2001.08361v1

Model: gemini-2.0-flash

## Scaling Laws for Neural Language Models ## Jared Kaplan ∗ Johns Hopkins University, OpenAI jaredk@jhu.edu Sam McCandlish ## ∗ OpenAI sam@openai.com Tom Henighan OpenAI henighan@openai.com Tom B. Brown OpenAI tom@openai.com Benjamin Chess OpenAI bchess@openai.com Rewon Child OpenAI rewon@openai.com Scott Gray OpenAI scott@openai.com Alec Radford OpenAI alec@openai.com Jeffrey Wu OpenAI jeffwu@openai.com Dario Amodei OpenAI damodei@openai.com ## Abstract Westudy empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sampleefficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence. ∗ Equal contribution. Contributions: Jared Kaplan and Sam McCandlish led the research. Tom Henighan contributed the LSTM experiments. Tom Brown, Rewon Child, and Scott Gray, and Alec Radford developed the optimized Transformer implementation. Jeff Wu, Benjamin Chess, and Alec Radford developed the text datasets. Dario Amodei provided guidance throughout the project. ## Contents | 1 | Introduction | 2 | |------------|--------------------------------------------------|-----| | 2 | Background and Methods | 6 | | 3 | Empirical Results and Basic Power Laws | 7 | | 4 | Charting the Infinite Data Limit and Overfitting | 10 | | 5 | Scaling Laws with Model Size and Training Time | 12 | | 6 | Optimal Allocation of the Compute Budget | 14 | | 7 | Related Work | 18 | | 8 | Discussion | 18 | | Appendices | Appendices | 20 | | A | Summary of Power Laws | 20 | | B | Empirical Model of Compute-Efficient Frontier | 20 | | C | Caveats | 22 | | D | Supplemental Figures | 23 | ## 1 Introduction Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning tasks can be efficiently expressed and evaluated in language, and the world's text provides a wealth of data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in language modeling, with state of the art models [RNSS18, DCLT18, YDY + 19, LOG + 19, RSR + 19] approaching human-level performance on many specific tasks [WPN + 19], including the composition of coherent multiparagraph prompted text samples [RWC + 19]. One might expect language modeling performance to depend on model architecture, the size of neural models, the computing power used to train them, and the data available for this training process. In this work we will empirically investigate the dependence of language modeling loss on all of these factors, focusing on the Transformer architecture [VSP + 17, LSP + 18]. The high ceiling and low floor for performance on language tasks allows us to study trends over more than seven orders of magnitude in scale. Throughout we will observe precise power-law scalings for performance as a function of training time, context length, dataset size, model size, and compute budget. ## 1.1 Summary Our key findings for Transformer language models are are as follows: 2 Here we display predicted compute when using a sufficiently small batch size. See Figure 13 for comparison to the purely empirical data. Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute 2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. <details> <summary>Image 1 Details</summary> ![369532d8](/v1/image/369532d800da3533bae7b37a93a047112adafcccd22b0e861033e27216422d7e) ### Visual Description ## Scatter Plots: Test Loss vs. Compute, Dataset Size, and Parameters ### Overview The image presents three scatter plots illustrating the relationship between test loss and three different factors: compute (measured in PF-days, non-embedding), dataset size (measured in tokens), and parameters (non-embedding). Each plot shows a decreasing trend in test loss as the corresponding factor increases. ### Components/Axes **Plot 1: Test Loss vs. Compute** * **Y-axis:** Test Loss, linear scale from 2 to 7. * **X-axis:** Compute, logarithmic scale from 10^-9 to 10^1, labeled "PF-days, non-embedding". * **Data:** Multiple light blue lines, a black line representing an average, and a dashed orange line representing a fitted curve. * **Fitted Curve Equation (orange dashed line):** L = (Cmin / (2.3 * 10^8))^-0.050 **Plot 2: Test Loss vs. Dataset Size** * **Y-axis:** Test Loss, linear scale from 2.7 to 4.2. * **X-axis:** Dataset Size, logarithmic scale from 10^7 to 10^9, labeled "tokens". * **Data:** Blue data points connected by a blue line, and a gray fitted curve. * **Fitted Curve Equation (gray line):** L = (D / (5.4 * 10^13))^-0.095 **Plot 3: Test Loss vs. Parameters** * **Y-axis:** Test Loss, linear scale from 2.4 to 5.6. * **X-axis:** Parameters, logarithmic scale from 10^5 to 10^9, labeled "non-embedding". * **Data:** Blue data points connected by a blue line, and a gray fitted curve. * **Fitted Curve Equation (gray line):** L = (N / (8.8 * 10^13))^-0.076 ### Detailed Analysis **Plot 1: Test Loss vs. Compute** * The light blue lines show individual runs, while the black line represents an average trend. * The test loss decreases as compute increases. * The orange dashed line represents the fitted curve, which approximates the average trend. * At Compute = 10^-9, Test Loss is approximately 6.7. * At Compute = 10^1, Test Loss is approximately 2.7. **Plot 2: Test Loss vs. Dataset Size** * The blue line with data points shows a clear decreasing trend. * The gray line represents the fitted curve. * At Dataset Size = 10^7, Test Loss is approximately 4.0. * At Dataset Size = 10^9, Test Loss is approximately 2.8. **Plot 3: Test Loss vs. Parameters** * The blue line with data points shows a decreasing trend. * The gray line represents the fitted curve. * At Parameters = 10^5, Test Loss is approximately 5.5. * At Parameters = 10^9, Test Loss is approximately 3.8. ### Key Observations * All three plots show a negative correlation between test loss and the respective factor (compute, dataset size, and parameters). * The fitted curves provide a mathematical representation of these relationships. * The logarithmic scale on the x-axes suggests that the impact of each factor diminishes as its value increases. ### Interpretation The plots demonstrate that increasing compute, dataset size, and the number of parameters generally leads to a reduction in test loss. This suggests that larger models, trained on more data, and with more computational resources, tend to perform better. The specific equations provided for the fitted curves quantify the relationship between test loss and each factor, allowing for predictions and comparisons. The diminishing returns observed due to the logarithmic scale highlight the importance of optimizing resource allocation to achieve the greatest reduction in test loss. </details> Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embeddings), the size of the dataset D , and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3) Smooth power laws: Performance has a power-law relationship with each of the three scale factors N,D,C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3) Universality of overfitting: Performance improves predictably as long as we scale up N and D in tandem, but enters a regime of diminishing returns if either N or D is held fixed while the other increases. The performance penalty depends predictably on the ratio N 0 . 74 /D , meaning that every time we increase the model size 8x, we only need to increase the data by roughly 5x to avoid a penalty. (Section 4) Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the loss that would be achieved if we trained for much longer. (Section 5) Transfer improves with test performance: When we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss - in other words, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2) Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4). Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D , we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as D ∼ C 0 . 27 with training compute. (Section 6) Optimal batch size: The ideal batch size for training these models is roughly a power of the loss only, and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly 1-2 million tokens at convergence for the largest models we can train. (Section 5.1) Taken together, these results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. We expect that larger language models will perform better and be more sample efficient than current models. Figure 2 We show a series of language model training runs, with models ranging in size from 10 3 to 10 9 parameters (excluding embeddings). <details> <summary>Image 2 Details</summary> ![1321ea19](/v1/image/1321ea1969fbcaabdbf573c98b58f13082a9b137ba10c0fd1e82f80e427f6c50) ### Visual Description ## Chart: Model Performance vs. Resources ### Overview The image presents two line charts comparing the performance of machine learning models with varying sizes (number of parameters) against different resource metrics. The left chart shows "Test Loss" as a function of "Tokens Processed," while the right chart shows "Test Loss" as a function of "Compute (PF-days)." The line color indicates the number of parameters in the model, ranging from 10^3 (purple) to 10^9 (yellow). ### Components/Axes **Left Chart:** * **Title:** Larger models require fewer samples to reach the same performance * **Y-axis:** Test Loss, with values ranging from 4 to 10. * **X-axis:** Tokens Processed, with a logarithmic scale ranging from 10^1 to 10^11. * **Annotations:** "10^3 Params" and "10^9 Params" with arrows pointing to the corresponding regions of the plot. **Right Chart:** * **Title:** The optimal model size grows smoothly with the loss target and compute budget * **Y-axis:** Test Loss, with values ranging from 4 to 10. * **X-axis:** Compute (PF-days), with a logarithmic scale ranging from 10^-9 to 10^0. * **Annotation:** "Compute-efficient training stops far short of convergence" with an arrow pointing to a horizontal line segment on the right side of the chart. **Legend:** * **Position:** Top-right of the combined image. * **Title:** Line color indicates number of parameters * **Colors and Labels:** * Purple: 10^3 * Teal: 10^6 * Yellow: 10^9 ### Detailed Analysis **Left Chart (Test Loss vs. Tokens Processed):** * **General Trend:** All lines show a decreasing trend, indicating that test loss decreases as more tokens are processed. * **Purple Lines (10^3 Parameters):** These lines start at a test loss of approximately 10 and decrease to a final test loss between 5 and 7. The lines flatten out around 10^9 tokens processed. * **Teal Lines (10^6 Parameters):** These lines also start at a test loss of approximately 10 and decrease to a final test loss between 4 and 6. The lines flatten out around 10^8 tokens processed. * **Yellow Lines (10^9 Parameters):** These lines start at a test loss of approximately 10 and decrease to a final test loss between 3 and 5. The lines flatten out around 10^7 tokens processed. * **Observation:** Models with more parameters (yellow lines) reach lower test loss values with fewer tokens processed compared to models with fewer parameters (purple lines). **Right Chart (Test Loss vs. Compute):** * **General Trend:** All lines show a decreasing trend, indicating that test loss decreases as more compute is used. * **Purple Lines (10^3 Parameters):** These lines start at a test loss of approximately 10 and decrease to a final test loss between 4 and 7. Some lines show plateaus at a test loss of around 6. * **Teal Lines (10^6 Parameters):** These lines also start at a test loss of approximately 10 and decrease to a final test loss between 3 and 5. * **Yellow Lines (10^9 Parameters):** These lines start at a test loss of approximately 10 and decrease to a final test loss between 3 and 4. * **Horizontal Line Segment:** Located on the right side of the chart, near the bottom. It spans from approximately 10^-4 to 10^-2 on the x-axis (Compute). * **Observation:** Models with more parameters (yellow lines) reach lower test loss values with less compute compared to models with fewer parameters (purple lines). The horizontal line segment indicates a point where compute-efficient training stops far short of convergence. ### Key Observations * Larger models (more parameters) achieve lower test loss with fewer tokens processed and less compute. * The relationship between model size, compute, and performance is smooth. * Compute-efficient training may stop before full convergence. ### Interpretation The charts demonstrate the trade-offs between model size, training data (tokens processed), compute resources, and model performance (test loss). The data suggests that increasing model size can lead to better performance with fewer training samples and less compute. However, the "Compute-efficient training stops far short of convergence" annotation indicates that there may be diminishing returns to increasing compute, and that training may be stopped early for efficiency reasons. The charts highlight the importance of considering model size and compute budget when training machine learning models. The trend suggests that larger models are more efficient in terms of data and compute requirements to achieve a certain level of performance. </details> Figure 3 As more compute becomes available, we can choose how much to allocate towards training larger models, using larger batches, and training for more steps. We illustrate this for a billion-fold increase in compute. For optimally compute-efficient training, most of the increase should go towards increased model size. A relatively small increase in data is needed to avoid reuse. Of the increase in data, most can be used to increase parallelism through larger batch sizes, with only a very small increase in serial training time required. <details> <summary>Image 3 Details</summary> ![e8fb4a72](/v1/image/e8fb4a72812089e59414032ea0456c18c0d4cc9708f5e9ab685541b378750f8d) ### Visual Description ## Area Chart: Compute vs. Multiplicative Contribution ### Overview The image is an area chart that illustrates the relationship between compute (measured in PF-days) and multiplicative contribution, highlighting the impact of serial steps, batch size, and model size. The chart uses a log-log scale for both axes. The areas are colored to represent different factors influencing the multiplicative contribution. ### Components/Axes * **X-axis:** Compute (PF-days), logarithmic scale from 10-8 to 100. Axis markers are present at 10-8, 10-6, 10-4, 10-2, and 100. * **Y-axis:** Multiplicative Contribution, logarithmic scale from 100 to 108. Axis markers are present at 100, 102, 104, 106, and 108. * **Areas:** * **Green:** Represents "<10x Serial Steps". The upper bound of this area is labeled "Minimum serial steps increases negligibly". * **Orange:** Represents "100x Batch Size". * **Blue:** Represents ">1,000,000x Model Size". The lower bound of this area starts at the x and y axis origin. * **Annotations:** * "Data requirements grow relatively slowly" is positioned to the right of the orange area. * "Optimal model size increases very quickly" is positioned below the previous annotation, to the right of the blue area. ### Detailed Analysis * **Green Area (<10x Serial Steps):** * The green area starts at approximately (10-8, 100) and increases linearly on the log-log scale. * At a compute of 100 PF-days, the multiplicative contribution is approximately 105. * The trend is upward, indicating that as compute increases, the multiplicative contribution due to serial steps also increases, but negligibly. * **Orange Area (100x Batch Size):** * The orange area starts where the green area ends and increases linearly on the log-log scale. * At a compute of 100 PF-days, the multiplicative contribution is approximately 107. * The trend is upward, indicating that as compute increases, the multiplicative contribution due to batch size also increases. * **Blue Area (>1,000,000x Model Size):** * The blue area starts at approximately (10-8, 100) and increases linearly on the log-log scale. * At a compute of 100 PF-days, the multiplicative contribution is approximately 108. * The trend is upward, indicating that as compute increases, the multiplicative contribution due to model size increases very quickly. ### Key Observations * The chart uses a log-log scale, which compresses the data and allows for the visualization of exponential relationships. * The model size has the most significant impact on multiplicative contribution, as indicated by the largest area. * Serial steps have the least impact on multiplicative contribution, as indicated by the smallest area. * The annotations highlight that data requirements grow relatively slowly, while optimal model size increases very quickly. ### Interpretation The chart illustrates the trade-offs between different factors influencing the multiplicative contribution in a computational model. It suggests that increasing model size has the most significant impact on performance, but also implies that this comes with a cost of rapidly increasing data requirements. Serial steps have a relatively minor impact. The chart emphasizes the importance of optimizing model size for performance, but also highlights the need to manage data requirements effectively. The logarithmic scales suggest exponential relationships between compute and multiplicative contribution for each factor. </details> ## 1.2 Summary of Scaling Laws The test loss of a Transformer trained to autoregressively model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters N , the dataset size D , or the optimally allocated compute budget C min (see Figure 1): 1. For models with a limited number of parameters, trained to convergence on sufficiently large datasets: $$L ( N ) = ( N _ { c } / N ) ^ { \alpha _ { N } } \, ; \quad \alpha _ { N } \sim 0 . 0 7 6 , \quad N _ { c } \sim 8 . 8 \times 1 0 ^ { 1 3 } \, ( n o n - e m b d i n g p a r a m e t e r s ) \quad ( 1 . 1 )$$ 2. For large models trained with a limited dataset with early stopping: $$L ( D ) = \left ( D _ { c } / D \right ) ^ { \alpha _ { D } } ; \, \alpha _ { D } \sim 0 . 0 9 5 , \quad D _ { c } \sim 5 . 4 \times 1 0 ^ { 1 3 } \left ( t o k e n s \right ) \quad \ \ ( 1 . 2 )$$ 3. When training with a limited amount of compute, a sufficiently large dataset, an optimally-sized model, and a sufficiently small batch size (making optimal 3 use of compute): $$\begin{array} { r l } & { L ( C _ { \min } ) = \left ( C _ { c } ^ { \min } / C _ { \min } \right ) ^ { \alpha _ { c } ^ { \min } } ; \, \alpha _ { C } ^ { \min } \sim 0 . 0 5 0 , \quad C _ { c } ^ { \min } \sim 3 . 1 \times 1 0 ^ { 8 } \left ( P F - d a y s \right ) \quad ( 1 . 3 ) } \end{array}$$ 3 We also observe an empirical power-law trend with the training compute C (Figure 1) while training at fixed batch size, but it is the trend with C min that should be used to make predictions. They are related by equation (5.5). <details> <summary>Image 4 Details</summary> ![0bc2fc9b](/v1/image/0bc2fc9b5b763cb2ad34db67f8682c84e4fd45c08bf6ec53ea4669ecaebe3036) ### Visual Description ## Chart: Loss vs Model and Dataset Size & Loss vs Model Size and Training Steps ### Overview The image presents two scatter plots side-by-side, both examining the relationship between 'Loss' and other variables. The left plot explores 'Loss' in relation to 'Tokens in Dataset' for different model sizes ('Params'). The right plot explores 'Loss' in relation to 'Estimated Smin' (training steps) for different model sizes ('Params'). The model sizes are color-coded in both plots, allowing for comparison. ### Components/Axes **Left Plot:** * **Title:** Loss vs Model and Dataset Size * **Y-axis:** Loss (linear scale, values ranging from approximately 2.5 to 4.5) * **X-axis:** Tokens in Dataset (logarithmic scale, values ranging from 10^7 to 10^10) * **Legend (Params):** Located on the right side of the left plot. * Yellow: 708M * Light Green: 302M * Green: 85M * Blue: 3M * Dark Blue: 25M * Purple: 393.2K **Right Plot:** * **Title:** Loss vs Model Size and Training Steps * **Y-axis:** Loss (linear scale, values ranging from approximately 2.4 to 4.4) * **X-axis:** Estimated Smin (logarithmic scale, values ranging from 10^4 to 10^5) * **Secondary Y-axis:** Parameters (non-embed) (logarithmic scale, values ranging from 10^6 to 10^8). This axis is represented by a color gradient. * **Color Gradient Legend:** Located on the right side of the right plot. The color gradient maps to the "Parameters (non-embed)" values. Yellow represents lower parameter values, and purple represents higher parameter values. ### Detailed Analysis **Left Plot: Loss vs Model and Dataset Size** * **708M (Yellow):** The loss decreases from approximately 4.3 at 10^7 tokens to approximately 2.7 at 10^10 tokens. * **302M (Light Green):** The loss decreases from approximately 4.1 at 10^7 tokens to approximately 2.9 at 10^10 tokens. * **85M (Green):** The loss decreases from approximately 3.9 at 10^7 tokens to approximately 3.1 at 10^10 tokens. * **3M (Blue):** The loss decreases from approximately 3.7 at 10^7 tokens to approximately 3.3 at 10^10 tokens. * **25M (Dark Blue):** The loss decreases from approximately 4.2 at 10^7 tokens to approximately 3.6 at 10^10 tokens. * **393.2K (Purple):** The loss remains relatively constant, starting at approximately 4.6 at 10^7 tokens and ending at approximately 4.3 at 10^10 tokens. **Trend Verification (Left Plot):** All data series, except for the 393.2K series, show a decreasing trend in loss as the number of tokens in the dataset increases. The 393.2K series remains relatively flat. **Right Plot: Loss vs Model Size and Training Steps** * The data series are color-coded based on the "Parameters (non-embed)" values, ranging from yellow (lower values) to purple (higher values). * All data series show a decreasing trend in loss as the estimated Smin (training steps) increases. * The series with higher parameter values (purple) generally have higher loss values across the range of estimated Smin. * The series with lower parameter values (yellow) generally have lower loss values across the range of estimated Smin. **Trend Verification (Right Plot):** All data series show a decreasing trend in loss as the estimated Smin increases. ### Key Observations * In the left plot, larger models (higher 'Params' values) generally exhibit lower loss for a given number of tokens in the dataset, except for the 393.2K model. * In the right plot, increasing the estimated Smin (training steps) generally leads to a decrease in loss for all model sizes. * The right plot shows a clear correlation between the number of parameters and the loss, with larger models generally having higher loss values. ### Interpretation The plots suggest that increasing both the dataset size (number of tokens) and the number of training steps (estimated Smin) can lead to a reduction in loss. However, the relationship between model size ('Params') and loss is more complex. In the left plot, larger models generally perform better (lower loss) with increasing dataset size. In the right plot, larger models tend to have higher loss values, which could be due to factors such as overfitting or the need for more extensive training. The 393.2K model in the left plot is an outlier, as it does not show a significant decrease in loss with increasing dataset size, suggesting that it may be under-parameterized for the task. The color gradient in the right plot provides a visual representation of how the number of parameters affects the loss, with larger models (purple) generally having higher loss values. </details> ⋂}⌋˜{√(]{(〈∐√∐√˜√ min Figure 4 Left : The early-stopped test loss L ( N,D ) varies predictably with the dataset size D and model size N according to Equation (1.5). Right : After an initial transient period, learning curves for all model sizes N can be fit with Equation (1.6), which is parameterized in terms of S min , the number of steps when training at large batch size (details in Section 5.1). These relations hold across eight orders of magnitude in C min , six orders of magnitude in N , and over two orders of magnitude in D . They depend very weakly on model shape and other Transformer hyperparameters (depth, width, number of self-attention heads), with specific numerical values associated with the Webtext2 training set [RWC + 19]. The power laws α N , α D , α min C specify the degree of performance improvement expected as we scale up N , D , or C min ; for example, doubling the number of parameters yields a loss that is smaller by a factor 2 -α N = 0 . 95 . The precise numerical values of N c , C min c , and D c depend on the vocabulary size and tokenization and hence do not have a fundamental meaning. The critical batch size, which determines the speed/efficiency tradeoff for data parallelism ([MKAT18]), also roughly obeys a power law in L :  Equation (1.1) and (1.2) together suggest that as we increase the model size, we should increase the dataset size sublinearly according to D ∝ N α N α D ∼ N 0 . 74 . In fact, we find that there is a single equation combining (1.1) and (1.2) that governs the simultaneous dependence on N and D and governs the degree of overfitting: $$L ( N , D ) = \left [ \left ( \frac { N _ { c } } { N } \right ) ^ { \frac { \alpha _ { N } } { \alpha _ { D } } } + \frac { D _ { c } } { D } \right ] ^ { \alpha _ { D } }$$ with fits pictured on the left in figure 4. We conjecture that this functional form may also parameterize the trained log-likelihood for other generative modeling tasks. When training a given model for a finite number of parameter update steps S in the infinite data limit, after an initial transient period, the learning curves can be accurately fit by (see the right of figure 4) $$L ( N , S ) = \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } + \left ( \frac { S _ { c } } { S _ { \min } ( S ) } \right ) ^ { \alpha _ { S } }$$ where S c ≈ 2 . 1 × 10 3 and α S ≈ 0 . 76 , and S min ( S ) is the minimum possible number of optimization steps (parameter updates) estimated using Equation (5.4). When training within a fixed compute budget C , but with no other constraints, Equation (1.6) leads to the prediction that the optimal model size N , optimal batch size B , optimal number of steps S , and dataset size D should grow as $$\begin{array} { r } { N \, \infty \, C ^ { \alpha _ { C } ^ { \min } / \alpha _ { N } } , \quad B \, \infty \, C ^ { \alpha _ { C } ^ { \min } / \alpha _ { B } } , \quad S \, \infty \, C ^ { \alpha _ { C } ^ { \min } / \alpha _ { S } } , \quad D = B \cdot S \quad ( 1 . 7 ) } \end{array}$$ with  which closely matches the empirically optimal results N ∝ C 0 . 73 min , B ∝ C 0 . 24 min , and S ∝ C 0 . 03 min . As the computational budget C increases, it should be spent primarily on larger models, without dramatic increases in training time or dataset size (see Figure 3). This also implies that as models grow larger, they become increasingly sample efficient. In practice, researchers typically train smaller models for longer than would be maximally compute-efficient because of hardware constraints. Optimal performance depends on total compute as a power law (see Equation (1.3)). We provide some basic theoretical motivation for Equation (1.5), an analysis of learning curve fits and their implications for training time, and a breakdown of our results per token. We also make some brief comparisons to LSTMs and recurrent Transformers [DGV + 18]. ## 1.3 Notation We use the following notation: - L - the cross entropy loss in nats. Typically it will be averaged over the tokens in a context, but in some cases we report the loss for specific tokens within the context. - N - the number of model parameters, excluding all vocabulary and positional embeddings - C ≈ 6 NBS - an estimate of the total non-embedding training compute, where B is the batch size, and S is the number of training steps (ie parameter updates). We quote numerical values in PF-days, where one PF-day = 10 15 × 24 × 3600 = 8 . 64 × 10 19 floating point operations. - D - the dataset size in tokens - B crit - the critical batch size [MKAT18], defined and discussed in Section 5.1. Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency. - C min - an estimate of the minimum amount of non-embedding compute to reach a given value of the loss. This is the training compute that would be used if the model were trained at a batch size much less than the critical batch size. - S min - an estimate of the minimal number of training steps needed to reach a given value of the loss. This is also the number of training steps that would be used if the model were trained at a batch size much greater than the critical batch size. - α X - power-law exponents for the scaling of the loss as L ( X ) ∝ 1 /X α X where X can be any of N,D,C,S,B,C min . ## 2 Background and Methods We train language models on WebText2, an extended version of the WebText [RWC + 19] dataset, tokenized using byte-pair encoding [SHB15] with a vocabulary size n vocab = 50257 . We optimize the autoregressive log-likelihood (i.e. cross-entropy loss) averaged over a 1024-token context, which is also our principal performance metric. We record the loss on the WebText2 test distribution and on a selection of other text distributions. We primarily train decoder-only [LSP + 18, RNSS18] Transformer [VSP + 17] models, though we also train LSTM models and Universal Transformers [DGV + 18] for comparison. ## 2.1 Parameter and Compute Scaling of Transformers We parameterize the Transformer architecture using hyperparameters n layer (number of layers), d model (dimension of the residual stream), d ff (dimension of the intermediate feed-forward layer), d attn (dimension of the attention output), and n heads (number of attention heads per layer). We include n ctx tokens in the input context, with n ctx = 1024 except where otherwise noted. We use N to denote the model size, which we define as the number of non-embedding parameters  where we have excluded biases and other sub-leading terms. Our models also have n vocab d model parameters in an embedding matrix, and use n ctx d model parameters for positional embeddings, but we do not include these when discussing the 'model size' N ; we will see that this produces significantly cleaner scaling laws. Evaluating a forward pass of the Transformer involves roughly $$C _ { f o r w a r d } \approx 2 N + 2 n _ { l a y e r } n _ { c t x } d _ { m o d e l } \quad ( 2 . 2 )$$ add-multiply operations, where the factor of two comes from the multiply-accumulate operation used in matrix multiplication. A more detailed per-operation parameter and compute count is included in Table 1. Table 1 Parameter counts and compute (forward pass) estimates for a Transformer model. Sub-leading terms such as nonlinearities, biases, and layer normalization are omitted. | Operation | Parameters | FLOPs per Token | |-----------------------|------------------------------------------|-----------------------------------------| | Embed | ( n vocab + n ctx ) d model | 4 d model | | Attention: QKV | n layer d model 3 d attn | 2 n layer d model 3 d attn | | Attention: Mask | - | 2 n layer n ctx d attn | | Attention: Project | n layer d attn d model | 2 n layer d attn d embd | | Feedforward | n layer 2 d model d ff | 2 n layer 2 d model d ff | | De-embed | - | 2 d model n vocab | | Total (Non-Embedding) | N = 2 d model n layer (2 d attn + d ff ) | C forward = 2 N +2 n layer n ctx d attn | For contexts and models with d model > n ctx / 12 , the context-dependent computational cost per token is a relatively small fraction of the total compute. Since we primarily study models where d model n ctx / 12 , we do not include context-dependent terms in our training compute estimate. Accounting for the backwards pass (approximately twice the compute as the forwards pass), we then define the estimated non-embedding compute as C ≈ 6 N floating point operators per training token. ## 2.2 Training Procedures Unless otherwise noted, we train models with the Adam optimizer [KB14] for a fixed 2 . 5 × 10 5 steps with a batch size of 512 sequences of 1024 tokens. Due to memory constraints, our largest models (more than 1B parameters) were trained with Adafactor [SS18]. We experimented with a variety of learning rates and schedules, as discussed in Appendix D.6. We found that results at convergence were largely independent of learning rate schedule. Unless otherwise noted, all training runs included in our data used a learning rate schedule with a 3000 step linear warmup followed by a cosine decay to zero. ## 2.3 Datasets We train our models on an extended version of the WebText dataset described in [RWC + 19]. The original WebText dataset was a web scrape of outbound links from Reddit through December 2017 which received at least 3 karma. In the second version, WebText2, we added outbound Reddit links from the period of January to October 2018, also with a minimum of 3 karma. The karma threshold served as a heuristic for whether people found the link interesting or useful. The text of the new links was extracted with the Newspaper3k python library. In total, the dataset consists of 20.3M documents containing 96 GB of text and 1 . 62 × 10 10 words (as defined by wc ). We then apply the reversible tokenizer described in [RWC + 19], which yields 2 . 29 × 10 10 tokens. We reserve 6 . 6 × 10 8 of these tokens for use as a test set, and we also test on similarlyprepared samples of Books Corpus [ZKZ + 15], Common Crawl [Fou], English Wikipedia, and a collection of publicly-available Internet Books. ## 3 Empirical Results and Basic Power Laws To characterize language model scaling we train a wide variety of models, varying a number of factors including: - Model size (ranging in size from 768 to 1.5 billion non-embedding parameters) - Dataset size (ranging from 22 million to 23 billion tokens) - Shape (including depth, width, attention heads, and feed-forward dimension) - Context length (1024 for most runs, though we also experiment with shorter contexts) - Batch size ( 2 19 for most runs, but we also vary it to measure the critical batch size) <details> <summary>Image 5 Details</summary> ![34209e53](/v1/image/34209e5367aa1596f85478eef1537837e61591e306a3a8506a2ade16f80d3641) ### Visual Description ## Chart: Loss Increase vs. Architecture Parameters ### Overview The image presents three line charts comparing the "Loss Increase" (y-axis) against different architectural parameters (x-axis) for neural networks. The charts explore the impact of "Feed-Forward Ratio", "Aspect Ratio", and "Attention Head Dimension" on model performance, measured by loss increase. Each chart uses different parameter settings (50M, 274M, 1.5B) or model dimensions (256, 512, 1024) as separate data series. ### Components/Axes **General:** * **Y-axis:** "Loss Increase" ranging from 0% to 10%. * **X-axis:** Logarithmic scale (base 10) for all three charts. **Chart 1: Feed-Forward Ratio** * **X-axis:** "Feed-Forward Ratio (dff / dmodel)". * X-axis markers: 100, 101 * **Parameter Setting:** 50M Parameters * **Legend:** Located in the top-left corner of the entire image. * Blue line: nhead = 8 * Orange line: dmodel / nhead = 64 **Chart 2: Aspect Ratio** * **X-axis:** "Aspect Ratio (dmodel / nlayer)". * X-axis markers: 101, 102, 103 * **Parameter Settings:** * Blue line: 50M Params * Orange line: 274M Params * Green line: 1.5B Params * **Text Overlay:** "A wide range of architectures achieve similar performance" with vertical lines indicating the range. **Chart 3: Attention Head Dimension** * **X-axis:** "Attention Head Dimension (dmodel / nhead)". * X-axis markers: 101, 102 * **Parameter Settings:** 25M Parameters * **Legend:** Located in the top-right corner of the entire image. * Blue line: dmodel = 256 * Orange line: dmodel = 512 * Green line: dmodel = 1024 * **Text Overlay:** "22% additional compute compensates for 1% loss increase" with a vertical line and arrow indicating the loss increase. ### Detailed Analysis **Chart 1: Feed-Forward Ratio (50M Parameters)** * **Blue Line (nhead = 8):** Starts at approximately 0.5% loss increase at x=100, remains relatively flat until x=101, then increases to approximately 2% at x=101. * **Orange Line (dmodel / nhead = 64):** Starts at approximately 0.7% loss increase at x=100, decreases slightly to approximately 0% at x=100.5, then increases sharply to approximately 8% at x=101. **Chart 2: Aspect Ratio** * **Blue Line (50M Params):** Starts at approximately 2.5% loss increase at x=101, decreases to approximately 0.2% at x=102, then increases to approximately 2% at x=103. * **Orange Line (274M Params):** Starts at approximately 1% loss increase at x=101, decreases to approximately 0% at x=102, then increases to approximately 8% at x=103. * **Green Line (1.5B Params):** Starts at approximately 2% loss increase at x=101, decreases to approximately 0.1% at x=102, then increases to approximately 3% at x=103. **Chart 3: Attention Head Dimension (25M Parameters)** * **Blue Line (dmodel = 256):** Starts at approximately 0.2% loss increase at x=101, remains relatively flat until x=102, then increases to approximately 1.5% at x=102. * **Orange Line (dmodel = 512):** Starts at approximately 0.1% loss increase at x=101, remains relatively flat until x=102, then increases to approximately 1% at x=102. * **Green Line (dmodel = 1024):** Starts at approximately 0.1% loss increase at x=101, remains relatively flat until x=102, then increases to approximately 0.5% at x=102. ### Key Observations * **Feed-Forward Ratio:** Increasing the feed-forward ratio significantly increases the loss, especially when dmodel / nhead = 64. * **Aspect Ratio:** There's a performance sweet spot around an aspect ratio of 100 (102), where the loss is minimized across different parameter settings. * **Attention Head Dimension:** Increasing the attention head dimension generally leads to a slight increase in loss. * **Parameter Settings:** The 274M parameter setting shows the most significant increase in loss with increasing aspect ratio. ### Interpretation The charts suggest that the architecture of a neural network significantly impacts its performance, as measured by loss increase. The "Aspect Ratio" chart indicates that there is an optimal ratio for minimizing loss, and deviating from this ratio increases the loss. The "Feed-Forward Ratio" chart shows that increasing the feed-forward ratio can lead to a substantial increase in loss. The "Attention Head Dimension" chart suggests that increasing the attention head dimension has a less pronounced effect on loss compared to the other two parameters. The text overlays highlight that a wide range of architectures can achieve similar performance and that additional compute can compensate for loss increases. These findings can guide the design of more efficient and effective neural network architectures. </details> 50M Parameters 25M Parameters Figure 5 Performance depends very mildly on model shape when the total number of non-embedding parameters N is held fixed. The loss varies only a few percent over a wide range of shapes. Small differences in parameter counts are compensated for by using the fit to L ( N ) as a baseline. Aspect ratio in particular can vary by a factor of 40 while only slightly impacting performance; an ( n layer , d model ) = (6 , 4288) reaches a loss within 3% of the (48 , 1600) model used in [RWC + 19]. <details> <summary>Image 6 Details</summary> ![e0979d54](/v1/image/e0979d544b10e84981f4038b51764587ab5c23804fd7aab2ad971c1ef1e66d4f) ### Visual Description ## Chart: Test Loss vs. Parameters ### Overview The image presents two line charts comparing the test loss of models with varying numbers of layers against the number of parameters. The left chart shows the relationship when parameters include embeddings, while the right chart excludes embeddings. The number of layers is represented by different colored lines. ### Components/Axes **Left Chart:** * **Title:** Parameters (with embedding) * **X-axis:** Parameters (with embedding), logarithmic scale from 10^6 to 10^9 * **Y-axis:** Test Loss, linear scale from 2 to 7 * **Legend (top-left):** * Dark Blue: 0 Layer * Purple: 1 Layer * Medium Purple: 2 Layers * Pink: 3 Layers * Light Orange: 6 Layers * Orange: > 6 Layers **Right Chart:** * **Title:** Parameters (non-embedding) * **X-axis:** Parameters (non-embedding), logarithmic scale from 10^3 to 10^9 * **Y-axis:** Test Loss, linear scale from 2 to 7 * **Legend (left):** * Purple: 1 Layer * Medium Purple: 2 Layers * Pink: 3 Layers * Light Orange: 6 Layers * Orange: > 6 Layers ### Detailed Analysis **Left Chart (with embedding):** * **0 Layer (Dark Blue):** Starts at approximately 6.8 test loss at 10^6 parameters, remains relatively flat around 6.0 test loss until 10^9 parameters. * **1 Layer (Purple):** Starts at approximately 7.0 test loss at 10^6 parameters, decreases to approximately 3.5 test loss at 10^9 parameters. * **2 Layers (Medium Purple):** Starts at approximately 6.0 test loss at 10^6 parameters, decreases to approximately 3.0 test loss at 10^9 parameters. * **3 Layers (Pink):** Starts at approximately 5.0 test loss at 10^6 parameters, decreases to approximately 2.7 test loss at 10^9 parameters. * **6 Layers (Light Orange):** Starts at approximately 4.5 test loss at 10^6 parameters, decreases to approximately 2.5 test loss at 10^9 parameters. * **> 6 Layers (Orange):** Starts at approximately 4.0 test loss at 10^6 parameters, decreases to approximately 2.3 test loss at 10^9 parameters. **Right Chart (non-embedding):** * **1 Layer (Purple):** Starts at approximately 6.5 test loss at 10^3 parameters, decreases to approximately 4.2 test loss at 10^9 parameters. * **2 Layers (Medium Purple):** Starts at approximately 6.0 test loss at 10^3 parameters, decreases to approximately 3.5 test loss at 10^9 parameters. * **3 Layers (Pink):** Starts at approximately 6.0 test loss at 10^3 parameters, decreases to approximately 3.0 test loss at 10^9 parameters. * **6 Layers (Light Orange):** Starts at approximately 5.5 test loss at 10^3 parameters, decreases to approximately 2.5 test loss at 10^9 parameters. * **> 6 Layers (Orange):** Starts at approximately 5.0 test loss at 10^3 parameters, decreases to approximately 2.3 test loss at 10^9 parameters. ### Key Observations * In both charts, increasing the number of parameters generally leads to a decrease in test loss. * The "0 Layer" model in the left chart (with embedding) shows minimal improvement in test loss as the number of parameters increases. * The right chart (non-embedding) shows a steeper initial decrease in test loss for all models as the number of parameters increases from 10^3 to 10^6, compared to the left chart. * The models with more layers (6 and >6) consistently achieve lower test loss compared to models with fewer layers (1, 2, and 3) in both charts. ### Interpretation The charts suggest that increasing the number of layers and parameters in a model generally improves its performance, as indicated by the decrease in test loss. The inclusion of embeddings appears to shift the parameter scale, requiring more parameters to achieve similar test loss reductions compared to models without embeddings. The "0 Layer" model's flat performance in the left chart indicates that simply increasing parameters without adding layers does not significantly improve performance. The steeper initial decrease in test loss in the right chart suggests that the initial impact of increasing parameters is more pronounced when embeddings are not included. The models with more layers consistently outperform those with fewer layers, highlighting the importance of model depth in achieving better results. </details> Figure 6 Left: When we include embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. Right: When we exclude embedding parameters, the performance of models with different depths converge to a single trend. Only models with fewer than 2 layers or with extreme depth-to-width ratios deviate significantly from the trend. In this section we will display data along with empirically-motivated fits, deferring theoretical analysis to later sections. ## 3.1 Approximate Transformer Shape and Hyperparameter Independence Transformer performance depends very weakly on the shape parameters n layer , n heads , and d ff when we hold the total non-embedding parameter count N fixed. To establish these results we trained models with fixed size while varying a single hyperparameter. This was simplest for the case of n heads . When varying n layer , we simultaneously varied d model while keeping N ≈ 12 n layer d 2 model fixed. Similarly, to vary d ff at fixed model size we also simultaneously varied the d model parameter, as required by the parameter counts in Table 1. Independence of n layers would follow if deeper Transformers effectively behave as ensembles of shallower models, as has been suggested for ResNets [VWB16]. The results are shown in Figure 5. ## 3.2 Performance with Non-Embedding Parameter Count N In Figure 6 we display the performance of a wide variety of models, ranging from small models with shape ( n layer , d model ) = (2 , 128) through billion-parameter models, ranging in shape from (6 , 4288) through (207 , 768) . Here we have trained to near convergence on the full WebText2 dataset and observe no overfitting (except possibly for the very largest models). As shown in Figure 1, we find a steady trend with non-embedding parameter count N , which can be fit to the first term of Equation (1.5), so that  Figure 7 <details> <summary>Image 7 Details</summary> ![5e55d896](/v1/image/5e55d896032de2498f12bd5cf49ca6699571849bcc02e23d0b5acf78343bf6e5) ### Visual Description ## Chart: Transformer vs LSTM Performance ### Overview The image presents two line charts comparing the performance of Transformer and LSTM models. The left chart shows "Test Loss" versus "Parameters (non-embedding)", while the right chart shows "Per-token Test Loss" versus "Token Index in Context". The charts aim to illustrate how Transformers outperform LSTMs, especially with longer contexts and increased parameters. ### Components/Axes **Left Chart:** * **Title:** Transformers asymptotically outperform LSTMs due to improved use of long contexts * **Y-axis:** Test Loss, with values ranging from 2.4 to 5.4. * **X-axis:** Parameters (non-embedding), with a logarithmic scale from 10^5 to 10^9. * **Data Series:** * LSTMs: Represented by three lines: * 1 Layer (light red) * 2 Layers (red) * 4 Layers (blue) * Transformers: Represented by one line (blue). **Right Chart:** * **Title:** LSTM plateaus after <100 tokens. Transformer improves through the whole context. * **Y-axis:** Per-token Test Loss, with values ranging from 2 to 6. * **X-axis:** Token Index in Context, with a logarithmic scale from 10^0 to 10^3. * **Data Series:** * Parameters: * 400K (red) * 400K (blue) * 2M (light red) * 3M (light blue) * 200M (light red) * 300M (light blue) ### Detailed Analysis **Left Chart:** * **LSTMs (1 Layer):** Starts at approximately (10^5, 5.1) and decreases to approximately (10^9, 3.8). * **LSTMs (2 Layers):** Starts at approximately (10^5, 5.2) and decreases to approximately (10^9, 3.5). * **LSTMs (4 Layers):** Starts at approximately (10^5, 5.3) and decreases to approximately (10^9, 4.0). * **Transformers:** Starts at approximately (10^5, 4.9) and decreases to approximately (10^9, 2.4). **Right Chart:** * **400K (red):** Starts at approximately (1, 6.2) and plateaus around 4.0 after 100 tokens. * **400K (blue):** Starts at approximately (1, 5.9) and plateaus around 3.8 after 100 tokens. * **2M (light red):** Starts at approximately (1, 5.7) and decreases to approximately 3.5 at 10^3. * **3M (light blue):** Starts at approximately (1, 5.5) and decreases to approximately 3.0 at 10^3. * **200M (light red):** Starts at approximately (1, 5.3) and decreases to approximately 2.8 at 10^3. * **300M (light blue):** Starts at approximately (1, 5.1) and decreases to approximately 2.5 at 10^3. ### Key Observations * In the left chart, Transformers consistently outperform LSTMs across all parameter ranges. The test loss for Transformers is significantly lower than that of LSTMs, especially as the number of parameters increases. * In the right chart, LSTM models (400K parameters) plateau relatively quickly, while Transformer models (2M, 3M, 200M, 300M parameters) continue to improve (decrease in test loss) throughout the context. * Increasing the number of layers in LSTMs does improve performance (lower test loss), but not to the same extent as using Transformers. * Increasing the number of parameters in Transformers leads to a continuous decrease in test loss, indicating better performance with larger models. ### Interpretation The data suggests that Transformers are more effective than LSTMs, particularly when dealing with long contexts and larger parameter sizes. The left chart demonstrates that Transformers achieve lower test loss compared to LSTMs for a given number of parameters. The right chart highlights that LSTMs plateau in performance after processing a limited number of tokens, while Transformers continue to improve as the context length increases. This indicates that Transformers are better at capturing long-range dependencies in the data. The charts support the claim that Transformers' architecture is better suited for tasks requiring the processing of long sequences, leading to improved performance compared to LSTMs. </details> To observe these trends it is crucial to study performance as a function of N ; if we instead use the total parameter count (including the embedding parameters) the trend is somewhat obscured (see Figure 6). This suggests that the embedding matrix can be made smaller without impacting performance, as has been seen in recent work [LCG + 19]. Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasets is also a power-law in N with nearly identical power, as shown in Figure 8. ## 3.2.1 Comparing to LSTMs and Universal Transformers In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter count N . The LSTMs were trained with the same dataset and context length. We see from these figures that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens. We present power-law relationships between performance and context position Appendix D.5, where increasingly large powers for larger models suggest improved ability to quickly recognize patterns. We also compare the performance of standard Transformers to recurrent Transformers [DGV + 18] in Figure 17 in the appendix. These models re-use parameters, and so perform slightly better as a function of N , at the cost of additional compute per-parameter. ## 3.2.2 Generalization Among Data Distributions We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8). ## 3.3 Performance with Dataset Size and Compute We display empirical trends for the test loss as a function of dataset size D (in tokens) and training compute C in Figure 1. For the trend with D we trained a model with ( n layer , n embd ) = (36 , 1280) on fixed subsets of the WebText2 dataset. We stopped training once the test loss ceased to decrease. We see that the resulting test losses can be fit with simple power-law  in the dataset size. The data and fit appear in Figure 1. The total amount of non-embedding compute used during training can be estimated as C = 6 NBS , where B is the batch size, S is the number of parameter updates, and the factor of 6 accounts for the forward and backward passes. Thus for a given value of C we can scan over all models with various N to find the model using three principles: 1. Changes in vocabulary size or tokenization are expected to rescale the loss by an overall factor. The parameterization of L ( N,D ) (and all models of the loss) must naturally allow for such a rescaling. 2. Fixing D and sending N → ∞ , the overall loss should approach L ( D ) . Conversely, fixing N and sending D →∞ the loss must approach L ( N ) . 3. L ( N,D ) should be analytic at D = ∞ , so that it has a series expansion in 1 /D with integer powers. Theoretical support for this principle is significantly weaker than for the first two. Our choice of L ( N,D ) satisfies the first requirement because we can rescale N c , D c with changes in the vocabulary. This also implies that the values of N c , D c have no fundamental meaning. <details> <summary>Image 8 Details</summary> ![ef442600](/v1/image/ef442600329d62f356f9ea371958809967e024d29e7776084cb41c671493ef3c) ### Visual Description ## Chart Type: Line Plots and Scatter Plot ### Overview The image contains two plots. The left plot is a line plot showing the test loss as a function of the number of parameters for different datasets. The right plot is a scatter plot showing the loss on other distributions versus the test loss on the training distribution for books and Wikipedia datasets during training and at convergence. ### Components/Axes **Left Plot:** * **X-axis:** Parameters (non-embedding). Logarithmic scale from 10^4 to 10^9. * **Y-axis:** Test Loss. Linear scale from 3 to 7. * **Legend (top-right):** * Blue: WebText2 (Test) * Orange: Internet Books * Green: Books * Red: Wikipedia * Purple: Common Crawl **Right Plot:** * **X-axis:** Test Loss on Training Distribution. Linear scale from 2.5 to 5.0. * **Y-axis:** Loss on Other Distribution. Linear scale from 2.5 to 5.0. * **Legend (top-right):** * Dashed Blue: Books during training * Dashed Orange: Wikipedia during training * Solid Blue: Books at convergence * Solid Orange: Wikipedia at convergence ### Detailed Analysis **Left Plot: Test Loss vs. Parameters** * **WebText2 (Test) (Blue):** The line slopes downward. Starts at approximately 6.2 at 10^4 parameters and decreases to approximately 3.2 at 10^9 parameters. * **Internet Books (Orange):** The line slopes downward. Starts at approximately 6.3 at 10^4 parameters and decreases to approximately 3.5 at 10^9 parameters. * **Books (Green):** The line slopes downward. Starts at approximately 6.1 at 10^4 parameters and decreases to approximately 3.8 at 10^9 parameters. * **Wikipedia (Red):** The line slopes downward. Starts at approximately 6.4 at 10^4 parameters and decreases to approximately 3.9 at 10^9 parameters. * **Common Crawl (Purple):** The line slopes downward. Starts at approximately 5.8 at 10^4 parameters and decreases to approximately 3.3 at 10^9 parameters. **Right Plot: Loss on Other Distribution vs. Test Loss on Training Distribution** * **Books during training (Dashed Blue):** The line slopes downward. Starts at approximately (4.8, 4.9) and ends at approximately (2.7, 3.0). * **Wikipedia during training (Dashed Orange):** The line slopes downward. Starts at approximately (4.8, 5.0) and ends at approximately (2.7, 2.8). * **Books at convergence (Solid Blue):** The points are scattered along a downward trend. The points range from approximately (4.7, 4.7) to (3.3, 3.8). * **Wikipedia at convergence (Solid Orange):** The points are scattered along a downward trend. The points range from approximately (4.7, 4.8) to (3.3, 3.9). ### Key Observations * In the left plot, all datasets show a decrease in test loss as the number of parameters increases. * In the left plot, Wikipedia has the highest test loss for most parameter values, while Common Crawl generally has the lowest. * In the right plot, both books and Wikipedia show a negative correlation between the test loss on the training distribution and the loss on other distributions. * In the right plot, the "during training" data points form a more linear trend compared to the "at convergence" data points. ### Interpretation The left plot demonstrates that increasing the number of parameters in a model generally leads to a reduction in test loss, indicating improved model performance. The different datasets exhibit varying levels of test loss, suggesting that the complexity or characteristics of the data influence model performance. The right plot suggests a trade-off between performance on the training distribution and performance on other distributions. As the test loss on the training distribution decreases, the loss on other distributions also tends to decrease. The "at convergence" data points indicate the final state of the model after training, while the "during training" data points show the trajectory of the model's performance during the training process. The difference in the trends between "during training" and "at convergence" suggests that the relationship between these losses changes as the model converges. </details> ∑∐√∐⌉˜√˜√√({{}{∖˜⌉̂˜̂̂]{˜} }}⌋√(̂√√]{˜(√√∐]{]{˜ ∨]⌋]√˜̂]∐(̂√√]{˜(√√∐]{]{˜ ∨]⌋]√˜̂]∐(∐√(̂}{√˜√˜˜{̂˜ ⋂˜√√(⊕}√√(}{(⋂√∐]{]{˜(〈]√√√]̂√√]}{ Figure 8 Left: Generalization performance to other data distributions improves smoothly with model size, with only a small and very slowly growing offset from the WebText2 training distribution. Right: Generalization performance depends only on training distribution performance, and not on the phase of training. We compare generalization of converged models (points) to that of a single large model (dashed curves) as it trains. with the best performance on step S = C 6 BS . Note that in these results the batch size B remains fixed for all models , which means that these empirical results are not truly optimal. We will account for this in later sections using an adjusted C min to produce cleaner trends. The result appears as the heavy black line on the left-hand plot in Figure 1. It can be fit with  The figure also includes images of individual learning curves to clarify when individual models are optimal. Wewill study the optimal allocation of compute more closely later on. The data strongly suggests that sample efficiency improves with model size, and we also illustrate this directly in Figure 19 in the appendix. ## 4 Charting the Infinite Data Limit and Overfitting In Section 3 we found a number of basic scaling laws for language modeling performance. Here we will study the performance of a model of size N trained on a dataset with D tokens while varying N and D simultaneously. We will empirically demonstrate that the optimally trained test loss accords with the scaling law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing size while keeping overfitting under control. ## 4.1 Proposed L ( N,D ) Equation We have chosen the parameterization (1.5) (repeated here for convenience):  ⊕}√√(}{(⊗√[˜√(〈]√√√]̂√√]}{ Figure 9 The early-stopped test loss L ( N,D ) depends predictably on the dataset size D and model size N according to Equation (1.5). Left : For large D , performance is a straight power law in N . For a smaller fixed D , performance stops improving as N increases and the model begins to overfit. (The reverse is also true, see Figure 4.) Right : The extent of overfitting depends predominantly on the ratio N α N α D /D , as predicted in equation (4.3). The line is our fit to that equation. <details> <summary>Image 9 Details</summary> ![c78a7215](/v1/image/c78a7215390887c3eef2a6e25d9fedda89863b32ff71a7c56a7879b582a10950) ### Visual Description ## Chart Type: Data Size Bottleneck and Overfitting Charts ### Overview The image presents two charts side-by-side. The left chart, titled "Data Size Bottleneck," shows the relationship between the number of parameters (non-embed) and the test loss for different data sizes. The right chart, titled "Overfitting," illustrates the relationship between a normalized loss metric (L/L(D=∞) - 1) and a normalized parameter count (N^(a_N/a_0)/D) for different data sizes. Both charts use a consistent color scheme to represent different data sizes, as indicated by the legend. ### Components/Axes **Left Chart: Data Size Bottleneck** * **Title:** Data Size Bottleneck * **X-axis:** Params (non-embed) - Logarithmic scale from 10^6 to 10^9 * **Y-axis:** Test Loss - Linear scale from 2.5 to 4.5 * **Legend (right side):** Data Size, with the following categories and colors: * 21M (Dark Purple) * 43M (Purple) * 86M (Dark Blue) * 172M (Blue) * 344M (Teal) * 688M (Green) * 1.4B (Lime Green) * 22.0B (Yellow) **Right Chart: Overfitting** * **Title:** Overfitting * **X-axis:** N^(a_N/a_0)/D - Logarithmic scale from 10^-4 to 10^-1 * **Y-axis:** L/L(D=∞) - 1 - Linear scale from 0.0 to 0.5 * **Legend (right side):** Data Size, with the same categories and colors as the left chart: * 21M (Dark Purple) * 43M (Purple) * 86M (Dark Blue) * 172M (Blue) * 344M (Teal) * 688M (Green) * 1.4B (Lime Green) * 22.0B (Yellow) ### Detailed Analysis **Left Chart: Data Size Bottleneck** * **Trend:** For each data size, the test loss generally decreases as the number of parameters increases. The rate of decrease diminishes as the number of parameters increases. * **Data Points:** * **21M (Dark Purple):** Starts at approximately 4.5 test loss at 10^6 parameters, decreasing to approximately 4.25 at 10^9 parameters. * **43M (Purple):** Starts at approximately 4.4 test loss at 10^6 parameters, decreasing to approximately 4.1 at 10^9 parameters. * **86M (Dark Blue):** Starts at approximately 4.3 test loss at 10^6 parameters, decreasing to approximately 3.9 at 10^9 parameters. * **172M (Blue):** Starts at approximately 4.2 test loss at 10^6 parameters, decreasing to approximately 3.7 at 10^9 parameters. * **344M (Teal):** Starts at approximately 4.0 test loss at 10^6 parameters, decreasing to approximately 3.4 at 10^9 parameters. * **688M (Green):** Starts at approximately 3.9 test loss at 10^6 parameters, decreasing to approximately 3.2 at 10^9 parameters. * **1.4B (Lime Green):** Starts at approximately 3.8 test loss at 10^6 parameters, decreasing to approximately 2.9 at 10^9 parameters. * **22.0B (Yellow):** Starts at approximately 4.2 test loss at 10^6 parameters, decreasing to approximately 2.5 at 10^9 parameters. **Right Chart: Overfitting** * **Trend:** The normalized loss metric (L/L(D=∞) - 1) generally increases as the normalized parameter count (N^(a_N/a_0)/D) increases. The rate of increase accelerates as the normalized parameter count increases. * **Data Points:** * **21M (Dark Purple):** Starts near 0.0 at 10^-4, increasing to approximately 0.5 at 10^-1. * **43M (Purple):** Starts near 0.0 at 10^-4, increasing to approximately 0.45 at 10^-1. * **86M (Dark Blue):** Starts near 0.0 at 10^-4, increasing to approximately 0.3 at 10^-1. * **172M (Blue):** Starts near 0.0 at 10^-4, increasing to approximately 0.2 at 10^-1. * **344M (Teal):** Starts near 0.0 at 10^-4, increasing to approximately 0.1 at 10^-1. * **688M (Green):** Starts near 0.0 at 10^-4, increasing to approximately 0.05 at 10^-1. * **1.4B (Lime Green):** Starts near 0.0 at 10^-4, increasing to approximately 0.02 at 10^-1. * **22.0B (Yellow):** Starts near 0.0 at 10^-4, increasing to approximately 0.01 at 10^-1. ### Key Observations * In the "Data Size Bottleneck" chart, larger data sizes generally result in lower test loss for a given number of parameters. * In the "Overfitting" chart, larger data sizes generally result in lower normalized loss metric for a given normalized parameter count. * The "Data Size Bottleneck" chart shows diminishing returns in reducing test loss as the number of parameters increases, especially for smaller data sizes. * The "Overfitting" chart shows an accelerating increase in the normalized loss metric as the normalized parameter count increases, indicating a potential for overfitting. ### Interpretation The charts suggest that increasing the data size can help to reduce both the test loss and the potential for overfitting. The "Data Size Bottleneck" chart demonstrates that, for a fixed number of parameters, larger datasets lead to better performance (lower test loss). The "Overfitting" chart shows that larger datasets are less prone to overfitting, as indicated by the lower normalized loss metric for a given normalized parameter count. The diminishing returns observed in the "Data Size Bottleneck" chart suggest that there may be a point beyond which increasing the number of parameters provides little additional benefit, especially for smaller datasets. The accelerating increase in the "Overfitting" chart highlights the importance of regularization techniques to prevent overfitting, particularly when using large numbers of parameters. </details> Since we stop training early when the test loss ceases to improve and optimize all models in the same way, we expect that larger models should always perform better than smaller models. But with fixed finite D , we also do not expect any model to be capable of approaching the best possible loss (ie the entropy of text). Similarly, a model with fixed size will be capacity-limited. These considerations motivate our second principle. Note that knowledge of L ( N ) at infinite D and L ( D ) at infinite N fully determines all the parameters in L ( N,D ) . The third principle is more speculative. There is a simple and general reason one might expect overfitting to scale ∝ 1 /D at very large D . Overfitting should be related to the variance or the signal-to-noise ratio of the dataset [AS17], and this scales as 1 /D . This expectation should hold for any smooth loss function, since we expect to be able to expand the loss about the D →∞ limit. However, this argument assumes that 1 /D corrections dominate over other sources of variance, such as the finite batch size and other limits on the efficacy of optimization. Without empirical confirmation, we would not be very confident of its applicability. Our third principle explains the asymmetry between the roles of N and D in Equation (1.5). Very similar symmetric expressions 4 are possible, but they would not have a 1 /D expansion with integer powers, and would require the introduction of an additional parameter. In any case, we will see that our equation for L ( N,D ) fits the data well, which is the most important justification for our L ( N,D ) ansatz. ## 4.2 Results We regularize all our models with 10% dropout, and by tracking test loss and stopping once it is no longer decreasing. The results are displayed in Figure 9, including a fit to the four parameters α N , α D , N c , D c in Equation (1.5): Table 2 Fits to L ( N,D ) | Parameter | α N | α D | N c | D c | |-------------|---------|---------|---------------|---------------| | Value | 0 . 076 | 0 . 103 | 6 . 4 × 10 13 | 1 . 8 × 10 13 | We obtain an excellent fit, with the exception of the runs where the dataset has been reduced by a factor of 1024 , to about 2 × 10 7 tokens. With such a small dataset, an epoch consists of only 40 parameter updates. Perhaps such a tiny dataset represents a different regime for language modeling, as overfitting happens very early in training (see Figure 16). Also note that the parameters differ very slightly from those obtained in Section 3, as here we are fitting the full L ( N,D ) rather than just L ( N, ∞ ) or L ( ∞ , D ) . To chart the borderlands of the infinite data limit, we can directly study the extent of overfitting. For all but the largest models, we see no sign of overfitting when training with the full 22B token WebText2 dataset, so we can take it as representative of D = ∞ . Thus we can compare finite D to the infinite data limit by 4 For example, one might have used L ( N,D ) = [( N c N ) α N + ( D c D ) α D ] β , but this does not have a 1 /D expansion. Figure 10 The critical batch size B crit follows a power law in the loss as performance increase, and does not depend directly on the model size. We find that the critical batch size approximately doubles for every 13% decrease in loss. B crit is measured empirically from the data shown in Figure 18, but it is also roughly predicted by the gradient noise scale, as in [MKAT18]. <details> <summary>Image 10 Details</summary> ![1a5e5a82](/v1/image/1a5e5a8272af595b11d21cfe788468b4e6d578c813c281014817a817db9433b0) ### Visual Description ## Chart: Critical Batch Size vs. Performance ### Overview The image is a scatter plot showing the relationship between critical batch size (in tokens) and WebText2 train loss. The plot includes two empirical data series for different values of N (3M and 85M), a theoretical curve, and noise scale measurements. Both axes are logarithmically scaled. ### Components/Axes * **Title:** Critical Batch Size vs. Performance * **Y-axis:** Critical Batch Size (Tokens) - Logarithmic scale from 10^3 to 10^6 * **X-axis:** WebText2 Train Loss - Logarithmic scale from approximately 5 to 3 x 10^3 * **Legend:** Located in the top-right of the chart. * Blue: Empirical B\_crit, N = 3M * Orange: Empirical B\_crit, N = 85M * Dashed Gray: B\_crit = 2.1 x 10^8 tokens * L^-4.8 * Green Dots: Noise Scale Measurement ### Detailed Analysis * **Empirical B\_crit, N = 3M (Blue):** * Trend: Generally increasing with WebText2 Train Loss. * Data Points: * At approximately x=5, y ≈ 3 x 10^3 * At approximately x=10, y ≈ 6 x 10^3 * At approximately x=60, y ≈ 3 x 10^4 * At approximately x=200, y ≈ 4 x 10^4 * At approximately x=500, y ≈ 6 x 10^4 * At approximately x=1000, y ≈ 8 x 10^4 * At approximately x=2000, y ≈ 9 x 10^4 * At approximately x=3000, y ≈ 5 x 10^4 * **Empirical B\_crit, N = 85M (Orange):** * Trend: Generally increasing with WebText2 Train Loss. * Data Points: * At approximately x=5, y ≈ 4 x 10^3 * At approximately x=10, y ≈ 5 x 10^3 * At approximately x=60, y ≈ 2 x 10^4 * At approximately x=200, y ≈ 3 x 10^4 * At approximately x=500, y ≈ 4 x 10^4 * At approximately x=1000, y ≈ 5 x 10^4 * At approximately x=2000, y ≈ 8 x 10^4 * At approximately x=3000, y ≈ 1 x 10^5 * **B\_crit = 2.1 x 10^8 tokens * L^-4.8 (Dashed Gray):** * Trend: Increasing with WebText2 Train Loss. * Data Points: * At approximately x=5, y ≈ 4 x 10^3 * At approximately x=10, y ≈ 7 x 10^3 * At approximately x=60, y ≈ 3 x 10^4 * At approximately x=200, y ≈ 6 x 10^4 * At approximately x=500, y ≈ 8 x 10^4 * At approximately x=1000, y ≈ 1 x 10^5 * At approximately x=2000, y ≈ 1.5 x 10^5 * At approximately x=3000, y ≈ 1.7 x 10^5 * **Noise Scale Measurement (Green Dots):** * Trend: Scattered, but generally increases with WebText2 Train Loss. * Distribution: Densely clustered at lower WebText2 Train Loss values and more spread out at higher values. ### Key Observations * The empirical critical batch sizes (N = 3M and N = 85M) generally increase with WebText2 Train Loss. * The theoretical curve (B\_crit = 2.1 x 10^8 tokens * L^-4.8) also increases with WebText2 Train Loss and appears to be an upper bound for the empirical data. * The noise scale measurements are scattered, indicating variability in the relationship between noise and train loss. * The empirical data for N=85M is generally higher than for N=3M, suggesting that a larger N leads to a larger critical batch size. ### Interpretation The chart suggests a positive correlation between critical batch size and WebText2 train loss. This implies that as the train loss increases, a larger batch size is needed to maintain performance. The theoretical curve provides a model for this relationship, while the noise scale measurements indicate the level of variability in the data. The difference between the N=3M and N=85M curves suggests that the size of the dataset (N) also influences the critical batch size. The data indicates that the critical batch size increases with train loss, and that the rate of increase is influenced by the size of the dataset. The noise scale measurements suggest that there is a degree of randomness in the relationship between train loss and critical batch size. </details> defining $$\delta L ( N , D ) \equiv \frac { L ( N , D ) } { L ( N , \infty ) } - 1 \quad ( 4 . 2 )$$ and studying it as a function of N,D . In fact, we see empirically that δL depends only a specific combination of N and D , as shown in Figure 16. This follows from the scaling law of Equation (1.5), which implies  Note that at large D this formula also has a series expansion in powers of 1 /D . We estimate that the variation in the loss with different random seeds is roughly 0 . 02 , which means that to avoid overfitting when training to within that threshold of convergence we require  With this relation, models smaller than 10 9 parameters can be trained with minimal overfitting on the 22B token WebText2 dataset, but our largest models will encounter some mild overfitting. More generally, this relation shows that dataset size may grow sub-linearly in model size while avoiding overfitting. Note however that this does not typically represent maximally compute-efficient training. We should also emphasize that we have not optimized regularization (eg the dropout probability) while varying dataset and model size. ## 5 Scaling Laws with Model Size and Training Time In this section we will demonstrate that a simple scaling law provides a good description for the loss as a function of model size N and training time. First we will explain how to use the results of [MKAT18] to define a universal training step S min , which accounts for the fact that most of our models have not been trained at an optimal batch size. Then we will demonstrate that we can fit the model size and training time dependence of the loss using Equation (1.6). Later we will use these results to predict the optimal allocation of training compute between model size and training time, and then confirm that prediction. ## 5.1 Adjustment for Training at B crit ( L ) A simple empirical theory for the batch size dependence of training was developed in [MKAT18] (see also [SLA + 18, ZLN + 19]). It was argued that there is a critical batch size B crit for training; for B up to B crit the batch size can be increased with very minimal degradation in compute-efficiency, whereas for B > B crit increases in B result in diminishing returns. It was also argued that the gradient noise scale provides a simple prediction for B crit , and that neither depends directly on model size except through the value of the loss that has been attained. These results can be used to predict how training time and compute will vary with the batch size. To utilize both training time and compute as effectively as possible, it is best to train with a batch size B ≈ B crit . Training at B B crit minimizes the number of training steps, while B B crit minimizes the use of compute. More specifically, it was demonstrated that for a wide variety of neural network tasks, the number of training steps S and the number of data examples processed E = BS satisfy the simple relation $$\left ( { \frac { S } { S _ { \min } } } - 1 \right ) \left ( { \frac { E } { E _ { \min } } } - 1 \right ) = 1 \quad ( 5 . 1 )$$ when training to any fixed value of the loss L . Here S min is the minimum number of steps necessary to reach L , while E min is the minimum number of data examples that must be processed. We demonstrate the relation (5.1) for Transformers in Figure 18 in the appendix. This relation defines the critical batch size  which is a function of the target value of the loss. Training at the critical batch size makes a roughly optimal time/compute tradeoff, requiring 2 S min training steps and processing E = 2 E min data examples. In Figure 10 we have plotted the critical batch size and gradient noise scale 5 as a function of training loss for two different models. We see that B crit ( L ) is independent of model size, and only depends on the loss L . So the predictions of [MKAT18] continue to hold for Transformer language models. The critical batch size can be fit with a power-law in the loss  where B ∗ ≈ 2 × 10 8 and α B ≈ 0 . 21 . We have chosen this parameterization for B crit ( L ) because as the loss approaches its minimum value L min , the gradient noise scale is expected to diverge, and we expect B crit to track this noise scale. We do not know L min , as we see no sign that our models are approaching it, but L min > 0 since the entropy of natural language is non-zero. Since apparently L min is much smaller than the values of L we have achieved, we used a parameterization where B crit diverges as L → 0 . We will use B crit ( L ) to estimate the relation between the number of training steps S while training at batch size B = 2 19 tokens and the number of training steps while training at B B crit . This is simply  for any given target value L for the loss. This also defines a critical value of the compute needed to train to L with a model of size N if we were to train at B B crit ( L ) . This is  where C = 6 NBS estimates the (non-embedding) compute used at batch size B . ## 5.2 Results for L ( N,S min ) and Performance with Model Size and Compute Now we will use S min defined in Equation (5.4) to obtain a simple and universal fit for the dependence of the loss on model size and training time in the infinite data limit. We will fit the stable, Adam-optimized training runs using Equation (1.6), repeated here for convenience:  for the loss. We include all training steps after the warmup period of the learning rate schedule, and find a fit to the data with the parameters: 5 Although the critical batch size roughly matches the gradient noise scale, we are using a direct measurements of B crit from Figures 18 and 10 for all our later analyses. Figure 11 When we hold either total compute or number of training steps fixed, performance follows L ( N,S ) from Equation (5.6). Each value of compute budget has an associated optimal model size that maximizes performance. Mediocre fits at small S are unsurprising, as the power-law equation for the learning curves breaks down very early in training. <details> <summary>Image 11 Details</summary> ![7b75f693](/v1/image/7b75f693fccd153822bd8f39bfb73c6f60faf395ec2d3fad42a5feff2ed3b9a8) ### Visual Description ## Chart: Performance vs Compute Budget and Performance vs Steps ### Overview The image contains two scatter plots comparing performance (Test Loss) against Parameters (non-embedding). The left plot shows "Performance vs Compute Budget," where the data points are colored according to "PF-dayss" (compute budget). The right plot shows "Performance vs Steps," where the data points are colored according to "Steps." Both plots also include dotted lines. ### Components/Axes **Left Plot: Performance vs Compute Budget** * **Title:** Performance vs Compute Budget * **Y-axis:** Test Loss (linear scale), ranging from 2 to 8. * **X-axis:** Parameters (non-embedding) (logarithmic scale), ranging from 10^4 to 10^8. * **Color Legend:** PF-dayss (compute budget), ranging from 10^-5 (purple) to 10^0 (yellow). The legend is located on the right side of the plot. * 10^-5 (purple) * 10^-4 * 10^-3 * 10^-2 * 10^-1 * 10^0 (yellow) **Right Plot: Performance vs Steps** * **Title:** Performance vs Steps * **Y-axis:** Test Loss (linear scale), ranging from 2.4 to 5.4. * **X-axis:** Parameters (non-embedding) (logarithmic scale), ranging from 10^6 to 10^9. * **Color Legend:** Steps, ranging from 10^4 (purple) to 10^5 (yellow). The legend is located on the right side of the plot. * 10^4 (purple) * 10^5 (yellow) ### Detailed Analysis **Left Plot: Performance vs Compute Budget** * **General Trend:** For each compute budget level (PF-dayss), the test loss initially decreases as the number of parameters increases, reaching a minimum, and then increases again. * **Data Series:** * **PF-dayss = 10^-5 (purple):** Starts at approximately (10^4, 6.5), decreases to a minimum around (10^5, 5.5), and then increases to approximately (10^8, 7.5). * **PF-dayss = 10^-4:** Starts at approximately (10^4, 6.2), decreases to a minimum around (2 * 10^5, 4.5), and then increases to approximately (10^8, 6.0). * **PF-dayss = 10^-3:** Starts at approximately (10^4, 5.8), decreases to a minimum around (5 * 10^5, 3.8), and then increases to approximately (10^8, 5.0). * **PF-dayss = 10^-2:** Starts at approximately (10^4, 5.5), decreases to a minimum around (10^6, 3.2), and then increases to approximately (10^8, 4.0). * **PF-dayss = 10^-1:** Starts at approximately (10^4, 5.2), decreases to a minimum around (2 * 10^6, 2.8), and then increases to approximately (10^8, 3.5). * **PF-dayss = 10^0 (yellow):** Starts at approximately (10^4, 5.0), decreases to a minimum around (5 * 10^6, 2.5), and then increases to approximately (10^8, 3.0). * **Dotted Lines:** There are several dotted lines that appear to follow the same general trend as the solid lines. **Right Plot: Performance vs Steps** * **General Trend:** For each step level, the test loss generally decreases as the number of parameters increases. * **Data Series:** * **Steps = 10^4 (purple):** Starts at approximately (10^6, 4.8), decreases to approximately (10^9, 3.2). * **Steps = 10^5 (yellow):** Starts at approximately (10^6, 4.2), decreases to approximately (10^9, 2.5). * **Dotted Lines:** There are several dotted lines that appear to follow the same general trend as the solid lines. ### Key Observations * In the left plot, higher compute budgets (yellow lines) generally result in lower test loss values, but the test loss increases again after a certain number of parameters. * In the right plot, higher step counts (yellow lines) generally result in lower test loss values. * In both plots, the test loss decreases as the number of parameters increases, up to a point. ### Interpretation The plots suggest that increasing both the compute budget (PF-dayss) and the number of training steps can improve model performance (lower test loss). However, the "Performance vs Compute Budget" plot indicates that there is a point of diminishing returns, where increasing the number of parameters beyond a certain point can actually increase the test loss, suggesting overfitting. The "Performance vs Steps" plot shows a consistent decrease in test loss with increasing parameters, within the range shown. The dotted lines likely represent some form of baseline or alternative model configuration. The relationship between parameters, compute budget, and training steps is complex and requires careful tuning to achieve optimal performance. </details> Table 3 Fits to L ( N,S ) | Parameter | α N | α S | N c | S c | |-------------|---------|--------|---------------|--------------| | Value | 0 . 077 | 0 . 76 | 6 . 5 × 10 13 | 2 . 1 × 10 3 | With these parameters, we obtain the learning curve fits in Figure 4. Though the fits are imperfect, we believe they are quite compelling given the simplicity of Equation (5.6). The data and fits can be visualized in a different and more interesting way, as shown in Figure 11. There we study the test loss as a function of model size while fixing either the total non-embedding compute C used in training, or the number of steps S . For the fits we use Equation (5.5) and (5.4) along with the parameters above and Equation (5.6). The power-law dependence of the loss on S min reflects the interplay of optimizer dynamics and the loss landscape. Since the fits are best late in training, when the loss may be approximately quadratic, the powerlaw should provide information about the spectrum of the Hessian of the loss. Its universality suggests that the Hessian eigenvalue density is roughly independent of model size. ## 5.3 Lower Bound on Early Stopping Step The results for L ( N,S min ) can be used to derive a lower-bound (and rough estimate) of the step at which early stopping should occur when training is data limited. It is motivated by the idea that finite and infinite D learning curves for a given model will be very similar until we reach S min ≈ S stop . Thus overfitting should be proportional to the correction from simply ending training at S stop . This will underestimate S stop , because in reality the test loss will decrease more slowly when we have a finite D , and therefore we will require more training steps to reach the optimal test loss at finite D . This line of reasoning leads to the inequality $$S _ { s t o p } ( N , D ) \gtrsim \frac { S _ { c } } { [ L ( N , D ) - L ( N , \infty ) ] ^ { 1 / \alpha _ { S } } }$$ where L ( N, ∞ ) is the converged loss, evaluated with infinite available data. This inequality and its comparison to the empirical data is displayed in Figure 16 in the appendix. In that figure, the values of S stop and L ( N,D ) are empirical (though S stop is adjusted to mimic training at B B crit ), while L ( N, ∞ ) is computed from the fit to L ( N,D ) evaluated at D = ∞ . ## 6 Optimal Allocation of the Compute Budget We displayed the empirical trend of performance as a function of the computation used during training in the top-right of Figure 1. However, this result involved training at a fixed batch size B , whereas we know Figure 12 Left: Given a fixed compute budget, a particular model size is optimal, though somewhat larger or smaller models can be trained with minimal additional compute. Right: Models larger than the computeefficient size require fewer steps to train, allowing for potentially faster training if sufficient additional parallelism is possible. Note that this equation should not be trusted for very large models, as it is only valid in the power-law region of the learning curve, after initial transient effects. <details> <summary>Image 12 Details</summary> ![b3b4346c](/v1/image/b3b4346c662d0aa8918b6be76b05527a271b5df42ea5932fb291fefb98c1fd79) ### Visual Description ## Chart Type: Comparative Line Graphs ### Overview The image presents two line graphs side-by-side, comparing the "Deviation from Optimal Model (N/N_efficient)" on the x-axis against "Excess Compute (C/C_efficient)" on the left and "Excess Steps (S/S_efficient)" on the right. Both x-axes are on a log scale. The left graph shows the relationship between model size deviation and excess compute, while the right graph shows the relationship between model size deviation and excess steps. ### Components/Axes **Left Graph:** * **Y-axis:** "Excess Compute (C/C_efficient)". Scale ranges from 1.0 to 4.0, with tick marks at 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, and 4.0. * **X-axis:** "Deviation from Optimal Model (N/N_efficient)". Logarithmic scale, ranging from 10^-1 to 10^1. Tick marks at 10^-1, 10^0, and 10^1. * **Line:** A single blue line representing the relationship between deviation from optimal model size and excess compute. * **Annotation:** "Models between 0.6x and 2.2x the optimal size can be trained with a 20% larger compute budget." Two downward-pointing arrows indicate the approximate range on the x-axis. **Right Graph:** * **Y-axis:** "Excess Steps (S/S_efficient)". Logarithmic scale, ranging from 10^0 to 10^1. Tick marks at 10^0 and 10^1. * **X-axis:** "Deviation from Optimal Model (N/N_efficient)". Logarithmic scale, ranging from 10^-1 to 10^1. Tick marks at 10^-1, 10^0, and 10^1. * **Line:** A blue line representing the relationship between deviation from optimal model size and excess steps. The line transitions to a dashed light-gray line towards the right side of the graph. * **Annotation 1:** "Smaller models require more steps to train, while larger models require fewer." A downward-pointing arrow indicates the approximate location on the curve. * **Annotation 2:** "Our framework does not capture early training dynamics." An arrow points to the dashed light-gray portion of the line. ### Detailed Analysis **Left Graph (Excess Compute):** * **Trend:** The blue line forms a U-shape. It starts high on the left, decreases to a minimum around x=10^0, and then increases again on the right. * **Data Points:** * At x = 0.1, y ≈ 3.8 * At x = 1, y ≈ 1.0 * At x = 10, y ≈ 1.3 **Right Graph (Excess Steps):** * **Trend:** The blue line slopes downward from left to right. It starts high on the left and decreases as x increases. The dashed light-gray line continues the downward trend but at a shallower slope. * **Data Points:** * At x = 0.1, y ≈ 10 * At x = 1, y ≈ 2 * At x = 10, y ≈ 1.2 ### Key Observations * The left graph indicates that deviating from the optimal model size in either direction (smaller or larger) increases the excess compute required. * The right graph indicates that smaller models require significantly more training steps, while larger models require fewer steps. * The annotation on the left graph suggests a range of model sizes (0.6x to 2.2x the optimal size) that can be trained with a relatively small increase in compute budget (20%). * The dashed line on the right graph indicates a limitation of the framework in capturing early training dynamics. ### Interpretation The graphs illustrate the trade-offs between model size, compute requirements, and training steps. The left graph suggests that there's a "sweet spot" around the optimal model size where compute costs are minimized. The right graph highlights the inverse relationship between model size and the number of training steps, with smaller models requiring more steps and larger models requiring fewer. The annotation regarding the framework's inability to capture early training dynamics suggests a potential area for improvement or further investigation. The graphs collectively suggest that careful consideration of model size is crucial for optimizing training efficiency. </details> Figure 13 When adjusting performance to simulate training far below the critical batch size, we find a somewhat altered power law for L ( C min ) when compared with the fully empirical results. The conspicuous lump at 10 -5 PF-days marks the transition from 1-layer to 2-layer networks; we exclude 1-layer networks in the power-law fits. It is the L ( C min ) trend that we expect to provide a reliable extrapolation for larger compute. <details> <summary>Image 13 Details</summary> ![678904c1](/v1/image/678904c1aca54eef340b89927403fa56f2f86fa9c3072e615030b56f2265c713) ### Visual Description ## Chart: Test Loss vs. Compute ### Overview The image is a log-log plot showing the relationship between Test Loss and Compute (PF-days, non-embedding). Two trend lines are overlaid on the plot, representing different power-law relationships. The x-axis (Compute) ranges from 10^-8 to 10^0, and the y-axis (Test Loss) ranges from 2 to 7. ### Components/Axes * **X-axis:** Compute (PF-days, non-embedding). Logarithmic scale from 10^-8 to 10^0. * **Y-axis:** Test Loss. Linear scale from 2 to 7. * **Legend (Top-Right):** * Blue dashed line: L = (Cmin/2.3 * 10^8)^-0.050 * Orange dashed line: L = (C/2.0 * 10^7)^-0.057 * **Data Series:** A black line represents the observed test loss as a function of compute. ### Detailed Analysis * **Black Line (Observed Test Loss):** The black line shows a decreasing trend as compute increases. The line is not smooth, showing some plateaus and steeper drops. * At Compute = 10^-8, Test Loss is approximately 6.3. * At Compute = 10^-6, Test Loss is approximately 5.8. * At Compute = 10^-4, Test Loss is approximately 4.5. * At Compute = 10^-2, Test Loss is approximately 3.3. * At Compute = 10^0, Test Loss is approximately 2.3. * **Blue Dashed Line (L = (Cmin/2.3 * 10^8)^-0.050):** This line represents a power-law relationship. It starts at approximately 6.2 at Compute = 10^-8 and decreases to approximately 2.2 at Compute = 10^0. * **Orange Dashed Line (L = (C/2.0 * 10^7)^-0.057):** This line also represents a power-law relationship. It starts at approximately 6.7 at Compute = 10^-8 and decreases to approximately 2.2 at Compute = 10^0. ### Key Observations * The observed test loss (black line) generally follows a decreasing trend as compute increases, which is expected. * The blue and orange dashed lines provide a model for the relationship between test loss and compute. * The black line is above the blue line for most of the range, indicating that the observed test loss is generally higher than predicted by the blue line model. * The black line is initially below the orange line, but crosses it around Compute = 10^-2. * The black line exhibits some plateaus, suggesting diminishing returns in test loss reduction for certain ranges of compute. ### Interpretation The plot illustrates the relationship between computational resources (Compute) and the resulting performance of a model (Test Loss). The decreasing trend of the black line indicates that increasing compute generally leads to lower test loss, which means better model performance. The power-law relationships represented by the blue and orange dashed lines provide a way to model and predict this relationship. The differences between the observed test loss (black line) and the model predictions (blue and orange lines) suggest that the power-law models are approximations and may not perfectly capture the complex dynamics of the system. The plateaus in the black line suggest that there may be diminishing returns in terms of test loss reduction as compute increases, and that other factors may be limiting performance. </details> that in fact we could train more efficiently 6 by training at the batch size B crit discussed in Section 5.1. Large and small values of the loss could have been achieved with fewer samples or fewer steps, respectively, and correcting for this inefficiency by standardizing to the critical batch size results in cleaner and more predictable trends. In this section we will adjust for this oversight. More importantly, we will use the results of Section 5 to determine the optimal allocation of compute between model size N and the quantity of data processed during training, namely 2 B crit S min . We will determine this allocation both empirically and theoretically, by using the equation for L ( N,S min ) , and we will demonstrate that these methods agree. ## 6.1 Optimal Performance and Allocations Let us first study the loss as a function of the optimally allocated compute from Equation (5.5). The result is plotted in Figure 13, along with a power-law fit. We see that as compared to the compute plot of Figure 1, the new fit with C min is somewhat improved. Given L ( C min ) , it is natural to ask for the optimal model size N ( C min ) that provides the minimal loss with a given quantity of training compute. The optimal model size is shown in Figure 14. We observe that N ( C min ) 6 One might ask why we did not simply train at B crit in the first place. The reason is that it depends not only on the model but also on the target value of the loss we wish to achieve, and so is a moving target. Figure 14 Left: Each value of the compute budget C min has an associated optimal model size N . Optimal model size grows very rapidly with C min , increasing by 5x for each 10x increase in compute. The number of data examples processed makes up the remainder of the increase, growing relatively modestly by only 2x. Right: The batch-adjusted number of optimization steps also grows very slowly, if at all, meaning that most of the growth in data examples processed can be used for increased batch sizes. <details> <summary>Image 14 Details</summary> ![14171e3c](/v1/image/14171e3c61738b3ff1562bd588ecc8a25bbff7d49fc43cdbd4250e5bf613382e) ### Visual Description ## Chart: Parameter Scaling vs. Compute ### Overview The image presents two scatter plots comparing compute (measured in PF-days) against parameters (non-embedding) and steps, respectively. The left plot shows the relationship between compute and the number of parameters, while the right plot shows the relationship between compute and the number of steps. Both plots use a log-log scale for the x-axis (Compute) and a log scale for the y-axis (Parameters or Steps). ### Components/Axes **Left Plot:** * **X-axis:** Compute (PF-days), non-embedding. Logarithmic scale from approximately 10^-7 to 10^-1. * **Y-axis:** Parameters (non-embedding). Logarithmic scale from 10^3 to 10^7. * **Data Series 1 (Blue):** `N = (1.3 * 10^9) * C_min^0.73` (dashed line) * **Data Series 2 (Orange):** `N = (1.6 * 10^9) * C_min^0.88` (dashed line) **Right Plot:** * **X-axis:** Compute (PF-days), excluding embeddings. Logarithmic scale from approximately 10^-7 to 10^-1. * **Y-axis:** Steps. Linear scale from 0 to 15000. * **Data Series 1 (Blue):** `S_min (adjusted)` (solid line with circular markers) * **Data Series 2 (Blue):** `S_min = (5.4 * 10^3) * C_min^0.03` (dashed line) * **Data Series 3 (Orange):** `S (fixed-batch)` (solid line with circular markers) ### Detailed Analysis **Left Plot (Parameters vs. Compute):** * **Blue Data Series:** The blue dashed line represents the equation `N = (1.3 * 10^9) * C_min^0.73`. The data points cluster closely around this line, indicating a strong relationship between compute and the number of parameters. As compute increases, the number of parameters also increases. * At Compute = 10^-7, Parameters ≈ 2000 * At Compute = 10^-1, Parameters ≈ 5 * 10^6 * **Orange Data Series:** The orange dashed line represents the equation `N = (1.6 * 10^9) * C_min^0.88`. The data points cluster closely around this line, indicating a strong relationship between compute and the number of parameters. As compute increases, the number of parameters also increases. The orange line is generally below the blue line. * At Compute = 10^-7, Parameters ≈ 1000 * At Compute = 10^-1, Parameters ≈ 3 * 10^6 **Right Plot (Steps vs. Compute):** * **Blue Data Series (Adjusted):** The blue solid line with circular markers represents `S_min (adjusted)`. The number of steps increases slightly with compute, but the relationship is not as clear as in the left plot. There is more variance in the data. * At Compute = 10^-7, Steps ≈ 2000 * At Compute = 10^-3, Steps ≈ 4000 * At Compute = 10^-2, Steps ≈ 3000 * At Compute = 10^-1, Steps ≈ 4000 * **Blue Data Series (Equation):** The blue dashed line represents the equation `S_min = (5.4 * 10^3) * C_min^0.03`. This line provides a smoothed representation of the relationship between compute and steps. * At Compute = 10^-7, Steps ≈ 3000 * At Compute = 10^-1, Steps ≈ 3500 * **Orange Data Series (Fixed-Batch):** The orange solid line with circular markers represents `S (fixed-batch)`. The number of steps increases significantly with compute, especially at higher compute values. The relationship is more pronounced than the adjusted steps. * At Compute = 10^-7, Steps ≈ 2000 * At Compute = 10^-3, Steps ≈ 3000 * At Compute = 10^-2, Steps ≈ 5000 * At Compute = 10^-1, Steps ≈ 15000 ### Key Observations * The left plot shows a strong positive correlation between compute and the number of parameters. * The right plot shows a weaker correlation between compute and the number of steps, especially for the adjusted steps. The fixed-batch steps show a more pronounced increase with compute. * The equations provided in the legends appear to be fitted models representing the relationships between the variables. ### Interpretation The plots suggest that increasing compute leads to a significant increase in the number of parameters in the non-embedding layers of the model. The relationship between compute and the number of steps is more complex. The adjusted steps show a relatively flat trend, while the fixed-batch steps show a more significant increase with compute, suggesting that the batch size influences the number of steps required. The difference between the adjusted and fixed-batch steps could be due to different optimization strategies or hyperparameter settings. The equations provided offer a quantitative way to estimate the number of parameters or steps given a certain amount of compute. </details> can be fit very well with a power-law where $$N ( C _ { \min } ) \circ ( C _ { \min } ) ^ { 0 . 7 3 } .$$ In Figure 12, we show the effect of training models of sub-optimal sizes (see Appendix B.4). By definition C min ≡ 6 NB crit S , and so we can use N ( C min ) to extract further results. In particular, since prior fits show B ∝ L -4 . 8 and L ∝ C -0 . 05 min , we can conclude that B crit ∝ C 0 . 24 min . This leads us to conclude that the optimal number of steps will only grow very slowly with compute, as $$S _ { \min } \, \infty \left ( C _ { \min } \right ) ^ { 0 . 0 3 } , \quad ( 6 . 2 )$$ matching the empirical results in Figure 14. In fact the measured exponent is sufficiently small that our results may even be consistent with an exponent of zero. Thus we conclude that as we scale up language modeling with an optimal allocation of computation, we should predominantly increase the model size N , while simultaneously scaling up the batch size via B ∝ B crit with negligible increase in the number of serial steps. Since compute-efficient training uses relatively few optimization steps, additional work on speeding up early training dynamics may be warranted. ## 6.2 Predictions from L ( N,S min ) The results for L ( C min ) and the allocations can be predicted from the L ( N,S min ) equation obtained in Section 5. Given our equation for L ( N,S min ) , we can substitute S min = C min 6 NB and then find the minimum of the loss as a function of N , while fixing the training compute. We carry out this procedure in detail in Appendix B, where we also provide some additional predictions. For the loss as a function of training compute, we predict that $$L ( C _ { \min } ) = \left ( \frac { C _ { c } ^ { \min } } { C _ { \min } } \right ) ^ { \alpha _ { C } ^ { \min } }$$ $$\alpha _ { C } ^ { \min } \equiv \frac { 1 } { 1 / \alpha _ { S } + 1 / \alpha _ { B } + 1 / \alpha _ { N } } \approx 0 . 0 5 4 \quad ( 6 . 4 )$$ in excellent agreement with the exponent of Figure 13. We also predict that $$N ( C _ { \min } ) \, \infty \, ( C _ { \min } ) ^ { \alpha _ { C } ^ { \min } / \alpha _ { N } } \approx ( C _ { \min } ) ^ { 0 . 7 1 } \quad ( 6 . 5 )$$ which also matches the scaling of Figure 14 to within a few percent. Our scaling laws provide a predictive framework for the performance of language modeling. Figure 15 Far beyond the model sizes we study empirically, we find a contradiction between our equations for L ( C min ) and L ( D ) due to the slow growth of data needed for compute-efficient training. The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits. <details> <summary>Image 15 Details</summary> ![8ede570c](/v1/image/8ede570cb4964b648b798692d9e82dadd04a750d7df5854c2f0df877185d2d18) ### Visual Description ## Chart: Test Loss vs. Compute ### Overview The image is a line chart showing the relationship between Test Loss and Compute (PF-days, non-embedding). There are two data series represented: L(Cmin) and L(D(C)). The x-axis (Compute) is on a logarithmic scale. The chart also includes a text annotation about the sensitivity of the intersection point. ### Components/Axes * **X-axis:** Compute (PF-days, non-embedding). Logarithmic scale with markers at 10-8, 10-5, 10-2, 101, 104, and 107. * **Y-axis:** Test Loss. Linear scale with markers at 1.5, 3.0, 4.5, 6.0, and 7.5. * **Legend:** Located at the top-right of the chart. * L(Cmin): Represented by a dashed yellow line. * L(D(C)): Represented by a solid red line with a shaded red region around it. * **Annotation:** Located at the bottom-right of the chart. "The intersection point is sensitive to the precise power-law parameters." ### Detailed Analysis * **L(Cmin) (Dashed Yellow Line):** * Trend: Decreases as Compute increases. * At Compute = 10-8, Test Loss ≈ 6.2. * At Compute = 107, Test Loss ≈ 1.3. * **L(D(C) (Solid Red Line with Shaded Region):** * Trend: Decreases as Compute increases. The shaded region indicates uncertainty or variability. * At Compute = 10-8, Test Loss ≈ 3.0. * At Compute = 107, Test Loss ≈ 1.3. * **Black Line:** * Trend: Decreases as Compute increases. * At Compute = 10-8, Test Loss ≈ 6.2. * The black line transitions into the dashed yellow line around Compute = 101. ### Key Observations * Both L(Cmin) and L(D(C)) decrease as Compute increases, indicating that higher compute leads to lower test loss. * The L(D(C)) data series has a shaded region, suggesting a range of possible values or uncertainty in the measurement. * The black line initially follows a steeper decline than L(D(C)) and then merges into L(Cmin). * The two lines intersect at a high Compute value, around 107. ### Interpretation The chart illustrates the relationship between computational resources (Compute) and the performance of a model (Test Loss). The decreasing trend of both L(Cmin) and L(D(C)) suggests that increasing the amount of computation generally improves model performance. The shaded region around L(D(C)) could represent the variance in performance due to different training runs or hyperparameter settings. The annotation highlights that the point at which the two lines intersect is sensitive to the specific parameters used in the power-law model, implying that small changes in these parameters can significantly affect the observed behavior. The black line represents the actual test loss, which transitions into the L(Cmin) line, suggesting that at higher compute values, the model's performance approaches the theoretical minimum loss. </details> ## 6.3 Contradictions and a Conjecture We observe no signs of deviation from straight power-law trends at large values of compute, data, or model size. Our trends must eventually level off, though, since natural language has non-zero entropy. Indeed, the trends for compute-efficient training described in this section already contain an apparent contradiction. At scales several orders of magnitude above those documented here, the performance predicted by the L ( C min ) scaling law decreases below what should be possible given the slow growth in training data with compute. This implies that our scaling laws must break down before this point, but we conjecture that the intersection point has a deeper meaning: it provides an estimate of the point at which Transformer language models reach maximal performance. Since the amount of data used by compute-efficient training grows slowly with the compute budget, the performance predicted by L ( C min ) eventually hits a lower bound set by the L ( D ) power law (see Figure 15). Let us work this out in more detail. To keep overfitting under control, the results of Section 4 imply that we should scale the dataset size as $$D \varpropto N ^ { 0 . 7 4 } \varpropto C _ { \min } ^ { 0 . 5 4 }$$ where we have used the compute-efficient N ( C min ) from Figure 14. Let us compare this to the data requirements of compute-efficient training. If we train at the critical batch size (i.e. C = 2 C min ) and never re-use data during training, we find that data usage grows with compute as $$D ( C _ { \min } ) = \frac { 2 C _ { \min } } { 6 N ( C _ { \min } ) } \approx \left ( 4 \times 1 0 ^ { 1 0 } t o k e n s \right ) ( C _ { \min } / P F - D a y ) ^ { 0 . 2 6 } \quad ( 6 . 7 )$$ This is the maximum rate at which the dataset size can productively grow with compute, since it means that we are only training for a single epoch. But it grows the dataset much more slowly than in Equation (6.6). It appears to imply that compute-efficient training will eventually run into a problem with overfitting, even if the training process never re-uses any data! According to Figure 1, we expect that when we are bottlenecked by the dataset size (ie by overfitting), the loss should scale as L ( D ) ∝ D -0 . 095 . This implies that the loss would scale with compute as L ( D ( C min )) ∝ C -0 . 03 min once we are data-limited. Once again, we have a contradiction, as this will eventually intersect with our prediction for L ( C min ) from Figure 13, where we found a scaling L ( C min ) ∝ C -0 . 050 min . The intersection point of L ( D ( C min )) and L ( C min ) occurs at $$C ^ { * } \sim 1 0 ^ { 4 } \, P F \text {-Days} \quad N ^ { * } \sim 1 0 ^ { 1 2 } \, \text {parameters} , \quad D ^ { * } \sim 1 0 ^ { 1 2 } \, \text {tokens} , \quad L ^ { * } \sim 1 . 7 \, \text {nats/token} \quad ( 6 . 8 )$$ though the numerical values are highly uncertain, varying by an order or magnitude in either direction depending on the precise values of the exponents from the power-law fits. The most obvious interpretation is that our scaling laws break down at or before we reach this point, which is still many orders of magnitude away in both compute and model size. One might also conjecture that this intersection point has a deeper meaning. If we cannot increase the model size beyond N ∗ without qualitatively different data requirements, perhaps this means that once we reach C ∗ min and N ∗ , we have extracted all of the reliable information available in natural language data. In this interpretation, L ∗ would provide a rough estimate for the entropy-per-token 7 of natural language. In this scenario, we would expect the loss trend to level off at or before L ∗ . We can guess at the functional form of L ( C min ) as it levels off by considering a version of our training dataset with added noise. For example, we could append a random string of tokens to each context shown to the model to artificially boost the loss by a constant additive factor. Then, the distance from the noise floor L -L noise would be a more meaningful performance metric, with even a small decrease in this distance potentially representing a significant boost in qualitative performance. Since the artificial noise would affect all of our trends equally, the critical point of 6.8 would not change (aside from the absolute value of L ∗ ), and may be meaningful even if it occurs after the leveling off. ## 7 Related Work Power laws can arise from a wide variety of sources [THK18]. Power-law scalings with model and dataset size in density estimation [Was06] and in random forest models [Bia12] may be connected with our results. These models suggest that power-law exponents may have a very rough interpretation as the inverse of the number of relevant features in the data. Some early [BB01, Goo01] work found power-law scalings between performance and dataset size. More recent work [HNA + 17, HAD19] also investigated scaling between model size and data size; their work is perhaps the closest to ours in the literature 8 . Note, however, that [HNA + 17] found super-linear scaling of dataset size with model size, whereas we find a sub-linear scaling. There are some parallels between our findings on optimal allocation of compute and [Kom19], including power-law learning curves. EfficientNets [TL19] also appear to obey an approximate power-law relation between accuracy and model size. Very recent work [RRBS19b] studies scaling with both dataset size and model size for a variety of datasets, and fits an ansatz similar to ours. EfficientNet [TL19] advocates scaling depth and width exponentially (with different coefficients) for optimal performance of image models, resulting in a power-law scaling of width as a function of depth. We find that for language models this power should be roughly one when scaling up (as width/depth should remain fixed). But more importantly, we find that the precise architectural hyperparameters are unimportant compared to the overall scale of the language model. In [VWB16] it was argued that deep models can function as ensembles of shallower models, which could potentially explain this finding. Earlier work [ZK16] has compared width and depth, and found that wide ResNets can outperform deep ResNets on image classification. Some studies fix computation per data example, which tends to scale in proportion to the number of model parameters, whereas we investigate scaling with both model size and the quantity of training computation. Various works [AS17, BHMM18] have investigated generalization in highly overparameterized models, finding a 'jamming transition' [GJS + 19] when the model size reaches the dataset size (this may require training many orders of magnitude beyond typical practice, and in particular does not use early stopping). We do not observe such a transition, and find that the necessary training data scales sublinearly in the model size. Expansions in the model size, particularly at large width [JGH18, LXS + 19], may provide a useful framework for thinking about some of our scaling relations. Our results on optimization, such as the shape of learning curves, can likely be explained using a noisy quadratic model, which can provide quite accurate predictions [ZLN + 19] in realistic settings. Making this connection quantitative will require a characterization of the Hessian spectrum [Pap18, GKX19, GARD18]. ## 8 Discussion We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N , dataset size D , and optimized training computation C min , as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,C min are power-laws, there are diminishing returns with increasing scale. 7 Defining words using the wc utility, the WebText2 dataset has 1 . 4 tokens per word and 4 . 3 characters per token. 8 After this work was completed, [RRBS19a] also appeared, which makes similar predictions for the dependence of loss on both model and dataset size. We were able to precisely model the dependence of the loss on N and D , and alternatively on N and S , when these parameters are varied simultaneously. We used these relations to derive the compute scaling, magnitude of overfitting, early stopping step, and data requirements when training large language models. So our scaling relations go beyond mere observation to provide a predictive framework. One might interpret these relations as analogues of the ideal gas law, which relates the macroscopic properties of a gas in a universal way, independent of most of the details of its microscopic consituents. It is natural to conjecture that the scaling relations will apply to other generative modeling tasks with a maximum likelihood loss, and perhaps in other settings as well. To this purpose, it will be interesting to test these relations on other domains, such as images, audio, and video models, and perhaps also for random network distillation. At this point we do not know which of our results depend on the structure of natural language data, and which are universal. It would also be exciting to find a theoretical framework from which the scaling relations can be derived: a 'statistical mechanics' underlying the 'thermodynamics' we have observed. Such a theory might make it possible to derive other more precise predictions, and provide a systematic understanding of the limitations of the scaling laws. In the domain of natural language, it will be important to investigate whether continued improvement on the loss translates into improvement on relevant language tasks. Smooth quantitative change can mask major qualitative improvements: 'more is different'. For example, the smooth aggregate growth of the economy provides no indication of the specific technological developments that underwrite it. Similarly, the smooth improvements in language model loss may hide seemingly qualitative changes in capability. Our results strongly suggest that larger models will continue to perform better, and will also be much more sample efficient than has been previously appreciated. Big models may be more important than big data. In this context, further investigation into model parallelism is warranted. Deep models can be trained using pipelining [HCC + 18], which splits parameters depth-wise between devices, but eventually requires increased batch sizes as more devices are used. Wide networks on the other hand are more amenable to parallelization [SCP + 18], since large layers can be split between multiple workers with less serial dependency. Sparsity [CGRS19, GRK17] or branching (e.g. [KSH12]) may allow for even faster training of large networks through increased model parallelism. And using methods like [WRH17, WYL19], which grow networks as they train, it might be possible to remain on the compute-efficient frontier for an entire training run. ## Acknowledgements We would like to thank Shan Carter, Paul Christiano, Jack Clark, Ajeya Cotra, Ethan Dyer, Jason Eisner, Danny Hernandez, Jacob Hilton, Brice Menard, Chris Olah, and Ilya Sutskever for discussions and for feedback on drafts of this work. ## Appendices ## A Summary of Power Laws For easier reference, we provide a summary below of the key trends described throughout the paper. Table 4 | Parameters | Data | Compute | Batch Size | Equation | |--------------|--------|------------|--------------------------|-----------------------------------------------------| | N | ∞ | ∞ | Fixed | L ( N ) = ( N c /N ) α N | | ∞ | D | Early Stop | Fixed | L ( D ) = ( D c /D ) α D | | Optimal | ∞ | C | Fixed | L ( C ) = ( C c /C ) α C (naive) | | N opt | D opt | C min | B B crit | L ( C min ) = ( C min c /C min ) α min C | | N | D | Early Stop | Fixed | L ( N,D ) = [ ( N c N ) αN αD + D c D ] α D | | N | ∞ | S steps | B | L ( N,S ) = ( N c N ) α N + ( S c S min ( S,B ) ) α | The empirical fitted values for these trends are: Table 5 | Power Law | Scale (tokenization-dependent) | |-------------------|----------------------------------------| | α N = 0 . 076 | N c = 8 . 8 × 10 13 params (non-embed) | | α D = 0 . 095 | D c = 5 . 4 × 10 13 tokens | | α C = 0 . 057 | C c = 1 . 6 × 10 7 PF-days | | α min C = 0 . 050 | C min c = 3 . 1 × 10 8 PF-days | | α B = 0 . 21 | B ∗ = 2 . 1 × 10 8 tokens | | α S = 0 . 76 | S c = 2 . 1 × 10 3 steps | The optimal parameters for compute efficient training are given by: Table 6 | Compute-Efficient Value | Power Law | Scale | |--------------------------------------------------------|--------------|---------------------------| | N opt = N e · C p N min | p N = 0 . 73 | N e = 1 . 3 · 10 9 params | | B B crit = B ∗ L 1 /αB = B e C p B min | p B = 0 . 24 | B e = 2 . 0 · 10 6 tokens | | S min = S e · C p S min (lower bound) | p S = 0 . 03 | S e = 5 . 4 · 10 3 steps | | D opt = D e · C p D min (1 epoch) | p D = 0 . 27 | D e = 2 · 10 10 tokens | ## B Empirical Model of Compute-Efficient Frontier Throughout this appendix all values of C, S, and α C are adjusted for training at the critical batch size B crit . We have left off the 'adj' label to avoid cluttering the notation. ## B.1 Defining Equations The power-law fit to the learning curves implies a simple prescription for compute-efficient training. In this appendix, we will derive the optimal performance, model size, and number of training steps as a function of the compute budget. We start with the Equation (1.6), repeated here for convenience: $$L \left ( N , S \right ) = \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } + \left ( \frac { S _ { c } } { S } \right ) ^ { \alpha _ { S } } .$$ Here, S represents the number of parameter updates when training at the critical batch size [MKAT18], which was defined in Equation (5.2) 9 : $$B \left ( L \right ) = \frac { B _ { * } } { L ^ { 1 / \alpha _ { B } } } .$$ We would like to determine optimal training parameters for a fixed compute budget, so we replace S = C/ (6 NB ( L )) , where C is the number of FLOPs used in the training run: $$L \left ( N , C \right ) = \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } + \left ( 6 B _ { * } S _ { c } \frac { N } { L ^ { 1 / \alpha _ { B } C } } \right ) ^ { \alpha _ { S } } .$$ Now, we set ∂ N L ∣ ∣ C = 0 to find the condition for optimality: $$0 & = \frac { \partial L } { \partial N } | _ { C } \\ & = - \frac { \alpha _ { N } } { N } \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } + \frac { \alpha _ { S } } { N } \left ( 6 B _ { * } S _ { c } \frac { N } { L ^ { 1 / \alpha _ { B } } C } \right ) ^ { \alpha _ { S } } \left ( 1 - 5 \frac { N } { L } \frac { \partial L } { \partial N } \right ) \\ \Rightarrow & \, \frac { \alpha _ { N } } { \alpha _ { S } } \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } = \left ( 6 B _ { * } S _ { c } \frac { N } { L ^ { 1 / \alpha _ { B } } C } \right ) ^ { \alpha _ { S } } \\$$ Equation (B.3) and (B.4) together determine the compute-efficient frontier. ## B.2 Efficient Training Now we assemble the implications of (B.3) and (B.4). First, note that inserting (B.4) into (B.3) yields $$L \left ( N _ { e f f } \left ( C \right ) , C \right ) = \left ( 1 + \frac { \alpha _ { N } } { \alpha _ { S } } \right ) L \left ( N _ { e f f } , \infty \right ) , \quad ( B . 5 )$$ which implies that for compute-efficient training, we should train to a fixed percentage α N α S ≈ 10% above the converged loss. Next, let's determine how the optimal loss depends on the compute budget. Eliminating N yields a power-law dependence of performance on compute: $$L \left ( C \right ) = \left ( \frac { C _ { c } } { C } \right ) ^ { \alpha _ { c } }$$ $$\alpha _ { C } = 1 / \left ( 1 / \alpha _ { S } + 1 / \alpha _ { B } + 1 / \alpha _ { N } \right ) \approx 0 . 0 5 2$$  Similarly, we can eliminate L to find N ( C ) : $$\frac { N \left ( C \right ) } { N _ { c } } = \left ( \frac { C } { C _ { c } } \right ) ^ { \alpha _ { C } / \alpha _ { N } } \left ( 1 + \frac { \alpha _ { N } } { \alpha _ { S } } \right ) ^ { 1 / \alpha _ { N } }$$ $$S \left ( C \right ) = \frac { C _ { c } } { 6 N _ { c } B _ { * } } \left ( 1 + \frac { \alpha _ { N } } { \alpha _ { S } } \right ) ^ { - 1 / \alpha _ { N } } \left ( \frac { C } { C _ { c } } \right ) ^ { \alpha _ { C } / \alpha _ { S } } \quad \text {(B.10)}$$ 9 There is a slight ambiguity here: we can imagine training either at a constant batch size B ( L target ) , or we could instead train at a variable batch size ˜ B ( L ) , where ˜ B is the instantaneous critical batch size (as opposed to B , which is the averaged version). These two prescriptions result in the same number of steps, so we can ignore this subtlety (see [MKAT18]). where we defined and ## B.3 Comparison to Inefficient Typically, researchers train models until they appear to be close to convergence. In this section, we compare the efficient training procedure described above to this more typical setup. We define a the convergence factor f as the percent deviation from the converged loss: $$L \left ( N , C \right ) = \left ( 1 + f \right ) L \left ( N , \infty \right ) .$$ For compute-efficient training we have f = α N /α S ≈ 10% from the previous section, but researchers typically use a much smaller value. Here, we choose f ′ = 2% as an estimate. For a fixed value of the loss, we predict:    So that compute-efficient training uses 7.7x fewer parameter updates, 2.7x more parameters, and 65% less compute to reach the same loss. ## B.4 Suboptimal Model Sizes We can solve A.1 to find an expression for the amount of compute needed to reach a given value of the loss L with a model of size N :  Using A.6 and A.9, we can eliminate L in favor of N eff ( L ) , the model size which reaches L most efficiently. From there, we find an expression for the excess compute needed as a consequence of using a suboptimal model size:  The result is shown in Figure X. Models between 0.6x and 2.2x the optimal size can be used with only a 20% increase in compute budget. Using a smaller model is useful when accounting for the cost inference. A larger model can be trained the the same level of performance in fewer steps, allowing for more parallelism and faster training if sufficient harware is available (see Figure Y):  A2.2x larger model requires 45% fewer steps at a cost of 20% more training compute. Note that this equation should not be trusted for very large models, as it is only valid in the power-law region of the learning curve after initial transient effects. ## C Caveats In this section we list some potential caveats to our analysis. - At present we do not have a solid theoretical understanding for any of our proposed scaling laws. The scaling relations with model size and compute are especially mysterious. It may be possible to understand scaling at very large D holding model size fixed [AS17], and also the shape of learning curves late in training, by modeling the loss with a noisy quadratic. But the scaling with D at very large model size still remains mysterious. Without a theory or a systematic understanding of the corrections to our scaling laws, it's difficult to determine in what circumstances they can be trusted. Figure 16 Left: We characterize the step on which early stopping occurs, as a function of the extent of overfitting. The red line indicates a lower bound for early stopping that is derived in Section 5.3. Right: We display train and test loss for a series of 300M parameter models trained on different sized dataset subsamples. The test loss typically follows that of a run done with unrestricted data until diverging. Note that the degree of overfitting (as compared to the infinite data limit) is significantly overestimated by L test -L train (denoted by a black bar for each run). <details> <summary>Image 16 Details</summary> ![148c2fb5](/v1/image/148c2fb5875c506b2e82c20f40bcda9b3f33ef24d725223f50e05efe6a58190f) ### Visual Description ## Chart: Early Stopping Step and Loss vs. Step ### Overview The image presents two charts. The left chart shows the relationship between the early stopping step and a function of the loss difference, colored by dataset size. The right chart shows the loss (both test and train) as a function of the step, with dataset size indicated by color. ### Components/Axes **Left Chart:** * **Title:** Early Stopping Step * **Y-axis:** Sstop (logarithmic scale from 103 to 105) * **X-axis:** Sc x [L(N, D) - L(N, ∞)]-1/αs (logarithmic scale from 103 to 105) * **Data Series:** Scatter plot with points colored according to dataset size. * **Legend (top-right):** * 21M (dark purple) * 43M (purple) * 86M (light purple) * 172M (blue) * 344M (light blue) * 688M (green) * 1.4B (light green) * A red dashed line is present, running diagonally. **Right Chart:** * **Y-axis:** Loss (linear scale from 2 to 6) * **X-axis:** Step (logarithmic scale from 103 to 105) * **Data Series:** * Test Loss (solid lines, color-coded by dataset size) * Train Loss (dashed lines, color-coded by dataset size) * **Legend (top-right):** * Test Loss (solid dark blue line) * Train Loss (dashed dark blue line) * **Colorbar (right):** Dataset Size (Tokens), ranging from 108 to 1010, with colors transitioning from dark purple to yellow. ### Detailed Analysis **Left Chart:** * The data points generally trend upwards, indicating a positive correlation between the early stopping step and the function on the x-axis. * The red dashed line appears to represent a reference line, possibly indicating a theoretical or expected relationship. * The color gradient suggests that larger datasets (green/yellow) tend to have higher early stopping steps compared to smaller datasets (purple/blue). **Right Chart:** * Both Test Loss and Train Loss decrease as the Step increases, indicating learning. * The loss curves flatten out as the Step increases, suggesting convergence. * The color gradient shows that larger datasets (yellow) generally have lower final losses compared to smaller datasets (purple). * The Test Loss and Train Loss curves for each dataset size tend to converge as the Step increases. * Error bars are present on the Test Loss lines, indicating the variability in the loss. **Specific Data Points (Right Chart - Approximate):** * **21M (dark purple):** * Test Loss: Starts around 4.5, decreases to approximately 2.7 at step 105. * Train Loss: Starts around 4.0, decreases to approximately 2.4 at step 105. * **1.4B (light green):** * Test Loss: Starts around 5.8, decreases to approximately 2.9 at step 105. * Train Loss: Starts around 5.5, decreases to approximately 2.6 at step 105. ### Key Observations * Larger datasets generally lead to lower final losses and higher early stopping steps. * The loss curves exhibit a typical learning curve pattern, with a rapid initial decrease followed by a slower convergence. * The early stopping step appears to be correlated with a function of the loss difference, suggesting a potential strategy for optimizing training. ### Interpretation The charts illustrate the impact of dataset size on the training process and early stopping criteria. The data suggests that larger datasets not only lead to better performance (lower loss) but also influence the optimal point to stop training. The relationship between the early stopping step and the loss difference function could be used to develop more efficient training strategies. The convergence of Test and Train Loss suggests that the model is generalizing well, and the error bars on the Test Loss provide an indication of the model's robustness. </details> - We are not especially confident in the prediction of B crit ( L ) for values of the loss far outside the range we have explored. Changes in B crit could have a significant impact on trade-offs between data parallelism and the number of serial training steps required, which would have a major impact on training time. - We did not thoroughly investigate the small data regime, and our fits for L ( N,D ) were poor for the smallest values of D (where an epoch corresponded to only 40 steps). Furthermore, we did not experiment with regularization and data augmentation. Improvements in these could alter our results, quantitatively or qualitatively. - We used the estimated training compute C ≈ 6 NBS , which did not include contributions proportional to n ctx (see Section 2.1). So our scalings with compute may be confounded in practice in the regime of very large n ctx , specifically where n ctx 12 d model . - We tuned learning rates, and we experimented with learning rate schedules. But we may have neglected to tune some hyperparameter (e.g. intialization scale or momentum) that have an important effect on scaling. - The optimal choice of learning rate is sensitive to the target loss. When training close to convergence, it may be necessary to use a smaller learning rate to avoid divergences. But when conducting a short training run (eg due to compute limitations), it may be possible to use a larger learning rate. We did not experiment with higher learning rates for training runs that did not proceed to convergence. ## D Supplemental Figures ## D.1 Early Stopping and Test vs Train In section 5.3 we described the result shown in Figure 16, which provides a prediction for a lower bound on the early stopping step. We also show the train and test loss for a given model size when training on different sized datasets. ## D.2 Universal Transformers We compare the performance of standard Transformers to recurrent Transformers [DGV + 18] in Figure 17. These models re-use parameters, and so perform slightly better as a function of N , but slightly worse as a function of compute C . We include several different different possibilities for parameter re-use. ## D.3 Batch Size We measure the critical batch size using the data displayed in figure 18. This made it possible to estimate B crit ( L ) in figure 10. Figure 17 We compare recurrent Transformers [DGV + 18], which re-use parameters, to standard Transformers. Recurrent Transformers perform slightly better when comparing models with equal parameter count, but slightly worse when accounting for reuse and comparing per FLOP. <details> <summary>Image 17 Details</summary> ![5f7d931d](/v1/image/5f7d931d0e4ddbe6af5042ba8a00455ca942b44225b6fff9c96afc14cfc086fd) ### Visual Description ## Chart: Test Loss vs. Parameters ### Overview The image contains two line charts comparing the test loss of different models against the number of parameters. The left chart plots "Parameters, including reuse (non-embedding)" on the x-axis, while the right chart plots "Parameters (non-embedding)". Both charts show the test loss on the y-axis. The charts compare models with different reuse factors (2x, 4x, and 8x) and a baseline of non-recurrent models. ### Components/Axes **Left Chart:** * **X-axis:** Parameters, including reuse (non-embedding). Logarithmic scale from 10^5 to 10^9. * **Y-axis:** Test Loss. Linear scale from 2.5 to 4.5. * **Legend:** Located in the top-right corner. * Purple: 2x Reuse * Blue: 4x Reuse * Yellow: 8x Reuse * Gray dashed line: Non-recurrent Models **Right Chart:** * **X-axis:** Parameters (non-embedding). Logarithmic scale from 10^5 to 10^9. * **Y-axis:** Test Loss. Linear scale from 2.5 to 4.5. * **Legend:** Located in the top-right corner. * Purple: 2x Reuse * Blue: 4x Reuse * Yellow: 8x Reuse * Gray dashed line: Non-recurrent Models ### Detailed Analysis **Left Chart (Parameters, including reuse):** * **2x Reuse (Purple):** The line slopes downward. * (10^5, 4.35) * (10^6, 3.85) * (10^7, 3.25) * (10^8, 2.8) * (10^9, 2.6) * **4x Reuse (Blue):** The line slopes downward. * (10^6, 3.9) * (10^7, 3.4) * (10^8, 2.9) * (10^9, 2.65) * **8x Reuse (Yellow):** The line slopes downward. * (10^7, 3.5) * (10^8, 3.2) * **Non-recurrent Models (Gray dashed line):** The line slopes downward. * (10^5, 4.7) * (10^9, 2.5) **Right Chart (Parameters, non-embedding):** * **2x Reuse (Purple):** The line slopes downward. * (10^5, 4.35) * (10^6, 3.85) * (10^7, 3.25) * (10^8, 2.8) * (10^9, 2.6) * **4x Reuse (Blue):** The line slopes downward. * (10^5, 4.4) * (10^6, 3.9) * (10^7, 3.4) * (10^8, 2.9) * (10^9, 2.65) * **8x Reuse (Yellow):** The line slopes downward. * (10^5, 4.4) * (10^6, 3.9) * (10^7, 3.5) * (10^8, 3.2) * **Non-recurrent Models (Gray dashed line):** The line slopes downward. * (10^5, 4.7) * (10^9, 2.5) ### Key Observations * Test loss decreases as the number of parameters increases for all models. * The non-recurrent models consistently show the lowest test loss across the parameter range. * The 8x Reuse model has the highest test loss for a given number of parameters, especially in the left chart. * The difference between the models decreases as the number of parameters increases. * The left chart includes "reuse" in the parameter count, while the right chart does not. ### Interpretation The charts suggest that increasing the number of parameters generally improves model performance (reduces test loss). However, the reuse factor impacts the efficiency of parameter usage. The non-recurrent models, represented by the gray dashed line, consistently outperform the recurrent models with reuse, indicating that for a given number of parameters, non-recurrent models achieve lower test loss. The difference between the left and right charts highlights the impact of including the "reuse" parameters in the total parameter count. The 8x reuse model appears to be less efficient in utilizing parameters compared to the 2x and 4x reuse models. The data implies that there is a trade-off between parameter reuse and model performance, and non-recurrent models may be more efficient in this context. </details> Figure 18 These figures demonstrate fits to Equation (5.1) for a large number of values of the loss L , and for two different Transformer model sizes. These fits were used to measure B crit ( L ) for Figure 10. <details> <summary>Image 18 Details</summary> ![7efb8e2f](/v1/image/7efb8e2f2ea46967f83f502087b860e5018d9740223307bc91c615c1df21e361) ### Visual Description ## Scatter Plot: Batch Size Scan ### Overview The image contains two scatter plots comparing the number of tokens processed against the training step for different batch sizes. The left plot represents a model with 3 million parameters, while the right plot represents a model with 85 million parameters. Each data point is colored according to the test loss, with a color gradient from purple (low) to yellow (high). ### Components/Axes * **Titles:** * Left Plot: "Batch Size Scan - 3M Params" * Right Plot: "Batch Size Scan - 85M Params" * **X-axis (both plots):** * Label: "Step" * Scale: Logarithmic, ranging from approximately 10^1 to 10^5. * **Y-axis (both plots):** * Label: "Tokens Processed" * Scale: Logarithmic, ranging from 10^6 to 10^11. * **Colorbar (both plots):** * Label: "Test Loss" * Scale: Linear, ranging from 4 (purple) to 10 (yellow). ### Detailed Analysis **Left Plot (3M Params):** * Each line represents a different batch size. * The lines generally slope upwards, indicating that as the step increases, the number of tokens processed also increases. * The lines are colored based on the test loss, with the lower lines (smaller batch sizes) tending to be yellow (higher loss) and the upper lines (larger batch sizes) tending to be purple/blue (lower loss). * **Data Points (Examples):** * At Step = 10^2, Tokens Processed ranges from approximately 10^6 (yellow, Test Loss ~ 10) to 10^7 (blue, Test Loss ~ 6). * At Step = 10^4, Tokens Processed ranges from approximately 10^8 (yellow, Test Loss ~ 10) to 10^10 (blue, Test Loss ~ 4). **Right Plot (85M Params):** * Similar to the left plot, each line represents a different batch size. * The lines also slope upwards, indicating that as the step increases, the number of tokens processed increases. * The lines are colored based on the test loss, with the lower lines (smaller batch sizes) tending to be yellow (higher loss) and the upper lines (larger batch sizes) tending to be purple/blue (lower loss). * **Data Points (Examples):** * At Step = 10^2, Tokens Processed ranges from approximately 10^6 (yellow, Test Loss ~ 10) to 10^7 (green, Test Loss ~ 8). * At Step = 10^4, Tokens Processed ranges from approximately 10^8 (yellow, Test Loss ~ 10) to 10^10 (blue, Test Loss ~ 4). ### Key Observations * Both plots show a clear relationship between the number of tokens processed, the training step, and the test loss. * Larger batch sizes (higher lines) generally result in lower test loss (purple/blue colors). * As the training step increases, the number of tokens processed increases for all batch sizes. * The range of tokens processed is similar for both the 3M and 85M parameter models. ### Interpretation The plots demonstrate the impact of batch size on the training process and the resulting test loss. The data suggests that using larger batch sizes leads to lower test loss, indicating better model performance. This could be due to more stable gradient updates or better exploration of the loss landscape. The plots also show that increasing the number of training steps leads to more tokens processed, as expected. The similarity in the range of tokens processed between the 3M and 85M parameter models suggests that the model size does not significantly affect the number of tokens processed for a given batch size and training step. </details> ## D.4 Sample Efficiency vs Model Size It is easy to see from figure 2 that larger models train faster, and are therefore more sample efficient. We provide another way of looking at this phenomenon in figure 19, which shows when different models reach various fixed values of the loss. Figure 19 The number of minimum serial steps needed to reach any fixed value of the test loss decreases precipitously with model size. Sample efficiency (show here for training far below the critical batch size) improves greatly as well, improving by a factor of almost 100 when comparing the smallest possible model to a very large one. <details> <summary>Image 19 Details</summary> ![bc7afc9e](/v1/image/bc7afc9e0a7cb5bb41ce26bb332fc7f2d003ef9034340fcf87b0b6c8e13b6b10) ### Visual Description ## Contour Plots: Minimum Steps and Minimum Examples vs. Parameters ### Overview The image contains two contour plots side-by-side. Both plots show the relationship between the number of parameters (non-embedding) of a model and either the minimum steps required for training (left plot) or the minimum examples required for training (right plot). The contours represent different loss values, ranging from 2.5 to 5.5. ### Components/Axes **Left Plot:** * **Title:** Minimum Steps (Smin) * **Y-axis:** Minimum Steps (Smin), logarithmic scale from 10^3 to 10^5 * **X-axis:** Parameters (non-embedding), logarithmic scale from 10^6 to 10^8 * **Contours:** Represent different loss values, color-coded from yellow (2.5) to dark purple (5.5) * **Colorbar (right side):** Loss, ranging from 2.5 to 5.5 **Right Plot:** * **Title:** Minimum Examples (Emin) * **Y-axis:** Minimum Examples (Emin), logarithmic scale from 10^8 to 10^11 * **X-axis:** Parameters (non-embedding), logarithmic scale from 10^6 to 10^8 * **Contours:** Represent different loss values, color-coded from yellow (2.5) to dark purple (5.5) * **Colorbar (right side):** Loss, ranging from 2.5 to 5.5 **Colorbar (present on both plots):** * Located on the right side of each plot. * Represents the Loss values. * Ranges from 2.5 (yellow) to 5.5 (dark purple). * Intermediate values: 3.0 (light green), 3.5 (green), 4.0 (light blue), 4.5 (blue), 5.0 (dark blue). ### Detailed Analysis **Left Plot: Minimum Steps (Smin) vs. Parameters** * **Loss = 2.5 (Yellow):** The line starts at approximately 10^6 parameters and ~1000 steps, remaining relatively flat until ~3 * 10^7 parameters, then slightly increases. * **Loss = 3.0 (Light Green):** The line starts at approximately 10^6 parameters and ~1000 steps, remaining relatively flat until ~3 * 10^7 parameters, then slightly increases. * **Loss = 3.5 (Green):** The line starts at approximately 10^6 parameters and ~1000 steps, remaining relatively flat until ~3 * 10^7 parameters, then slightly increases. * **Loss = 4.0 (Light Blue):** The line starts at approximately 10^6 parameters and ~2000 steps, decreasing until ~3 * 10^7 parameters, then slightly increases. * **Loss = 4.5 (Blue):** The line starts at approximately 10^6 parameters and ~5000 steps, decreasing until ~3 * 10^7 parameters, then slightly increases. * **Loss = 5.0 (Dark Blue):** The line starts at approximately 10^6 parameters and ~10000 steps, decreasing until ~3 * 10^7 parameters, then slightly increases. * **Loss = 5.5 (Dark Purple):** The line starts at approximately 10^6 parameters and ~50000 steps, decreasing until ~3 * 10^7 parameters, then slightly increases. **Right Plot: Minimum Examples (Emin) vs. Parameters** * **Loss = 2.5 (Yellow):** The line starts at approximately 10^6 parameters and ~10^8 examples, remaining relatively flat until ~3 * 10^7 parameters, then slightly increases. * **Loss = 3.0 (Light Green):** The line starts at approximately 10^6 parameters and ~2 * 10^8 examples, decreasing until ~3 * 10^7 parameters, then slightly increases. * **Loss = 3.5 (Green):** The line starts at approximately 10^6 parameters and ~5 * 10^8 examples, decreasing until ~3 * 10^7 parameters, then slightly increases. * **Loss = 4.0 (Light Blue):** The line starts at approximately 10^6 parameters and ~10^9 examples, decreasing until ~3 * 10^7 parameters, then slightly increases. * **Loss = 4.5 (Blue):** The line starts at approximately 10^6 parameters and ~3 * 10^9 examples, decreasing until ~3 * 10^7 parameters, then slightly increases. * **Loss = 5.0 (Dark Blue):** The line starts at approximately 10^6 parameters and ~10^10 examples, decreasing until ~3 * 10^7 parameters, then slightly increases. * **Loss = 5.5 (Dark Purple):** The line starts at approximately 10^6 parameters and ~3 * 10^10 examples, decreasing until ~3 * 10^7 parameters, then slightly increases. ### Key Observations * For both plots, as the loss increases (from yellow to dark purple), the minimum steps/examples required for training also increase. * In both plots, the lines tend to flatten out or slightly increase after a parameter count of approximately 3 * 10^7. * The y-axis scales are different, with the right plot showing minimum examples in the range of 10^8 to 10^11, while the left plot shows minimum steps in the range of 10^3 to 10^5. ### Interpretation The plots suggest that for a given number of parameters, achieving a lower loss requires fewer training steps and fewer training examples. However, there appears to be a point of diminishing returns around 3 * 10^7 parameters, after which increasing the number of parameters does not significantly reduce the number of steps or examples needed for training. The relationship between model size (number of parameters), training effort (steps/examples), and achievable loss is complex. The plots indicate that increasing model size can initially reduce the training effort required to achieve a certain loss, but beyond a certain point, further increases in model size provide less benefit. This could be due to overfitting or other factors. </details> Figure 20 This figure provides information about the performance per token as a function of model size and training time. Left: Loss per token as a function of its position T in the 1024-token context. Loss scales predictably as a power-law in T . Right: Test loss per token as a function of training step. <details> <summary>Image 20 Details</summary> ![bdc848b5](/v1/image/bdc848b56c25a81d6b5d641bfad73b4d34f8e5b26812380f226fd7e852d35d2f) ### Visual Description ## Chart: Per-Token Test Loss vs. Token Index and Step ### Overview The image presents two line charts comparing per-token test loss against token index (left) and step (right). Both charts display multiple lines, each representing a different model parameter configuration. A color gradient is used to represent the model parameters and token index. ### Components/Axes **Left Chart:** * **Title:** Per-Token Test Loss * **X-axis:** Token Index (Logarithmic scale from 10^0 to 10^3) * **Y-axis:** Per-Token Test Loss (Linear scale from 2 to 8) * **Legend:** Located at the top-right of the left chart. The legend entries are color-coded and represent different model parameter configurations. The color gradient ranges from dark purple to yellow, corresponding to the following values: * Dark Purple: 4.0 + 3.2 * T^-0.47 * Purple: 3.4 + 4.0 * T^-0.56 * Green-Blue: 2.9 + 4.5 * T^-0.56 * Green: 2.7 + 4.9 * T^-0.60 * Yellow-Green: 2.4 + 5.1 * T^-0.61 * Yellow: 2.3 + 5.4 * T^-0.62 **Right Chart:** * **Title:** Per-token Loss (774M Params) * **X-axis:** Step (Logarithmic scale from 10^1 to 10^5) * **Y-axis:** Test Loss (Linear scale from 2 to 10) * **Color Bar (Right Side):** Token Index (Logarithmic scale from 10^0 to 10^3) ### Detailed Analysis **Left Chart (Per-Token Test Loss vs. Token Index):** * **Trend:** All lines show a decreasing trend as the token index increases. The rate of decrease diminishes as the token index grows larger. * **Data Points:** * The dark purple line (4.0 + 3.2 * T^-0.47) starts at approximately 7.5 at Token Index 1 and decreases to approximately 4.5 at Token Index 1000. * The purple line (3.4 + 4.0 * T^-0.56) starts at approximately 7.5 at Token Index 1 and decreases to approximately 4.2 at Token Index 1000. * The green-blue line (2.9 + 4.5 * T^-0.56) starts at approximately 7.5 at Token Index 1 and decreases to approximately 3.8 at Token Index 1000. * The green line (2.7 + 4.9 * T^-0.60) starts at approximately 7.5 at Token Index 1 and decreases to approximately 3.5 at Token Index 1000. * The yellow-green line (2.4 + 5.1 * T^-0.61) starts at approximately 7.5 at Token Index 1 and decreases to approximately 3.2 at Token Index 1000. * The yellow line (2.3 + 5.4 * T^-0.62) starts at approximately 7.5 at Token Index 1 and decreases to approximately 3.0 at Token Index 1000. **Right Chart (Per-token Loss vs. Step):** * **Trend:** All lines show a decreasing trend as the step increases. The rate of decrease diminishes as the step grows larger. * **Data Points:** * The lines start at approximately 10 at Step 10^1 and decrease to between 6 and 2 at Step 10^5. * The lines are clustered together, making it difficult to distinguish individual values. ### Key Observations * In both charts, the loss decreases as the token index or step increases, indicating that the model is learning. * The left chart shows a clear separation between the lines, indicating that different model parameter configurations result in different loss values. * The right chart shows that the loss values converge as the step increases, suggesting that the model is reaching a point of diminishing returns. ### Interpretation The charts illustrate the training process of a language model, showing how the per-token test loss decreases as the model is exposed to more tokens (left chart) and as the training progresses through more steps (right chart). The different lines in the left chart represent different model parameter configurations, and the fact that they converge in the right chart suggests that the model is learning to generalize well regardless of the initial parameter settings. The color gradient in the left chart represents the model parameters, while the color gradient in the right chart represents the token index. The convergence of the lines in the right chart suggests that the model is reaching a point of diminishing returns, where further training does not significantly improve the loss. </details> Figure 21 In addition to the averaged loss, individual tokens within the 1024-token context also improve smoothly as model size increases. Training runs with shorter context n ctx = 8 (dashed lines) perform better on early tokens, since they can allocate all of their capacity to them. <details> <summary>Image 21 Details</summary> ![6061f834](/v1/image/6061f834c5b8aa75b03862b4c0e04507a2d8d6c8a8a66c9afa3bb19b66d55aa7) ### Visual Description ## Chart: Test Loss vs. Parameters for Different Token Ratios ### Overview The image is a line chart showing the relationship between the number of parameters (excluding embedding) and the test loss for different token ratios. The x-axis represents the number of parameters on a logarithmic scale, ranging from 10^4 to 10^9. The y-axis represents the test loss, ranging from approximately 3.0 to 7.5. Different colored lines represent different token ratios, as indicated in the legend on the right side of the chart. ### Components/Axes * **X-axis:** Parameters (excl. embedding), logarithmic scale from 10^4 to 10^9. * **Y-axis:** Test Loss, linear scale from 3.0 to 7.5. * **Legend (Top-Right):** * Purple: Token 1/1024 * Dark Blue: Token 2/1024 * Blue: Token 4/1024 * Dark Teal: Token 8/1024 * Teal: Token 16/1024 * Green: Token 64/1024 * Light Green: Token 256/1024 * Yellow: Token 1024/1024 * Dashed Purple: Token 1/8 * Dashed Dark Blue: Token 2/8 * Dashed Blue: Token 4/8 * Dashed Dark Teal: Token 8/8 ### Detailed Analysis * **Token 1/1024 (Purple):** The line is relatively flat, showing a slight decrease in test loss as the number of parameters increases. The test loss starts around 7.8 at 10^4 parameters and decreases to approximately 7.5 at 10^9 parameters. * **Token 2/1024 (Dark Blue):** The line shows a gradual decrease in test loss as the number of parameters increases. The test loss starts around 6.3 at 10^4 parameters and decreases to approximately 5.8 at 10^9 parameters. * **Token 4/1024 (Blue):** The line shows a decrease in test loss as the number of parameters increases. The test loss starts around 6.0 at 10^4 parameters and decreases to approximately 5.2 at 10^9 parameters. * **Token 8/1024 (Dark Teal):** The line shows a decrease in test loss as the number of parameters increases. The test loss starts around 5.9 at 10^4 parameters and decreases to approximately 4.8 at 10^9 parameters. * **Token 16/1024 (Teal):** The line shows a decrease in test loss as the number of parameters increases. The test loss starts around 5.7 at 10^4 parameters and decreases to approximately 4.2 at 10^9 parameters. * **Token 64/1024 (Green):** The line shows a decrease in test loss as the number of parameters increases. The test loss starts around 5.5 at 10^4 parameters and decreases to approximately 3.7 at 10^9 parameters. * **Token 256/1024 (Light Green):** The line shows a decrease in test loss as the number of parameters increases. The test loss starts around 5.3 at 10^4 parameters and decreases to approximately 3.3 at 10^9 parameters. * **Token 1024/1024 (Yellow):** The line shows a decrease in test loss as the number of parameters increases. The test loss starts around 5.1 at 10^4 parameters and decreases to approximately 3.0 at 10^9 parameters. * **Token 1/8 (Dashed Purple):** The line shows a slight decrease in test loss as the number of parameters increases. The test loss starts around 6.1 at 10^4 parameters and decreases to approximately 5.9 at 10^7 parameters. * **Token 2/8 (Dashed Dark Blue):** The line shows a decrease in test loss as the number of parameters increases. The test loss starts around 5.8 at 10^4 parameters and decreases to approximately 5.2 at 10^7 parameters. * **Token 4/8 (Dashed Blue):** The line shows a decrease in test loss as the number of parameters increases. The test loss starts around 5.5 at 10^4 parameters and decreases to approximately 4.8 at 10^7 parameters. * **Token 8/8 (Dashed Dark Teal):** The line shows a decrease in test loss as the number of parameters increases. The test loss starts around 5.3 at 10^4 parameters and decreases to approximately 4.5 at 10^7 parameters. ### Key Observations * As the token ratio increases (e.g., from 1/1024 to 1024/1024), the test loss generally decreases for a given number of parameters. * The test loss decreases more significantly for higher token ratios as the number of parameters increases. * The "Token 1/1024" series shows the least improvement in test loss with increasing parameters. * The dashed lines (Token 1/8, 2/8, 4/8, 8/8) only extend to 10^7 parameters. ### Interpretation The chart suggests that increasing the token ratio and the number of parameters (excluding embedding) generally leads to a lower test loss, indicating better model performance. The token ratio seems to have a significant impact on the model's ability to learn, with higher ratios resulting in lower test loss. The "Token 1/1024" series, which has the lowest token ratio, shows the least improvement, suggesting that a sufficient token ratio is crucial for effective learning. The dashed lines stopping at 10^7 parameters might indicate a limitation or constraint in the experiment setup for those specific token ratios. </details> ## D.5 Context Dependence The trends for loss as a function of model size are displayed for different tokens in the context in Figure 21. We see that models trained on n ctx = 1024 show steady improvement with model size on all but the first token. Fixing model size, it appears that the loss scales as a power-law as a function of position T in the context, see Figure 20. This may be a consequence of underlying power-law correlations in language [EP94, ACDE12, LT16], or a more general feature of the model architecture and optimization. It provides some suggestion for the potential benefits (or lack thereof) from training on larger contexts. Not only do larger models converge to better performance at T = 1024 , but they also improve more quickly at early tokens, suggesting that larger models are more efficient at detecting patterns with less contextual information. In the right-hand plot we show how per-token performance varies for a fixed model as a function of the training step. The model begins by learning short-range information, and only learns longer-range correlations later in training. We have also included models trained with a tiny context n ctx = 8 in order to compare with our longer context models. Even modestly sized models trained on n ctx = 8 can dominate our largest n ctx = 1024 models on very early tokens. This also suggests that further improvements should be possible with much larger models trained on large contexts. ## D.6 Learning Rate Schedules and Error Analysis We experimented with a variety of learning rates and schedules. A host of schedules and resulting test performances for a small language model are plotted in Figure 22. We conclude that the choice of learning rate schedule is mostly irrelevant, as long as the total summed learning rate is sufficiently large, and the schedule includes a warmup period and a final decay to near-vanishing learning rate. Variations among Figure 22 We test a variety of learning rate schedules including cosine decay, linear decay, as well as other faster/slower decays schedules on a 3 million parameter model, shown on the left. For these experiments we do not decay to zero, since we find that this tends to give a fixed improvement close to the end of training. We find that, as long as the learning rate is not too small and does not decay too quickly, performance does not depend strongly on learning rate. Run-to-run variation is at the level of 0.05 in the loss, so averaging multiple runs is necessary to validate performance changes smaller than this level. <details> <summary>Image 22 Details</summary> ![6d65b9a5](/v1/image/6d65b9a5a3ce3b329cf4e4704353f991eca12724baac8b0671b0c515af15e591) ### Visual Description ## Chart Type: Combined Line and Scatter Plots ### Overview The image presents two plots side-by-side. The left plot is a line chart showing the learning rate over steps, with multiple lines representing different learning rate schedules. The right plot is a scatter plot showing the relationship between loss and the learning rate summed over steps. ### Components/Axes **Left Plot (Learning Rate vs. Step):** * **X-axis:** "Step", with ticks at 0, 50000, 100000, 150000, 200000, and 250000. * **Y-axis:** "Learning Rate", with ticks at 0.0000, 0.0002, 0.0004, 0.0006, 0.0008, and 0.0010. * **Data Series:** Multiple lines, each representing a different learning rate schedule. The lines are colored, but there is no legend to identify each line. **Right Plot (Loss vs. LR Summed Over Steps):** * **X-axis:** "LR Summed Over Steps", with ticks at 50, 100, 150, 200, and 250. * **Y-axis:** "Loss", with ticks at 3.65, 3.70, 3.75, 3.80, 3.85, and 3.90. * **Data Series:** A scatter plot of blue points. ### Detailed Analysis **Left Plot (Learning Rate vs. Step):** * **General Trend:** Most lines start at a learning rate of approximately 0.0010. Many lines initially maintain this rate for a short period before decaying towards 0.0000. The decay patterns vary, with some lines decaying rapidly and others decaying more gradually. Some lines start at a lower learning rate of approximately 0.0007. * **Specific Values:** * Initial learning rates are approximately 0.0007 and 0.0010. * The step values range from 0 to 250000. * The final learning rates for most lines converge to approximately 0.0000. **Right Plot (Loss vs. LR Summed Over Steps):** * **General Trend:** The loss generally decreases as the summed learning rate increases. The relationship appears to be non-linear, with a steeper decrease at lower summed learning rate values. * **Specific Values:** * At a summed learning rate of approximately 50, the loss is around 3.87. * At a summed learning rate of approximately 250, the loss is around 3.72. * The loss values range from approximately 3.70 to 3.87. ### Key Observations * The learning rate schedules vary significantly, suggesting different optimization strategies. * There is a negative correlation between the loss and the summed learning rate, indicating that higher summed learning rates are associated with lower loss. * The scatter plot shows a decreasing trend, but there is some scatter, indicating that the summed learning rate is not the only factor affecting the loss. ### Interpretation The plots illustrate the relationship between learning rate schedules, training steps, and the resulting loss. The left plot shows how the learning rate changes over time for different schedules. The right plot suggests that accumulating a higher learning rate over the training process generally leads to a lower loss. The variability in the learning rate schedules and the scatter in the right plot indicate that the optimal learning rate strategy is complex and depends on other factors not shown in the plots. The data suggests that a higher summed learning rate is generally beneficial, but the specific schedule used to achieve that sum can influence the final loss. </details> Figure 23 The trend for performance as a function of parameter count, L ( N ) , is fit better by a power law than by other functions such as a logarithm at a qualitative level. <details> <summary>Image 23 Details</summary> ![7457ee30](/v1/image/7457ee30ba32d6e2c8322280c31607a1ea169802afb0d75234078f17c60b6f19) ### Visual Description ## Chart: Test Loss vs. Parameters ### Overview The image is a scatter plot showing the relationship between test loss (at convergence) and the number of parameters (non-embedding) in a model. The x-axis is on a logarithmic scale. Two trend lines are plotted, one representing a power law and the other a logarithmic function, both fitted to the data points. ### Components/Axes * **X-axis:** Parameters (non-embedding), logarithmic scale ranging from 10^4 to 10^9. * **Y-axis:** Test Loss (at convergence), linear scale ranging from 2 to 6. * **Data Points:** Black dots representing individual data points. * **Legend (top-right):** * Blue line: L = (N / 8.8 * 10^13)^-0.076 * Orange line: L = -0.25 * log(N / 7.1 * 10^12) ### Detailed Analysis * **Blue Line (Power Law):** L = (N / 8.8 * 10^13)^-0.076 * Trend: Decreases as the number of parameters increases. * At N = 10^4, L ≈ 5.8 * At N = 10^9, L ≈ 2.2 * **Orange Line (Logarithmic):** L = -0.25 * log(N / 7.1 * 10^12) * Trend: Decreases as the number of parameters increases. * At N = 10^4, L ≈ 5.2 * At N = 10^9, L ≈ 2.3 * **Data Points (Black):** * The data points generally follow the trend of both lines, with some scatter. * The data points are more closely aligned with the blue line at lower parameter values and with the orange line at higher parameter values. ### Key Observations * Both the power law and logarithmic functions provide a reasonable fit to the data. * The test loss decreases as the number of parameters increases, indicating that larger models tend to have lower test loss. * The power law function seems to fit the data better at lower parameter values, while the logarithmic function seems to fit better at higher parameter values. ### Interpretation The plot demonstrates the relationship between model size (number of parameters) and generalization performance (test loss). The decreasing trend suggests that increasing the model size generally leads to better performance on the test set. The fact that both a power law and a logarithmic function can approximate the relationship indicates that the relationship is complex and may be influenced by various factors, such as the specific architecture of the model and the training data. The slight deviation of the data points from the trend lines suggests that there is some variability in the test loss for models with the same number of parameters. </details> schedules appear to be statistical noise, and provide a rough gauge for the scale of variation between different training runs. Experiments on larger models suggest that the variation in the final test loss between different random seeds is roughly constant in magnitude for different model sizes. We found that larger models require a smaller learning rate to prevent divergence, while smaller models can tolerate a larger learning rate. To implement this, the following rule of thumb was used for most runs:  We expect that this formula could be improved. There may be a dependence on network width, likely set by the initialization scale. The formula also breaks down for N > 10 10 parameters. Nevertheless, we found that it works sufficiently well for the models we considered. ## D.7 Fit Details and Power Law Quality We experimented with a number of functional forms for the fits to L ( N ) , L ( C ) , and L ( D ) ; the power-law fits were qualitatively much more accurate than other functions such as logarithms (see Figure 23). For L ( C ) , we do not include small models with only 1 layer in the fit, as the transition from 1 to 2 layers causes a noticable lump in the data. For L ( N ) we also do not include very small models with only 1 layer in the fit, and we exclude the largest models that have not trained fully to convergence. Fit parameters change marginally if we do include them, and the trend extrapolates well in both directions regardless. ## D.8 Generalization and Architecture In figure 24 we show that generalization to other data distributions does not depend on network depth when we hold the total parameter count fixed. It seems to depend only on the performance on the training distribution. Figure 24 We show evaluations on a series of datasets for models with approximately 1.5 Billion parameters. We observe no effect of depth on generalization; generalization performance depends primarily on training distribution performance. The 12-layer model overfit the Internet Books dataset and we show the early-stopped performance; we have not seen this surprising result in other experiments. <details> <summary>Image 24 Details</summary> ![0a118128](/v1/image/0a118128f1de72f73718bfce2c07368878c4e2897fdb594cf7943922983db46f) ### Visual Description ## Line Chart: Test Loss vs. Depth ### Overview The image is a line chart comparing the test loss of different datasets (Wikipedia, Books, Internet Books, Common Crawl, WebText2 (Train), and WebText2 (Test)) across varying depths. The x-axis represents depth on a logarithmic scale, and the y-axis represents test loss. ### Components/Axes * **X-axis:** Depth (logarithmic scale). Markers are at approximately 10^1 and 10^2. * **Y-axis:** Test Loss. Scale ranges from 2.3 to 2.8. * **Legend:** Located on the right side of the chart. * Blue: Wikipedia * Orange: Books * Green: Internet Books * Red: Common Crawl * Purple: WebText2 (Train) * Brown: WebText2 (Test) ### Detailed Analysis * **Wikipedia (Blue):** Starts at approximately 2.75 at depth 10^0.5, decreases to around 2.7 at depth 10^1, and then increases to approximately 2.74 at depth 10^2. * **Books (Orange):** Starts at approximately 2.85 at depth 10^0.5, increases to around 2.9 at depth 10^1, and then decreases to approximately 2.78 at depth 10^2. * **Internet Books (Green):** Starts at approximately 2.75 at depth 10^0.5, increases to around 2.85 at depth 10^1, and then decreases to approximately 2.72 at depth 10^2. * **Common Crawl (Red):** Starts at approximately 2.5 at depth 10^0.5, decreases to around 2.45 at depth 10^1, and then increases to approximately 2.48 at depth 10^2. * **WebText2 (Train) (Purple):** Starts at approximately 2.35 at depth 10^0.5, decreases to around 2.3 at depth 10^1, and then increases to approximately 2.32 at depth 10^2. * **WebText2 (Test) (Brown):** Starts at approximately 2.43 at depth 10^0.5, decreases to around 2.35 at depth 10^1, and then increases to approximately 2.4 at depth 10^2. ### Key Observations * The "Books" dataset has the highest test loss across all depths. * The "WebText2 (Train)" dataset has the lowest test loss across all depths. * All datasets show a general trend of decreasing test loss from depth 10^0.5 to 10^1, followed by an increase from depth 10^1 to 10^2. ### Interpretation The chart illustrates how the test loss varies with depth for different datasets. The general trend suggests that increasing depth initially reduces test loss, but beyond a certain point, it starts to increase, potentially indicating overfitting. The relative performance of the datasets varies, with "Books" consistently showing the highest test loss and "WebText2 (Train)" the lowest. This could be due to differences in the size, quality, or nature of the data in each dataset. The logarithmic scale on the x-axis suggests that the impact of depth changes diminishes as depth increases. </details> ## List of Figures | 1 | Summary of simple power laws. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 3 | |-----|-------------------------------------------------------------------------------------------------|-----| | 2 | Illustration of sample efficiency and compute efficiency. . . . . . . . . . . . . . . . . . . | 4 | | 3 | How to scale up model size, batch size, and serial steps . . . . . . . . . . . . . . . . . . | 4 | | 4 | Performance when varying model and data size, or model and training steps, simultaneously | 5 | | 5 | Weak dependence of performance on hyperparameter tuning . . . . . . . . . . . . . . . | 8 | | 6 | Comparison of performance trend when including or excluding embeddings . . . . . . . | 8 | | 7 | LSTM and Transformer performance comparison . . . . . . . . . . . . . . . . . . . . . | 9 | | 8 | Generalization to other test datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 10 | | 9 | Universality of overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 11 | | 10 | Critical batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 12 | | 11 | Performance versus compute budget or number of parameter updates . . . . . . . . . . . | 14 | | 12 | Training on suboptimal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 15 | | 13 | Comparison between empirical and adjusted compute trends . . . . . . . . . . . . . . . | 15 | | 14 | Optimal model size and serial number of steps versus compute budget . . . . . . . . . . | 16 | | 15 | Contradiction between compute and data trends . . . . . . . . . . . . . . . . . . . . . . | 17 | | 16 | Early stopping lower bound and training curves for overfit models . . . . . . . . . . . . | 23 | | 17 | Universal transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 24 | | 18 | Batch size scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 24 | | 19 | Another look at sample efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 24 | | 20 | Power-law dependence of performance on position in context . . . . . . . . . . . . . . . | 25 | | 21 | Performance at different context positions versus model size . . . . . . . . . . . . . . . | 25 | | 22 | Learning rate schedule scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 26 | | 23 | Comparison of Power-Law and Logarithmic Fits . . . . . . . . . . . . . . . . . . . . . | 26 | | 24 | Generalization versus depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 27 | ## List of Tables | 1 | Parameter and compute counts for Transformer | Parameter and compute counts for Transformer | 7 | |------------|------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 2 | Fits to L ( N,D ) . | Fits to L ( N,D ) . | 11 | | 3 | Fits to L ( N,S ) . . . . . | Fits to L ( N,S ) . . . . . | 14 | | 4 | Key trend equations . | Key trend equations . | 20 | | 5 | Key parameters to trend fits . | Key parameters to trend fits . | 20 | | 6 | Trends for compute-efficient training . . . | Trends for compute-efficient training . . . | 20 | | References | References | References | References | | [ACDE12] | [ACDE12] | Eduardo G Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti. On the origin of long- range correlations in texts. Proceedings of the National Academy of Sciences , 109(29):11582- 11587, 2012. 25 | Eduardo G Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti. On the origin of long- range correlations in texts. Proceedings of the National Academy of Sciences , 109(29):11582- 11587, 2012. 25 | | [AS17] | [AS17] | Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv , 2017, 1710.03667. 11, 18, 22 | Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv , 2017, 1710.03667. 11, 18, 22 | | [BB01] | [BB01] | Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam- biguation. In Proceedings of the 39th annual meeting on association for computational linguis- tics , pages 26-33. Association for Computational Linguistics, 2001. 18 | Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam- biguation. In Proceedings of the 39th annual meeting on association for computational linguis- tics , pages 26-33. Association for Computational Linguistics, 2001. 18 | | [BHMM18] | [BHMM18] | Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv , 2018, 1812.11118. 18 | Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv , 2018, 1812.11118. 18 | | [Bia12] | [Bia12] | GÃŠrard Biau. Analysis of a random forests model. Journal of Machine Learning Research , 13(Apr):1063-1095, 2012. 18 | GÃŠrard Biau. Analysis of a random forests model. Journal of Machine Learning Research , 13(Apr):1063-1095, 2012. 18 | | [CGRS19] | [CGRS19] | Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR , abs/1904.10509, 2019, 1904.10509. URL http://arxiv.org/ abs/1904.10509 . 19 | Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR , abs/1904.10509, 2019, 1904.10509. URL http://arxiv.org/ abs/1904.10509 . 19 | | [DCLT18] | [DCLT18] | Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv:1810.04805. 2 | Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv:1810.04805. 2 | | [DGV + 18] | [DGV + 18] | Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni- versal transformers. CoRR , abs/1807.03819, 2018, 1807.03819. URL http://arxiv.org/ abs/1807.03819 . 6, 9, 23, 24 | Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni- versal transformers. CoRR , abs/1807.03819, 2018, 1807.03819. URL http://arxiv.org/ abs/1807.03819 . 6, 9, 23, 24 | | [EP94] | [EP94] | Werner Ebeling and Thorsten Pöschel. Entropy and long-range correlations in literary english. EPL (Europhysics Letters) , 26(4):241, 1994. 25 | Werner Ebeling and Thorsten Pöschel. Entropy and long-range correlations in literary english. EPL (Europhysics Letters) , 26(4):241, 1994. 25 | | [Fou] | [Fou] | The Common Crawl Foundation. Common crawl. URL http://commoncrawl.org . 7 | The Common Crawl Foundation. Common crawl. URL http://commoncrawl.org . 7 | | [GARD18] | [GARD18] | Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. 2018, arXiv:1812.04754. 18 | Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. 2018, arXiv:1812.04754. 18 | | [GJS + 19] | [GJS + 19] | Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. arXiv , 2019, 1901.01608. 18 | Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. arXiv , 2019, 1901.01608. 18 | | [GKX19] | [GKX19] | Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net op- timization via hessian eigenvalue density. CoRR , abs/1901.10159, 2019, 1901.10159. URL http://arxiv.org/abs/1901.10159 . 18 | Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net op- timization via hessian eigenvalue density. CoRR , abs/1901.10159, 2019, 1901.10159. URL http://arxiv.org/abs/1901.10159 . 18 | | [Goo01] | [Goo01] | Joshua Goodman. A bit of progress in language modeling. CoRR , cs.CL/0108005, 2001. URL http://arxiv.org/abs/cs.CL/0108005 . 18 | Joshua Goodman. A bit of progress in language modeling. CoRR , cs.CL/0108005, 2001. URL http://arxiv.org/abs/cs.CL/0108005 . 18 | | [GRK17] | [GRK17] | Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. ope- nai.com , 2017. 19 | Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. ope- nai.com , 2017. 19 | | [HAD19] | [HAD19] | Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy: Compu- tational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming , PPoPP '19, pages 1-14, New York, NY, USA, 2019. ACM. doi:10.1145/3293883.3295710. 18 | Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy: Compu- tational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming , PPoPP '19, pages 1-14, New York, NY, USA, 2019. ACM. doi:10.1145/3293883.3295710. 18 | - [HCC + 18] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR , abs/1811.06965, 2018, 1811.06965. URL http://arxiv.org/abs/1811.06965 . 19 - [HNA + 17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017, 1712.00409. 18 - [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems , pages 8571-8580, 2018. 18 - [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014, 1412.6980. 7 - [Kom19] Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669. 18 - [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 , NIPS'12, pages 1097-1105, USA, 2012. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257 . 19 - [LCG + 19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019, 1909.11942. 9 - [LOG + 19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR , abs/1907.11692, 2019, 1907.11692. URL http://arxiv.org/abs/ 1907.11692 . 2 - [LSP + 18] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv:1801.10198 [cs] , 2018, 1801.10198. URL http://arxiv.org/abs/1801.10198 . 2, 6 - [LT16] Henry W Lin and Max Tegmark. Criticality in formal languages and statistical physics. arXiv preprint arXiv:1606.06737 , 2016. 25 - [LXS + 19] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019, arXiv:1902.06720. 18 - [MKAT18] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training, 2018, arXiv:1812.06162. 3, 5, 6, 12, 13, 21 - [Pap18] Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. CoRR , abs/1811.07062, 2018, 1811.07062. URL http://arxiv.org/abs/1811.07062 . 18 - [RNSS18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openaiassets/research-covers/languageunsupervised/language understanding paper. pdf , 2018. 2, 6 - [RRBS19a] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673. 18 - [RRBS19b] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, arXiv:1909.12673. 18 - [RSR + 19] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019, arXiv:1910.10683. 2 - [RWC + 19] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. openai.com , 2019. 2, 5, 6, 7, 8 - [SCP + 18] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. Mesh-tensorflow: Deep learning for supercomputers, 2018, 1811.02084. 19 - [SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. CoRR , 2015, 1508.07909. 6 - [SLA + 18] Christopher J. Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training, 2018, arXiv:1811.03600. 12 - [SS18] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR , abs/1804.04235, 2018, 1804.04235. URL http://arxiv.org/abs/1804.04235 . 7 - [THK18] Stefan Thurner, Rudolf Hanel, and Peter Klimek. Introduction to the theory of complex systems . Oxford University Press, 2018. 18 - [TL19] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR , abs/1905.11946, 2019, 1905.11946. URL http://arxiv.org/abs/1905. 11946 . 18 - [VSP + 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . 2, 6 - [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks, 2016, arXiv:1605.06431. 8, 18 - [Was06] Larry Wasserman. All of nonparametric statistics . Springer Science & Business Media, 2006. 18 - [WPN + 19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2019, 1905.00537. 2 - [WRH17] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by increasing model capacity. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Jul 2017. doi:10.1109/cvpr.2017.323. 19 - [WYL19] Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional networks, 2019, 1906.02909. 19 - [YDY + 19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019, arXiv:1906.08237. 2 - [ZK16] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. Procedings of the British Machine Vision Conference 2016 , 2016. doi:10.5244/c.30.87. 18 - [ZKZ + 15] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV) , Dec 2015. doi:10.1109/iccv.2015.11. 7 - [ZLN + 19] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. CoRR , abs/1907.04164, 2019, 1907.04164. URL http://arxiv.org/abs/1907.04164 . 12, 18

Rendering Paper...