2407.07263v1

Model: healer-alpha-free

# Reuse, Don’t Retrain: A Recipe for Continued Pretraining of Language Models **Authors**: Jupinder Parmar, Sanjeev Satheesh, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro > Correspondence to: ## Abstract As language models have scaled both their number of parameters and pretraining dataset sizes, the computational cost for pretraining has become intractable except for the most well-resourced teams. This increasing cost makes it ever more important to be able to reuse a model after it has completed pretraining; allowing for a model’s abilities to further improve without needing to train from scratch. In this work, we detail a set of guidelines that cover how to design efficacious data distributions and learning rate schedules for continued pretraining of language models. When applying these findings within a continued pretraining run on top of a well-trained 15B parameter model, we show an improvement of 9% in average model accuracy compared to the baseline of continued training on the pretraining set. The resulting recipe provides a practical starting point with which to begin developing language models through reuse rather than retraining. Reuse, Don’t Retrain: A Recipe for Continued Pretraining of Language Models ## 1 Introduction Language modeling abilities have seen massive improvements over the past few years (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2024; Team, 2024). While these advancements have enabled language models (LMs) to become highly-skilled conversational agents (OpenAI, 2024; Anthropic, 2024; Team, 2024), they have come with increased computational cost as pretraining has become ever more expensive due to both the number of model parameters (Team et al., 2024; DeepSeek-AI et al., 2024) and pretraining dataset size (Touvron et al., 2023; Gemma Team, 2024; Parmar et al., 2024) continuing to grow in scale. With new LMs that set state of the art accuracy being released on a frequent basis, LMs developed only a couple months back are becoming obsolete as their capabilities are no longer up to par. This leaves model developers with the choice of either pretraining new LMs from scratch or reusing their existing LMs and updating them with new information in order to match current best LM abilities. Due to the large computational cost that pretraining of modern LMs incurs, frequent complete retraining is intractable. This makes the reuse of already developed LMs via continued pretraining an attractive proposition. While most recent works (Ibrahim et al., 2024; Jang et al., 2022; Ke et al., 2023; Çağatay Yıldız et al., 2024) have recommended guidelines for continued pretraining when adapting language models to new data domains or distribution shifts, intuition or recommendations on how to improve a model’s general purpose abilities from a previously finalized checkpoint with continued pretraining have not been widely explored. In this paper, we focus on this under-studied setting and identify strategies that allow for already trained LMs to improve upon areas of weakness without experiencing degradations in other capabilities. In our experiments, we start on top of a 15B parameter LM that has seen 8T tokens of pretraining data (Parmar et al., 2024). Experimenting with a well trained model of this scale ensures that our findings will be transferable to most settings and model sizes. We first identify the type of data distribution that should be used during continued pretraining and find that it is optimal to have two distributions, with the final one more heavily weighting data sources that relate to the abilities we want to improve in the model. Second, we determine what learning rate schedules enable the most efficient learning during continued pretraining and determine that the most performant one strikes a balance between magnitude of learning rate and steepness of decay. Lastly, we show how the learning rate value at which we switch between data distributions affects downstream accuracy and identify the point at which this switch should be made. These findings culminate in a recipe that can be used to perform continued pretraining to improve the capabilities of an existing LM. We demonstrate that this recipe is beneficial at continued training scales from 100B to 1 trillion tokens, illustrating its flexibility and robustness to be used in a wide variety of settings. We hope that this recipe will allow for model providers to forgo the need to regularly retrain models from scratch as it makes it possible to reuse a trained model to attain improved capabilities. ## 2 Related Works Continued training methods aim to take an already trained model and incorporate new data, adapt it for a given domain, or specialize it on a certain task (Rolnick et al., 2019; Caccia et al., 2021; Lesort et al., 2022; Gupta et al., 2023; Lin et al., 2024). The major challenge that arises during continued training is enabling a model to learn new information without forgetting previously attained knowledge or capabilities (Robins, 1995; French, 1999). The learning rate schedule and data distribution used during continued training (Gupta et al., 2023; Ibrahim et al., 2024; Winata et al., 2023; Scialom et al., 2022) have been shown to be particularly important in preventing such catastrophic forgetting. For LMs, one major setting of continued training has been to embed more recent knowledge into the model by using data collected at a date later than when the pretraining set was constructed (Jin et al., 2022; Jang et al., 2022, 2023; Loureiro et al., 2022; Qin et al., 2022). Results from these studies found that using experience replay (Chaudhry et al., 2019) and knowledge distillation (Hinton et al., 2015) are particularly effective. Continued training is also commonly used in LMs to adapt the model to data coming from a new domain (Ke et al., 2023; Gururangan et al., 2020; Wu et al., 2024). Many of these methods for domain adaptive continued training update a portion of the model’s weights with the new data to ensure that previous knowledge is not lost. For instance, (Wu et al., 2024) does so via an expansion of the transformer blocks and only updating the newly added weights. More related to the setting which we explore, several studies utilize continued pretraining to specialize a LM on a given task or domain (Zan et al., 2022; Yadav et al., 2023; Ma et al., 2023; Yang et al., 2024; Labrak et al., 2024). Despite investigating effective strategies for continued pretraining, these studies differ from ours as they do not aim to improve the general capabilities of LMs, train for far fewer tokens, and use much smaller model sizes. The main study which offers a comparative setting to ours is (Ibrahim et al., 2024) which provides a recipe, based on learning rate schedule and example replay recommendations, for maintaining general purpose abilities during continued pretraining on data distribution shifts. Their experimental setting consists of a 10B parameter model that was pretrained for 300B tokens. Our study differs from (Ibrahim et al., 2024) as we aim to improve the general capabilities of the LM further, and in our experimental setting we perform continued pretraining for up to 1T tokens with a 15B parameter model that was pretrained on 8T tokens. ## 3 Experimental Setup The continued pretraining process is as follows: a model is first pretrained, then a data distribution and learning rate schedule are chosen, a continued pretraining run takes place, and finally the, hopefully improved, model is returned. Before delving into the experiments that define the continued training recipe, we detail the datasets and model architecture that are used. ### 3.1 Data Sources #### 3.1.1 Pretraining Our pretraining dataset consists of three different domains of data: English natural language data, multilingual natural language data, and source code data. Table 1 highlights the data sources that compose the pretraining set along with their respective token counts. In our English corpus, the Web Crawl data is sourced from Common Crawl (CC) snapshots while the remaining categories are comprised of high-quality sets. For instance, the miscellaneous category consists of BigScience ROOTS (Lachaux et al., 2020), Reddit, and Pile-Stories (Gao et al., 2020), the encyclopedia category contains Wikipedia and Stack Exchange, and scientific papers includes ArXiv and PubMed. The multilingual dataset consists of 53 languages with the majority of examples being drawn from CC snapshots, although a small portion comes from machine translation parallel corpora (Schwenk et al., 2019; El-Kishky et al., 2019). Lastly, our source code data is drawn from permissively licensed GitHub repositories and totals over 43 languages. | Data type | Data source | Tokens (B) | | --- | --- | --- | | English | Web Crawl | 5,106 | | Misc. | 179 | | | News | 93 | | | Scientific Papers | 82 | | | Books | 80 | | | Legal | 50 | | | Encyclopedia | 31 | | | Finance | 20 | | | Multilingual | Web crawl | 2,229 | | Parallel corpora | 55 | | | Source Code | GitHub | 583 | Table 1: The pretraining data composition. Appendix A.1 and A.2 breakdown the multilingual and coding languages. We pretrain the model for 8T tokens. Given that current state of the art LMs are pretrained for trillions of tokens, we want to experiment on top of a pretrained model that is emblematic of the type of models which the continued pretraining recipe would be used for. #### 3.1.2 Continued Pretraining As the most likely scenario in continued pretraining is that the available datasets are exactly those which made up the pretraining set, the vast majority of our continued training data blend is comprised of the pretraining data sources. The only new additional source of data is a set of question and answer (QA), alignment style examples. Such examples have been shown to better extract stored knowledge within LMs (Allen-Zhu and Li, 2023). This set of QA data totals 2.8B tokens and Table 2 highlights the categories of types of QA examples. | QA | World Knowledge | 1.13 | | --- | --- | --- | | Reasoning | 0.92 | | | STEM | 0.31 | | | Chat | 0.26 | | | Code | 0.19 | | Table 2: The five constituent categories of the QA, alignment style data. ### 3.2 Model Architecture and Hyperparameters We experiment using a 15B parameter decoder-only transformer (Vaswani et al., 2017) LM with causal attention masks. It has 3.2 billion embedding parameters and 12.5 billion non-embedding parameters. Additional architectural specifications include: 32 transformer layers, a hidden size of 6144, 48 attention heads, Rotary Position Embeddings (RoPE) (Su et al., 2023), squared ReLU activations in the MLP layers, a SentencePiece (Kudo and Richardson, 2018) tokenizer with a vocabulary size of 256k, no bias terms, and untied input-output embeddings. Additionally, we use grouped query attention (GQA) (Ainslie et al., 2023) with 8 KV heads. The model is pretrained with a sequence length of 4,096 and uses batch size rampup over the first 5% of pretraining tokens, starting from a batch size of 384 and building up to one of 1,152. We use a cosine learning rate schedule, with warmup of 16B tokens, to decay from a maximum learning rate (LR) of $\eta_{max}=4.5e\text{-}4$ to $\eta_{min}=4.5e\text{-}5$ . We train using the AdamW (Loshchilov and Hutter, 2019) optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , and a weight decay of 0.1. In continued pretraining, the only hyperparameter that is altered is the learning rate schedule. ### 3.3 Evaluation We evaluate the model using a representative set of tasks to test its change in abilities across the English, multilingual, and coding domains. To assess English capabilities, we evaluate on the widely-used MMLU (Hendrycks et al., 2020) and Hellaswag (Zellers et al., 2019) benchmarks. MMLU measures the model’s world knowledge across 57 domains while Hellaswag assesses commonsense reasoning ability within natural language inference. For our multilingual evaluations, we use the Multilingual Grade School Mathematics (MGSM) (Shi et al., 2022) benchmark and specifically report the average accuracy across the language subset of Spanish, Japanese, and Thai, as they represent a high, medium, and low resource language respectively. Lastly, to assess the model’s coding capabilities we utilize the Python code generation task of HumanEval (Chen et al., 2021) with evaluations reported in the pass@1 (Kulal et al., 2019) setting. In our results below, we report the average score across all four of these tasks with fully detailed evaluation scores shared in the Appendix. ## 4 Continued Pretraining Recipe The experimental findings which constitute our continued pretraining recipe are shared below: Recipe • Start with a data distribution that is similar to the pretraining set but places larger weight on high quality sources before transitioning to a second distribution that incorporates QA data and upweights sources in areas of model weakness. • The learning rate schedule should start from $\eta_{min}$ of the pretrained model and decay with cosine annealing to $\frac{\eta_{min}}{100}$ . • The switch between data distribution should occur at $\frac{\eta_{max}}{5}$ in the learning rate schedule. ## 5 Experiments The results of the pretrained base model are shown in Table 3. The aim for our continuous training recipe will be to define steps that help maximally improve upon this benchmark. All detailed experiments perform continuous pretraining for 300B tokens. Additionally, we note that in our experiments we choose to load in the optimizer state from the pretrained model as we found that there was a negligible difference in evaluation accuracy when the optimizer state was loaded in or when initialized from scratch. Thus, we expect that whether eventual practitioners have the optimizer state of the pretrained model available or not, the resulting findings will hold. | Model Pretrained | Average Accuracy 48.9 | | --- | --- | Table 3: Model accuracy after 8T tokens of pretraining. Per-task evaluations scores are shared in Table 12, we find the model particularly struggles on tasks that assess STEM based reasoning capabilities. ### 5.1 Data Distribution <details> <summary>acl-style-files/figures/GB_distrs_big_name.png Details</summary> ![bf6ced7e](/v1/image/bf6ced7ef309b17a02b7aae3911dd3a4bfb64708aaf8c94f800d40ac80b5f021) ### Visual Description ## Grouped Bar Chart: Data Source Weight Distribution Across Training Strategies ### Overview This image is a grouped bar chart illustrating the percentage weight assigned to ten different data sources under five distinct data curation or training strategies for a machine learning model. The chart compares how the importance (weight) of each data source changes depending on the methodology applied. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **X-Axis (Horizontal):** Labeled **"Data Source"**. It lists ten categorical data sources from left to right: 1. Web Crawl 2. Books 3. News Articles 4. Papers 5. Encyclopedia 6. Legal 7. Finance 8. Misc. 9. Multilingual 10. Code * **Y-Axis (Vertical):** Labeled **"Weight (%)"**. It is a linear scale ranging from 0 to 55, with major gridlines at intervals of 5%. * **Legend:** Positioned at the top of the chart, centered. It defines five data series (training strategies), each represented by a distinct shade of green: 1. **Pretraining** (Lightest green) 2. **Reweight Domains** (Light green) 3. **Pretraining w/ High Quality Web** (Medium green) 4. **No Web** (Dark green) 5. **Upweight Non Web w/ High Quality Web** (Darkest green) ### Detailed Analysis The following table reconstructs the approximate weight (%) for each data source under each strategy. Values are estimated from the bar heights relative to the y-axis gridlines. | Data Source | Pretraining | Reweight Domains | Pretraining w/ High Quality Web | No Web | Upweight Non Web w/ High Quality Web | | :--- | :--- | :--- | :--- | :--- | :--- | | **Web Crawl** | ~46% | ~53% | ~46% | ~12% | ~0% | | **Books** | ~3% | ~4% | ~3% | ~13% | ~11% | | **News Articles** | ~5% | ~5.5% | ~5% | ~5% | ~4% | | **Papers** | ~3.5% | ~4% | ~3.5% | ~16% | ~13.5% | | **Encyclopedia** | ~1.5% | ~2% | ~1.5% | ~11% | ~9% | | **Legal** | ~1% | ~1% | ~1% | ~2.5% | ~2% | | **Finance** | ~1% | ~1.5% | ~1% | ~5% | ~4% | | **Misc.** | ~9% | ~10% | ~9% | ~18% | ~15% | | **Multilingual** | ~15% | ~5% | ~15% | ~15% | ~15% | | **Code** | ~15% | ~15% | ~15% | ~15% | ~15% | **Trend Verification per Data Series:** * **Pretraining & Pretraining w/ High Quality Web:** These two series have nearly identical profiles. They are heavily dominated by **Web Crawl** (~46%), with moderate contributions from **Multilingual** and **Code** (~15% each), and low single-digit percentages for most other sources. * **Reweight Domains:** This strategy increases the weight of **Web Crawl** to its highest point (~53%) while drastically reducing the weight of **Multilingual** data (from ~15% to ~5%). Other sources see minor adjustments. * **No Web:** This strategy shows a dramatic inversion. The weight of **Web Crawl** plummets to ~12%. The freed-up weight is redistributed primarily to **Misc.** (~18%), **Papers** (~16%), **Books** (~13%), and **Encyclopedia** (~11%). **Multilingual** and **Code** remain stable at ~15%. * **Upweight Non Web w/ High Quality Web:** This strategy appears to completely eliminate **Web Crawl** (0%). It further boosts non-web sources compared to the "No Web" strategy, with the highest weights going to **Misc.** (~15%), **Papers** (~13.5%), and **Books** (~11%). **Multilingual** and **Code** remain constant at ~15%. ### Key Observations 1. **Web Crawl Dominance:** In three of the five strategies (Pretraining, Reweight Domains, Pretraining w/ HQ Web), Web Crawl is the overwhelmingly dominant data source, comprising nearly half or more of the total weight. 2. **Stability of Code & Multilingual:** The weight for **Code** is fixed at ~15% across all five strategies. **Multilingual** is stable at ~15% in four strategies, with a significant drop only in the "Reweight Domains" strategy. 3. **Redistribution Upon Web Removal:** The "No Web" and "Upweight Non Web" strategies demonstrate a clear redistribution pattern. Removing or down-weighting web crawl data leads to a substantial increase in the relative importance of curated, high-quality sources like **Papers**, **Books**, **Encyclopedias**, and the **Miscellaneous** category. 4. **"Misc." Category Significance:** The **Misc.** category becomes the largest or second-largest data source in the strategies that minimize web data, suggesting it contains a substantial volume of valuable non-web content. ### Interpretation This chart visualizes the strategic trade-offs in data curation for training large language models. The data suggests: * **The "Default" Reliance on Web Data:** Standard pretraining ("Pretraining") and even quality-filtered web data ("Pretraining w/ High Quality Web") rely heavily on web crawl data, indicating its perceived value in terms of volume and diversity for general language model training. * **Intentional Domain Reweighting:** The "Reweight Domains" strategy appears to be an optimization that further amplifies web data at the expense of multilingual data, possibly to improve performance on web-centric tasks or English-language benchmarks. * **The Non-Web Alternative Pathway:** The "No Web" and "Upweight Non Web" strategies represent a deliberate philosophical shift. They posit that high-quality, curated non-web sources (academic papers, books, encyclopedias) can effectively replace web data, potentially leading to models with stronger factual grounding, deeper reasoning, or reduced exposure to web noise and biases. The complete removal of web data in the final strategy is a strong statement about the viability of this alternative data mix. * **Constant Pillars:** The fixed weight for **Code** across all strategies implies it is considered a non-negotiable, foundational data type for model capabilities, regardless of the overall data philosophy. The stability of **Multilingual** data (except in one targeted reweighting) suggests it is also a core component for maintaining broad language coverage. In essence, the chart contrasts a **web-centric** training paradigm with a **curated non-web** paradigm, showing exactly how the data "diet" of a model is adjusted to pursue different performance and safety objectives. </details> Figure 1: Breakdown of the various distributions considered for the General Blend (GB). We use Upweight Non Web w/ High Quality Web as the GB moving forward given its strong performance across all evaluation areas. A crucial component of any training run is the data distribution – it defines the information which a model sees and directly impacts the model’s capabilities. As continuous pretraining builds on top of a model which has already seen a given pretraining distribution, it is important to define a data distribution which allows the model to learn new concepts without also deviating too far from the pretraining distribution such that the model begins to experience training instability and accuracy regression. Through a series of runs which tackle what compositions of data distributions best improve the abilities of a pretrained model, we identify general characteristics that can be applied across most continuous pretraining scenarios. In these experiments, we use a learning rate schedule that starts from $\eta_{min}$ and decays to 0 with cosine annealing. First, we examine if the inclusion of QA data, which improves the ability of a model to extract stored knowledge (Allen-Zhu and Li, 2023), improves model accuracy. Coupled with this question is another on how to best incorporate the QA data, or more generally any dataset which is not contained within the pretraining data distribution, into the continued training run: immediately at the beginning and throughout the entirety of continued training, or rather reserved till the end of continued training following a curriculum learning setup (Soviany et al., 2022; Blakeney et al., 2024). We hypothesize that inclusion of new data sources at the beginning of continued pretraining allows for the model to best learn the new information, but may cause learning instabilities that could be mitigated by showing the new dataset at the end of the run when the learning rate is less aggressive. To answer these questions, we compare continued training entirely with the pretraining data blend, entirely with a QA data blend, and with a mix of the pretraining and QA data blends where we start with the pretraining blend and switch to the QA data blend late in the training run. The QA data blend in this scenario adds the QA dataset to the pretraining data distribution with a weight of 10%. | Pretraining QA Pretraining (250B), QA (50B) | 51.5 53.4 54.3 | | --- | --- | Table 4: Using two data distributions, with the QA data appearing in the latter, leads to the largest improvement via continued pretraining. () indicates the number of training tokens for each blend. Per-task evaluations scores are shared in Table 13. Table 4 illustrates that the incorporation of QA data markedly outperforms solely using existing data from the pretraining set. Additionally, first using the pretraining data blend for the majority of training tokens before transitioning to the QA data blend at the end of continued pretraining exhibits improved accuracy compared to using the QA blend throughout the entirety of training. This indicates that continued pretraining runs should begin with a data distribution which more closely aligns to the pretraining one followed by a blend that then introduces new data. Moving forward, we refer to the initial blend as the general blend, GB, and the latter blend as the QA blend, QB, and discuss how they can be refined to realize further improvements. We hypothesize that the optimal GB will be one which places greater emphasis on high quality data sources and areas of model weakness, without deviating too far from the pretraining distribution. Such a blend will enhance knowledge in needed areas and prime the model for the QB blend without worry of experiencing large training instabilities. Figure 1 illustrates the various GB distributions we consider; in addition to upweighting sources of interest, we either subset web crawl to just high quality documents, as identified by being in the bottom quartile of perplexity scores from a KenLM model (Heafield, 2011) trained on Wikipedia, or remove web crawl altogether. Experimenting with the various GB distributions for all 300B tokens of continued training, Table 5 shows that each improves upon the pretraining distribution. Even though it does not achieve the highest average accuracy, we choose Upweight Non Web with High Quality Web as the GB moving forward, because compared to others, it most consistently achieves high scores across all considered tasks as shown in Table 13. | Pretraining Reweight Domains Pretraining w/ High Quality Web | 51.5 51.7 52.5 | | --- | --- | | No Web | 52.9 | | UW Non Web w/ High Quality Web | 52.0 | Table 5: Evaluation results of various GB candidate distributions. Per-task evaluations scores are shared in Table 13 With a GB distribution in place, we now look to define the QB distribution by first refining the weights placed on the sources within the QA data and then optimizing the QB distribution as a whole. In the initial QB distribution, the QA data was added as is, and this weighting is shown as QA blend 1 in Figure 2. Given that the pretrained model struggles on STEM tasks, we create two additional blends that both upweight the QA STEM data while either maintaining the original weight of QA world knowledge, blend 2, or QA chat, blend 3, data as seen in Figure 2. We choose to maintain the weight in world knowledge and chat information as such examples cover a broad range of topics and help better align model responses to questions respectively. Table 6 highlights that upon adding each of the QA blends to the initial QB distribution following 250B tokens of the identified GB, QA data that emphasizes both STEM and chat information leads to the best results. <details> <summary>acl-style-files/figures/QB_qa_distr_big_font.png Details</summary> ![0fcaeadd](/v1/image/0fcaeadda16b91d851dff2ce85c302aae50a072c25ed79cfd0f610d5409b3893) ### Visual Description ## Grouped Bar Chart: Weight Distribution of Data Sources Across Three Blends ### Overview This image is a grouped bar chart illustrating the percentage weight assigned to five distinct data sources (Chat, Reasoning, STEM, Code, World Knowledge) across three different data blending strategies (Blend 1, Blend 2, Blend 3). The chart compares how these blending strategies prioritize different types of training data. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **X-Axis (Horizontal):** Labeled **"Data Source"**. It contains five categorical groups: 1. Chat 2. Reasoning 3. STEM 4. Code 5. World Knowledge * **Y-Axis (Vertical):** Labeled **"Weight (%)"**. It is a linear scale ranging from 0 to 45, with major gridlines at intervals of 5 (0, 5, 10, 15, 20, 25, 30, 35, 40, 45). * **Legend:** Positioned at the **top center** of the chart. It defines three data series by color: * **Light Blue Square:** `Blend 1 (Balanced)` * **Medium Blue Square:** `Blend 2 (+STEM, +World Knowledge)` * **Dark Blue Square:** `Blend 3 (+STEM, +Chat)` ### Detailed Analysis The following table reconstructs the approximate weight percentages for each blend across all data sources. Values are estimated based on the bar heights relative to the y-axis gridlines. Uncertainty is noted where the bar top falls between gridlines. | Data Source | Blend 1 (Balanced) - Light Blue | Blend 2 (+STEM, +World Knowledge) - Medium Blue | Blend 3 (+STEM, +Chat) - Dark Blue | | :--- | :--- | :--- | :--- | | **Chat** | ~9% | ~8% | ~9% | | **Reasoning** | ~36% | ~31% | ~33% | | **STEM** | ~5% | ~11% | ~11% | | **Code** | ~8% | ~7% | ~6.5% | | **World Knowledge** | ~42% | ~43% | ~41% | **Visual Trend Verification per Data Source:** * **Chat:** All three blends show relatively low and similar weights (8-9%). The bars are of comparable height. * **Reasoning:** Blend 1 has the highest weight (~36%), followed by Blend 3 (~33%), then Blend 2 (~31%). The trend is Blend 1 > Blend 3 > Blend 2. * **STEM:** Blend 1 has a significantly lower weight (~5%) compared to Blends 2 and 3, which are nearly equal (~11%). The trend is a sharp increase from Blend 1 to the other two. * **Code:** Weights are low and decrease slightly across the blends: Blend 1 (~8%) > Blend 2 (~7%) > Blend 3 (~6.5%). * **World Knowledge:** This is the highest-weighted category for all blends. Blend 2 is slightly highest (~43%), followed by Blend 1 (~42%), then Blend 3 (~41%). The differences are minimal. ### Key Observations 1. **Dominant Category:** "World Knowledge" receives the highest weight allocation (over 40%) in all three blending strategies, indicating its foundational importance. 2. **Primary Differentiator:** The "Reasoning" category is the second-largest component and shows the most significant variation between blends, with the "Balanced" blend (Blend 1) weighting it most heavily. 3. **Specialization Impact:** The blends explicitly labeled with "+STEM" (Blends 2 and 3) show a more than twofold increase in the weight of the "STEM" data source compared to the "Balanced" blend (from ~5% to ~11%). 4. **Stability of Chat & Code:** The weights for "Chat" and "Code" data sources remain relatively low and stable across all three strategies, suggesting they are considered consistent, secondary components. 5. **Trade-off Pattern:** Increasing the weight for "STEM" (in Blends 2 & 3) appears to come primarily from a reduction in the weight for "Reasoning" compared to Blend 1, with minor adjustments to "Chat" and "Code". ### Interpretation This chart visualizes the strategic trade-offs in composing a training dataset for an AI model. The data suggests: * **Core vs. Specialized Knowledge:** "World Knowledge" is treated as the essential, non-negotiable core of the training data. "Reasoning" is a major secondary pillar, but its importance is adjusted based on the desired specialization. * **The Cost of Specialization:** The act of explicitly boosting STEM capabilities (Blends 2 & 3) requires reallocating weight from other areas. The primary "source" of this weight is the "Reasoning" category, not the already-minimal "Chat" or "Code" categories. This implies a potential design hypothesis: that general reasoning capacity and specialized STEM knowledge may compete for a fixed "budget" in the data blend. * **Blend Strategy Implications:** * **Blend 1 (Balanced):** Prioritizes a strong foundation in general reasoning alongside world knowledge. * **Blend 2 (+STEM, +World Knowledge):** Maximizes domain-specific (STEM) and factual (World Knowledge) knowledge, slightly at the expense of general reasoning. * **Blend 3 (+STEM, +Chat):** Also boosts STEM, but pairs it with a slight emphasis on conversational data (Chat) compared to Blend 2, resulting in the lowest reasoning weight of the three. This might aim for a model that is both STEM-capable and interactive. The chart does not show performance outcomes, only data composition. The ultimate effectiveness of each blend would depend on how these weightings align with the target tasks for the AI model. The minimal variation in "World Knowledge" weight suggests it is considered a stable, high-value component regardless of the specialization goal. </details> Figure 2: Various distributions of QA data. We use Blend 3. | QA 1 QA 2 (+STEM, +World Knowledge) QA 3 (+STEM, +Chat) | 54.3 53.0 54.9 | | --- | --- | Table 6: Evaluation results of various QA blend candidates. Per-task evaluations scores are shared in Table 13 We now incorporate the QA data within the overall QB distribution. In previous runs, the QB distribution, aside from the QA dataset, exactly mirrored the pretraining set. We define a new series of distributions based on more aggressive upweighting of sources in areas of model weakness and amount of weight placed on the QA dataset as seen in Figure 4. Table 7 details that the aggressive weighting in the QB is beneficial, and we use the QB termed QA blend moving forward. With refined GB and QB distributions, the average evaluation accuracy has improved from 48.9 for the pretrained model to 55.4, a 13% improvement. | Pretraining blend w/ QA data General blend w/ QA data QA | 54.3 54.2 55.4 | | --- | --- | | QA w/ Upweighted STEM | 54.4 | | QA w/ 1.5e QA data | 54.9 | | QA w/ 3.5e QA data | 54.4 | Table 7: Evaluation results of various QB candidate distributions. Per-task evaluations scores are shared in Table 13 <details> <summary>acl-style-files/figures/just_decay_LRs.png Details</summary> ![2e137715](/v1/image/2e1377150639cb5c076530e61c10316c0b2e3db773162bbd1e52d3982ffc32fc) ### Visual Description \n ## Line Chart: Learning Rate Decay Schedule Over Training Tokens ### Overview The image is a line chart illustrating three different learning rate (LR) decay schedules over the course of training a machine learning model, measured in billions of tokens. A shaded region on the right side of the chart indicates a specific training phase called "QA Blend." The chart demonstrates how the learning rate decreases from an initial maximum value as training progresses, with the rate of decay determined by the chosen minimum learning rate (Min LR) parameter. ### Components/Axes * **Y-Axis:** * **Label:** "LR" (Learning Rate). * **Scale:** Linear scale from 0 to 5. A multiplier of `1e-5` is noted at the top of the axis, meaning all y-axis values should be multiplied by 0.00001. Therefore, the visible range is from 0 to 0.00005. * **Ticks:** Major ticks are present at 0, 1, 2, 3, 4, and 5 (representing 0, 1e-5, 2e-5, 3e-5, 4e-5, and 5e-5). * **X-Axis:** * **Label:** "Tokens (B)" (Billions of Tokens). * **Scale:** Linear scale from 0 to 300. * **Ticks:** Major ticks are present at 0, 50, 100, 150, 200, 250, and 300. * **Legend (Bottom-Left Corner):** * **Dashed Line (`---`):** "Min LR = (1/10)*Max LR" * **Solid Line (`—`):** "Min LR = (1/100)*Max LR" * **Dotted Line (`····`):** "Min LR = 0" * **Gray Shaded Box:** "QA Blend" * **Shaded Region:** * A vertical gray band spans the x-axis from 250B to 300B tokens, corresponding to the "QA Blend" phase. ### Detailed Analysis * **Initial State (0 Tokens):** All three decay schedules begin at the same point. The initial learning rate (Max LR) is approximately **4.5e-5** (or 0.000045). * **Data Series Trends & Points:** 1. **Dashed Line (Min LR = 1/10 * Max LR):** * **Trend:** This line shows the slowest decay. It slopes downward gradually, maintaining a higher learning rate for longer compared to the other schedules. * **Approximate Data Points:** Starts at ~4.5e-5 (0B). At 150B tokens, it is at ~2.5e-5. At 250B tokens (start of QA Blend), it is at ~0.8e-5. It ends at 300B tokens at approximately **0.5e-5**. 2. **Solid Line (Min LR = 1/100 * Max LR):** * **Trend:** This line shows a moderate decay rate. It follows a smooth, convex curve downward. * **Approximate Data Points:** Starts at ~4.5e-5 (0B). At 150B tokens, it is at ~2.0e-5. At 250B tokens, it is at ~0.3e-5. It approaches very close to zero by 300B tokens, approximately **0.05e-5**. 3. **Dotted Line (Min LR = 0):** * **Trend:** This line shows the fastest decay. It has the steepest initial slope and reaches near-zero earliest. * **Approximate Data Points:** Starts at ~4.5e-5 (0B). At 150B tokens, it is at ~1.8e-5. It crosses below the solid line around 200B tokens. It reaches approximately **0** by 300B tokens. * **QA Blend Phase (250B - 300B Tokens):** This shaded region highlights the final 50 billion tokens of training. During this phase, all three learning rate schedules are in their final, low-value stages, with the dashed line schedule still having a notably higher LR than the other two. ### Key Observations * All three schedules share the same starting learning rate (~4.5e-5). * The decay curves diverge significantly after approximately 50B tokens, with the `Min LR = 0` schedule decaying the fastest and the `Min LR = (1/10)*Max LR` schedule decaying the slowest. * The "QA Blend" phase occurs entirely within the tail end of the learning rate decay, where LRs are less than 20% of their initial value for all schedules. * The solid and dotted lines are very close together from 0B to ~200B tokens, after which the dotted line (Min LR=0) drops below the solid line. ### Interpretation This chart visualizes a critical hyperparameter decision in training large language models: the learning rate schedule. The data suggests that setting a non-zero minimum learning rate (e.g., 1/10th or 1/100th of the max) prevents the learning rate from vanishing completely, which can be important for continued learning or fine-tuning in later stages. The "QA Blend" region indicates a common practice where the training data mixture is changed towards the end of training, often to specialize the model on question-answering tasks. The chart shows that this specialization phase is conducted with a very low, but potentially non-zero, learning rate. This allows the model to adapt to the new data without drastically altering the knowledge it has already acquired during the main training phase (0-250B tokens). The choice between the three schedules represents a trade-off: a higher minimum LR (dashed line) may allow for more adaptation late in training but risks destabilizing previously learned information. A minimum LR of zero (dotted line) ensures training fully concludes but may limit late-stage adaptation. The solid line represents a middle ground. The optimal choice likely depends on the specific model architecture and training objectives. </details> Figure 3: Cosine decay schedules with a Max LR of $4.5e\text{-}5$ . Each schedule differently prioritizes LR magnitude and slope of decay. <details> <summary>acl-style-files/figures/QB_distrs.png Details</summary> ![54a350d7](/v1/image/54a350d71622cb532595072ca46cb2744c419965651ffd33ca8d571f5adea04d) ### Visual Description ## Grouped Bar Chart: Weight Distribution of Data Sources Across Different Blending Strategies ### Overview This image is a grouped bar chart illustrating the percentage weight assigned to eleven different data sources across five distinct data blending strategies. The chart compares how different blending approaches prioritize various types of content, likely for training a machine learning model. The visual data shows significant variation in weighting, particularly for sources like "Papers" and "QA." ### Components/Axes * **Chart Type:** Grouped Bar Chart (Vertical). * **X-Axis (Horizontal):** Labeled **"Data Source"**. It contains 11 categorical groups: 1. Web Crawl 2. Books 3. News Articles 4. Papers 5. Encyclopedia 6. Legal 7. Finance 8. Misc. 9. Multilingual 10. Code 11. QA * **Y-Axis (Vertical):** Labeled **"Weight (%)"**. It is a linear scale ranging from 0 to 35, with major gridlines at intervals of 5 (0, 5, 10, 15, 20, 25, 30, 35). * **Legend:** Positioned at the top of the chart, centered. It defines five data series, each represented by a distinct color: * **Light Mint Green:** General Blend w/ QA * **Light Gray-Green:** QA Blend * **Medium Olive Green:** QA Blend w/ Upweight STEM * **Teal Green:** QA Blend w/ 1.5e QA * **Dark Forest Green:** QA blend w/ 3.5e QA ### Detailed Analysis The following analysis lists approximate weight percentages for each data source, grouped by the blending strategy. Values are estimated from the bar heights relative to the y-axis gridlines. **1. Web Crawl** * General Blend w/ QA: ~12% * QA Blend: ~3% * QA Blend w/ Upweight STEM: ~2% * QA Blend w/ 1.5e QA: ~3% * QA blend w/ 3.5e QA: ~3% *Trend:* The "General Blend" assigns significantly higher weight to Web Crawl than all QA-focused blends, which keep it very low (~2-3%). **2. Books** * General Blend w/ QA: ~11% * QA Blend: ~16% * QA Blend w/ Upweight STEM: ~10% * QA Blend w/ 1.5e QA: ~16% * QA blend w/ 3.5e QA: ~15% *Trend:* QA-focused blends (except the STEM-upweighted one) assign a higher weight to Books (~15-16%) compared to the General Blend (~11%). **3. News Articles** * General Blend w/ QA: ~4% * QA Blend: ~3% * QA Blend w/ Upweight STEM: ~2% * QA Blend w/ 1.5e QA: ~3% * QA blend w/ 3.5e QA: ~3% *Trend:* All strategies assign low weight to News Articles, generally between 2-4%. **4. Papers** * General Blend w/ QA: ~13% * QA Blend: ~18% * QA Blend w/ Upweight STEM: **~30%** (Highest single bar in the entire chart) * QA Blend w/ 1.5e QA: ~18% * QA blend w/ 3.5e QA: ~17% *Trend:* This is the most dramatic variation. The "Upweight STEM" strategy massively increases the weight for Papers to ~30%. Other QA blends also weight Papers highly (~17-18%), more than the General Blend (~13%). **5. Encyclopedia** * General Blend w/ QA: ~9% * QA Blend: ~8% * QA Blend w/ Upweight STEM: ~13% * QA Blend w/ 1.5e QA: ~8% * QA blend w/ 3.5e QA: ~8% *Trend:* The "Upweight STEM" strategy gives a notably higher weight to Encyclopedia (~13%) compared to the other blends (~8-9%). **6. Legal** * General Blend w/ QA: ~2% * QA Blend: ~8% * QA Blend w/ Upweight STEM: ~5% * QA Blend w/ 1.5e QA: ~8% * QA blend w/ 3.5e QA: ~8% *Trend:* QA-focused blends (except STEM) assign a higher weight to Legal (~8%) than the General Blend (~2%). **7. Finance** * General Blend w/ QA: ~4% * QA Blend: ~3% * QA Blend w/ Upweight STEM: ~2% * QA Blend w/ 1.5e QA: ~3% * QA blend w/ 3.5e QA: ~3% *Trend:* All strategies assign low weight to Finance, generally between 2-4%. **8. Misc.** * General Blend w/ QA: ~15% * QA Blend: ~11% * QA Blend w/ Upweight STEM: ~7% * QA Blend w/ 1.5e QA: ~11% * QA blend w/ 3.5e QA: ~10% *Trend:* The General Blend assigns the highest weight to Misc. (~15%). QA blends assign it moderate weight (~10-11%), with the STEM variant being the lowest (~7%). **9. Multilingual** * General Blend w/ QA: ~3% * QA Blend: ~3% * QA Blend w/ Upweight STEM: ~3% * QA Blend w/ 1.5e QA: ~5% * QA blend w/ 3.5e QA: ~3% *Trend:* All strategies assign low weight to Multilingual data, mostly ~3%, with a slight increase for the "1.5e QA" blend (~5%). **10. Code** * General Blend w/ QA: ~15% * QA Blend: ~15% * QA Blend w/ Upweight STEM: ~15% * QA Blend w/ 1.5e QA: ~15% * QA blend w/ 3.5e QA: ~12% *Trend:* Code receives a consistently high and nearly equal weight (~15%) across the first four strategies, with a slight dip for the "3.5e QA" blend (~12%). **11. QA** * General Blend w/ QA: ~12% * QA Blend: ~12% * QA Blend w/ Upweight STEM: ~12% * QA Blend w/ 1.5e QA: ~10% * QA blend w/ 3.5e QA: **~20%** (Second highest bar in the chart) *Trend:* The "3.5e QA" strategy dramatically increases the weight for the QA source itself to ~20%. The other blends assign it a moderate, consistent weight of ~10-12%. ### Key Observations 1. **STEM Emphasis:** The "QA Blend w/ Upweight STEM" strategy is defined by a massive reallocation of weight to **Papers (~30%)** and a notable increase for **Encyclopedia (~13%)**, likely at the expense of sources like Misc. and Legal. 2. **QA Emphasis:** The "QA blend w/ 3.5e QA" strategy is defined by a very high weight for the **QA source itself (~20%)**, suggesting a strong focus on question-answer pair data. 3. **Consistency in Code:** The **Code** data source receives a remarkably stable and high weight (~15%) across almost all strategies, indicating its perceived universal importance. 4. **Low-Priority Sources:** **Web Crawl (for QA blends), News Articles, Finance, and Multilingual** data are consistently assigned low weights (mostly under 5%) across all strategies. 5. **General vs. QA Blends:** The "General Blend w/ QA" tends to have a more even distribution, with higher weights for **Web Crawl** and **Misc.** compared to the QA-focused blends. ### Interpretation This chart visualizes the strategic trade-offs in curating a training dataset. Each "blend" represents a different hypothesis about what data composition will yield a better-performing model. * The **"General Blend"** appears to be a balanced baseline, drawing significantly from web crawls, books, code, and miscellaneous sources. * The **"QA Blend"** and its variants (**1.5e, 3.5e**) shift focus away from broad web data and towards more structured or knowledge-dense sources like Books, Legal text, and especially the QA pairs themselves. The "3.5e" variant takes this to an extreme, heavily prioritizing its namesake QA data. * The **"Upweight STEM"** blend makes a clear, targeted bet: that performance on Science, Technology, Engineering, and Math tasks is improved by drastically increasing the proportion of academic Papers and Encyclopedia entries in the training mix. The data suggests that the creators are experimenting with two primary levers: 1) increasing the proportion of direct question-answer data, and 2) boosting specific knowledge domains (STEM). The consistent high weighting of **Code** across all strategies implies it is considered a fundamental, non-negotiable component for the model's capabilities, regardless of the specialization focus. The low weights for sources like News and Finance may indicate they are considered less critical for the target tasks or potentially noisier. </details> Figure 4: Breakdown of the various distributions considered for the QB. $N$ e refers to $N$ epochs of the QA data. The final chosen distribution is shown as QA Blend which used 2 epochs of QA data. ### 5.2 Learning Rate Schedule The learning rate schedule greatly impacts the training dynamics and efficacy of continued pretraining (Gupta et al., 2023; Ibrahim et al., 2024; Winata et al., 2023). In our above continued pretraining experiments, the learning rate schedule starts at a maximum LR of $\eta_{max_{\text{ct}}}=4.5e\text{-}5$ , which is equal to $\eta_{min}$ , and decays to a minimum LR of 0 using cosine annealing. As seen in Figure 3, a minimum LR of 0 facilitates a steep slope of decay but the magnitude of LR is severely impacted, especially over the tokens where the QB is used which may impact the model’s ability to extract full utility from the QA data. To understand the trade-off between these two characteristics of the learning rate schedule in continued pretraining runs, we experiment with two additional minimum learning rate values: $\frac{\eta_{max_{\text{ct}}}}{10}=4.5e\text{-}6$ and $\frac{\eta_{max_{\text{ct}}}}{100}=4.5e\text{-}7$ . | Decay to $\frac{\eta_{max_{\text{ct}}}}{10}$ Decay to $\frac{\eta_{max_{\text{ct}}}}{100}$ Decay to 0 | 54.8 55.7 55.4 | | --- | --- | Table 8: Evaluation results of learning rate schedules with varying Min LR values. Per-task evaluations scores are shared in Table 14 Table 8 highlights that it is in fact best to strike a middle ground between magnitude of LR and slope of decay, as a minimum LR of $\frac{\eta_{max_{\text{ct}}}}{100}$ achieves the best accuracy. Such a minimum LR value allows for a learning rate schedule that has reasonable decay over the QB tokens, unlike when using a minimum LR of $\frac{\eta_{max_{\text{ct}}}}{10}$ , without severely sacrificing on magnitude of LR, as was the case with a minimum LR of 0. Experiments with varying learning rate warmup and maximum LR values led to accuracy regressions compared to the schedule detailed above. In addition, we ran ablations with a different annealing schedule, WSD (Hu et al., 2024), however the results were not competitive to cosine annealing. Full details and results for both studies are shared in Appendix B.2. ### 5.3 Switch of Data Distributions Until this point, we have been switching between the GB and the QB after 250B tokens of continued pretraining. We believe this to be sub-optimal, as it is unclear how switching between distributions after a fixed number of tokens can be easily translated to continued training runs of different token horizons. We hypothesize that the optimal point for switching between the data distributions depends upon the learning rate schedule. Figure 5 highlights how both the number of tokens and learning rate values for the QB blend would differ if the distribution switch occurred at progressively smaller fractions of the maximum LR. As the fraction goes to 0, both the slope of decay and magnitude of the learning rate shrink, meaning that there likely is an optimal point in the learning rate curve where both of these characteristics are still conducive to enable learning but also not too aggressive to the point where the data shift in the QB distribution causes training instability. <details> <summary>acl-style-files/figures/distribution_switch_LRs_background.png Details</summary> ![beb0ec3e](/v1/image/beb0ec3e8e7fab592e16c25263fe39aa4658dea665494f7b9f78a119cb4a6c13) ### Visual Description ## Line Chart: Learning Rate Decay Schedule with Switch Points ### Overview The image displays a line chart illustrating the decay of a learning rate (LR) over the course of training, measured in billions of tokens. The chart includes a primary decay curve and four shaded vertical regions that indicate specific points where the learning rate is switched to a fraction of its maximum value. The overall trend shows a smooth, monotonic decrease in the learning rate as training progresses. ### Components/Axes * **X-Axis:** Labeled **"Tokens (B)"**, representing the number of training tokens in billions. The axis has major tick marks at intervals of 50, ranging from 0 to 300. * **Y-Axis:** Labeled **"LR"** (Learning Rate). A multiplier **"1e-5"** is positioned at the top-left of the axis, indicating all y-axis values are scaled by 10⁻⁵. The axis has major tick marks at intervals of 1, ranging from 0 to 5 (representing 0 to 5e-5). * **Legend:** Positioned in the **top-right corner** of the chart. It contains four entries, each associated with a distinct shade of teal, from lightest to darkest: 1. `Switch at (1/2)*Max LR` (Lightest teal) 2. `Switch at (1/5)*Max LR` 3. `Switch at (1/10)*Max LR` 4. `Switch at (1/50)*Max LR` (Darkest teal) * **Data Series:** A single, solid black line representing the learning rate decay curve. * **Reference Lines:** Four horizontal dashed lines extend from the y-axis across the chart, each labeled with a specific LR value: * `2.25e-5` * `9e-6` * `4.5e-6` * `9e-7` * **Shaded Regions:** Four vertical bands, colored according to the legend, originate from the x-axis and extend upward to the decay curve. Their left edges align with the points where the decay curve intersects the corresponding horizontal dashed reference lines. ### Detailed Analysis * **Decay Curve Trend:** The learning rate curve starts at its maximum value of approximately **4.5e-5** at 0 tokens. It follows a smooth, downward-sloping trajectory, with the rate of decay being steepest in the first half of the displayed range and gradually flattening as it approaches 300B tokens. * **Switch Points & Shaded Regions:** The shaded regions mark the token counts at which the learning rate is reduced to a specific fraction of its maximum (4.5e-5). The alignment is precise: * **Switch at (1/2)*Max LR (2.25e-5):** The lightest teal region begins at approximately **150B tokens**, where the decay curve crosses the `2.25e-5` reference line. * **Switch at (1/5)*Max LR (9e-6):** The next darker region begins at approximately **210B tokens**, aligning with the `9e-6` reference line. * **Switch at (1/10)*Max LR (4.5e-6):** The following region begins at approximately **240B tokens**, aligning with the `4.5e-6` reference line. * **Switch at (1/50)*Max LR (9e-7):** The darkest region begins at approximately **280B tokens**, aligning with the `9e-7` reference line. * **Data Point Extraction (Approximate):** * At 0 tokens: LR ≈ 4.5e-5 * At 150B tokens: LR ≈ 2.25e-5 (Switch to 1/2 Max) * At 210B tokens: LR ≈ 9e-6 (Switch to 1/5 Max) * At 240B tokens: LR ≈ 4.5e-6 (Switch to 1/10 Max) * At 280B tokens: LR ≈ 9e-7 (Switch to 1/50 Max) * At 300B tokens: LR is slightly above 0, approaching the 9e-7 level. ### Key Observations 1. **Inverse Relationship:** There is a clear inverse relationship between training tokens and learning rate; as tokens increase, LR decreases. 2. **Non-Linear Decay:** The decay is not linear. The curve is convex, indicating the learning rate decreases more rapidly initially and then more slowly later in training. 3. **Staggered Switch Schedule:** The switch points are not evenly spaced in terms of token count. The gap between the first and second switch (~60B tokens) is larger than the gap between the third and fourth (~40B tokens), suggesting a schedule that becomes more aggressive in its reductions later in training. 4. **Visual Correlation:** The color intensity of the shaded regions directly correlates with the severity of the learning rate reduction (darker = smaller fraction of Max LR). ### Interpretation This chart visualizes a **learning rate schedule with staged decay**, a common technique in training large machine learning models. The primary curve represents a continuous decay function (e.g., cosine decay). The shaded regions and legend indicate a secondary, discrete schedule where the learning rate is explicitly "switched" or reset to a predefined fraction of its initial maximum at specific token milestones. The data suggests a training strategy designed for stability and fine-grained optimization. The initial high learning rate facilitates rapid learning. The staged reductions (to 1/2, then 1/5, etc.) likely correspond to phases where the model transitions from broad learning to refining its parameters, requiring smaller steps to converge effectively without overshooting. The fact that the switch points are triggered by the continuous decay curve reaching specific thresholds (the dashed lines) indicates an **adaptive schedule** where the discrete switches are synchronized with the continuous decay progress, rather than being fixed at arbitrary token counts. This approach ensures the learning rate reductions happen at points of comparable relative decay, potentially leading to more stable and effective training. </details> Figure 5: How the number of QB tokens, the shaded region, varies based on different distribution switch points. Table 9 highlights that switching between the GB and QB at $\frac{\eta_{max_{\text{ct}}}}{5}$ achieves the best accuracy and improves upon the heuristically chosen switch point by 0.4 points on average. Wanting to confirm this distribution switch point holds at differing amounts of continued pretraining tokens, we ran an ablation on a scale of 100B tokens and found that $\frac{\eta_{max_{\text{ct}}}}{5}$ again maximized the results as seen in Table 18. | At $\eta_{max_{\text{ct}}}$ (from step 0) At $\frac{\eta_{max_{\text{ct}}}}{2}$ At $\frac{\eta_{max_{\text{ct}}}}{5}$ | 52.8 54.7 56.1 | | --- | --- | | At $\frac{\eta_{max_{\text{ct}}}}{10}$ | 55.0 | | At $\frac{\eta_{max_{\text{ct}}}}{50}$ | 54.6 | Table 9: Evaluation results of varying distribution switch points. Per-task evaluations scores are shared in Table 17 This finalizes our continued pretraining recipe. We highlight the utility of this recipe as it allows the model to achieve an average accuracy of 56.1, which improves upon the natural baseline of continued training on the pretraining distribution, as shared in Table 4, by 9%. ## 6 Ablations ### 6.1 Varying Token Horizons We show the efficacy of the identified continued pretraining recipe when used at varying numbers of continued training tokens. Table 10 illustrates that on continued training horizons from 100B to 1T tokens, the identified recipe consistently achieves improved evaluation results – realizing a 16% gain over the pretrained model when using 1T tokens of continued training. We do note that the slope in accuracy improvement from 300B to 1T tokens is lower than that from 100B to 300B tokens, we hypothesize that as we are mainly reusing documents from the pretraining set when doing a large number of continued training tokens the repeated number of epochs on the same data sources have decreasing marginal utility. | 0B 100B 300B | 59.3 63.0 63.8 | 48.9 55.0 56.1 | | --- | --- | --- | | 1T | 65.3 | 56.8 | Table 10: Performance of the continuous pretraining (CPT) recipe across different token horizons. Per-task evaluations scores are shared in Table 19 ### 6.2 Document Mining In an effort to improve the utility of the data sources that are seen for multiple epochs in long horizon continued pretraining runs, we aim to find a subset of examples that are most helpful for model improvement. As the QA dataset was shown to significantly boost model accuracies, we hypothesize that restricting each pretraining data source to the set of documents which are most similar to the QA examples would be beneficial. To do so, we use the E5-large-v2 (Wang et al., 2022) text embedding model to obtain an embedding for each document in our pretraining and QA sets. Using the Faiss library (Johnson et al., 2017), we efficiently perform a 50-nearest neighbor search across all these embeddings to obtain the 50 most similar, non-QA documents to each example in the QA set. The identified subset of examples constitutes 60B tokens, and we term this approach document mining. Table 11 shows a training run where we replace all non-QA data sources in the QB distribution solely with the examples identified via document mining. We find that these documents substantially improve the performance of the continued pretraining run and believe that document mining is a viable approach at extracting further utility from existing data sources. | CT 1T CT 1T w/ Mined Docs | 65.3 66.6 | 56.8 57.9 | | --- | --- | --- | Table 11: Mining examples related to QA documents further improves accuracy. Per-task evaluations scores are shared in Table 20 ## 7 Conclusion We investigate how to effectively continue training LMs to improve upon their existing capabilities. Our experiments show that it is especially important to carefully define the data distribution and learning rate decay schedule used during continued pretraining so that the model is able to smoothly transition away from the pretraining distribution and better learn the newly emphasized data sources. With these findings we propose a general recipe that model developers can use in order to perform continued pretraining on top of their own LMs and show that for our base model, we are able to improve cumulative accuracy by over 18%. We hope that this will be a starting point to enable future LMs to be developed through the reuse of existing models rather than retraining from scratch. ## Limitations In the development of our continued pretraining recipe, we only experiment along the axes of data distributions and hyperparameter configurations. Although we did not include them within our study, there may be added benefit in exploring other aspects such as altering the learning algorithm. Additionally, given that our study is conducted on top of a model with a given configuration and which was pretrained using a certain data distribution, the results that we highlight are likely to not extrapolate well when used in settings highly divergent from the one utilized in the study. Finally, we limited our goal within continued pretraining to improving the general purpose capabilities of the pretrained model; however, there are many additional angles when considering model reuse such as domain specialization and the efficient addition of new knowledge into existing models. ## References - Ainslie et al. (2023) Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245. - Allen-Zhu and Li (2023) Zeyuan Allen-Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.1, knowledge storage and extraction. Preprint, arXiv:2309.14316. - Anthropic (2024) Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. - Blakeney et al. (2024) Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, and Jonathan Frankle. 2024. Does your data spark joy? performance gains from domain upsampling at the end of training. Preprint, arXiv:2406.03476. - Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Preprint, arXiv:2005.14165. - Caccia et al. (2021) Massimo Caccia, Pau Rodriguez, Oleksiy Ostapenko, Fabrice Normandin, Min Lin, Lucas Caccia, Issam Laradji, Irina Rish, Alexandre Lacoste, David Vazquez, and Laurent Charlin. 2021. Online fast adaptation and knowledge accumulation: a new approach to continual learning. Preprint, arXiv:2003.05856. - Chaudhry et al. (2019) Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. 2019. On tiny episodic memories in continual learning. Preprint, arXiv:1902.10486. - Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. Preprint, arXiv:2107.03374. - Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. PaLM: Scaling Language Modeling with Pathways. arXiv preprint arXiv:2204.02311. - DeepSeek-AI et al. (2024) DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y. K. Li, Wenfeng Liang, Fangyun Lin, A. X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, and Yuheng Zou. 2024. Deepseek llm: Scaling open-source language models with longtermism. Preprint, arXiv:2401.02954. - El-Kishky et al. (2019) Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, and Philipp Koehn. 2019. Ccaligned: A massive collection of cross-lingual web-document pairs. arXiv preprint arXiv:1911.06154. - French (1999) Robert M. French. 1999. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135. - Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027. - Gemma Team (2024) Google DeepMind Gemma Team. 2024. Gemma: Open Models Based on Gemini Research and Technology. - Gupta et al. (2023) Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. 2023. Continual pre-training of large language models: How to (re)warm your model? Preprint, arXiv:2308.04014. - Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics. - Heafield (2011) Kenneth Heafield. 2011. Kenlm: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation, pages 187–197. - Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300. - Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. Preprint, arXiv:1503.02531. - Hu et al. (2024) Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. Minicpm: Unveiling the potential of small language models with scalable training strategies. Preprint, arXiv:2404.06395. - Ibrahim et al. (2024) Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, and Irina Rish. 2024. Simple and scalable strategies to continually pre-train large language models. Preprint, arXiv:2403.08763. - Jang et al. (2023) Joel Jang, Seonghyeon Ye, Changho Lee, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, and Minjoon Seo. 2023. Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. Preprint, arXiv:2204.14211. - Jang et al. (2022) Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Stanley Jungkyu Choi, and Minjoon Seo. 2022. Towards continual knowledge learning of language models. Preprint, arXiv:2110.03215. - Jin et al. (2022) Xisen Jin, Dejiao Zhang, Henghui Zhu, Wei Xiao, Shang-Wen Li, Xiaokai Wei, Andrew Arnold, and Xiang Ren. 2022. Lifelong pretraining: Continually adapting language models to emerging corpora. Preprint, arXiv:2110.08534. - Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with gpus. Preprint, arXiv:1702.08734. - Ke et al. (2023) Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. 2023. Continual pre-training of language models. Preprint, arXiv:2302.03241. - Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv preprint arXiv:1808.06226. - Kulal et al. (2019) Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. 2019. Spoc: Search-based pseudocode to code. Preprint, arXiv:1906.04908. - Labrak et al. (2024) Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. 2024. Biomistral: A collection of open-source pretrained large language models for medical domains. Preprint, arXiv:2402.10373. - Lachaux et al. (2020) Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised translation of programming languages. Preprint, arXiv:2006.03511. - Lesort et al. (2022) Timothée Lesort, Massimo Caccia, and Irina Rish. 2022. Understanding continual learning settings with data distribution drift analysis. Preprint, arXiv:2104.01678. - Lin et al. (2024) Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. 2024. Rho-1: Not all tokens are what you need. Preprint, arXiv:2404.07965. - Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. Preprint, arXiv:1711.05101. - Loureiro et al. (2022) Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-Collados. 2022. Timelms: Diachronic language models from twitter. Preprint, arXiv:2202.03829. - Ma et al. (2023) Shirong Ma, Shen Huang, Shulin Huang, Xiaobin Wang, Yangning Li, Hai-Tao Zheng, Pengjun Xie, Fei Huang, and Yong Jiang. 2023. Ecomgpt-ct: Continual pre-training of e-commerce large language models with semi-structured data. Preprint, arXiv:2312.15696. - OpenAI (2024) OpenAI. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774. - Parmar et al. (2024) Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, and Bryan Catanzaro. 2024. Nemotron-4 15b technical report. Preprint, arXiv:2402.16819. - Qin et al. (2022) Yujia Qin, Jiajie Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2022. Elle: Efficient lifelong pre-training for emerging data. Preprint, arXiv:2203.06311. - Robins (1995) Anthony V. Robins. 1995. Catastrophic forgetting, rehearsal and pseudorehearsal. Connect. Sci., 7:123–146. - Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne. 2019. Experience replay for continual learning. Preprint, arXiv:1811.11682. - Schwenk et al. (2019) Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, and Armand Joulin. 2019. Ccmatrix: Mining billions of high-quality parallel sentences on the web. arXiv preprint arXiv:1911.04944. - Scialom et al. (2022) Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. 2022. Fine-tuned language models are continual learners. Preprint, arXiv:2205.12393. - Shi et al. (2022) Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2022. Language models are multilingual chain-of-thought reasoners. Preprint, arXiv:2210.03057. - Soviany et al. (2022) Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum learning: A survey. Preprint, arXiv:2101.10382. - Su et al. (2023) Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. Roformer: Enhanced transformer with rotary position embedding. Preprint, arXiv:2104.09864. - Team (2024) Gemini Team. 2024. Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805. - Team et al. (2024) Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, and Zhihui Xie. 2024. Reka core, flash, and edge: A series of powerful multimodal language models. Preprint, arXiv:2404.12387. - Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288. - Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. - Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533. - Winata et al. (2023) Genta Indra Winata, Lingjue Xie, Karthik Radhakrishnan, Shijie Wu, Xisen Jin, Pengxiang Cheng, Mayank Kulkarni, and Daniel Preotiuc-Pietro. 2023. Overcoming catastrophic forgetting in massively multilingual continual learning. Preprint, arXiv:2305.16252. - Wu et al. (2024) Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ying Shan, and Ping Luo. 2024. Llama pro: Progressive llama with block expansion. Preprint, arXiv:2401.02415. - Yadav et al. (2023) Prateek Yadav, Qing Sun, Hantian Ding, Xiaopeng Li, Dejiao Zhang, Ming Tan, Xiaofei Ma, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Mohit Bansal, and Bing Xiang. 2023. Exploring continual learning for code generation models. Preprint, arXiv:2307.02435. - Yang et al. (2024) Xianjun Yang, Junfeng Gao, Wenxin Xue, and Erik Alexandersson. 2024. Pllama: An open-source large language model for plant science. Preprint, arXiv:2401.01600. - Zan et al. (2022) Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. 2022. Cert: Continual pre-training on sketches for library-oriented code generation. Preprint, arXiv:2206.06888. - Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In ACL. - Çağatay Yıldız et al. (2024) Çağatay Yıldız, Nishaanth Kanna Ravichandran, Prishruit Punia, Matthias Bethge, and Beyza Ermis. 2024. Investigating continual pretraining in large language models: Insights and implications. Preprint, arXiv:2402.17400. ## Appendix A Data ### A.1 Multilingual Data The 53 multilingual languages contained within the pretraining set are: AR, AZ, BG, BN, CA, CS, DA, DE, EL, ES, ET, FA, FI, FR, GL, HE, HI, HR, HU, HY, ID, IS, IT, JA, KA, KK, KN, KO, LT, LV, MK, ML, MR, NE, NL, NO, PL, PT, RO, RU, SK, SL, SQ, SR, SV, TA, TE, TH, TR, UK, UR, VI, and ZH. ### A.2 Code Data The 43 programming languags contained within our pretraining set are: assembly, c, c-sharp, common-lisp, cpp, css, cuda, dart, dockerfile, fortran, go, haskell, html, java, javascript, json, julia, jupyter-scripts, lua, makefile, markdown, mathematica, omniverse, pascal, perl, php, python, R, restructuredtext, ruby, rust, scala, shell, sql, swift, systemverilog, tex, typescript, verilog, vhdl, visual-basic, xml, and yaml. ## Appendix B Experiments The evaluation results across all considered tasks are shared below for each of our experiments. | MMLU HellaSwag HumanEval | 59.3 80.4 31.1 | | --- | --- | | MGSM (ES, JA, TH) | 24.9 | Table 12: Model accuracy after 8T tokens of pretraining. We find that the model struggles on STEM based reasoning tasks due to its low scores on MGSM and STEM substasks of MMLU. ### B.1 Data Distribution Table 13 shares the results across all tasks for each experiment mentioned within Section 5.1. | Data Blend Pretraining QA | MMLU 61.9 62 | HellaSwag 81.2 78.7 | HumanEval 28.1 32.9 | MGSM (ES, JA, TH) 34.7 40.1 | | --- | --- | --- | --- | --- | | Pretraining (250B) + QA (50B) | 62.6 | 82.2 | 29.9 | 42.4 | | Pretraining | 61.9 | 81.2 | 28.1 | 34.7 | | Reweight Domains | 61.9 | 81.7 | 29.9 | 33.2 | | Pretraining w/ High Quality Web | 62.2 | 80.9 | 34.1 | 32.9 | | No Web | 62.3 | 81.8 | 29.9 | 37.7 | | Upweight Non Web w/ High Quality Web | 62.6 | 81.4 | 31.7 | 32.1 | | QA 1 | 63.0 | 82.4 | 29.9 | 41.9 | | QA 2 (+STEM, +World Knowledge) | 63.9 | 82.3 | 29.3 | 36.7 | | QA 3 (+STEM, +Chat) | 64.1 | 82.2 | 28.7 | 44.7 | | QA | 64.2 | 82.4 | 30.5 | 44.5 | | QA w/ Upweighted STEM | 64.1 | 82.3 | 28.1 | 42.9 | | QA w/ 1.5e QA data | 64.1 | 82.2 | 28.7 | 44.7 | | QA w/ 3.5e QA data | 64.4 | 27.4 | 82.4 | 43.3 | Table 13: Per-task evaluation results of each experiment mentioned within Section 5.1 on defining data distributions for continued pretraining. ### B.2 Learning Rate Schedule | Decay to $\frac{\eta_{max_{\text{ct}}}}{10}$ Decay to $\frac{\eta_{max_{\text{ct}}}}{100}$ Decay to 0 | 63.9 64.2 64.2 | 82.4 82.2 30.5 | 29.3 31.1 82.4 | 43.7 45.2 44.5 | | --- | --- | --- | --- | --- | Table 14: Per-task evaluation results of the experiments mentioned in Table 8 on identifying an appropriate learning rate decay schedule for continued pretraining. In identifying a learning rate schedule for continued pretraining, we experiment with various degrees of warmup and values of $\eta_{max_{\text{ct}}}$ . The combinations we consider are: warmup from $\eta_{min}$ to $\eta_{max_{\text{ct}}}=1.5*\eta_{min}$ , warmup from $0.5*\eta_{min}$ to $\eta_{max_{\text{ct}}}=\eta_{min}$ , and warmup from 0 to what the expected learning rate value would be had the pretraining learning rate schedule been extended to incorporate the continued training tokens (i.e., from 8T to 8.3T). We use $\eta_{min}$ to specify the minimum learning rate value of the pretrained model, which is $4.5e\text{-}5$ . Figure 6 highlights each of these schedules, and we note that these combinations were chosen to quantify different degrees of aggressiveness when using warmup in a continued pretraining learning rate schedule. <details> <summary>acl-style-files/figures/just_warmup_LRs.png Details</summary> ![50a036eb](/v1/image/50a036ebc930d942d58469df2b10fe68edcb344e0a6c453bc94082d94f38fbe1) ### Visual Description ## Line Chart: Learning Rate Schedules Over Training Tokens ### Overview The image displays a line chart illustrating three different learning rate (LR) warmup schedules plotted against the number of training tokens (in billions). A shaded region indicates a "QA Blend" phase. The chart is designed to compare how the learning rate changes over the course of a model's training process under different warmup strategies. ### Components/Axes * **X-Axis:** Labeled "Tokens (B)". It represents the number of training tokens in billions, with major tick marks at 0, 50, 100, 150, 200, 250, and 300. * **Y-Axis:** Labeled "LR". It represents the learning rate value. The axis has a multiplier of `1e-5` noted at the top-left corner, meaning the displayed numbers (0 through 7) should be multiplied by 0.00001. Major tick marks are at 0, 1, 2, 3, 4, 5, 6, and 7. * **Legend:** Positioned in the top-right corner of the plot area. It contains four entries: 1. **Solid Black Line:** "Warmup to 6.75e-5" 2. **Dashed Black Line:** "Warmup to 4.5e-5" 3. **Dotted Black Line:** "Warmup to Expected LR" 4. **Grey Shaded Area:** "QA Blend" ### Detailed Analysis The chart plots three distinct learning rate trajectories. All schedules begin at 0 tokens with a non-zero LR, warm up to a peak, and then decay towards zero as training progresses. **1. Warmup to 6.75e-5 (Solid Line):** * **Trend:** This line shows the most aggressive warmup and the highest peak learning rate. It rises steeply, peaks early, and then follows a smooth, convex decay curve. * **Key Data Points (Approximate):** * Start (0B tokens): LR ≈ 4.5e-5 * Peak: Occurs at approximately 30B tokens, with a peak LR of ≈ 6.75e-5 (matching its label). * Mid-point (150B tokens): LR ≈ 3.0e-5 * End (300B tokens): LR approaches 0. **2. Warmup to Expected LR (Dotted Line):** * **Trend:** This schedule has a moderate warmup and peak. Its decay curve is less steep than the solid line but follows a similar convex shape. * **Key Data Points (Approximate):** * Start (0B tokens): LR ≈ 2.3e-5 * Peak: Occurs at approximately 30B tokens, with a peak LR of ≈ 4.5e-5. * Mid-point (150B tokens): LR ≈ 2.0e-5 * End (300B tokens): LR approaches 0. **3. Warmup to 4.5e-5 (Dashed Line):** * **Trend:** This is the most conservative schedule. It has a very shallow warmup to a low peak and a very gradual, nearly linear decay. * **Key Data Points (Approximate):** * Start (0B tokens): LR ≈ 0. * Peak: Plateaus between approximately 30B and 70B tokens at a peak LR of ≈ 0.45e-5 (or 4.5e-6). * Mid-point (150B tokens): LR ≈ 0.3e-5 * End (300B tokens): LR approaches 0. **4. QA Blend (Grey Shaded Region):** * **Position:** This vertical shaded area spans the x-axis from 250B tokens to 300B tokens. * **Meaning:** It indicates a specific phase in the training process, likely where the data mixture is blended with Question-Answering (QA) data. All three LR schedules are in their final decay phase during this period. ### Key Observations 1. **Peak Timing:** All three schedules reach their peak learning rate at approximately the same point in training (~30B tokens). 2. **Hierarchy of Aggressiveness:** The schedules are clearly tiered: "Warmup to 6.75e-5" > "Warmup to Expected LR" > "Warmup to 4.5e-5" in terms of peak LR and overall magnitude throughout training. 3. **Convergence:** All three lines converge to near-zero learning rate by the end of training at 300B tokens. 4. **QA Blend Phase:** The final 50B tokens (from 250B to 300B) are designated as a QA Blend phase, during which the learning rate for all schedules is very low (< 1.0e-5). ### Interpretation This chart visualizes different strategies for the critical "warmup" phase in training large language models. The warmup gradually increases the learning rate from a small value to a target peak to stabilize training. * **The data suggests a trade-off:** The "Warmup to 6.75e-5" schedule represents a more aggressive approach, potentially leading to faster initial learning but carrying higher risk of instability. The "Warmup to 4.5e-5" schedule is a conservative, stable approach. The "Expected LR" schedule appears to be a middle-ground, possibly the default or theoretically derived target. * **The relationship between elements** shows that regardless of the initial warmup target, the long-term decay schedule is designed to bring the learning rate down in a controlled manner as the model consumes more data (tokens). The QA Blend phase at the end suggests a fine-tuning or specialization stage using a specific dataset type, conducted at a very low learning rate to make precise adjustments without disrupting the already-learned representations. * **A notable anomaly** is that the "Warmup to 4.5e-5" line starts at 0, while the other two start at a positive value. This could indicate a different warmup function (e.g., starting from zero vs. starting from a fraction of the peak). The chart effectively communicates that the choice of warmup target significantly impacts the learning rate profile throughout the entire training run. </details> Figure 6: Cosine decay schedule with the various levels of warmup which we experiment with. As highlighted in Table 15, we find that including any level of warmup within the continued training learning rate schedule causes regressions in evaluation accuracies, indicating that it is best to decay directly from $\eta_{min}$ . | Warmup to $6.75e\text{-}5$ Warmup to $4.5e\text{-}5$ Warmup to Expected LR | 64.0 64.0 63.3 | 81.9 82.1 82.1 | 31.1 32.9 31.7 | 42.3 41.5 42.5 | 54.8 55.1 54.9 | | --- | --- | --- | --- | --- | --- | | No Warmup | 64.2 | 31.1 | 82.2 | 45.2 | 55.7 | Table 15: Comparison of including warmup within learning rate schedules for continued pretraining. No warmup achieves the best evaluation results. In addition to cosine annealing, we experiment with the WSD learning rate scheduler (Hu et al., 2024). Table 16 compares the best found setting of WSD with cosine annealing. The WSD schedule produces significantly lower evaluation accuracies than cosine annealing. We hypothesize that in continued pretraining, switching the decay schedule from the one used during pretraining is harmful. Hence, for models pretrained with cosine annealing, the learning rate schedule in continued training should also use cosine annealing. | WSD Cosine Annealing | 63.6 64.2 | 80.2 82.2 | 28.1 31.1 | 39.5 45.2 | 52.8 55.7 | | --- | --- | --- | --- | --- | --- | Table 16: We find that WSD causes significant regression in evaluation accuracy compared to cosine annealing. Both learning rate schedules were decayed till $\frac{\eta_{max_{\text{ct}}}}{100}$ . ### B.3 Switch of Data Distributions Table 18 highlights that the findings of our experiments in Section 5.3 also hold at the continued training token horizon of 100B tokens. This indicates that regardless of the number of continued training tokens, transitioning between the GB and QB distributions at $\frac{\eta_{max_{\text{ct}}}}{5}$ is optimal. | At $\eta_{max_{\text{ct}}}$ (from step 0) At $\frac{\eta_{max_{\text{ct}}}}{2}$ At $\frac{\eta_{max_{\text{ct}}}}{5}$ | 65.0 60.9 63.8 | 78.7 81.6 82.2 | 29.9 32.3 32.3 | 37.7 44.1 46.1 | | --- | --- | --- | --- | --- | | At $\frac{\eta_{max_{\text{ct}}}}{10}$ | 63.9 | 82.2 | 29.3 | 44.7 | | At $\frac{\eta_{max_{\text{ct}}}}{50}$ | 63.3 | 81.6 | 31.1 | 42.3 | Table 17: Per-task evaluation results of the experiments mentioned in Table 9 on how to switch between data distributions in continued pretraining. | At $\eta_{max_{\text{ct}}}$ (from step 0) At $\frac{\eta_{max_{\text{ct}}}}{2}$ At $\frac{\eta_{max_{\text{ct}}}}{5}$ | 64.1 63.2 63.0 | 79.2 81.6 81.9 | 31.1 27.4 31.7 | 40.0 44.1 43.6 | 53.6 54.1 55.0 | | --- | --- | --- | --- | --- | --- | | At $\frac{\eta_{max_{\text{ct}}}}{10}$ | 63.6 | 81.8 | 30.5 | 39.7 | 53.9 | | At $\frac{\eta_{max_{\text{ct}}}}{50}$ | 63.3 | 81.6 | 31.1 | 42.3 | 54.6 | Table 18: Ablation of the data distribution switch experiments at a continued pretraining scale of 100B tokens. As found for the 300B token continued training horizon, switching distributions at $\frac{\eta_{max_{\text{ct}}}}{5}$ achieves the highest accuracy. ## Appendix C Ablations ### C.1 Varying Token Horizons When extending the number of continued pretraining tokens to 1T, we found that our existing QB distribution would cause the small QA dataset to be trained on for a large number of epochs. To correct for this, we reduce the weight on the QA datset so that it would be trained on for no more than 4 epochs. Figure 7 demonstrates the distribution of the QB when used at the scale of 1T continued pretraining tokens. <details> <summary>acl-style-files/figures/QB_lengths.png Details</summary> ![4f313f86](/v1/image/4f313f869eb3f01a0a938a0043b399f083fe47f4d10bae5e268e738e7c37f29e) ### Visual Description ## Grouped Bar Chart: Data Source Weight Distribution for Two QA Blends ### Overview The image displays a grouped bar chart comparing the percentage weight assigned to eleven different data sources across two distinct datasets or models, labeled "QA Blend" and "QA Blend 1T". The chart visualizes the compositional differences between these two blends. ### Components/Axes * **Chart Type:** Grouped (clustered) vertical bar chart. * **Legend:** Located at the top center of the chart area. * **Light Green Square:** Labeled "QA Blend" * **Darker Green Square:** Labeled "QA Blend 1T" * **X-Axis (Horizontal):** * **Title:** "Data Source" (centered below the axis labels). * **Categories (from left to right):** Web Crawl, Books, News Articles, Papers, Encyclopedia, Legal, Finance, Misc., Multilingual, Code, QA. * **Y-Axis (Vertical):** * **Title:** "Weight (%)" (rotated 90 degrees). * **Scale:** Linear scale from 0 to 35, with major gridlines at intervals of 5 (0, 5, 10, 15, 20, 25, 30, 35). ### Detailed Analysis The following table reconstructs the approximate weight (%) for each data source in both blends. Values are estimated based on bar height relative to the y-axis gridlines. | Data Source | QA Blend (Light Green) Weight (%) | QA Blend 1T (Darker Green) Weight (%) | | :--- | :--- | :--- | | **Web Crawl** | ~3% | ~4% | | **Books** | ~16% | ~13.5% | | **News Articles** | ~3% | ~4% | | **Papers** | ~18% | ~15% | | **Encyclopedia** | ~8% | ~7% | | **Legal** | ~8% | ~11.5% | | **Finance** | ~3% | ~4% | | **Misc.** | ~11% | ~14% | | **Multilingual** | ~3% | ~3% | | **Code** | ~15% | ~20% | | **QA** | ~12% | ~4% | **Trend Verification per Data Series:** * **QA Blend (Light Green):** The series shows its highest weights in **Papers (~18%)**, **Books (~16%)**, and **Code (~15%)**. It has notably lower weights in **Web Crawl, News Articles, Finance, and Multilingual** (all ~3%). * **QA Blend 1T (Darker Green):** This series peaks sharply at **Code (~20%)**. Other significant sources are **Papers (~15%)**, **Misc. (~14%)**, and **Books (~13.5%)**. Its lowest weight is in **QA (~4%)**. ### Key Observations 1. **Dominant Source Shift:** The primary weight shifts from **Papers** in QA Blend to **Code** in QA Blend 1T. 2. **Significant Divergence in 'QA':** The most dramatic relative difference is in the **QA** category itself, where QA Blend assigns ~12% weight, but QA Blend 1T assigns only ~4%. 3. **Increased Emphasis:** QA Blend 1T shows a clear increase in weight for **Code, Legal, and Misc.** compared to QA Blend. 4. **Decreased Emphasis:** QA Blend 1T shows a clear decrease in weight for **Books, Papers, Encyclopedia, and QA** compared to QA Blend. 5. **Consistent Low-Priority Sources:** **Web Crawl, News Articles, Finance, and Multilingual** remain low-weight sources (3-4%) in both blends. 6. **Equal Weight:** **Multilingual** is the only category where both blends appear to have an identical weight (~3%). ### Interpretation This chart illustrates a strategic rebalancing of training data composition between two iterations or versions of a model (QA Blend vs. QA Blend 1T). The data suggests a deliberate pivot in focus: * **From Academic to Practical:** The reduction in weight for **Books, Papers, and Encyclopedia** (traditional knowledge sources) coupled with the major increase for **Code** indicates a shift towards prioritizing practical, technical, and potentially instruction-following data. This often aligns with improving a model's reasoning, logic, and task-completion abilities. * **Refinement of QA Data:** The sharp drop in the **QA** category's weight is intriguing. It may indicate that the "1T" blend relies less on curated question-answer pairs, perhaps because the increased weight in **Code** and **Legal** documents provides more implicit reasoning patterns, or because the QA data was consolidated or filtered more aggressively. * **Broadening of 'Misc.':** The increase in the **Misc.** category suggests an effort to incorporate a wider variety of unstructured or niche data to improve generalization. * **Stable Foundation:** The consistent, low weighting of broad web data (**Web Crawl, News**) suggests both blends use these as a minor, stabilizing component rather than a primary source. In essence, **QA Blend 1T appears to be a more technically-oriented and possibly more specialized derivative** of the original QA Blend, trading some breadth of general knowledge for depth in code, law, and miscellaneous practical domains. </details> Figure 7: Distribution of the QB blend when extending the number of continued pretraining tokens to 1T. | 0B 100B 300B | 59.3 63.0 63.8 | 80.4 81.9 82.2 | 31.1 31.7 32.3 | 24.9 43.6 46.1 | 48.9 55.0 56.1 | | --- | --- | --- | --- | --- | --- | | 1T | 65.3 | 82.4 | 34.1 | 45.5 | | Table 19: Per-task evaluation results of the experiments mentioned in Table 11 on how the identified continued pretraining recipe performs at varying amounts of continued training tokens. | CT 1T CT 1T w/ Mined Docs | 65.3 66.6 | 82.4 81.7 | 34.1 36.6 | 45.5 46.7 | | --- | --- | --- | --- | --- | Table 20: Per-task evaluation results of the experiments mentioned in Table 11 on how document mining increases the utility of existing data sources in continued pretraining.

Rendering Paper...