2407.07263v1

Model: nemotron-free

# Reuse, Don’t Retrain: A Recipe for Continued Pretraining of Language Models **Authors**: Jupinder Parmar, Sanjeev Satheesh, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro > Correspondence to:jupinderp@nvidia.com Abstract As language models have scaled both their number of parameters and pretraining dataset sizes, the computational cost for pretraining has become intractable except for the most well-resourced teams. This increasing cost makes it ever more important to be able to reuse a model after it has completed pretraining; allowing for a model’s abilities to further improve without needing to train from scratch. In this work, we detail a set of guidelines that cover how to design efficacious data distributions and learning rate schedules for continued pretraining of language models. When applying these findings within a continued pretraining run on top of a well-trained 15B parameter model, we show an improvement of 9% in average model accuracy compared to the baseline of continued training on the pretraining set. The resulting recipe provides a practical starting point with which to begin developing language models through reuse rather than retraining. Reuse, Don’t Retrain: A Recipe for Continued Pretraining of Language Models 1 Introduction Language modeling abilities have seen massive improvements over the past few years (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2024; Team, 2024). While these advancements have enabled language models (LMs) to become highly-skilled conversational agents (OpenAI, 2024; Anthropic, 2024; Team, 2024), they have come with increased computational cost as pretraining has become ever more expensive due to both the number of model parameters (Team et al., 2024; DeepSeek-AI et al., 2024) and pretraining dataset size (Touvron et al., 2023; Gemma Team, 2024; Parmar et al., 2024) continuing to grow in scale. With new LMs that set state of the art accuracy being released on a frequent basis, LMs developed only a couple months back are becoming obsolete as their capabilities are no longer up to par. This leaves model developers with the choice of either pretraining new LMs from scratch or reusing their existing LMs and updating them with new information in order to match current best LM abilities. Due to the large computational cost that pretraining of modern LMs incurs, frequent complete retraining is intractable. This makes the reuse of already developed LMs via continued pretraining an attractive proposition. While most recent works (Ibrahim et al., 2024; Jang et al., 2022; Ke et al., 2023; Çağatay Yıldız et al., 2024) have recommended guidelines for continued pretraining when adapting language models to new data domains or distribution shifts, intuition or recommendations on how to improve a model’s general purpose abilities from a previously finalized checkpoint with continued pretraining have not been widely explored. In this paper, we focus on this under-studied setting and identify strategies that allow for already trained LMs to improve upon areas of weakness without experiencing degradations in other capabilities. In our experiments, we start on top of a 15B parameter LM that has seen 8T tokens of pretraining data (Parmar et al., 2024). Experimenting with a well trained model of this scale ensures that our findings will be transferable to most settings and model sizes. We first identify the type of data distribution that should be used during continued pretraining and find that it is optimal to have two distributions, with the final one more heavily weighting data sources that relate to the abilities we want to improve in the model. Second, we determine what learning rate schedules enable the most efficient learning during continued pretraining and determine that the most performant one strikes a balance between magnitude of learning rate and steepness of decay. Lastly, we show how the learning rate value at which we switch between data distributions affects downstream accuracy and identify the point at which this switch should be made. These findings culminate in a recipe that can be used to perform continued pretraining to improve the capabilities of an existing LM. We demonstrate that this recipe is beneficial at continued training scales from 100B to 1 trillion tokens, illustrating its flexibility and robustness to be used in a wide variety of settings. We hope that this recipe will allow for model providers to forgo the need to regularly retrain models from scratch as it makes it possible to reuse a trained model to attain improved capabilities. 2 Related Works Continued training methods aim to take an already trained model and incorporate new data, adapt it for a given domain, or specialize it on a certain task (Rolnick et al., 2019; Caccia et al., 2021; Lesort et al., 2022; Gupta et al., 2023; Lin et al., 2024). The major challenge that arises during continued training is enabling a model to learn new information without forgetting previously attained knowledge or capabilities (Robins, 1995; French, 1999). The learning rate schedule and data distribution used during continued training (Gupta et al., 2023; Ibrahim et al., 2024; Winata et al., 2023; Scialom et al., 2022) have been shown to be particularly important in preventing such catastrophic forgetting. For LMs, one major setting of continued training has been to embed more recent knowledge into the model by using data collected at a date later than when the pretraining set was constructed (Jin et al., 2022; Jang et al., 2022, 2023; Loureiro et al., 2022; Qin et al., 2022). Results from these studies found that using experience replay (Chaudhry et al., 2019) and knowledge distillation (Hinton et al., 2015) are particularly effective. Continued training is also commonly used in LMs to adapt the model to data coming from a new domain (Ke et al., 2023; Gururangan et al., 2020; Wu et al., 2024). Many of these methods for domain adaptive continued training update a portion of the model’s weights with the new data to ensure that previous knowledge is not lost. For instance, (Wu et al., 2024) does so via an expansion of the transformer blocks and only updating the newly added weights. More related to the setting which we explore, several studies utilize continued pretraining to specialize a LM on a given task or domain (Zan et al., 2022; Yadav et al., 2023; Ma et al., 2023; Yang et al., 2024; Labrak et al., 2024). Despite investigating effective strategies for continued pretraining, these studies differ from ours as they do not aim to improve the general capabilities of LMs, train for far fewer tokens, and use much smaller model sizes. The main study which offers a comparative setting to ours is (Ibrahim et al., 2024) which provides a recipe, based on learning rate schedule and example replay recommendations, for maintaining general purpose abilities during continued pretraining on data distribution shifts. Their experimental setting consists of a 10B parameter model that was pretrained for 300B tokens. Our study differs from (Ibrahim et al., 2024) as we aim to improve the general capabilities of the LM further, and in our experimental setting we perform continued pretraining for up to 1T tokens with a 15B parameter model that was pretrained on 8T tokens. 3 Experimental Setup The continued pretraining process is as follows: a model is first pretrained, then a data distribution and learning rate schedule are chosen, a continued pretraining run takes place, and finally the, hopefully improved, model is returned. Before delving into the experiments that define the continued training recipe, we detail the datasets and model architecture that are used. 3.1 Data Sources 3.1.1 Pretraining Our pretraining dataset consists of three different domains of data: English natural language data, multilingual natural language data, and source code data. Table 1 highlights the data sources that compose the pretraining set along with their respective token counts. In our English corpus, the Web Crawl data is sourced from Common Crawl (CC) snapshots while the remaining categories are comprised of high-quality sets. For instance, the miscellaneous category consists of BigScience ROOTS (Lachaux et al., 2020), Reddit, and Pile-Stories (Gao et al., 2020), the encyclopedia category contains Wikipedia and Stack Exchange, and scientific papers includes ArXiv and PubMed. The multilingual dataset consists of 53 languages with the majority of examples being drawn from CC snapshots, although a small portion comes from machine translation parallel corpora (Schwenk et al., 2019; El-Kishky et al., 2019). Lastly, our source code data is drawn from permissively licensed GitHub repositories and totals over 43 languages. | Data type | Data source | Tokens (B) | | --- | --- | --- | | English | Web Crawl | 5,106 | | Misc. | 179 | | | News | 93 | | | Scientific Papers | 82 | | | Books | 80 | | | Legal | 50 | | | Encyclopedia | 31 | | | Finance | 20 | | | Multilingual | Web crawl | 2,229 | | Parallel corpora | 55 | | | Source Code | GitHub | 583 | Table 1: The pretraining data composition. Appendix A.1 and A.2 breakdown the multilingual and coding languages. We pretrain the model for 8T tokens. Given that current state of the art LMs are pretrained for trillions of tokens, we want to experiment on top of a pretrained model that is emblematic of the type of models which the continued pretraining recipe would be used for. 3.1.2 Continued Pretraining As the most likely scenario in continued pretraining is that the available datasets are exactly those which made up the pretraining set, the vast majority of our continued training data blend is comprised of the pretraining data sources. The only new additional source of data is a set of question and answer (QA), alignment style examples. Such examples have been shown to better extract stored knowledge within LMs (Allen-Zhu and Li, 2023). This set of QA data totals 2.8B tokens and Table 2 highlights the categories of types of QA examples. | QA | World Knowledge | 1.13 | | --- | --- | --- | | Reasoning | 0.92 | | | STEM | 0.31 | | | Chat | 0.26 | | | Code | 0.19 | | Table 2: The five constituent categories of the QA, alignment style data. 3.2 Model Architecture and Hyperparameters We experiment using a 15B parameter decoder-only transformer (Vaswani et al., 2017) LM with causal attention masks. It has 3.2 billion embedding parameters and 12.5 billion non-embedding parameters. Additional architectural specifications include: 32 transformer layers, a hidden size of 6144, 48 attention heads, Rotary Position Embeddings (RoPE) (Su et al., 2023), squared ReLU activations in the MLP layers, a SentencePiece (Kudo and Richardson, 2018) tokenizer with a vocabulary size of 256k, no bias terms, and untied input-output embeddings. Additionally, we use grouped query attention (GQA) (Ainslie et al., 2023) with 8 KV heads. The model is pretrained with a sequence length of 4,096 and uses batch size rampup over the first 5% of pretraining tokens, starting from a batch size of 384 and building up to one of 1,152. We use a cosine learning rate schedule, with warmup of 16B tokens, to decay from a maximum learning rate (LR) of $\eta_{max}=4.5e\text{-}4$ to $\eta_{min}=4.5e\text{-}5$ . We train using the AdamW (Loshchilov and Hutter, 2019) optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , and a weight decay of 0.1. In continued pretraining, the only hyperparameter that is altered is the learning rate schedule. 3.3 Evaluation We evaluate the model using a representative set of tasks to test its change in abilities across the English, multilingual, and coding domains. To assess English capabilities, we evaluate on the widely-used MMLU (Hendrycks et al., 2020) and Hellaswag (Zellers et al., 2019) benchmarks. MMLU measures the model’s world knowledge across 57 domains while Hellaswag assesses commonsense reasoning ability within natural language inference. For our multilingual evaluations, we use the Multilingual Grade School Mathematics (MGSM) (Shi et al., 2022) benchmark and specifically report the average accuracy across the language subset of Spanish, Japanese, and Thai, as they represent a high, medium, and low resource language respectively. Lastly, to assess the model’s coding capabilities we utilize the Python code generation task of HumanEval (Chen et al., 2021) with evaluations reported in the pass@1 (Kulal et al., 2019) setting. In our results below, we report the average score across all four of these tasks with fully detailed evaluation scores shared in the Appendix. 4 Continued Pretraining Recipe The experimental findings which constitute our continued pretraining recipe are shared below: Recipe • Start with a data distribution that is similar to the pretraining set but places larger weight on high quality sources before transitioning to a second distribution that incorporates QA data and upweights sources in areas of model weakness. • The learning rate schedule should start from $\eta_{min}$ of the pretrained model and decay with cosine annealing to $\frac{\eta_{min}}{100}$ . • The switch between data distribution should occur at $\frac{\eta_{max}}{5}$ in the learning rate schedule. 5 Experiments The results of the pretrained base model are shown in Table 3. The aim for our continuous training recipe will be to define steps that help maximally improve upon this benchmark. All detailed experiments perform continuous pretraining for 300B tokens. Additionally, we note that in our experiments we choose to load in the optimizer state from the pretrained model as we found that there was a negligible difference in evaluation accuracy when the optimizer state was loaded in or when initialized from scratch. Thus, we expect that whether eventual practitioners have the optimizer state of the pretrained model available or not, the resulting findings will hold. | Model Pretrained | Average Accuracy 48.9 | | --- | --- | Table 3: Model accuracy after 8T tokens of pretraining. Per-task evaluations scores are shared in Table 12, we find the model particularly struggles on tasks that assess STEM based reasoning capabilities. 5.1 Data Distribution <details> <summary>acl-style-files/figures/GB_distrs_big_name.png Details</summary> ![bf6ced7e](/v1/image/bf6ced7ef309b17a02b7aae3911dd3a4bfb64708aaf8c94f800d40ac80b5f021) ### Visual Description ## Bar Chart: Data Source Weight Distribution Across Categories ### Overview The chart compares the weight distribution (in percentage) of different data sources across five categories: Pretraining, Reweight Domains, Pretraining w/ High Quality Web, No Web, and Upweight Non Web w/ High Quality Web. The x-axis lists 10 data sources (Web Crawl, Books, News Articles, Papers, Encyclopedia, Legal, Finance, Misc., Multilingual, Code), while the y-axis ranges from 0% to 55%. ### Components/Axes - **X-axis (Data Source)**: 10 categories (Web Crawl, Books, News Articles, Papers, Encyclopedia, Legal, Finance, Misc., Multilingual, Code). - **Y-axis (Weight %)**: Scale from 0% to 55% in 5% increments. - **Legend**: Located at the top, with five color-coded categories: - Pretraining (light green) - Reweight Domains (medium green) - Pretraining w/ High Quality Web (gray) - No Web (teal) - Upweight Non Web w/ High Quality Web (dark green) ### Detailed Analysis 1. **Web Crawl**: - Pretraining: ~46% (light green) - Reweight Domains: ~53% (medium green) - Pretraining w/ High Quality Web: ~46% (gray) - No Web: 0% - Upweight Non Web w/ High Quality Web: ~12% (dark green) 2. **Books**: - Pretraining: ~3% - Reweight Domains: ~4% - Pretraining w/ High Quality Web: ~3% - No Web: ~13% - Upweight Non Web w/ High Quality Web: ~10% 3. **News Articles**: - Pretraining: ~5% - Reweight Domains: ~5% - Pretraining w/ High Quality Web: ~4% - No Web: ~5% - Upweight Non Web w/ High Quality Web: ~4% 4. **Papers**: - Pretraining: ~3% - Reweight Domains: ~4% - Pretraining w/ High Quality Web: ~3% - No Web: ~16% - Upweight Non Web w/ High Quality Web: ~13% 5. **Encyclopedia**: - Pretraining: ~1% - Reweight Domains: ~1% - Pretraining w/ High Quality Web: ~1% - No Web: ~11% - Upweight Non Web w/ High Quality Web: ~9% 6. **Legal**: - Pretraining: ~1% - Reweight Domains: ~1% - Pretraining w/ High Quality Web: ~1% - No Web: ~2% - Upweight Non Web w/ High Quality Web: ~2% 7. **Finance**: - Pretraining: ~1% - Reweight Domains: ~1% - Pretraining w/ High Quality Web: ~1% - No Web: ~5% - Upweight Non Web w/ High Quality Web: ~4% 8. **Misc.**: - Pretraining: ~9% - Reweight Domains: ~10% - Pretraining w/ High Quality Web: ~8% - No Web: ~18% - Upweight Non Web w/ High Quality Web: ~15% 9. **Multilingual**: - Pretraining: 0% - Reweight Domains: ~5% - Pretraining w/ High Quality Web: 0% - No Web: 0% - Upweight Non Web w/ High Quality Web: ~15% 10. **Code**: - Pretraining: ~15% - Reweight Domains: ~15% - Pretraining w/ High Quality Web: ~15% - No Web: 0% - Upweight Non Web w/ High Quality Web: ~15% ### Key Observations - **Dominance of Web Crawl**: The Web Crawl data source has the highest weights in Pretraining (~46%) and Reweight Domains (~53%), with significant Pretraining w/ High Quality Web (~46%). - **Upweight Consistency**: The "Upweight Non Web w/ High Quality Web" category shows relatively stable weights across most data sources (e.g., 10-15% for Books, Papers, Encyclopedia). - **Low Representation in Legal/Finance**: Legal and Finance data sources have minimal weights (<2%) across all categories. - **Misc. and Code**: Misc. and Code data sources show balanced weights in Pretraining, Reweight Domains, and Upweight categories (~8-15%). - **No Web Variability**: The "No Web" category has notable weights in Books (~13%), Papers (~16%), and Misc. (~18%), but is absent in Web Crawl and Code. ### Interpretation The chart reveals that **Web Crawl** is heavily prioritized for Pretraining and Reweight Domains, suggesting it is a primary data source for foundational model training. The "Upweight Non Web w/ High Quality Web" category demonstrates consistent usage across diverse data sources, indicating a strategy to enhance model quality by selectively emphasizing high-quality non-web data. Notably, **Legal** and **Finance** data sources are underrepresented, which may reflect domain-specific challenges in sourcing high-quality data. The absence of "No Web" in Web Crawl and Code highlights their reliance on web-based data, whereas Misc. and Papers show strong "No Web" weights, possibly indicating curated or synthetic data usage. The balanced weights in **Code** across Pretraining, Reweight Domains, and Upweight categories suggest a holistic approach to leveraging code data for model development. Overall, the chart underscores the importance of data source selection and weighting strategies in optimizing model performance. </details> Figure 1: Breakdown of the various distributions considered for the General Blend (GB). We use Upweight Non Web w/ High Quality Web as the GB moving forward given its strong performance across all evaluation areas. A crucial component of any training run is the data distribution – it defines the information which a model sees and directly impacts the model’s capabilities. As continuous pretraining builds on top of a model which has already seen a given pretraining distribution, it is important to define a data distribution which allows the model to learn new concepts without also deviating too far from the pretraining distribution such that the model begins to experience training instability and accuracy regression. Through a series of runs which tackle what compositions of data distributions best improve the abilities of a pretrained model, we identify general characteristics that can be applied across most continuous pretraining scenarios. In these experiments, we use a learning rate schedule that starts from $\eta_{min}$ and decays to 0 with cosine annealing. First, we examine if the inclusion of QA data, which improves the ability of a model to extract stored knowledge (Allen-Zhu and Li, 2023), improves model accuracy. Coupled with this question is another on how to best incorporate the QA data, or more generally any dataset which is not contained within the pretraining data distribution, into the continued training run: immediately at the beginning and throughout the entirety of continued training, or rather reserved till the end of continued training following a curriculum learning setup (Soviany et al., 2022; Blakeney et al., 2024). We hypothesize that inclusion of new data sources at the beginning of continued pretraining allows for the model to best learn the new information, but may cause learning instabilities that could be mitigated by showing the new dataset at the end of the run when the learning rate is less aggressive. To answer these questions, we compare continued training entirely with the pretraining data blend, entirely with a QA data blend, and with a mix of the pretraining and QA data blends where we start with the pretraining blend and switch to the QA data blend late in the training run. The QA data blend in this scenario adds the QA dataset to the pretraining data distribution with a weight of 10%. | Pretraining QA Pretraining (250B), QA (50B) | 51.5 53.4 54.3 | | --- | --- | Table 4: Using two data distributions, with the QA data appearing in the latter, leads to the largest improvement via continued pretraining. () indicates the number of training tokens for each blend. Per-task evaluations scores are shared in Table 13. Table 4 illustrates that the incorporation of QA data markedly outperforms solely using existing data from the pretraining set. Additionally, first using the pretraining data blend for the majority of training tokens before transitioning to the QA data blend at the end of continued pretraining exhibits improved accuracy compared to using the QA blend throughout the entirety of training. This indicates that continued pretraining runs should begin with a data distribution which more closely aligns to the pretraining one followed by a blend that then introduces new data. Moving forward, we refer to the initial blend as the general blend, GB, and the latter blend as the QA blend, QB, and discuss how they can be refined to realize further improvements. We hypothesize that the optimal GB will be one which places greater emphasis on high quality data sources and areas of model weakness, without deviating too far from the pretraining distribution. Such a blend will enhance knowledge in needed areas and prime the model for the QB blend without worry of experiencing large training instabilities. Figure 1 illustrates the various GB distributions we consider; in addition to upweighting sources of interest, we either subset web crawl to just high quality documents, as identified by being in the bottom quartile of perplexity scores from a KenLM model (Heafield, 2011) trained on Wikipedia, or remove web crawl altogether. Experimenting with the various GB distributions for all 300B tokens of continued training, Table 5 shows that each improves upon the pretraining distribution. Even though it does not achieve the highest average accuracy, we choose Upweight Non Web with High Quality Web as the GB moving forward, because compared to others, it most consistently achieves high scores across all considered tasks as shown in Table 13. | Pretraining Reweight Domains Pretraining w/ High Quality Web | 51.5 51.7 52.5 | | --- | --- | | No Web | 52.9 | | UW Non Web w/ High Quality Web | 52.0 | Table 5: Evaluation results of various GB candidate distributions. Per-task evaluations scores are shared in Table 13 With a GB distribution in place, we now look to define the QB distribution by first refining the weights placed on the sources within the QA data and then optimizing the QB distribution as a whole. In the initial QB distribution, the QA data was added as is, and this weighting is shown as QA blend 1 in Figure 2. Given that the pretrained model struggles on STEM tasks, we create two additional blends that both upweight the QA STEM data while either maintaining the original weight of QA world knowledge, blend 2, or QA chat, blend 3, data as seen in Figure 2. We choose to maintain the weight in world knowledge and chat information as such examples cover a broad range of topics and help better align model responses to questions respectively. Table 6 highlights that upon adding each of the QA blends to the initial QB distribution following 250B tokens of the identified GB, QA data that emphasizes both STEM and chat information leads to the best results. <details> <summary>acl-style-files/figures/QB_qa_distr_big_font.png Details</summary> ![0fcaeadd](/v1/image/0fcaeadda16b91d851dff2ce85c302aae50a072c25ed79cfd0f610d5409b3893) ### Visual Description ## Bar Chart: Weight Distribution of Data Sources Across Three Blends ### Overview The chart compares the weight distribution (in percentages) of five data sources (Chat, Reasoning, STEM, Code, World Knowledge) across three blends: - **Blend 1 (Balanced)** - **Blend 2 (+STEM, +World Knowledge)** - **Blend 3 (+STEM, +Chat)** The y-axis represents weight (%), ranging from 0 to 45%, while the x-axis lists data sources. Each blend is represented by a distinct color: - **Blend 1**: Light blue - **Blend 2**: Purple - **Blend 3**: Dark blue ### Components/Axes - **X-axis (Data Source)**: Categories: Chat, Reasoning, STEM, Code, World Knowledge. - **Y-axis (Weight %)**: Scale: 0 to 45% in 5% increments. - **Legend**: Positioned at the top, with color-coded labels for each blend. ### Detailed Analysis 1. **Chat**: - Blend 1: ~9% - Blend 2: ~8% - Blend 3: ~9% 2. **Reasoning**: - Blend 1: ~35% - Blend 2: ~31% - Blend 3: ~32% 3. **STEM**: - Blend 1: ~5% - Blend 2: ~11% - Blend 3: ~10% 4. **Code**: - Blend 1: ~8% - Blend 2: ~7% - Blend 3: ~6% 5. **World Knowledge**: - Blend 1: ~42% - Blend 2: ~43% - Blend 3: ~40% ### Key Observations - **World Knowledge** dominates all blends, with weights exceeding 40% in Blend 1 and Blend 2. - **Reasoning** is the second-highest weighted category across all blends. - **STEM** has the lowest weight in Blend 1 (~5%) but increases in Blend 2 (+STEM) to ~11%. - **Code** consistently has the lowest weight across all blends (~6–8%). - **Blend 2 (+STEM, +World Knowledge)** prioritizes STEM and World Knowledge, while **Blend 3 (+STEM, +Chat)** shows reduced emphasis on Chat (~6–9%) compared to Blend 1. ### Interpretation The chart highlights that **World Knowledge** is the most critical data source across all blends, suggesting its foundational role in the system. **Blend 2** explicitly emphasizes STEM and World Knowledge, aligning with its label, while **Blend 3** includes Chat but assigns it minimal weight (~6–9%), indicating a weaker reliance on conversational data. **Blend 1 (Balanced)** distributes weights more evenly but still prioritizes World Knowledge (~42%), suggesting a baseline preference for general knowledge over specialized domains. Notably, **STEM** and **Code** receive disproportionately low weights in Blend 1, despite their inclusion in Blend 2 and 3. This implies that the "balanced" approach may undervalue technical domains unless explicitly augmented (as in Blend 2 and 3). The slight variations between blends (e.g., Blend 2’s higher STEM weight) reflect tailored adjustments to domain-specific priorities. </details> Figure 2: Various distributions of QA data. We use Blend 3. | QA 1 QA 2 (+STEM, +World Knowledge) QA 3 (+STEM, +Chat) | 54.3 53.0 54.9 | | --- | --- | Table 6: Evaluation results of various QA blend candidates. Per-task evaluations scores are shared in Table 13 We now incorporate the QA data within the overall QB distribution. In previous runs, the QB distribution, aside from the QA dataset, exactly mirrored the pretraining set. We define a new series of distributions based on more aggressive upweighting of sources in areas of model weakness and amount of weight placed on the QA dataset as seen in Figure 4. Table 7 details that the aggressive weighting in the QB is beneficial, and we use the QB termed QA blend moving forward. With refined GB and QB distributions, the average evaluation accuracy has improved from 48.9 for the pretrained model to 55.4, a 13% improvement. | Pretraining blend w/ QA data General blend w/ QA data QA | 54.3 54.2 55.4 | | --- | --- | | QA w/ Upweighted STEM | 54.4 | | QA w/ 1.5e QA data | 54.9 | | QA w/ 3.5e QA data | 54.4 | Table 7: Evaluation results of various QB candidate distributions. Per-task evaluations scores are shared in Table 13 <details> <summary>acl-style-files/figures/just_decay_LRs.png Details</summary> ![2e137715](/v1/image/2e1377150639cb5c076530e61c10316c0b2e3db773162bbd1e52d3982ffc32fc) ### Visual Description ## Line Graph: Learning Rate (LR) vs. Tokens (B) with QA Blend ### Overview The image is a line graph depicting the relationship between Learning Rate (LR) and the number of tokens (B). Three distinct lines represent different minimum learning rate (Min LR) configurations, while a shaded vertical region labeled "QA Blend" spans the rightmost portion of the graph. The y-axis (LR) ranges from 0 to 5e-5, and the x-axis (Tokens) spans 0 to 300. --- ### Components/Axes - **Y-Axis (LR)**: Labeled "LR" with logarithmic scaling (0 to 5e-5). - **X-Axis (Tokens)**: Labeled "Tokens (B)" with linear scaling (0 to 300). - **Legend**: Located in the bottom-left corner, with three entries: - **Dashed Line**: Min LR = (1/100)*Max LR - **Solid Line**: Min LR = (1/1000)*Max LR - **Dotted Line**: Min LR = 0 - **QA Blend**: A vertical gray shaded region spanning tokens 250–300. --- ### Detailed Analysis 1. **Lines**: - **Solid Line (Min LR = 1/1000 Max LR)**: - Starts at ~4.5e-5 LR at 0 tokens. - Decreases steeply, reaching ~0.1e-5 at 300 tokens. - **Dashed Line (Min LR = 1/100 Max LR)**: - Starts at ~4.5e-5 LR at 0 tokens. - Decreases gradually, ending at ~0.5e-5 at 300 tokens. - **Dotted Line (Min LR = 0)**: - Remains flat at 0 LR across all token values. - **QA Blend**: - A vertical gray bar from 250 to 300 tokens, overlapping all lines. 2. **Trends**: - All three lines originate at the same high LR value (~4.5e-5) at 0 tokens. - The solid line (1/1000) exhibits the steepest decline, suggesting a rapid reduction in LR with increasing tokens. - The dashed line (1/100) decreases more slowly, maintaining higher LR values longer. - The dotted line (Min LR = 0) remains constant, indicating no LR adjustment. --- ### Key Observations - **Convergence**: All lines converge near 0 LR as tokens approach 300, except the dotted line (Min LR = 0), which stays at 0. - **QA Blend**: The shaded region (250–300 tokens) may represent a critical range for quality assurance, where LR values are intentionally controlled. - **Steepness**: The solid line’s sharp decline implies that Min LR = 1/1000 Max LR is more sensitive to token increases than Min LR = 1/100 Max LR. --- ### Interpretation The graph illustrates how different Min LR configurations modulate LR decay during token processing. The QA Blend’s placement at high token values (250–300) suggests this range is significant for validation or testing, possibly requiring stable LR settings. The solid line’s rapid decay (1/1000) may prioritize early-stage learning adjustments, while the dashed line (1/100) balances gradual adaptation. The dotted line (Min LR = 0) acts as a baseline, showing no LR dependency on token count. This analysis highlights the trade-offs between learning rate sensitivity and stability in token-based models. </details> Figure 3: Cosine decay schedules with a Max LR of $4.5e\text{-}5$ . Each schedule differently prioritizes LR magnitude and slope of decay. <details> <summary>acl-style-files/figures/QB_distrs.png Details</summary> ![54a350d7](/v1/image/54a350d71622cb532595072ca46cb2744c419965651ffd33ca8d571f5adea04d) ### Visual Description ## Bar Chart: Weight (%) by Data Source and QA Method ### Overview The chart visualizes the distribution of "Weight (%)" across different data sources (e.g., Web Crawl, Books, Papers) and QA methods (e.g., General Blend w/ QA, QA Blend w/ Upweight STEM). Each data source has six bars representing distinct QA methods, with weights ranging from 0% to 35%. ### Components/Axes - **X-axis (Data Source)**: Categories include Web Crawl, Books, News Articles, Papers, Encyclopedia, Legal, Finance, Misc., Multilingual, Code, and QA. - **Y-axis (Weight %)**: Scale from 0% to 35%, with increments of 5%. - **Legend**: Six QA methods with distinct colors: 1. **General Blend w/ QA** (light blue) 2. **QA Blend** (gray) 3. **QA Blend w/ Upweight STEM** (dark green) 4. **QA Blend w/ 1.5e QA** (teal) 5. **QA blend w/ 3.5e QA** (dark blue) 6. **QA blend w/ 3.5e QA** (dark blue, same as above? Likely a typo; assume unique color for clarity). ### Detailed Analysis - **Web Crawl**: - General Blend w/ QA: ~12% (light blue) - QA Blend: ~2% (gray) - QA Blend w/ Upweight STEM: ~1% (dark green) - QA Blend w/ 1.5e QA: ~2.5% (teal) - QA blend w/ 3.5e QA: ~2.5% (dark blue) - **Books**: - General Blend w/ QA: ~10% (light blue) - QA Blend: ~10% (gray) - QA Blend w/ Upweight STEM: ~15% (dark green) - QA Blend w/ 1.5e QA: ~16% (teal) - QA blend w/ 3.5e QA: ~15% (dark blue) - **News Articles**: - General Blend w/ QA: ~4% (light blue) - QA Blend: ~2% (gray) - QA Blend w/ Upweight STEM: ~1% (dark green) - QA Blend w/ 1.5e QA: ~2.5% (teal) - QA blend w/ 3.5e QA: ~2.5% (dark blue) - **Papers**: - General Blend w/ QA: ~13% (light blue) - QA Blend: ~18% (gray) - QA Blend w/ Upweight STEM: ~30% (dark green) - QA Blend w/ 1.5e QA: ~18% (teal) - QA blend w/ 3.5e QA: ~16% (dark blue) - **Encyclopedia**: - General Blend w/ QA: ~9% (light blue) - QA Blend: ~8% (gray) - QA Blend w/ Upweight STEM: ~13% (dark green) - QA Blend w/ 1.5e QA: ~8% (teal) - QA blend w/ 3.5e QA: ~7% (dark blue) - **Legal**: - General Blend w/ QA: ~2% (light blue) - QA Blend: ~8% (gray) - QA Blend w/ Upweight STEM: ~5% (dark green) - QA Blend w/ 1.5e QA: ~8% (teal) - QA blend w/ 3.5e QA: ~7% (dark blue) - **Finance**: - General Blend w/ QA: ~4% (light blue) - QA Blend: ~2% (gray) - QA Blend w/ Upweight STEM: ~1% (dark green) - QA Blend w/ 1.5e QA: ~3% (teal) - QA blend w/ 3.5e QA: ~2% (dark blue) - **Misc.**: - General Blend w/ QA: ~15% (light blue) - QA Blend: ~7% (gray) - QA Blend w/ Upweight STEM: ~10% (dark green) - QA Blend w/ 1.5e QA: ~11% (teal) - QA blend w/ 3.5e QA: ~10% (dark blue) - **Multilingual**: - General Blend w/ QA: ~3% (light blue) - QA Blend: ~3% (gray) - QA Blend w/ Upweight STEM: ~3% (dark green) - QA Blend w/ 1.5e QA: ~5% (teal) - QA blend w/ 3.5e QA: ~3% (dark blue) - **Code**: - General Blend w/ QA: ~15% (light blue) - QA Blend: ~15% (gray) - QA Blend w/ Upweight STEM: ~15% (dark green) - QA Blend w/ 1.5e QA: ~15% (teal) - QA blend w/ 3.5e QA: ~12% (dark blue) - **QA**: - General Blend w/ QA: ~12% (light blue) - QA Blend: ~12% (gray) - QA Blend w/ Upweight STEM: ~12% (dark green) - QA Blend w/ 1.5e QA: ~10% (teal) - QA blend w/ 3.5e QA: ~20% (dark blue) ### Key Observations 1. **Papers** dominate in **QA Blend w/ Upweight STEM** (~30%), suggesting a focus on technical/STEM content. 2. **QA** data source has the highest weight in **QA blend w/ 3.5e QA** (~20%), indicating prioritization of QA-specific methods. 3. **Web Crawl** and **Code** show strong reliance on **General Blend w/ QA** (~12% and ~15%, respectively). 4. **Legal** and **Finance** have minimal weights in **General Blend w/ QA** (~2% and ~4%), favoring other QA methods. 5. **Upweight STEM** (dark green) peaks in **Papers**, while **3.5e QA** (dark blue) peaks in **QA**. ### Interpretation The data suggests that QA method effectiveness or prioritization varies by data source. For example: - **Papers** heavily utilize **Upweight STEM**, likely due to technical content requiring specialized QA. - **QA** data source emphasizes **3.5e QA**, possibly reflecting iterative or high-stakes QA processes. - **Web Crawl** and **Code** rely on **General Blend w/ QA**, indicating broad applicability for general or structured data. - **Legal** and **Finance** avoid **General Blend w/ QA**, favoring domain-specific methods like **1.5e QA** or **3.5e QA**. Notable anomalies include the sharp drop in **General Blend w/ QA** for **Legal** (~2%) and the dominance of **Upweight STEM** in **Papers** (~30%). This may reflect domain-specific challenges or resource allocation strategies. </details> Figure 4: Breakdown of the various distributions considered for the QB. $N$ e refers to $N$ epochs of the QA data. The final chosen distribution is shown as QA Blend which used 2 epochs of QA data. 5.2 Learning Rate Schedule The learning rate schedule greatly impacts the training dynamics and efficacy of continued pretraining (Gupta et al., 2023; Ibrahim et al., 2024; Winata et al., 2023). In our above continued pretraining experiments, the learning rate schedule starts at a maximum LR of $\eta_{max_{\text{ct}}}=4.5e\text{-}5$ , which is equal to $\eta_{min}$ , and decays to a minimum LR of 0 using cosine annealing. As seen in Figure 3, a minimum LR of 0 facilitates a steep slope of decay but the magnitude of LR is severely impacted, especially over the tokens where the QB is used which may impact the model’s ability to extract full utility from the QA data. To understand the trade-off between these two characteristics of the learning rate schedule in continued pretraining runs, we experiment with two additional minimum learning rate values: $\frac{\eta_{max_{\text{ct}}}}{10}=4.5e\text{-}6$ and $\frac{\eta_{max_{\text{ct}}}}{100}=4.5e\text{-}7$ . | Decay to $\frac{\eta_{max_{\text{ct}}}}{10}$ Decay to $\frac{\eta_{max_{\text{ct}}}}{100}$ Decay to 0 | 54.8 55.7 55.4 | | --- | --- | Table 8: Evaluation results of learning rate schedules with varying Min LR values. Per-task evaluations scores are shared in Table 14 Table 8 highlights that it is in fact best to strike a middle ground between magnitude of LR and slope of decay, as a minimum LR of $\frac{\eta_{max_{\text{ct}}}}{100}$ achieves the best accuracy. Such a minimum LR value allows for a learning rate schedule that has reasonable decay over the QB tokens, unlike when using a minimum LR of $\frac{\eta_{max_{\text{ct}}}}{10}$ , without severely sacrificing on magnitude of LR, as was the case with a minimum LR of 0. Experiments with varying learning rate warmup and maximum LR values led to accuracy regressions compared to the schedule detailed above. In addition, we ran ablations with a different annealing schedule, WSD (Hu et al., 2024), however the results were not competitive to cosine annealing. Full details and results for both studies are shared in Appendix B.2. 5.3 Switch of Data Distributions Until this point, we have been switching between the GB and the QB after 250B tokens of continued pretraining. We believe this to be sub-optimal, as it is unclear how switching between distributions after a fixed number of tokens can be easily translated to continued training runs of different token horizons. We hypothesize that the optimal point for switching between the data distributions depends upon the learning rate schedule. Figure 5 highlights how both the number of tokens and learning rate values for the QB blend would differ if the distribution switch occurred at progressively smaller fractions of the maximum LR. As the fraction goes to 0, both the slope of decay and magnitude of the learning rate shrink, meaning that there likely is an optimal point in the learning rate curve where both of these characteristics are still conducive to enable learning but also not too aggressive to the point where the data shift in the QB distribution causes training instability. <details> <summary>acl-style-files/figures/distribution_switch_LRs_background.png Details</summary> ![beb0ec3e](/v1/image/beb0ec3e8e7fab592e16c25263fe39aa4658dea665494f7b9f78a119cb4a6c13) ### Visual Description ## Line Graph: Likelihood Ratio (LR) vs. Tokens (B) ### Overview The image is a line graph depicting the relationship between the number of tokens (B) and the likelihood ratio (LR). The y-axis (LR) is logarithmic, ranging from 0 to 5e-5, while the x-axis (Tokens) is linear, spanning 0 to 300. A single decreasing curve dominates the graph, with shaded vertical regions and dashed horizontal reference lines. The legend identifies four switch points based on fractions of the maximum LR. ### Components/Axes - **X-axis (Tokens (B))**: Linear scale from 0 to 300. - **Y-axis (LR)**: Logarithmic scale from 0 to 5e-5. - **Legend**: Located in the top-right corner, with four entries: - **Switch at (1/2)*Max LR**: Lightest shade (pale blue). - **Switch at (1/5)*Max LR**: Light blue. - **Switch at (1/10)*Max LR**: Medium blue. - **Switch at (1/50)*Max LR**: Dark blue. - **Dashed Horizontal Lines**: At LR values of 2.25e-5, 9e-6, 4.5e-6, and 9e-7. ### Detailed Analysis - **Line Trend**: A smooth, monotonically decreasing curve starts at ~4.5e-5 (when tokens = 0) and approaches ~0 as tokens approach 300. The curve crosses the dashed lines at approximately: - **2.25e-5**: ~150 tokens. - **9e-6**: ~200 tokens. - **4.5e-6**: ~250 tokens. - **9e-7**: ~300 tokens. - **Shaded Regions**: Vertical bands correspond to the legend labels: - **Lightest shade (1/2*Max LR)**: 0–150 tokens. - **Light blue (1/5*Max LR)**: 150–200 tokens. - **Medium blue (1/10*Max LR)**: 200–250 tokens. - **Dark blue (1/50*Max LR)**: 250–300 tokens. - **Spatial Grounding**: The legend is positioned in the top-right corner. Shaded regions are aligned with the x-axis, with darker shades indicating lower LR thresholds. ### Key Observations 1. **LR Decline**: The LR decreases exponentially as tokens increase, suggesting diminishing confidence or significance with more tokens. 2. **Switch Points**: The shaded regions indicate thresholds where the LR drops below specific fractions of the maximum value. For example: - At 150 tokens, LR crosses 2.25e-5 (1/2*Max LR). - At 250 tokens, LR drops below 4.5e-6 (1/10*Max LR). 3. **Dashed Lines**: These act as reference points, showing critical LR values that the curve intersects at specific token counts. ### Interpretation The graph illustrates how the likelihood ratio (LR) diminishes as the number of tokens increases, with distinct thresholds marked by shaded regions. The decreasing trend implies that the model's confidence or discriminative power weakens with larger token counts. The switch points (e.g., 1/2, 1/5, 1/10, 1/50 of Max LR) likely represent decision boundaries for adaptive strategies, such as adjusting model behavior or stopping criteria. The darkest shaded region (250–300 tokens) indicates the lowest LR values, suggesting minimal discriminative ability at high token counts. The dashed lines provide granular benchmarks for evaluating LR performance across the token range. </details> Figure 5: How the number of QB tokens, the shaded region, varies based on different distribution switch points. Table 9 highlights that switching between the GB and QB at $\frac{\eta_{max_{\text{ct}}}}{5}$ achieves the best accuracy and improves upon the heuristically chosen switch point by 0.4 points on average. Wanting to confirm this distribution switch point holds at differing amounts of continued pretraining tokens, we ran an ablation on a scale of 100B tokens and found that $\frac{\eta_{max_{\text{ct}}}}{5}$ again maximized the results as seen in Table 18. | At $\eta_{max_{\text{ct}}}$ (from step 0) At $\frac{\eta_{max_{\text{ct}}}}{2}$ At $\frac{\eta_{max_{\text{ct}}}}{5}$ | 52.8 54.7 56.1 | | --- | --- | | At $\frac{\eta_{max_{\text{ct}}}}{10}$ | 55.0 | | At $\frac{\eta_{max_{\text{ct}}}}{50}$ | 54.6 | Table 9: Evaluation results of varying distribution switch points. Per-task evaluations scores are shared in Table 17 This finalizes our continued pretraining recipe. We highlight the utility of this recipe as it allows the model to achieve an average accuracy of 56.1, which improves upon the natural baseline of continued training on the pretraining distribution, as shared in Table 4, by 9%. 6 Ablations 6.1 Varying Token Horizons We show the efficacy of the identified continued pretraining recipe when used at varying numbers of continued training tokens. Table 10 illustrates that on continued training horizons from 100B to 1T tokens, the identified recipe consistently achieves improved evaluation results – realizing a 16% gain over the pretrained model when using 1T tokens of continued training. We do note that the slope in accuracy improvement from 300B to 1T tokens is lower than that from 100B to 300B tokens, we hypothesize that as we are mainly reusing documents from the pretraining set when doing a large number of continued training tokens the repeated number of epochs on the same data sources have decreasing marginal utility. | 0B 100B 300B | 59.3 63.0 63.8 | 48.9 55.0 56.1 | | --- | --- | --- | | 1T | 65.3 | 56.8 | Table 10: Performance of the continuous pretraining (CPT) recipe across different token horizons. Per-task evaluations scores are shared in Table 19 6.2 Document Mining In an effort to improve the utility of the data sources that are seen for multiple epochs in long horizon continued pretraining runs, we aim to find a subset of examples that are most helpful for model improvement. As the QA dataset was shown to significantly boost model accuracies, we hypothesize that restricting each pretraining data source to the set of documents which are most similar to the QA examples would be beneficial. To do so, we use the E5-large-v2 (Wang et al., 2022) text embedding model to obtain an embedding for each document in our pretraining and QA sets. Using the Faiss library (Johnson et al., 2017), we efficiently perform a 50-nearest neighbor search across all these embeddings to obtain the 50 most similar, non-QA documents to each example in the QA set. The identified subset of examples constitutes 60B tokens, and we term this approach document mining. Table 11 shows a training run where we replace all non-QA data sources in the QB distribution solely with the examples identified via document mining. We find that these documents substantially improve the performance of the continued pretraining run and believe that document mining is a viable approach at extracting further utility from existing data sources. | CT 1T CT 1T w/ Mined Docs | 65.3 66.6 | 56.8 57.9 | | --- | --- | --- | Table 11: Mining examples related to QA documents further improves accuracy. Per-task evaluations scores are shared in Table 20 7 Conclusion We investigate how to effectively continue training LMs to improve upon their existing capabilities. Our experiments show that it is especially important to carefully define the data distribution and learning rate decay schedule used during continued pretraining so that the model is able to smoothly transition away from the pretraining distribution and better learn the newly emphasized data sources. With these findings we propose a general recipe that model developers can use in order to perform continued pretraining on top of their own LMs and show that for our base model, we are able to improve cumulative accuracy by over 18%. We hope that this will be a starting point to enable future LMs to be developed through the reuse of existing models rather than retraining from scratch. Limitations In the development of our continued pretraining recipe, we only experiment along the axes of data distributions and hyperparameter configurations. Although we did not include them within our study, there may be added benefit in exploring other aspects such as altering the learning algorithm. Additionally, given that our study is conducted on top of a model with a given configuration and which was pretrained using a certain data distribution, the results that we highlight are likely to not extrapolate well when used in settings highly divergent from the one utilized in the study. Finally, we limited our goal within continued pretraining to improving the general purpose capabilities of the pretrained model; however, there are many additional angles when considering model reuse such as domain specialization and the efficient addition of new knowledge into existing models. References - Ainslie et al. (2023) Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245. - Allen-Zhu and Li (2023) Zeyuan Allen-Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.1, knowledge storage and extraction. Preprint, arXiv:2309.14316. - Anthropic (2024) Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. - Blakeney et al. (2024) Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, and Jonathan Frankle. 2024. Does your data spark joy? performance gains from domain upsampling at the end of training. Preprint, arXiv:2406.03476. - Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Preprint, arXiv:2005.14165. - Caccia et al. (2021) Massimo Caccia, Pau Rodriguez, Oleksiy Ostapenko, Fabrice Normandin, Min Lin, Lucas Caccia, Issam Laradji, Irina Rish, Alexandre Lacoste, David Vazquez, and Laurent Charlin. 2021. Online fast adaptation and knowledge accumulation: a new approach to continual learning. Preprint, arXiv:2003.05856. - Chaudhry et al. (2019) Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. 2019. On tiny episodic memories in continual learning. Preprint, arXiv:1902.10486. - Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. Preprint, arXiv:2107.03374. - Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. PaLM: Scaling Language Modeling with Pathways. arXiv preprint arXiv:2204.02311. - DeepSeek-AI et al. (2024) DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y. K. Li, Wenfeng Liang, Fangyun Lin, A. X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, and Yuheng Zou. 2024. Deepseek llm: Scaling open-source language models with longtermism. Preprint, arXiv:2401.02954. - El-Kishky et al. (2019) Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, and Philipp Koehn. 2019. Ccaligned: A massive collection of cross-lingual web-document pairs. arXiv preprint arXiv:1911.06154. - French (1999) Robert M. French. 1999. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135. - Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027. - Gemma Team (2024) Google DeepMind Gemma Team. 2024. Gemma: Open Models Based on Gemini Research and Technology. - Gupta et al. (2023) Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. 2023. Continual pre-training of large language models: How to (re)warm your model? Preprint, arXiv:2308.04014. - Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics. - Heafield (2011) Kenneth Heafield. 2011. Kenlm: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation, pages 187–197. - Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300. - Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. Preprint, arXiv:1503.02531. - Hu et al. (2024) Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. Minicpm: Unveiling the potential of small language models with scalable training strategies. Preprint, arXiv:2404.06395. - Ibrahim et al. (2024) Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, and Irina Rish. 2024. Simple and scalable strategies to continually pre-train large language models. Preprint, arXiv:2403.08763. - Jang et al. (2023) Joel Jang, Seonghyeon Ye, Changho Lee, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, and Minjoon Seo. 2023. Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. Preprint, arXiv:2204.14211. - Jang et al. (2022) Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Stanley Jungkyu Choi, and Minjoon Seo. 2022. Towards continual knowledge learning of language models. Preprint, arXiv:2110.03215. - Jin et al. (2022) Xisen Jin, Dejiao Zhang, Henghui Zhu, Wei Xiao, Shang-Wen Li, Xiaokai Wei, Andrew Arnold, and Xiang Ren. 2022. Lifelong pretraining: Continually adapting language models to emerging corpora. Preprint, arXiv:2110.08534. - Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with gpus. Preprint, arXiv:1702.08734. - Ke et al. (2023) Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. 2023. Continual pre-training of language models. Preprint, arXiv:2302.03241. - Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv preprint arXiv:1808.06226. - Kulal et al. (2019) Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. 2019. Spoc: Search-based pseudocode to code. Preprint, arXiv:1906.04908. - Labrak et al. (2024) Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. 2024. Biomistral: A collection of open-source pretrained large language models for medical domains. Preprint, arXiv:2402.10373. - Lachaux et al. (2020) Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised translation of programming languages. Preprint, arXiv:2006.03511. - Lesort et al. (2022) Timothée Lesort, Massimo Caccia, and Irina Rish. 2022. Understanding continual learning settings with data distribution drift analysis. Preprint, arXiv:2104.01678. - Lin et al. (2024) Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. 2024. Rho-1: Not all tokens are what you need. Preprint, arXiv:2404.07965. - Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. Preprint, arXiv:1711.05101. - Loureiro et al. (2022) Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-Collados. 2022. Timelms: Diachronic language models from twitter. Preprint, arXiv:2202.03829. - Ma et al. (2023) Shirong Ma, Shen Huang, Shulin Huang, Xiaobin Wang, Yangning Li, Hai-Tao Zheng, Pengjun Xie, Fei Huang, and Yong Jiang. 2023. Ecomgpt-ct: Continual pre-training of e-commerce large language models with semi-structured data. Preprint, arXiv:2312.15696. - OpenAI (2024) OpenAI. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774. - Parmar et al. (2024) Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, and Bryan Catanzaro. 2024. Nemotron-4 15b technical report. Preprint, arXiv:2402.16819. - Qin et al. (2022) Yujia Qin, Jiajie Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2022. Elle: Efficient lifelong pre-training for emerging data. Preprint, arXiv:2203.06311. - Robins (1995) Anthony V. Robins. 1995. Catastrophic forgetting, rehearsal and pseudorehearsal. Connect. Sci., 7:123–146. - Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne. 2019. Experience replay for continual learning. Preprint, arXiv:1811.11682. - Schwenk et al. (2019) Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, and Armand Joulin. 2019. Ccmatrix: Mining billions of high-quality parallel sentences on the web. arXiv preprint arXiv:1911.04944. - Scialom et al. (2022) Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. 2022. Fine-tuned language models are continual learners. Preprint, arXiv:2205.12393. - Shi et al. (2022) Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2022. Language models are multilingual chain-of-thought reasoners. Preprint, arXiv:2210.03057. - Soviany et al. (2022) Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum learning: A survey. Preprint, arXiv:2101.10382. - Su et al. (2023) Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. Roformer: Enhanced transformer with rotary position embedding. Preprint, arXiv:2104.09864. - Team (2024) Gemini Team. 2024. Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805. - Team et al. (2024) Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, and Zhihui Xie. 2024. Reka core, flash, and edge: A series of powerful multimodal language models. Preprint, arXiv:2404.12387. - Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288. - Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. - Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533. - Winata et al. (2023) Genta Indra Winata, Lingjue Xie, Karthik Radhakrishnan, Shijie Wu, Xisen Jin, Pengxiang Cheng, Mayank Kulkarni, and Daniel Preotiuc-Pietro. 2023. Overcoming catastrophic forgetting in massively multilingual continual learning. Preprint, arXiv:2305.16252. - Wu et al. (2024) Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ying Shan, and Ping Luo. 2024. Llama pro: Progressive llama with block expansion. Preprint, arXiv:2401.02415. - Yadav et al. (2023) Prateek Yadav, Qing Sun, Hantian Ding, Xiaopeng Li, Dejiao Zhang, Ming Tan, Xiaofei Ma, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Mohit Bansal, and Bing Xiang. 2023. Exploring continual learning for code generation models. Preprint, arXiv:2307.02435. - Yang et al. (2024) Xianjun Yang, Junfeng Gao, Wenxin Xue, and Erik Alexandersson. 2024. Pllama: An open-source large language model for plant science. Preprint, arXiv:2401.01600. - Zan et al. (2022) Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. 2022. Cert: Continual pre-training on sketches for library-oriented code generation. Preprint, arXiv:2206.06888. - Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In ACL. - Çağatay Yıldız et al. (2024) Çağatay Yıldız, Nishaanth Kanna Ravichandran, Prishruit Punia, Matthias Bethge, and Beyza Ermis. 2024. Investigating continual pretraining in large language models: Insights and implications. Preprint, arXiv:2402.17400. Appendix A Data A.1 Multilingual Data The 53 multilingual languages contained within the pretraining set are: AR, AZ, BG, BN, CA, CS, DA, DE, EL, ES, ET, FA, FI, FR, GL, HE, HI, HR, HU, HY, ID, IS, IT, JA, KA, KK, KN, KO, LT, LV, MK, ML, MR, NE, NL, NO, PL, PT, RO, RU, SK, SL, SQ, SR, SV, TA, TE, TH, TR, UK, UR, VI, and ZH. A.2 Code Data The 43 programming languags contained within our pretraining set are: assembly, c, c-sharp, common-lisp, cpp, css, cuda, dart, dockerfile, fortran, go, haskell, html, java, javascript, json, julia, jupyter-scripts, lua, makefile, markdown, mathematica, omniverse, pascal, perl, php, python, R, restructuredtext, ruby, rust, scala, shell, sql, swift, systemverilog, tex, typescript, verilog, vhdl, visual-basic, xml, and yaml. Appendix B Experiments The evaluation results across all considered tasks are shared below for each of our experiments. | MMLU HellaSwag HumanEval | 59.3 80.4 31.1 | | --- | --- | | MGSM (ES, JA, TH) | 24.9 | Table 12: Model accuracy after 8T tokens of pretraining. We find that the model struggles on STEM based reasoning tasks due to its low scores on MGSM and STEM substasks of MMLU. B.1 Data Distribution Table 13 shares the results across all tasks for each experiment mentioned within Section 5.1. | Data Blend Pretraining QA | MMLU 61.9 62 | HellaSwag 81.2 78.7 | HumanEval 28.1 32.9 | MGSM (ES, JA, TH) 34.7 40.1 | | --- | --- | --- | --- | --- | | Pretraining (250B) + QA (50B) | 62.6 | 82.2 | 29.9 | 42.4 | | Pretraining | 61.9 | 81.2 | 28.1 | 34.7 | | Reweight Domains | 61.9 | 81.7 | 29.9 | 33.2 | | Pretraining w/ High Quality Web | 62.2 | 80.9 | 34.1 | 32.9 | | No Web | 62.3 | 81.8 | 29.9 | 37.7 | | Upweight Non Web w/ High Quality Web | 62.6 | 81.4 | 31.7 | 32.1 | | QA 1 | 63.0 | 82.4 | 29.9 | 41.9 | | QA 2 (+STEM, +World Knowledge) | 63.9 | 82.3 | 29.3 | 36.7 | | QA 3 (+STEM, +Chat) | 64.1 | 82.2 | 28.7 | 44.7 | | QA | 64.2 | 82.4 | 30.5 | 44.5 | | QA w/ Upweighted STEM | 64.1 | 82.3 | 28.1 | 42.9 | | QA w/ 1.5e QA data | 64.1 | 82.2 | 28.7 | 44.7 | | QA w/ 3.5e QA data | 64.4 | 27.4 | 82.4 | 43.3 | Table 13: Per-task evaluation results of each experiment mentioned within Section 5.1 on defining data distributions for continued pretraining. B.2 Learning Rate Schedule | Decay to $\frac{\eta_{max_{\text{ct}}}}{10}$ Decay to $\frac{\eta_{max_{\text{ct}}}}{100}$ Decay to 0 | 63.9 64.2 64.2 | 82.4 82.2 30.5 | 29.3 31.1 82.4 | 43.7 45.2 44.5 | | --- | --- | --- | --- | --- | Table 14: Per-task evaluation results of the experiments mentioned in Table 8 on identifying an appropriate learning rate decay schedule for continued pretraining. In identifying a learning rate schedule for continued pretraining, we experiment with various degrees of warmup and values of $\eta_{max_{\text{ct}}}$ . The combinations we consider are: warmup from $\eta_{min}$ to $\eta_{max_{\text{ct}}}=1.5*\eta_{min}$ , warmup from $0.5*\eta_{min}$ to $\eta_{max_{\text{ct}}}=\eta_{min}$ , and warmup from 0 to what the expected learning rate value would be had the pretraining learning rate schedule been extended to incorporate the continued training tokens (i.e., from 8T to 8.3T). We use $\eta_{min}$ to specify the minimum learning rate value of the pretrained model, which is $4.5e\text{-}5$ . Figure 6 highlights each of these schedules, and we note that these combinations were chosen to quantify different degrees of aggressiveness when using warmup in a continued pretraining learning rate schedule. <details> <summary>acl-style-files/figures/just_warmup_LRs.png Details</summary> ![50a036eb](/v1/image/50a036ebc930d942d58469df2b10fe68edcb344e0a6c453bc94082d94f38fbe1) ### Visual Description ## Line Graph: Learning Rate (LR) vs. Tokens (B) with Warmup Strategies and QA Blend Region ### Overview The graph illustrates the relationship between learning rate (LR) and token count (B) across three distinct warmup strategies and a QA Blend region. Three lines represent different warmup targets, with a shaded area marking the QA Blend phase. The y-axis (LR) ranges from 0 to 7e-5, while the x-axis (Tokens) spans 0 to 300. ### Components/Axes - **X-axis (Tokens)**: Labeled "Tokens (B)", scaled from 0 to 300 in increments of 50. - **Y-axis (LR)**: Labeled "LR", scaled logarithmically from 0 to 7e-5. - **Legend**: - Solid line: "Warmup to 6.75e-5" - Dashed line: "Warmup to 4.5e-5" - Dotted line: "Warmup to Expected LR" - **Shaded Region**: Gray rectangle labeled "QA Blend" spanning tokens 250–300. ### Detailed Analysis 1. **Solid Line (Warmup to 6.75e-5)**: - Peaks at **6.75e-5** near **50 tokens**. - Declines sharply to near 0 by **300 tokens**. - Steepest drop occurs between **50–150 tokens**. 2. **Dashed Line (Warmup to 4.5e-5)**: - Peaks at **4.5e-5** near **50 tokens**. - Declines gradually, remaining above 0.1e-5 until **250 tokens**. - Flattens near 0.05e-5 between **200–300 tokens**. 3. **Dotted Line (Warmup to Expected LR)**: - Peaks at **4e-5** near **50 tokens**. - Declines more gradually than the dashed line, reaching ~0.1e-5 by **250 tokens**. - Crosses the dashed line near **150 tokens**. 4. **Shaded Region (QA Blend)**: - Occupies tokens **250–300**. - All lines converge near 0 LR within this region. ### Key Observations - All warmup strategies peak near **50 tokens**, with the solid line achieving the highest LR. - The solid and dashed lines diverge significantly after **100 tokens**, with the solid line dropping faster. - The dotted line exhibits the slowest decline, suggesting a more sustained LR. - The QA Blend region coincides with the final phase of LR decay for all strategies. ### Interpretation The graph demonstrates how varying warmup targets influence LR decay over token processing. The solid line (highest target) exhibits the most aggressive decay, likely reflecting a rapid adaptation phase. The dashed and dotted lines (lower targets) show more gradual decay, possibly indicating conservative warmup strategies. The QA Blend region may represent a transitional phase where the model integrates learned patterns, as all strategies converge to near-zero LR. The sharp drop in the solid line after 50 tokens suggests a potential over-adjustment risk, while the dotted line’s slower decay might align better with stable QA performance. The convergence in the QA Blend region implies that all strategies ultimately stabilize, though the solid line’s rapid decay could risk underfitting if not balanced. </details> Figure 6: Cosine decay schedule with the various levels of warmup which we experiment with. As highlighted in Table 15, we find that including any level of warmup within the continued training learning rate schedule causes regressions in evaluation accuracies, indicating that it is best to decay directly from $\eta_{min}$ . | Warmup to $6.75e\text{-}5$ Warmup to $4.5e\text{-}5$ Warmup to Expected LR | 64.0 64.0 63.3 | 81.9 82.1 82.1 | 31.1 32.9 31.7 | 42.3 41.5 42.5 | 54.8 55.1 54.9 | | --- | --- | --- | --- | --- | --- | | No Warmup | 64.2 | 31.1 | 82.2 | 45.2 | 55.7 | Table 15: Comparison of including warmup within learning rate schedules for continued pretraining. No warmup achieves the best evaluation results. In addition to cosine annealing, we experiment with the WSD learning rate scheduler (Hu et al., 2024). Table 16 compares the best found setting of WSD with cosine annealing. The WSD schedule produces significantly lower evaluation accuracies than cosine annealing. We hypothesize that in continued pretraining, switching the decay schedule from the one used during pretraining is harmful. Hence, for models pretrained with cosine annealing, the learning rate schedule in continued training should also use cosine annealing. | WSD Cosine Annealing | 63.6 64.2 | 80.2 82.2 | 28.1 31.1 | 39.5 45.2 | 52.8 55.7 | | --- | --- | --- | --- | --- | --- | Table 16: We find that WSD causes significant regression in evaluation accuracy compared to cosine annealing. Both learning rate schedules were decayed till $\frac{\eta_{max_{\text{ct}}}}{100}$ . B.3 Switch of Data Distributions Table 18 highlights that the findings of our experiments in Section 5.3 also hold at the continued training token horizon of 100B tokens. This indicates that regardless of the number of continued training tokens, transitioning between the GB and QB distributions at $\frac{\eta_{max_{\text{ct}}}}{5}$ is optimal. | At $\eta_{max_{\text{ct}}}$ (from step 0) At $\frac{\eta_{max_{\text{ct}}}}{2}$ At $\frac{\eta_{max_{\text{ct}}}}{5}$ | 65.0 60.9 63.8 | 78.7 81.6 82.2 | 29.9 32.3 32.3 | 37.7 44.1 46.1 | | --- | --- | --- | --- | --- | | At $\frac{\eta_{max_{\text{ct}}}}{10}$ | 63.9 | 82.2 | 29.3 | 44.7 | | At $\frac{\eta_{max_{\text{ct}}}}{50}$ | 63.3 | 81.6 | 31.1 | 42.3 | Table 17: Per-task evaluation results of the experiments mentioned in Table 9 on how to switch between data distributions in continued pretraining. | At $\eta_{max_{\text{ct}}}$ (from step 0) At $\frac{\eta_{max_{\text{ct}}}}{2}$ At $\frac{\eta_{max_{\text{ct}}}}{5}$ | 64.1 63.2 63.0 | 79.2 81.6 81.9 | 31.1 27.4 31.7 | 40.0 44.1 43.6 | 53.6 54.1 55.0 | | --- | --- | --- | --- | --- | --- | | At $\frac{\eta_{max_{\text{ct}}}}{10}$ | 63.6 | 81.8 | 30.5 | 39.7 | 53.9 | | At $\frac{\eta_{max_{\text{ct}}}}{50}$ | 63.3 | 81.6 | 31.1 | 42.3 | 54.6 | Table 18: Ablation of the data distribution switch experiments at a continued pretraining scale of 100B tokens. As found for the 300B token continued training horizon, switching distributions at $\frac{\eta_{max_{\text{ct}}}}{5}$ achieves the highest accuracy. Appendix C Ablations C.1 Varying Token Horizons When extending the number of continued pretraining tokens to 1T, we found that our existing QB distribution would cause the small QA dataset to be trained on for a large number of epochs. To correct for this, we reduce the weight on the QA datset so that it would be trained on for no more than 4 epochs. Figure 7 demonstrates the distribution of the QB when used at the scale of 1T continued pretraining tokens. <details> <summary>acl-style-files/figures/QB_lengths.png Details</summary> ![4f313f86](/v1/image/4f313f869eb3f01a0a938a0043b399f083fe47f4d10bae5e268e738e7c37f29e) ### Visual Description ## Bar Chart: QA Blend vs QA Blend 1T ### Overview The chart compares the weight distribution (%) of two data sources, **QA Blend** (light green) and **QA Blend 1T** (dark green), across 11 categories. The y-axis represents weight in percentage (0–35%), and the x-axis lists categories such as Web Crawl, Books, Papers, etc. The legend is positioned at the top right, with QA Blend in light green and QA Blend 1T in dark green. ### Components/Axes - **X-axis (Data Source)**: Categories include Web Crawl, Books, News Articles, Papers, Encyclopedia, Legal, Finance, Misc., Multilingual, Code, QA. - **Y-axis (Weight %)**: Scaled from 0 to 35 in increments of 5. - **Legend**: Top-right corner, with QA Blend (light green) and QA Blend 1T (dark green). ### Detailed Analysis - **Web Crawl**: QA Blend (~3%), QA Blend 1T (~4%). - **Books**: QA Blend (~16%), QA Blend 1T (~14%). - **News Articles**: QA Blend (~3%), QA Blend 1T (~4%). - **Papers**: QA Blend (~18%), QA Blend 1T (~15%). - **Encyclopedia**: QA Blend (~8%), QA Blend 1T (~7%). - **Legal**: QA Blend (~8%), QA Blend 1T (~12%). - **Finance**: QA Blend (~3%), QA Blend 1T (~4%). - **Misc.**: QA Blend (~11%), QA Blend 1T (~14%). - **Multilingual**: QA Blend (~3%), QA Blend 1T (~3%). - **Code**: QA Blend (~15%), QA Blend 1T (~20%). - **QA**: QA Blend (~12%), QA Blend 1T (~4%). ### Key Observations 1. **QA Blend 1T** consistently has higher weights than **QA Blend** in most categories (e.g., Code, Legal, Misc.). 2. **Code** is the highest-weighted category for QA Blend 1T (~20%), while **QA** is the highest for QA Blend (~12%). 3. **Web Crawl, News Articles, Finance, and Multilingual** have the lowest weights for both data sources (~3–4%). 4. **QA Blend** shows a peak in **Papers** (~18%), while QA Blend 1T peaks in **Code** (~20%). ### Interpretation - **QA Blend 1T** demonstrates broader utility across categories, with significant weight in **Code** and **Legal**, suggesting it may be optimized for technical or specialized data. - **QA Blend** has a stronger presence in **Papers** and **QA**, indicating a focus on academic or direct question-answering tasks. - The **QA** category is an outlier, where QA Blend outperforms QA Blend 1T, possibly reflecting a design choice to prioritize direct QA tasks in the original model. - The **Multilingual** category shows equal weight for both, suggesting balanced performance in multilingual contexts. - The **Finance** and **News Articles** categories have minimal weight, implying these sources are less critical to the models' training data. ### Trends Verification - **QA Blend 1T** slopes upward in most categories compared to QA Blend, except for **QA** and **Encyclopedia**. - **QA Blend** shows a bimodal distribution, with peaks in **Papers** and **QA**, while QA Blend 1T has a more uniform distribution. ### Component Isolation - **Header**: Title and legend (top-right). - **Main Chart**: Bar pairs for each category, with QA Blend (light green) and QA Blend 1T (dark green). - **Footer**: No additional elements; focus remains on the bar chart. ### Final Notes The chart highlights a strategic shift in data sourcing between the two models, with QA Blend 1T emphasizing technical and specialized domains. The QA category's divergence suggests a deliberate prioritization in the original QA Blend model. </details> Figure 7: Distribution of the QB blend when extending the number of continued pretraining tokens to 1T. | 0B 100B 300B | 59.3 63.0 63.8 | 80.4 81.9 82.2 | 31.1 31.7 32.3 | 24.9 43.6 46.1 | 48.9 55.0 56.1 | | --- | --- | --- | --- | --- | --- | | 1T | 65.3 | 82.4 | 34.1 | 45.5 | | Table 19: Per-task evaluation results of the experiments mentioned in Table 11 on how the identified continued pretraining recipe performs at varying amounts of continued training tokens. | CT 1T CT 1T w/ Mined Docs | 65.3 66.6 | 82.4 81.7 | 34.1 36.6 | 45.5 46.7 | | --- | --- | --- | --- | --- | Table 20: Per-task evaluation results of the experiments mentioned in Table 11 on how document mining increases the utility of existing data sources in continued pretraining.

Rendering Paper...