2404.05090

Model: nemotron-free

# How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse **Authors**: - Mohamed El Amine Seddik (Technology Innovation Institute) - Abu Dhabi, UAE - &Suei-Wen Chen - NYU Abu Dhabi - Abu Dhabi, UAE - &Soufiane Hayou (Simons Institute) - Berkeley, USA - &Pierre Youssef - NYU Abu Dhabi - Abu Dhabi, UAE - &Merouane Debbah (Khalifa University of Science and Technology) - Abu Dhabi, UAE Abstract The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in language models, we consider in this paper a statistical model that allows us to characterize the impact of various recursive training scenarios. Specifically, we demonstrate that model collapse cannot be avoided when training solely on synthetic data. However, when mixing both real and synthetic data, we provide an estimate of a maximal amount of synthetic data below which model collapse can eventually be avoided. Our theoretical conclusions are further supported by empirical validations. 1 Introduction The large-scale adoption of large language models (e.g. ChatGPT (OpenAI, 2024)) will inevitably lead to enormous amounts of synthetic (generated) data “polluting” the original human-created web data. Since language models are trained on web data, this raises some concerns about the impact of this synthetic data on the next generations of LLMs. One can think of a train-generate loop where models from the current generation generate data that contaminate existing web data, and the next generation models are trained on this contaminated data. This loop was studied in several works (Shumailov et al., 2023; Alemohammad et al., 2023; Briesch et al., 2023) where the authors conclude that synthetic data generally hurts performance as the number of generate-train increases, and that (naturally) the impact on model performance is linked to the amount of real data in the training set. A particular phenomenon, coined model collapse (Shumailov et al., 2023), refers to the model’s tendency to produce limited or repetitive outputs making recursive training on such outputs forget about the tails of the original underlying distribution of real data. This was further studied in (Guo et al., 2023) where the authors show that recursive training on synthetic data leads to a “self-consuming” loop that affects linguistic diversity. Intuitively, model collapse is a result of the distribution shift that occurs when training generative models recursively on synthetic data from previous generation models. Shumailov et al. (2023) have discussed two main sources of model collapse; 1) statistical approximation error: which is inherently related to the fact that generative models are trained on a finite number of samples, and therefore it is impossible that the learned model captures all the information about the original distribution. 2) functional approximation error: which results from the fact that the models are insufficiently expressive in real implementations, even if neural networks are known to be universal functional approximators from a theoretical standpoint. The authors provide further theoretical intuition to characterize the effect of these approximation errors, relying on simple mathematical models such as single-dimensional Gaussian distribution. In this paper, we aim to provide a rigorous theoretical framework to understand the effects of recursive training with synthetic data. In particular, we focus on the statistical approximation error and introduce a simple next-token-prediction language model to characterize model collapse. Our model allows us to gain insights into the behaviour of the self-consuming train-generate loop leading to model collapse. From a theoretical standpoint, we consider two main recursive training scenarios: - Fully Synthetic: Training with data sampled from the previous generation model. - Partially Synthetic: Training with a mixture of data sampled from the previous generation model and original data. We demonstrate that model collapse always occurs in the first scenario and characterize the rates at which it occurs. Furthermore, in the second scenario, we provide an upper bound on the sample size of generated data below which model collapse can eventually be attenuated. Our results are further confirmed through simulations of general scenarios with the introduced statistical model, as well as with realistic GPT2-style language models on real data. Our findings suggest that the amount of generated data should be considerably smaller compared to the original data to avoid model collapse. Related work: With the adoption of generative Large Language and Vision models, the amount of synthetic data on the web is growing at an unprecedented rate – see for example (del Rio-Chanona et al., 2023) where the authors conducted an empirical study of the amount of synthetic data via activity monitoring on Stack Overflow, and (Alemohammad et al., 2023) where the authors showed that a dataset used to train Stable Diffusion contains synthetic data (Schuhmann et al., 2022). In fact, practitioners are willingly using synthetic data to train next-generation models (Ben Allal et al., 2024; Gunasekar et al., 2023; Chen et al., 2024). As we mentioned above, several works studied the recursive training loop where next-generation models are trained on synthetic data generated from previous-generation models. Shumailov et al. (2023) studied model collapse, a phenomenon that occurs in recursive training where the quality of model outputs tend to degrade by becoming e.g. repetitive. Similar phenomena were studied in (Alemohammad et al., 2023) where the authors call it Model Autophagy Disorder (MAD). Another empirical study by (Briesch et al., 2023) studied this same Self-Consuming phenomenon and observed that the degeneracy rate (naturally) depends on the number of fresh data in the training sample. In the same direction, several works showed that incorporating synthetic data in the training can hurt the performance of trained diffusion models (Bohacek & Farid, 2023; Martínez et al., 2023a; b). Only a few works have tackled this question from a theoretical perspective. Shumailov et al. (2023) considered a simple recursive Gaussian distribution to provide an intuition as to why model collapse occurs, but no training is considered in that work. In (Fu et al., 2024), authors studied recursive training of diffusion models (generally used to learn distributions over images) and obtained an upper bound on the total variation distance between the distribution of the original data and that of the synthetic data after $T$ generations. Dohmatob et al. (2024b) studied the impact of synthetic data on scaling laws and, in a simple setting of linear regression, Dohmatob et al. (2024a) studied the behaviour of the test error of different generations and showed that a linear dependency of the degradation on generation number. In this work, we are interested in characterizing the distribution shift in synthetic data generated with a language model, as the number of generations increases. We consider a linear Softmax classifier for next-token prediction and study the distribution of the learned conditional probabilities as the number of generations increases. The closest work to ours is (Fu et al., 2024) where the authors study the distribution of synthetic data generated by a diffusion model instead, whereas the statistical model we are considering is closer in spirit to the realm of language models. The remainder of the paper is organized as follows. Section 2 presents our theoretical setup where we introduce our statistical model and the considered recursive training scenarios. Our main theoretical results are presented in Section 3. We further present some experiments in Section 4.1 to support our findings. Finally, Section 5 concludes the paper. 2 Theoretical Setup 2.1 Statistical Language Model We consider a language model of vocabulary size $s$ , context length denoted by $\ell$ and we further denote by $c$ the number of possible contexts which is at most $s^{\ell}$ . We suppose that the language data is generated from some unknown conditional probabilities given the contexts. That is, the probability of the next token being $k∈[s]:=\{1,...,s\}$ given some context ${\bm{j}}=(j_{1},...,j_{\ell})∈[c]$ is denoted by | | $\displaystyle p_{{\bm{j}}k}:={\mathbb{P}}\{Y=k\mid X={\bm{j}}\},$ | | | --- | --- | --- | where $X$ and $Y$ denote discrete random variables representing a context and the next-token respectively. In practice, we do not have access to the true conditional probabilities $p_{{\bm{j}}k}$ but rather a (large) corpus sampled according to $p_{{\bm{j}}k}$ . In other words, we are given a dataset $\{({\bm{x}}_{l},{\bm{y}}_{l})\}_{l∈[M]}$ of $M$ samples of contexts and next-token pairs represented by ${\bm{x}}_{l}∈\{{\bm{e}}_{1},...,{\bm{e}}_{c}\}$ and ${\bm{y}}_{l}∈\{{\bm{e}}_{1},...,{\bm{e}}_{s}\}$ where ${\bm{e}}_{i}$ ’s denote the canonical vectors. Given this dataset, we consider approximating the underlying conditional probabilities via the Softmax classifier, which entails minimizing the categorical cross-entropy loss function: $$ \displaystyle\operatorname*{arg\,min}_{{\mathbf{W}}=[{\bm{w}}_{1},\ldots,{\bm{% w}}_{s}]\in{\mathbb{R}}^{c\times s}}-\frac{1}{M}\sum_{l=1}^{M}{\bm{y}}_{l}^{% \top}\log\sigma\left({\mathbf{W}}^{\top}{\bm{x}}_{l}\right), \tag{1} $$ where $\sigma({\bm{v}})=\frac{\exp({\bm{v}})}{\sum_{k=1}^{s}\exp(v_{k})}$ is the Softmax function and the functions $\exp$ and $\log$ are applied entry-wise. Note that in current state-of-the-art language models, the ${\bm{x}}_{l}$ ’s are context representations computed via transformer models, whereas, in our setting, we choose to work with one-hot embeddings as representations for tractable theoretical analysis. Solving the above objective (see Appendix B.1) yields the estimated conditional probability $\hat{p}_{{\bm{j}}k}$ of $p_{{\bm{j}}k}$ which expresses as the following empirical mean: $$ \displaystyle\hat{p}_{{\bm{j}}k}=\frac{\exp(\hat{{\bm{w}}}_{k}^{\top}{\bm{e}}_% {\bm{j}})}{\sum_{i=1}^{s}\exp(\hat{{\bm{w}}}_{i}^{\top}{\bm{e}}_{\bm{j}})}=% \frac{1}{|{\mathcal{C}}_{j}|}\sum_{l\in{\mathcal{C}}_{\bm{j}}}y_{lk}\quad\text% {with}\quad{\mathcal{C}}_{\bm{j}}=\left\{l\in[n]\mid{\bm{x}}_{l}={\bm{e}}_{\bm% {j}}\right\}. \tag{2} $$ The estimated conditional probabilities $\hat{p}_{{\bm{j}}k}$ are the result of training on the original data. These conditional probabilities can be used to generate new synthetic data, which can be used (with or without additional fresh data from the original dataset) to train the next-generation model. In this paper, we are interested in characterizing the behaviour of $\hat{p}_{{\bm{j}}k}$ in this recursive training loop which we will formally define in the next section. Hereafter, without loss of generality, we consider a fixed context ${\bm{j}}$ with $N:=|{\mathcal{C}}_{\bm{j}}|$ training samples $\{({\bm{e}}_{j},{\bm{y}}_{l})\}_{l}$ of context-next-token pairs, where ${\bm{y}}_{l}∈{\mathbb{R}}^{s}$ are independent multinomial random variables with one trial and parameter $$ \displaystyle{\bm{p}}=(p_{1},\ldots,p_{s}):=(p_{{\bm{j}}1},\ldots,p_{{\bm{j}}s% })\in[0,1]^{s}. \tag{3} $$ From (2), we notice that $\hat{p}_{{\bm{j}}k}$ is an estimate of $p_{{\bm{j}}k}$ and we further denote $$ \displaystyle{\bm{p}}^{(1)}=(\hat{p}_{1},\ldots,\hat{p}_{s}):=(\hat{p}_{{\bm{j% }}1},\ldots,\hat{p}_{{\bm{j}}s})\in[0,1]^{s}. \tag{1} $$ As such, ${\bm{p}}^{(0)}:={\bm{p}}$ corresponds to the ground-truth distribution whereas ${\bm{p}}^{(1)}$ denotes the first-generation model trained on the real data $\{({\bm{e}}_{j},{\bm{y}}_{l})\}_{l∈{\mathcal{C}}_{\bm{j}}}$ . 2.2 Recursive Training <details> <summary>2404.05090v1/x1.png Details</summary> ![a13e6add10c5240ab363b3f014cfbe716f8db17f9750103d06062e785e871fb7](http://localhost:8000/v1/image/a13e6add10c5240ab363b3f014cfbe716f8db17f9750103d06062e785e871fb7) ### Visual Description # Technical Document Extraction: Image Analysis ## Overview The image consists of **five sequential panels** labeled `p^(0)`, `p^(10)`, `p^(20)`, `p^(30)`, and `p^(40)`, arranged horizontally. Each panel depicts a **triangular arrangement of red dots** with varying density, size, and opacity. The progression across panels suggests a dynamic process or iterative system. --- ## Panel Descriptions ### Panel `p^(0)` - **Label**: `p^(0)` (top-left corner). - **Content**: - **Dense cluster** of overlapping red dots. - **Size variation**: Most dots are small and semi-transparent, with a few larger, solid red dots. - **Arrangement**: Triangular formation with high internal density. ### Panel `p^(10)` - **Label**: `p^(10)` (top-center). - **Content**: - **Reduced density** compared to `p^(0)`. - **Solid red dots** at the **three vertices** of the triangle. - **Intermediate dots**: Smaller, semi-transparent dots along the edges and interior. - **Structure**: Maintains triangular shape but with sparser internal distribution. ### Panel `p^(20)` - **Label**: `p^(20)` (top-right). - **Content**: - **Further reduction in density**. - **Solid red dots** at the **three vertices** (consistent with `p^(10)`). - **Intermediate dots**: Scattered along edges, fewer than `p^(10)`. - **Arrangement**: Triangular outline becomes more defined. ### Panel `p^(30)` - **Label**: `p^(30)` (middle-right). - **Content**: - **Minimal density**: Only **solid red dots** at the **three vertices**. - **Intermediate dots**: Sparse, isolated near edges. - **Structure**: Triangle outline is faint but discernible. ### Panel `p^(40)` - **Label**: `p^(40)` (far-right). - **Content**: - **Sparse distribution**: Only **three solid red dots** at the **vertices**. - **No intermediate dots**: Triangle outline is barely visible. - **Arrangement**: Degenerate triangle (collapsed structure). --- ## Key Observations 1. **Temporal Progression**: - The system evolves from a **high-density, clustered state** (`p^(0)`) to a **low-density, vertex-dominated state** (`p^(40)`). - **Trend**: Gradual dispersal of intermediate dots, retention of vertex points. 2. **Dot Categorization**: - **Solid red dots**: Likely represent **critical nodes** or **anchors** (persistent across all panels). - **Semi-transparent dots**: Represent **transient or secondary elements** (diminish over time). 3. **Structural Dynamics**: - The triangular arrangement suggests a **hierarchical or networked system**. - Vertex points act as **fixed reference points**, while intermediate dots may represent **intermediate states** or **interactions**. --- ## Missing Elements - **No explicit legend, axis titles, or numerical data** provided in the image. - **No textual annotations** beyond panel labels (`p^(n)`). --- ## Conclusion The image illustrates a **dissipative process** where a dense, interconnected system (`p^(0)`) evolves into a sparse, vertex-dominated structure (`p^(40)`). The persistence of solid red dots at vertices implies **invariant components**, while the reduction in intermediate dots suggests **loss of transient interactions** or **system simplification**. </details> Figure 1: Evolution of ${\bm{p}}^{(m)}$ in the Fully Synthetic setting for vocabulary size $s=3$ , context length $\ell=4$ , total contexts $c=s^{\ell}=81$ and sample size $n=1000$ . The initial distribution ${\bm{p}}^{(0)}$ is some random distribution over tokens. The trained conditional distributions converge towards Dirac measures over generations illustrating total collapse in Definition 1. In this section, we introduce the notations for recursive training. At generation $m≥ 1$ , suppose that we have samples $\{{\bm{y}}^{(t)}_{l}\}_{l}$ generated by some past models ${\bm{p}}^{(t)}$ for $t∈\{0,...,m-1\}$ respectively. As such, similarly to (2), the model at generation $m$ expresses as $$ \displaystyle{\bm{p}}^{(m)}:=\frac{1}{n^{(m)}}\sum_{t=0}^{m-1}\sum_{l=1}^{n^{(% m)}_{t}}{\bm{y}}^{(t)}_{l}. \tag{5} $$ In other words, ${\bm{p}}^{(m)}$ is obtained by training the model on a mixture of real and synthetic data generated from previously trained models. Here, $n^{(m)}_{t}$ stands for the number of samples used to train the $m$ -th generation model that are generated by model ${\bm{p}}^{(t)}$ , and $n^{(m)}=n^{(m)}_{0}+·s+n^{(m)}_{m-1}$ is the total number of training samples used to train ${\bm{p}}^{(m)}$ . Note that we do not generate samples from ${\bm{p}}={\bm{p}}^{(0)}$ but rather we are given a corpus $\{{\bm{y}}^{(0)}_{l}\}_{l}$ which represent original data. Note that by definition all the models ${\bm{p}}^{(m)}$ are unbiased, i.e. $\mathbb{E}{\bm{p}}^{(m)}={\bm{p}}$ , which means that they recover the original distribution in case of infinite sample sizes. However, in practical settings the sample sizes are finite, and therefore ${\bm{p}}^{(m)}$ may deviate from ${\bm{p}}$ since its variance can be large for small values of $n_{t}^{(m)}$ . In the remainder of the paper, we present analytical results to rigorously quantify the impact of sample sizes $n^{(m)}_{t}$ on the variance of ${\bm{p}}^{(m)}$ and how they affect the learned distribution to eventually cause model collapse (Shumailov et al., 2023), a degenerative process affecting future generation models by either losing information about the tails of the initial distribution or inducing distribution shifts over generations. Specifically, we aim to quantify the rate of such deterioration, and to this end, we introduce a stricter version of model collapse that we call total collapse and define as follows. **Definition 1 (Total Collapse)** *We say that total collapse occurs in the recursive training process if ${\bm{p}}^{(m)}$ converges to some Dirac mass $\delta_{i}$ for some token $i∈[s]$ as $m$ grows.* Total collapse refers to the case where, under recursive training, the trained model ${\bm{p}}^{(m)}$ completely loses information about the original distribution ${\bm{p}}$ over generations, leading to poor linguistic diversity. Note that here we define total collapse with respect to a single pre-fixed context. This can be generalized to the event where total collapse occurs for all contexts. However, the theoretical analysis in this case would require more refined treatment of several statistical quantities to obtain uniform bounds over contexts. We leave this for future work. This phenomenon is illustrated in Figure 1 where we see convergence towards the vertices of the probability simplex, i.e. Dirac measures, over generations. With this definition of total collapse, we provide a quantitative analysis of two cases of recursive training: - Fully Synthetic: Training with synthetic data from the last model. Each generation ${\bm{p}}^{(m)}$ is trained only on data generated by the previous model ${\bm{p}}^{(m-1)}$ . More precisely, for some fixed $n∈{\mathbb{N}}$ we let $n^{(m)}_{t}=n·\mathbb{1}\{t=m-1\}$ for all $m≥ 1$ and $0≤ t<m$ . - Partially Synthetic: Training with a mixture of real and synthetic data from the last model. Each generation ${\bm{p}}^{(m)}$ is trained on a mixture of real data and synthetic data generated by the previous model ${\bm{p}}^{(m-1)}$ . More precisely, for some fixed $n∈{\mathbb{N}}$ we let $n^{(m)}_{0}=N$ , $n^{(m)}_{m-1}=n$ and $n^{(m)}_{t}=0$ for all $m≥ 2$ and $0<t<m$ , and $n^{(1)}_{0}=N$ . Fully Synthetic corresponds to the theoretical setting considered in (Shumailov et al., 2023). However, in that paper, only Gaussian distribution was analyzed instead of discrete distributions over all possible tokens as language models entail. We point out that this setting is unlikely to happen in real-world applications but serves as the worst-case scenario. On the other hand, the Partially Synthetic setting is more realistic for future generation models as it corresponds to training on a mixture of real and synthetic data. We consider this setting to assess whether it is possible to avoid collapse by having a fraction of the original data in the training mixture. We answer this question positively in the next section. Moreover, we show through simulations in Section 4.1 that conclusions from our theoretical analysis on both Fully Synthetic and Partially Synthetic hold beyond these simple settings, such as training on a mixture of all generations or even using realistic transformer models. 3 Main Results To investigate the model collapse phenomenon and the rate at which it occurs, we define the following statistical quantities that capture the randomness of ${\bm{p}}^{(m)}$ : | | $\displaystyle\|{\bm{p}}^{(m)}\|_{∞}:=\max_{i∈[s]}p^{(m)}_{i},\quad% \sigma_{m}:=\|{\bm{p}}^{(m)}\|_{2}^{2}=\sum_{i=1}^{s}{p^{(m)}_{i}}^{2}\quad% \text{and}\quad S_{m}:=\mathbb{E}[\sigma_{m}].$ | | | --- | --- | --- | These quantities measure how far away ${\bm{p}}^{(m)}$ is from some Dirac mass (Total Collapse, Definition 1). Since the maximum value of all three quantities is $1$ , the closer they are to $1$ , the closer ${\bm{p}}^{(m)}$ is to some Dirac mass, and the less diverse ${\bm{p}}^{(m)}$ is as a language model. As such, $\|{\bm{p}}^{(m)}\|_{∞}$ or $\sigma_{m}$ being equal to $1$ is equivalent to ${\bm{p}}^{(m)}$ being a Dirac mass which reflects total collapse. To quantify the distribution shift incurred by the aforementioned recursive training scenarios, we further consider the $1$ -norm We point out that the $1$ -norm is twice the total variation distance $\|\mu-\nu\|_{\text{TV}}$ , which is another commonly used metric for probability distributions and was considered by Fu et al. (2024) for studying model collapse in the case of diffusion models. between two distributions $\mu,\nu∈\mathbb{R}^{s}$ defined by $\|\mu-\nu\|_{1}:=\sum_{i=1}^{s}|\mu(i)-\nu(i)|$ . We assume that the initial distribution ${\bm{p}}^{(0)}$ is nontrivial, specifically $S_{0}<1$ and $\|{\bm{p}}^{(0)}\|_{∞}<1$ . Under this assumption, we provide results on the rate of total collapse and further quantify distribution shift under recursive training. All the proofs are presented in Appendix B. 3.1 Fully Synthetic: Training on synthetic data We start by describing the recursive training process in which only synthetic data from the last generation model are used. This process can be viewed as a Markov chain on the set of probability measures $\Delta^{s-1}\cap\frac{1}{n}\mathbb{N}^{s}$ on $[s]$ with denominator $n$ , where $$ \Delta^{s-1}:=\{{\bm{v}}\in\mathbb{R}^{s}:v_{1}+v_{2}+\dots+v_{s}=1,v_{i}\geq 0% \textrm{ for all }i\} $$ is the probability simplex. Since the probability of reaching $\delta_{i}$ is positive provided $p_{i}>0$ , this random walk which starts at ${\bm{p}}^{(0)}$ has absorbing states $\{\delta_{i}:i∈[s]\text{ s.t. }p_{i}>0\}$ . As a result, the random walk converges to one of the absorbing states almost surely (Kemeny et al., 1969), which means that total collapse is bound to happen in the Fully Synthetic setting. To characterize the rate at which total collapse occurs, let us denote by $T:=∈f\{t∈{\mathbb{N}}:\|{\bm{p}}^{(t)}\|_{∞}=1\}≥ 1$ the random time at which the model first becomes a Dirac mass, and let $\rho_{m}:={\mathbb{P}}\left(\|{\bm{p}}^{(m)}\|_{∞}=1\right)$ denote the probability that the $m$ -th generation has collapsed. Our first result, presented in Theorem 1, provides the rate of convergence via $S_{m}$ , $\rho_{m}$ and $\mathbb{E}[T]$ . **Theorem 1 (Control on Total Collapse)** *Consider the Fully Synthetic setting and let $\tilde{s}:=|\text{supp}({\bm{p}})|$ denote the support size of ${\bm{p}}$ , namely $\tilde{s}:=|\{i∈[s]:p_{i}>0\}|$ . 1. The expected sum of squared probabilities $S_{m}$ is given by $$ S_{m}=1-\left(1-\frac{1}{n}\right)^{m}(1-S_{0}). \tag{6} $$ 1. The probability $\rho_{m}$ that total collapse has occurred by generation $m$ satisfies $$ 1-n(1-S_{0})(1-1/n)^{m}\leq\rho_{m}\leq 1-\frac{1-S_{0}}{1-1/\tilde{s}}(1-1/n)% ^{m}. \tag{7} $$ 1. The generation $T$ at which total collapse happens satisfies $$ 1+\frac{1-S_{0}}{1-1/\tilde{s}}(n-1)\leq\mathbb{E}[T]\leq 1+(1-S_{0})n(n-1). \tag{8} $$* In essence, Theorem 1 describes the behavior of ${\bm{p}}^{(m)}$ as a function of the model generation $m$ , the sample size $n$ , and the dispersion of the initial distribution $S_{0}$ . Specifically, we draw the following observations: - Effect of the number of generations $m$ : As $m$ increases, $S_{m}$ and $\rho_{m}$ tend to $1$ as per (6) and (7), making total collapse increasingly more likely. Note also that this dependence is exponential in $m$ , making total collapse in this case relatively fast. - Effect of “synthetic” sample size $n$ : The smaller $n$ is, the more likely ${\bm{p}}^{(m)}$ is to have collapsed for a given generation $m$ as per the bounds on $\rho_{m}$ in (7), and the faster total collapse is expected to happen as suggested by (8). The dependence in this case is polynomial in $n$ . - Effect of $S_{0}$ : Larger values of $S_{0}$ correspond to faster collapse as suggested by (8); namely, starting from an original distribution that is not diverse enough speeds up total collapse. The upper and lower bounds on $\mathbb{E}[T]$ are both linear in $(1-S_{0})$ . We point out that when the number of contexts $c$ is large, the number of samples $n$ per context would be fairly small, which suggests that total collapse, given a single context, is expected to happen fairly quickly. The bounds on $\mathbb{E}[T]$ state that the expected collapse time is at least of order $n$ and at most of order $n^{2}$ , though experiments (see Figure 2) suggest that the upper bound is not sharp and should be close to ${\mathcal{O}}(n)$ instead. On another note, we characterize below in Proposition 1 the limiting distribution as $m$ gets to infinity, which demonstrates the direction of total collapse. <details> <summary>2404.05090v1/extracted/5522140/figs/case1.png Details</summary> ![d4ad3ce7ebb538e706aacf1928ddc85247d7302ae5a1ac80268ecc0888c69d3e](http://localhost:8000/v1/image/d4ad3ce7ebb538e706aacf1928ddc85247d7302ae5a1ac80268ecc0888c69d3e) ### Visual Description # Technical Document Extraction ## Top Row: Bar Charts (Token i vs log(1 + p_i)) ### Chart 1: S₀ = 0.01, s̃ = 181 - **Title**: S₀ = 0.01, s̃ = 181 - **X-axis**: Token i (0–600) - **Y-axis**: log(1 + p_i) (10⁻³ to 10⁻¹) - **Bars**: Green vertical bars with varying heights (no legend provided) - **Key Observations**: - Bars show irregular distribution across token indices. - No clear pattern; density fluctuates significantly. ### Chart 2: S₀ = 0.1, s̃ = 29 - **Title**: S₀ = 0.1, s̃ = 29 - **X-axis**: Token i (0–600) - **Y-axis**: log(1 + p_i) (10⁻³ to 10⁻¹) - **Bars**: Green vertical bars with varying heights (no legend provided) - **Key Observations**: - Sparse bars at lower token indices (0–100). - Increased density in mid-range tokens (200–400). ### Chart 3: S₀ = 0.5, s̃ = 10 - **Title**: S₀ = 0.5, s̃ = 10 - **X-axis**: Token i (0–600) - **Y-axis**: log(1 + p_i) (10⁻³ to 10⁻¹) - **Bars**: Green vertical bars with varying heights (no legend provided) - **Key Observations**: - Dominant peaks at token indices ~200, 400, and 500. - Sparse bars at lower and higher token indices. ### Chart 4: S₀ = 0.7, s̃ = 9 - **Title**: S₀ = 0.7, s̃ = 9 - **X-axis**: Token i (0–600) - **Y-axis**: log(1 + p_i) (10⁻³ to 10⁻¹) - **Bars**: Green vertical bars with varying heights (no legend provided) - **Key Observations**: - Peaks at token indices ~200, 400, and 500. - Fewer bars overall compared to S₀ = 0.5. --- ## Bottom Row: Line Graphs (Sample size n vs Generation) ### Chart 1: S₀ = 0.01, s̃ = 181 - **Title**: S₀ = 0.01, s̃ = 181 - **X-axis**: Sample size n (0–400) - **Y-axis**: Generation (0–800) - **Data Series**: - **Blue crosses**: Discrete data points showing exponential growth. - **Red dashed line**: Trend line (linear fit) with slope ~0.005. - **Legend**: Red dashed line labeled "Trend" (bottom-right corner). - **Key Observations**: - Generation increases linearly with sample size. - Blue crosses align closely with the red trend line. ### Chart 2: S₀ = 0.1, s̃ = 29 - **Title**: S₀ = 0.1, s̃ = 29 - **X-axis**: Sample size n (0–400) - **Y-axis**: Generation (0–800) - **Data Series**: - **Blue crosses**: Discrete data points showing exponential growth. - **Red dashed line**: Trend line (linear fit) with slope ~0.007. - **Legend**: Red dashed line labeled "Trend" (bottom-right corner). - **Key Observations**: - Steeper growth rate compared to S₀ = 0.01. - Blue crosses align with the trend line. ### Chart 3: S₀ = 0.5, s̃ = 10 - **Title**: S₀ = 0.5, s̃ = 10 - **X-axis**: Sample size n (0–400) - **Y-axis**: Generation (0–800) - **Data Series**: - **Blue crosses**: Discrete data points showing exponential growth. - **Red dashed line**: Trend line (linear fit) with slope ~0.003. - **Legend**: Red dashed line labeled "Trend" (bottom-right corner). - **Key Observations**: - Slower growth rate compared to S₀ = 0.1. - Blue crosses align with the trend line. ### Chart 4: S₀ = 0.7, s̃ = 9 - **Title**: S₀ = 0.7, s̃ = 9 - **X-axis**: Sample size n (0–400) - **Y-axis**: Generation (0–800) - **Data Series**: - **Blue crosses**: Discrete data points showing exponential growth. - **Red dashed line**: Trend line (linear fit) with slope ~0.002. - **Legend**: Red dashed line labeled "Trend" (bottom-right corner). - **Key Observations**: - Slowest growth rate among all charts. - Blue crosses align with the trend line. --- ## Notes 1. **Legend Placement**: All line graphs have the legend in the bottom-right corner. 2. **Color Consistency**: - Blue crosses represent discrete data points. - Red dashed lines represent trend lines. 3. **Missing Elements**: - No legend for bar charts (top row). - No numerical values for individual bars or data points. 4. **Trend Verification**: - All line graphs show linear trends (red dashed lines) with slopes increasing as S₀ increases. 5. **Spatial Grounding**: - Legends are consistently placed in the bottom-right corner of line graphs. </details> Figure 2: Fully Synthetic case for different initial distribution ${\bm{p}}^{(0)}$ . Total collapse time is plotted as a function of the initial distribution ${\bm{p}}$ and the sample size $n$ . (Top) The initial distribution ${\bm{p}}$ with different values of $S_{0}$ and support size $\tilde{s}$ . The $x$ -axis represents tokens $i∈\{1,2,...,600\}$ while the $y$ -axis represents the probabilities in log scale. (Bottom) Each cross represents the average total collapse time over $100$ runs for a particular sample sizes $n∈\{10,50,100,150,...,400\}$ . The red dashed line depicts the lower bound on $\mathbb{E}T$ given by (8). **Proposition 1** *In the Fully Synthetic case, we have $\displaystyle{\mathbb{P}}\left(\lim_{m→∞}{\bm{p}}^{(m)}=\delta_{i}% \right)=p_{i}$ for all $i∈[s]$ .* Proposition 1 describes the limiting distribution when total collapse occurs in terms of the initial probabilities $p_{i}$ over tokens. Specifically, the resulting Dirac mass $\delta_{i}$ is likely to be supported on some token $i$ with high initial probability $p_{i}$ . This formally supports the description of early and late model collapse in Shumailov et al. (2023): In the early phase of recursive training, the tails of the original distribution disappear because the probability $p^{(m)}_{j}$ of outputting unlikely tokens $j$ (those $j$ ’s for which $p_{j}$ is small) will decrease as ${\bm{p}}^{(m)}$ converges to some Dirac mass $\delta_{i}$ . After many generations, all but one $p^{(m)}_{j}$ will go to $0 0$ , exhibiting late model collapse where all the randomness of the original distribution is lost. 3.2 Partially Synthetic: Handling model collapse with real data As described in the previous section, total collapse is unavoidable when training solely on synthetic data. In this section, we consider the case in which real data is incorporated at each generation. In this case, ${\bm{p}}^{(m)}$ can be simply seen as a weighted average of ${\bm{p}}^{(1)}$ and an estimate of ${\bm{p}}^{(m-1)}$ with $n$ samples. We also assume that ${\bm{p}}^{(1)}$ is nontrivial, thereby implying that, on average, ${\bm{p}}^{(m)}$ will not be a Dirac mass. In what follows, we quantify the variability in the data distribution across different generations, as well as the distribution drift $\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ from the first-generation model. We particularly show that collapse can be avoided if enough real data is injected into the recursive training process. **Theorem 2 (Model Variation)** *Consider the Partially Synthetic setting. For $m≥ 1$ we have $$ \displaystyle S_{m+1}=\dfrac{\frac{1}{N}\left[1+2\alpha-\left(1-\frac{1}{N}% \right)\alpha\beta^{m}\right]}{1+(1+1/N)\alpha}+\dfrac{(1-\frac{1}{N})S_{0}}{1% +(1+1/N)\alpha}\left[1+\alpha-\frac{\alpha\beta^{m}}{N}\right], \tag{9} $$ where $\alpha:=\frac{n}{N+n}$ and $\beta:=\alpha\left[(1+\frac{1}{N})\alpha-\frac{1}{N}\right]$ .* Theorem 2 provides a control of the variance $S_{m}$ in the Partially Synthetic setting. Essentially, when $n\ll N$ , we have $S_{m}≈\frac{1}{N}+(1-\frac{1}{N})S_{0}≈ S_{0}$ , which is not surprising since the training data for each generation model are dominated by real data in this case. However, even when the number of synthetic data is much larger than the original dataset (i.e. $n\gg N$ ), we have $\alpha≈ 1≈\beta$ and hence $S_{m+1}≈ 1/N+(2+1/N)^{-1}(1-1/N)(2-1/N)S_{0}$ , which approaches $S_{0}$ for large $N$ . To further refine our analysis, we present a result that directly controls the deviation $\mathbb{E}\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ between the conditional distributions from first and $m$ -th generations. This allows us to have a quantitative control over the distribution shift. When $n$ is sufficiently small, we have a sharper control over this deviation by exploiting the concentration results from (Mardia et al., 2020). Essentially, this allows us to estimate the maximum number of synthetic samples $n$ to ensure that the distribution ${\bm{p}}^{(m)}$ stays close to ${\bm{p}}^{(1)}$ . <details> <summary>2404.05090v1/extracted/5522140/figs/case2_exp.png Details</summary> ![b1d2daec74e9aa2ce85a19b7c2b38183658f45d16a413213d5713e296c26d0a1](http://localhost:8000/v1/image/b1d2daec74e9aa2ce85a19b7c2b38183658f45d16a413213d5713e296c26d0a1) ### Visual Description # Technical Document Extraction: Evolution of σₘ and ||p^(m) - p^(1)||₁ ## Overview The image contains **8 line graphs** arranged in a 2x4 grid, comparing the evolution of two metrics across different sample sizes. Each row represents a distinct metric, with columns showing results for **1, 10, 100, and 500 additional samples**. Key elements include axis labels, legends, and trends in empirical means versus theoretical expectations. --- ### **Top Row: Evolution of σₘ** #### **Graph 1: 1 Additional Sample** - **Title**: "1 additional samples Evolution of σₘ" - **Y-Axis**: σₘ (range: 0.10–0.20) - **X-Axis**: Generation m (0–50) - **Legend**: - **Red line**: Empirical mean (stable, ~0.11) - **Blue dashed line**: Sm = E[σₘ] (slightly higher, ~0.12) - **Trend**: - Empirical mean remains constant. - Sm fluctuates moderately but stays near the empirical mean. #### **Graph 2: 10 Additional Samples** - **Title**: "10 additional samples Evolution of σₘ" - **Y-Axis**: σₘ (range: 0.10–0.20) - **X-Axis**: Generation m (0–50) - **Legend**: - **Red line**: Empirical mean (stable, ~0.11) - **Blue dashed line**: Sm = E[σₘ] (slightly higher, ~0.12) - **Trend**: - Empirical mean remains constant. - Sm shows reduced variability compared to 1 sample. #### **Graph 3: 100 Additional Samples** - **Title**: "100 additional samples Evolution of σₘ" - **Y-Axis**: σₘ (range: 0.10–0.20) - **X-Axis**: Generation m (0–50) - **Legend**: - **Red line**: Empirical mean (stable, ~0.11) - **Blue dashed line**: Sm = E[σₘ] (slightly higher, ~0.12) - **Trend**: - Empirical mean remains constant. - Sm stabilizes further, with minimal deviation from the empirical mean. #### **Graph 4: 500 Additional Samples** - **Title**: "500 additional samples Evolution of σₘ" - **Y-Axis**: σₘ (range: 0.10–0.20) - **X-Axis**: Generation m (0–50) - **Legend**: - **Red line**: Empirical mean (stable, ~0.11) - **Blue dashed line**: Sm = E[σₘ] (slightly higher, ~0.12) - **Trend**: - Empirical mean remains constant. - Sm converges tightly to the empirical mean, showing near-zero variability. --- ### **Bottom Row: Evolution of ||p^(m) - p^(1)||₁** #### **Graph 1: 1 Additional Sample** - **Title**: "Evolution of ||p^(m) - p^(1)||₁" - **Y-Axis**: ||p^(m) - p^(1)||₁ (range: 0–0.3) - **X-Axis**: Generation m (0–50) - **Legend**: - **Red line**: Empirical mean (constant at 0) - **Trend**: - Empirical mean remains at 0. - Individual trials (yellow lines) show minimal deviation. #### **Graph 2: 10 Additional Samples** - **Title**: "Evolution of ||p^(m) - p^(1)||₁" - **Y-Axis**: ||p^(m) - p^(1)||₁ (range: 0–0.3) - **X-Axis**: Generation m (0–50) - **Legend**: - **Red line**: Empirical mean (constant at 0) - **Trend**: - Empirical mean remains at 0. - Individual trials (yellow lines) cluster tightly around 0. #### **Graph 3: 100 Additional Samples** - **Title**: "Evolution of ||p^(m) - p^(1)||₁" - **Y-Axis**: ||p^(m) - p^(1)||₁ (range: 0–0.3) - **X-Axis**: Generation m (0–50) - **Legend**: - **Red line**: Empirical mean (constant at 0) - **Trend**: - Empirical mean remains at 0. - Individual trials (yellow lines) show slight variability but stabilize near 0. #### **Graph 4: 500 Additional Samples** - **Title**: "Evolution of ||p^(m) - p^(1)||₁" - **Y-Axis**: ||p^(m) - p^(1)||₁ (range: 0–0.3) - **X-Axis**: Generation m (0–50) - **Legend**: - **Red line**: Empirical mean (constant at 0) - **Trend**: - Empirical mean remains at 0. - Individual trials (yellow lines) exhibit significant variability but stabilize near 0. --- ### **Key Observations** 1. **σₘ Evolution**: - The empirical mean (red) is stable across all sample sizes. - Sm = E[σₘ] (blue dashed) converges to the empirical mean as sample size increases, indicating reduced estimation error. 2. **||p^(m) - p^(1)||₁ Evolution**: - The empirical mean (red) remains constant at 0 for all sample sizes. - Individual trials (yellow lines) show decreasing variability with larger sample sizes, suggesting improved convergence. 3. **Legend Placement**: - All legends are located in the **top-right corner** of their respective graphs. - Colors match exactly: red for empirical mean, blue dashed for Sm = E[σₘ]. 4. **Spatial Grounding**: - X-axis (Generation m) spans 0–50 for all graphs. - Y-axis ranges differ by metric: σₘ (0.10–0.20) vs. ||p^(m) - p^(1)||₁ (0–0.3). --- ### **Conclusion** Increasing the number of samples improves the stability of σₘ estimates (Sm converges to the empirical mean) and reduces variability in ||p^(m) - p^(1)||₁. The empirical mean remains invariant across all scenarios, while theoretical expectations (Sm) and individual trial outcomes (yellow lines) demonstrate sample-size-dependent convergence. </details> Figure 3: Partially Synthetic case with different sample sizes $n$ . A hundred experiments were run for $50$ generations for $N=100$ and different values of $n$ . Each yellow line represents the evolution of $\sigma_{m}$ (top row) or $\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ (bottom row) in one experiment, with the red line being the empirical mean across $100$ runs. The blue dashed lines plot the formula for $S_{m}$ given by (9). The initial distribution ${\bm{p}}$ satisfies $s=600$ , $\tilde{s}=52$ , and $S_{0}=0.1$ . **Theorem 3 (Model Deviation)** *Consider the Partially Synthetic setting and define $$ \displaystyle G_{n}(s):=\begin{cases}C_{1}se^{\frac{C_{0}n}{2e}}&\text{if}% \quad\frac{C_{0}}{e}n+2\leq s;\\ C_{1}s\left(\frac{C_{0}n}{s}\right)^{s/2}&\text{if}\quad\frac{C_{0}}{4}n+2\leq s% <\frac{C_{0}}{e}n+2;\\ (2^{s}-2)&\text{if}\quad s<\frac{C_{0}}{4}n+2,\end{cases} \tag{10} $$ where $C_{0}=\frac{e^{3}}{2\pi}≈ 3.19$ and $C_{1}=\frac{6e}{\pi^{3/2}}≈ 2.93$ , and $s$ is the vocabulary size. Then, for $m≥ 2$ , $$ \mathbb{E}\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}<\frac{1}{N}\sqrt{\frac{\pi n}{% 2}}G_{n}(s). \tag{1} $$* We point out that the upper bound on the deviation $\mathbb{E}\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ is independent of the generation $m$ . Since the deviation from ${\bm{p}}^{(0)}$ to ${\bm{p}}^{(1)}$ is inevitable and independent of $n$ , we give the result in $\mathbb{E}\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ , from which $\mathbb{E}\|{\bm{p}}^{(m)}-{\bm{p}}^{(0)}\|_{1}$ can be estimated by the triangle inequality. Theorem 3 allows us to estimate the maximum number of synthetic samples $n$ that can be used if we want ${\bm{p}}^{(m)}$ to stay $\epsilon$ -close to ${\bm{p}}^{(1)}$ in $L^{1}$ norm. For example, when $C_{0}n/e+2≤ s$ , for any $\epsilon>0$ we can take $$ n\leq 2\pi e^{-2}\min\left[s-2,\log\left(\frac{\sqrt{2}\pi N\epsilon}{6es}% \right)\right] \tag{11} $$ in order for $\mathbb{E}\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ to be less than $\epsilon$ . In other words, to ensure small $L^{1}$ deviation $n$ should be taken to be logarithmic in the ratio $N\epsilon/s$ , which highlights that the amount of synthetic data should be exponentially smaller compared to real data in order to ensure that ${\bm{p}}^{(m)}$ remains close to ${\bm{p}}^{(1)}$ . We show through simulations the effect of the sample size $n$ and the initial distribution ${\bm{p}}$ . In Figure 3, we show that if we fix the initial distribution and increase the amount of synthetic data $n$ , the dispersion $\sigma_{m}$ stays relatively constant while the distribution drift $\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ increases. In contrast, when we fix $n$ and increase the values of $S_{0}$ , $\sigma_{m}$ increases but $\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ actually decreases as depicted in Figure 4. This behavior can be explained by Theorem 4 in Appendix B, which is more general than Theorem 3 in the sense that it captures the dependence of $\mathbb{E}\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ on the randomness of ${\bm{p}}$ and hence provides a sharper bound on the expected deviation. <details> <summary>2404.05090v1/extracted/5522140/figs/case2_exp_different_s0.png Details</summary> ![31d19e36232448eb53b2d10f4d540bf35bbee5804e82ac8dc915247227dbf4c5](http://localhost:8000/v1/image/31d19e36232448eb53b2d10f4d540bf35bbee5804e82ac8dc915247227dbf4c5) ### Visual Description # Technical Document Extraction: Image Analysis ## Overview The image contains four comparative panels, each depicting three subplots for different parameter settings: - **Top Row**: Histograms of `log(1 + p_i)` distributions across generations (0–600). - **Middle Row**: Evolution of `σ_m` (standard deviation) over generations (0–50). - **Bottom Row**: Evolution of `||p^(m) - p^(1)||₁` (L1 norm difference) over generations (0–50). Parameters vary across panels: 1. `S₀ = 0.01, σ̃ = 484` 2. `S₀ = 0.1, σ̃ = 52` 3. `S₀ = 0.5, σ̃ = 22` 4. `S₀ = 0.7, σ̃ = 16` --- ## Panel 1: `S₀ = 0.01, σ̃ = 484` ### Histogram of `log(1 + p_i)` - **Y-axis**: `log(1 + p_i)` (logarithmic scale, 10⁻⁵ to 10⁻¹). - **X-axis**: Generation `m` (0–600). - **Key Trends**: - Dense, uniform distribution of green bars. - No significant peaks or valleys. ### Evolution of `σ_m` - **Y-axis**: `σ_m` (0–0.8). - **X-axis**: Generation `m` (0–50). - **Key Trends**: - Empirical mean (red line) and `E[σ_m]` (blue dashed) converge to **~0.01**. - Minimal fluctuation after generation 10. ### Evolution of `||p^(m) - p^(1)||₁` - **Y-axis**: Norm difference (0–0.15). - **X-axis**: Generation `m` (0–50). - **Key Trends**: - Sharp drop to **~0.05** by generation 10. - Stabilizes with minor oscillations. --- ## Panel 2: `S₀ = 0.1, σ̃ = 52` ### Histogram of `log(1 + p_i)` - **Y-axis**: `log(1 + p_i)` (logarithmic scale, 10⁻⁵ to 10⁻¹). - **X-axis**: Generation `m` (0–600). - **Key Trends**: - Moderate density of green bars. - Slight clustering around mid-range values. ### Evolution of `σ_m` - **Y-axis**: `σ_m` (0–0.8). - **X-axis**: Generation `m` (0–50). - **Key Trends**: - Empirical mean (red) and `E[σ_m]` (blue dashed) stabilize at **~0.05**. - Slight upward trend in early generations. ### Evolution of `||p^(m) - p^(1)||₁` - **Y-axis**: Norm difference (0–0.15). - **X-axis**: Generation `m` (0–50). - **Key Trends**: - Gradual decline to **~0.1** by generation 30. - Persistent noise in later generations. --- ## Panel 3: `S₀ = 0.5, σ̃ = 22` ### Histogram of `log(1 + p_i)` - **Y-axis**: `log(1 + p_i)` (logarithmic scale, 10⁻⁵ to 10⁻¹). - **X-axis**: Generation `m` (0–600). - **Key Trends**: - Sparse green bars. - Dominant peak near `log(1 + p_i) ≈ 10⁻²`. ### Evolution of `σ_m` - **Y-axis**: `σ_m` (0–0.8). - **X-axis**: Generation `m` (0–50). - **Key Trends**: - Empirical mean (red) and `E[σ_m]` (blue dashed) converge to **~0.4**. - Sharp increase after generation 20. ### Evolution of `||p^(m) - p^(1)||₁` - **Y-axis**: Norm difference (0–0.15). - **X-axis**: Generation `m` (0–50). - **Key Trends**: - Rapid drop to **~0.08** by generation 10. - High variability in later generations. --- ## Panel 4: `S₀ = 0.7, σ̃ = 16` ### Histogram of `log(1 + p_i)` - **Y-axis**: `log(1 + p_i)` (logarithmic scale, 10⁻⁵ to 10⁻¹). - **X-axis**: Generation `m` (0–600). - **Key Trends**: - Very sparse green bars. - Dominant peak near `log(1 + p_i) ≈ 10⁻¹`. ### Evolution of `σ_m` - **Y-axis**: `σ_m` (0–0.8). - **X-axis**: Generation `m` (0–50). - **Key Trends**: - Empirical mean (red) and `E[σ_m]` (blue dashed) stabilize at **~0.8**. - Minimal fluctuation after generation 10. ### Evolution of `||p^(m) - p^(1)||₁` - **Y-axis**: Norm difference (0–0.15). - **X-axis**: Generation `m` (0–50). - **Key Trends**: - Sharp drop to **~0.05** by generation 5. - Persistent noise with occasional spikes. --- ## Legend and Spatial Grounding - **Legend Location**: Bottom-right corner of each panel. - **Legend Entries**: - **Red Line**: Empirical mean. - **Blue Dashed Line**: `E[σ_m]` (standard deviation expectation). - **Color Consistency**: - Red lines match empirical mean trends across all panels. - Blue dashed lines align with `E[σ_m]` in σ_m subplots. --- ## Component Isolation ### Histograms - **Region**: Top row of each panel. - **Observations**: - Higher `S₀` values correlate with sparser distributions. - `σ̃` inversely affects bar density (e.g., `σ̃ = 484` vs. `σ̃ = 16`). ### σ_m Evolution - **Region**: Middle row of each panel. - **Observations**: - `σ_m` increases with `S₀` (0.01 → 0.8). - `E[σ_m]` (blue dashed) closely tracks empirical mean (red). ### Norm Evolution - **Region**: Bottom row of each panel. - **Observations**: - `||p^(m) - p^(1)||₁` decreases with higher `S₀`. - Noise levels increase for larger `S₀` values. --- ## Final Notes - **Language**: All text is in English. - **Data Tables**: No explicit tables; histograms and line plots represent data. - **Missing Information**: No textual annotations beyond axis labels and legends. </details> Figure 4: Partially Synthetic case with different initial distributions ${\bm{p}}$ . A hundred experiments were run for $50$ generations for $n=10$ , $N=100$ and different initial distributions ${\bm{p}}$ shown in the top row. Notice that as $S_{0}$ increases, the deviation $\|p^{(m)}-p^{(1)}\|_{1}$ decreases, as suggest by the inequality (19). 4 Experiments 4.1 Transformer-based Models To support our findings in a realistic setting, we conduct experiments with a decoder-only generative model for text generation. For this model, we consider the aforementioned Fully Synthetic and Partially Synthetic settings using the model parameters in Appendix C. We consider a simple character-level tokenizer using the tiny Shakespeare dataset https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt yielding a vocabulary of size $s=65$ . We first train a model for $2000$ iterations on this dataset and consider the trained model as the ground-truth distribution ${\bm{p}}^{(0)}$ . The next-generation models are trained recursively using the exact same architecture and training parameters. This setting allows us to reduce the effect of functional approximation error thereby focusing only on the effect of statistical approximation error. The results of these experiments are summarized in Figure 5. The top left plot therein depicts the deviation $\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ which is averaged over $300$ random contexts generated by the ground-truth generative model ${\bm{p}}^{(0)}$ . From this plot, we clearly see that $\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ diverges over generations in the Fully Synthetic setting, while mixing with real data ensures stability over generations. The effect of synthetic data is further noticed in the validation loss of the next-generation models where we see an overfitting effect in the Fully Synthetic setting. We point out that this effect might also be associated with the functional approximation error and was rigorously studied by Dohmatob et al. (2024a) in the case of linear regression, where the authors have shown that synthetic data affect the usual scaling laws. We believe that similar conclusions can be obtained with our statistical model by incorporating the functional approximation error, this can be achieved for instance by supposing that the context embeddings ${\bm{e}}_{i}$ ’s are high-dimensional Gaussian vectors instead of canonical vectors, thereby introducing the embedding dimension as a parameter controlling model complexity. <details> <summary>2404.05090v1/x2.png Details</summary> ![adb447f74e65efbb75af8c8dca18d3d824c7a21280190e5149fb1a4d7384cdfb](http://localhost:8000/v1/image/adb447f74e65efbb75af8c8dca18d3d824c7a21280190e5149fb1a4d7384cdfb) ### Visual Description # Technical Document Extraction: Line Chart Analysis ## Image Description The image is a line chart visualizing the norm of the difference between two probability distributions (`||p^(m) - p^(1)||₁`) across generations (`m`). The chart includes three data series with distinct line styles and colors, along with shaded confidence intervals. --- ### Labels and Axis Titles - **X-axis**: Labeled "generation m" with integer markers from 1 to 9. - **Y-axis**: Labeled "||p^(m) - p^(1)||₁" with values ranging from 0.4 to 0.8 in increments of 0.1. --- ### Legend - **Location**: Top-right corner of the chart. - **Entries**: 1. **Red (solid line)**: "synthetic" 2. **Orange (dashed line)**: "synthetic 50% - real 50%" 3. **Green (dotted line)**: "synthetic 20% - real 80%" --- ### Key Trends and Data Points #### 1. Synthetic (Red, Solid Line) - **Trend**: Steadily increasing from generation 1 to 9. - **Data Points**: - Generation 1: ~0.4 - Generation 3: ~0.55 - Generation 5: ~0.65 - Generation 7: ~0.75 - Generation 9: ~0.8 - **Confidence Interval**: Shaded red area widens progressively, indicating increasing uncertainty. #### 2. Synthetic 50% - Real 50% (Orange, Dashed Line) - **Trend**: Relatively flat with minor fluctuations. - **Data Points**: - Generation 1: ~0.45 - Generation 5: ~0.45 - Generation 9: ~0.45 - **Confidence Interval**: Narrow orange shading, suggesting stable performance. #### 3. Synthetic 20% - Real 80% (Green, Dotted Line) - **Trend**: Nearly flat with slight oscillations. - **Data Points**: - Generation 1: ~0.42 - Generation 5: ~0.42 - Generation 9: ~0.42 - **Confidence Interval**: Very narrow green shading, indicating high stability. --- ### Spatial Grounding and Validation - **Legend Colors**: - Red matches the solid line (synthetic). - Orange matches the dashed line (synthetic 50% - real 50%). - Green matches the dotted line (synthetic 20% - real 80%). - **Axis Markers**: - X-axis markers (1–9) align with generation labels. - Y-axis markers (0.4–0.8) correspond to the norm values. --- ### Component Isolation 1. **Main Chart**: - Three lines with distinct styles/colors. - Shaded regions represent confidence intervals. 2. **Legend**: - Positioned independently in the top-right corner. 3. **No Additional Components**: No headers, footers, or secondary axes. --- ### Conclusion The chart demonstrates that the synthetic distribution diverges significantly from the baseline (`p^(1)`) over generations, while hybrid approaches (50% synthetic, 20% synthetic) maintain stability. Confidence intervals highlight the uncertainty in the synthetic case. </details> <details> <summary>2404.05090v1/x3.png Details</summary> ![105b2d621d3e53793a1c2a7a788754e9b2a23972d3b725824af99af81f9c6882](http://localhost:8000/v1/image/105b2d621d3e53793a1c2a7a788754e9b2a23972d3b725824af99af81f9c6882) ### Visual Description # Technical Document Analysis: Synthetic Validation Loss Chart ## Title - **Title**: `synthetic` ## Axes - **Y-Axis**: - Label: `val loss` - Range: `1.25` to `2.50` (in increments of `0.25`) - **X-Axis**: - Label: `iterations` - Scale: Logarithmic (`10^0` to `10^1`) - Ticks: `10^0`, `10^1` ## Legend - **Placement**: Lower-left corner - **Entries**: - `Gen 1`: Blue (`#0000FF`) - `Gen 2`: Orange (`#FFA500`) - `Gen 3`: Green (`#008000`) - `Gen 4`: Red (`#FF0000`) - `Gen 5`: Purple (`#800080`) - `Gen 6`: Brown (`#A52A2A`) - `Gen 7`: Pink (`#FFC0CB`) - `Gen 8`: Gray (`#808080`) - `Gen 9`: Yellow (`#FFFF00`) - `Gen 10`: Cyan (`#00FFFF`) ## Key Trends 1. **Initial Convergence**: - All lines start at approximately `2.50` val loss at `10^0` iterations. - Lines diverge slightly during early iterations (`10^0` to `10^0.5`), with Gen 1 (blue) and Gen 2 (orange) showing the steepest initial decline. 2. **Mid-Iteration Behavior**: - By `10^0.5` iterations, lines begin to converge toward a common trajectory. - Gen 10 (cyan) and Gen 9 (yellow) exhibit the slowest rate of decline post-`10^0.5`. 3. **Final Iterations**: - All lines plateau near `1.50–1.75` val loss by `10^1` iterations. - Gen 1 (blue) achieves the lowest final val loss (~`1.45`), while Gen 10 (cyan) remains the highest (~`1.70`). ## Component Isolation - **Header**: Title (`synthetic`) centered at the top. - **Main Chart**: - 10 distinct lines representing generations. - Logarithmic x-axis emphasizes exponential iteration progression. - **Footer**: Legend box with generation labels and color mappings. ## Spatial Grounding & Color Verification - **Legend Accuracy**: - Confirmed: Each line color matches its legend entry (e.g., Gen 3 = green, Gen 7 = pink). - Example: Gen 6 (brown) line aligns with the brown legend marker. ## Data Extraction - **Val Loss Values** (approximate, based on visual interpolation): | Generation | Iterations (`10^0`) | Iterations (`10^1`) | |------------|----------------------|----------------------| | Gen 1 | 2.48 | 1.45 | | Gen 2 | 2.47 | 1.52 | | Gen 3 | 2.46 | 1.58 | | Gen 4 | 2.45 | 1.60 | | Gen 5 | 2.44 | 1.62 | | Gen 6 | 2.43 | 1.65 | | Gen 7 | 2.42 | 1.68 | | Gen 8 | 2.41 | 1.70 | | Gen 9 | 2.40 | 1.72 | | Gen 10 | 2.39 | 1.75 | ## Notes - **Logarithmic Scale Impact**: The x-axis compression emphasizes early iteration differences, while later iterations appear compressed. - **Convergence Pattern**: Lines exhibit diminishing returns in val loss reduction as iterations increase, suggesting optimization saturation. </details> <details> <summary>2404.05090v1/x4.png Details</summary> ![2c586e8597ee3202dff20c3eb582ca71fb8a194e318d4f0a0dc379c3f8781ce7](http://localhost:8000/v1/image/2c586e8597ee3202dff20c3eb582ca71fb8a194e318d4f0a0dc379c3f8781ce7) ### Visual Description # Technical Document Extraction: Line Graph Analysis ## Title **synthetic 50% - real 50%** ## Axes - **Y-Axis**: `val loss` (ranging from 1.25 to 2.50 in increments of 0.25) - **X-Axis**: `iterations` (logarithmic scale: 10⁰, 10¹, 10²) ## Legend - **Location**: Right side of the graph - **Entries**: - Gen 1: Blue (`#1f77b4`) - Gen 2: Orange (`#ff7f0e`) - Gen 3: Green (`#2ca02c`) - Gen 4: Red (`#d62728`) - Gen 5: Purple (`#9467bd`) - Gen 6: Brown (`#8c564b`) - Gen 7: Pink (`#e377c2`) - Gen 8: Gray (`#7f7f7f`) - Gen 9: Yellow (`#bcbd22`) - Gen 10: Cyan (`#17becf`) ## Key Trends 1. **Initial Values (10⁰ iterations)**: - All generations start near **2.5 val loss**, with Gen 1 (blue) slightly higher (~2.45) and Gen 10 (cyan) slightly lower (~2.4). - Lines are tightly clustered, with minimal separation. 2. **Mid-Iterations (10¹ iterations)**: - All lines converge to a **val loss of ~1.5**, indicating stabilization. - Gen 1 (blue) and Gen 2 (orange) show the steepest decline, while Gen 6 (brown) and Gen 7 (pink) exhibit slower descent. 3. **Final Values (10² iterations)**: - All generations plateau at **~1.5 val loss**, with negligible differences between lines. ## Data Point Verification - **Gen 1 (Blue)**: - 10⁰: ~2.45 - 10¹: ~1.5 - 10²: ~1.5 - **Gen 2 (Orange)**: - 10⁰: ~2.4 - 10¹: ~1.5 - 10²: ~1.5 - **Gen 3 (Green)**: - 10⁰: ~2.35 - 10¹: ~1.5 - 10²: ~1.5 - **Gen 4 (Red)**: - 10⁰: ~2.3 - 10¹: ~1.5 - 10²: ~1.5 - **Gen 5 (Purple)**: - 10⁰: ~2.25 - 10¹: ~1.5 - 10²: ~1.5 - **Gen 6 (Brown)**: - 10⁰: ~2.2 - 10¹: ~1.5 - 10²: ~1.5 - **Gen 7 (Pink)**: - 10⁰: ~2.15 - 10¹: ~1.5 - 10²: ~1.5 - **Gen 8 (Gray)**: - 10⁰: ~2.1 - 10¹: ~1.5 - 10²: ~1.5 - **Gen 9 (Yellow)**: - 10⁰: ~2.05 - 10¹: ~1.5 - 10²: ~1.5 - **Gen 10 (Cyan)**: - 10⁰: ~2.0 - 10¹: ~1.5 - 10²: ~1.5 ## Observations - **Logarithmic X-Axis**: Iterations scale exponentially (10⁰ → 10¹ → 10²), emphasizing early-stage performance differences. - **Convergence**: All generations achieve similar val loss by 10¹ iterations, suggesting diminishing returns after initial training. - **Color Consistency**: Legend colors match line colors exactly (e.g., Gen 1 = blue, Gen 10 = cyan). ## Conclusion The graph illustrates the validation loss trajectory of 10 generations across 100 iterations. Early iterations (10⁰) show distinct performance gaps, but all generations converge to a stable val loss (~1.5) by 10¹ iterations, indicating robust training outcomes regardless of generation. </details> <details> <summary>2404.05090v1/x5.png Details</summary> ![f22fd5b006cd8f294bd94a4ce09f26d3b150f38e55d0bc5e61fb52ed34528af5](http://localhost:8000/v1/image/f22fd5b006cd8f294bd94a4ce09f26d3b150f38e55d0bc5e61fb52ed34528af5) ### Visual Description # Technical Document Extraction: Line Chart Analysis ## Title - **Title**: `synthetic 20% - real 80%` ## Axes - **X-axis**: - **Label**: `iterations` - **Scale**: Logarithmic (base 10) - **Ticks**: `10^0`, `10^1` - **Y-axis**: - **Label**: `val loss` - **Range**: `1.25` to `2.50` - **Ticks**: `1.25`, `1.50`, `1.75`, `2.00`, `2.25`, `2.50` ## Legend - **Placement**: Right side of the chart - **Entries**: - `Gen 1` (blue line) - `Gen 2` (orange line) - `Gen 3` (green line) - `Gen 4` (red line) - `Gen 5` (purple line) - `Gen 6` (brown line) - `Gen 7` (pink line) - `Gen 8` (gray line) - `Gen 9` (yellow line) - `Gen 10` (cyan line) ## Data Trends - **General Behavior**: - All lines exhibit a **monotonic decrease** in `val loss` as `iterations` increase. - Initial steep decline occurs between `10^0` and `10^1` iterations. - Lines converge toward lower `val loss` values as iterations approach `10^1`. - **Specific Observations**: - **Gen 1** (blue): Starts at `~2.45` val loss, drops sharply to `~1.50` by `10^1`. - **Gen 2** (orange): Similar trajectory to Gen 1, with a slight plateau near `10^0.5`. - **Gen 3** (green): Highest initial val loss (`~2.45`), steepest early decline. - **Gen 4** (red): Moderate decline, intersects Gen 5 (`purple`) near `10^0.7`. - **Gen 5** (purple): Smooth decline, overlaps Gen 6 (`brown`) at `10^1`. - **Gen 6** (brown): Gradual decline, flattens near `10^1`. - **Gen 7** (pink): Steady decline, overlaps Gen 8 (`gray`) at `10^1`. - **Gen 8** (gray): Slightly higher val loss than Gen 7, converges at `10^1`. - **Gen 9** (yellow): Highest val loss at `10^0`, sharp drop to `~1.50` by `10^1`. - **Gen 10** (cyan): Most gradual decline, ends at `~1.45` val loss. ## Spatial Grounding - **Legend Position**: Right-aligned, outside the plot area. - **Line Colors**: Confirmed to match legend entries (e.g., Gen 1 = blue, Gen 10 = cyan). ## Component Isolation 1. **Header**: Title centered above the chart. 2. **Main Chart**: - Logarithmic x-axis with gridlines. - Y-axis with gridlines and tick marks. - Overlapping lines representing generations. 3. **Footer**: Legend box with color-coded generation labels. ## Notes - No non-English text detected. - All data points and trends are inferred visually; exact numerical values are approximated from the chart's scale. </details> Figure 5: Experiments with a GPT2-type generative model. The top left plot depicts the deviation $\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ varying the generation model $m$ for synthetic data and a mixture of real and synthetic data. The three other plots show the behavior of the validation loss over generations. Essentially, training solely on synthetic data causes model collapse and affects the usual scaling laws (Dohmatob et al., 2024a). 4.2 Additional Experiments with the Statistical Model To further investigate and demonstrate the generality of our theoretical findings, we present empirical results on two more scenarios that better represent recursive training in real-world settings, as follows: - Most Recent Models: Each generation $p^{(m)}$ is trained on synthetic data from the most recent $K$ models for a fixed window size $K$ . More precisely, for some fixed $n∈{\mathbb{N}}$ we let $n^{(m)}_{t}=\lfloor n/K\rfloor·\mathbb{1}\{\max(0,m-K)≤ t≤ m-1\}$ for all $m≥ 1$ . When $K=1$ , this case degenerates to the Fully Synthetic setting. - Randomly Sampled Data: Each generation $p^{(m)}$ is trained on a mixture of synthetic data from possibly all the previous models and the real data. More precisely, the $m$ -th generation model is trained on $n=n_{0}^{(m)}+·s+n_{m-1}^{(m)}$ samples with $$ n^{(m)}_{t}:=\sum_{i=1}^{n}\mathbb{1}\{g_{i}=t\},\quad(t=0,\cdots,m-1), $$ samples from the $t$ -th generation, where $\{g_{i}\}_{i∈[n]}$ are independent and uniformly distributed on $\{0,1,... m-1\}$ indexing previous-generation models. This setting describes the scenario where data generated by all past models are mixed in a pool from which the training data for the next generation is collected. <details> <summary>2404.05090v1/extracted/5522140/figs/mixed_data_window.png Details</summary> ![bcec0f1d05d6fff1c9496f5cb5140951f7b9bc48537e25140418f531df373237](http://localhost:8000/v1/image/bcec0f1d05d6fff1c9496f5cb5140951f7b9bc48537e25140418f531df373237) ### Visual Description # Technical Document Extraction: Evolution Analysis of σₘ, Norm Differences, and T Distributions ## Image Structure The image contains **16 subplots** organized into **4 main sections** (Window Sizes = 1, 5, 10, m) with **3 subplots per section**: 1. Evolution of σₘ (top row) 2. Evolution of ||p^(m) - p^(1)||₁ (middle row) 3. Distribution of T (bottom row) --- ## Key Labels and Axis Titles ### Common Elements - **X-axis**: "Generation m" (all subplots) - **Y-axis (σₘ plots)**: "σₘ" (range: 0.0–1.0) - **Y-axis (norm plots)**: "||p^(m) - p^(1)||₁" (range: 0.0–2.0) - **Y-axis (T distributions)**: "Count" (range: 0–30) - **X-axis (T distributions)**: "Generation m" (range: 0–500) ### Section-Specific Labels - **Window Size = 1** - σₘ: "Evolution of σₘ" - Norm: "Evolution of ||p^(m) - p^(1)||₁" - T Distribution: "Distribution of T" - **Window Size = 5** - σₘ: "Evolution of σₘ" - Norm: "Evolution of ||p^(m) - p^(1)||₁" - T Distribution: "Distribution of T" - **Window Size = 10** - σₘ: "Evolution of σₘ" - Norm: "Evolution of ||p^(m) - p^(1)||₁" - T Distribution: "Distribution of T" - **Window Size = m** - σₘ: "Evolution of σₘ" - Norm: "Evolution of ||p^(m) - p^(1)||₁" - T Distribution: "Distribution of T" ### Legends - **σₘ plots**: Red line labeled "Mean = X" (X varies by window size) - **Norm plots**: Red line labeled "Mean = X" (X varies by window size) - **T Distributions**: Red vertical line labeled "Mean = X" (X varies by window size) --- ## Key Trends and Data Points ### 1. Evolution of σₘ - **Window Size = 1** - σₘ stabilizes at **~0.95** after an initial sharp rise (0–50 generations). - Red line confirms mean ≈ 0.95. - **Window Size = 5** - σₘ increases gradually from **~0.2 to ~0.8** over 500 generations. - Red line confirms mean ≈ 0.8. - **Window Size = 10** - σₘ rises steadily from **~0.3 to ~0.9** over 500 generations. - Red line confirms mean ≈ 0.9. - **Window Size = m** - σₘ fluctuates between **~0.4 and ~0.6** with high variance. - Red line confirms mean ≈ 0.5. ### 2. Evolution of ||p^(m) - p^(1)||₁ - **Window Size = 1** - Norm drops sharply from **~2.0 to ~0.0** within 50 generations. - Red line confirms mean ≈ 0.0. - **Window Size = 5** - Norm decreases gradually from **~2.0 to ~0.5** over 500 generations. - Red line confirms mean ≈ 0.5. - **Window Size = 10** - Norm decreases from **~2.0 to ~0.3** over 500 generations. - Red line confirms mean ≈ 0.3. - **Window Size = m** - Norm stabilizes at **~0.1** after an initial drop. - Red line confirms mean ≈ 0.1. ### 3. Distribution of T - **Window Size = 1** - T values cluster tightly around **mean = 17**. - Histogram shows a narrow peak at low T values. - **Window Size = 5** - T values spread between **~50 and ~250**, with mean = 95. - Histogram shows a bimodal distribution. - **Window Size = 10** - T values spread between **~100 and ~400**, with mean = 216. - Histogram shows a multimodal distribution. - **Window Size = m** - T values cluster tightly around **mean = 98**. - Histogram shows a single peak at low T values. --- ## Spatial Grounding and Color Verification - **Legend Placement**: Top-right corner of each subplot. - **Color Consistency**: - Red lines in σₘ and norm plots match "Mean = X" labels. - Green bars in T distributions match "Mean = X" labels. - No mismatches detected between legend colors and data points. --- ## Component Isolation ### Header - Title: "Window Size = X" (X = 1, 5, 10, m) - Subtitle: "Evolution of σₘ" or "Evolution of ||p^(m) - p^(1)||₁" or "Distribution of T" ### Main Chart - **σₘ Plots**: Line charts with red trend lines. - **Norm Plots**: Line charts with red trend lines. - **T Distributions**: Bar charts with red vertical mean lines. ### Footer - No explicit footer elements; all data is contained within subplots. --- ## Data Table Reconstruction (Hypothetical) | Window Size | σₘ Mean | ||p^(m) - p^(1)||₁ Mean | T Mean | |-------------|---------|--------------------------|--------| | 1 | 0.95 | 0.0 | 17 | | 5 | 0.8 | 0.5 | 95 | | 10 | 0.9 | 0.3 | 216 | | m | 0.5 | 0.1 | 98 | --- ## Conclusion The image demonstrates how window size impacts: 1. **σₘ stability** (smaller windows stabilize faster). 2. **Norm convergence** (larger windows reduce differences between p^(m) and p^(1)). 3. **T distribution spread** (larger windows increase variability in T values). </details> Figure 6: Experiment for Most Recent Models and Randomly Sampled Data. A hundred experiments were run for $500$ generations for different window sizes and a fixed sample size $n=10$ . In the first two rows, each yellow line represents the evolution of $\sigma_{m}$ and $\|{\bm{p}}^{(m)}-{\bm{p}}^{(1)}\|_{1}$ in one experiment respectively, with the red line being the empirical mean across $100$ runs. The bottom row plots the histogram of total collapse time $T$ with the red line being the empirical mean. The initial distribution ${\bm{p}}$ satisfies $s=600$ , $\tilde{s}=52$ and $S_{0}=0.1$ . Experiments for these two cases are shown in Figure 6. Column $1$ corresponds to Fully Synthetic, columns $2$ and $3$ for Most Recent Models, and column $4$ shows the case of Randomly Sampled Data. As we can see, when the window size $K$ is increased, total collapse is delayed but still eventually happens. Observe that there are multiple visible horizontal yellow lines in the second row for window size $1$ . This is because ${\bm{p}}^{(m)}$ can converge to any of the Dirac masses $\delta_{i}$ provided $p_{i}>0$ by Proposition 1. A similar phenomenon can be observed in the second and third columns. On the other hand, in column $4$ the model does deteriorate but the deviation plateaus very quickly. In particular, the $100$ runs produce only one Dirac mass over the span of $500$ generations, suggesting that randomly sampling from all the past models is qualitatively different from sampling from the most recent $K$ models with a fixed $K$ . We remark that for $K>1$ , even if $p^{(m)}$ is some Dirac mass, $p^{(m+1)}$ could still regain randomness from models before generation $m$ , but with a fixed window size all the randomness eventually disappears. 5 Discussions & Conclusion In this paper, we studied model collapse in language models through a simple statistical model. We provided theoretical analysis when training with only synthetic data and when adding real data from the original distribution. Our results demonstrate that model collapse always happens when the model is training solely on synthetic data, whereas controlling deviation from the initial distribution requires careful choice of the amount of synthetic data to inject in the training set. We also provided experiments showing that these findings extend beyond the simple theoretical settings. Our current results describe only the statistical approximation error since all generation models are unbiased in our theoretical framework. However, as we discussed in the previous section, this framework can be extended to account for the functional approximation error by considering high-dimension Gaussian vectors as context embeddings instead of canonical vectors. Another possible extension is to study the effect of in-context learning (Wu et al., 2023) on model collapse, which is a key feature of transformer-based models. Despite the simple setting of our current investigation, we believe that it lays the groundwork for better understanding and mitigation of model collapse in language models, thereby opening the way for the development of more general theoretical frameworks to study next-generation language models dynamics. References - Alemohammad et al. (2023) Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G. Baraniuk. Self-consuming generative models go mad, 2023. - Ben Allal et al. (2024) Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra. Cosmopedia, 2024. URL https://huggingface.co/datasets/HuggingFaceTB/cosmopedia. - Bohacek & Farid (2023) Matyas Bohacek and Hany Farid. Nepotistically trained generative-ai models collapse, 2023. - Briesch et al. (2023) Martin Briesch, Dominik Sobania, and Franz Rothlauf. Large language models suffer from their own output: An analysis of the self-consuming training loop, 2023. - Chen et al. (2024) Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024. - del Rio-Chanona et al. (2023) Maria del Rio-Chanona, Nadzeya Laurentsyeva, and Johannes Wachs. Are large language models a threat to digital public goods? evidence from activity on stack overflow, 2023. - Dohmatob et al. (2024a) Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. Model collapse demystified: The case of regression, 2024a. - Dohmatob et al. (2024b) Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, and Julia Kempe. A tale of tails: Model collapse as a change of scaling laws, 2024b. - Fu et al. (2024) Shi Fu, Sen Zhang, Yingjie Wang, Xinmei Tian, and Dacheng Tao. Towards theoretical understandings of self-consuming generative models, 2024. - Gunasekar et al. (2023) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023. - Guo et al. (2023) Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chloé Clavel. The curious decline of linguistic diversity: Training language models on synthetic text, 2023. - Kemeny et al. (1969) John G Kemeny, J Laurie Snell, et al. Finite markov chains, volume 26. van Nostrand Princeton, NJ, 1969. - Mardia et al. (2020) Jay Mardia, Jiantao Jiao, Ervin Tánczos, Robert D Nowak, and Tsachy Weissman. Concentration inequalities for the empirical distribution of discrete distributions: beyond the method of types. Information and Inference: A Journal of the IMA, 9(4):813–850, 2020. - Martínez et al. (2023a) Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Combining generative artificial intelligence (ai) and the internet: Heading towards evolution or degradation?, 2023a. - Martínez et al. (2023b) Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Towards understanding the interplay of generative artificial intelligence and the internet, 2023b. - OpenAI (2024) OpenAI. Gpt-4 technical report, 2024. - Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022. - Shumailov et al. (2023) Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget, 2023. - Weissman et al. (2003) Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003. - Wu et al. (2023) Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter L Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? arXiv preprint arXiv:2310.08391, 2023. Appendix A Technical Lemma For completeness, we include the concentration results from (Mardia et al., 2020) which we used to prove Theorem 3. For a probability distribution ${\bm{p}}$ on $[s]$ define $$ \pi_{\bm{p}}:=\max_{A\subseteq[s]}\min\{{\bm{p}}(A),1-{\bm{p}}(A)\} \tag{12} $$ and notice that $\pi_{\bm{p}}≤\frac{1}{2}$ . Define the function $\varphi:[0,1/2)→{\mathbb{R}}$ by $$ \varphi(x):=\frac{1}{1-2x}\log\frac{1-x}{x} $$ and extend $\varphi$ by continuity to $\varphi(1/2):=2$ . Observe that $\varphi$ is strictly decreasing and $2≤\varphi(x)<∞$ . The following Lemma concerns the concentration of empirical distribution which captures the dependence on the sample size $n$ , dimension $s$ as well as the structure of the underlying probability distribution via $\varphi(\pi_{\bm{p}})$ . **Lemma 1** *Let $p$ be a probability distribution on $[s]$ and $\hat{p}$ be the associated empirical distribution obtained from $n$ samples. Then for $\epsilon>0$ we have $$ {\mathbb{P}}\left(\|\hat{{\bm{p}}}-{\bm{p}}\|_{1}\geq\epsilon\right)\leq\exp% \left(-\frac{n\varphi(\pi_{\bm{p}})\epsilon^{2}}{4}\right)G_{n}(s), $$ where $C_{0}=\frac{e^{3}}{2\pi}$ , $C_{1}=\frac{6e}{\pi^{3/2}}$ and $$ G_{n}(s)=\begin{cases}C_{1}s\left(C_{0}n/s\right)^{s/2}&\text{ if }\frac{C_{0}% n}{4}+2\leq s\leq\frac{C_{0}n}{e}+2;\\ C_{1}se^{\frac{C_{0}n}{2e}}&\text{ if }\frac{C_{0}n}{e}+2\leq s;\\ (2^{s}-2)&\text{ if }s\leq\frac{C_{0}n}{4}+2.\end{cases} \tag{13} $$* The upper bound $$ {\mathbb{P}}(\|\hat{{\bm{p}}}-{\bm{p}}\|_{1})\leq(2^{s}-2)\exp\left(-\frac{n% \varphi(\pi_{\bm{p}})\epsilon^{2}}{4}\right) $$ in fact holds for all values of $s$ and $n$ (Weissman et al., 2003), but this bound is very poor when $s/n$ is large, which is commonly the case for language models. Lemma 1 provides an improvement in the regime of small sample size $n\lesssim s$ . Appendix B General Results and Proofs B.1 Derivation of Formula (2) * Proof* The categorical cross-entropy loss reads as | | $\displaystyle{\mathcal{L}}({\bm{w}}_{1},...,{\bm{w}}_{s})=-\frac{1}{M}\sum_% {i=1}^{M}\sum_{k=1}^{s}y_{ik}\log\left(\frac{\exp({\bm{w}}_{k}^{→p}{\bm{x}}_% {i})}{\sum_{j=1}^{K}\exp({\bm{w}}_{j}^{→p}{\bm{x}}_{i})}\right).$ | | | --- | --- | --- | The gradient of the loss is expressed as | | $\displaystyle\frac{∂{\mathcal{L}}}{∂{\bm{w}}_{k}}=\frac{1}{M}% \sum_{i=1}^{M}\sum_{\ell=1}^{s}y_{i\ell}\frac{\frac{∂}{∂{\bm{w}}% _{k}}\left[\sum_{j≠\ell}\exp(({\bm{w}}_{j}-{\bm{w}}_{\ell})^{→p}{\bm{x}}_% {i})\right]}{1+\sum_{j≠\ell}\exp(({\bm{w}}_{j}-{\bm{w}}_{\ell})^{→p}{\bm{% x}}_{i})},$ | | | --- | --- | --- | where | | $\displaystyle\frac{∂}{∂{\bm{w}}_{k}}\left[\sum_{j≠\ell}\exp((% {\bm{w}}_{j}-{\bm{w}}_{\ell})^{→p}{\bm{x}}_{i})\right]=\begin{cases}-\sum_{j% ≠ k}\exp(({\bm{w}}_{j}-{\bm{w}}_{k})^{→p}{\bm{x}}_{i}){\bm{x}}_{i}\quad% \text{if}\quad k=\ell,\\ \exp(({\bm{w}}_{k}-{\bm{w}}_{\ell})^{→p}{\bm{x}}_{i})\quad\text{if}\quad k% ≠\ell.\end{cases}$ | | | --- | --- | --- | Therefore, | | $\displaystyle\frac{∂{\mathcal{L}}}{∂{\bm{w}}_{k}}$ | $\displaystyle=\frac{1}{M}\sum_{i=1}^{M}\left\{y_{ik}\frac{-\sum_{j≠ k}\exp(% ({\bm{w}}_{j}-{\bm{w}}_{k})^{→p}{\bm{x}}_{i})}{1+\sum_{j≠ k}\exp(({\bm{w}% }_{j}-{\bm{w}}_{k})^{→p}{\bm{x}}_{i})}+\sum_{\ell≠ k}y_{i\ell}\frac{\exp(% ({\bm{w}}_{k}-{\bm{w}}_{\ell})^{→p}{\bm{x}}_{i})}{1+\sum_{j≠\ell}\exp(({% \bm{w}}_{j}-{\bm{w}}_{\ell})^{→p}{\bm{x}}_{i})}\right\}{\bm{x}}_{i}$ | | | --- | --- | --- | --- | Finally, solving for $\frac{∂{\mathcal{L}}}{∂{\bm{w}}_{k}}={\bm{0}}$ yields (2). ∎ B.2 Proofs for the Fully Synthetic Case * Proof of Theorem1* Write $p^{(m)}_{i}=X^{(m)}_{i}/n$ where $X^{(m)}_{i}\sim B(n,p^{(m-1)}_{i})$ is binomial when conditioned on ${\bm{p}}^{(m-1)}$ . So we have | | $\displaystyle\mathbb{E}[{p^{(m)}_{i}}^{2}\mid{\bm{p}}^{(m-1)}]$ | $\displaystyle=\frac{1}{n}p^{(m-1)}_{i}+\left(1-\frac{1}{n}\right){p^{(m-1)}_{i% }}^{2}$ | | | --- | --- | --- | --- | and by the law of total expectation $$ \displaystyle S_{m} \displaystyle=\frac{1}{n}+\left(1-\frac{1}{n}\right)S_{m-1}=1-\left(1-\frac{1}% {n}\right)^{m}(1-S_{0}). \tag{14} $$ Note that $S_{m}\nearrow 1$ as $m→∞$ and | | $\displaystyle S_{m}=\mathbb{E}\left[\sum_{i}p^{(m)}_{i}p^{(m)}_{i}\right]≤% \mathbb{E}\left[\left(\sum_{i}p^{(m)}_{i}\right)\|{\bm{p}}^{(m)}\|_{∞}% \right]=\mathbb{E}\|{\bm{p}}^{(m)}\|_{∞}.$ | | | --- | --- | --- | Recall that $\rho_{m}:={\mathbb{P}}\left(\|{\bm{p}}^{(m)}\|_{∞}=1\right)$ and $T:=∈f\{m∈{\mathbb{N}}:\|{\bm{p}}^{(m)}\|_{∞}=1\}≥ 1$ . Since $\sigma_{m}≥ 1/\tilde{s}$ and $\|{\bm{p}}^{(m)}\|_{∞}∈\{\frac{1}{n},...,\frac{n-1}{n},1\}$ , we have | | $\displaystyle 1·\rho_{m}+\frac{1}{\tilde{s}}(1-\rho_{m})≤\mathbb{E}% \sigma_{m}=S_{m}≤\mathbb{E}\|{\bm{p}}^{(m)}\|_{∞}≤ 1·\rho_{m}+(% 1-1/n)(1-\rho_{m}),$ | | | --- | --- | --- | from which (7) follows thanks to (14). For $k=1,2,...$ we have ${\mathbb{P}}(T>k)={\mathbb{P}}(\|{\bm{p}}^{(k)}\|_{∞}<1)=1-\rho_{k}$ and thus | | $\displaystyle\frac{1-S_{0}}{1-1/\tilde{s}}(1-1/n)^{k}≤{\mathbb{P}}(T>k)≤ n% (1-S_{0})(1-1/n)^{k},$ | | | --- | --- | --- | which establishes (8) because $\mathbb{E}[T]=\sum_{k=0}^{∞}{\mathbb{P}}(T>k)$ . ∎ * Proof of Proposition1* Fix $i∈[s]$ . For $m≥ 1$ consider the events $E_{m}:=\{T≤ m\}$ and $F_{m}:=\{{\bm{p}}_{i}^{(m)}=1\}$ . Then ${\bm{p}}_{i}^{(m)}∈\{0,1\}$ on $E_{m}$ and $F_{m}⊂eq E_{m}$ . Observe that $$ E_{m}\nearrow\cup_{m}E_{m}\quad\text{and}\quad F_{m}\nearrow\cup_{m}F_{m}=\{% \lim_{m\to\infty}{\bm{p}}^{(m)}=\delta_{i}\}, $$ where ${\mathbb{P}}(\cup_{m}E_{m})=1$ by Theorem 1 and in particular ${\mathbb{P}}(E_{m})>0$ for $m$ large. Thus | | $\displaystyle p_{i}$ | $\displaystyle=\lim_{m→∞}\mathbb{E}[p_{i}^{(m)}]=\lim_{m→∞}\left[% {\mathbb{P}}(E_{m})\mathbb{E}[p_{i}^{(m)}\mid E_{m}]+{\mathbb{P}}(E_{m}^{c})% \mathbb{E}[p_{i}^{(m)}\mid E_{m}^{c}]\right]$ | | | --- | --- | --- | --- | which proves the assertion. ∎ B.3 Proofs for the Partially Synthetic Case * Proof of Theorem2* Let $\nu^{(m)}_{i}:=\mathbb{E}\left[{p^{(m)}_{i}}^{2}\right]$ , $N^{\prime}:=N+n$ and ${\bm{y}}^{(m)}_{i}=(y^{(m)}_{1,i},...,y^{(m)}_{s,i})$ . Then $$ \nu^{(1)}_{i}=\frac{p_{i}}{N}+\left(1-\frac{1}{N}\right)p_{i}^{2} \tag{1} $$ and for $m≥ 2$ $$ \displaystyle\nu^{(m)}_{i} \displaystyle=\frac{1}{N^{\prime 2}}\mathbb{E}\left[\left(\sum_{k=1}^{N}y^{(0)% }_{i,k}\right)^{2}+\left(\sum_{k=1}^{n}y^{(m-1)}_{i,k}\right)^{2}+2\sum_{j=1}^% {N}\sum_{k=1}^{n}y^{(0)}_{i,k}y^{(m-1)}_{i,k}\right] \displaystyle=\frac{1}{N^{\prime 2}}\left[Np_{i}+(N^{2}-N)p_{i}^{2}+np_{i}+(n^% {2}-n)\nu^{(m-1)}_{i}\right]+\frac{2}{N^{\prime 2}}\sum_{j=1}^{N}\sum_{k=1}^{n% }\mathbb{E}\left[y^{(0)}_{i,j}y^{(m-1)}_{i,k}\right] \displaystyle=\frac{p_{i}}{N^{\prime}}+\frac{N^{2}-N}{N^{\prime 2}}p_{i}^{2}+% \frac{n^{2}-n}{N^{\prime 2}}\nu^{(m-1)}_{i}+\frac{2}{N^{\prime 2}}\sum_{j=1}^{% N}\sum_{k=1}^{n}\mathbb{E}\left[y^{(0)}_{i,j}y^{(m-1)}_{i,k}\right]. \tag{0} $$ For $m≥ 2$ , since $y^{(m-1)}_{i,k}$ is conditionally independent of $y^{(0)}_{i,j}$ given $p^{(m-1)}_{i}$ , by conditioning on $y^{(0)}_{i,j}$ and $p^{(m-1)}_{i}$ we have $$ \displaystyle\mathbb{E}\left[y^{(0)}_{i,j}y^{(m-1)}_{i,k}\right] \displaystyle=\mathbb{E}\left[y^{(0)}_{i,j}p^{(m-1)}_{i}\right] \tag{0} $$ and hence for $m≥ 2$ , $$ \displaystyle\nu^{(m)}_{i}=\frac{p_{i}}{N^{\prime}}+\frac{N^{2}-N}{N^{\prime 2% }}p_{i}^{2}+\frac{n^{2}-n}{N^{\prime 2}}\nu^{(m-1)}_{i}+\frac{2n}{N^{\prime 2}% }\sum_{j=1}^{N}\mathbb{E}\left[y^{(0)}_{i,j}p^{(m-1)}_{i}\right]. \tag{0} $$ By the definition of $p^{(m)}_{i}$ , for $m≥ 3$ $$ \displaystyle\mathbb{E}\left[y^{(0)}_{i,j}p^{(m-1)}_{i}\right] \displaystyle=\frac{1}{N^{\prime}}\mathbb{E}\left[y^{(0)}_{i,j}\sum_{k=1}^{N}y% ^{(0)}_{i,k}\right]+\frac{1}{N^{\prime}}\mathbb{E}\left[y^{(0)}_{i,j}\sum_{k=1% }^{n}y^{(m-2)}_{i,k}\right] \displaystyle=\frac{{\bm{p}}_{i}+(N-1)p_{i}^{2}}{N^{\prime}}+\frac{n}{N^{% \prime}}\mathbb{E}\left[y^{(0)}_{i,j}p^{(m-2)}_{i}\right] \tag{0} $$ and $$ \mathbb{E}\left[y^{(0)}_{i,j}p^{(1)}_{i}\right]=\frac{1}{N}p_{i}+\left(1-\frac% {1}{N}\right)p_{i}^{2}=\frac{p_{i}+(N-1)p_{i}^{2}}{N} \tag{0} $$ which gives $$ \displaystyle\mathbb{E}\left[y^{(0)}_{i,j}p^{(m-1)}_{i}\right] \displaystyle=\dfrac{1-\left(\frac{n}{N^{\prime}}\right)^{m-2}}{1-\frac{n}{N^{% \prime}}}\frac{p_{i}+(N-1)p_{i}^{2}}{N^{\prime}}+\left(\frac{n}{N^{\prime}}% \right)^{m-2}\mathbb{E}\left[y^{(0)}_{i,j}p^{(1)}_{i}\right]. \tag{0} $$ Setting $\alpha:=n/N^{\prime}$ , we have $N=(1-\alpha)N^{\prime}$ and hence $$ \mathbb{E}\left[y^{(0)}_{i,j}p^{(m-1)}_{i}\right]=\frac{p_{i}+(N-1)p_{i}^{2}}{N} \tag{0} $$ for all $m≥ 2$ . Plugging this back to the expression (15) for $\nu^{(m)}_{i}$ gives | | $\displaystyle\nu^{(m)}_{i}$ | $\displaystyle=\frac{(1-\alpha)(1+2\alpha)}{N}p_{i}+\left(1-\frac{1}{N}\right)(% 1-\alpha^{2})p_{i}^{2}+\alpha\left[\left(1+\frac{1}{N}\right)\alpha-\frac{1}{N% }\right]\nu^{(m-1)}_{i}.$ | | | --- | --- | --- | --- | Let $\beta:=\alpha\left[\left(1+\frac{1}{N}\right)\alpha-\frac{1}{N}\right]$ . Then for $m≥ 1$ , | | $\displaystyle\nu^{(m+1)}_{i}$ | $\displaystyle=\frac{1}{N}\left[\frac{1-\beta^{m+1}}{1-\beta}-\alpha(2-\alpha)% \frac{1-\beta^{m}}{1-\beta}\right]p_{i}$ | | | --- | --- | --- | --- | and summing across $i∈[s]$ gives the expression for $S_{m+1}$ . Note that $0<\beta<\alpha<1$ , which gives both an upper bound and a lower bound on $\nu^{(m)}_{i}$ . In particular, for $m≥ 2$ | | $\displaystyle\nu^{(m)}_{i}$ | $\displaystyle<\frac{1}{1+(1+1/N)\alpha}\left[\frac{1}{N}(1+2\alpha)p_{i}+\left% (1-\frac{1}{N}\right)(1+\alpha)p_{i}^{2}\right]$ | | | --- | --- | --- | --- | and therefore | | $\displaystyle S_{m}$ | $\displaystyle<\frac{1}{1+(1+1/N)\alpha}\left[\frac{1}{N}(1+2\alpha)+\left(1-% \frac{1}{N}\right)(1+\alpha)S_{0}\right]$ | | | --- | --- | --- | --- | where $$ \gamma:=\frac{\left(1-\frac{1}{N}\right)(1+\alpha)(1-S_{0})}{1+(1+1/N)\alpha}% \in\left[\frac{1+\alpha}{2+3\alpha}(1-S_{0}),1-S_{0}\right] $$ since $0≤ 1/N≤ 1/2$ . ∎ We state a more general result which implies Theorem 3. **Theorem 4** *Let $G_{n}(s)$ and $\varphi$ be as in Lemma 1. Then for $k≥ 1$ we have $$ \mathbb{E}\|{\bm{p}}^{(k+1)}-{\bm{p}}^{(1)}\|_{1}<\frac{1}{N}\sqrt{\frac{n\pi}% {\varphi(\zeta)}}G_{n}(s), \tag{1} $$ where $$ \zeta=\frac{1}{2}-\left(\frac{1}{2}-2\mathbb{E}\left[\max_{A\subseteq[s]}{\bm{% p}}^{(1)}(A){\bm{p}}^{(1)}([s]\setminus A)\right]\right)\left(\frac{N}{N+n}% \right)^{2}. \tag{1} $$* * Proof of Theorem4* Since all generations share the same original data source, we can write $$ {\bm{p}}^{(k+1)}=\frac{N}{N+n}{\bm{p}}^{(1)}+\frac{n}{N+n}\widehat{{\bm{p}}^{(% k)}},\quad\text{where}\quad\widehat{{\bm{p}}^{(k)}}=\frac{1}{n}\sum_{i=1}^{n}{% \bm{y}}^{(k)}_{i} \tag{1} $$ and $\{{\bm{y}}^{(k)}_{i}\}_{i=1,...,n}$ are i.i.d. multinomial with parameter ${\bm{p}}^{(k)}$ and one trial $(k≥ 1)$ . This gives $$ \displaystyle{\bm{p}}^{(k+1)}-{\bm{p}}^{(1)}=\frac{n}{N+n}\left[\widehat{{\bm{% p}}^{(k)}}-{\bm{p}}^{(1)}\right], \tag{1} $$ and applying the triangle inequality yields $$ \displaystyle\|{\bm{p}}^{(k+1)}-{\bm{p}}^{(1)}\|_{1} \displaystyle=\frac{n}{N+n}\left\|\widehat{{\bm{p}}^{(k)}}-{\bm{p}}^{(1)}% \right\|_{1}\leq\frac{n}{N+n}\left\|\widehat{{\bm{p}}^{(k)}}-{\bm{p}}^{(k)}% \right\|_{1}+\frac{n}{N+n}\left\|{\bm{p}}^{(k)}-{\bm{p}}^{(1)}\right\|_{1}. \tag{1} $$ Taking the expectation and solving the recursion gives for $k≥ 1$ $$ \displaystyle\mathbb{E}\|{\bm{p}}^{(k+1)}-{\bm{p}}^{(1)}\|_{1}\leq\sum_{j=1}^{% k}\left(\frac{n}{N+n}\right)^{k+1-j}\mathbb{E}\left\|\widehat{{\bm{p}}^{(j)}}-% {\bm{p}}^{(j)}\right\|_{1}. \tag{1} $$ From Lemma 1, we have | | $\displaystyle{\mathbb{P}}\left(\left\|\widehat{{\bm{p}}^{(k)}}-{\bm{p}}^{(k)}% \right\|_{1}≥ t\,\Bigg{|}{\bm{p}}^{(k)}\right)≤ G_{n}(s)e^{-n\varphi(\pi% _{k})t^{2}/4}$ | | | --- | --- | --- | where $\pi_{k}:=\pi_{{\bm{p}}^{(k)}}$ . Integrating over $t∈[0,∞)$ , we get | | $\displaystyle\mathbb{E}\left[\left\|\widehat{{\bm{p}}^{(k)}}-{\bm{p}}^{(k)}% \right\|_{1}\Bigg{|}{\bm{p}}^{(k)}\right]≤ G_{n}(s)\sqrt{\frac{\pi}{n% \varphi(\pi_{k})}},$ | | | --- | --- | --- | and by Jensen’s inequality and the concavity of $x\mapsto\varphi(x)^{-1/2}$ , | | $\displaystyle\mathbb{E}\left\|\widehat{{\bm{p}}^{(k)}}-{\bm{p}}^{(k)}\right\|_% {1}≤\sqrt{\frac{\pi}{n}}G_{n}(s)\mathbb{E}\left[\varphi(\pi_{k})^{-1/2}% \right]≤\sqrt{\frac{\pi}{n\varphi(\mathbb{E}\pi_{k})}}G_{n}(s).$ | | | --- | --- | --- | Thus $$ \displaystyle\mathbb{E}\|{\bm{p}}^{(k+1)}-{\bm{p}}^{(1)}\|_{1}\leq\sqrt{\frac{% \pi}{n}}G_{n}(s)\sum_{j=1}^{k}\left(\frac{n}{N+n}\right)^{k+1-j}\frac{1}{\sqrt% {\varphi(\mathbb{E}\pi_{j})}}. \tag{1} $$ It remains to upper upper bound $\mathbb{E}\pi_{j}$ since $\varphi$ is decreasing. For $A⊂eq[s]$ let $A^{c}$ denote its complement. Then by (17), for $A⊂eq[s]$ we have $$ \displaystyle{\bm{p}}^{(k+1)}(A){\bm{p}}^{(k+1)}(A^{c}) \displaystyle=\frac{N^{2}}{(N+n)^{2}}{\bm{p}}^{(1)}(A){\bm{p}}^{(1)}(A^{c})+% \frac{n^{2}}{(N+n)^{2}}\widehat{{\bm{p}}^{(k)}}(A)\widehat{{\bm{p}}^{(k)}}(A^{% c}) \displaystyle\quad+\frac{Nn}{(N+n)^{2}}\left[{\bm{p}}^{(1)}(A^{c})\widehat{{% \bm{p}}^{(k)}}(A)+{\bm{p}}^{(1)}(A)\widehat{{\bm{p}}^{(k)}}(A^{c})\right] \displaystyle\leq\frac{N^{2}}{(N+n)^{2}}{\bm{p}}^{(1)}(A){\bm{p}}^{(1)}(A^{c})% +\frac{n^{2}}{(N+n)^{2}}\widehat{{\bm{p}}^{(k)}}(A)\widehat{{\bm{p}}^{(k)}}(A^% {c})+\frac{Nn}{(N+n)^{2}}. \tag{1} $$ Let $\lambda_{k}:=\max_{A⊂eq[s]}{\bm{p}}^{(k)}(A){\bm{p}}^{(k)}(A^{c})$ , so the inequality above implies $$ \displaystyle\mathbb{E}[\lambda_{k+1}\mid{\bm{p}}^{(1)},{\bm{p}}^{(k)}] \displaystyle\leq\frac{N^{2}}{(N+n)^{2}}\lambda_{1}+\frac{n^{2}}{(N+n)^{2}}% \mathbb{E}\left[\max_{A\subseteq[s]}\widehat{{\bm{p}}^{(k)}}(A)\widehat{{\bm{p% }}^{(k)}}(A^{c})\Bigg{|}{\bm{p}}^{(k)}\right]+\frac{Nn}{(N+n)^{2}} \displaystyle\leq\frac{N^{2}}{(N+n)^{2}}\lambda_{1}+\frac{n^{2}}{4(N+n)^{2}}+% \frac{Nn}{(N+n)^{2}} \tag{1} $$ and thus $$ \mathbb{E}[\lambda_{k+1}]\leq\frac{N^{2}\mathbb{E}[\lambda_{1}]+Nn+n^{2}/4}{(N% +n)^{2}}=\frac{1}{4}-\left(\frac{1}{4}-\mathbb{E}[\lambda_{1}]\right)\left(% \frac{N}{N+n}\right)^{2}. $$ Observe that for $0≤ x≤ 1$ we have $\min(x,1-x)≤ 2x(1-x)$ , so $\mathbb{E}[\pi_{k}]≤ 2\mathbb{E}[\lambda_{k}]$ . Writing $$ \zeta:=\frac{1}{2}-\left(\frac{1}{2}-2\mathbb{E}[\lambda_{1}]\right)\left(% \frac{N}{N+n}\right)^{2} $$ we have $\mathbb{E}[\pi_{k}]≤\zeta$ , so from (19) we see that $$ \mathbb{E}\|{\bm{p}}^{(k+1)}-{\bm{p}}^{(1)}\|_{1}\leq\sqrt{\frac{\pi}{n\varphi% (\zeta)}}G_{n}(s)\sum_{j=1}^{k}\left(\frac{n}{N+n}\right)^{k+1-j}<\frac{1}{N}% \sqrt{\frac{n\pi}{\varphi(\zeta)}}G_{n}(s). \tag{1} $$ ∎ * Proof of Theorem3* Theorem 3 is an immediate consequence of Theorem 4 by replacing $\varphi$ with its minimum value $2$ . ∎ Appendix C Architecture & training parameters for GPT2 experiments We consider that all generation models have GPT2-type https://github.com/karpathy/nanoGPT vanilla architecture which is a decoder-only generative model with the configuration and training parameters as summarized in Table 1. These parameters were chosen to achieve the best validation loss when training the first-generation model on data produced by ground-truth generative model ${\bm{p}}^{(0)}$ as defined in Section 4.1. | Context length | $128$ | | --- | --- | | Embedding dimension | $256$ | | Number of layers | $8$ | | Number of self-attention heads | $4$ | | Vocabulary size | $65$ | | Dropout | $0.2$ | | Learning rate | $10^{-3}$ | | Batch size | $256$ | | Max iterations | $2000$ | Table 1: Architecture and training parameters in the setting of Section 4.1.

Rendering Paper...