2505.20674v2

Model: nemotron-free

# Pretraining Language Models to Ponder in Continuous Space > Zhouhan Lin is the corresponding author. Abstract Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Experiments across three widely used open-source architectures—GPT-2, Pythia, and LLaMA—and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, PonderingPythia-2.8B surpasses Pythia-6.9B, and PonderingPythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM. 1 Introduction In the pursuit of improving model performance, scaling up model parameters and data sizes has long been the most widely adopted and accepted approach (Kaplan et al., 2020; Brown et al., 2020; Liu et al., 2024). However, this approach faces several bottlenecks, including the exhaustion of high-quality data (Villalobos et al., 2022; Muennighoff et al., 2023), the observed saturation in scaling laws (Hackenburg et al., 2025; Hoffmann et al., 2022a) and substantial communication overhead in distributed pre-training that grows super-linearly with model size (Narayanan et al., 2021; Pati et al., 2023; Li et al., 2024). On the other hand, if we look at humans, the growth of human capabilities does not stem from simply increasing the number of neurons in the brain. Instead, when faced with complex problems, humans often enhance their problem-solving abilities by repeatedly pondering, engaging in deep cognitive processing before articulating their thoughts. Analogously, in large language models, the most relevant research direction is test-time scaling. In particular, following the advancements in o1 and R1 (Jaech et al., 2024; DeepSeek-AI et al., 2025), generating long chains of thought (CoT) has emerged as the mainstream approach for scaling test-time computation. However, CoT-based methods also exhibit several drawbacks: they often require curated human-annotated datasets (Allen-Zhu & Li, 2023) and carefully designed reinforcement learning algorithms (Pang et al., 2025). Moreover, small models rarely benefit from CoT (Li et al., 2023), and the upper bound of performance remains constrained by the base pretrained model (Yue et al., 2025). Additionally, current language models employing CoT are still confined to discrete language spaces with fixed vocabularies, which, according to recent studies (Fedorenko et al., 2024; Hao et al., 2024; Pfau et al., 2024), are primarily optimized for communication rather than for internal computational thinking. To overcome these challenges and inspired by human pondering, we introduce the Pondering Language Model (Pondering LM), which relies solely on self-supervised learning. Pondering LMs can be naturally learned through pretraining on large-scale general corpora, without the need for human-annotated datasets or reinforcement learning. During pondering, instead of generating a discrete token sampled from the prediction distribution, the model produces a weighted sum of all token embeddings based on the predicted probabilities. This generated embedding is then fed back into the language model, allowing it to iteratively refine its predictions. As the weighted embedding is continuous, Pondering LMs overcome the expressive limitations of discrete token vocabularies and enable fully differentiable, end-to-end pretraining via gradient descent. Furthermore, by performing more computations per parameter, Pondering LMs achieve higher parameter knowledge density (Allen-Zhu & Li, 2024), potentially reducing communication costs at scale. Experimentally, by introducing the pondering process during pretraining, our Pondering GPT-2, LLaMA, and Pythia models achieve pretraining perplexity comparable to that of vanilla models with twice as many parameters. Furthermore, Pondering Pythia models significantly outperform the official Pythia models across nine popular downstream tasks, PonderingPythia-2.8B surpasses Pythia-6.9B, and PonderingPythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. Notably, increasing the number of pondering steps consistently enhances model performance, underscoring the substantial potential of this approach. Moreover, our method is orthogonal to traditional scaling strategies, including parameter scaling and inference-time scaling via CoT, and thus can complement existing techniques, potentially introducing a third scaling axis to enhance model performance. <details> <summary>x1.png Details</summary> ![c0c8f818](/v1/image/c0c8f8187ca189852fe946970967d1b6ed745e9eea79b56010c9fd7a83c47c77) ### Visual Description ## Diagram: Pondering Language Model Architecture ### Overview The diagram illustrates a multi-layered Pondering Language Model (PLM) processing input embeddings through sequential stages. Each layer integrates input embeddings, word embeddings, and predicted probabilities, with arrows indicating data flow. The model outputs predicted probabilities for tokens, visualized as green rectangles with numerical values. ### Components/Axes 1. **Layers**: - Three identical "Pondering Language Model" blocks stacked vertically. - Each layer contains: - **Input Embeddings** (blue rectangles with values like 0.1, 0.2, 0.3). - **Word Embeddings** (beige rectangles labeled "Word Embed"). - **Predicted Probabilities** (green rectangles with values like 0.6, 0.2, 0.7). 2. **Arrows**: - Blue arrows connect input embeddings to the model. - Black arrows link model outputs to word embeddings. - Green arrows show predicted probabilities exiting the model. 3. **Parameters**: - **Vocab Size**: Labeled at the bottom-right, connected to word embeddings. - **Hidden Size**: Labeled at the bottom-center, connected to input embeddings. 4. **Legend**: - Colors: - Blue = Input Embedding - Beige = Word Embed - Green = Predicted Probability ### Detailed Analysis 1. **Input Embeddings**: - Top layer: Values range from 0.1 to 0.3. - Middle layer: Values range from 0.1 to 0.3. - Bottom layer: Values range from 0.1 to 0.3. 2. **Word Embeddings**: - Top layer: Values range from 0.1 to 0.4. - Middle layer: Values range from 0.1 to 0.4. - Bottom layer: Values range from 0.1 to 0.4. 3. **Predicted Probabilities**: - Top layer: Values range from 0.1 to 0.8 (highest probability: 0.8). - Middle layer: Values range from 0.1 to 0.5. - Bottom layer: Values range from 0.1 to 0.5. ### Key Observations - **Hierarchical Flow**: Input embeddings are processed through three identical layers, with outputs feeding into subsequent layers. - **Probability Trends**: - Top layer predicts the highest probabilities (up to 0.8). - Middle and bottom layers show lower maximum probabilities (0.5). - **Embedding Consistency**: Input and word embeddings maintain similar value ranges across layers. - **Scaling Factor**: The "×N" notation near word embeddings suggests a multiplicative scaling parameter. ### Interpretation The diagram represents a recurrent or stacked architecture where each layer refines predictions based on prior outputs. The top layer’s higher predicted probabilities (e.g., 0.8) suggest it may act as a final decision layer, while lower layers contribute progressively smaller adjustments. The "×N" scaling implies word embeddings are modulated by a global factor, potentially balancing input and learned representations. The consistent hidden size across layers indicates uniform dimensionality in processing. This architecture likely models sequential dependencies, with each layer capturing increasingly abstract features before outputting token probabilities. The green rectangles’ values reflect confidence scores for token predictions at each stage. </details> ⬇ class PonderingLanguageModel (nn. Module): def __init__ (self, lm, v, h, k): self. lm = lm # language model self. vocab_size = v self. hidden_dim = h self. pondering_steps = k self. embedding = nn. Parameter (torch. randn (v, h), requires_grad = True) def forward (self, input_tokens): input_embedding = self. embedding [input_tokens] #Iterative pondering for t in range (self. pondering_steps): predicted_prob = self. lm (input_embedding) pondering_embedding = torch. matmul (predicted_prob, self. embedding) input_embedding = input_embedding + pondering_embedding #Final forward pass final_prob = self. lm (input_embedding) return final_prob Figure 1: Overview of the Pondering Language Model. Given input token embeddings, the base LM produces a probability distribution over the vocabulary, which is used to compute a continuous “pondering embedding” via a weighted sum of all token embeddings. This embedding is then added residually to the original input embeddings and fed back into the LM. By repeating this process for $k$ steps within a single token prediction, the model iteratively refines its output distributions. The pseudocode on the right illustrates the implementation details. 2 Pondering Language Model In this section, we introduce our proposed Pondering Language Model (Figure 1), which integrates a pondering mechanism into language models via pretraining. Given that pretraining fundamentally constitutes a language modeling task, we briefly review this task before detailing our proposed model. Language Modeling. Given a sequence of tokens $X=[x_{1},x_{2},...,x_{n}]$ , the primary objective of language modeling is typically to maximize the likelihood of predicting each token based on its preceding tokens. Formally, this is expressed through the joint probability factorization: $$ P(x_{1},x_{2},\dots,x_{n})=\prod_{t=1}^{n}P(x_{t}\mid x_{<t}) \tag{1} $$ Current language models first map tokens to input embeddings $\mathbf{E}^{0}=[\mathbf{e}^{0}_{1},\mathbf{e}^{0}_{2},...,\mathbf{e}^{0}_{n}]$ , where each embedding $\mathbf{e}^{0}_{i}∈\mathbb{R}^{d}$ is selected from a vocabulary embedding matrix $\mathbf{V}=[\mathbf{e}_{1},\mathbf{e}_{2},...,\mathbf{e}_{|V|}]$ , with vocabulary size $|V|$ and hidden dimension $d$ . The language model then generates output probabilities $\mathbf{P}$ for predicting the next token at each position: $$ \mathbf{P}=\text{LM}(\mathbf{E}^{0}),\quad\mathbf{P}\in\mathbb{R}^{n\times|V|} \tag{2} $$ The cross-entropy loss is computed directly from these predicted probabilities $\mathbf{P}$ to pretrain the language model. Pondering Mechanism. In our proposed method, instead of directly using the predicted output probabilities $\mathbf{P}$ to compute the cross-entropy loss, we utilize these probabilities as weights to sum embeddings of all candidate tokens, forming what we call a "pondering embedding". Given the probability distribution $\mathbf{p}∈\mathbb{R}^{|V|}$ at each position, the pondering embedding $\mathbf{t}$ is: $$ \mathbf{t}=\sum_{i=1}^{|V|}p_{i}\mathbf{e}_{i},\quad p_{i}\in\mathbb{R},% \mathbf{e}_{i}\in\mathbb{R}^{d} \tag{3} $$ For computational efficiency, pondering embeddings $\mathbf{t}$ for all positions can be calculated simultaneously via matrix multiplication In practice, we use only the top-K tokens with highest probabilities at each position to compute the pondering embedding, reducing complexity from $\mathcal{O}(n|V|d)$ to $\mathcal{O}(nKd)$ . With $K=100\ll|V|$ , this does not degrade LM performance and makes the matrix multiplication overhead negligible within the overall LM computations. : $$ \mathbf{T}=\mathbf{P}\mathbf{V},\quad\mathbf{T}\in\mathbb{R}^{n\times d} \tag{4} $$ Through these pondering embeddings, we effectively map predicted probabilities back into the embedding space, preserving embedding information from all possible candidate tokens. To maintain the information from the original input embeddings, we integrate the pondering embeddings using a residual connection: $$ \mathbf{E}^{1}=\mathbf{E}^{0}+\mathbf{T}=[\mathbf{e}^{0}_{1}+\mathbf{t}_{1},% \mathbf{e}^{0}_{2}+\mathbf{t}_{2},\dots,\mathbf{e}^{0}_{n}+\mathbf{t}_{n}] \tag{5} $$ We then feed the updated embeddings $\mathbf{E}^{1}$ back into the same language model to obtain refined output probabilities: $$ \mathbf{P}^{1}=\text{LM}(\mathbf{E}^{1}),\quad\mathbf{P}^{1}\in\mathbb{R}^{n% \times|V|} \tag{6} $$ After obtaining $\mathbf{P}^{1}$ , we can iteratively repeat the previous process to achieve multi-step pondering. Specifically, given a predefined number of pondering steps $k$ Unless otherwise specified, we set $k=3$ for subsequent experiments., we iteratively compute new pondering embeddings and integrate them with the original input embeddings using residual connections, feeding the result back into the same language model until $k$ steps are reached: $$ \displaystyle\mathbf{E}^{0} \displaystyle=[\mathbf{e}^{0}_{1},\mathbf{e}^{0}_{2},\dots,\mathbf{e}^{0}_{n}]% ,\quad\mathbf{P}^{0}=\text{LM}(\mathbf{E}^{0}),\quad\mathbf{T}^{1}=\mathbf{P}^% {0}\mathbf{V} \displaystyle\mathbf{E}^{1} \displaystyle=\mathbf{E}^{0}+\mathbf{T}^{1}=[\mathbf{e}^{0}_{1}+\mathbf{t}^{1}% _{1},\mathbf{e}^{0}_{2}+\mathbf{t}^{1}_{2},\dots,\mathbf{e}^{0}_{n}+\mathbf{t}% ^{1}_{n}],\quad\mathbf{P}^{1}=\text{LM}(\mathbf{E}^{1}),\quad\mathbf{T}^{2}=% \mathbf{P}^{1}\mathbf{V} \displaystyle\dots \displaystyle\mathbf{E}^{k} \displaystyle=\mathbf{E}^{0}+\sum_{i=1}^{k}\mathbf{T}^{i}=[\mathbf{e}^{0}_{1}+% \mathbf{t}^{1}_{1}+\dots+\mathbf{t}^{k}_{1},\dots,\mathbf{e}^{0}_{n}+\mathbf{t% }^{1}_{n}+\dots+\mathbf{t}^{k}_{n}],\quad\mathbf{P}^{k}=\text{LM}(\mathbf{E}^{% k}) \tag{7} $$ This iterative pondering mechanism progressively refines the model’s predictions. Finally, we can use the refined output probabilities $\mathbf{P}^{k}$ after $k$ pondering steps to compute the cross-entropy loss and optimize the language model to perform $k$ -step pondering. 3 Experiments Our experiments consist of four parts. First, we validate the scaling curves of pondering models on widely used GPT-2 and LLaMA architectures. Second, we perform large-scale pretraining of PonderingPythia models on the Pile dataset and compare their scaling curves and language modeling capabilities with those of the official Pythia suite (Biderman et al., 2023). Third, we evaluate the downstream task performance of PonderingPythia models, including nine popular general tasks and an instruction-following task, and compare the results with official Pythia, OPT (Zhang et al., 2022), Bloom (Le Scao et al., 2023), GPT-Neo (Black et al., 2021), and TinyLLaMA (Zhang et al., 2024) models. Finally, we investigate the impact of the number of pondering steps on pretraining perplexity. 3.1 Small Scale Validation on GPT-2 and LLaMA We apply our proposed method to two popular Transformer architectures, GPT-2 and LLaMA, to investigate its general applicability and effectiveness. Specifically, we plot and compare scaling curves between vanilla GPT-2, vanilla LLaMA, and their corresponding pondering versions, referred to as PonderingGPT and PonderingLLaMA, respectively. Experimental Settings. We train all models from scratch on a subset of the Pile dataset with the same tokenizer. The model sizes range from 405M to 1.4B parameters, with a context length fixed at 2048 tokens. The number of training tokens for each model is chosen to approximately match the scaling laws proposed by Chinchilla (Hoffmann et al., 2022b). The detailed configurations of the models, including sizes, learning rates, and batch sizes, are specified in Table 1. These hyperparameters primarily follow the GPT-3 specifications (Brown et al., 2020). However, unlike GPT-3, we untie the input and output embedding matrices. Table 1: Model sizes and hyperparameters for scaling experiments. | params | $\mathrm{n_{layers}}$ | $\mathrm{d_{model}}$ | $\mathrm{n_{heads}}$ | learning rate | batch size (in tokens) | tokens | | --- | --- | --- | --- | --- | --- | --- | | 405M | 24 | 1024 | 16 | 3e-4 | 0.5M | 7B | | 834M | 24 | 1536 | 24 | 2.5e-4 | 0.5M | 15B | | 1.4B | 24 | 2048 | 32 | 2e-4 | 0.5M | 26B | <details> <summary>x2.png Details</summary> ![fdb1525a](/v1/image/fdb1525a89625948f7e0fb0b5d60fa2d880e89a702241cc05099898e9becbaae) ### Visual Description ## Line Graph: Model Performance Comparison Across Sizes ### Overview The image contains a dual-axis line graph comparing the performance of GPT and LLaMA models across three different sizes (405M*7B, 834M*15B, 1.4B*26B). The main chart shows loss values, while the lower subplot displays the difference in loss (ΔLoss) between model variants. The graph includes fitted lines, actual model performance, and pondering-adjusted models. ### Components/Axes - **Main Chart**: - **X-axis**: Model sizes (405M*7B, 834M*15B, 1.4B*26B) - **Y-axis**: Loss (2.1–2.6) - **Legend**: - Dashed blue: GPT fitted - Dashed green: LLaMA fitted - Solid blue circles: GPT - Solid green circles: LLaMA - Solid blue stars: Pondering GPT - Solid green stars: Pondering LLaMA - **Annotations**: - "2.01x" (834M*15B) - "2.26x" (1.4B*26B) - **Lower Subplot**: - **X-axis**: Same model sizes - **Y-axis**: ΔLoss (0.05–0.15) - **Lines**: - Blue: GPT - LLaMA - Green: LLaMA - Pondering LLaMA - Black dashed: L_GPT - L_LLaMA ### Detailed Analysis 1. **Main Chart Trends**: - **GPT fitted (dashed blue)**: Starts at ~2.6 (405M*7B) and decreases to ~2.2 (1.4B*26B). - **LLaMA fitted (dashed green)**: Starts at ~2.55 (405M*7B) and decreases to ~2.15 (1.4B*26B). - **GPT (solid blue circles)**: Starts at ~2.5 (405M*7B) and decreases to ~2.2 (1.4B*26B). - **LLaMA (solid green circles)**: Starts at ~2.45 (405M*7B) and decreases to ~2.15 (1.4B*26B). - **Pondering GPT (blue stars)**: Starts at ~2.45 (405M*7B) and decreases to ~2.1 (1.4B*26B). - **Pondering LLaMA (green stars)**: Starts at ~2.4 (405M*7B) and decreases to ~2.05 (1.4B*26B). 2. **Lower Subplot Trends**: - **GPT - LLaMA (blue)**: ΔLoss ~0.12 (405M*7B) to ~0.08 (1.4B*26B). - **LLaMA - Pondering LLaMA (green)**: ΔLoss ~0.08 (405M*7B) to ~0.05 (1.4B*26B). - **L_GPT - L_LLaMA (black dashed)**: ΔLoss ~0.06 (405M*7B) to ~0.03 (1.4B*26B). ### Key Observations 1. **Fitted vs. Actual Models**: - Fitted lines (dashed) are consistently higher than actual model lines (solid), suggesting overestimation or different evaluation criteria. - Pondering models (stars) show lower loss than base models (circles), indicating performance improvement. 2. **Model Size Impact**: - Loss decreases as model size increases for all variants. - The ratio of GPT fitted to LLaMA fitted loss increases from 2.01x (834M*15B) to 2.26x (1.4B*26B), implying GPT's relative performance degrades more with larger models. 3. **ΔLoss Analysis**: - GPT has a larger loss gap compared to LLaMA (blue line) than LLaMA compared to its pondering version (green line). - The black dashed line (L_GPT - L_LLaMA) shows the smallest ΔLoss, suggesting minimal difference between GPT and LLaMA base models. ### Interpretation - **Performance Trends**: Larger models generally perform better (lower loss), but the rate of improvement varies by model type. Pondering techniques reduce loss, with LLaMA showing a more significant reduction than GPT. - **Fitted Line Discrepancy**: The fitted lines (dashed) being higher than actual models may indicate overfitting or a mismatch between training and evaluation data. - **Ratio Increase**: The growing ratio of GPT to LLaMA fitted loss (2.01x → 2.26x) suggests GPT's performance becomes relatively worse as models scale, possibly due to architectural or training differences. - **ΔLoss Implications**: The smaller ΔLoss in pondering LLaMA (green line) highlights its effectiveness in narrowing the performance gap between GPT and LLaMA. The black dashed line's minimal ΔLoss implies that base GPT and LLaMA models are closer in performance than their fitted counterparts. This data underscores the importance of model architecture and optimization techniques (e.g., pondering) in balancing performance and scalability. The fitted lines' divergence from actual results warrants further investigation into evaluation methodologies. </details> Figure 2: (top) Scaling curves of GPT3 LLaMA and their corresponding pondering models. (bottom) Relative improvements of RoPE + RMSNorm + SwiGLU MLP and Pondering. Results. Figure 2 (top) illustrates the scaling curves of the validation loss on the Pile subset for vanilla GPT-2, vanilla LLaMA, and their pondering counterparts. Our results show that incorporating pondering significantly improves the performance for both GPT-2 and LLaMA across the entire size range tested (405M to 1.4B parameters). For instance, by fitting scaling curves, we find that PonderingGPT-834M achieves a loss comparable to a vanilla GPT-2 model trained with approximately $2.01×$ parameters $*$ tokens, while PonderingLLaMA-834M matches the loss of vanilla LLaMA trained with roughly $2.26×$ parameters $*$ tokens. Furthermore, Figure 2 (bottom) shows the relative validation loss improvements (denoted as $\Delta$ loss) of architectural modifications inherent in LLaMA—namely rotary positional embeddings (RoPE) (Su et al., 2024), RMSNorm (Zhang & Sennrich, 2019), and SwiGLU activation—in comparison to GPT-2, alongside the relative improvements obtained by our pondering method for both architectures. We observe that while the relative improvement due to RoPE, RMSNorm, and SwiGLU MLP diminishes as model size and data scale increase (reaching approximately 0.02 for the 1.4B parameter model), the relative improvement provided by pondering consistently remains around 0.1 across the entire tested scale. <details> <summary>x3.png Details</summary> ![93bc80d7](/v1/image/93bc80d73042c4f8d4022d610642293631d8217ec0337dfeda8ff602fa16a4aa) ### Visual Description ## Line Graph: Loss vs. Number of Parameters for Pythia and PonderingPythia ### Overview The image is a line graph comparing the "Loss" metric across different numbers of parameters (log scale) for two models: **Pythia** (blue line) and **PonderingPythia** (green line). The graph includes data points for each model and a purple arrow highlighting a specific point on the PonderingPythia line. The y-axis represents "Loss" (ranging from 1.8 to 2.5), and the x-axis represents the number of parameters (log scale, from 200M to 7B). --- ### Components/Axes - **X-axis**: "#Parameters (log scale)" with labeled ticks at 200M, 500M, 1B, 2B, 3B, and 7B. - **Y-axis**: "Loss" with values from 1.8 to 2.5. - **Legend**: Located in the **top-right** corner, with: - **Blue line**: Labeled "Pythia" - **Green line**: Labeled "PonderingPythia" - **Data Points**: - **Blue circles**: Represent Pythia's loss values. - **Green squares**: Represent PonderingPythia's loss values. - **Purple Arrow**: Located in the **bottom-right** corner, pointing to a data point on the PonderingPythia line. The arrow is labeled "37% params". --- ### Detailed Analysis #### Pythia (Blue Line) - **Trend**: The loss decreases as the number of parameters increases, following a steep downward slope. - **Data Points**: - 200M parameters: ~2.55 - 500M parameters: ~2.18 - 1B parameters: ~2.05 - 2B parameters: ~1.95 - 3B parameters: ~1.88 - 7B parameters: ~1.85 #### PonderingPythia (Green Line) - **Trend**: The loss decreases more gradually compared to Pythia, with a flatter slope. - **Data Points**: - 200M parameters: ~2.30 - 500M parameters: ~2.10 - 1B parameters: ~1.95 - 2B parameters: ~1.90 - 3B parameters: ~1.85 - 7B parameters: ~1.80 #### Purple Arrow - **Position**: Points to the **200M parameters** data point on the PonderingPythia line. - **Label**: "37% params" (exact meaning unclear; likely refers to a parameter efficiency metric or a specific threshold, but not explicitly defined in the graph). --- ### Key Observations 1. **Loss Reduction**: Both models show a clear trend of decreasing loss as parameters increase, but Pythia achieves lower loss at higher parameter counts. 2. **Efficiency**: PonderingPythia maintains lower loss at smaller parameter sizes (e.g., 200M parameters: 2.30 vs. Pythia's 2.55). 3. **37% Annotation**: The purple arrow highlights a specific point on PonderingPythia's line, but the exact parameter value (200M) does not align with a 37% calculation of 7B (which would be ~2.59B). This suggests the "37% params" may refer to a different metric (e.g., parameter efficiency relative to Pythia or a custom threshold). --- ### Interpretation - **Model Performance**: Pythia demonstrates superior performance at larger parameter scales, while PonderingPythia offers better efficiency at smaller scales. - **37% Annotation**: The label "37% params" likely indicates a specific parameter efficiency or threshold for PonderingPythia, but the graph does not clarify its exact meaning. This could imply that PonderingPythia achieves comparable loss with fewer parameters (e.g., 37% of Pythia's parameters for similar performance). - **Log Scale Implications**: The x-axis's log scale emphasizes the exponential growth of parameters, highlighting the trade-off between parameter count and loss reduction. --- ### Notes on Data Extraction - All labels, axis titles, and legend entries were extracted as described. - Data points were approximated based on their positions relative to the axes. - The purple arrow's label ("37% params") was transcribed verbatim, but its exact interpretation requires additional context not provided in the image. </details> <details> <summary>x4.png Details</summary> ![23892500](/v1/image/238925007929e6c7ff2cad095ccf4eec2faff7c4a87945e0bc93ed6fea77d528) ### Visual Description ## Line Graph: Model Performance vs Training Tokens ### Overview The graph compares the loss reduction of two language models (Pythia-1B and PonderingPythia-1B) as a function of training tokens on a logarithmic scale. Two lines represent model performance, with a horizontal purple line marking a 41% training token threshold. ### Components/Axes - **X-axis**: "#Training tokens (log scale)" ranging from 60B to 280B (60B, 100B, 200B, 280B). - **Y-axis**: "Loss" ranging from 1.95 to 2.25. - **Legend**: - Blue line: Pythia-1B - Green line: PonderingPythia-1B - **Key annotations**: Purple horizontal line labeled "41% training tokens" at ~100B on the x-axis. ### Detailed Analysis 1. **Pythia-1B (Blue Line)**: - Starts at ~2.20 loss at 60B tokens. - Decreases linearly to ~2.05 loss at 280B tokens. - Slope: Steady decline (~0.005 loss per 10B tokens). 2. **PonderingPythia-1B (Green Line)**: - Starts at ~2.12 loss at 60B tokens. - Decreases more steeply than Pythia-1B, reaching ~1.97 loss at 280B tokens. - Intersects the 41% training tokens marker (~100B) at ~2.05 loss. 3. **41% Training Tokens Marker**: - Horizontal purple line at ~100B tokens. - Green line crosses this threshold at ~2.05 loss, while Pythia-1B remains at ~2.10 loss. ### Key Observations - Both models show **inverse scaling**: loss decreases as training tokens increase. - PonderingPythia-1B achieves **~15% lower loss** than Pythia-1B at 280B tokens. - The 41% threshold (~100B tokens) marks a **performance inflection point** for PonderingPythia-1B. ### Interpretation The graph demonstrates that PonderingPythia-1B exhibits superior training efficiency, achieving lower loss with fewer tokens. The 41% threshold likely represents a critical training milestone where the model's architecture or optimization strategy becomes more effective. The logarithmic x-axis emphasizes diminishing returns at higher token counts, suggesting that early-stage training (below 100B tokens) is disproportionately impactful for loss reduction. This aligns with findings in large language model training, where initial parameter tuning often yields the most significant performance gains. </details> <details> <summary>x5.png Details</summary> ![5b9f9fa2](/v1/image/5b9f9fa26a6cf584e9c4afc33227847f5c24183f5da8fbaa2d93ddf25303d3be) ### Visual Description ## Line Graph: Loss vs. FLOPs (EFLOPs) for Vanilla Pythia-70M and PonderingPythia-70M ### Overview The image is a line graph comparing the **Loss** (y-axis) of two language models—**Vanilla Pythia-70M** (blue line) and **PonderingPythia-70M** (green line)—across varying computational costs (**FLOPs**, x-axis). The graph shows how loss decreases as FLOPs increase for both models, with PonderingPythia-70M consistently outperforming the Vanilla version. --- ### Components/Axes - **X-axis**: **FLOPs (EFLOPs)** - Range: 100 to 400 EFLOPs (increments of 50). - Label: "FLOPs (EFLOPs)" in bold black text. - **Y-axis**: **Loss** - Range: 2.55 to 2.8 (increments of 0.05). - Label: "Loss" in bold black text. - **Legend**: - Position: Top-right corner. - Entries: - **Blue circle**: "Vanilla Pythia-70M" - **Green circle**: "PonderingPythia-70M" --- ### Detailed Analysis #### Vanilla Pythia-70M (Blue Line) - **Trend**: Loss decreases monotonically as FLOPs increase. - **Data Points**: - 100 EFLOPs: ~2.85 - 150 EFLOPs: ~2.78 - 200 EFLOPs: ~2.75 - 250 EFLOPs: ~2.74 - 300 EFLOPs: ~2.73 - 350 EFLOPs: ~2.72 - 400 EFLOPs: ~2.72 #### PonderingPythia-70M (Green Line) - **Trend**: Loss decreases more steeply than Vanilla Pythia-70M. - **Data Points**: - 100 EFLOPs: ~2.67 - 150 EFLOPs: ~2.63 - 200 EFLOPs: ~2.60 - 250 EFLOPs: ~2.58 - 300 EFLOPs: ~2.57 - 350 EFLOPs: ~2.56 - 400 EFLOPs: ~2.55 --- ### Key Observations 1. **Performance Gap**: PonderingPythia-70M achieves **~0.1 lower loss** than Vanilla Pythia-70M at all FLOP levels. 2. **Efficiency**: PonderingPythia-70M’s loss decreases more rapidly with increasing FLOPs, suggesting better optimization. 3. **Diminishing Returns**: Both models show reduced loss improvement at higher FLOP levels (e.g., Vanilla’s loss drops by only ~0.03 between 250–400 EFLOPs). --- ### Interpretation The graph demonstrates that **PonderingPythia-70M** is computationally more efficient than the Vanilla version, achieving lower loss with the same or fewer FLOPs. This suggests architectural or training improvements in PonderingPythia-70M that enhance performance per FLOP. The diminishing returns at higher FLOPs imply that beyond a certain point, additional computational resources yield minimal loss reduction, highlighting potential inefficiencies in scaling. The consistent outperformance of PonderingPythia-70M across all FLOP levels indicates it may be preferable for applications prioritizing accuracy over raw computational power. </details> Figure 3: Language modeling loss when scaling parameter count and training tokens. PonderingPythia achieves comparable performance to the official Pythia-6.9B while using only 37% of the parameters, and matches the performance of the official Pythia-1B with just 41% of the training tokens. Table 2: Language Modeling Perplexity (ppl). Lower is better. The values in parentheses indicate the improvement ( $\downarrow$ ) compared to the corresponding Pythia model. For simplicity, we refer to PonderingPythia as Ponder. | Model | Pile | Wikitext | Lambada Openai | Lambada Standard | | --- | --- | --- | --- | --- | | Pythia-70M | $18.36$ | $57.03$ | $141.94$ | $967.72$ | | Ponder-70M | $12.68$ ( $\downarrow$ 5.68) | $34.07$ ( $\downarrow$ 22.96) | $39.19$ ( $\downarrow$ 102.75) | $145.51$ ( $\downarrow$ 822.21) | | Pythia-160M | $12.55$ | $33.43$ | $38.15$ | $186.51$ | | Ponder-160M | $9.87$ ( $\downarrow$ 2.68) | $23.69$ ( $\downarrow$ 9.74) | $15.96$ ( $\downarrow$ 22.19) | $41.24$ ( $\downarrow$ 145.27) | | Pythia-410M | $8.85$ | $20.11$ | $10.86$ | $31.54$ | | Ponder-410M | $8.15$ ( $\downarrow$ 0.70) | $17.78$ ( $\downarrow$ 2.33) | $8.55$ ( $\downarrow$ 2.31) | $17.48$ ( $\downarrow$ 14.06) | | Pythia-1B | $7.85$ | $16.45$ | $7.91$ | $17.41$ | | Ponder-1B | $7.15$ ( $\downarrow$ 0.70) | $14.61$ ( $\downarrow$ 1.84) | $6.02$ ( $\downarrow$ 1.89) | $10.64$ ( $\downarrow$ 6.77) | | Pythia-1.4B | $7.24$ | $14.72$ | $6.08$ | $10.87$ | | Ponder-1.4B | $6.82$ ( $\downarrow$ 0.42) | $13.59$ ( $\downarrow$ 1.13) | $5.37$ ( $\downarrow$ 0.71) | $8.88$ ( $\downarrow$ 1.99) | | Pythia-2.8B | $6.63$ | $12.69$ | $5.04$ | $8.23$ | | Ponder-2.8B | $6.21$ ( $\downarrow$ 0.42) | $11.74$ ( $\downarrow$ 0.95) | $4.15$ ( $\downarrow$ 0.89) | $6.17$ ( $\downarrow$ 2.06) | | Pythia-6.9B | $6.29$ | $11.41$ | $4.45$ | $6.95$ | 3.2 Large-Scale Pretraining on Pile We further validate the effectiveness of our pondering method by conducting large-scale pretraining experiments on the entire Pile dataset (300B tokens) (Gao et al., 2020). We train a new model, named PonderingPythia, from scratch using exactly the same architectural components (parallel attention and MLP layers, rotary embeddings with 1/4 head dimensions), same tokenizer and training hyperparameters (optimizer settings, learning rate schedule, batch size, and context length) as the original Pythia models. We then compare PonderingPythia models’ scaling curves and language modeling capabilities with the official Pythia model suite. 3.2.1 Scaling Curves We plot the PonderingPythia and official Pythia model’s scaling curves along parameters size, training tokens and training flops. As depicted in Figure 3 (left), the fitted curves show that a 2.5B-parameter PonderingPythia model achieves a validation loss comparable to the 6.9B-parameter official Pythia model, while requiring only 37% of the parameters. In Figure 3 (middle), we evaluated the 1B PonderingPythia and official Pythia models at intervals of 20B training tokens. The fitted curves indicate that PonderingPythia trained with 115B tokens achieves performance comparable to the official Pythia model trained with 280B tokens, consuming only 41% of the training tokens. In Figure 3 (right), we report the language modeling loss of vanilla Pythia-70M To match the training FLOPs of vanilla Pythia-70M with PonderingPythia-70M, we trained the vanilla model for 4 epochs. and PonderingPythia-70M under the same computational budget during pretraining. It can be observed that PonderingPythia-70M consistently outperforms vanilla Pythia-70M when trained with the same number of FLOPs. 3.2.2 Language Modeling Ability Evaluation We measure the perplexity on several language modeling datasets to reflect general language modeling capabilities. Specifically, we report perplexity scores on the Pile validation set, Wikitext, and the Lambada dataset (both OpenAI and standard versions), detailed in Table 2. The results demonstrate significant perplexity improvements across all datasets and model sizes. Notably, the perplexity achieved by the PonderingPythia-2.8B model is even better than that of the official Pythia-6.8B model, which is nearly 2.5 times larger. 3.3 Downstream Tasks Evaluation We use the PonderingPythia models pretrained in the previous subsection to conduct downstream task evaluations. <details> <summary>x6.png Details</summary> ![8944392f](/v1/image/8944392f2b3dd38fcff3ed61dceb4f725d1c26d25e922a471d16963a09615a2d) ### Visual Description ## Radar Chart: Performance Comparison of Pythia-1B and PonderingPythia-1B ### Overview The image is a radar chart comparing the performance of two language models, **Pythia-1B** (blue line) and **PonderingPythia-1B** (green line), across eight categories: **Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities**. The chart uses a circular layout with radial axes scaled from 0 to 3. The legend at the bottom identifies the models by color, and average scores are provided: Pythia-1B (1.89) and PonderingPythia-1B (2.22). --- ### Components/Axes - **Axes (Categories)**: - **Writing** (top) - **Roleplay** (top-right) - **Reasoning** (right) - **Math** (bottom-right) - **Coding** (bottom) - **Extraction** (bottom-left) - **STEM** (left) - **Humanities** (top-left) - **Legend**: - **Pythia-1B**: Blue line (avg score = 1.89) - **PonderingPythia-1B**: Green line (avg score = 2.22) - **Scale**: Radial axes range from 0 to 3, with increments of 0.5. --- ### Detailed Analysis #### Pythia-1B (Blue Line) - **Writing**: ~2.0 - **Roleplay**: ~2.0 - **Reasoning**: ~1.5 - **Math**: ~1.0 - **Coding**: ~1.0 - **Extraction**: ~1.5 - **STEM**: ~2.0 - **Humanities**: ~2.0 #### PonderingPythia-1B (Green Line) - **Writing**: ~3.0 - **Roleplay**: ~2.5 - **Reasoning**: ~1.5 - **Math**: ~2.0 - **Coding**: ~2.0 - **Extraction**: ~2.5 - **STEM**: ~3.0 - **Humanities**: ~2.5 --- ### Key Observations 1. **PonderingPythia-1B outperforms Pythia-1B in most categories**: - **STEM** and **Writing** show the largest gaps (3.0 vs. 2.0 for both). - **Extraction** (2.5 vs. 1.5) and **Humanities** (2.5 vs. 2.0) also favor PonderingPythia-1B. 2. **Shared weaknesses**: - Both models score lowest in **Math** and **Coding** (1.0 for Pythia-1B, 2.0 for PonderingPythia-1B). 3. **Reasoning**: Both models score similarly (~1.5), suggesting this is a shared limitation. 4. **Average scores**: PonderingPythia-1B’s higher average (2.22 vs. 1.89) confirms its overall superiority. --- ### Interpretation - **Performance Trends**: PonderingPythia-1B demonstrates stronger capabilities in **STEM** and **Writing**, likely due to enhanced reasoning or domain-specific training. Its performance in **Math** and **Coding** improves moderately compared to Pythia-1B, but both models struggle in these areas, indicating a potential gap in computational task handling. - **Reasoning as a Bottleneck**: The similar low scores in **Reasoning** (~1.5 for both) suggest this is a critical area for improvement across models. - **Implications**: PonderingPythia-1B’s higher average score (2.22) positions it as a more versatile model, particularly for tasks requiring **STEM** expertise or **Writing** proficiency. However, both models require optimization in **Math** and **Coding** to address fundamental weaknesses. --- ### Spatial Grounding & Verification - **Legend Position**: Bottom-center, clearly associating colors with models. - **Color Consistency**: Blue (Pythia-1B) and green (PonderingPythia-1B) lines match the legend without ambiguity. - **Axis Placement**: Categories are evenly spaced around the radar, with **Writing** at the top and **Math** at the bottom-right. ### Uncertainties - Exact values are approximate due to the absence of gridlines or numerical labels on the radial axes. - The chart does not specify confidence intervals or error margins for the scores. </details> <details> <summary>x7.png Details</summary> ![95b47403](/v1/image/95b47403a777996f727d39844e0d54bc5c7d8341e7cc38dd9a4612433c01ef0a) ### Visual Description ## Radar Chart: Comparative Performance of Pythia-1.4B and PonderingPythia-1.4B ### Overview The image is a radar chart comparing the performance of two language models, Pythia-1.4B (blue line) and PonderingPythia-1.4B (teal line), across eight categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, and Humanities. The chart uses a circular layout with radial axes scaled from 0 to 3.5. Average scores are provided: Pythia-1.4B (2.2) and PonderingPythia-1.4B (2.75). --- ### Components/Axes - **Categories (Axes):** - Writing (top) - Roleplay (top-right) - Reasoning (right) - Math (bottom-right) - Coding (bottom) - Extraction (bottom-left) - STEM (left) - Humanities (top-left) - **Legend:** - **Blue line:** Pythia-1.4B (avg score = 2.2) - **Teal line:** PonderingPythia-1.4B (avg score = 2.75) - **Axis Markers:** - Radial scale increments: 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5 --- ### Detailed Analysis 1. **PonderingPythia-1.4B (Teal Line):** - **Highest Scores:** - Roleplay (~3.5) - Humanities (~3.2) - Writing (~3.0) - **Lowest Score:** - Coding (~1.2) - **Trend:** Dominates in creative/linguistic tasks (Roleplay, Humanities, Writing) but underperforms in technical tasks (Coding, Math). 2. **Pythia-1.4B (Blue Line):** - **Highest Score:** - Reasoning (~2.8) - **Lowest Score:** - Math (~1.5) - **Trend:** Stronger in analytical tasks (Reasoning) but weaker in Math and Coding compared to PonderingPythia. 3. **Shared Patterns:** - Both models score highest in Reasoning and Humanities. - Both struggle with Math and Coding, though PonderingPythia performs slightly better in Math. --- ### Key Observations - **Performance Gap:** PonderingPythia-1.4B consistently outperforms Pythia-1.4B across most categories, with an average score 0.55 points higher. - **Outliers:** - PonderingPythia’s extreme strength in Roleplay (~3.5) vs. Pythia’s moderate score (~2.5). - Pythia’s slight edge in Reasoning (~2.8 vs. ~2.5). - **Weaknesses:** Both models score below 2.0 in Math and Coding, suggesting systemic limitations in technical reasoning. --- ### Interpretation The data suggests that **PonderingPythia-1.4B** is optimized for creative and linguistic tasks (e.g., Roleplay, Humanities), while **Pythia-1.4B** excels in analytical reasoning. However, both models share critical weaknesses in Math and Coding, indicating potential gaps in training data or architectural design for technical problem-solving. The disparity in Roleplay performance highlights PonderingPythia’s specialization in narrative generation, whereas Pythia’s balanced but lower scores suggest a more generalized but less specialized capability. **Critical Insight:** The chart underscores the trade-off between specialization (PonderingPythia) and generalization (Pythia), with neither model achieving high performance in all domains. This aligns with Peircean principles of abductive reasoning: the models’ strengths and weaknesses reflect their design priorities, leaving room for hybrid approaches to address systemic gaps. </details> Figure 4: Instruction-following abilities evaluated on MT-Bench. PonderingPythia-1B and 1.4B consistently outperform their corresponding official Pythia models across all subtasks. 3.3.1 General Downstream Tasks We consider various widely-used benchmarks, including the tasks originally used by Pythia (LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), WinoGrande (Sakaguchi et al., 2021), ARC-Easy and ARC-Challenge (Clark et al., 2018), SciQ (Welbl et al., 2017). Additionally, we include HellaSwag (Zellers et al., 2019) for commonsense reasoning and RACE (Lai et al., 2017) for reading comprehension. We evaluate both zero-shot and five-shot learning performance using the LM evaluation harness (Gao et al., 2023). Detailed results are shown in Table 3. Across all evaluated model sizes, PonderingPythia consistently and significantly outperforms the official Pythia models, as well as comparable OPT, GPT-Neo, and Bloom models. Remarkably, with only 1/10 of the training data (300B tokens) and fewer parameters (1B vs. 1.1B), our PonderingPythia-1B achieves results comparable to, or even surpassing, TinyLlama—which uses a more advanced LLaMA architecture and 3T tokens. PonderingPythia-2.8B also surpasses Pythia-6.9B, which is nearly 2.5 times its size. Table 3: Five-shot and zero-shot evaluations on downstream NLP tasks. All pretrained model weights used for comparison are obtained from their official repositories. We refer to PonderingPythia as Ponder for simplicity. $\Delta$ acc is compared to the official Pythia models. | Model (#training tokens) | Lambada OpenAI | ARC -E | Lambada Standard | ARC -C | Wino Grande | PIQA | Hella Swag | SciQ | RACE | Avg acc / $\Delta$ acc $\uparrow$ | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 5-shot | | | | | | | | | | | | Pythia-70M (300B) | 11.9 | 36.7 | 9.2 | 17.1 | 50.5 | 58.7 | 26.7 | 57.8 | 25.1 | 32.6 | | Ponder-70M (300B) | 28.4 | 44.6 | 21.1 | 18.6 | 52.2 | 62.4 | 28.4 | 75.7 | 26.9 | 39.8 / +7.2 | | Pythia-160M (300B) | 24.9 | 44.7 | 19.0 | 18.4 | 50.4 | 63.5 | 28.6 | 76.4 | 27.8 | 39.3 | | OPT-125M (300B) | 29.8 | 42.3 | 26.4 | 19.7 | 51.3 | 63.4 | 29.0 | 78.8 | 30.1 | 41.2 | | GPTneo-125M (300B) | 29.3 | 43.9 | 23.0 | 19.1 | 51.9 | 63.9 | 28.5 | 79.9 | 27.9 | 40.8 | | Ponder-160M (300B) | 39.9 | 50.1 | 31.2 | 22.1 | 51.2 | 65.8 | 31.9 | 85.4 | 30.1 | 45.3 / +6.0 | | Pythia-410M (300B) | 43.9 | 54.7 | 32.8 | 22.3 | 53.4 | 68.0 | 33.8 | 88.9 | 30.4 | 47.6 | | OPT-350M (300B) | 38.3 | 45.4 | 32.1 | 20.5 | 53.0 | 65.8 | 31.9 | 85.7 | 29.5 | 44.7 | | Bloom-560M (366B) | 29.4 | 50.2 | 29.7 | 21.9 | 52.7 | 64.2 | 31.4 | 88.0 | 30.0 | 44.2 | | Ponder-410M (300B) | 48.9 | 58.7 | 43.7 | 26.1 | 54.0 | 70.5 | 37.3 | 91.0 | 32.4 | 51.4 /+3.8 | | Pythia-1B (300B) | 48.3 | 58.6 | 35.8 | 25.4 | 52.8 | 71.3 | 37.7 | 91.6 | 31.7 | 50.4 | | OPT-1.3B (300B) | 54.0 | 60.4 | 49.0 | 26.9 | 56.9 | 72.4 | 38.5 | 91.8 | 35.4 | 52.7 | | GPTneo-1.3B (300B) | 49.9 | 59.9 | 44.5 | 25.9 | 56.6 | 71.6 | 38.6 | 86.1 | 34.5 | 51.9 | | Bloom-1.1B (366B) | 36.3 | 54.9 | 37.4 | 24.9 | 53.4 | 67.6 | 34.8 | 88.7 | 33.0 | 47.9 | | Ponder-1B (300B) | 57.7 | 63.2 | 52.5 | 28.6 | 58.6 | 73.3 | 41.9 | 93.4 | 36.3 | 56.2 / +5.8 | | Pythia-1.4B (300B) | 54.5 | 63.1 | 44.5 | 28.8 | 57.1 | 71.0 | 40.5 | 92.4 | 34.6 | 54.1 | | Bloom-1.7B (366B) | 42.5 | 58.8 | 41.5 | 26.2 | 57.7 | 68.7 | 37.6 | 91.9 | 33.5 | 50.9 | | Ponder-1.4B (300B) | 59.2 | 67.5 | 49.9 | 32.4 | 60.4 | 73.5 | 44.2 | 94.3 | 37.1 | 57.6 / +3.5 | | Tinyllama-1.1B (3T) | 53.8 | 64.8 | 45.0 | 31.1 | 59.4 | 73.8 | 44.9 | 94.0 | 36.4 | 55.9 | | OPT-2.7B (300B) | 60.2 | 64.7 | 55.0 | 29.8 | 62.2 | 75.1 | 46.1 | 93.0 | 37.5 | 58.2 | | GPTneo-2.7B (300B) | 56.0 | 64.0 | 51.6 | 30.1 | 59.6 | 73.9 | 42.4 | 93.3 | 35.5 | 56.3 | | Bloom-3B (366B) | 46.2 | 63.8 | 47.1 | 31.7 | 57.8 | 70.8 | 41.4 | 93.4 | 34.6 | 54.1 | | Pythia-2.8B (300B) | 59.0 | 67.0 | 50.7 | 31.0 | 61.1 | 74.4 | 45.3 | 93.7 | 35.9 | 57.6 | | Pythia-6.9B (300B) | 62.5 | 69.6 | 54.8 | 35.6 | 62.9 | 76.6 | 48.0 | 94.6 | 36.7 | 60.1 | | Ponder-2.8B (300B) | 64.2 | 70.6 | 58.7 | 35.8 | 65.3 | 76.7 | 49.0 | 94.3 | 39.0 | 61.5 / +3.9 | | 0-shot | | | | | | | | | | | | Pythia-70M (300B) | 18.7 | 36.7 | 13.5 | 18.8 | 51.1 | 59.9 | 26.6 | 59.9 | 23.9 | 34.3 | | Ponder-70M (300B) | 33.5 | 42.5 | 24.1 | 19.1 | 51.1 | 61.5 | 28.3 | 70.3 | 27.8 | 39.8 / +5.5 | | Pythia-160M (300B) | 33.0 | 43.8 | 21.4 | 18.9 | 52.0 | 62.2 | 28.4 | 73.7 | 27.9 | 40.1 | | OPT-125M (300B) | 37.9 | 43.5 | 29.0 | 19.0 | 50.3 | 62.8 | 29.2 | 75.2 | 30.1 | 41.9 | | GPTneo-125M (300B) | 37.5 | 43.9 | 26.0 | 19.1 | 50.4 | 63.0 | 28.7 | 76.5 | 27.5 | 41.4 | | Ponder-160M (300B) | 44.3 | 46.8 | 33.8 | 20.5 | 52.3 | 63.9 | 31.9 | 76.8 | 29.6 | 44.4 / +4.3 | | Pythia-410M (300B) | 51.4 | 52.2 | 36.4 | 21.4 | 53.8 | 66.9 | 33.7 | 81.5 | 30.9 | 47.6 | | OPT-350M (300B) | 45.2 | 44.0 | 35.8 | 20.7 | 52.3 | 64.5 | 32.0 | 74.9 | 29.8 | 44.4 | | Bloom-560M (366B) | 34.3 | 47.5 | 33.3 | 22.4 | 51.5 | 63.8 | 31.5 | 80.3 | 30.5 | 43.9 | | Ponder-410M (300B) | 56.9 | 51.9 | 45.3 | 22.6 | 56.0 | 68.7 | 37.0 | 81.4 | 33.8 | 50.4 / +2.8 | | Pythia-1B (300B) | 55.9 | 56.8 | 42.0 | 24.2 | 52.5 | 70.5 | 37.7 | 83.3 | 32.7 | 50.6 | | OPT-1.3B (300B) | 57.9 | 57.1 | 52.5 | 23.4 | 59.7 | 71.8 | 41.6 | 84.3 | 34.3 | 53.6 | | GPTneo-1.3B (300B) | 57.1 | 56.2 | 45.3 | 23.2 | 55.0 | 71.2 | 38.6 | 86.1 | 34.5 | 51.9 | | Bloom-1.1B (366B) | 42.6 | 51.5 | 42.9 | 23.6 | 54.9 | 67.3 | 34.5 | 83.6 | 32.6 | 48.2 | | Ponder-1B (300B) | 62.3 | 60.5 | 51.9 | 27.0 | 56.5 | 72.2 | 41.8 | 87.4 | 35.4 | 55.0 / +4.4 | | Pythia-1.4B (300B) | 61.6 | 60.4 | 49.7 | 25.9 | 57.5 | 70.8 | 40.4 | 86.4 | 34.1 | 54.1 | | Bloom-1.7B (366B) | 46.2 | 56.4 | 44.5 | 23.7 | 56.8 | 68.5 | 37.5 | 85.0 | 33.2 | 50.2 | | Ponder-1.4B (300B) | 65.2 | 62.0 | 53.8 | 27.0 | 60.1 | 72.6 | 44.0 | 89.0 | 35.2 | 56.5 / +2.4 | | Tinyllama-1.1B (3T) | 58.8 | 60.3 | 49.3 | 28.0 | 59.0 | 73.3 | 45.0 | 88.9 | 36.4 | 55.4 | | OPT-2.7B (300B) | 63.5 | 60.8 | 56.0 | 26.8 | 61.2 | 73.8 | 45.9 | 85.8 | 36.2 | 56.7 | | GPTneo-2.7B (300B) | 62.1 | 61.2 | 51.6 | 27.5 | 57.8 | 72.3 | 42.7 | 89.3 | 35.1 | 55.5 | | Bloom-3B (366B) | 51.7 | 59.4 | 50.9 | 28.0 | 58.7 | 70.8 | 41.4 | 88.8 | 35.2 | 53.9 | | Pythia-2.8B (300B) | 64.6 | 64.4 | 54.3 | 29.5 | 60.2 | 73.8 | 45.4 | 88.5 | 34.9 | 57.3 | | Pythia-6.9B (300B) | 67.2 | 67.3 | 55.9 | 31.4 | 61.0 | 75.2 | 48.1 | 89.3 | 36.9 | 59.1 | | Ponder-2.8B (300B) | 68.9 | 66.5 | 60.8 | 32.5 | 63.6 | 75.0 | 48.6 | 91.0 | 36.5 | 60.4 / +3.1 | 3.3.2 Instruction-following Ability Evaluation To assess the instruction-following capability of our model, we further fine-tuned PonderingPythia-1B and 1.4B, as well as the corresponding official Pythia models, on the Alpaca dataset using the same settings (Taori et al., 2023). The fine-tuned models were evaluated with MT-Bench (Zheng et al., 2023), a popular multi-turn question benchmark. The experimental results are shown in Figure 4. As illustrated, both PonderingPythia-1B and 1.4B consistently outperform their official Pythia counterparts across all subtasks, achieving average improvements of 0.33 and 0.55, respectively. The marginal gains on Coding and Math tasks may be attributed to the limited coding and mathematical abilities of the Pythia models. <details> <summary>x8.png Details</summary> ![49b45459](/v1/image/49b454596c1266c394ae59e8d8a54ed58b20bb9caa49278e5b9c780b05f5ff07) ### Visual Description ## Line Graph: Loss vs. Pondering Steps ### Overview The image depicts a line graph illustrating the relationship between "Pondering Steps" (x-axis) and "Loss" (y-axis). The graph shows a downward trend in loss as the number of pondering steps increases, starting from a baseline of 0 steps. ### Components/Axes - **X-axis (Horizontal)**: - Label: "Pondering Steps (0 = Baseline)" - Scale: 0 to 10, with ticks at 0, 1, 3, 5, and 10. - Units: Discrete steps (no explicit units provided). - **Y-axis (Vertical)**: - Label: "Loss" - Scale: 2.58 to 2.72, with ticks at 2.58, 2.60, 2.62, 2.64, 2.66, 2.68, 2.70, 2.72. - Units: Numerical values (likely a metric like error or cost). - **Legend**: Not visible in the image. - **Line and Data Points**: - A single green line connects five data points. - Data points are marked with green dots. ### Detailed Analysis - **Data Points**: - (0, 2.71): Baseline loss at 0 steps. - (1, 2.68): Loss decreases by 0.03 after 1 step. - (3, 2.63): Loss decreases by 0.05 after 3 steps. - (5, 2.61): Loss decreases by 0.02 after 5 steps. - (10, 2.59): Loss decreases by 0.02 after 10 steps. - **Trend**: - The line slopes downward consistently from left to right, indicating a negative correlation between pondering steps and loss. - The rate of decrease slows over time (e.g., 0.03 drop between 0–1 steps vs. 0.02 drops between 5–10 steps). ### Key Observations 1. **Steady Decline**: Loss decreases monotonically as pondering steps increase. 2. **Diminishing Returns**: The rate of loss reduction slows after the first few steps. 3. **Baseline Significance**: The highest loss (2.71) occurs at 0 steps, suggesting the baseline is suboptimal. ### Interpretation The graph demonstrates that increasing the number of pondering steps reduces loss, implying that deeper analysis or reflection improves outcomes. The baseline (0 steps) represents the worst performance, while the trend suggests that even small increments in pondering steps yield measurable improvements. The diminishing returns after 5 steps may indicate a saturation point or diminishing marginal utility of additional steps. This could inform strategies for optimizing decision-making processes or resource allocation in scenarios requiring iterative analysis. **Note**: No legend is present, but the green line and data points are consistent in color. All axis labels and data points are explicitly extracted from the image. </details> Figure 5: Increasing the number of pondering steps consistently reduces loss on the Pile validation set. 3.4 Impact of Pondering Steps To further investigate the effect of pondering steps on model performance, we trained several 160M-parameter Pythia models from scratch with different numbers of pondering steps: 0 (baseline), 1, 3, 5, and 10. These models were pretrained on a 30B-token subset of the Pile dataset. The results, presented in Figure 5, clearly show that increasing the number of pondering steps consistently reduces the language modeling loss on the Pile validation set. This demonstrates the significant potential and scalability of our method, suggesting that model performance can be further improved by increasing the depth of pondering steps. 4 Limitations and Future Work 4.1 Limitations There are two limitations to our work. Firstly, due to computational constraints, we only scaled our method up to a 2.8B-parameter model trained on 300B tokens from the Pile dataset. It would be interesting to extend our approach to larger models and larger-scale datasets in the future. Secondly, although our results demonstrate that the proposed method scales better than vanilla models under the same training FLOPs (Figure 3), it also introduces additional inference overhead (increasing roughly linearly with the number of pondering steps), similar to test-time scaling methods. 4.2 Future Work There are several promising directions for future work. Firstly, our proposed method is not limited to decoder-only architectures or language modeling; it has the potential to be applied to a wide range of model types and domains. For example, it could be adapted to state-space models such as Mamba (Gu & Dao, 2023), encoder models, or RWKV (Peng et al., 2023), as well as extended to other areas. Another promising direction is the introduction of token-adaptive pondering, which may significantly reduce computation and further enhance our method. It would also be interesting to investigate the interpretability of the pondering process, such as how the model "thinks" during pondering, the semantics of the pondering embedding, and whether the model learns to reflect on its predictions through pondering. Finally, exploring the combination of our method with orthogonal approaches such as CoT and other test-time scaling methods could also be an interesting direction. 5 Related Work 5.1 Test-Time Compute Scaling Scaling test-time computation has emerged as an influential approach, providing an effective alternative to merely increasing model parameters (Snell et al., 2024). Current methods for enhancing test-time computation primarily fall into two categories: parallel scaling and sequential scaling (Zeng et al., 2024; Muennighoff et al., 2025). Parallel scaling methods involve generating multiple candidate solutions simultaneously and selecting the best solution based on certain evaluation criteria. Prominent examples of this paradigm include Best-of-N search (Cobbe et al., 2021; Sun et al., 2024; Gui et al., 2024; Amini et al., 2024; Sessa et al., 2024) and Majority Vote (Wang et al., 2022). The primary difference among these parallel approaches lies in their strategy for selecting the optimal outcome. Nevertheless, parallel scaling approaches often encounter difficulty accurately identifying the best candidate solution among multiple generated possibilities (Stroebl et al., 2024; Hassid et al., 2024). Sequential scaling methods, conversely, focus on progressively refining the model’s reasoning through multiple iterative steps. Representative methods include CoT (Wei et al., 2022; Nye et al., 2021), iterative rethinking and revision (Huang et al., 2022; Min et al., 2024; Madaan et al., 2024; Wang et al., 2024b, b; Lee et al., 2025; Hou et al., 2025; Muennighoff et al., 2025; Li et al., 2025). Following the emergence of powerful OpenAI o1 and DeepSeek R1, leveraging extensive chain-of-thought prompting to scale test-time computation has become increasingly popular. Nonetheless, existing methods often come with limitations, including reliance on specially curated datasets (Allen-Zhu & Li, 2023), extended context windows (Zhu et al., 2025), and complex reinforcement learning strategies (Pang et al., 2025). In contrast, our method effectively circumvents these limitations. 5.2 Latent Thinking in Language Models Latent thinking in LMs typically refers to the hidden computational processes within transformer architectures (Yang et al., 2024; Biran et al., 2024). Prior works exploring latent thinking can generally be divided into two categories based on their implementation methods: Adding Additional Tokens: Wang et al. (2024a) proposed predicting a planning token as a discrete latent variable prior to generating reasoning steps. Similarly, Pfau et al. (2024) investigated filler tokens (e.g., “…”) and found them effective for parallelizable problems, although they also highlighted limitations compared to CoT methods, potentially constraining scalability to more complex reasoning tasks. Zhou et al. (2024) pretrained models by randomly inserting a learnable token into the training corpus, demonstrating improvements across various tasks. Furthermore, Quiet-STaR employs reinforcement learning to train models to generate intermediate rationales at each token, thereby enhancing reasoning capabilities (Zelikman et al., 2024). Reusing Hidden States: Early Universal Transformers (Csordás et al., ) reused hidden states through a transition network, aiming to enhance the encoder-decoder transformer architecture toward Turing completeness. A more recent method (Geiping et al., 2025) iterates over multiple layers of a language model to refine intermediate hidden states, treating this process as a form of latent thinking. Giannou et al. (2023) proposed looped transformer, recycling output hidden states back into input embeddings for algorithmic tasks. Similarly, Hao et al. (2024) fine-tuned models to reason directly within continuous latent spaces, using final hidden states as embeddings to achieve reasoning without explicit CoT. In contrast, our method relies on pondering embeddings derived from the predicted probabilities of LMs. 6 Conclusion In this paper, we introduce the pondering process into language models through solely self-supervised learning. Pondering LM can be naturally learned through pretraining on large-scale general corpora. Our extensive experiments across three widely adopted architectures—GPT-2, Pythia, and LLaMA—highlight the effectiveness and generality of our proposed method. Notably, our PonderingPythia consistently outperforms the official Pythia model on language modeling tasks, scaling curves, downstream tasks, and instruction-following abilities when pretrained on the large-scale Pile dataset. As increasing the number of pondering steps further improves language model performance, we posit that our approach introduces a promising new dimension along which language model capabilities can be scaled. References - Allen-Zhu & Li (2023) Allen-Zhu, Z. and Li, Y. Physics of language models: Part 3.2, knowledge manipulation. arXiv preprint arXiv:2309.14402, 2023. - Allen-Zhu & Li (2024) Allen-Zhu, Z. and Li, Y. Physics of language models: Part 3.3, knowledge capacity scaling laws. arXiv preprint arXiv:2404.05405, 2024. - Amini et al. (2024) Amini, A., Vieira, T., Ash, E., and Cotterell, R. Variational best-of-n alignment. arXiv preprint arXiv:2407.06057, 2024. - Biderman et al. (2023) Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023. - Biran et al. (2024) Biran, E., Gottesman, D., Yang, S., Geva, M., and Globerson, A. Hopping too late: Exploring the limitations of large language models on multi-hop queries. arXiv preprint arXiv:2406.12775, 2024. - Bisk et al. (2020) Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020. - Black et al. (2021) Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata. - Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. - Clark et al. (2018) Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. - Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems, 2021. URL https://arxiv. org/abs/2110.14168, 9, 2021. - (11) Csordás, R., Irie, K., Schmidhuber, J., Potts, C., and Manning, C. D. Moeut: Mixture-of-experts universal transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. - DeepSeek-AI et al. (2025) DeepSeek-AI, Guo, D., Yang, D., and et al., H. Z. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. 2025. URL https://arxiv.org/abs/2501.12948. - Fedorenko et al. (2024) Fedorenko, E., Piantadosi, S. T., and Gibson, E. A. Language is primarily a tool for communication rather than thought. Nature, 630(8017):575–586, 2024. - Gao et al. (2020) Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. - Gao et al. (2023) Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836. - Geiping et al. (2025) Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B. R., Kailkhura, B., Bhatele, A., and Goldstein, T. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025. - Giannou et al. (2023) Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J. D., and Papailiopoulos, D. Looped transformers as programmable computers. In International Conference on Machine Learning, pp. 11398–11442. PMLR, 2023. - Gu & Dao (2023) Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. - Gui et al. (2024) Gui, L., Gârbacea, C., and Veitch, V. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832, 2024. - Hackenburg et al. (2025) Hackenburg, K., Tappin, B. M., Röttger, P., Hale, S. A., Bright, J., and Margetts, H. Scaling language model size yields diminishing returns for single-message political persuasion. Proceedings of the National Academy of Sciences, 122(10):e2413443122, 2025. - Hao et al. (2024) Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024. - Hassid et al. (2024) Hassid, M., Remez, T., Gehring, J., Schwartz, R., and Adi, Y. The larger the better? improved llm code-generation via budget reallocation. arXiv preprint arXiv:2404.00725, 2024. - Hoffmann et al. (2022a) Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022a. - Hoffmann et al. (2022b) Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022b. - Hou et al. (2025) Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J., Tang, J., and Dong, Y. Advancing language model reasoning through reinforcement learning and inference scaling. arXiv preprint arXiv:2501.11651, 2025. - Huang et al. (2022) Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., and Han, J. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022. - Jaech et al. (2024) Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. - Kaplan et al. (2020) Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. - Lai et al. (2017) Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017. - Le Scao et al. (2023) Le Scao, T., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. Bloom: A 176b-parameter open-access multilingual language model. 2023. - Lee et al. (2025) Lee, K.-H., Fischer, I., Wu, Y.-H., Marwood, D., Baluja, S., Schuurmans, D., and Chen, X. Evolving deeper llm thinking. arXiv preprint arXiv:2501.09891, 2025. - Li et al. (2025) Li, D., Cao, S., Griggs, T., Liu, S., Mo, X., Patil, S. G., Zaharia, M., Gonzalez, J. E., and Stoica, I. Llms can easily learn to reason from demonstrations structure, not content, is what matters! arXiv preprint arXiv:2502.07374, 2025. - Li et al. (2023) Li, L. H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y. Symbolic chain-of-thought distillation: Small models can also "think" step-by-step. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2665–2679, Toronto, Canada, jul 2023. Association for Computational Linguistics. - Li et al. (2024) Li, W., Liu, X., Li, Y., Jin, Y., Tian, H., Zhong, Z., Liu, G., Zhang, Y., and Chen, K. Understanding communication characteristics of distributed training. In Proceedings of the 8th Asia-Pacific Workshop on Networking, pp. 1–8, 2024. - Liu et al. (2024) Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. - Madaan et al. (2024) Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024. - Min et al. (2024) Min, Y., Chen, Z., Jiang, J., Chen, J., Deng, J., Hu, Y., Tang, Y., Wang, J., Cheng, X., Song, H., et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems. arXiv preprint arXiv:2412.09413, 2024. - Muennighoff et al. (2023) Muennighoff, N., Rush, A., Barak, B., Le Scao, T., Tazi, N., Piktus, A., Pyysalo, S., Wolf, T., and Raffel, C. A. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36:50358–50376, 2023. - Muennighoff et al. (2025) Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025. - Narayanan et al. (2021) Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the international conference for high performance computing, networking, storage and analysis, pp. 1–15, 2021. - Nye et al. (2021) Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. - Pang et al. (2025) Pang, B., Dong, H., Xu, J., Savarese, S., Zhou, Y., and Xiong, C. Bolt: Bootstrap long chain-of-thought in language models without distillation. arXiv preprint arXiv:2502.03860, 2025. - Paperno et al. (2016) Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fernández, R. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016. - Pati et al. (2023) Pati, S., Aga, S., Islam, M., Jayasena, N., and Sinclair, M. D. Computation vs. communication scaling for future transformers on future hardware. arXiv preprint arXiv:2302.02825, 2023. - Peng et al. (2023) Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023. - Pfau et al. (2024) Pfau, J., Merrill, W., and Bowman, S. R. Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758, 2024. - Sakaguchi et al. (2021) Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021. - Sessa et al. (2024) Sessa, P. G., Dadashi, R., Hussenot, L., Ferret, J., Vieillard, N., Ramé, A., Shariari, B., Perrin, S., Friesen, A., Cideron, G., et al. Bond: Aligning llms with best-of-n distillation. arXiv preprint arXiv:2407.14622, 2024. - Snell et al. (2024) Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024. - Stroebl et al. (2024) Stroebl, B., Kapoor, S., and Narayanan, A. Inference scaling flaws: The limits of llm resampling with imperfect verifiers. arXiv preprint arXiv:2411.17501, 2024. - Su et al. (2024) Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. - Sun et al. (2024) Sun, H., Haider, M., Zhang, R., Yang, H., Qiu, J., Yin, M., Wang, M., Bartlett, P., and Zanette, A. Fast best-of-n decoding via speculative rejection. arXiv preprint arXiv:2410.20290, 2024. - Taori et al. (2023) Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. - Villalobos et al. (2022) Villalobos, P., Sevilla, J., Heim, L., Besiroglu, T., Hobbhahn, M., and Ho, A. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325, 1, 2022. - Wang et al. (2022) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. - Wang et al. (2024a) Wang, X., Caccia, L., Ostapenko, O., Yuan, X., Wang, W. Y., and Sordoni, A. Guiding language model reasoning with planning tokens, 2024a. URL https://arxiv.org/abs/2310.05707. - Wang et al. (2024b) Wang, Y., Wu, Y., Wei, Z., Jegelka, S., and Wang, Y. A theoretical understanding of self-correction through in-context alignment. arXiv preprint arXiv:2405.18634, 2024b. - Wei et al. (2022) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. - Welbl et al. (2017) Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017. - Yang et al. (2024) Yang, S., Gribovskaya, E., Kassner, N., Geva, M., and Riedel, S. Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837, 2024. - Yue et al. (2025) Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025. - Zelikman et al. (2024) Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N., and Goodman, N. D. Quiet-star: Language models can teach themselves to think before speaking, 2024. URL https://arxiv.org/abs/2403.09629. - Zellers et al. (2019) Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019. - Zeng et al. (2024) Zeng, Z., Cheng, Q., Yin, Z., Wang, B., Li, S., Zhou, Y., Guo, Q., Huang, X., and Qiu, X. Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective. arXiv preprint arXiv:2412.14135, 2024. - Zhang & Sennrich (2019) Zhang, B. and Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019. - Zhang et al. (2024) Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024. - Zhang et al. (2022) Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. - Zheng et al. (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023. - Zhou et al. (2024) Zhou, J., Pang, L., Shen, H., and Cheng, X. Think before you speak: Cultivating communication skills of large language models via inner monologue, 2024. URL https://arxiv.org/abs/2311.07445. - Zhu et al. (2025) Zhu, D., Wei, X., Zhao, G., Wu, W., Zou, H., Ran, J., Wang, X., Sun, L., Zhang, X., and Li, S. Chain-of-thought matters: Improving long-context language models with reasoning path supervision. arXiv preprint arXiv:2502.20790, 2025.

Rendering Paper...