2510.01032v1

Model: nemotron-free

# Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning ## Abstract Motivated by the puzzling observation that inserting long sequences of meaningless tokens before the query prompt can consistently enhance LLM reasoning performance, this work analyzes the underlying mechanism driving this phenomenon and based on these insights proposes a more principled method that allows for similar performance gains. First, we find that the improvements arise from a redistribution of activations in the LLM’s MLP layers, where near zero activations become less frequent while large magnitude activations increase. This redistribution enhances the model’s representational capacity by suppressing weak signals and promoting stronger, more informative ones. Building on this insight, we propose the Activation Redistribution Module (ARM), a lightweight inference-time technique that modifies activations directly without altering the input sequence. ARM adaptively identifies near-zero activations after the non-linear function and shifts them outward, implicitly reproducing the beneficial effects of meaningless tokens in a controlled manner. Extensive experiments across diverse benchmarks and model architectures clearly show that ARM consistently improves LLM performance on reasoning tasks while requiring only a few lines of simple code to implement. Our findings deliver both a clear mechanistic explanation for the unexpected benefits of meaningless tokens and a simple yet effective technique that harnesses activation redistribution to further improve LLM performance. The code has been released at ARM-Meaningless-tokens. ## 1 Introduction Large language models (LLMs) are known to be sensitive to subtle variations in their inputs, which makes it important to understand how tokens influence predictions (Guan et al., 2025; Errica et al., 2024; Zhuo et al., 2024). In this paper, we present a surprisingly counterintuitive finding named meaningless-token effect: inserting long sequences of meaningless tokens, such as repeated punctuation or separators, into prompts can consistently improve the performance of LLMs, particularly on reasoning tasks. Contrary to common intuition that long and irrelevant tokens are like noise and thus useless or even harmful during inference (Jiang et al., 2024; Guan et al., 2025), our experiments reveal the opposite. When long sequences of meaningless tokens are appended before query prompts, models that previously struggled with certain problems can produce correct solutions, as illustrated in the left panel of Figure 1 (see more examples in Appendix J). This effect occurs consistently across tasks and models, suggesting a counterintuitive behavior of LLMs pending deeper investigation. <details> <summary>x1.png Details</summary> ![1bc25912](/v1/image/1bc25912ec011a7a280af4bb50de42ddc4d4ea262a35f9ce3b9046e93d368fcf) ### Visual Description ## Diagram: Impact of Meaningless Tokens on LLM Processing ### Overview The image illustrates how inserting meaningless tokens into a prompt affects a large language model's (LLM) processing, attention mechanisms, and activation distributions. It is divided into three sections: 1. **Prompt Insertion**: Shows a system prompt, question, and meaningless tokens. 2. **Attention Mechanism Transformation**: Visualizes changes in attention weights, value states, and outputs. 3. **Activation Distribution Shift**: Compares activation distributions with/without meaningless tokens. --- ### Components/Axes #### Left Section (Prompt Insertion) - **System Prompt**: - Text: *"You are an expert mathematician. Solve the following problem carefully. Put your final answer within a \boxed{}."* - **Question**: - Text: *"Let $a$ be a positive real number such that all the roots of \$n[x^3 + ax^2 + 1 = 0]\$ are real. Find the smallest possible value of $a$."* - **Meaningless Tokens**: - Represented as green boxes labeled `"/////....//////"`. - **LLM Output**: - Two answers: - Incorrect: `5` (marked with red X). - Correct: `3` (marked with green check). #### Middle Section (Attention Mechanism) - **Legend**: - Yellow: System Prompt Token - Green: Meaningless Token - Blue: Question Token - **Components**: - **Attention Weights**: Heatmaps showing token interactions. - **Value States**: Gridded representations of token values. - **Attention Outputs**: Modified grids after token insertion. - **Arrows**: Indicate transformations between components. - **Note**: *"Each picture is just an example, not a concrete representation."* #### Right Section (Activation Distribution) - **Graph**: - **X-axis**: Labeled `0` (activation value). - **Y-axis**: Labeled `Activation distribution`. - **Lines**: - Blue: Activation distribution *without* meaningless tokens. - Red: Activation distribution *with* meaningless tokens. - **Legend**: - Blue: *Without* tokens. - Red: *With* tokens. --- ### Detailed Analysis #### Left Section - The system prompt instructs the LLM to solve a math problem involving real roots of a cubic equation. - The question includes a LaTeX-formatted equation: \$n[x^3 + ax^2 + 1 = 0]\$. - Inserting meaningless tokens (green boxes) leads to an incorrect answer (`5` vs. correct `3`). #### Middle Section - **Attention Weights**: - Yellow (system prompt) and green (meaningless tokens) tokens show lighter/darker shading, indicating value changes. - **Value States**: - Grids with blue shading represent darker value shifts. - **Attention Outputs**: - Modified grids show altered token interactions post-insertion. #### Right Section - **Graph Trends**: - Blue line (no tokens) peaks sharply at `0`. - Red line (with tokens) has a broader, bimodal distribution, peaking at `0` and `1`. - **Key Insight**: - Meaningless tokens shift activation values toward both extremes (`0` and `1`), suggesting increased model uncertainty or distraction. --- ### Key Observations 1. **Incorrect Answer**: Inserting meaningless tokens causes the LLM to output `5` instead of the correct `3`. 2. **Attention Shifts**: - System prompt tokens (yellow) and question tokens (blue) dominate attention weights without tokens. - With tokens, green meaningless tokens disrupt attention, altering value states. 3. **Activation Distribution**: - Tokens introduce bimodality, indicating fragmented focus or noise in processing. --- ### Interpretation - **Mechanism of Failure**: Meaningless tokens disrupt the LLM's attention by introducing irrelevant tokens, causing the model to misallocate computational resources. This leads to incorrect answers and distorted activation distributions. - **Attention Dynamics**: The heatmaps suggest that meaningless tokens (green) interfere with the system prompt (yellow) and question (blue) tokens, reducing coherence in value states. - **Activation Bimodality**: The red line’s broader distribution implies that tokens force the model to consider conflicting or irrelevant information, reducing confidence in the correct answer. - **Practical Implication**: Inserting meaningless tokens could be a form of adversarial attack or prompt injection, degrading model performance by corrupting attention and activation patterns. --- **Note**: All textual content (e.g., system prompt, question, labels) is transcribed verbatim. Colors in diagrams align with the legend (yellow/green/blue for tokens, blue/red for activation lines). No numerical data beyond axis labels and line trends is provided in the image. </details> Figure 1: The left panel illustrates how meaningless-token effect can improve model performance. The middle panel shows the changes occurring in the attention module after introducing meaningless tokens. The right panel depicts the redistribution of activations that results from adding these tokens. This unexpected result raises fundamental questions about how LLMs process input and what aspects of their internal computation are being affected. Why should tokens that convey no meaning lead to measurable performance gains? Are they simply acting as noise, or do they restructure representations in a systematic way that supports better reasoning? To answer these questions, we move beyond surface level observations and conduct a detailed investigation of the mechanisms behind this effect. Our analysis shows that the influence of meaningless tokens arises primarily in the first layer, and their effect on meaningful tokens can be approximated as an affine transformation of the attention outputs. As demonstrated in the middle schematic diagram of Figure 1, the resulting transformation shifts the distribution of activations in the MLP: the proportion of near-zero activations decreases, while more activations are pushed outward toward larger positive and negative values. The rightmost plot in Figure 1 gives a visualization of this process. We hypothesize that redistribution fosters richer exploration, enhancing reasoning performance, and clarify the mechanism by decomposing the transformation into coefficient and bias terms. Our theoretical analysis shows how each component shapes activation variance and induces the observed distributional shift. Building on these insights, we propose ARM (an A ctivation R edistribution M odule), a lightweight alternative to explicit meaningless-token insertion. ARM requires only a few lines of code modification and no additional training. It automatically identifies a proportion of near-zero activations after the non-linear function and shifts their values outward, yielding a smoother and less sparse activation distribution. In doing so, ARM reproduces the beneficial effects of meaningless tokens without altering the input sequence and consistently improves LLM performance on reasoning and related tasks. In summary, the key findings and contributions of our work are: - We uncover a meaningless-token effect in LLMs: inserting meaningless tokens, far from being harmful, systematically improves reasoning in LLMs. This runs counter to the common assumption that such tokens only add noise. - Through theoretical and empirical analysis, we show that these tokens induce an activation redistribution effect in the first-layer MLP, reducing near-zero activations and increasing variance. - Building on this understanding, we present ARM, a lightweight inference-time instantiation to demonstrate that the phenomenon can be directly harnessed. ## 2 Observation: Inserting Meaningless Tokens Induces an Affine Transformation on Meaningful Token Representations We observe that meaningless tokens, such as a sequence of slashes (“/”) with appropriate lengths can enhance the performance of LLMs, particularly on reasoning tasks Varying token length, type, and position affects performance, as shown in Appendix F.. As shown in Table 1, when we insert a fixed-length sequence of meaningless tokens between the system prompt and the question, all evaluated models exhibit performance improvements on Math-500 and AIME2024 to different degrees. This consistent improvement suggests that the inserted meaningless tokens are not simply ignored or detrimental to the models; rather, they exert a positive influence, likely through non-trivial interactions with the models’ internal representations. To investigate this phenomenon, we start our analysis from the attention module. The formula of attention is: Table 1: Performance on mathematical reasoning datasets with and without meaningless tokens across different models. “w/o” denotes the absence of meaningless tokens, while “w/” denotes their presence. We test each model five times to get the average result. | Qwen2.5-Math-1.5B Qwen2.5-Math-7B DS-R1-Qwen-7B | 63.9 72.3 52.7 | 65.9 74.6 53.1 | 14.4 23.1 3.2 | 17.5 23.3 4.4 | | --- | --- | --- | --- | --- | | DS-Math-7B-instruct | 39.5 | 42.1 | 7.8 | 12.3 | | Llama-3.1-8B-Instruct | 41.8 | 42.1 | 7.9 | 9.9 | | Qwen-2.5-32B-Instruct | 81.3 | 81.7 | 17.6 | 22.8 | <details> <summary>x2.png Details</summary> ![6e6f2eb1](/v1/image/6e6f2eb18d1d702e539eb70679366644da72fbbda233682d3295b10bf560c4b7) ### Visual Description ## Line Graphs: Average Attention Weights with/without Meaningless Tokens ### Overview The image contains six line graphs comparing the average attention weights of three language models (Qwen2.5-7B-Math, Llama3.1-8B-Instruct, Gemma3-4b-it) across different layers and attention heads. Each graph contrasts two scenarios: - **Blue line**: Attention weights **without** meaningless tokens - **Red line**: Attention weights **with** meaningless tokens Shaded regions represent confidence intervals (likely 95% CI). All graphs share consistent axes and formatting. --- ### Components/Axes 1. **X-axis**: - Label: "Token Position" - Scale: 0 to 120 (discrete intervals) - Position: Bottom of each graph 2. **Y-axis**: - Label: "Average Attention Weight" - Scale: 0.00 to 0.175 (varies by graph) - Position: Left of each graph 3. **Legends**: - Located in the **top-left corner** of each graph - Blue: "w/o Meaningless tokens" - Red: "w/ Meaningless tokens" - Shaded regions: Confidence intervals 4. **Graph Titles**: - Format: `[Model Name] [Layer] [Head]` - Examples: - "Qwen2.5-7B-Math Layer 1 Head 22" - "Llama3.1-8B-Instruct Layer 1 Head 27" - "Gemma3-4b-it Layer 1 Head 3" --- ### Detailed Analysis #### Qwen2.5-7B-Math Layer 1 Head 22 - **Blue line (w/o tokens)**: Peaks at ~0.07 (token 50), ~0.06 (token 100). - **Red line (w/ tokens)**: Peaks at ~0.08 (token 50), ~0.07 (token 100). - **Trend**: Red line consistently higher than blue, with sharper peaks. #### Llama3.1-8B-Instruct Layer 1 Head 27 - **Blue line**: Peaks at ~0.08 (token 30), ~0.06 (token 90). - **Red line**: Peaks at ~0.10 (token 30), ~0.07 (token 90). - **Trend**: Red line shows ~25% higher peaks at key positions. #### Gemma3-4b-it Layer 1 Head 3 - **Blue line**: Peaks at ~0.10 (token 100), ~0.05 (token 50). - **Red line**: Peaks at ~0.15 (token 100), ~0.07 (token 50). - **Trend**: Red line exhibits ~50% higher peaks at token 100. #### Lower Row (Unlabeled Models) - **Graph 1**: - Blue line: Peaks at ~0.07 (token 20), ~0.05 (token 100). - Red line: Peaks at ~0.06 (token 20), ~0.04 (token 100). - **Graph 2**: - Blue line: Peaks at ~0.08 (token 60), ~0.04 (token 120). - Red line: Peaks at ~0.10 (token 60), ~0.05 (token 120). - **Graph 3**: - Blue line: Peaks at ~0.12 (token 100), ~0.03 (token 50). - Red line: Peaks at ~0.16 (token 100), ~0.04 (token 50). --- ### Key Observations 1. **Meaningless tokens amplify attention weights** at specific token positions (e.g., token 50, 100). 2. **Peak magnitudes vary by model/head**: - Qwen2.5-7B-Math: ~10–20% increase with tokens. - Llama3.1-8B-Instruct: ~25% increase. - Gemma3-4b-it: ~50% increase at token 100. 3. **Confidence intervals** (shaded regions) are narrower for blue lines, suggesting more stable attention without meaningless tokens. 4. **Token 100** is a critical position across models, with red lines showing the highest attention. --- ### Interpretation The data demonstrates that **meaningless tokens significantly alter attention dynamics** in language models. Key insights: - **Attention spikes** at specific token positions (e.g., 50, 100) suggest these tokens act as "anchors" for model focus. - **Model-specific variability** implies differences in how architectures process irrelevant information. - **Confidence intervals** highlight the instability introduced by meaningless tokens, which could degrade performance in real-world scenarios. - **Practical implication**: Filtering or mitigating meaningless tokens may improve model robustness, particularly in attention-critical tasks. *Note: Exact numerical values are approximated from visual inspection due to lack of raw data.* </details> Figure 2: The x-axis shows token indices. Subsequent tokens assign lower average attention weights to the original prompt overall, while meaningless tokens receive similarly near-zero weights. We show additional average attention weights in Appendix I and layer-wise analyses in Section F.4. $\text{Attention}(Q,K,V)=\text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_{k}}}\right)V$ , where $Q$ , $K$ , $V$ are query vectors, key vectors and value vectors respectively, $d$ is the dimensionality of key/query. From this equation, adding extra tokens introduces additional terms into the softmax normalization, enlarging the softmax normalization denominator. Although the new tokens typically receive small weights, their presence redistributes probability mass and reduces the relative share of attention allocated to the original tokens. To probe the underlying case, we directly compare input’s attention weights with and without meaningless tokens while keeping tokens indices aligned in the first layers. For every token we computed the mean of its column below the diagonal of the attention matrix to measure the extent to which each token receives attention from all downstream tokens (Bogdan et al., 2025). When a string of meaningless tokens are present, the model assigns only small weights to each token, intuitively indicating that the model only pays little attention to them (see Figure 2 bottom row). The top row of Figure 2 presents a direct comparison of the attention to meaningful tokens without (blue) or with meaningless tokens (red; meaningless token indices are removed from visualization to allow for direct comparison). Among meaningful tokens, the average attention is decreased in the meaningless-token condition, especially driven by decreased high-attention spikes. The attention weights of the original prompt after inserting meaningless tokens are: $W^{\prime}=\lambda\cdot\text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_{k}}}\right)$ , where $W_{attn}$ are the attention weights after softmax, and $\lambda$ is the drop percentage of attention weights in the original prompt after adding meaningless tokens. Then, the attention output for each token not only obtains the weighted combination of the original tokens, but also includes attention weights and values from the meaningless tokens. Thus, the attention output can be expressed as: $$ \text{Attn\_Output}_{new}=W_{j}^{\prime}V_{j}+W_{i}V_{i}, \tag{1} $$ <details> <summary>x3.png Details</summary> ![45e62b96](/v1/image/45e62b96c67b7f09c934e665706cae93116ffc22c3d35df82f0c9652b403a2c3) ### Visual Description ## Scatter Plot: Attention Output Analysis with/without Meaningless Tokens ### Overview The image is a scatter plot comparing attention output distributions with and without meaningless tokens. Two data series (blue crosses and red crosses) are plotted on unlabeled X and Y axes. Concentric circles and trend lines contextualize the data, while displacement arrows indicate directional relationships between data points. ### Components/Axes - **Legend**: Located in the top-left corner, with four entries: - Blue crosses: "Attn_output w/o meaningless tokens" - Red crosses: "Attn_output w/ meaningless tokens" - Dotted blue line: "Attn_output set w/o meaningless tokens" - Solid red line: "Attn_output set w/ meaningless tokens" - **Axes**: - X-axis (horizontal, unlabeled) - Y-axis (vertical, unlabeled) - **Geometric Elements**: - Dotted blue circle (enclosing blue crosses) - Solid red circle (enclosing red crosses) - **Arrows**: Blue displacement vectors labeled "Displacement from x to X" ### Detailed Analysis 1. **Data Series**: - **Blue crosses** (w/o meaningless tokens): Clustered within the dotted blue circle, with a slight upward trend. - **Red crosses** (w/ meaningless tokens): Scattered outside the solid red circle, with a steeper upward trend. 2. **Trend Lines**: - Dotted blue line: Follows the central tendency of blue crosses, curving upward. - Solid red line: Traces the red crosses, showing a sharper upward trajectory. 3. **Circles**: - Dotted blue circle: Encloses ~70% of blue crosses, suggesting a boundary for "normal" outputs. - Solid red circle: Contains ~30% of red crosses, indicating a tighter cluster for outputs with meaningless tokens. 4. **Arrows**: - Blue vectors point from blue crosses to red crosses, illustrating directional displacement. ### Key Observations - **Data Separation**: Blue and red crosses occupy distinct regions, with minimal overlap. - **Trend Divergence**: Red crosses exhibit a steeper upward trend than blue crosses. - **Circular Boundaries**: The circles act as visual thresholds, with red crosses predominantly outside the red circle. - **Displacement Pattern**: Arrows suggest a systematic shift from blue to red regions. ### Interpretation The plot demonstrates that including meaningless tokens (red crosses) alters attention output distributions: 1. **Threshold Effects**: The concentric circles may represent performance or validity boundaries, with meaningless tokens pushing outputs beyond these limits. 2. **Directional Shift**: Displacement arrows imply that meaningless tokens transform outputs from a "baseline" (blue) to a "perturbed" state (red). 3. **Trend Implications**: The steeper red trend suggests meaningless tokens amplify certain features in attention outputs, potentially degrading model performance or introducing noise. The absence of axis labels limits quantitative analysis, but the visual separation and directional cues strongly indicate that meaningless tokens significantly impact attention mechanisms, warranting further investigation into their role in model behavior. </details> Figure 3: After adding meaningless tokens, each token vector is affinely transformed: blue points show the original vectors, and red points show them after the addition. Arrow is change direction. where Attn_Output corresponds to the output of attention mechanism for each token in the original prompt, $W_{j}^{\prime}$ and $V_{j}$ are the attention weight and value vectors of the original prompt, and $W_{i}$ and $V_{i}$ are the attention weight and value vectors of meaningless tokens. As the meaningless tokens are repeated in long sequences and contribute no semantic information, the values of these tokens are identical, and their attention weights are small in a similar magnitude. Therefore, as shown in Equation 1, the term $W_{i}V_{i}$ primarily shifts the final attention output along an approximately unified direction as they accumulate, without introducing diverse semantic components. In this formula, $W_{j}V_{j}$ is the value of original attention output, we see $W_{i}V_{i}$ as $\Sigma_{\sigma}$ . As a result, the attention output of meaningful tokens after adding meaningless tokens can be seen as an affine transformation expressed as: $$ \text{Attn\_Output}_{new}=\lambda\cdot\text{Attn\_Output}+\Sigma_{\sigma}, \tag{2} $$ where Attn_Output is $W_{j}V_{j}$ . Following this equation, the introduction of meaningless tokens transforms the attention output of meaningful tokens into an affine function, consisting of a scaled original term ( $\lambda\cdot\text{Attn\_Output}$ ) and an additional bias ( $\Sigma_{\sigma}$ ). Figure 3 illustrates the process of this transformation. After the attention module the affine transformed output passes through RMSNorm and serves as the input to the MLP. In the next section, we examine in detail how this transformation propagates through the subsequent MLP layers and shapes the model’s overall activation distribution. ## 3 Analysis: Why Affine Transformation Improve Reasoning Performance Having established in the previous sections that meaningless-token effect induces scaling and bias terms that produce an affine transformation of the attention output, we next examine how this transformation propagates through the subsequent MLP modules and affects reasoning. In Equation 2, we decompose the transformation’s influencing factors into two primary components: the scaling factors $\lambda$ controls the magnitude of activations, and the bias factors $\Sigma_{\sigma}$ , a bounded zero-mean bias term reflecting the variation in attention outputs before and after meaningless-token insertion which introduce structured shifts in the activation distribution. Together, these two factors determine how the transformed attention representations shape the dynamics of the MLP layers. ### 3.1 Affine Transformation influence the output of gate layer Key Takeaway We demonstrate that applying an affine transformation, through both scaling and bias factors, systematically increases the variance of the gate layer’s output. In this part, we show that these two factors increase the gate projection layer variance in MLP layer. As discussed above, because these tokens have low attention weights and nearly identical values, they shift the RMSNorm input almost entirely along a single direction with small margin; consequently, RMSNorm largely absorbs this change, producing only a minor numerical adjustment without adding semantic information. Specifically, the two factors act through different mechanisms. For the scaling factors, before entering the MLP, the attention output undergoes output projection and residual connection, which can be written as $x(\lambda)=\text{res}+\lambda*\text{U}*\text{A}$ , where A is the attention output and U the projection weights. Treating $\lambda$ as a functional variable, the RMSNorm output becomes $y(\lambda)=\text{RMS}(x(\lambda))$ . For the $j$ -th gate dimension, $z_{j}(\lambda)=w_{j}^{\top}y(\lambda)$ , and a small variation $\Delta\lambda$ leads to the variance change of this dimension. $$ \text{Var}[z_{j}(\lambda+\Delta\lambda)]=\text{Var}[z_{j}(\lambda)]+2\text{Cov}(z_{j}(\lambda),g_{j}(\lambda))\Delta\lambda+\text{Var}[g_{j}(\lambda)]\Delta\lambda^{2}, \tag{3} $$ the third term in Equation 3 remains strictly positive for all admissible parameters. Moreover, as $\Delta\lambda$ increases, this term exhibits monotonic growth and asymptotically dominates the second term, thereby guaranteeing a strictly increasing overall variance. We analyze the range of $\Delta\lambda$ in Appendix E. In the case of bias factors, we model the perturbation as stochastic noise which is bounded, zero-mean and statistically independent from the original attention output across all dimensions, which contributes an additional variance component and interacts non-trivially with the subsequent RMSNorm operation. Formally, after noise injection, the RMSNorm input can be written as $x=x_{0}+W\Sigma_{\sigma}$ , where $W$ is the linear coefficient of matrix $x$ preceding RMSNorm. After normalization, the covariance of the output can be expressed as: $$ \text{Cov}(y)=J_{q}\text{Cov}(x)J_{q}^{{\top}}+o(\|x-x_{0}\|^{2}) \tag{4} $$ where $x_{0}$ is the mean expansion point, $J_{q}$ is the Jacobian matrix of the RMSNorm mapping. Since the variance of the added perturbation is very small, the higher-order terms can be disregarded. In this case, the bias factor will bias the input of RMSNorm and lead to an increase in the covariance $\mathrm{Cov}(y)$ . Subsequently, the input to the activation function can be written as $z=W_{gate}(x+W\Sigma_{\sigma})$ . Based on the properties of the covariance, the variance of the $j$ -th dimension is given by: $$ \text{Var}[z_{j}]\approx e_{j}^{\top}W_{gate}[J_{q}\text{Cov(x)}J_{q}^{{\top}}]W_{gate}^{\top}e_{j}, \tag{5} $$ since the projection of the vector onto the tangent space is almost never zero in LLMs’ high dimensions, the resulting variance must be strictly greater than zero. From this, we can deduce that these two factors increase the variance of the output. In general, the scaling factors increase variance by amplifying inter-sample differences, whereas the bias factors correspondingly increase variance by enlarging the covariance structure across dimensions. ### 3.2 Variance change leads to activation redistribution Key Takeaway Our analysis shows that an increase in the input variance of activation functions broadens and reshapes the output activation distribution by raising both its mean and its variance. As the variance of gate layer outputs grows under perturbations, the subsequent activation function further reshapes these signals by compressing values near zero. This motivates redistributing near-zero activations. For each sample in the hidden state, the second-order Taylor expansion on $\phi$ , the activation function output is: $$ \phi(\mu+\sigma)=\phi(\mu)+\phi^{{}^{\prime}}(\mu)\sigma+\frac{1}{2}\phi^{{}^{\prime\prime}}(\mu)\sigma^{2}+o(|\sigma|^{3}), \tag{6} $$ where $\sigma$ can represent both $\Delta k$ in scaling factor and $\Sigma_{\sigma}$ in bias factor. We denote the input to the activation function as $z=\mu+\sigma$ . For the $j$ -th dimension of the hidden state, the expectation and variance of the activation output can be expressed as: $$ \mathbb{E}[\phi(z_{j})]=\mathbb{E}[\phi(\mu_{j})]+\mathbb{E}[\phi^{{}^{\prime}}(\mu_{j})\sigma]+\mathbb{E}[\frac{1}{2}\phi^{{}^{\prime\prime}}(\mu_{j})\sigma^{2}]+o(\mathbb{E}|\sigma|^{3}), \tag{7} $$ $$ \text{Var}[\phi(z_{j})]=\phi^{{}^{\prime}}(\mu_{j})^{2}\text{Var}_{j}+o(\text{Var}_{j}^{2}). \tag{8} $$ From above equations, We infer that distributional changes map to variations in expectation and variance. On a single dimension, activations shift in both directions; from Equation 6, higher-order terms are negligible, and the first derivative of GeLU/SiLU near zero is positive. Since perturbations include both signs, extrapolated activations also fluctuate around zero. From Equation 7, $\mathbb{E}[\sigma^{2}]=\text{Var}_{j}$ . For the bias factor, the zero-mean perturbation removes the first-order term. For scaling factors, expanding at the population mean gives $\mathbb{E}[\phi^{{}^{\prime}}(z_{j})g_{j}]=0$ , again canceling the first order. The second derivative near zero is strictly positive. From Equation 8, $\text{Var}_{j}$ increases, and so does the activation histogram variance, as the function is nearly linear near zero. In summary, scaling and bias factors jointly enlarge activation variance, expressed as: $$ \text{Var}_{j}\approx{\color[rgb]{0.28515625,0.609375,0.75390625}\definecolor[named]{pgfstrokecolor}{rgb}{0.28515625,0.609375,0.75390625}\mathbb{E}[\text{Var}_{j}^{(\Sigma_{\sigma})}]}+{\color[rgb]{0.96875,0.53125,0.47265625}\definecolor[named]{pgfstrokecolor}{rgb}{0.96875,0.53125,0.47265625}\text{Var}(g_{j}^{\lambda})}. \tag{9} $$ <details> <summary>x4.png Details</summary> ![0ed57a77](/v1/image/0ed57a7778b8a08da7c7d682219cb223d47fc4643ea124cc7c7dab951c70767c) ### Visual Description ## Bar Chart: Model Performance Metrics with/without Transformation ### Overview The image contains three grouped bar charts comparing performance metrics (Sparsity, L1 Norm, L2 Norm, Gini) across three AI models: **Qwen2.5-7B-Math**, **Llama3.1-8B-Instruct**, and **Gemma3-4b-it**. Each chart has two bars per metric: **blue** (with transformation) and **pink** (without transformation). Y-axis scales vary per subplot to accommodate metric ranges. --- ### Components/Axes - **X-axis**: Model names (Qwen2.5-7B-Math, Llama3.1-8B-Instruct, Gemma3-4b-it) - **Y-axis**: - **Sparsity**: 0.00–0.70 (Qwen), 0.00–0.10 (Llama), 0.00–0.10 (Gemma) - **L1 Norm**: 0.00–7.00 (Qwen), 0.00–5.00 (Llama), 0.00–1.00 (Gemma) - **L2 Norm**: 0.00–15.00 (Qwen), 0.00–15.00 (Llama), 0.00–1.40 (Gemma) - **Gini**: 0.00–0.20 (Qwen), 0.00–15.00 (Llama), 0.00–1.00 (Gemma) - **Legend**: - **Blue**: "with transformation" - **Pink**: "without transformation" --- ### Detailed Analysis #### Qwen2.5-7B-Math - **Sparsity**: Pink (0.185) > Blue (0.160) - **L1 Norm**: Pink (0.134) > Blue (0.120) - **L2 Norm**: Pink (0.45) > Blue (0.40) - **Gini**: Pink (0.120) < Blue (0.160) #### Llama3.1-8B-Instruct - **Sparsity**: Pink (0.10) > Blue (0.07) - **L1 Norm**: Pink (4.5) > Blue (4.3) - **L2 Norm**: Pink (13.0) > Blue (12.0) - **Gini**: Pink (7.0) > Blue (5.0) #### Gemma3-4b-it - **Sparsity**: Pink (0.10) > Blue (0.08) - **L1 Norm**: Pink (1.0) > Blue (0.8) - **L2 Norm**: Pink (1.4) > Blue (0.6) - **Gini**: Pink (1.0) > Blue (0.6) --- ### Key Observations 1. **Sparsity**: All models show lower sparsity with transformation (blue bars). 2. **L1/L2 Norms**: Transformation reduces these metrics across models, suggesting improved efficiency. 3. **Gini**: Mixed results: - **Qwen**: Transformation increases Gini (0.120 → 0.160), indicating higher inequality. - **Llama/Gemma**: Transformation decreases Gini (7.0 → 5.0; 1.0 → 0.6), suggesting reduced inequality. 4. **Outliers**: - Llama’s L2 Norm (13.0) and Gini (7.0) are significantly higher than other models, even with transformation. - Gemma’s L2 Norm (1.4) is notably higher than Qwen’s (0.45) despite similar model sizes. --- ### Interpretation - **Transformation Impact**: - Reduces computational metrics (Sparsity, L1/L2 Norms) across models, implying optimization benefits. - Effects on inequality (Gini) are model-dependent: - Qwen’s Gini increase suggests transformation may exacerbate output variability. - Llama/Gemma’s Gini decrease indicates more equitable performance post-transformation. - **Model Differences**: - Llama3.1-8B-Instruct exhibits the highest L2 Norm and Gini values, suggesting inherent inefficiency or inequality in its architecture. - Gemma3-4b-it balances lower L2 Norm (0.6) and Gini (0.6) with transformation, outperforming Qwen in inequality metrics. - **Anomalies**: - Qwen’s Gini increase with transformation contradicts trends in other models, warranting further investigation into its transformation pipeline. This analysis highlights the nuanced trade-offs between efficiency and fairness when applying transformations to AI models. </details> Figure 4: The histogram of the frequency of activations after activation functions in MLP, the sub-figure is the comparison of 4 metrics between before and after transformation. The first term represents the expected variance of the $j$ -th hidden states under the influence of the bias factor. Since the bias factor varies across individual cases, taking the expectation is necessary to capture its overall impact. The second term corresponds to the variance induced by scaling factors, which inherently reflects the aggregate change. When combining them, the overall variance of the outputs of nonlinear activation functions increases, the mean shifts upward, and the activation distribution becomes broader, manifested as heavier tails and a thinner center. More details of above analysis and relative proof are in Appendix E. Moreover, we presume the reason that this redistribution has a positive impact on reasoning tasks is that reasoning-critical tokens (digits, operators, conjunctions) have a higher fraction of near-zero activations. Elevating their activation levels strengthens their representations and improves reasoning performance; see Section 6 for details. ### 3.3 Verification of activation redistribution To verify whether the activation redistribution pattern in Section 3.2 indeed occurs in LLMs, Figure 4 illustrates the activation distribution after the first-layer MLP, explicitly comparing states before and after the transformation defined in Equation 2. We also comprehensively assess the transformation of activation states using several quantitative indicators, including: - Relative Sparsity: Defined as the proportion of activations after the transformation whose values fall below the pre-transformation threshold. - L1 Norm: The sum of the absolute activation values; smaller values indicate higher sparsity. - L2 Norm: A measure of the overall magnitude of activations. - Gini Coefficient: An indicator of the smoothness of the histogram distribution, where smaller absolute values correspond to smoother distributions. From Figure 4, we observe that after transformation, the frequency of near-zero activations decreases, while the frequency of absolute high-magnitude activations increases. Both sparsity and smoothness in the activation distribution are improved. Specifically, the relative sparsity consistently decreases across all three models while the L1 and L2 norms increase, aligning with the previous phenomenon. ## 4 Method: Activation Redistribution Module <details> <summary>x5.png Details</summary> ![b6dccbc5](/v1/image/b6dccbc5743060be12ba347b985b40a8d61f7fd973d5abfa0089b8c299c54fe8) ### Visual Description ## Diagram: Transformer Layer Architecture with MLP and Distribution Redistribution ### Overview The diagram illustrates a technical architecture of a neural network layer (Layer 1) combined with an MLP (Multi-Layer Perceptron) and visualizations of data distribution changes. Key components include attention mechanisms, positional encoding, normalization layers, and activation functions. Two histograms compare distributions before and after redistribution. --- ### Components/Axes #### Layer 1 - **Components (left to right flow):** - **Q, K, V**: Input tensors (Query, Key, Value) processed through RMSNorm. - **RoPE**: Rotary Positional Encoding applied to Q/K/V. - **Attention**: Output of RoPE combined with Q/K/V. - **Output**: Final output of the attention mechanism. #### MLP (Central Block) - **Components (bidirectional flow):** - **down**: Linear projection to lower dimension. - **ARM**: Approximate Message Passing (ARM) block. - **SiLU/GeLU**: Activation function (Swish/GELU). - **gate**: Gating mechanism. - **up**: Linear projection to higher dimension. - **RMSNorm**: Root Mean Square Normalization applied after up. #### Histograms (Right Side) - **X-axis**: Value distribution (labeled "0" at center). - **Y-axis**: Count (unlabeled but visually proportional to bar height). - **Legend**: No explicit legend, but colors differentiate distributions: - **Before redistribution**: Taller peak at 0 (narrow distribution). - **After redistribution**: Flatter, wider distribution. --- ### Detailed Analysis #### Layer 1 - **Flow**: Q/K/V → RMSNorm → RoPE → Attention → Output. - **Key Elements**: - **RMSNorm**: Applied to Q/K/V inputs and MLP outputs. - **RoPE**: Positional encoding integrated into attention mechanism. #### MLP - **Flow**: - `down` → `ARM` → `SiLU/GeLU` → `gate` → `up` → `RMSNorm`. - Feedback loop from `ARM` to `SiLU/GeLU` (recurrent structure?). - **Key Elements**: - **ARM**: Likely a sparse or efficient attention variant. - **SiLU/GeLU**: Non-linear activation for MLP. #### Histograms - **Before Redistribution**: - Peak at 0 (narrow distribution, high concentration of values near 0). - Visualized as a tall, narrow histogram. - **After Redistribution**: - Flatter distribution with reduced peak at 0. - Values spread across a wider range (e.g., -1 to +1). - Arrows indicate redistribution from the peak to other bins. --- ### Key Observations 1. **Redistribution Effect**: The histograms show a clear transformation from a concentrated distribution to a dispersed one, suggesting normalization or regularization in the MLP. 2. **Component Redundancy**: RMSNorm appears in both Layer 1 and MLP, indicating its critical role in stabilizing training. 3. **Bidirectional Flow in MLP**: The `ARM` block feeds back into the activation function, implying a recurrent or iterative process within the MLP. --- ### Interpretation - **Technical Implications**: - The architecture combines standard transformer components (attention, RoPE) with an MLP featuring ARM and gating, suggesting a hybrid model for efficiency or specialized tasks. - The redistribution histograms highlight the impact of normalization (RMSNorm) and activation functions (SiLU/GeLU) on data distribution, critical for preventing gradient issues. - **Design Choices**: - Use of ARM in the MLP may aim to reduce computational complexity while maintaining performance. - The gating mechanism (`gate`) likely controls information flow between MLP subcomponents. - **Anomalies**: - The histograms lack explicit numerical labels, making precise value extraction impossible. Approximate trends (narrow → wide) are inferred visually. - The bidirectional flow in the MLP is atypical for standard MLPs, suggesting a custom design for specific use cases (e.g., autoregressive modeling). --- ### Final Notes - **Language**: All text is in English. - **Missing Data**: No numerical values or explicit legends for histograms; trends are qualitative. - **Spatial Grounding**: - Histograms are positioned to the right of the architecture, with "before" above "after." - Component labels are aligned vertically for clarity. </details> ⬇ def forward (x, layer_idx): # in first layer activation = self. act_fn (self. gate_proj (x)) #Our function activation_alter = self. arm (activation. clone ()) down_proj = self. down_proj (activation_alter * self. up_proj (x)) return down_proj Figure 5: The upper panel illustrates the first-layer LLM architecture with ARM, while the lower panel presents the corresponding ARM code in the MLP module. Inspired by the previous finding that meaningless tokens can shift meaningful activations and boost LLM performance, we propose ARM—a simple method replacing explicit meaningless tokens with an implicit mechanism that adjusts the MLP activation distribution after the activation function. Our approach has two steps: First, adaptively identify a proportion of near-zero activations based on the model and input; Then, extrapolate them outward to redistribute the activation pattern. The top half of Figure 5 shows the first-layer MLP with ARM, where selected activations around zero are shifted outward, reducing their frequency and increasing larger-magnitude activations. The bottom half of Figure 5 presents the ARM-specific code, a lightweight function inserted into the first-layer MLP without affecting inference speed. As shown in Appendix D, ARM’s time complexity is negligible within the MLP context. The significance of the ARM method is twofold. Firstly, it adds further evidence deductively supporting our theoretical analysis in Section 3. By directly replacing explicit meaningless token insertion with implicit activation redistribution, ARM yields a similar improvement in reasoning across models and benchmarks, thus strengthening our theoretical framework. Secondly, we introduce ARM as a lightweight inference time trick for boosting reasoning, which is not only robustly effective on its own (see experiments in Section 5) but also compatible with existing inference time scaling methods (see Appendix G.3). ### 4.1 Select Appropriate Change proportion Our method first selects a proportion of activations to be modified. However, different models exhibit varying sensitivities to meaningless tokens. To address this, we propose a dynamic strategy that adjusts the fraction of near-zero activations to be altered during inference. To determine this proportion, we measure the dispersion of activations around zero. Specifically, we define a neighborhood $\epsilon$ based on the activation distribution to identify which activations are considered “close to zero”. We adopt the Median Absolute Deviation (MAD) as our dispersion metric, since MAD is robust to outliers and better captures the core distribution. The threshold $\epsilon$ is given by: $\epsilon=\kappa*\text{MAD}*c$ , where $\kappa$ is a consistency constant, $c$ is a hyperparameter controlling the width of the near-zero range. Next, we compute the fraction of activations falling within $[-\epsilon,\epsilon]$ This fraction $p$ represents the proportion of activations that we think to be near zero. As a result, the fraction we want to change is $\text{fraction}=clip(p,(p_{\text{min}},p_{\text{max}}))$ . Here, $p$ denotes the calculated fraction, while $p_{\text{min}}$ and $p_{\text{max}}$ serve as bounds to prevent the scale from becoming either too small or excessively large. In our experiments, we set $p_{\text{min}}=0.02$ and $p_{\text{max}}=0.25$ . ### 4.2 Redistribution of Activation Values After selecting the elements, we preserve its sign and adjust only its magnitude. Specifically, we add a positive or negative value depending on the element’s sign. To constrain the modified values within a reasonable range, the range is defined as follows: $$ \text{R}=\begin{cases}[0,\text{Q}_{p_{1}}(\text{Activations)}],&\text{sign}=1,\\[6.0pt] [\text{min(Activations)},0],&\text{sign}=-1.\end{cases} \tag{10} $$ Where R is the range of modified values. In this range, we set the lower bound to the minimum activation value when $\text{sign}=-1$ , since activation functions such as SiLU and GeLU typically attain their smallest values on the negative side. For the upper bound when $\text{sign}=1$ , we select the value corresponding to the $p_{1}$ -th percentile of the activation distribution. Here, $p_{1}$ is a hyperparameter that depends on the distribution of activations. $\text{Q}_{p_{1}}(\text{Activations)}$ is the upper bound when we changing the chosen activations. The value of $p_{1}$ depends on the distribution of activations and the value of $c$ . Finally, we generate a random value in R and add it to the activation in order to modify its value. In this way, we adaptively adjust an appropriate proportion of activations, enriching the distribution with more effective values. We shows how to choose hyperparameter in Appendix H. Table 2: After adding ARM to the first-layer MLP, we report reasoning-task performance for six models, using a dash (‘–’) for accuracies below 5% to indicate incapability. | Model | Setting | GPQA Diamond | Math-500 | AIME 2024 | AIME 2025 | LiveCodeBench | Humaneval | | --- | --- | --- | --- | --- | --- | --- | --- | | Pass@1 | Pass@1 | Pass@1 | Pass@1 | Pass@1 | Pass@1 | | | | Qwen2.5 Math-1.5B | Baseline | 27.3 | 63.8 | 14.4 | 6.7 | - | 6.1 | | cyan!15ARM | cyan!1528.8 | cyan!1567.0 | cyan!1518.9 | cyan!1510.0 | cyan!15- | cyan!158.5 | | | gray!15 Improve Rate (%) | gray!15 1.5 $\uparrow$ | gray!15 3.2 $\uparrow$ | gray!15 4.5 $\uparrow$ | gray!15 3.3 $\uparrow$ | gray!15 - | gray!15 2.4 $\uparrow$ | | | Qwen2.5 Math-7B | Baseline | 30.3 | 72.4 | 23.3 | 10.0 | - | 15.2 | | cyan!15ARM | cyan!1534.9 | cyan!1573.4 | cyan!1525.6 | cyan!1513.3 | cyan!15- | cyan!1517.7 | | | gray!15 Improve Rate (%) | gray!15 4.6 $\uparrow$ | gray!15 1.0 $\uparrow$ | gray!15 2.3 $\uparrow$ | gray!15 3.3 $\uparrow$ | gray!15 - | gray!15 2.5 $\uparrow$ | | | Qwen2.5 7B-Instruct | Baseline | 28.3 | 61.4 | 20.0 | 10.0 | 29.7 | 43.9 | | cyan!15ARM | cyan!1529.8 | cyan!1562.4 | cyan!1520.0 | cyan!1523.3 | cyan!1531.9 | cyan!1547.6 | | | gray!15 Improve Rate (%) | gray!15 1.5 $\uparrow$ | gray!15 1.0 $\uparrow$ | gray!150 | gray!15 13.3 $\uparrow$ | gray!15 2.2 $\uparrow$ | gray!15 3.7 $\uparrow$ | | | Qwen2.5 32B-Instruct | Baseline | 35.4 | 82.6 | 16.7 | 20.0 | 49.5 | 50.0 | | cyan!15ARM | cyan!1535.9 | cyan!1582.6 | cyan!1518.8 | cyan!1526.7 | cyan!1549.5 | cyan!1551.2 | | | gray!15 Improve Rate (%) | gray!15 0.5 $\uparrow$ | gray!150 | gray!15 2.1 $\uparrow$ | gray!15 6.7 $\uparrow$ | gray!150 | gray!15 1.2 $\uparrow$ | | | Llama3.1 8B-Instruct | Baseline | 28.3 | 43.0 | 11.1 | - | 11.9 | 45.7 | | cyan!15ARM | cyan!1531.3 | cyan!1545.8 | cyan!1513.3 | cyan!15- | cyan!1517.0 | cyan!1547.6 | | | gray!15 Improve Rate (%) | gray!15 3.0 $\uparrow$ | gray!15 2.8 $\uparrow$ | gray!15 2.2 $\uparrow$ | gray!15- | gray!15 5.1 $\uparrow$ | gray!15 1.9 $\uparrow$ | | | Gemma3 4b-it | Baseline | 34.3 | 72.6 | 13.3 | 20.0 | 20.2 | 17.1 | | cyan!15ARM | cyan!1535.9 | cyan!1574.0 | cyan!1517.8 | cyan!1523.3 | cyan!1520.6 | cyan!1520.7 | | | gray!15 Improve Rate (%) | gray!15 1.5 $\uparrow$ | gray!15 1.4 $\uparrow$ | gray!15 4.5 $\uparrow$ | gray!15 3.3 $\uparrow$ | gray!15 0.4 $\uparrow$ | gray!15 3.6 $\uparrow$ | | | Gemma3 27b-it | Baseline | 33.3 | 85.4 | 25.6 | 26.7 | 31.9 | 9.1 | | cyan!15ARM | cyan!1533.8 | cyan!1586.2 | cyan!1531.1 | cyan!1530.0 | cyan!1534.2 | cyan!1511.6 | | | gray!15 Improve Rate (%) | gray!15 0.5 $\uparrow$ | gray!15 0.8 $\uparrow$ | gray!15 4.4 $\uparrow$ | gray!15 3.3 $\uparrow$ | gray!15 2.3 $\uparrow$ | gray!15 2.5 $\uparrow$ | | ## 5 Experiments We evaluate our method on reasoning and non-reasoning tasks using seven models: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-Instruct-7B, Qwen2.5-Instruct-32B (qwe, 2025), Llama3.1-8B-Instruct (gra, 2024), Gemma3-4b-it, and Gemma3-27b-it (gem, 2025). All models use default generation parameters. For reasoning tasks, we cover three skill areas: (1) General: GPQA (Rein et al., 2024), a challenging expert-authored multiple-choice dataset; (2) Math & Text Reasoning: MATH-500 (Lightman et al., 2023), AIME’24 (AIME, 2024), and AIME’25 (AIME, 2025); (3) Agent & Coding: LiveCodeBench (Jain et al., 2024) and HumanEval (Chen et al., 2021). For non-reasoning tasks, we use GSM8K (Cobbe et al., 2021), ARC-E (Clark et al., 2018), ARC-C (Clark et al., 2018), MMLU (Hendrycks et al., 2021), BoolQ (Clark et al., 2019), HellaSwag (Zellers et al., 2019), and OpenBookQA (Mihaylov et al., 2018). ### 5.1 Experiment Results Analysis For reasoning tasks, the results in Table 2 show pass@1 accuracy across multiple benchmarks. Our method consistently improves performance across most models and datasets, with the effect more pronounced in smaller models (e.g., Qwen2.5-Math-7B shows larger gains than Qwen2.5-32B-Instruct). On challenging benchmarks, however, improvements are limited when models lack sufficient capacity or when baseline accuracy is near saturation. For non-reasoning tasks (see 3(b)), applying ARM to the first-layer MLP yields little change. We attribute this to their largely factual nature, where models already have the necessary knowledge and response formats, requiring minimal reasoning. By contrast, for reasoning tasks, altering early activations helps reorganize knowledge, strengthens intermediate representations, and facilitates more effective and consistent reasoning. ### 5.2 Comparison of Meaningless tokens and ARM In 3(a), we provide a direct comparison between our proposed ARM method and the strategy of inserting a suitable number of meaningless tokens. The results demonstrate that both approaches are capable of improving model performance and neither requires post-training, therefore presenting lightweight interventions that lead to robust performance gains. However, since ARM directly utilizes the fundamental principle driving the meaningless-token effect, it provides more stable results. While the meaningless-token effect is pervasive, our experiments show that the effect itself depends heavily on the specific choice of token length and placement, and thus may be unstable or difficult to generalize across tasks. ARM provides a more principled and model-internal mechanism that directly reshapes the activation distribution within the MLP, yielding more consistent gains without relying on heuristic token engineering. In sum, while the insertion of a meaningless token string on the prompt level might seem like a promising prompt-tuning adjustment on the surface, it comes with an instability of the effect which ARM eliminates. This contrast highlights the trade-off between ease of use and robustness, and further underscores the value of ARM as a systematic method for enhancing the reasoning ability in large language models. <details> <summary>x6.png Details</summary> ![1ad1a66d](/v1/image/1ad1a66d04051b7b68bede1dda1f82395363996db9dc9e857c7fe991fe2f1e2b) ### Visual Description ## Bar Charts: Model Performance Comparison ### Overview The image contains three grouped bar charts comparing the performance of three AI models (Qwen2.5-Math-1.5B, Gemma3-4b-it, Qwen2.5-Math-7B) across three metrics: Pass@3 on Math-500, Pass@3 on AIME2024, and 2-gram diversity score. Each chart compares two configurations: Baseline and ARM (Adaptive Response Mechanism). ### Components/Axes 1. **X-Axes**: - Math-500: Qwen2.5-Math-1.5B | Gemma3-4b-it | Qwen2.5-Math-7B - AIME2024: Same model categories - Diversity Score: Same model categories 2. **Y-Axes**: - Math-500/AIME2024: Accuracy (0.6–0.9) - Diversity Score: Diversity score (0.0–0.5) 3. **Legends**: - Top-right corner in all charts - Baseline: Light green (Math-500), Light pink (AIME2024), Light blue (Diversity) - ARM: Dark green (Math-500), Dark red (AIME2024), Dark blue (Diversity) ### Detailed Analysis #### Pass@3 on Math-500 - **Qwen2.5-Math-1.5B**: Baseline ~0.72 | ARM ~0.74 - **Gemma3-4b-it**: Baseline ~0.83 | ARM ~0.84 - **Qwen2.5-Math-7B**: Baseline ~0.81 | ARM ~0.82 - **Trend**: ARM consistently outperforms Baseline by ~0.02–0.03 across all models. #### Pass@3 on AIME2024 - **Qwen2.5-Math-1.5B**: Baseline ~0.22 | ARM ~0.24 - **Gemma3-4b-it**: Baseline ~0.26 | ARM ~0.29 - **Qwen2.5-Math-7B**: Baseline ~0.36 | ARM ~0.37 - **Trend**: ARM shows larger gains for smaller models (Gemma3-4b-it +0.03) but minimal improvement for larger models (Qwen2.5-Math-7B +0.01). #### 2-gram Diversity Score - **Qwen2.5-Math-1.5B**: Baseline ~0.52 | ARM ~0.55 - **Qwen2.5-Math-7B**: Baseline ~0.51 | ARM ~0.54 - **Gemma3-4b-it**: Baseline ~0.44 | ARM ~0.46 - **Trend**: ARM improves diversity for all models, but Gemma3-4b-it lags significantly (~0.02 lower than ARM). ### Key Observations 1. **ARM Advantage**: ARM improves accuracy in Math-500 (avg. +0.025) and AIME2024 (avg. +0.023), with stronger gains for smaller models. 2. **Diversity Trade-off**: Larger models (Qwen2.5-Math-7B) maintain high diversity but show diminishing returns with ARM. 3. **Gemma3-4b-it Anomaly**: Underperforms in diversity despite matching Qwen2.5-Math-1.5B in ARM accuracy gains. ### Interpretation The data suggests ARM enhances model performance across tasks, with smaller models benefiting more from the mechanism. However, Gemma3-4b-it's lower diversity score raises questions about output variability despite its strong accuracy. The Qwen2.5-Math-7B model maintains high diversity but shows minimal ARM improvement, indicating potential saturation in larger models. These findings highlight a trade-off between accuracy gains and output diversity when implementing ARM, particularly for mid-sized models like Gemma3-4b-it. </details> Figure 6: The first two figures show pass@3 on Math-500 and AIME2024 for three models with and without ARM, and the last shows their 2-gram diversity under both conditions. Table 3: Table (a) compares the performance of meaningless tokens and ARM, and Table (b) reports ARM’s results on non-reasoning tasks. (a) Pass@1 on Math-500 and AIME2024 with meaningless tokens (Mless) or ARM. | Qwen2.5 Math-7B Mless ARM | Baseline 75.0 73.4 | 72.4 24.4 25.6 | 23.3 | | --- | --- | --- | --- | | Llama3.1 8B-Instruct | Baseline | 43.0 | 11.1 | | Mless | 44.9 | 13.3 | | | ARM | 45.8 | 13.3 | | (b) Performance of models with ARM on non-reasoning tasks. Additional results are in Appendix G. | Model Qwen2.5 Math-1.5B ARM | Setting Baseline 78.6 | GSM8K 78.0 39.3 | ARC-E 39.3 39.5 | HellaSwag 39.1 | | --- | --- | --- | --- | --- | | gray!15 Improve Rate (%) | gray!15 0.6 $\uparrow$ | gray!150 | gray!15 0.4 $\uparrow$ | | | Llama3.1 8B-Instruct | Baseline | 80.0 | 46.6 | 56.8 | | ARM | 82.4 | 47.1 | 57.3 | | | gray!15 Improve Rate (%) | gray!15 2.4 $\uparrow$ | gray!15 0.5 $\uparrow$ | gray!15 0.5 $\uparrow$ | | ### 5.3 Exploration capabilities after ARM As discussed earlier, we hypothesize that redistributing activations enables the model to explore the reasoning space more effectively. To test this hypothesis, we evaluate the model’s pass@3 performance on the Math-500 and AIME2024 benchmarks as well as its 2-gram diversity. As shown in Figure 6, applying activation redistribution consistently yields higher pass@3 scores compared to the baselines on both tasks. In addition, the 2-gram diversity under ARM is also greater than that without ARM. These findings indicate that activation redistribution not only improves the likelihood of arriving at correct solutions within multiple samples but also promotes more diverse reasoning paths. This dual effect suggests that ARM enhances both the effectiveness and the breadth of the model’s internal reasoning processes, reinforcing our hypothesis that carefully manipulating internal activations can expand a model’s reasoning capacity without additional training or parameter growth. ## 6 Discussion: Why Activation Redistribution Enhances LLM Reasoning Performance <details> <summary>x7.png Details</summary> ![297e210d](/v1/image/297e210ddea8251ef18abc57f4d76ce4520ffbd936e603b9b988c7061b263c4a) ### Visual Description ## Bar Chart: Mean Ratio Comparison Across Models ### Overview The chart compares mean ratios of four categories ("digit," "operator," "conjunction," "other") between two language models: **Qwen2.5-7B-Math** and **Llama3.1-8B-Instruct**. The y-axis represents the mean ratio (0.00–0.14), while the x-axis lists the models. Each model has four grouped bars corresponding to the categories. ### Components/Axes - **X-axis**: Model names ("Qwen2.5-7B-Math," "Llama3.1-8B-Instruct"). - **Y-axis**: Mean ratio (0.00–0.14, increments of 0.02). - **Legend**: Located in the top-left corner, mapping colors to categories: - Teal: digit - Light green: operator - Light blue: conjunction - Purple: other ### Detailed Analysis #### Qwen2.5-7B-Math - **Digit**: ~0.065 (teal bar, highest among Qwen's categories). - **Operator**: ~0.055 (light green bar, second highest). - **Conjunction**: ~0.053 (light blue bar, third highest). - **Other**: ~0.048 (purple bar, lowest). #### Llama3.1-8B-Instruct - **Digit**: ~0.10 (teal bar, second highest overall). - **Operator**: ~0.14 (light green bar, highest across all models). - **Conjunction**: ~0.11 (light blue bar, second highest overall). - **Other**: ~0.095 (purple bar, slightly higher than Qwen's "other"). ### Key Observations 1. **Llama3.1-8B-Instruct** consistently outperforms **Qwen2.5-7B-Math** in all categories except "other," where it is marginally lower. 2. **Operator** is the dominant category for Llama3.1 (~0.14), while **digit** is the weakest for Qwen (~0.065). 3. The "other" category shows the smallest disparity between models (~0.048 vs. ~0.095). ### Interpretation The data suggests that **Llama3.1-8B-Instruct** excels in operator-based tasks, potentially due to architectural or training differences. Qwen2.5-7B-Math shows stronger performance in digit-related tasks but lags in operator and conjunction categories. The "other" category's lower ratios for both models indicate these tasks are less emphasized or inherently more challenging. The stark difference in operator performance highlights Llama3.1's specialization in complex reasoning, while Qwen's digit focus may reflect optimization for numerical tasks. The "other" category's ambiguity warrants further investigation into its definition and real-world relevance. </details> Figure 7: Percentage of near-zero activations across the four token types in the Math-500 dataset. We provide one possible explanation for why redistributing the near-zero activations can improve the reasoning performance of LLMs. We categorize all tokens in Math-500 into four classes: digits, conjunctions, operators, and other tokens. For each class, we compute the average proportion of activations falling within near-zero range, which reflects how many dimensions of the hidden representation remain nearly inactive. The results are presented in Figure 7. As shown, normal tokens exhibit the lowest near-zero proportion, while digits, operators, and conjunctions show substantially higher proportions, which means that in the high-frequency near-zero activations after activation function, a larger portion of them are derived from these tokens. This suggests that although these tokens are crucial for reasoning, their information is insufficiently activated by the model. Our observation is consistent with the findings of Huan et al. (2025), which highlight the increasing importance of conjunctions after reinforcement learning, and also aligns with the recognized role of digits and operators in reasoning tasks such as mathematics and coding. Consequently, redistributing activations around zero enhances the representation of under-activated yet semantically important tokens, improving reasoning performance. ## 7 Related Work Recent studies notice that symbols in an LLM’s input may affect their internal mechanism. Sun et al. (2024) show large activations for separators, periods, or newlines, suggesting that these tokens carry model biases. Razzhigaev et al. (2025) find that commas are essential for contextual memory, while Chauhan et al. (2025) and Min et al. (2024) highlight punctuation as attention sinks, memory aids, and semantic cues. Moreover, Chadimová et al. (2024) show that replacing words with meaningless tokens can reduce cognitive biases, whereas Li et al. (2024) report that such “glitch tokens” may also cause misunderstandings, refusals, or irrelevant outputs. Our work adds explanation to the puzzling downstream benefits that the inclusion of a string of meaningless tokens contributes to reasoning performance and shows how deep investigations of the underlying mechanisms can lead to improved inference solutions. We provide an extended discussion of related works in Appendix B. ## 8 Conclusion In this paper, we report a meaningless-token effect that inserting long sequences of meaningless tokens improves model performance, particularly on reasoning tasks. Our analysis suggests that it stems from the fact that meaningless tokens induce an affine transformation on meaningful tokens, thereby redistributing their activations and enabling key information to be more effectively utilized. Building on this insight, we introduce ARM, a lightweight and training-free method for activation redistribution, which strengthens our analysis and serves as a practical approach for consistently improving LLM performance on reasoning tasks. ## Ethics Statement All datasets used in this work are publicly available and contain no sensitive information. Our method enhances LLM reasoning without introducing new data collection or human interaction. While stronger reasoning ability may be misused, we emphasize that this work is intended for beneficial research and responsible applications. ## Reproducibility Statement We will release our code and data once the paper is published. The appendix includes detailed experimental setups and hyperparameters so that others can reproduce our results. We also encourage the community to follow good research practices when using our code and data, to help maintain the reliability and transparency of future work. ## References - gra (2024) The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. - gem (2025) Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503.19786. - qwe (2025) Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115. - AIME (2024) AIME. Aime problems and solutions, 2024. URL https://aime24.aimedicine.info/. - AIME (2025) AIME. Aime problems and solutions, 2025. URL https://artofproblemsolving.com/wiki/index.php/AIMEProblemsandSolutions. - Bogdan et al. (2025) Paul C Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter? arXiv preprint arXiv:2506.19143, 2025. - Chadimová et al. (2024) Milena Chadimová, Eduard Jurášek, and Tomáš Kliegr. Meaningless is better: hashing bias-inducing words in llm prompts improves performance in logical reasoning and statistical learning. arXiv preprint arXiv:2411.17304, 2024. - Chauhan et al. (2025) Sonakshi Chauhan, Maheep Chaudhary, Koby Choy, Samuel Nellessen, and Nandi Schoots. Punctuation and predicates in language models. arXiv preprint arXiv:2508.14067, 2025. - Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021. - Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019. - Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018. - Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. - Dhanraj & Eliasmith (2025) Varun Dhanraj and Chris Eliasmith. Improving rule-based reasoning in llms via neurosymbolic representations. arXiv e-prints, pp. arXiv–2502, 2025. - Errica et al. (2024) Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco. What did i do wrong? quantifying llms’ sensitivity and consistency to prompt engineering. arXiv preprint arXiv:2406.12334, 2024. - Guan et al. (2025) Bryan Guan, Tanya Roosta, Peyman Passban, and Mehdi Rezagholizadeh. The order effect: Investigating prompt sensitivity to input order in llms. arXiv preprint arXiv:2502.04134, 2025. - Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021. - Højer et al. (2025) Bertram Højer, Oliver Jarvis, and Stefan Heinrich. Improving reasoning performance in large language models via representation engineering. arXiv preprint arXiv:2504.19483, 2025. - Huan et al. (2025) Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025. - Jain et al. (2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024. - Jiang et al. (2024) Ming Jiang, Tingting Huang, Biao Guo, Yao Lu, and Feng Zhang. Enhancing robustness in large language models: Prompting for mitigating the impact of irrelevant information. In International Conference on Neural Information Processing, pp. 207–222. Springer, 2024. - Kaul et al. (2024) Prannay Kaul, Chengcheng Ma, Ismail Elezi, and Jiankang Deng. From attention to activation: Unravelling the enigmas of large language models. arXiv preprint arXiv:2410.17174, 2024. - Kawasaki et al. (2024) Amelia Kawasaki, Andrew Davis, and Houssam Abbas. Defending large language models against attacks with residual stream activation analysis. arXiv preprint arXiv:2406.03230, 2024. - Li et al. (2024) Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. Glitch tokens in large language models: Categorization taxonomy and effective detection. Proceedings of the ACM on Software Engineering, 1(FSE):2075–2097, 2024. - Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023. - Liu et al. (2024) Weize Liu, Yinlong Xu, Hongxia Xu, Jintai Chen, Xuming Hu, and Jian Wu. Unraveling babel: Exploring multilingual activation patterns of llms and their applications. arXiv preprint arXiv:2402.16367, 2024. - London & Kanade (2025) Charles London and Varun Kanade. Pause tokens strictly increase the expressivity of constant-depth transformers. arXiv preprint arXiv:2505.21024, 2025. - Luo et al. (2025) Yifan Luo, Zhennan Zhou, and Bin Dong. Inversescope: Scalable activation inversion for interpreting large language models. arXiv preprint arXiv:2506.07406, 2025. - Luo et al. (2024) Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, and Maosong Sun. Sparsing law: Towards large language models with greater activation sparsity. arXiv preprint arXiv:2411.02335, 2024. - Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018. - Min et al. (2024) Junghyun Min, Minho Lee, Woochul Lee, and Yeonsoo Lee. Punctuation restoration improves structure understanding without supervision. arXiv preprint arXiv:2402.08382, 2024. - Owen et al. (2025) Louis Owen, Nilabhra Roy Chowdhury, Abhay Kumar, and Fabian Güra. A refined analysis of massive activations in llms. arXiv preprint arXiv:2503.22329, 2025. - Pfau et al. (2024) Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758, 2024. - Pham & Nguyen (2024) Van-Cuong Pham and Thien Huu Nguyen. Householder pseudo-rotation: A novel approach to activation editing in llms with direction-magnitude perspective. arXiv preprint arXiv:2409.10053, 2024. - Rai & Yao (2024) Daking Rai and Ziyu Yao. An investigation of neuron activation as a unified lens to explain chain-of-thought eliciting arithmetic reasoning of llms. arXiv preprint arXiv:2406.12288, 2024. - Razzhigaev et al. (2025) Anton Razzhigaev, Matvey Mikhalchuk, Temurbek Rahmatullaev, Elizaveta Goncharova, Polina Druzhinina, Ivan Oseledets, and Andrey Kuznetsov. Llm-microscope: Uncovering the hidden role of punctuation in context memory of transformers. arXiv preprint arXiv:2502.15007, 2025. - Rein et al. (2024) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. - Sheng et al. (2025) Yu Sheng, Linjing Li, and Daniel Dajun Zeng. Learning theorem rationale for improving the mathematical reasoning capability of large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 15151–15159, 2025. - Shi et al. (2024) Zeru Shi, Zhenting Wang, Yongye Su, Weidi Luo, Hang Gao, Fan Yang, Ruixiang Tang, and Yongfeng Zhang. Robustness-aware automatic prompt optimization. arXiv preprint arXiv:2412.18196, 2024. - Sun et al. (2024) Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024. - Turner et al. (2023) Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering. arXiv preprint arXiv:2308.10248, 2023. - Voita et al. (2023) Elena Voita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. arXiv preprint arXiv:2309.04827, 2023. - Wang et al. (2025) Yudong Wang, Damai Dai, Zhe Yang, Jingyuan Ma, and Zhifang Sui. Exploring activation patterns of parameters in language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 25416–25424, 2025. - Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. - Zhao et al. (2025) Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. Analyzing the rapid generalization of sft via the perspective of attention head activation patterns. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16980–16992, 2025. - Zhou et al. (2024) Andy Zhou, Bo Li, and Haohan Wang. Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263, 2024. - Zhuo et al. (2024) Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms. arXiv preprint arXiv:2410.12405, 2024. Contents 1. 1 Introduction 1. 2 Observation: Inserting Meaningless Tokens Induces an Affine Transformation on Meaningful Token Representations 1. 3 Analysis: Why Affine Transformation Improve Reasoning Performance 1. 3.1 Affine Transformation influence the output of gate layer 1. 3.2 Variance change leads to activation redistribution 1. 3.3 Verification of activation redistribution 1. 4 Method: Activation Redistribution Module 1. 4.1 Select Appropriate Change proportion 1. 4.2 Redistribution of Activation Values 1. 5 Experiments 1. 5.1 Experiment Results Analysis 1. 5.2 Comparison of Meaningless tokens and ARM 1. 5.3 Exploration capabilities after ARM 1. 6 Discussion: Why Activation Redistribution Enhances LLM Reasoning Performance 1. 7 Related Work 1. 8 Conclusion 1. A Disclosure of LLM Usage 1. B Related Work 1. B.1 Meaningless Tokens in LLMs 1. B.2 Activations Analysis in LLMs 1. C Limitations 1. D Time Complexity 1. E Proof 1. E.1 Scaling factor cause variance rise 1. E.2 bias factor cause variance rise 1. F More Analysis 1. F.1 The impact of inserting length of meaningless tokens 1. F.2 The impact of inserting position of meaningless tokens. 1. F.3 The impact of inserting type of meaningless tokens 1. F.4 Why we only analyze first layer 1. F.5 Repeat Meaningful tokens’ effectiveness 1. F.6 Why random sentence is useless 1. F.7 The optimal hyperparameter range 1. G More Experiments 1. G.1 Results on non-reasoning tasks 1. G.2 Results on Base Model 1. G.3 Inference Time trick comparison 1. H The example of hyper parameters 1. I More Average activation weights 1. J Examples of model’s output change of adding meaningless ## Appendix A Disclosure of LLM Usage This paper used LLMs to assist with grammar checking. ## Appendix B Related Work ### B.1 Meaningless Tokens in LLMs Recent studies have shown that seemingly meaningless tokens, such as punctuation marks, play a non-trivial role in information propagation and reasoning within large language models (LLMs). For example, Sun et al. (2024) report that LLMs exhibit large activations in response to separators, periods, or newline characters, suggesting that these tokens can serve as carriers of model biases. Similarly, Razzhigaev et al. (2025) demonstrate that tokens such as commas act as crucial elements in maintaining contextual memory: removing them significantly degrades performance on context-understanding tasks. Chauhan et al. (2025) further argue that punctuation may function as attention sinks or assist the memory mechanism, while Min et al. (2024) highlight its value in semantic construction, enabling models to better capture contextual structure. In addition, Chadimová et al. (2024) show that substituting certain words with meaningless tokens can mitigate cognitive biases in LLMs. Conversely, Li et al. (2024) illustrate that meaningless “glitch tokens” can induce misunderstandings, refusals, or irrelevant generations. However, these works primarily examine the effects of individual meaningless tokens, without considering the broader impact of longer meaningless token sequences. More recently, several studies have explored the role of long meaningless token sequences and reported their surprising positive influence on LLM performance. For instance, Zhou et al. (2024) find that appending meaningless tokens to the end of prompts can trigger or defend against jailbreak behaviors. Similarly, Shi et al. (2024) show that adding long meaningless sequences after a sentence can improve model performance on certain tasks. Pfau et al. (2024) and London & Kanade (2025) report that substituting meaningful tokens with filler-like tokens (e.g., ‘…’) in the training data preserves the model’s ability to solve questions, suggesting that even without meaningful tokens the model can perform implicit computation. Meanwhile, there are also some methods to improve the reasoning performance of LLMs (Dhanraj & Eliasmith, 2025; Højer et al., 2025; Sheng et al., 2025). Despite these empirical findings and methods, there is still a lack of systematic analysis explaining why meaningless tokens, especially in longer sequences, can play such a counterintuitive yet beneficial role in shaping LLM reasoning behavior. ### B.2 Activations Analysis in LLMs Activation analysis is a popular method for explaining the mechanics of LLMs (Wang et al., 2025; Kawasaki et al., 2024; Pham & Nguyen, 2024; Rai & Yao, 2024). Owen et al. (2025) supplement Sun et al. (2024) by analyzing the activations after MLP to study how massive values influence bias and large attention. Wang et al. (2025) test hidden states across all layers to examine the importance of parameters in different layers. Zhao et al. (2025) use activations to determine whether an attention head is activated after training. Kaul et al. (2024) analyze attention activations and find that almost all activations focus on the first tokens; they also analyze high activations in the FFN. Luo et al. (2024) systematically study the magnitude law and influencing factors of activation sparsity in decoder Transformer architectures, showing that different activation functions (ReLU vs. SiLU) lead to drastically different sparsity trends during training. In Liu et al. (2024), activation refers to the output behavior of the expert selector: instead of a single neuron activating, the analysis investigates which expert module each token is routed to. Turner et al. (2023) propose steering middle-layer activations to improve model outputs. Voita et al. (2023) uses OPT model to do analysis for FFM neurons. Luo et al. (2025), using activations to understand the semantic information in LLMs. However, most papers analyze activations using activation scores, hidden states, or broader definitions of activation. Few works directly examine the activations right after the non-linear activation functions in the MLP. ## Appendix C Limitations Different meaningless tokens lead to varying performance outcomes. We only know that this difference arises from their differing degrees of transformation, but the underlying reason why different tokens cause such phenomena remains unclear. Meanwhile, we assume that meaningless tokens can be identified by LLMs in the first layer. Therefore, in our analysis, we focus only on their impact on meaningful tokens and how this interaction influences model performance. As such, we ignore the meaningless tokens themselves. Future work can further investigate the results when explicitly considering meaningless tokens. We restrict our analysis to the first layer, as it is the only layer where the attention scores exhibit a clear phenomenon (see Figure 2). Future work may extend this investigation to examine whether similar effects arise in deeper layers. ## Appendix D Time Complexity In this section, we will analyze the time complexity of our method in the MLP. In the first layer’s MLP, we have batch size $B$ , sequence length $S$ , feed forward dimensions $D_{f}$ , model dimension $D_{model}$ . For MLP, the module of time complexity contains gate project, up project and down project. The time complexity of each module is $O(2BSD_{f}D_{model})$ , thus the total of MLP is: $$ \text{T}_{mlp}=O(BSD_{f}D_{model}), \tag{11} $$ For ARM module, the operation contains: calculating MAD, comparing threshold, calculating proportion $p$ , selecting elements that need to be changed. The time complexity of all above operations is $O(BSD_{f})$ . So the time complexity of ARM is: $$ \text{T}_{ARM}=O(BSD_{f}), \tag{12} $$ The comparison between the time complexity of ARM and MLP is $\frac{1}{D_{model}}$ . When $D_{model}$ equals to 4096. This proportion value is approximately $\frac{1}{2*4096}\approx 1.2\times 10^{-4}$ at the level of one ten-thousandth. Therefore, we believe that the time complexity of ARM can be disregarded in MLP layer. ## Appendix E Proof ### E.1 Scaling factor cause variance rise **Lemma E.1** *In LLMs, RMSNorm uses $\varepsilon>0$ ; hence $J_{q}(x_{0})$ is bounded and $\|x_{0}\|\geq\varepsilon$* For every $\lambda$ , we have: $$ x(\lambda)=r+\lambda UA,y(\lambda)=\text{RMS}(x(\lambda)),z_{j}(\lambda)=w_{j}^{\top}y(\lambda), \tag{13} $$ For every $\Delta\lambda$ , we have: $$ z_{j}(\lambda+\Delta\lambda)\approx z_{j}(\lambda)+g_{j}(\lambda)\Delta\lambda,g_{j}(\lambda)=w_{j}^{\top}J_{q}(x(\lambda))UA, \tag{14} $$ For $\text{Var}_{j}$ we have following proof: | | $\displaystyle\Delta\mathrm{Var}_{j}$ | $\displaystyle\triangleq\mathrm{Var}\!\big[z_{j}(\lambda+\Delta\lambda)\big]-\mathrm{Var}\!\big[z_{j}(\lambda)\big]$ | | | --- | --- | --- | --- | $$ \Delta\mathrm{Var}_{j}\;\geq\;-2\bigl|\mathrm{Cov}(z_{j},g_{j})\bigr|\,|\Delta\lambda|+\mathrm{Var}(g_{j})\,(\Delta\lambda)^{2}. $$ $$ |\Delta\lambda|\;>\;\frac{2\bigl|\mathrm{Cov}(z_{j},g_{j})\bigr|}{\mathrm{Var}(g_{j})} $$ Meanwhile, we also need to have: $$ \Delta\mathrm{Var}_{j}\;\geq\;-2\bigl|\mathrm{Cov}(z_{j},g_{j})\bigr|\,\bigl|\Delta\lambda\bigr|+A\,(\Delta\lambda)^{2}-\frac{K}{6}\,\bigl|\Delta\lambda\bigr|^{3}. \tag{15} $$ K is upper bound of $\text{Var}[z_{j}(\lambda)]$ , thus we have a range: $$ \frac{2\bigl|\mathrm{Cov}(z_{j},g_{j})\bigr|}{\mathrm{Var}(g_{j})}\leq\Delta\lambda\leq\frac{3\mathrm{Var}(g_{j})}{\text{K}}. \tag{16} $$ For every $|\Delta\lambda|$ , if it is in this range, we will have $\Delta\text{Var}_{j}>0$ . Specially, when $|\Delta\lambda|$ becomes larger, the quadratic term dominates, and A increases monotonically and eventually becomes positive. ### E.2 bias factor cause variance rise **Lemma E.2** *The bias we add is a uniform distribution sampled independently each time and does not depend on the specific value of attention output.* **Lemma E.3** *In LLM’s high dimensions, bias has a nonzero tangential component and $w_{j}^{\top}J_{q}(x_{0})W\neq 0$ .* According to above lemmas we have: $$ \mathrm{Var}[z_{j}]\approx e_{j}^{\top}W_{gate}J_{q}(x_{0})W\Sigma_{\sigma}W^{\top}J_{q}(x_{0})^{\top}W_{gate}^{\top}e_{j} \tag{17} $$ Thus, we have $\Delta V_{j}>0$ . ## Appendix F More Analysis ### F.1 The impact of inserting length of meaningless tokens In this section, we analyze the relationship between the length of inserted tokens and the performance of LLMs. We evaluate five models on MATH-500 while varying the number of inserted tokens from 0 to 70. The results are shown in Figure 8. We observe that when the inserted sequence is relatively short, the models outperform the baseline, although their accuracy fluctuates. However, when too many tokens are inserted, performance drops sharply. This occurs because, as the length of the inserted tokens increases, their influence on the attention output values accumulates (as shown in Equation 1). Once this accumulation reaches a critical level, it no longer produces a small, benign effect; instead, it alters the model’s internal semantic structure and degrades its performance. <details> <summary>x8.png Details</summary> ![26b5d931](/v1/image/26b5d93111c8ad07dca5996fc3c7a70e51cfa4abd45c007373ed3279d00aa454) ### Visual Description ## Bar Chart: Accuracy on MATH-500 ### Overview The chart compares the accuracy of five AI models (Qwen-2.5-MATH-7B, GPT-4o-mini, Qwen-2.5-MATH-1.5B, DS-R1-Distill-Qwen-7B, DS-math-7b-instruct) on the MATH-500 benchmark across six input complexity levels (0, 10, 25, 40, 55, 70 characters). Accuracy is measured on a 0-100% scale. ### Components/Axes - **X-axis**: Model names (Qwen-2.5-MATH-7B, GPT-4o-mini, Qwen-2.5-MATH-1.5B, DS-R1-Distill-Qwen-7B, DS-math-7b-instruct) - **Y-axis**: Accuracy percentage (30-70% range) - **Legend**: Located in the top-right corner, mapping colors to character counts: - Red: 0 characters - Orange: 10 characters - Yellow: 25 characters - Light blue: 40 characters - Dark blue: 55 characters - Darkest blue: 70 characters ### Detailed Analysis 1. **Qwen-2.5-MATH-7B**: - Accuracy: 73% (0 chars) → 74% (10 chars) → 73% (25 chars) → 72% (40 chars) → 71% (55 chars) → 67% (70 chars) - Trend: Gradual decline with increasing complexity 2. **GPT-4o-mini**: - Accuracy: 72% (0 chars) → 73% (10 chars) → 72% (25 chars) → 71% (40 chars) → 70% (55 chars) → 65% (70 chars) - Trend: Slightly steeper decline than Qwen-2.5-MATH-7B 3. **Qwen-2.5-MATH-1.5B**: - Accuracy: 65% (0 chars) → 66% (10 chars) → 65% (25 chars) → 64% (40 chars) → 63% (55 chars) → 58% (70 chars) - Trend: More pronounced drop at higher complexities 4. **DS-R1-Distill-Qwen-7B**: - Accuracy: 53% (0 chars) → 54% (10 chars) → 53% (25 chars) → 52% (40 chars) → 51% (55 chars) → 46% (70 chars) - Trend: Consistent decline with complexity 5. **DS-math-7b-instruct**: - Accuracy: 40% (0 chars) → 41% (10 chars) → 40% (25 chars) → 40% (40 chars) → 39% (55 chars) → 34% (70 chars) - Trend: Sharpest decline among all models ### Key Observations - **Model Capacity Correlation**: Larger models (Qwen-2.5-MATH-7B, GPT-4o-mini) maintain higher baseline accuracy (70-74%) compared to smaller models (DS-math-7b-instruct at 40% baseline). - **Complexity Sensitivity**: All models show reduced accuracy with increased input complexity, but smaller models experience steeper declines (DS-math-7b-instruct drops 6 percentage points from 0 to 70 chars). - **Stability Patterns**: Qwen-2.5-MATH-7B demonstrates the most stable performance across complexities (only 7% drop from 0 to 70 chars). ### Interpretation The data reveals a clear relationship between model size and robustness to input complexity. Larger models maintain higher accuracy even with complex inputs, suggesting superior capacity to handle nuanced mathematical reasoning. The consistent downward trend across all models indicates that character count is a significant factor in performance on MATH-500, potentially reflecting increased cognitive load or contextual dependency in longer problems. Notably, the DS-math-7b-instruct model's performance suggests that even base models can achieve moderate accuracy on simpler problems, but struggle significantly with more complex inputs. This pattern underscores the importance of model architecture and training data quality in mathematical reasoning tasks. </details> Figure 8: The relationship between the length of inserting tokens and the performance of models. ### F.2 The impact of inserting position of meaningless tokens. In the previous section, we demonstrated that inserting meaningless tokens between the system prompt and the question leads to improved model performance. In this section, we further investigate the effect of inserting meaningless tokens at different positions. Specifically, we consider four settings: ❶ the beginning of the system prompt, ❷ between the system prompt and the question, ❸ the end of the input, and ❹ a random position within the input. The results are reported in Table 4. We observe that only inserting tokens between the system prompt and the question yields performance gains. In contrast, appending tokens to the end of the input causes the model to simply repeat them, leading to zero accuracy. Inserting tokens at random positions disrupts the original semantic structure of the sentence, while inserting them at the beginning alters the values of the system prompt itself, introducing extra terms as shown in Equation 1. We hypothesize that this disrupts the intended initialization and interferes with the task the model is expected to process. Therefore, the most effective position for inserting meaningless tokens is between the system prompt and the question. | w/o meaningless tokens position ❶ position ❷ | 72.4 69.6 75.0 | 23.1 21.1 23.3 | | --- | --- | --- | | position ❸ | 0.0 | 0.0 | | position ❹ | 51.2 | 21.1 | Table 4: Performance on Math-500 and AIME 2024 after inserting meaningless tokens in different positions. ### F.3 The impact of inserting type of meaningless tokens In this section, we examine the influence of inserting different types of meaningless tokens on reasoning tasks. In our experiments, we insert varying lengths of slashes (“/”) and question marks (“?”) into the inputs and select the best-performing configuration from each set. As shown in Table 5, different types of meaningless tokens produce varying impacts on LLM performance, and no single unified pattern emerges. We attribute this to the fact that different token types carry distinct representational values, leading to different effects of attention during the transformation. Moreover, the sensitivity of individual questions to such transformations also varies. Consequently, the impact of meaningless tokens differs across tasks and models. Table 5: Accuracy of LLM on two mathematical reasoning datasets with inserting different kinds of meaningless tokens. | Qwen2.5-Math-1.5b Qwen2.5-Math-7b DeepSeek-R1-Distill-Qwen-7B | 63.6 72.4 52.0 | 66.8 75.0 55.0 | 58.2 69.6 53.6 | 14.4 23.3 3.3 | 18.8 24.4 3.3 | 16.1 22.2 4.4 | | --- | --- | --- | --- | --- | --- | --- | | DeepSeek-Math-7b-instruct | 39.6 | 41.4 | 43.4 | 7.8 | 12.2 | 12.5 | | Llama-3.1-8B-Instruct | 35.4 | 36.6 | 34.2 | 11.1 | 7.8 | 13.3 | | Qwen-2.5-32B-Instruct | 80.8 | 81.0 | 81.6 | 18.9 | 20.0 | 21.1 | ### F.4 Why we only analyze first layer <details> <summary>x9.png Details</summary> ![f564f365](/v1/image/f564f3652453626705f982addbfde7e1b358818ac6f15332088f5d0bd7cd15b1) ### Visual Description ## Line Graphs: Qwen2.5-7B-Math Attention Weights Across Layers ### Overview The image contains three line graphs comparing attention weight distributions across three transformer layers (Layer 1, Layer 2, Layer 3) of the Qwen2.5-7B-Math model. Each graph compares two conditions: "None" (blue line) and "Mless" (orange line). The x-axis represents attention weight values (0-60), while the y-axis shows normalized attention magnitudes. The graphs reveal distinct patterns of attention concentration across layers and conditions. ### Components/Axes - **X-axis**: "Average Attention Weight" (0-60, integer intervals) - **Y-axis**: Normalized attention magnitude (ranges vary per layer): - Layer 1: 0.00-0.08 - Layer 2: 0.00-0.200 - Layer 3: 0.00-0.175 - **Legends**: - Blue line: "None" (no modification) - Orange line: "Mless" (modified condition) - **Graph Titles**: - Layer 1: "Qwen2.5-7B-Math Layer 1 Head 22" - Layer 2: "Qwen2.5-7B-Math Layer 2 Head 22" - Layer 3: "Qwen2.5-7B-Math Layer 3 Head 22" ### Detailed Analysis #### Layer 1 - **None (blue)**: Peaks at x=15 (0.065), x=35 (0.072), and x=55 (0.068). Baseline values cluster between 0.01-0.03. - **Mless (orange)**: Peaks at x=10 (0.055), x=30 (0.062), and x=50 (0.058). Baseline values cluster between 0.005-0.025. - **Key Difference**: "None" shows 1.2-1.5x higher peak values than "Mless" across all attention spikes. #### Layer 2 - **None (blue)**: Single dominant peak at x=15 (0.175), with secondary peaks at x=35 (0.12) and x=55 (0.09). - **Mless (orange)**: Dominant peak at x=10 (0.15), with smaller peaks at x=30 (0.11) and x=50 (0.08). - **Key Difference**: "Mless" shows earlier concentration (x=10 vs x=15) but 85% of "None" peak magnitude. #### Layer 3 - **None (blue)**: Peaks at x=25 (0.12), x=45 (0.11), and x=60 (0.09). Baseline values between 0.02-0.05. - **Mless (orange)**: Peaks at x=30 (0.125), x=40 (0.115), and x=55 (0.095). Baseline values between 0.01-0.04. - **Key Difference**: "Mless" maintains 95-100% of "None" peak magnitudes but with slightly earlier concentration. ### Key Observations 1. **Layer-Specific Patterns**: - Layer 1 shows distributed attention with "None" having sharper peaks. - Layer 2 exhibits early concentration in "Mless" but lower magnitude. - Layer 3 demonstrates sustained attention in both conditions with minimal divergence. 2. **Magnitude Relationships**: - "None" consistently shows 10-15% higher peak values in Layer 1. - "Mless" achieves 85-100% of "None" peak magnitudes in Layers 2-3. - Both conditions show similar baseline attention distributions (0.005-0.025). 3. **Temporal Dynamics**: - "Mless" condition exhibits earlier attention concentration (x=10-15 vs x=15-25 in "None"). - Layer 3 shows most stable attention patterns across conditions. ### Interpretation The data suggests that the "Mless" modification preserves attention magnitude while slightly accelerating concentration timing in deeper layers. Layer 1 shows the most significant divergence, indicating potential architectural sensitivity to modifications in shallower layers. The consistent baseline similarity across conditions implies that "Mless" primarily affects attention dynamics rather than overall capacity. The Layer 3 stability suggests robust attention mechanisms in deeper transformer layers, while Layer 2's reduced magnitude in "Mless" may indicate trade-offs between concentration speed and attention strength. These patterns could inform optimization strategies for model efficiency without significant performance loss. </details> Figure 9: Average attention weights for later tokens in Layers 1 to 3 of Qwen2.5-7B-Math. In this section, we explain why our analysis and redistribution of activations focus exclusively on the first layer. As shown in Figure 9, we present the average attention weights of later tokens in Layers 1, 2, and 3 of Qwen2.5-7B-Math. We observe that only the first layer exhibits a clear and consistent phenomenon: after inserting meaningless tokens, the average attention weights decrease to a noticeable extent, suggesting that meaningless tokens directly alter the initial allocation of attention. In contrast, Layers 2 and 3 do not display such regularity—the average attention weights with and without meaningless tokens show no systematic relationship. Consequently, later layers do not undergo an affine transformation of this type. We hypothesize that this disappearance of the phenomenon arises because, beyond the first layer, the model has already integrated and mixed substantial semantic information through residual connections. From the second layer onward, the model begins to reconstruct and redistribute information, thereby diminishing the direct effect of meaningless tokens on average attention weights. In other words, the role of meaningless tokens becomes less distinguishable once meaningful contextual representations dominate, which explains why the first layer is the most critical point for observing and leveraging this effect. ### F.5 Repeat Meaningful tokens’ effectiveness In this section, we investigate whether adding meaningful tokens can play a role similar to meaningless tokens. Specifically, we insert a long sequence of repeated tokens that are semantically irrelevant to the question. For example, we add 55 repetitions of “he” between the system prompt and the question. The results, shown on the left of Figure 10, indicate that even such repeated but irrelevant meaningful tokens lead to an improvement in model performance. To better understand this effect, we further visualize the average attention weights after inserting these tokens, as presented on the right of Figure 10. The results reveal that the activation changes induced by repeated meaningful tokens closely resemble those caused by meaningless tokens, and the inserted tokens receive similar attention patterns which means the weight value of inserted part’s are similar. Taken together, these findings suggest that when repeated tokens are inserted at appropriate positions without introducing additional semantic content, LLMs are able to recognize them as irrelevant. Consequently, they trigger a redistribution of activations in the MLP, ultimately improving model performance. <details> <summary>x10.png Details</summary> ![87be90e9](/v1/image/87be90e9bbbe0651624e84c772c72fc8ccfc2945d05057430dba8b3add27201c) ### Visual Description ## Bar Chart: Results on Math-500 ### Overview The bar chart compares the performance of three language models (Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Llama3.1-8B Instruct) on the Math-500 benchmark. Three configurations are tested: baseline ("w/o"), with Mless attention ("w/ Mless"), and with Mful attention ("w/ Mful"). Performance is measured on a scale from 30 to 70. ### Components/Axes - **X-axis**: Model variants (Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Llama3.1-8B Instruct) - **Y-axis**: Performance score (30–70) - **Legend**: - Purple: "w/o" (baseline) - Teal: "w/ Mless" (Mless attention) - Light blue: "w/ Mful" (Mful attention) ### Detailed Analysis - **Qwen2.5-Math-1.5B**: - w/o: ~64 - w/ Mless: ~66 - w/ Mful: ~66 - **Qwen2.5-Math-7B**: - w/o: ~73 - w/ Mless: ~75 - w/ Mful: ~74 - **Llama3.1-8B Instruct**: - w/o: ~44 - w/ Mless: ~46 - w/ Mful: ~48 ### Key Observations - Larger models (Qwen2.5-Math-7B) outperform smaller models (Qwen2.5-Math-1.5B, Llama3.1-8B Instruct). - Adding Mless attention improves performance across all models, with the largest gains in Qwen2.5-Math-7B (+2 points). - Mful attention shows mixed results: it matches Mless for Qwen2.5-Math-1.5B but slightly underperforms for Qwen2.5-Math-7B. ### Interpretation The chart suggests that Mless attention consistently enhances performance, particularly in larger models. Mful attention’s variable impact implies it may be less universally effective or context-dependent. The Llama3.1-8B Instruct model’s lower baseline performance highlights potential architectural or training differences. --- ## Line Graph: Average Attention on Head 14 (Full Range) ### Overview This graph compares attention patterns across three mechanisms (None, Mless, Mful) in Head 14 of a neural network. The x-axis spans 0–60 tokens, and the y-axis measures attention magnitude (0–0.10). ### Components/Axes - **X-axis**: Token positions (0–60) - **Y-axis**: Attention magnitude (0–0.10) - **Legend**: - Blue: "None" (baseline) - Red: "Mless" - Green: "Mful" ### Detailed Analysis - **Trends**: - "Mful" (green) consistently shows higher peaks than "Mless" (red) and "None" (blue). - "Mless" exhibits sharper, more frequent spikes than "None." - All lines show periodic fluctuations, with "Mful" maintaining the highest average attention. ### Key Observations - "Mful" attention dominates in Head 14, suggesting it prioritizes critical tokens more effectively. - "None" (baseline) has the lowest and most erratic attention distribution. ### Interpretation The dominance of "Mful" attention in Head 14 aligns with its performance improvements in the Math-500 results. This implies that Mful’s attention mechanism may better focus on relevant mathematical reasoning tokens, enhancing model accuracy. --- ## Line Graph: Average Attention on Head 14 (Extended Range) ### Overview A zoomed-out version of the previous graph, extending the x-axis to 120 tokens. Compares "Mless" (blue) and "Mful" (red) attention mechanisms. ### Components/Axes - **X-axis**: Token positions (0–120) - **Y-axis**: Attention magnitude (0–0.10) - **Legend**: - Blue: "Mless" - Red: "Mful" ### Detailed Analysis - **Trends**: - "Mful" (red) maintains higher attention values than "Mless" (blue) across the extended range. - Both lines show cyclical patterns, but "Mful" peaks are more pronounced and sustained. ### Key Observations - The extended range confirms "Mful"’s consistent superiority over "Mless," even beyond the initial 60-token window. - "Mless" attention decays more rapidly after token position 60. ### Interpretation The extended analysis reinforces that "Mful" attention provides a more stable and focused mechanism, likely contributing to its performance advantages in mathematical tasks. The decay in "Mless" attention suggests it may struggle with long-range dependencies. --- ## Cross-Referenced Insights 1. **Attention Mechanism Impact**: Both line graphs confirm that "Mful" attention correlates with improved Math-500 performance, validating its effectiveness in prioritizing critical tokens. 2. **Model-Attention Synergy**: Larger models (Qwen2.5-Math-7B) benefit most from Mless attention, indicating that model size and attention mechanism interact to optimize performance. 3. **Llama3.1-8B Instruct Limitations**: Despite being the largest model, its lower baseline performance suggests architectural or training differences that Mless/Mful attention cannot fully mitigate. ## Conclusion The data demonstrates that Mless and Mful attention mechanisms enhance mathematical reasoning capabilities, with Mful showing the strongest and most consistent impact. These findings highlight the importance of attention design in language models for specialized tasks like math problem-solving. </details> Figure 10: The left panel illustrates a comparison between adding repeated meaningful tokens and meaningless tokens, while the right panel presents the average attention weights resulting from the addition of meaningful and meaningless tokens. ### F.6 Why random sentence is useless <details> <summary>x11.png Details</summary> ![56a87306](/v1/image/56a8730631fedee6aa512b23ca0cabd4f0c203f7326bb25e810e1d651ef28bcc) ### Visual Description ## Line Graphs: Attention Weights Comparison Across Models ### Overview The image contains three separate line graphs comparing attention weights for "Meaningless tokens" (blue) and "Random sentence" (orange) across different transformer model architectures. Each graph represents a specific layer and attention head configuration, with x-axis showing token positions (0-120) and y-axis showing normalized attention weights. ### Components/Axes 1. **Models/Architectures**: - Qwen2.5-7B-Math (Layer 1 Head 1) - Llama3.1-8B-Instruct (Layer 1 Head 27) - Gemma3-4b-it (Layer 1 Head 8) 2. **Axes**: - X-axis: Token position (0-120) - Y-axis: Attention weights (scaled differently per chart) - Legends: Positioned at top-right of each chart 3. **Data Series**: - Blue line: "Meaningless tokens" - Orange line: "Random sentence" ### Detailed Analysis #### Qwen2.5-7B-Math (Layer 1 Head 1) - **Y-axis range**: 0-0.16 - **Blue line ("Meaningless tokens")**: - Sharp peaks at ~10 (0.12), ~50 (0.14), ~90 (0.12) - Baseline ~0.02-0.04 - **Orange line ("Random sentence")**: - Peaks at ~15 (0.10), ~30 (0.08), ~70 (0.06) - Baseline ~0.01-0.03 #### Llama3.1-8B-Instruct (Layer 1 Head 27) - **Y-axis range**: 0-0.10 - **Blue line**: - Gradual increase from 0.02 to 0.06 - Peaks at ~80 (0.07) - **Orange line**: - More variable with peaks at ~20 (0.05), ~60 (0.06), ~100 (0.05) - Baseline ~0.03-0.04 #### Gemma3-4b-it (Layer 1 Head 8) - **Y-axis range**: 0-0.5 - **Blue line**: - Sharp peak at ~110 (0.45) - Baseline ~0.01-0.03 - **Orange line**: - Peak at ~115 (0.40) - Baseline ~0.02-0.04 ### Key Observations 1. **Model-Specific Patterns**: - Qwen shows strongest attention peaks for "Meaningless tokens" - Llama3.1 exhibits more distributed attention for both series - Gemma3 has extreme attention concentration near end tokens 2. **Attention Distribution**: - "Meaningless tokens" generally show higher peak attention values - "Random sentence" demonstrates more consistent baseline attention 3. **Positional Trends**: - Attention peaks often occur at token positions 10-120 across models - Gemma3's attention weights reach highest absolute values (up to 0.45) ### Interpretation The data suggests transformer models exhibit varying attention patterns depending on architecture and task. "Meaningless tokens" consistently receive stronger attention focus, particularly in Qwen2.5-7B-Math, indicating potential sensitivity to positional information. The extreme attention concentration in Gemma3-4b-it (0.45 peak) suggests possible overfitting to specific token positions. These patterns may reflect differences in model training objectives or architectural choices affecting how attention is distributed across input sequences. </details> Figure 11: The average attention weights of adding meaningless tokens and random sentence. When additional tokens are inserted into a sentence, both the attention weights and the resulting attention outputs exhibit consistent patterns: the weights assigned to the original tokens decrease, while the attention outputs gain additional values from the inserted tokens. In this section, we analyze why adding repeated tokens can enhance the performance of LLMs, whereas inserting random or unrelated sentences can have a detrimental effect. The results are shown in Figure 11. We observe that the attention weights associated with the random sentence are highly diverse, and their corresponding value vectors also differ substantially. In contrast, the repeated meaningless tokens exhibit more uniform attention weights and nearly identical value vectors. Consequently, compared with repeated meaningless tokens, a random sentence introduces not only numerical fluctuations but also a pronounced directional shift in the attention outputs—one that carries additional semantic information. The formula of RMSNorm is: $$ \operatorname{RMSNorm}(x)=\gamma\odot\frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d}x_{i}^{2}+\epsilon}}, \tag{18} $$ where $\gamma$ is a learnable rescaling vector and $\epsilon$ ensures numerical stability. For repeated meaningless tokens, the effect manifests as a small and uniform directional bias on the input to RMSNorm, producing only a minor numerical perturbation in its output. In contrast, inserting a random sentence introduces high-rank and structured semantic signals that RMSNorm cannot simply absorb. This leads to systematic shifts in the output direction and subspace, thereby altering the model’s internal semantic representations. ### F.7 The optimal hyperparameter range <details> <summary>x12.png Details</summary> ![c198135a](/v1/image/c198135a1cff0551ec7cd1cf4b18103ec18988a872d02d545e820c3eec137639) ### Visual Description ## Heatmap: Relationship Between Percentage p and Width of Near-Zero Range c ### Overview The image is a heatmap visualizing the relationship between two variables: "percentage p" (y-axis) and "width of near-zero range c" (x-axis). Each cell contains a numerical value representing a metric (likely a percentage or score) that varies based on the combination of p and c. The color gradient transitions from light yellow (low values) to dark blue (high values), with a legend on the right indicating the scale. --- ### Components/Axes - **X-axis (width of near-zero range c)**: - Labels: 0.05, 0.1, 0.15, 0.2, 0.25, 0.3 - Scale: Linear increments of 0.05 - **Y-axis (percentage p)**: - Labels: 85.0, 90.0, 92.5, 95.0, 97.0, 99.5 - Scale: Linear increments of 2.5 (except 85.0 to 90.0, which is 5.0) - **Legend**: - Position: Right side of the heatmap - Color gradient: Light yellow (64.0) to dark blue (67.0) - Label: Implied to represent the metric values in the cells --- ### Detailed Analysis #### Data Table Structure The heatmap is organized as a 6x6 grid, with rows corresponding to percentage p values and columns to width c values. Each cell contains a numerical value and a color intensity matching the legend. Below is the reconstructed table: | **p (%)** | **c = 0.05** | **c = 0.1** | **c = 0.15** | **c = 0.2** | **c = 0.25** | **c = 0.3** | |-----------|--------------|-------------|--------------|-------------|--------------|-------------| | 99.5 | 65.2 | 64.6 | 66.4 | 66.0 | 64.4 | 65.2 | | 97.0 | 65.4 | 65.4 | 65.2 | 65.6 | 65.2 | 65.2 | | 95.0 | 64.8 | 64.4 | 64.4 | 65.6 | 64.6 | 64.2 | | 92.5 | 65.0 | 64.2 | 64.8 | 64.8 | 65.2 | 65.0 | | 90.0 | 64.6 | 64.6 | 64.6 | 64.4 | 65.6 | 65.2 | | 85.0 | 64.0 | 64.4 | 64.8 | 64.6 | 67.0 | 66.2 | #### Color-Space Verification - **Darkest blue (67.0)**: Located at (85.0%, 0.3) - **Lightest yellow (64.0)**: Located at (99.5%, 0.05) - All other values align with the legend’s gradient (e.g., 65.2 corresponds to a medium-blue shade). --- ### Key Observations 1. **Highest Value**: 67.0 at (85.0%, 0.3) — the lowest p and widest c. 2. **Lowest Value**: 64.0 at (99.5%, 0.05) — the highest p and narrowest c. 3. **Trend**: - Values generally increase as c increases (e.g., 64.0 → 66.2 for p=85.0%). - For fixed c, values fluctuate slightly with p but show no clear monotonic trend. 4. **Anomalies**: - At p=95.0%, the value drops to 64.2 at c=0.3, breaking the upward trend observed in other rows. - At p=90.0%, the value at c=0.2 is 64.4, lower than adjacent cells. --- ### Interpretation The heatmap suggests a **trade-off relationship** between p and c: - **Lower p (85.0%)** and **wider c (0.3)** yield the highest values (67.0), indicating optimal performance under these conditions. - **Higher p (99.5%)** and **narrow c (0.05)** result in the lowest values (64.0), suggesting suboptimal performance. - The anomaly at p=95.0% and c=0.3 (64.2) may indicate a non-linear interaction or an outlier in the dataset. The color gradient reinforces that the metric being measured (e.g., efficiency, accuracy) improves as the width of the near-zero range increases, particularly when the percentage p is lower. This could imply that broader near-zero ranges are more effective in scenarios with lower p values, though further context is needed to confirm causality. --- ### Spatial Grounding - **Legend**: Right-aligned, vertically oriented, with a clear gradient from yellow to blue. - **Axes**: Labeled with precise numerical increments; grid lines divide the heatmap into discrete cells. - **Cell Values**: Centered in each cell, ensuring readability. This structured layout ensures the data is both visually accessible and quantitatively precise. </details> Figure 12: This figure illustrates how accuracy varies with changes in the parameters p and c. In this section, we investigate how the hyper-parameters—the percentage p and the width of the near-zero range c —influence model performance on Math-500 when using Qwen2.5-1.5B-Math. The results are summarized in Figure 12. As the figure illustrates, the accuracy does not change monotonically with either p or c; instead, the best-performing settings emerge only within specific regions of the parameter space. This indicates that the choice of hyper-parameters is not trivial and cannot be reduced to cherry-picking. More concretely, we find that balanced combinations of p and c lead to more stable improvements. High accuracy is typically concentrated in two regions: when p is large and c is small, or conversely, when c is large and p is small. In these cases, the redistribution mechanism introduced by ARM effectively amplifies informative activations while suppressing uninformative near-zero activations. Outside of these regions, however, the performance of the model degrades, suggesting that poorly chosen hyper-parameters may distort the activation distribution rather than enhance it. These observations highlight the importance of aligning hyper-parameter choices with the intrinsic properties of activation distributions. To maximize the benefits of ARM, one must take into account both the proportion of near-zero activations and the magnitude of the maximum activation values, thereby ensuring that p and c are set within an appropriate interval. By doing so, ARM can operate in its most effective regime, consistently improving model reasoning performance rather than introducing instability. From each row and column, we can see that the performance of LLMs after ARM depends on both $p$ and $c$ . Since they are equally important, the optimal performance is determined by the range of these two parameters. ## Appendix G More Experiments ### G.1 Results on non-reasoning tasks In this section, we present supplementary results on non-reasoning benchmarks, including ARC-C, MMLU, BoolQ, and OpenBookQA, as shown in Table 6. Across all evaluated models, the application of our method yields only marginal variations in performance. For most models and tasks, it either produces slight improvements or maintains parity with the baseline (i.e., vanilla model performance without any inference-time trick), suggesting that the redistribution of activations has little impact when the task primarily requires factual recall or pattern recognition rather than multi-step reasoning. A minor performance drop is observed only on a small subset of tasks with Llama-3.1-8B-Instruct, which we attribute to model-specific characteristics or sensitivity to activation perturbations.These findings indicate that our approach exerts negligible influence on non-reasoning tasks and, in most cases, does not introduce adverse effects on task accuracy. This observation further supports our central claim: the benefits of activation redistribution are most pronounced in reasoning-oriented scenarios, while in non-reasoning settings the method remains stable and does not compromise the model’s inherent ability to answer factual or knowledge-intensive questions. Table 6: Complete results of several models on non-reasoning tasks. | Model Qwen2.5 Math-1.5B ARM | Setting Baseline 78.6 | GSM8K 78.0 39.3 | ARC-E 39.3 35.4 | ARC-C 35.0 32.1 | MMLU 32.1 33.4 | BoolQ 32.6 39.5 | HellaSwag 39.1 42.4 | OpenBookQA 42.0 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | gray!15 Improve Rate (%) | gray!15 0.6 $\uparrow$ | gray!15 0 | gray!15 0.4 $\uparrow$ | gray!15 0 | gray!15 0.8 $\uparrow$ | gray!15 0.4 $\uparrow$ | gray!15 0.4 $\uparrow$ | | | Qwen2.5 Math-7B | Baseline | 83.8 | 49.7 | 47.9 | 36.9 | 38.6 | 46.9 | 47.6 | | ARM | 83.8 | 49.7 | 47.0 | 37.5 | 38.7 | 47.1 | 47.9 | | | gray!15 Improve Rate (%) | gray!15 0 | gray!15 0 | gray!15 0 | gray!15 0.6 $\uparrow$ | gray!15 0.1 $\uparrow$ | gray!15 0.2 $\uparrow$ | gray!15 0.3 $\uparrow$ | | | Llama3.1 8B-Instruct | Baseline | 80.0 | 46.6 | 49.0 | 38.6 | 43.3 | 56.8 | 52.8 | | ARM | 82.4 | 47.1 | 48.7 | 38.2 | 43.2 | 57.3 | 50.8 | | | gray!15 Improve Rate (%) | gray!15 2.4 $\uparrow$ | gray!15 0.5 $\uparrow$ | gray!15 -0.3 $\downarrow$ | gray!15 -0.4 $\downarrow$ | gray!15 -0.1 $\downarrow$ | gray!15 0.5 $\uparrow$ | gray!15 -2.0 $\downarrow$ | | | Gemma3 4b-it | Baseline | 86.8 | 47.1 | 44.5 | 33.9 | 45.0 | 42.0 | 41.0 | | ARM | 86.8 | 47.1 | 45.0 | 34.1 | 45.2 | 42.0 | 42.0 | | | gray!15 Improve Rate (%) | gray!15 0 | gray!15 0 | gray!15 0.5 $\uparrow$ | gray!15 0.2 $\uparrow$ | gray!15 0.2 $\uparrow$ | gray!15 0 | gray!15 1.0 $\uparrow$ | | ### G.2 Results on Base Model Table 7: Performance on Math-500 and AIME 2024 after incorporating ARM into the MLP in non-reasoning model. | Model Qwen2.5-1.5B ARM | Setting Baseline 68.2 | Math-500 (Pass@1) 67.8 14.4 | AIME 2024 (Pass@1) 14.4 | | --- | --- | --- | --- | | Improve Rate (%) | 0.4 $\uparrow$ | 0 | | | Qwen2.5-7B | Baseline | 50.4 | 15.6 | | ARM | 50.6 | 16.7 | | | Improve Rate (%) | 0.2 $\uparrow$ | 1.1 $\uparrow$ | | | Qwen2.5-32B | Baseline | 77.2 | 27.8 | | ARM | 77.4 | 28.9 | | | Improve Rate (%) | 0.2 $\uparrow$ | 1.1 $\uparrow$ | | In this section, we evaluate the effect of applying ARM to base models and report their performance on Math-500 and AIME2024 using Qwen2.5-1.5B, Qwen2.5-7B, and Qwen2.5-32B. Since these models achieve accuracy above 5%, we consider them capable of tackling these tasks. In contrast, models such as Llama3.1-8B and Gemma3-4B-PT exhibit poor performance and are therefore excluded from the evaluation. The results in Table 7 show that incorporating ARM into the MLP layers of base models yields measurable performance gains on reasoning tasks, although the improvements are generally smaller than those observed for reasoning, oriented models. We attribute this gap to the weaker inherent reasoning abilities of base models. While activation redistribution can still enhance their internal representations, it may not strongly affect how they process key numerical or symbolic elements, such as digits and operators, compared with models trained specifically for reasoning. Table 8: Performance on Math-500 and AIME 2024 after incorporating ARM into the MLP. | Qwen2.5 Math-1.5B ARM Best-of-N(N=5) | Baseline 67.8 69.4 | 63.8 18.9 14.4 | 14.4 | | --- | --- | --- | --- | | Best-of-N + ARM | 71.2 | 18.9 | | | Qwen2.5 Math-7B | Baseline | 72.4 | 23.3 | | ARM | 73.4 | 25.6 | | | Best-of-N(N=5) | 72.8 | 23.3 | | | Best-of-N + ARM | 73.4 | 25.6 | | ### G.3 Inference Time trick comparison To more comprehensively evaluate the robustness, effectiveness, and compatibility of ARM with established inference-time scaling techniques, we further compare its performance against the widely used Best-of-N sampling approach during inference. Specifically, Table 8 summarizes the results obtained by applying ARM alone, Best-of-N sampling alone, and their combined usage on two representative reasoning benchmarks. For all settings, we fix the generation hyperparameters to a temperature of 0.5 and a top_p of 0.95 to ensure a consistent sampling regime. As demonstrated in the table, both ARM and Best-of-N independently yield improvements over the baseline, and their combination produces an even larger performance gain, suggesting that ARM complements rather than competes with existing inference-time strategies. These findings collectively underscore the practical value and scalability of ARM as a lightweight inference-time method for enhancing reasoning capabilities across diverse tasks. Table 9: The hyper parameters in 7 models on three benchmarks. For Qwen and Llama, we using near-zero range $c$ to choose proportion, so $p$ is dash(“-”). But for Gemma, due to the activation distribution, we directly decide to skip setting $c$ and choose $p$ . So here, $c$ is dash(“-”). If the task performance doesn’t improve, we replace hyper-parameters with dash(“-”). | Qwen2.5-Math-1.5B Qwen2.5-Math-7B Qwen2.5-7B-Instruct | 0.15/- 0.2/- 0.15/- | 99.5 99.5 99.5 | 0.13/- 0.1/- 0.1/- | 99.5 95.0 99.5 | 0.13/- 0.05/- - | 99.5 90.0 - | 0.13/- 0.13/- 0.13/- | 99.5 99.5 95.0 | 0.13/- 0.13/- 0.05/- | 99.5 95.0 90 | - - 0.3/- | - - 99.5 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Qwen2.5-32B-Instruct | 0.05/- | 90.5 | - | - | 0.13/- | 99.5 | 0.05/- | 99.0 | 0.13/- | 99.5 | 0.3/- | 99.5 | | Llama3.1-8B-Instruct | 0.45/- | 80.0 | 0.32/- | 90.0 | 0.32/- | 90.0 | - | - | 0.3/- | 90.0 | 0.3/- | 90.0 | | Gemma3-4B-it | -/0.5 | 96.5 | -/0.25 | 85.0 | -/0.25 | 96.5 | -/0.25 | 85.0 | -/0.25 | 96.5 | -/0.25 | 75.0 | | Gemma3-27B-it | -/0.5 | 96.5 | -/0.25 | 85.0 | -/0.25 | 85.0 | -/0.25 | 70.0 | -/0.25 | 85.0 | -/0.25 | 85.0 | ## Appendix H The example of hyper parameters The selection of $p_{1}$ and $c$ depends on the distribution of model activations after the activation function. In most cases, when the frequency of near-zero activations greatly exceeds that of other values (as in Qwen), the value of $c$ should be chosen smaller. In contrast, for models like LLaMA, $c$ should be chosen larger. When the proportion of near-zero activations is extremely high (as in Gemma), we recommend directly setting the modification proportion to a value larger than $p_{\text{max}}$ . For the choice of $p_{1}$ , we generally advise selecting more than $80\$ , which covers regions with higher activation frequencies and thus exerts a stronger influence on near-zero values. Because the proportion of extremely high-frequency activations is small, this choice will not cause large deviations in the near-zero values. Overall, although models from the same family may behave differently across tasks and parameter scales, the selection ranges of $p_{1}$ and $c$ remain relatively consistent within each family. Some examples are in Table 9. ## Appendix I More Average activation weights In this section, we present additional figures of average attention weights to further validate the previously observed phenomenon. When computing the average attention weights for Llama3.1 and Gemma3, we exclude the first token to make the effect clearer, as position tokens tend to absorb most of the attention. The results are in Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20. From results, we can see that in the most layers, models will have same phenomenon like we mentioned before. ## Appendix J Examples of model’s output change of adding meaningless In this section, we show some examples where, after adding meaningless tokens, the model can turn wrong answers into correct ones. The specific examples are provided in Appendix J. <details> <summary>x13.png Details</summary> ![59926974](/v1/image/599269747873bcca1082d8b3168dd5bfa55e30faf08e374fc8235481fa059694) ### Visual Description ## Line Graphs: Qwen2.5-Math-7B Layer 1 Head Attention Weights ### Overview The image contains six line graphs comparing attention weight distributions across token positions (0-120) for three attention heads (Head 1, Head 2, Head 8) in a Qwen2.5-Math-7B model. Each graph contrasts two conditions: - **Blue line**: Attention weights without meaningless tokens - **Red line**: Attention weights with meaningless tokens The graphs are split into two sections: 1. **Top row**: Full token position range (0-120) 2. **Bottom row**: Zoomed-in view of the "Meaningless tokens" region (20-40) --- ### Components/Axes - **X-axis**: Token Position (0-120) - **Y-axis**: Average Attention Weight (0.00-0.175) - **Legends**: - Blue: "w/o Meaningless tokens" - Red: "w/ Meaningless tokens" - **Shaded areas**: Confidence intervals (bottom row only) --- ### Detailed Analysis #### Head 1 (Top Left) - **Blue line**: Sharp peaks at token positions ~20 and ~40 (attention weights ~0.15-0.175). - **Red line**: Broader, less pronounced peaks (max ~0.125). - **Bottom graph**: Blue line shows concentrated peaks in 20-40 range; red line has diffuse, lower peaks. #### Head 2 (Top Center) - **Blue line**: Gradual rise to ~0.125 at token 60, then decline. - **Red line**: Noisy baseline (~0.05-0.08) with minor spikes. - **Bottom graph**: Blue line remains stable in 20-40 range; red line shows erratic fluctuations. #### Head 8 (Top Right) - **Blue line**: Sparse peaks at ~10, ~50, and ~100 (weights ~0.10-0.15). - **Red line**: Continuous noise with occasional spikes (max ~0.12). - **Bottom graph**: Blue line has minimal activity in 20-40 range; red line shows sporadic peaks. --- ### Key Observations 1. **Head 1**: - Meaningless tokens (red) reduce peak sharpness and magnitude compared to the no-meaningless-token condition (blue). - Attention is more distributed in the meaningless-token region. 2. **Head 2**: - Meaningless tokens introduce noise, flattening the attention curve. - No meaningful peaks in the 20-40 range for either condition. 3. **Head 8**: - Meaningless tokens disrupt the sparse, periodic peaks seen in the no-meaningless-token condition. - Attention becomes more erratic in the meaningless-token region. --- ### Interpretation The data suggests that **meaningless tokens disrupt attention patterns** in all three heads: - **Head 1**: Focuses on specific tokens (e.g., 20, 40) but loses precision with meaningless tokens. </details> Figure 13: The average attention weights of Qwen2.5-Math-7B in Head 1, 2, 8. <details> <summary>x14.png Details</summary> ![1ab0e798](/v1/image/1ab0e7985e1e484ccb9510c63d62e72fb2a4253cf4e70e59b309e9a72bcbf165) ### Visual Description ## Line Charts: Qwen2.5-Math-7B Layer 1 Head Attention Weights ### Overview The image contains six line charts comparing attention weight distributions across three attention heads (Head 12, 13, 14) in Layer 1 of the Qwen2.5-Math-7B model. Each chart contrasts two scenarios: "w/o Meaningless tokens" (blue) and "w/ Meaningless tokens" (red). The charts visualize how attention weights vary across token positions (0-120) with and without meaningless tokens. ### Components/Axes - **Y-axis**: "Average Attention Weight" (0.00 to 0.14) - **X-axis**: Token positions (0 to 120) - **Legend**: - Blue line: "w/o Meaningless tokens" - Red line: "w/ Meaningless tokens" - **Shaded Regions**: Gray areas around lines (likely confidence intervals or variability) - **Annotations**: Vertical dashed lines at x=20 and x=40 labeled "Meaningless tokens" in lower charts ### Detailed Analysis #### Head 12 - **w/o Meaningless tokens (blue)**: Peaks at ~0.12 (x=10), ~0.09 (x=30), ~0.07 (x=50), ~0.05 (x=70), ~0.03 (x=90), ~0.01 (x=110) - **w/ Meaningless tokens (red)**: Peaks at ~0.14 (x=10), ~0.11 (x=30), ~0.09 (x=50), ~0.07 (x=70), ~0.05 (x=90), ~0.03 (x=110) - **Shaded Regions**: Wider variability in red line (e.g., ±0.02 at x=10 vs ±0.01 for blue) #### Head 13 - **w/o Meaningless tokens (blue)**: Peaks at ~0.10 (x=10), ~0.08 (x=30), ~0.06 (x=50), ~0.04 (x=70), ~0.02 (x=90), ~0.01 (x=110) - **w/ Meaningless tokens (red)**: Peaks at ~0.12 (x=10), ~0.10 (x=30), ~0.08 (x=50), ~0.06 (x=70), ~0.04 (x=90), ~0.02 (x=110) - **Shaded Regions**: Consistent variability (±0.01-0.02) across both scenarios #### Head 14 - **w/o Meaningless tokens (blue)**: Peaks at ~0.09 (x=10), ~0.07 (x=30), ~0.05 (x=50), ~0.03 (x=70), ~0.01 (x=90), ~0.005 (x=110) - **w/ Meaningless tokens (red)**: Peaks at ~0.11 (x=10), ~0.09 (x=30), ~0.07 (x=50), ~0.05 (x=70), ~0.03 (x=90), ~0.01 (x=110) - **Shaded Regions**: Larger variability in red line (e.g., ±0.03 at x=10 vs ±0.01 for blue) ### Key Observations 1. **Attention Amplification**: All heads show increased attention weights when meaningless tokens are included (red > blue), with Head 12 showing the largest difference (~0.02-0.04 higher peaks). 2. **Positional Patterns**: Peaks consistently occur at x=10, 30, 50, 70, 90, 110 across all heads, suggesting structured attention to specific token positions. 3. **Variability**: Shaded regions indicate higher uncertainty in attention weights when meaningless tokens are present, particularly in Head 12 and 14. 4. **Meaningless Token Impact**: Vertical dashed lines at x=20 and 40 in lower charts highlight regions where meaningless tokens are concentrated, correlating with attention spikes. ### Interpretation The data demonstrates that the Qwen2.5-Math-7B model allocates significantly more attention to meaningless tokens compared to meaningful ones, with attention weights increasing by 15-30% across all heads. This suggests the model may be: 1. **Overfitting to Noise**: Allocating resources to process irrelevant tokens, potentially reducing efficiency for meaningful computations. 2. **Learning Robustness**: Developing mechanisms to handle noisy inputs, which could improve generalization but at the cost of computational overhead. 3. **Positional Bias**: The consistent peak positions (x=10, 30, etc.) might indicate architectural biases or training data patterns favoring specific token positions. The shaded regions reveal that attention weight variability increases with meaningless tokens, implying less stable processing of noisy inputs. This could have implications for model interpretability and efficiency in real-world applications where input noise is common. </details> Figure 14: The average attention weights of Qwen2.5-Math-7B in Head 12, 13, 14. <details> <summary>x15.png Details</summary> ![137a7ae7](/v1/image/137a7ae793b1c1b9913fa7da3f03e950cb02632a7c5693f6b256b1cf132e52ef) ### Visual Description ## Line Graphs: Qwen2.5-Math-7B Layer 1 Head Attention Weights ### Overview The image contains six line graphs comparing attention weight distributions across token positions for three different attention heads (17, 26, 28) in Layer 1 of the Qwen2.5-Math-7B model. Each graph contrasts two scenarios: "w/o Meaningless tokens" (blue line) and "w/ Meaningless tokens" (red line), with shaded regions representing confidence intervals. ### Components/Axes - **X-axis**: Token Position (0–120) in all graphs - **Y-axis**: Average Attention Weight (0–0.15) in all graphs - **Legends**: - Blue line: "w/o Meaningless tokens" - Red line: "w/ Meaningless tokens" - **Shaded Regions**: Confidence intervals (lighter blue for "w/o", lighter red for "w/") ### Detailed Analysis #### Layer 1 Head 17 - **Trend**: Red line ("w/") shows 2–3x higher peak attention weights (up to ~0.12) compared to blue line (~0.04–0.06). Both lines exhibit periodic spikes, but red line's peaks are sharper and more frequent. - **Key Data Points**: - Blue line: Max ~0.06 at token 10 - Red line: Max ~0.12 at token 30 #### Layer 1 Head 26 - **Trend**: Red line dominates with sustained higher weights (avg ~0.06–0.08 vs ~0.03–0.05 for blue). Shaded regions overlap significantly in the middle token range (40–80). - **Key Data Points**: - Blue line: Max ~0.05 at token 20 - Red line: Max ~0.08 at token 50 #### Layer 1 Head 28 - **Trend**: Red line exhibits erratic spikes (up to ~0.10) with wider confidence intervals. Blue line remains relatively flat (~0.02–0.04) with minimal variation. - **Key Data Points**: - Blue line: Max ~0.04 at token 10 - Red line: Max ~0.10 at token 90 ### Key Observations 1. **Consistent Pattern**: All heads show increased attention weights when meaningless tokens are included, with red lines consistently above blue lines. 2. **Confidence Intervals**: Shaded regions for "w/" tokens are 2–3x wider than "w/o" in most cases, indicating greater variability. 3. **Token Position Correlation**: Peaks in red lines often align with specific token positions (e.g., token 30 in Head 17, token 90 in Head 28), suggesting potential positional significance. ### Interpretation The data demonstrates that including meaningless tokens significantly alters attention distribution patterns: - **Mechanistic Insight**: The model allocates disproportionately higher attention to meaningless tokens in specific positions, potentially indicating learned positional biases or noise amplification. - **Robustness Implications**: Wider confidence intervals for "w/" tokens suggest reduced stability in attention patterns when irrelevant tokens are present. - **Head-Specific Behavior**: Head 28 shows the most dramatic divergence between conditions, possibly reflecting specialized processing of positional information. The findings highlight the importance of token filtering in transformer architectures and suggest that meaningless tokens may act as positional anchors influencing attention dynamics. </details> Figure 15: The average attention weights of Qwen2.5-Math-7B in Head 17, 26, 28. <details> <summary>x16.png Details</summary> ![e5b62f99](/v1/image/e5b62f9968ead6671e90fd3341ed6c670129ac5d2582d7e860b5987e4d625996) ### Visual Description ## Line Chart: Llama3.1-8B-Instruct Layer 1 Head Attention Weights ### Overview The image contains three line charts comparing average attention weights across token positions (0-120) for three attention heads (Head 2, Head 5, Head 7) in Layer 1 of the Llama3.1-8B-Instruct model. Each chart compares two scenarios: attention weights **with** and **without** meaningless tokens. The charts use blue (dashed) and red (solid) lines to represent these scenarios, with shaded regions highlighting token ranges. --- ### Components/Axes - **X-axis**: Token positions (0–120), labeled "Token Position". - **Y-axis**: Average attention weight (0.00–0.20), labeled "Average Attention Weight". - **Legends**: - **Blue (dashed)**: "w/o Meaningless tokens" (no meaningless tokens). - **Red (solid)**: "w/ Meaningless tokens" (with meaningless tokens). - **Shaded Regions**: Gray areas labeled "Meaningless tokens" span token positions 20–80 in all charts. --- ### Detailed Analysis #### Head 2 - **w/o Meaningless tokens (blue)**: - Stable baseline (~0.05) with minor fluctuations. - Peaks at ~0.10 near token 60. - **w/ Meaningless tokens (red)**: - Higher baseline (~0.07) with sharper peaks. - Sharp spike to ~0.15 at token 60. - **Shaded Region**: Blue line dips slightly (~0.04) within 20–80 tokens. #### Head 5 - **w/o Meaningless tokens (blue)**: - Baseline ~0.06 with moderate fluctuations. - Peaks at ~0.12 near token 60. - **w/ Meaningless tokens (red)**: - Baseline ~0.08 with more pronounced peaks. - Spike to ~0.14 at token 60. - **Shaded Region**: Blue line remains stable (~0.06) within 20–80 tokens. #### Head 7 - **w/o Meaningless tokens (blue)**: - Baseline ~0.07 with minor fluctuations. - Peaks at ~0.13 near token 60. - **w/ Meaningless tokens (red)**: - Baseline ~0.09 with sharper peaks. - Spike to ~0.16 at token 60. - **Shaded Region**: Blue line dips slightly (~0.06) within 20–80 tokens. --- ### Key Observations 1. **Consistent Peaks**: All three heads show significant attention spikes (~0.10–0.16) at token 60 when meaningless tokens are included. 2. **Baseline Differences**: Attention weights are consistently higher (0.02–0.04) in the "w/ Meaningless tokens" scenario across all heads. 3. **Shaded Region Behavior**: - Blue lines (no meaningless tokens) show minor dips (~0.04–0.06) within the 20–80 token range. - Red lines (with tokens) maintain elevated attention weights (~0.07–0.09) in this region. --- ### Interpretation The data suggests that the inclusion of meaningless tokens increases attention weights in specific regions (e.g., token 60), potentially indicating the model's effort to process or filter irrelevant information. The shaded region (20–80 tokens) may represent a critical zone where attention dynamics differ significantly between scenarios. The consistent pattern across all three heads implies this behavior is a general property of Layer 1 in the model, rather than head-specific. The spikes at token 60 could correspond to syntactic or semantic boundaries in the input sequence, warranting further investigation into how meaningless tokens disrupt or modulate attention mechanisms. </details> Figure 16: The average attention weights of Llama3.1-8B-Instruct in Head 2, 5, 7. <details> <summary>x17.png Details</summary> ![41907594](/v1/image/41907594f019e4edd0cad0b3712205d116ecc734d6f9de3e15e4c23027c2c46b) ### Visual Description ## Line Graphs: Llama3.1-8B-Instruct Attention Weights Across Layers and Heads ### Overview The image contains six line graphs comparing average attention weights in a neural network model (Llama3.1-8B-Instruct) across three layers (Layer 1) and three heads (Heads 13, 16, 17). Each graph contrasts two conditions: - **Blue line**: Attention weights **without meaningless tokens** - **Red line**: Attention weights **with meaningless tokens** The x-axis represents token positions (0–60), and the y-axis shows average attention weight (0–0.12). Shaded regions around lines indicate variability/confidence intervals. --- ### Components/Axes 1. **Top Row Graphs** - **Graph 1**: Llama3.1-8B-Instruct Layer 1 Head 13 - **Graph 2**: Llama3.1-8B-Instruct Layer 1 Head 16 - **Graph 3**: Llama3.1-8B-Instruct Layer 1 Head 17 2. **Bottom Row Graphs** - **Graph 4**: Llama3.1-8B-Instruct Layer 1 Head 13 (zoomed view, x-axis 0–120) - **Graph 5**: Llama3.1-8B-Instruct Layer 1 Head 16 (zoomed view, x-axis 0–120) - **Graph 6**: Llama3.1-8B-Instruct Layer 1 Head 17 (zoomed view, x-axis 0–120) 3. **Axes** - **X-axis**: Token Position (0–60 in main graphs; 0–120 in zoomed graphs) - **Y-axis**: Average Attention Weight (0–0.12) 4. **Legends** - **Blue**: "w/o Meaningless tokens" - **Red**: "w/ Meaningless tokens" - Positioned in the **top-right corner** of each graph. --- ### Detailed Analysis #### Graph 1 (Layer 1 Head 13) - **Blue line (w/o tokens)**: Peaks at token 5 (~0.12), token 25 (~0.08), and token 55 (~0.06). - **Red line (w/ tokens)**: Peaks at token 5 (~0.14), token 25 (~0.10), and token 55 (~0.08). - **Shaded regions**: Wider for red line, indicating higher variability with tokens. #### Graph 2 (Layer 1 Head 16) - **Blue line**: Peaks at token 10 (~0.09), token 30 (~0.07), and token 50 (~0.05). - **Red line**: Peaks at token 10 (~0.11), token 30 (~0.09), and token 50 (~0.07). - **Shaded regions**: Consistent width, suggesting stable variability. #### Graph 3 (Layer 1 Head 17) - **Blue line**: Peaks at token 15 (~0.10), token 45 (~0.06), and token 55 (~0.04). - **Red line**: Peaks at token 15 (~0.13), token 45 (~0.08), and token 55 (~0.10). - **Shaded regions**: Narrower for blue line, indicating lower variability without tokens. #### Zoomed Graphs (4–6) - **Graph 4 (Head 13)**: - Blue line drops sharply after token 20, remaining near 0.02. - Red line shows a secondary peak at token 80 (~0.04). - **Graph 5 (Head 16)**: - Blue line has a minor peak at token 70 (~0.03). - Red line shows a sharp drop after token 20, stabilizing near 0.01. - **Graph 6 (Head 17)**: - Blue line has a sustained low value (~0.01–0.02) after token 20. - Red line exhibits a secondary peak at token 90 (~0.05). --- ### Key Observations 1. **Peak Attention**: - Red lines (with tokens) consistently show **higher peaks** than blue lines in the same token positions (e.g., token 5 in Head 13: 0.14 vs. 0.12). 2. **Variability**: - Shaded regions are wider for red lines, suggesting **greater uncertainty** in attention weights when meaningless tokens are present. 3. **Secondary Peaks**: - Zoomed graphs reveal **additional attention spikes** in red lines at later token positions (e.g., token 80 in Head 13, token 90 in Head 17). 4. **Decay Patterns**: - Blue lines (w/o tokens) show faster decay in attention weights after initial peaks compared to red lines. --- ### Interpretation 1. **Impact of Meaningless Tokens**: - The presence of meaningless tokens increases attention weights in critical positions (e.g., token 5, 15), potentially indicating the model treats them as **distractors** or **contextual anchors**. 2. **Model Robustness**: - Wider shaded regions for red lines suggest the model’s attention is **less stable** when processing noisy inputs, which could affect performance on tasks requiring focus on meaningful tokens. 3. **Secondary Attention Spikes**: - Late-token peaks in red lines (e.g., token 80, 90) may reflect the model’s attempt to **recover context** after encountering irrelevant tokens. 4. **Layer-Specific Behavior**: - Head 17 (Graph 3/6) shows the most pronounced difference between conditions, implying this head is **more sensitive to token relevance**. --- ### Technical Notes - **Language**: All text is in English. - **Uncertainty**: Values are approximate (e.g., "~0.12") due to lack of exact numerical labels. - **Spatial Grounding**: Legends are consistently placed in the top-right corner; shaded regions align with line colors. </details> Figure 17: The average attention weights of Llama3.1-8B-Instruct in Head 13, 16, 17. <details> <summary>x18.png Details</summary> ![254f4c52](/v1/image/254f4c520f7467d9d884ca58d981cb4215808e6eeffe4e89e1eb34dabc961359) ### Visual Description ## Line Graphs: Llama3.1-8B-Instruct Layer 1 Attention Weights Across Heads 20, 25, and 26 ### Overview The image contains six line graphs comparing average attention weights in Layer 1 of the Llama3.1-8B-Instruct model across three attention heads (20, 25, 26). Each graph contrasts two scenarios: - **Blue line**: Attention weights without meaningless tokens - **Red line**: Attention weights with meaningless tokens The graphs also include a bottom row showing attention weights when **only meaningless tokens** are present, with a shaded region (positions 20–60) labeled "Meaningless tokens." --- ### Components/Axes 1. **X-axis**: Token position (0–120) 2. **Y-axis**: Average attention weight (0.00–0.14) 3. **Legends**: - Blue: "w/o Meaningless tokens" - Red: "w/ Meaningless tokens" 4. **Shaded Region**: Bottom graphs only, spans token positions 20–60, labeled "Meaningless tokens." --- ### Detailed Analysis #### Head 20 - **Top Graph**: - Blue line (w/o tokens): Peaks at ~0.08 (positions 10, 30, 50, 70, 90). - Red line (w/ tokens): Peaks at ~0.06–0.08 (positions 10, 30, 50, 70, 90), with reduced amplitude. - **Bottom Graph**: - Attention weights drop to ~0.01–0.02 outside the shaded region (20–60). - Within the shaded region, weights spike to ~0.06–0.08. #### Head 25 - **Top Graph**: - Blue line: Peaks at ~0.12 (positions 10, 30, 50, 70, 90, 110). - Red line: Peaks at ~0.08–0.10 (positions 10, 30, 50, 70, 90, 110), with dampened spikes. - **Bottom Graph**: - Outside shaded region: Weights ~0.01–0.02. - Within shaded region: Weights ~0.04–0.06. #### Head 26 - **Top Graph**: - Blue line: Peaks at ~0.10 (positions 10, 30, 50, 70, 90, 110). - Red line: Peaks at ~0.06–0.08 (positions 10, 30, 50, 70, 90, 110), with reduced amplitude. - **Bottom Graph**: - Outside shaded region: Weights ~0.01–0.02. - Within shaded region: Weights ~0.03–0.05. --- ### Key Observations 1. **Attention Spikes**: All heads exhibit periodic spikes in attention weights (positions 10, 30, 50, 70, 90, 110) when meaningless tokens are absent. 2. **Meaningless Tokens Impact**: - Attention weights in non-shaded regions decrease by ~30–50% when meaningless tokens are added. - Within the shaded region (20–60), attention weights increase significantly (e.g., Head 25: +300% from 0.01 to 0.04). 3. **Head-Specific Behavior**: - Head 25 shows the highest baseline attention (0.12) without tokens. - Head 26 has the lowest attention in the shaded region (~0.03). --- ### Interpretation 1. **Model Adaptation**: The presence of meaningless tokens forces the model to redistribute attention. Spikes in the shaded region suggest the model uses this area to process or filter irrelevant tokens. 2. **Robustness Mechanism**: Heads with higher baseline attention (e.g., Head 25) may prioritize meaningful tokens more strongly, while others (e.g., Head 26) show weaker adaptation. 3. **Anomaly Detection**: The consistent periodic spikes (every 20 tokens) imply the model detects rhythmic patterns, which could be exploited for tasks like token segmentation. 4. **Trade-off**: Adding meaningless tokens reduces overall attention coherence but increases focus on specific regions, potentially improving noise resilience. This analysis highlights how attention mechanisms dynamically adjust to input complexity, balancing focus on meaningful content and handling irrelevant data. </details> Figure 18: The average attention weights of Llama3.1-8B-Instruct in Head 20, 25, 26. <details> <summary>x19.png Details</summary> ![0b501e32](/v1/image/0b501e3285c766b9e524c5ee537d021612f385e3c00d81c2ec462fb33d1a15ea) ### Visual Description ## Line Graphs: Llama3.1-8B-Instruct Layer 1 Head Attention Weights with/without Meaningless Tokens ### Overview The image contains six line graphs comparing attention weights across three attention heads (28, 29, 31) in Layer 1 of the Llama3.1-8B-Instruct model. Each graph pair compares attention weights **with** (red) and **without** (blue) meaningless tokens. The x-axis represents token positions (0–120), and the y-axis shows average attention weight. Bottom subplots zoom into the 0–120 range with a shaded "Meaningless tokens" region (20–60). --- ### Components/Axes - **X-axis**: Token Position (0–120) - **Y-axis**: Average Attention Weight (0.00–0.12–0.175 depending on head) - **Legends**: - Blue: "w/o Meaningless tokens" - Red: "w/ Meaningless tokens" - **Subplot Structure**: - Top subplots: Full 0–120 token range - Bottom subplots: Zoomed 0–120 range with shaded 20–60 "Meaningless tokens" region --- ### Detailed Analysis #### Head 28 - **Top Subplot**: - Red line (w/ tokens) peaks at ~0.12 (token 10), ~0.08 (token 30), ~0.10 (token 50). - Blue line (w/o tokens) peaks at ~0.06 (token 10), ~0.04 (token 30), ~0.05 (token 50). - **Bottom Subplot**: - Red line dominates 20–60 range (avg. ~0.08–0.10). - Blue line drops sharply outside 20–60 (avg. ~0.01–0.03). #### Head 29 - **Top Subplot**: - Red line peaks at ~0.15 (token 20), ~0.10 (token 40). - Blue line peaks at ~0.05 (token 20), ~0.03 (token 40). - **Bottom Subplot**: - Red line remains elevated in 20–60 (avg. ~0.06–0.08). - Blue line flattens to ~0.02–0.04. #### Head 31 - **Top Subplot**: - Red line peaks at ~0.175 (token 10), ~0.12 (token 30), ~0.10 (token 50). - Blue line peaks at ~0.07 (token 10), ~0.05 (token 30), ~0.04 (token 50). - **Bottom Subplot**: - Red line sustains high attention in 20–60 (avg. ~0.08–0.10). - Blue line drops to ~0.01–0.03 outside 20–60. --- ### Key Observations 1. **Meaningless tokens amplify attention** in the 20–60 token range across all heads. 2. **Peaks in red lines** (w/ tokens) are consistently higher than blue lines (w/o tokens) in the shaded region. 3. **Blue lines** (w/o tokens) show reduced attention outside 20–60, suggesting meaningless tokens may anchor focus. 4. **Head 31** exhibits the highest overall attention weights, particularly in token 10 (w/ tokens: ~0.175). --- ### Interpretation The data demonstrates that **meaningless tokens significantly increase attention weights** in the 20–60 token range, likely due to their salience or role in contextual framing. This suggests the model prioritizes these tokens when present, potentially improving task-specific performance (e.g., instruction following). The absence of meaningless tokens results in more dispersed attention, which may reduce efficiency. The consistent pattern across heads implies this behavior is a general property of the model’s attention mechanism, not head-specific. **Notable Anomaly**: Head 31’s extreme peak at token 10 (w/ tokens: ~0.175) suggests an outlier in attention allocation, possibly indicating a unique processing role for that token position. </details> Figure 19: The average attention weights of Llama3.1-8B-Instruct in Head 28, 29, 31. <details> <summary>x20.png Details</summary> ![f9df622b](/v1/image/f9df622b0839672dcbb0b2823e1c98513c91c792f5ee69d2dbcaaff0eb5150ee) ### Visual Description ## Line Graphs: Gemma3-4b-it Layer Attention Weights with/without Meaningless Tokens ### Overview The image contains six line graphs comparing average attention weights across token positions (0–120) for three attention heads (Head 1, 4, 8) in Layer 1 of the Gemma3-4b-it model. Each graph contrasts two scenarios: - **Blue line**: Attention weights *without* meaningless tokens - **Red line**: Attention weights *with* meaningless tokens The graphs highlight how the inclusion of meaningless tokens affects attention distribution, with shaded regions marking token positions labeled as "Meaningless tokens." ### Components/Axes - **X-axis**: Token Position (0–120, integer intervals) - **Y-axis**: Average Attention Weight (0–0.12, linear scale) - **Legends**: - Blue: "w/o Meaningless tokens" - Red: "w/ Meaningless tokens" - **Subplot Titles**: - Top row: "Gemma3-4b-it Layer1 Head X" (X = 1, 4, 8) - Bottom row: Same titles, with shaded regions labeled "Meaningless tokens" (20–60 token positions) ### Detailed Analysis #### Layer1 Head1 - **Top subplot**: - Red line (w/ meaningless tokens) shows higher peaks (up to ~0.08) at token positions 10, 30, and 50. - Blue line (w/o) remains below 0.06, with smoother fluctuations. - **Bottom subplot**: - Shaded region (20–60 tokens) correlates with a sharp drop in blue line attention weights (~0.01–0.02). - Red line retains higher weights (~0.03–0.05) in the shaded region. #### Layer1 Head4 - **Top subplot**: - Red line exhibits pronounced peaks (~0.08–0.10) at tokens 10, 30, and 50. - Blue line peaks at ~0.06, with less variability. - **Bottom subplot**: - Shaded region shows blue line attention weights dropping to ~0.01–0.02. - Red line remains elevated (~0.03–0.05) in the shaded area. #### Layer1 Head8 - **Top subplot**: - Red line has a single dominant peak (~0.05) at token 100. - Blue line shows minor fluctuations (<0.03). - **Bottom subplot**: - Shaded region (20–60 tokens) has negligible impact on blue line (~0.01–0.02). - Red line shows a slight increase (~0.03) in the shaded area. ### Key Observations 1. **Meaningless tokens amplify attention weights**: Red lines (w/ meaningless tokens) consistently show higher peaks than blue lines (w/o) across all heads. 2. **Positional sensitivity**: Peaks in red lines align with token positions 10, 30, 50, and 100, suggesting these positions are critical for processing. 3. **Shaded region impact**: In Layers 1 Heads 1 and 4, attention weights drop sharply in the shaded "meaningless tokens" region (20–60 tokens) for the blue line, while red lines remain stable. 4. **Head-specific behavior**: Head 8 exhibits a unique pattern with a late peak at token 100, unlike the earlier peaks in Heads 1 and 4. ### Interpretation The data suggests that meaningless tokens act as **attention amplifiers**, increasing the model’s focus on specific token positions (e.g., 10, 30, 50). The shaded regions (20–60 tokens) likely represent noise or irrelevant data, as the blue line (w/o meaningless tokens) shows reduced attention here. This implies the model may use meaningless tokens to: - **Filter noise**: By concentrating attention on critical positions, the model ignores irrelevant tokens in the shaded region. - **Enhance robustness**: Higher attention weights in red lines (w/ meaningless tokens) could improve performance on noisy inputs. - **Head specialization**: Head 8’s late peak at token 100 may indicate a role in processing long-range dependencies or contextual cues. The findings align with hypotheses about attention mechanisms prioritizing salient tokens while suppressing irrelevant ones, though further analysis is needed to confirm causality. </details> Figure 20: The average attention weights of Gemma3-4b-it in Head 1, 4, 8. See pages 1-last of case_study.pdf

Rendering Paper...