2510.01032v1

Model: healer-alpha-free

# Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning ## Abstract Motivated by the puzzling observation that inserting long sequences of meaningless tokens before the query prompt can consistently enhance LLM reasoning performance, this work analyzes the underlying mechanism driving this phenomenon and based on these insights proposes a more principled method that allows for similar performance gains. First, we find that the improvements arise from a redistribution of activations in the LLM’s MLP layers, where near zero activations become less frequent while large magnitude activations increase. This redistribution enhances the model’s representational capacity by suppressing weak signals and promoting stronger, more informative ones. Building on this insight, we propose the Activation Redistribution Module (ARM), a lightweight inference-time technique that modifies activations directly without altering the input sequence. ARM adaptively identifies near-zero activations after the non-linear function and shifts them outward, implicitly reproducing the beneficial effects of meaningless tokens in a controlled manner. Extensive experiments across diverse benchmarks and model architectures clearly show that ARM consistently improves LLM performance on reasoning tasks while requiring only a few lines of simple code to implement. Our findings deliver both a clear mechanistic explanation for the unexpected benefits of meaningless tokens and a simple yet effective technique that harnesses activation redistribution to further improve LLM performance. The code has been released at ARM-Meaningless-tokens. ## 1 Introduction Large language models (LLMs) are known to be sensitive to subtle variations in their inputs, which makes it important to understand how tokens influence predictions (Guan et al., 2025; Errica et al., 2024; Zhuo et al., 2024). In this paper, we present a surprisingly counterintuitive finding named meaningless-token effect: inserting long sequences of meaningless tokens, such as repeated punctuation or separators, into prompts can consistently improve the performance of LLMs, particularly on reasoning tasks. Contrary to common intuition that long and irrelevant tokens are like noise and thus useless or even harmful during inference (Jiang et al., 2024; Guan et al., 2025), our experiments reveal the opposite. When long sequences of meaningless tokens are appended before query prompts, models that previously struggled with certain problems can produce correct solutions, as illustrated in the left panel of Figure 1 (see more examples in Appendix J). This effect occurs consistently across tasks and models, suggesting a counterintuitive behavior of LLMs pending deeper investigation. <details> <summary>x1.png Details</summary> ![1bc25912](/v1/image/1bc25912ec011a7a280af4bb50de42ddc4d4ea262a35f9ce3b9046e93d368fcf) ### Visual Description ## Technical Diagram: Mechanism of Meaningless Token Insertion in LLM Prompts ### Overview This technical diagram illustrates a hypothesized mechanism by which inserting meaningless tokens into a prompt for a Large Language Model (LLM) can alter its output, potentially improving accuracy on specific tasks. The diagram is divided into three connected panels that describe the process from prompt modification to its effect on the model's internal attention mechanism and activation distributions. ### Components/Axes The diagram consists of three main panels arranged horizontally. **1. Left Panel: "Illustration of inserting meaningless tokens into prompt"** * **Visual Components:** Two parallel process flows. * **Top Flow (Baseline):** A yellow block labeled "System prompt" is combined with a blue block labeled "Question". An arrow points to a box labeled "LLM". The output is "Answer 5" with a red "X" (indicating incorrect). * **Bottom Flow (Modified):** The same yellow "System prompt" block is combined with the blue "Question" block and a new green block labeled "Meaningless tokens" containing the text `"/\/\/\/\/\/\/\/\/\/"`. An arrow points to the same "LLM" box. The output is "Answer 3" with a green checkmark (indicating correct). * **Text Content:** * **System prompt box:** "You are an expert mathematician. Solve the following problem carefully. Put your final answer within a \boxed{}" * **Question box:** "Let $a$ be a positive real number such that all the roots of the polynomial $ax^3 + x^2 + ax + 1 = 0$ are real. Find the smallest possible value of $a$." * **Meaningless tokens box:** "/\/\/\/\/\/\/\/\/\/" * **Legend (bottom left):** Explains the color coding: Yellow square = "System Prompt Token", Green square = "Meaningless Token", Blue square = "Question Token". **2. Middle Panel: "This insertion leads to an affine transformation of the attention outputs"** * **Visual Components:** A matrix multiplication diagram showing two rows. * **Top Row (Baseline):** A matrix of "Attention Weights" (with yellow and blue cells) is multiplied (×) by a matrix of "Value States" (with yellow and blue cells). The result is a matrix of "Attention Outputs" (with yellow and blue cells). * **Bottom Row (Modified):** The "Attention Weights" and "Value States" matrices now include green cells (for the meaningless tokens). The resulting "Attention Outputs" matrix shows a visible change in cell shading compared to the top row. * **Legend (bottom center):** Repeats the color coding. Adds: "Lighter means value goes down", "Darker means value goes up", "[Each picture is just an example, not a concrete representation]". **3. Right Panel: "This transformation shifts the distribution of activation values toward both sides"** * **Visual Components:** A 2D line graph. * **X-axis:** Unlabeled, but a vertical dashed line is drawn at "0". * **Y-axis:** Unlabeled, representing frequency or density. * **Data Series:** * **Blue Curve:** Labeled "Activation distribution w/o Meaningless tokens". It is a tall, narrow bell curve centered on the 0 line. * **Red Curve:** Labeled "Activation distribution w/ Meaningless tokens". It is a shorter, wider bell curve that is shifted slightly to the left and right, indicating a broader spread of values away from zero. * **Legend:** Integrated directly into the graph lines. ### Detailed Analysis * **Process Flow:** The diagram proposes a causal chain: 1) Insertion of meaningless tokens → 2) Affine transformation of attention outputs → 3) Shift in the distribution of activation values → 4) Change in final model output (from incorrect to correct answer). * **Attention Mechanism Effect:** The middle panel visually suggests that the inclusion of meaningless tokens (green) in the attention calculation alters the resulting "Attention Outputs" matrix. The note "Darker means value goes up" implies the transformation may amplify or modulate certain signals within the model's processing. * **Activation Distribution Shift:** The right panel's graph is the key quantitative claim. The baseline (blue) distribution is highly concentrated around zero. The modified (red) distribution is "shifted... toward both sides," meaning the activation values become more dispersed, with more values having larger positive or negative magnitudes. ### Key Observations 1. **Task-Specific Benefit:** The example uses a specific, complex math problem. The implication is that this technique may be particularly useful for tasks requiring precise reasoning or exploration of a solution space. 2. **Non-Semantic Intervention:** The tokens are explicitly "meaningless" (`/\/\/...`), indicating the effect is not due to adding relevant information but to altering the model's internal computational dynamics. 3. **Visual Metaphor for "Jitter" or "Noise":** The shift from a narrow to a broad activation distribution is a common metaphor in machine learning for introducing noise to escape local minima or encourage exploration. This diagram suggests a method to induce such a shift at the attention layer. ### Interpretation This diagram presents a theory for why a seemingly nonsensical prompt engineering tactic—appending garbage tokens—can sometimes improve LLM performance. It argues that the tokens are not processed for their meaning but act as a **computational catalyst**. The proposed mechanism is twofold: 1. **Attention Modulation:** The meaningless tokens participate in the attention mechanism, creating an "affine transformation" (a linear transformation plus a translation) of the value states. This could function like a learned skip connection or a dynamic scaling factor, subtly reweighting the importance of different parts of the prompt. 2. **Activation Redistribution:** The ultimate effect is to push the model's internal activations away from a tight cluster around zero. In optimization terms, a narrow distribution might represent a "confident" but potentially suboptimal state. Broadening the distribution could force the model to consider a wider range of numerical pathways, potentially helping it escape a local minimum in its reasoning process and find a more accurate solution (Answer 3 vs. Answer 5). The diagram serves as a conceptual bridge between a simple, observable prompt hack and a complex internal change in the model's behavior, framing it as a deliberate intervention in the model's information processing flow rather than a random artifact. </details> Figure 1: The left panel illustrates how meaningless-token effect can improve model performance. The middle panel shows the changes occurring in the attention module after introducing meaningless tokens. The right panel depicts the redistribution of activations that results from adding these tokens. This unexpected result raises fundamental questions about how LLMs process input and what aspects of their internal computation are being affected. Why should tokens that convey no meaning lead to measurable performance gains? Are they simply acting as noise, or do they restructure representations in a systematic way that supports better reasoning? To answer these questions, we move beyond surface level observations and conduct a detailed investigation of the mechanisms behind this effect. Our analysis shows that the influence of meaningless tokens arises primarily in the first layer, and their effect on meaningful tokens can be approximated as an affine transformation of the attention outputs. As demonstrated in the middle schematic diagram of Figure 1, the resulting transformation shifts the distribution of activations in the MLP: the proportion of near-zero activations decreases, while more activations are pushed outward toward larger positive and negative values. The rightmost plot in Figure 1 gives a visualization of this process. We hypothesize that redistribution fosters richer exploration, enhancing reasoning performance, and clarify the mechanism by decomposing the transformation into coefficient and bias terms. Our theoretical analysis shows how each component shapes activation variance and induces the observed distributional shift. Building on these insights, we propose ARM (an A ctivation R edistribution M odule), a lightweight alternative to explicit meaningless-token insertion. ARM requires only a few lines of code modification and no additional training. It automatically identifies a proportion of near-zero activations after the non-linear function and shifts their values outward, yielding a smoother and less sparse activation distribution. In doing so, ARM reproduces the beneficial effects of meaningless tokens without altering the input sequence and consistently improves LLM performance on reasoning and related tasks. In summary, the key findings and contributions of our work are: - We uncover a meaningless-token effect in LLMs: inserting meaningless tokens, far from being harmful, systematically improves reasoning in LLMs. This runs counter to the common assumption that such tokens only add noise. - Through theoretical and empirical analysis, we show that these tokens induce an activation redistribution effect in the first-layer MLP, reducing near-zero activations and increasing variance. - Building on this understanding, we present ARM, a lightweight inference-time instantiation to demonstrate that the phenomenon can be directly harnessed. ## 2 Observation: Inserting Meaningless Tokens Induces an Affine Transformation on Meaningful Token Representations We observe that meaningless tokens, such as a sequence of slashes (“/”) with appropriate lengths can enhance the performance of LLMs, particularly on reasoning tasks Varying token length, type, and position affects performance, as shown in Appendix F.. As shown in Table 1, when we insert a fixed-length sequence of meaningless tokens between the system prompt and the question, all evaluated models exhibit performance improvements on Math-500 and AIME2024 to different degrees. This consistent improvement suggests that the inserted meaningless tokens are not simply ignored or detrimental to the models; rather, they exert a positive influence, likely through non-trivial interactions with the models’ internal representations. To investigate this phenomenon, we start our analysis from the attention module. The formula of attention is: Table 1: Performance on mathematical reasoning datasets with and without meaningless tokens across different models. “w/o” denotes the absence of meaningless tokens, while “w/” denotes their presence. We test each model five times to get the average result. | Qwen2.5-Math-1.5B Qwen2.5-Math-7B DS-R1-Qwen-7B | 63.9 72.3 52.7 | 65.9 74.6 53.1 | 14.4 23.1 3.2 | 17.5 23.3 4.4 | | --- | --- | --- | --- | --- | | DS-Math-7B-instruct | 39.5 | 42.1 | 7.8 | 12.3 | | Llama-3.1-8B-Instruct | 41.8 | 42.1 | 7.9 | 9.9 | | Qwen-2.5-32B-Instruct | 81.3 | 81.7 | 17.6 | 22.8 | <details> <summary>x2.png Details</summary> ![6e6f2eb1](/v1/image/6e6f2eb18d1d702e539eb70679366644da72fbbda233682d3295b10bf560c4b7) ### Visual Description ## Line Charts: Attention Weight Analysis Across Language Models ### Overview The image displays a 2x3 grid of six line charts. The charts analyze and compare the "Average Attention Weight" across token positions for three different Large Language Models (LLMs) under two conditions: with and without the inclusion of "Meaningless tokens." The top row shows a standard sequence length (0-60 tokens), while the bottom row shows an extended sequence (0-120 tokens) with a specific region highlighted as containing meaningless tokens. ### Components/Axes * **Chart Type:** Line charts with filled areas under the curves. * **Models Analyzed (Column Headers):** * Left Column: `Qwen2.5-7B-Math` (Layer 1, Head 22) * Middle Column: `Llama3.1-8B-Instruct` (Layer 1, Head 27) * Right Column: `Gemma3-4b-it` (Layer 1, Head 3) * **Axes:** * **X-axis (All Charts):** `Token Position`. Scale: Top row charts range from 0 to 60. Bottom row charts range from 0 to 120. * **Y-axis (All Charts):** `Average Attention Weight`. The scale varies per chart: * Qwen2.5-7B-Math (Top): 0.00 to 0.08 * Llama3.1-8B-Instruct (Top): 0.00 to 0.10 * Gemma3-4b-it (Top): 0.000 to 0.175 * Qwen2.5-7B-Math (Bottom): 0.00 to 0.07 * Llama3.1-8B-Instruct (Bottom): 0.00 to 0.08 * Gemma3-4b-it (Bottom): 0.00 to 0.16 * **Legend (Present in all top-row charts, implied in bottom-row):** * Blue Line / Area: `w/o Meaningless tokens` (Without) * Red Line / Area: `w/ Meaningless tokens` (With) * **Special Annotation (Bottom-row charts only):** A gray shaded region from approximately token position 0 to 70, labeled `Meaningless tokens` in the center of the region. ### Detailed Analysis **Top Row (Standard Sequence, 0-60 tokens):** 1. **Qwen2.5-7B-Math (Layer 1 Head 22):** * **Trend:** Both lines show a highly volatile, spiky pattern. The red line (`w/ Meaningless tokens`) generally exhibits higher peaks than the blue line (`w/o Meaningless tokens`), particularly after position 30. * **Data Points (Approximate):** Peaks for the red line reach ~0.075 near positions 35, 50, and 58. Blue line peaks are lower, around 0.05-0.06. Both lines start near 0.01 at position 0. 2. **Llama3.1-8B-Instruct (Layer 1 Head 27):** * **Trend:** Similar volatile pattern. The red line (`w/`) consistently shows higher attention weights than the blue line (`w/o`) across most positions, with the difference becoming more pronounced after position 40. * **Data Points (Approximate):** Red line peaks exceed 0.09 near positions 50 and 58. Blue line peaks are generally below 0.07. 3. **Gemma3-4b-it (Layer 1 Head 3):** * **Trend:** Extremely spiky. The red line (`w/`) has dramatically higher peaks than the blue line (`w/o`), especially in the latter half of the sequence. * **Data Points (Approximate):** The most extreme peak on the entire graphic is the red line here, reaching ~0.17 near position 50. Blue line peaks are significantly lower, maxing around 0.075. **Bottom Row (Extended Sequence with Meaningless Token Buffer, 0-120 tokens):** * **Common Structure:** All three charts in this row only plot the `w/ Meaningless tokens` condition (blue line). The sequence is divided into two distinct phases. * **Phase 1 (Meaningless Tokens, ~Pos 0-70):** * **Trend:** The attention weight is very low and relatively stable, forming a near-flat line close to the x-axis. This indicates the model assigns minimal attention to these tokens. * **Data Points (Approximate):** Values hover between 0.00 and 0.01 for all three models in this region. * **Phase 2 (Post-Meaningless Tokens, ~Pos 70-120):** * **Trend:** Immediately after the shaded region ends, the attention weight becomes highly volatile and spiky, similar to the patterns in the top row. The magnitude of these spikes is comparable to or greater than those seen in the top-row charts. * **Data Points (Approximate):** * Qwen2.5-7B-Math: Spikes reach up to ~0.065. * Llama3.1-8B-Instruct: Spikes reach up to ~0.08. * Gemma3-4b-it: Spikes are very high, reaching up to ~0.15. ### Key Observations 1. **Consistent Effect of Meaningless Tokens:** Across all three models (Qwen, Llama, Gemma), the inclusion of meaningless tokens (`w/` condition, red line) leads to higher average attention weights, particularly in the later positions of a standard sequence (top row). 2. **Attention Suppression:** The bottom charts demonstrate that the model's attention mechanism actively suppresses focus on a long contiguous block of meaningless tokens, assigning them near-zero weight. 3. **Attention Reallocation:** Following the block of meaningless tokens, attention does not return to a "normal" pattern but becomes highly volatile, with sharp spikes. This suggests a dynamic reallocation of attention resources after the buffer. 4. **Model-Specific Magnitude:** While the pattern is consistent, the scale of attention weights differs. Gemma3-4b-it (Head 3) shows the most extreme peaks, suggesting this specific head may be more sensitive or specialized. ### Interpretation This data visualizes a potential mechanism by which language models handle noise or filler content. The "Meaningless tokens" appear to act as an attention sink or buffer. * **What it suggests:** The model learns to ignore predictable, low-information tokens (the meaningless block) to conserve its attention capacity. However, this process isn't passive; it actively alters the attention distribution for subsequent, meaningful tokens. * **Relationship between elements:** The top row shows the *effect* (higher attention weights with meaningless tokens present). The bottom row reveals the *cause* or *process*: the model first suppresses attention to the noise, then exhibits heightened, volatile attention afterward. This could be a compensatory mechanism or a sign of the model "resetting" its focus. * **Notable Anomalies/Patterns:** The most striking pattern is the stark contrast between the flatline in the meaningless region and the explosive volatility immediately after. This isn't a gradual return to baseline but a sharp phase transition. This finding could be crucial for understanding model robustness, the design of prompt structures, and the interpretation of attention maps in models processing repetitive or filler text. The investigation is Peircean in that it moves from observing a surprising correlation (red line > blue line) to hypothesizing a causal mechanism (suppression followed by reallocation), which is then visually confirmed by the experimental design shown in the bottom row. </details> Figure 2: The x-axis shows token indices. Subsequent tokens assign lower average attention weights to the original prompt overall, while meaningless tokens receive similarly near-zero weights. We show additional average attention weights in Appendix I and layer-wise analyses in Section F.4. $Attention(Q,K,V)=softmax≤ft(\frac{QK^⊤}{√{d_k}}\right)V$ , where $Q$ , $K$ , $V$ are query vectors, key vectors and value vectors respectively, $d$ is the dimensionality of key/query. From this equation, adding extra tokens introduces additional terms into the softmax normalization, enlarging the softmax normalization denominator. Although the new tokens typically receive small weights, their presence redistributes probability mass and reduces the relative share of attention allocated to the original tokens. To probe the underlying case, we directly compare input’s attention weights with and without meaningless tokens while keeping tokens indices aligned in the first layers. For every token we computed the mean of its column below the diagonal of the attention matrix to measure the extent to which each token receives attention from all downstream tokens (Bogdan et al., 2025). When a string of meaningless tokens are present, the model assigns only small weights to each token, intuitively indicating that the model only pays little attention to them (see Figure 2 bottom row). The top row of Figure 2 presents a direct comparison of the attention to meaningful tokens without (blue) or with meaningless tokens (red; meaningless token indices are removed from visualization to allow for direct comparison). Among meaningful tokens, the average attention is decreased in the meaningless-token condition, especially driven by decreased high-attention spikes. The attention weights of the original prompt after inserting meaningless tokens are: $W^\prime=λ·softmax≤ft(\frac{QK^⊤}{√{d_k}}\right)$ , where $W_attn$ are the attention weights after softmax, and $λ$ is the drop percentage of attention weights in the original prompt after adding meaningless tokens. Then, the attention output for each token not only obtains the weighted combination of the original tokens, but also includes attention weights and values from the meaningless tokens. Thus, the attention output can be expressed as: $$ Attn\_Output_new=W_j^\primeV_j+W_iV_i, \tag{1} $$ <details> <summary>x3.png Details</summary> ![45e62b96](/v1/image/45e62b96c67b7f09c934e665706cae93116ffc22c3d35df82f0c9652b403a2c3) ### Visual Description ## Scatter Plot: Effect of Meaningless Tokens on Attention Output Distribution ### Overview The image is a scatter plot visualizing the distribution of attention output vectors in a two-dimensional space. It compares two conditions: attention outputs without meaningless tokens (blue 'x' markers) and attention outputs with meaningless tokens (red 'x' markers). The plot includes displacement vectors showing the shift for specific points and ellipses representing the overall set distributions for each condition. ### Components/Axes * **Legend (Top-Left Corner):** * `Attn_output w/o meaningless tokens`: Blue 'x' marker. * `Attn_output w/ meaningless tokens`: Red 'x' marker. * `Displacement from ✕ to ✕`: Blue arrow (from a blue 'x' to a red 'x'). * `Attn_output set w/o meaningless tokens`: Blue dotted ellipse. * `Attn_output set w/ meaningless tokens`: Red solid ellipse. * **Axes:** The plot has a grid background but **no labeled X or Y axes**. The axes represent an abstract, two-dimensional projection of the attention output vectors. No numerical scales or units are provided. * **Data Series:** 1. **Blue 'x' Series:** Approximately 30-40 data points representing attention outputs without meaningless tokens. They are clustered primarily in the lower-left to central region of the plot. 2. **Red 'x' Series:** Approximately 30-40 data points representing attention outputs with meaningless tokens. They are more dispersed and shifted towards the upper-right relative to the blue cluster. * **Annotations:** * **Displacement Arrows:** Five distinct blue arrows originate from specific blue 'x' points and point towards corresponding red 'x' points. They indicate the direction and magnitude of shift for individual data points when meaningless tokens are introduced. * **Distribution Ellipses:** * A **blue dotted ellipse** encircles the main cluster of blue 'x' points. It is oriented diagonally from lower-left to upper-right. * A **red solid ellipse** encircles the main cluster of red 'x' points. It is larger, more circular, and positioned further to the upper-right compared to the blue ellipse. ### Detailed Analysis * **Data Point Distribution:** * The **blue 'x' points (w/o meaningless tokens)** form a relatively tight, elongated cluster. The densest region is near the center-left of the plot. * The **red 'x' points (w/ meaningless tokens)** are more scattered. While some overlap with the blue cluster, a significant number are displaced towards the top and right edges of the plot. * **Displacement Vectors:** The five visible arrows all point from the blue cluster towards the red cluster, generally in a **north-east (upper-right) direction**. The lengths of the arrows vary, suggesting different magnitudes of shift for different data points. The longest arrow originates from the lower-left portion of the blue cluster and extends to the upper-right region. * **Set Ellipses (Spatial Grounding):** * The **blue dotted ellipse** is centered roughly in the lower-middle-left area. It encompasses the majority of the blue 'x' markers. * The **red solid ellipse** is centered further to the upper-right. Its larger area visually confirms the greater variance/spread of the red data points. The two ellipses overlap in the central region of the plot. ### Key Observations 1. **Systematic Shift:** There is a clear, systematic displacement of data points from the lower-left/central region (blue) towards the upper-right region (red) when meaningless tokens are added. 2. **Increased Dispersion:** The set of outputs with meaningless tokens (red) exhibits significantly higher variance or dispersion than the set without them (blue), as shown by the larger red ellipse and the more scattered red 'x' markers. 3. **Directional Consistency:** The displacement arrows are all oriented in a similar north-east direction, indicating a consistent directional bias in the effect of adding meaningless tokens. 4. **Overlap Region:** A subset of points, particularly in the central area of the plot, shows less displacement, resulting in an overlap between the two distributions. ### Interpretation This chart demonstrates the **destabilizing and distorting effect of meaningless tokens on a model's attention mechanism**. * **What the data suggests:** The introduction of meaningless tokens does not simply add random noise. Instead, it causes a **systematic drift** in the representation space of attention outputs (the shift to the upper-right). This could imply that the model's internal processing is being consistently biased or "pulled" in a specific direction by these tokens. * **How elements relate:** The displacement arrows provide a microscopic view of this effect on individual data points, while the ellipses offer a macroscopic view of the overall distribution change. The increased size of the red ellipse indicates that meaningless tokens make the model's outputs less consistent and more unpredictable. * **Notable implications:** In a technical context, this visualization argues that meaningless tokens are not benign. They can corrupt the semantic representation space, potentially leading to less reliable or focused model outputs. The systematic nature of the shift (rather than pure random scattering) might be particularly concerning, as it suggests a predictable form of interference. The lack of axis labels means the exact semantic meaning of the shift (e.g., towards more generic or off-topic representations) cannot be determined from the plot alone, but the directional consistency is a strong signal of a non-random effect. </details> Figure 3: After adding meaningless tokens, each token vector is affinely transformed: blue points show the original vectors, and red points show them after the addition. Arrow is change direction. where Attn_Output corresponds to the output of attention mechanism for each token in the original prompt, $W_j^\prime$ and $V_j$ are the attention weight and value vectors of the original prompt, and $W_i$ and $V_i$ are the attention weight and value vectors of meaningless tokens. As the meaningless tokens are repeated in long sequences and contribute no semantic information, the values of these tokens are identical, and their attention weights are small in a similar magnitude. Therefore, as shown in Equation 1, the term $W_iV_i$ primarily shifts the final attention output along an approximately unified direction as they accumulate, without introducing diverse semantic components. In this formula, $W_jV_j$ is the value of original attention output, we see $W_iV_i$ as $Σ_σ$ . As a result, the attention output of meaningful tokens after adding meaningless tokens can be seen as an affine transformation expressed as: $$ Attn\_Output_new=λ·Attn\_Output+Σ_σ, \tag{2} $$ where Attn_Output is $W_jV_j$ . Following this equation, the introduction of meaningless tokens transforms the attention output of meaningful tokens into an affine function, consisting of a scaled original term ( $λ·Attn\_Output$ ) and an additional bias ( $Σ_σ$ ). Figure 3 illustrates the process of this transformation. After the attention module the affine transformed output passes through RMSNorm and serves as the input to the MLP. In the next section, we examine in detail how this transformation propagates through the subsequent MLP layers and shapes the model’s overall activation distribution. ## 3 Analysis: Why Affine Transformation Improve Reasoning Performance Having established in the previous sections that meaningless-token effect induces scaling and bias terms that produce an affine transformation of the attention output, we next examine how this transformation propagates through the subsequent MLP modules and affects reasoning. In Equation 2, we decompose the transformation’s influencing factors into two primary components: the scaling factors $λ$ controls the magnitude of activations, and the bias factors $Σ_σ$ , a bounded zero-mean bias term reflecting the variation in attention outputs before and after meaningless-token insertion which introduce structured shifts in the activation distribution. Together, these two factors determine how the transformed attention representations shape the dynamics of the MLP layers. ### 3.1 Affine Transformation influence the output of gate layer Key Takeaway We demonstrate that applying an affine transformation, through both scaling and bias factors, systematically increases the variance of the gate layer’s output. In this part, we show that these two factors increase the gate projection layer variance in MLP layer. As discussed above, because these tokens have low attention weights and nearly identical values, they shift the RMSNorm input almost entirely along a single direction with small margin; consequently, RMSNorm largely absorbs this change, producing only a minor numerical adjustment without adding semantic information. Specifically, the two factors act through different mechanisms. For the scaling factors, before entering the MLP, the attention output undergoes output projection and residual connection, which can be written as $x(λ)=res+λ*U*A$ , where A is the attention output and U the projection weights. Treating $λ$ as a functional variable, the RMSNorm output becomes $y(λ)=RMS(x(λ))$ . For the $j$ -th gate dimension, $z_j(λ)=w_j^⊤y(λ)$ , and a small variation $Δλ$ leads to the variance change of this dimension. $$ Var[z_j(λ+Δλ)]=Var[z_j(λ)]+2Cov(z_j(λ),g_j(λ))Δλ+Var[g_j(λ)]Δλ^2, \tag{3} $$ the third term in Equation 3 remains strictly positive for all admissible parameters. Moreover, as $Δλ$ increases, this term exhibits monotonic growth and asymptotically dominates the second term, thereby guaranteeing a strictly increasing overall variance. We analyze the range of $Δλ$ in Appendix E. In the case of bias factors, we model the perturbation as stochastic noise which is bounded, zero-mean and statistically independent from the original attention output across all dimensions, which contributes an additional variance component and interacts non-trivially with the subsequent RMSNorm operation. Formally, after noise injection, the RMSNorm input can be written as $x=x_0+WΣ_σ$ , where $W$ is the linear coefficient of matrix $x$ preceding RMSNorm. After normalization, the covariance of the output can be expressed as: $$ Cov(y)=J_qCov(x)J_q^{⊤}+o(\|x-x_0\|^2) \tag{4} $$ where $x_0$ is the mean expansion point, $J_q$ is the Jacobian matrix of the RMSNorm mapping. Since the variance of the added perturbation is very small, the higher-order terms can be disregarded. In this case, the bias factor will bias the input of RMSNorm and lead to an increase in the covariance $Cov(y)$ . Subsequently, the input to the activation function can be written as $z=W_gate(x+WΣ_σ)$ . Based on the properties of the covariance, the variance of the $j$ -th dimension is given by: $$ Var[z_j]≈ e_j^⊤W_gate[J_qCov(x)J_q^{⊤}]W_gate^⊤e_j, \tag{5} $$ since the projection of the vector onto the tangent space is almost never zero in LLMs’ high dimensions, the resulting variance must be strictly greater than zero. From this, we can deduce that these two factors increase the variance of the output. In general, the scaling factors increase variance by amplifying inter-sample differences, whereas the bias factors correspondingly increase variance by enlarging the covariance structure across dimensions. ### 3.2 Variance change leads to activation redistribution Key Takeaway Our analysis shows that an increase in the input variance of activation functions broadens and reshapes the output activation distribution by raising both its mean and its variance. As the variance of gate layer outputs grows under perturbations, the subsequent activation function further reshapes these signals by compressing values near zero. This motivates redistributing near-zero activations. For each sample in the hidden state, the second-order Taylor expansion on $φ$ , the activation function output is: $$ φ(μ+σ)=φ(μ)+φ^{^\prime}(μ)σ+\frac{1}{2}φ^{^\prime\prime}(μ)σ^2+o(|σ|^3), \tag{6} $$ where $σ$ can represent both $Δ k$ in scaling factor and $Σ_σ$ in bias factor. We denote the input to the activation function as $z=μ+σ$ . For the $j$ -th dimension of the hidden state, the expectation and variance of the activation output can be expressed as: $$ E[φ(z_j)]=E[φ(μ_j)]+E[φ^{^\prime}(μ_j)σ]+E[\frac{1}{2}φ^{^\prime\prime}(μ_j)σ^2]+o(E|σ|^3), \tag{7} $$ $$ Var[φ(z_j)]=φ^{^\prime}(μ_j)^2Var_j+o(Var_j^2). \tag{8} $$ From above equations, We infer that distributional changes map to variations in expectation and variance. On a single dimension, activations shift in both directions; from Equation 6, higher-order terms are negligible, and the first derivative of GeLU/SiLU near zero is positive. Since perturbations include both signs, extrapolated activations also fluctuate around zero. From Equation 7, $E[σ^2]=Var_j$ . For the bias factor, the zero-mean perturbation removes the first-order term. For scaling factors, expanding at the population mean gives $E[φ^{^\prime}(z_j)g_j]=0$ , again canceling the first order. The second derivative near zero is strictly positive. From Equation 8, $Var_j$ increases, and so does the activation histogram variance, as the function is nearly linear near zero. In summary, scaling and bias factors jointly enlarge activation variance, expressed as: $$ Var_j≈{\color[rgb]{0.28515625,0.609375,0.75390625}\definecolor[named]{pgfstrokecolor}{rgb}{0.28515625,0.609375,0.75390625}E[Var_j^(Σ_σ)]}+{\color[rgb]{0.96875,0.53125,0.47265625}\definecolor[named]{pgfstrokecolor}{rgb}{0.96875,0.53125,0.47265625}Var(g_j^λ)}. \tag{9} $$ <details> <summary>x4.png Details</summary> ![0ed57a77](/v1/image/0ed57a7778b8a08da7c7d682219cb223d47fc4643ea124cc7c7dab951c70767c) ### Visual Description ## [Combined Statistical Charts]: Impact of Transformation on Model Weight Distributions and Metrics ### Overview The image presents a comparative statistical analysis of three different large language models (LLMs) under two conditions: "w/ transformation" (blue) and "w/o transformation" (pink). For each model, a main histogram shows the distribution of a value (likely weight or activation values) centered around zero, accompanied by four smaller bar charts comparing key statistical metrics. The analysis aims to demonstrate the effect of a specific transformation on the models' internal parameter distributions. ### Components/Axes 1. **Legend**: Located at the top center of the entire figure. * Blue square: `w/ transformation` * Pink square: `w/o transformation` 2. **Main Histograms (3 total)**: * **X-axis**: Unlabeled, but centered on `0`. Represents the value of model parameters (weights/activations). * **Y-axis**: Represents frequency/count. Scale varies per model (0-700,000 for Qwen and Gemma, 0-200,000 for Llama). * **Data Series**: Two overlapping distributions (blue for "w/ transformation", pink for "w/o transformation"). * **Annotation**: A red circle highlights a specific point on the x-axis near zero for each model's histogram. 3. **Bar Chart Insets (4 per model, 12 total)**: * Each inset compares a single metric between the two conditions. * **Metrics**: `Sparsity`, `L1 Norm`, `L2 Norm`, `Gini`. * **Y-axis**: Metric value (scale specific to each chart). * **X-axis**: Two bars, blue (left) and pink (right). 4. **Model Labels**: Located below each corresponding histogram. * Left: `Qwen2.5-7B-Math` * Center: `Llama3.1-8B-Instruct` * Right: `Gemma3-4b-it` ### Detailed Analysis **1. Qwen2.5-7B-Math (Left Panel)** * **Histogram**: Both distributions are sharply peaked near zero. The pink ("w/o") distribution appears slightly taller and narrower at the peak. The red circle is placed on the x-axis at approximately 0. * **Bar Charts**: * **Sparsity**: Pink (~0.185) > Blue (~0.180). Transformation slightly reduces sparsity. * **L1 Norm**: Blue (~0.09) > Pink (~0.08). Transformation increases L1 norm. * **L2 Norm**: Blue (~0.135) > Pink (~0.130). Transformation increases L2 norm. * **Gini**: Pink (~0.45) > Blue (~0.44). Transformation slightly reduces the Gini coefficient. **2. Llama3.1-8B-Instruct (Center Panel)** * **Histogram**: Distributions are broader than Qwen's. The pink distribution has a higher peak. The red circle is on the x-axis near 0. * **Bar Charts**: * **Sparsity**: Pink (~0.10) > Blue (~0.08). Transformation reduces sparsity. * **L1 Norm**: Blue (~4.50) > Pink (~4.35). Transformation increases L1 norm. * **L2 Norm**: Blue (~7.5) > Pink (~7.0). Transformation increases L2 norm. * **Gini**: Pink (~13.0) > Blue (~12.5). Transformation reduces the Gini coefficient. **3. Gemma3-4b-it (Right Panel)** * **Histogram**: Similar sharp peak near zero as Qwen. The pink distribution is taller. The red circle is on the x-axis near 0. * **Bar Charts**: * **Sparsity**: Pink (~0.10) > Blue (~0.09). Transformation reduces sparsity. * **L1 Norm**: Blue (~0.9) > Pink (~0.8). Transformation increases L1 norm. * **L2 Norm**: Blue (~1.3) > Pink (~1.2). Transformation increases L2 norm. * **Gini**: Pink (~0.9) > Blue (~0.85). Transformation reduces the Gini coefficient. ### Key Observations 1. **Consistent Trend Across Models**: For all three models, the transformation consistently leads to: * **Increased L1 and L2 Norms**: The blue bars are always higher than the pink bars for these metrics. * **Decreased Gini Coefficient**: The pink bars are always higher than the blue bars for the Gini metric. * **Decreased Sparsity**: The pink bars are always higher than the blue bars for Sparsity. 2. **Histogram Shape**: The "w/o transformation" (pink) distributions consistently show a higher peak at zero compared to the "w/ transformation" (blue) distributions, suggesting the transformation spreads the values slightly away from zero. 3. **Magnitude of Effect**: The relative change in metrics appears most pronounced in the `Llama3.1-8B-Instruct` model, particularly for the L1 and L2 Norms. ### Interpretation This visualization demonstrates the effect of a specific (unspecified) transformation on the weight or activation distributions of three LLMs. The data suggests the transformation has a **regularizing or redistributing effect**. * **Increased Norms (L1, L2)**: The transformation increases the overall magnitude of the parameters, making the vector of values "longer" in geometric terms. * **Decreased Gini Coefficient**: The Gini coefficient measures inequality or concentration. A decrease indicates the transformation makes the distribution of parameter values **more uniform** and less concentrated around a few large values. * **Decreased Sparsity**: The transformation reduces the number of parameters that are exactly or near zero, making the model's representations denser. * **Peircean Reading**: The consistent pattern across three different model architectures (Qwen, Llama, Gemma) suggests the transformation's effect is **robust and model-agnostic**. The red circles on the histograms likely highlight the region of highest density (near zero), which is precisely where the transformation seems to have its most visible effect—smoothing and slightly flattening the peak. This could imply the transformation is designed to mitigate issues related to extreme sparsity or high concentration of values (e.g., improving training stability, reducing quantization error, or enhancing model robustness). The trade-off appears to be an increase in the overall norm of the parameters. </details> Figure 4: The histogram of the frequency of activations after activation functions in MLP, the sub-figure is the comparison of 4 metrics between before and after transformation. The first term represents the expected variance of the $j$ -th hidden states under the influence of the bias factor. Since the bias factor varies across individual cases, taking the expectation is necessary to capture its overall impact. The second term corresponds to the variance induced by scaling factors, which inherently reflects the aggregate change. When combining them, the overall variance of the outputs of nonlinear activation functions increases, the mean shifts upward, and the activation distribution becomes broader, manifested as heavier tails and a thinner center. More details of above analysis and relative proof are in Appendix E. Moreover, we presume the reason that this redistribution has a positive impact on reasoning tasks is that reasoning-critical tokens (digits, operators, conjunctions) have a higher fraction of near-zero activations. Elevating their activation levels strengthens their representations and improves reasoning performance; see Section 6 for details. ### 3.3 Verification of activation redistribution To verify whether the activation redistribution pattern in Section 3.2 indeed occurs in LLMs, Figure 4 illustrates the activation distribution after the first-layer MLP, explicitly comparing states before and after the transformation defined in Equation 2. We also comprehensively assess the transformation of activation states using several quantitative indicators, including: - Relative Sparsity: Defined as the proportion of activations after the transformation whose values fall below the pre-transformation threshold. - L1 Norm: The sum of the absolute activation values; smaller values indicate higher sparsity. - L2 Norm: A measure of the overall magnitude of activations. - Gini Coefficient: An indicator of the smoothness of the histogram distribution, where smaller absolute values correspond to smoother distributions. From Figure 4, we observe that after transformation, the frequency of near-zero activations decreases, while the frequency of absolute high-magnitude activations increases. Both sparsity and smoothness in the activation distribution are improved. Specifically, the relative sparsity consistently decreases across all three models while the L1 and L2 norms increase, aligning with the previous phenomenon. ## 4 Method: Activation Redistribution Module <details> <summary>x5.png Details</summary> ![b6dccbc5](/v1/image/b6dccbc5743060be12ba347b985b40a8d61f7fd973d5abfa0089b8c299c54fe8) ### Visual Description ## Diagram: Transformer Layer Architecture with MLP and Redistribution Effect ### Overview This image is a technical diagram illustrating the architecture of a single transformer layer (labeled "Layer 1") with a specific Multi-Layer Perceptron (MLP) block, alongside two histograms demonstrating a data redistribution effect. The diagram is divided into three primary regions: the main transformer layer flow on the left, the detailed MLP block in the center, and two comparative histograms on the right. ### Components/Axes **Left Region: Layer 1** * **Title:** "Layer 1" (top-left corner). * **Components (from bottom to top):** * `RMSNorm` (Root Mean Square Normalization) block at the base. * Three parallel paths leading to `Q`, `K`, `V` (Query, Key, Value) blocks. * `RoPE` (Rotary Positional Embedding) block above Q, K, V. * `Attention` block above RoPE. * `Output` block at the top. * **Flow:** Data flows upward from `RMSNorm` through `Q/K/V` -> `RoPE` -> `Attention` -> `Output`. A residual connection (indicated by a circled plus `⊕`) bypasses the entire attention block, adding the original input to the `Output`. **Center Region: MLP Block** * **Title:** "MLP" (top-left of the shaded box). * **Components:** * `RMSNorm` block at the base. * Two parallel paths from `RMSNorm`: one to an `up` projection block and one to a `gate` block. * `SiLU/GELU` activation function block above `gate`. * `ARM` block (likely an Adaptive or Attention-based Redistribution Module) above the activation. * `down` projection block at the top. * **Flow:** The outputs of the `up` path and the `gate` -> `SiLU/GELU` -> `ARM` path are combined via element-wise multiplication (indicated by `⊗`). This combined signal then goes to the `down` block. A residual connection (`⊕`) bypasses the entire MLP block, adding the original input to the `down` output. **Right Region: Histograms** * **Top Histogram:** * **Title/Label:** "before redistribution" (below the plot). * **X-axis:** Centered at `0`. * **Visual:** A blue histogram showing a distribution of values. Several red, curved arrows originate from the tails of the distribution and point towards the center, suggesting a process that pulls extreme values inward. * **Bottom Histogram:** * **Title/Label:** "after redistribution" (below the plot). * **X-axis:** Centered at `0`. * **Visual:** A blue histogram showing a distribution that is more concentrated (narrower) around the central `0` value compared to the "before" histogram. ### Detailed Analysis **Architectural Flow:** 1. The main "Layer 1" processes input through a standard transformer sub-layer: `RMSNorm` -> `Q/K/V` projections -> `RoPE` -> `Attention` -> `Output`, with a residual connection. 2. The output of the attention sub-layer is then fed into the specialized "MLP" block. 3. Within the MLP, the input is normalized again (`RMSNorm`). It is then processed by two parallel paths: a standard `up` projection and a gated path (`gate` -> `SiLU/GELU` -> `ARM`). The `ARM` module is a key component, positioned after the activation function. 4. The outputs of these two paths are multiplied (`⊗`) and then projected back down via the `down` block. A final residual connection adds the MLP's input to its output. **Redistribution Effect:** The histograms visually demonstrate the function of the `ARM` module within the MLP. * **Before Redistribution:** The data distribution has wider tails, with values spread further from zero. The red arrows symbolize the `ARM`'s intended action: redistributing mass from the tails towards the center. * **After Redistribution:** The resulting distribution is tighter and more peaked around zero, confirming the effect of the `ARM` in concentrating the data values. ### Key Observations 1. **Component Integration:** The `ARM` is not a standalone block but is integrated into the gated linear unit (GLU) pathway of the MLP, specifically after the non-linear activation (`SiLU/GELU`). 2. **Dual Normalization:** The architecture employs `RMSNorm` twice: once at the very beginning of the layer (before attention) and again at the beginning of the MLP block. 3. **Residual Connections:** Standard residual connections (`⊕`) are present for both the attention and MLP sub-layers, which is typical for transformer architectures. 4. **Visual Metaphor:** The red arrows in the "before" histogram are a clear visual metaphor for a "pulling" or "constraining" force acting on outlier values. ### Interpretation This diagram describes a modified transformer layer designed to control the distribution of activations within the MLP block. The core innovation appears to be the **ARM (Adaptive Redistribution Module)**. * **Purpose:** The primary goal is to mitigate the issue of activation outliers or excessive spread in the value distribution, which can hinder model training stability and efficiency. By redistributing mass from the tails to the center, the ARM likely promotes more stable gradients and potentially allows for lower-precision computation. * **Mechanism:** The ARM operates within the gating mechanism of a SwiGLU/GELU-style MLP. It acts on the activated values before they are modulated by the up-projection and before the final down-projection. This placement suggests it directly shapes the information flow through the MLP. * **Relationship:** The left and center diagrams show the *structural* implementation (where the ARM is placed), while the right histograms show the *functional* outcome (what the ARM does to the data). The connection is causal: the architecture on the left produces the effect shown on the right. * **Significance:** This technique is relevant for improving large language model (LLM) training and inference. Controlling activation distributions can lead to more robust models, easier quantization, and reduced computational overhead. The diagram succinctly communicates both the "how" (architecture) and the "why" (distributional effect) of this modification. </details> ⬇ def forward (x, layer_idx): # in first layer activation = self. act_fn (self. gate_proj (x)) #Our function activation_alter = self. arm (activation. clone ()) down_proj = self. down_proj (activation_alter * self. up_proj (x)) return down_proj Figure 5: The upper panel illustrates the first-layer LLM architecture with ARM, while the lower panel presents the corresponding ARM code in the MLP module. Inspired by the previous finding that meaningless tokens can shift meaningful activations and boost LLM performance, we propose ARM—a simple method replacing explicit meaningless tokens with an implicit mechanism that adjusts the MLP activation distribution after the activation function. Our approach has two steps: First, adaptively identify a proportion of near-zero activations based on the model and input; Then, extrapolate them outward to redistribute the activation pattern. The top half of Figure 5 shows the first-layer MLP with ARM, where selected activations around zero are shifted outward, reducing their frequency and increasing larger-magnitude activations. The bottom half of Figure 5 presents the ARM-specific code, a lightweight function inserted into the first-layer MLP without affecting inference speed. As shown in Appendix D, ARM’s time complexity is negligible within the MLP context. The significance of the ARM method is twofold. Firstly, it adds further evidence deductively supporting our theoretical analysis in Section 3. By directly replacing explicit meaningless token insertion with implicit activation redistribution, ARM yields a similar improvement in reasoning across models and benchmarks, thus strengthening our theoretical framework. Secondly, we introduce ARM as a lightweight inference time trick for boosting reasoning, which is not only robustly effective on its own (see experiments in Section 5) but also compatible with existing inference time scaling methods (see Appendix G.3). ### 4.1 Select Appropriate Change proportion Our method first selects a proportion of activations to be modified. However, different models exhibit varying sensitivities to meaningless tokens. To address this, we propose a dynamic strategy that adjusts the fraction of near-zero activations to be altered during inference. To determine this proportion, we measure the dispersion of activations around zero. Specifically, we define a neighborhood $ε$ based on the activation distribution to identify which activations are considered “close to zero”. We adopt the Median Absolute Deviation (MAD) as our dispersion metric, since MAD is robust to outliers and better captures the core distribution. The threshold $ε$ is given by: $ε=κ*MAD*c$ , where $κ$ is a consistency constant, $c$ is a hyperparameter controlling the width of the near-zero range. Next, we compute the fraction of activations falling within $[-ε,ε]$ This fraction $p$ represents the proportion of activations that we think to be near zero. As a result, the fraction we want to change is $fraction=clip(p,(p_min,p_max))$ . Here, $p$ denotes the calculated fraction, while $p_min$ and $p_max$ serve as bounds to prevent the scale from becoming either too small or excessively large. In our experiments, we set $p_min=0.02$ and $p_max=0.25$ . ### 4.2 Redistribution of Activation Values After selecting the elements, we preserve its sign and adjust only its magnitude. Specifically, we add a positive or negative value depending on the element’s sign. To constrain the modified values within a reasonable range, the range is defined as follows: $$ R=\begin{cases}[0,Q_p_1(Activations)],&sign=1,\\[6.0pt] [min(Activations),0],&sign=-1.\end{cases} \tag{10} $$ Where R is the range of modified values. In this range, we set the lower bound to the minimum activation value when $sign=-1$ , since activation functions such as SiLU and GeLU typically attain their smallest values on the negative side. For the upper bound when $sign=1$ , we select the value corresponding to the $p_1$ -th percentile of the activation distribution. Here, $p_1$ is a hyperparameter that depends on the distribution of activations. $Q_p_1(Activations)$ is the upper bound when we changing the chosen activations. The value of $p_1$ depends on the distribution of activations and the value of $c$ . Finally, we generate a random value in R and add it to the activation in order to modify its value. In this way, we adaptively adjust an appropriate proportion of activations, enriching the distribution with more effective values. We shows how to choose hyperparameter in Appendix H. Table 2: After adding ARM to the first-layer MLP, we report reasoning-task performance for six models, using a dash (‘–’) for accuracies below 5% to indicate incapability. | Model | Setting | GPQA Diamond | Math-500 | AIME 2024 | AIME 2025 | LiveCodeBench | Humaneval | | --- | --- | --- | --- | --- | --- | --- | --- | | Pass@1 | Pass@1 | Pass@1 | Pass@1 | Pass@1 | Pass@1 | | | | Qwen2.5 Math-1.5B | Baseline | 27.3 | 63.8 | 14.4 | 6.7 | - | 6.1 | | cyan!15ARM | cyan!1528.8 | cyan!1567.0 | cyan!1518.9 | cyan!1510.0 | cyan!15- | cyan!158.5 | | | gray!15 Improve Rate (%) | gray!15 1.5 $↑$ | gray!15 3.2 $↑$ | gray!15 4.5 $↑$ | gray!15 3.3 $↑$ | gray!15 - | gray!15 2.4 $↑$ | | | Qwen2.5 Math-7B | Baseline | 30.3 | 72.4 | 23.3 | 10.0 | - | 15.2 | | cyan!15ARM | cyan!1534.9 | cyan!1573.4 | cyan!1525.6 | cyan!1513.3 | cyan!15- | cyan!1517.7 | | | gray!15 Improve Rate (%) | gray!15 4.6 $↑$ | gray!15 1.0 $↑$ | gray!15 2.3 $↑$ | gray!15 3.3 $↑$ | gray!15 - | gray!15 2.5 $↑$ | | | Qwen2.5 7B-Instruct | Baseline | 28.3 | 61.4 | 20.0 | 10.0 | 29.7 | 43.9 | | cyan!15ARM | cyan!1529.8 | cyan!1562.4 | cyan!1520.0 | cyan!1523.3 | cyan!1531.9 | cyan!1547.6 | | | gray!15 Improve Rate (%) | gray!15 1.5 $↑$ | gray!15 1.0 $↑$ | gray!150 | gray!15 13.3 $↑$ | gray!15 2.2 $↑$ | gray!15 3.7 $↑$ | | | Qwen2.5 32B-Instruct | Baseline | 35.4 | 82.6 | 16.7 | 20.0 | 49.5 | 50.0 | | cyan!15ARM | cyan!1535.9 | cyan!1582.6 | cyan!1518.8 | cyan!1526.7 | cyan!1549.5 | cyan!1551.2 | | | gray!15 Improve Rate (%) | gray!15 0.5 $↑$ | gray!150 | gray!15 2.1 $↑$ | gray!15 6.7 $↑$ | gray!150 | gray!15 1.2 $↑$ | | | Llama3.1 8B-Instruct | Baseline | 28.3 | 43.0 | 11.1 | - | 11.9 | 45.7 | | cyan!15ARM | cyan!1531.3 | cyan!1545.8 | cyan!1513.3 | cyan!15- | cyan!1517.0 | cyan!1547.6 | | | gray!15 Improve Rate (%) | gray!15 3.0 $↑$ | gray!15 2.8 $↑$ | gray!15 2.2 $↑$ | gray!15- | gray!15 5.1 $↑$ | gray!15 1.9 $↑$ | | | Gemma3 4b-it | Baseline | 34.3 | 72.6 | 13.3 | 20.0 | 20.2 | 17.1 | | cyan!15ARM | cyan!1535.9 | cyan!1574.0 | cyan!1517.8 | cyan!1523.3 | cyan!1520.6 | cyan!1520.7 | | | gray!15 Improve Rate (%) | gray!15 1.5 $↑$ | gray!15 1.4 $↑$ | gray!15 4.5 $↑$ | gray!15 3.3 $↑$ | gray!15 0.4 $↑$ | gray!15 3.6 $↑$ | | | Gemma3 27b-it | Baseline | 33.3 | 85.4 | 25.6 | 26.7 | 31.9 | 9.1 | | cyan!15ARM | cyan!1533.8 | cyan!1586.2 | cyan!1531.1 | cyan!1530.0 | cyan!1534.2 | cyan!1511.6 | | | gray!15 Improve Rate (%) | gray!15 0.5 $↑$ | gray!15 0.8 $↑$ | gray!15 4.4 $↑$ | gray!15 3.3 $↑$ | gray!15 2.3 $↑$ | gray!15 2.5 $↑$ | | ## 5 Experiments We evaluate our method on reasoning and non-reasoning tasks using seven models: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-Instruct-7B, Qwen2.5-Instruct-32B (qwe, 2025), Llama3.1-8B-Instruct (gra, 2024), Gemma3-4b-it, and Gemma3-27b-it (gem, 2025). All models use default generation parameters. For reasoning tasks, we cover three skill areas: (1) General: GPQA (Rein et al., 2024), a challenging expert-authored multiple-choice dataset; (2) Math & Text Reasoning: MATH-500 (Lightman et al., 2023), AIME’24 (AIME, 2024), and AIME’25 (AIME, 2025); (3) Agent & Coding: LiveCodeBench (Jain et al., 2024) and HumanEval (Chen et al., 2021). For non-reasoning tasks, we use GSM8K (Cobbe et al., 2021), ARC-E (Clark et al., 2018), ARC-C (Clark et al., 2018), MMLU (Hendrycks et al., 2021), BoolQ (Clark et al., 2019), HellaSwag (Zellers et al., 2019), and OpenBookQA (Mihaylov et al., 2018). ### 5.1 Experiment Results Analysis For reasoning tasks, the results in Table 2 show pass@1 accuracy across multiple benchmarks. Our method consistently improves performance across most models and datasets, with the effect more pronounced in smaller models (e.g., Qwen2.5-Math-7B shows larger gains than Qwen2.5-32B-Instruct). On challenging benchmarks, however, improvements are limited when models lack sufficient capacity or when baseline accuracy is near saturation. For non-reasoning tasks (see 3(b)), applying ARM to the first-layer MLP yields little change. We attribute this to their largely factual nature, where models already have the necessary knowledge and response formats, requiring minimal reasoning. By contrast, for reasoning tasks, altering early activations helps reorganize knowledge, strengthens intermediate representations, and facilitates more effective and consistent reasoning. ### 5.2 Comparison of Meaningless tokens and ARM In 3(a), we provide a direct comparison between our proposed ARM method and the strategy of inserting a suitable number of meaningless tokens. The results demonstrate that both approaches are capable of improving model performance and neither requires post-training, therefore presenting lightweight interventions that lead to robust performance gains. However, since ARM directly utilizes the fundamental principle driving the meaningless-token effect, it provides more stable results. While the meaningless-token effect is pervasive, our experiments show that the effect itself depends heavily on the specific choice of token length and placement, and thus may be unstable or difficult to generalize across tasks. ARM provides a more principled and model-internal mechanism that directly reshapes the activation distribution within the MLP, yielding more consistent gains without relying on heuristic token engineering. In sum, while the insertion of a meaningless token string on the prompt level might seem like a promising prompt-tuning adjustment on the surface, it comes with an instability of the effect which ARM eliminates. This contrast highlights the trade-off between ease of use and robustness, and further underscores the value of ARM as a systematic method for enhancing the reasoning ability in large language models. <details> <summary>x6.png Details</summary> ![1ad1a66d](/v1/image/1ad1a66d04051b7b68bede1dda1f82395363996db9dc9e857c7fe991fe2f1e2b) ### Visual Description ## Bar Charts: Model Performance Comparison ### Overview The image displays three separate bar charts arranged horizontally, comparing the performance of three different AI models ("Qwen2.5-Math-1.5B", "Gemma3-4b-it", and "Qwen2.5-Math-7B") under two conditions: "Baseline" and "ARM". The charts measure different metrics: accuracy on a math dataset, performance on a competition dataset, and output diversity. ### Components/Axes * **Chart 1 (Left):** * **Title:** `Pass@3 on Math-500` * **Y-axis:** Label is `Accuracy`. Scale ranges from 0.60 to 0.90, with major ticks at 0.05 intervals. * **X-axis:** Lists three models: `Qwen2.5-Math-1.5B`, `Gemma3-4b-it`, `Qwen2.5-Math-7B`. * **Legend:** Located in the top-right corner. `Baseline` is represented by light green bars. `ARM` is represented by dark green bars. * **Chart 2 (Center):** * **Title:** `Pass@3 on AIME2024` * **Y-axis:** No explicit label, but the title implies it's Pass@3 score. Scale ranges from 0.15 to 0.40, with major ticks at 0.05 intervals. * **X-axis:** Lists the same three models as Chart 1. * **Legend:** Located in the top-right corner. `Baseline` is represented by light pink bars. `ARM` is represented by dark red bars. * **Chart 3 (Right):** * **Title:** `2-gram diversity score` * **Y-axis:** Label is `Diversity score`. Scale ranges from 0.0 to 0.5, with major ticks at 0.1 intervals. * **X-axis:** Lists the same three models, but in a different order: `Qwen2.5-Math-1.5B`, `Qwen2.5-Math-7B`, `Gemma3-4b-it`. * **Legend:** Located in the top-right corner. `Baseline` is represented by light blue bars. `ARM` is represented by dark teal/blue bars. ### Detailed Analysis **Chart 1: Pass@3 on Math-500 (Accuracy)** * **Trend Verification:** For all three models, the ARM bar (dark green) is taller than the Baseline bar (light green), indicating improved accuracy. * **Data Points (Approximate):** * **Qwen2.5-Math-1.5B:** Baseline ≈ 0.72, ARM ≈ 0.74 * **Gemma3-4b-it:** Baseline ≈ 0.83, ARM ≈ 0.84 * **Qwen2.5-Math-7B:** Baseline ≈ 0.80, ARM ≈ 0.81 **Chart 2: Pass@3 on AIME2024** * **Trend Verification:** For all three models, the ARM bar (dark red) is taller than the Baseline bar (light pink), indicating improved performance on this more challenging dataset. * **Data Points (Approximate):** * **Qwen2.5-Math-1.5B:** Baseline ≈ 0.22, ARM ≈ 0.23 * **Gemma3-4b-it:** Baseline ≈ 0.26, ARM ≈ 0.29 * **Qwen2.5-Math-7B:** Baseline ≈ 0.37, ARM ≈ 0.375 (very slight improvement) **Chart 3: 2-gram Diversity Score** * **Trend Verification:** For all three models, the ARM bar (dark teal) is taller than the Baseline bar (light blue), indicating increased output diversity. * **Data Points (Approximate):** * **Qwen2.5-Math-1.5B:** Baseline ≈ 0.52, ARM ≈ 0.55 * **Qwen2.5-Math-7B:** Baseline ≈ 0.51, ARM ≈ 0.54 * **Gemma3-4b-it:** Baseline ≈ 0.44, ARM ≈ 0.46 ### Key Observations 1. **Consistent Improvement:** The "ARM" method consistently improves performance over the "Baseline" across all three models and all three measured metrics (Math-500 accuracy, AIME2024 score, and 2-gram diversity). 2. **Model Scaling:** The larger model, Qwen2.5-Math-7B, generally achieves higher absolute scores than the smaller Qwen2.5-Math-1.5B on the accuracy and competition tasks, as expected. 3. **Task Difficulty:** Scores on the AIME2024 dataset are significantly lower (0.22-0.375) than on the Math-500 dataset (0.72-0.84), indicating AIME2024 is a much harder benchmark. 4. **Diversity vs. Performance:** The improvement in diversity score (Chart 3) correlates with the improvement in task performance (Charts 1 & 2), suggesting the ARM method may enhance both the quality and variety of model outputs. ### Interpretation The data demonstrates the effectiveness of the "ARM" technique as a method for enhancing the capabilities of mathematical reasoning models. It provides a dual benefit: it boosts the models' accuracy on both standard (Math-500) and highly challenging (AIME2024) problem-solving tasks, while simultaneously increasing the lexical diversity of their generated solutions. This suggests ARM doesn't simply make the models more confident in a single line of reasoning but may encourage a more robust and varied exploration of solution paths. The consistent positive effect across models of different sizes (1.5B and 7B parameters) and architectures (Qwen and Gemma) indicates that ARM is a broadly applicable improvement technique rather than being model-specific. The most significant relative gains appear on the hardest task (AIME2024) for the mid-sized model (Gemma3-4b-it), hinting that the technique might be particularly valuable for pushing performance boundaries on difficult problems. </details> Figure 6: The first two figures show pass@3 on Math-500 and AIME2024 for three models with and without ARM, and the last shows their 2-gram diversity under both conditions. Table 3: Table (a) compares the performance of meaningless tokens and ARM, and Table (b) reports ARM’s results on non-reasoning tasks. (a) Pass@1 on Math-500 and AIME2024 with meaningless tokens (Mless) or ARM. | Qwen2.5 Math-7B Mless ARM | Baseline 75.0 73.4 | 72.4 24.4 25.6 | 23.3 | | --- | --- | --- | --- | | Llama3.1 8B-Instruct | Baseline | 43.0 | 11.1 | | Mless | 44.9 | 13.3 | | | ARM | 45.8 | 13.3 | | (b) Performance of models with ARM on non-reasoning tasks. Additional results are in Appendix G. | Model Qwen2.5 Math-1.5B ARM | Setting Baseline 78.6 | GSM8K 78.0 39.3 | ARC-E 39.3 39.5 | HellaSwag 39.1 | | --- | --- | --- | --- | --- | | gray!15 Improve Rate (%) | gray!15 0.6 $↑$ | gray!150 | gray!15 0.4 $↑$ | | | Llama3.1 8B-Instruct | Baseline | 80.0 | 46.6 | 56.8 | | ARM | 82.4 | 47.1 | 57.3 | | | gray!15 Improve Rate (%) | gray!15 2.4 $↑$ | gray!15 0.5 $↑$ | gray!15 0.5 $↑$ | | ### 5.3 Exploration capabilities after ARM As discussed earlier, we hypothesize that redistributing activations enables the model to explore the reasoning space more effectively. To test this hypothesis, we evaluate the model’s pass@3 performance on the Math-500 and AIME2024 benchmarks as well as its 2-gram diversity. As shown in Figure 6, applying activation redistribution consistently yields higher pass@3 scores compared to the baselines on both tasks. In addition, the 2-gram diversity under ARM is also greater than that without ARM. These findings indicate that activation redistribution not only improves the likelihood of arriving at correct solutions within multiple samples but also promotes more diverse reasoning paths. This dual effect suggests that ARM enhances both the effectiveness and the breadth of the model’s internal reasoning processes, reinforcing our hypothesis that carefully manipulating internal activations can expand a model’s reasoning capacity without additional training or parameter growth. ## 6 Discussion: Why Activation Redistribution Enhances LLM Reasoning Performance <details> <summary>x7.png Details</summary> ![297e210d](/v1/image/297e210ddea8251ef18abc57f4d76ce4520ffbd936e603b9b988c7061b263c4a) ### Visual Description ## Grouped Bar Chart: Mean Ratio Comparison Between Two Language Models ### Overview The image is a grouped bar chart comparing the "Mean ratio" of four different token categories across two large language models: **Qwen2.5-7B-Math** and **Llama3.1-8B-Instruct**. The chart visually demonstrates that the Llama model exhibits substantially higher mean ratios across all measured categories compared to the Qwen model. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **Y-Axis:** * **Label:** "Mean ratio" * **Scale:** Linear scale from 0.00 to 0.14, with major tick marks at intervals of 0.02 (0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14). * **X-Axis:** * **Categories (Models):** Two primary groups labeled "Qwen2.5-7B-Math" (left group) and "Llama3.1-8B-Instruct" (right group). * **Legend:** * **Position:** Top-left corner of the chart area. * **Categories & Colors:** 1. `digit` - Teal color (approximate hex: #7fcdbb) 2. `operator` - Light green color (approximate hex: #c7e9b4) 3. `conjunction` - Light blue color (approximate hex: #a1d9f4) 4. `other` - Lavender/light purple color (approximate hex: #d0d1e6) ### Detailed Analysis The chart presents the mean ratio for four token types for each model. Values are approximate based on visual inspection against the y-axis. **For Qwen2.5-7B-Math (Left Group):** * **Trend:** All four bars are relatively low and close in height, all below the 0.08 mark. * **Data Points (Approximate):** * `digit` (Teal): ~0.068 * `operator` (Light Green): ~0.054 * `conjunction` (Light Blue): ~0.054 * `other` (Lavender): ~0.049 **For Llama3.1-8B-Instruct (Right Group):** * **Trend:** All four bars are significantly taller than their counterparts in the Qwen group. The `operator` bar is the tallest, followed by `conjunction`, then `digit`, and finally `other`. * **Data Points (Approximate):** * `digit` (Teal): ~0.102 * `operator` (Light Green): ~0.139 * `conjunction` (Light Blue): ~0.111 * `other` (Lavender): ~0.094 ### Key Observations 1. **Model Disparity:** The most prominent observation is the substantial difference in magnitude between the two models. Every token category for Llama3.1-8B-Instruct has a mean ratio roughly 1.5 to 2.5 times higher than the corresponding category for Qwen2.5-7B-Math. 2. **Category Ranking:** The internal ranking of categories differs between models. * For **Qwen**, `digit` is the highest, followed by a tie between `operator` and `conjunction`, with `other` being the lowest. * For **Llama**, `operator` is the highest, followed by `conjunction`, then `digit`, and `other` is again the lowest. 3. **Operator Emphasis:** The `operator` category shows the most dramatic increase between models, jumping from one of the lower values in Qwen to the highest value in Llama. ### Interpretation This chart likely visualizes a metric related to the internal token usage or attention patterns of these two language models, possibly during mathematical reasoning tasks (given the "Math" in Qwen's name and the token categories like "digit" and "operator"). * **What the data suggests:** The significantly higher "Mean ratio" for Llama3.1-8B-Instruct across all categories could indicate several possibilities: a different tokenization strategy, a higher density or frequency of these specific token types in its outputs or internal representations, or a different architectural approach to processing mathematical language. The fact that `operator` tokens are most prominent in Llama might suggest it places a stronger relative emphasis on procedural or operational steps in its reasoning compared to Qwen. * **Relationship between elements:** The direct side-by-side comparison of the same four categories for two different models allows for a clear, controlled analysis of how model architecture or training affects this specific metric. The legend is essential for decoding which bar corresponds to which linguistic component. * **Notable anomalies:** The reversal in the ranking of `digit` and `operator` between the two models is a key finding. It suggests a fundamental difference in how these models prioritize or represent core components of mathematical language. The consistently lowest value for `other` in both models indicates that the three specified categories (digit, operator, conjunction) are the primary drivers of the measured "Mean ratio." </details> Figure 7: Percentage of near-zero activations across the four token types in the Math-500 dataset. We provide one possible explanation for why redistributing the near-zero activations can improve the reasoning performance of LLMs. We categorize all tokens in Math-500 into four classes: digits, conjunctions, operators, and other tokens. For each class, we compute the average proportion of activations falling within near-zero range, which reflects how many dimensions of the hidden representation remain nearly inactive. The results are presented in Figure 7. As shown, normal tokens exhibit the lowest near-zero proportion, while digits, operators, and conjunctions show substantially higher proportions, which means that in the high-frequency near-zero activations after activation function, a larger portion of them are derived from these tokens. This suggests that although these tokens are crucial for reasoning, their information is insufficiently activated by the model. Our observation is consistent with the findings of Huan et al. (2025), which highlight the increasing importance of conjunctions after reinforcement learning, and also aligns with the recognized role of digits and operators in reasoning tasks such as mathematics and coding. Consequently, redistributing activations around zero enhances the representation of under-activated yet semantically important tokens, improving reasoning performance. ## 7 Related Work Recent studies notice that symbols in an LLM’s input may affect their internal mechanism. Sun et al. (2024) show large activations for separators, periods, or newlines, suggesting that these tokens carry model biases. Razzhigaev et al. (2025) find that commas are essential for contextual memory, while Chauhan et al. (2025) and Min et al. (2024) highlight punctuation as attention sinks, memory aids, and semantic cues. Moreover, Chadimová et al. (2024) show that replacing words with meaningless tokens can reduce cognitive biases, whereas Li et al. (2024) report that such “glitch tokens” may also cause misunderstandings, refusals, or irrelevant outputs. Our work adds explanation to the puzzling downstream benefits that the inclusion of a string of meaningless tokens contributes to reasoning performance and shows how deep investigations of the underlying mechanisms can lead to improved inference solutions. We provide an extended discussion of related works in Appendix B. ## 8 Conclusion In this paper, we report a meaningless-token effect that inserting long sequences of meaningless tokens improves model performance, particularly on reasoning tasks. Our analysis suggests that it stems from the fact that meaningless tokens induce an affine transformation on meaningful tokens, thereby redistributing their activations and enabling key information to be more effectively utilized. Building on this insight, we introduce ARM, a lightweight and training-free method for activation redistribution, which strengthens our analysis and serves as a practical approach for consistently improving LLM performance on reasoning tasks. ## Ethics Statement All datasets used in this work are publicly available and contain no sensitive information. Our method enhances LLM reasoning without introducing new data collection or human interaction. While stronger reasoning ability may be misused, we emphasize that this work is intended for beneficial research and responsible applications. ## Reproducibility Statement We will release our code and data once the paper is published. The appendix includes detailed experimental setups and hyperparameters so that others can reproduce our results. We also encourage the community to follow good research practices when using our code and data, to help maintain the reliability and transparency of future work. ## References - gra (2024) The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. - gem (2025) Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503.19786. - qwe (2025) Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115. - AIME (2024) AIME. Aime problems and solutions, 2024. URL https://aime24.aimedicine.info/. - AIME (2025) AIME. Aime problems and solutions, 2025. URL https://artofproblemsolving.com/wiki/index.php/AIMEProblemsandSolutions. - Bogdan et al. (2025) Paul C Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter? arXiv preprint arXiv:2506.19143, 2025. - Chadimová et al. (2024) Milena Chadimová, Eduard Jurášek, and Tomáš Kliegr. Meaningless is better: hashing bias-inducing words in llm prompts improves performance in logical reasoning and statistical learning. arXiv preprint arXiv:2411.17304, 2024. - Chauhan et al. (2025) Sonakshi Chauhan, Maheep Chaudhary, Koby Choy, Samuel Nellessen, and Nandi Schoots. Punctuation and predicates in language models. arXiv preprint arXiv:2508.14067, 2025. - Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021. - Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019. - Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018. - Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. - Dhanraj & Eliasmith (2025) Varun Dhanraj and Chris Eliasmith. Improving rule-based reasoning in llms via neurosymbolic representations. arXiv e-prints, pp. arXiv–2502, 2025. - Errica et al. (2024) Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco. What did i do wrong? quantifying llms’ sensitivity and consistency to prompt engineering. arXiv preprint arXiv:2406.12334, 2024. - Guan et al. (2025) Bryan Guan, Tanya Roosta, Peyman Passban, and Mehdi Rezagholizadeh. The order effect: Investigating prompt sensitivity to input order in llms. arXiv preprint arXiv:2502.04134, 2025. - Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021. - Højer et al. (2025) Bertram Højer, Oliver Jarvis, and Stefan Heinrich. Improving reasoning performance in large language models via representation engineering. arXiv preprint arXiv:2504.19483, 2025. - Huan et al. (2025) Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025. - Jain et al. (2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024. - Jiang et al. (2024) Ming Jiang, Tingting Huang, Biao Guo, Yao Lu, and Feng Zhang. Enhancing robustness in large language models: Prompting for mitigating the impact of irrelevant information. In International Conference on Neural Information Processing, pp. 207–222. Springer, 2024. - Kaul et al. (2024) Prannay Kaul, Chengcheng Ma, Ismail Elezi, and Jiankang Deng. From attention to activation: Unravelling the enigmas of large language models. arXiv preprint arXiv:2410.17174, 2024. - Kawasaki et al. (2024) Amelia Kawasaki, Andrew Davis, and Houssam Abbas. Defending large language models against attacks with residual stream activation analysis. arXiv preprint arXiv:2406.03230, 2024. - Li et al. (2024) Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. Glitch tokens in large language models: Categorization taxonomy and effective detection. Proceedings of the ACM on Software Engineering, 1(FSE):2075–2097, 2024. - Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023. - Liu et al. (2024) Weize Liu, Yinlong Xu, Hongxia Xu, Jintai Chen, Xuming Hu, and Jian Wu. Unraveling babel: Exploring multilingual activation patterns of llms and their applications. arXiv preprint arXiv:2402.16367, 2024. - London & Kanade (2025) Charles London and Varun Kanade. Pause tokens strictly increase the expressivity of constant-depth transformers. arXiv preprint arXiv:2505.21024, 2025. - Luo et al. (2025) Yifan Luo, Zhennan Zhou, and Bin Dong. Inversescope: Scalable activation inversion for interpreting large language models. arXiv preprint arXiv:2506.07406, 2025. - Luo et al. (2024) Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, and Maosong Sun. Sparsing law: Towards large language models with greater activation sparsity. arXiv preprint arXiv:2411.02335, 2024. - Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018. - Min et al. (2024) Junghyun Min, Minho Lee, Woochul Lee, and Yeonsoo Lee. Punctuation restoration improves structure understanding without supervision. arXiv preprint arXiv:2402.08382, 2024. - Owen et al. (2025) Louis Owen, Nilabhra Roy Chowdhury, Abhay Kumar, and Fabian Güra. A refined analysis of massive activations in llms. arXiv preprint arXiv:2503.22329, 2025. - Pfau et al. (2024) Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758, 2024. - Pham & Nguyen (2024) Van-Cuong Pham and Thien Huu Nguyen. Householder pseudo-rotation: A novel approach to activation editing in llms with direction-magnitude perspective. arXiv preprint arXiv:2409.10053, 2024. - Rai & Yao (2024) Daking Rai and Ziyu Yao. An investigation of neuron activation as a unified lens to explain chain-of-thought eliciting arithmetic reasoning of llms. arXiv preprint arXiv:2406.12288, 2024. - Razzhigaev et al. (2025) Anton Razzhigaev, Matvey Mikhalchuk, Temurbek Rahmatullaev, Elizaveta Goncharova, Polina Druzhinina, Ivan Oseledets, and Andrey Kuznetsov. Llm-microscope: Uncovering the hidden role of punctuation in context memory of transformers. arXiv preprint arXiv:2502.15007, 2025. - Rein et al. (2024) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. - Sheng et al. (2025) Yu Sheng, Linjing Li, and Daniel Dajun Zeng. Learning theorem rationale for improving the mathematical reasoning capability of large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 15151–15159, 2025. - Shi et al. (2024) Zeru Shi, Zhenting Wang, Yongye Su, Weidi Luo, Hang Gao, Fan Yang, Ruixiang Tang, and Yongfeng Zhang. Robustness-aware automatic prompt optimization. arXiv preprint arXiv:2412.18196, 2024. - Sun et al. (2024) Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024. - Turner et al. (2023) Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering. arXiv preprint arXiv:2308.10248, 2023. - Voita et al. (2023) Elena Voita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. arXiv preprint arXiv:2309.04827, 2023. - Wang et al. (2025) Yudong Wang, Damai Dai, Zhe Yang, Jingyuan Ma, and Zhifang Sui. Exploring activation patterns of parameters in language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 25416–25424, 2025. - Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. - Zhao et al. (2025) Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. Analyzing the rapid generalization of sft via the perspective of attention head activation patterns. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16980–16992, 2025. - Zhou et al. (2024) Andy Zhou, Bo Li, and Haohan Wang. Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263, 2024. - Zhuo et al. (2024) Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms. arXiv preprint arXiv:2410.12405, 2024. Contents 1. 1 Introduction 1. 2 Observation: Inserting Meaningless Tokens Induces an Affine Transformation on Meaningful Token Representations 1. 3 Analysis: Why Affine Transformation Improve Reasoning Performance 1. 3.1 Affine Transformation influence the output of gate layer 1. 3.2 Variance change leads to activation redistribution 1. 3.3 Verification of activation redistribution 1. 4 Method: Activation Redistribution Module 1. 4.1 Select Appropriate Change proportion 1. 4.2 Redistribution of Activation Values 1. 5 Experiments 1. 5.1 Experiment Results Analysis 1. 5.2 Comparison of Meaningless tokens and ARM 1. 5.3 Exploration capabilities after ARM 1. 6 Discussion: Why Activation Redistribution Enhances LLM Reasoning Performance 1. 7 Related Work 1. 8 Conclusion 1. A Disclosure of LLM Usage 1. B Related Work 1. B.1 Meaningless Tokens in LLMs 1. B.2 Activations Analysis in LLMs 1. C Limitations 1. D Time Complexity 1. E Proof 1. E.1 Scaling factor cause variance rise 1. E.2 bias factor cause variance rise 1. F More Analysis 1. F.1 The impact of inserting length of meaningless tokens 1. F.2 The impact of inserting position of meaningless tokens. 1. F.3 The impact of inserting type of meaningless tokens 1. F.4 Why we only analyze first layer 1. F.5 Repeat Meaningful tokens’ effectiveness 1. F.6 Why random sentence is useless 1. F.7 The optimal hyperparameter range 1. G More Experiments 1. G.1 Results on non-reasoning tasks 1. G.2 Results on Base Model 1. G.3 Inference Time trick comparison 1. H The example of hyper parameters 1. I More Average activation weights 1. J Examples of model’s output change of adding meaningless ## Appendix A Disclosure of LLM Usage This paper used LLMs to assist with grammar checking. ## Appendix B Related Work ### B.1 Meaningless Tokens in LLMs Recent studies have shown that seemingly meaningless tokens, such as punctuation marks, play a non-trivial role in information propagation and reasoning within large language models (LLMs). For example, Sun et al. (2024) report that LLMs exhibit large activations in response to separators, periods, or newline characters, suggesting that these tokens can serve as carriers of model biases. Similarly, Razzhigaev et al. (2025) demonstrate that tokens such as commas act as crucial elements in maintaining contextual memory: removing them significantly degrades performance on context-understanding tasks. Chauhan et al. (2025) further argue that punctuation may function as attention sinks or assist the memory mechanism, while Min et al. (2024) highlight its value in semantic construction, enabling models to better capture contextual structure. In addition, Chadimová et al. (2024) show that substituting certain words with meaningless tokens can mitigate cognitive biases in LLMs. Conversely, Li et al. (2024) illustrate that meaningless “glitch tokens” can induce misunderstandings, refusals, or irrelevant generations. However, these works primarily examine the effects of individual meaningless tokens, without considering the broader impact of longer meaningless token sequences. More recently, several studies have explored the role of long meaningless token sequences and reported their surprising positive influence on LLM performance. For instance, Zhou et al. (2024) find that appending meaningless tokens to the end of prompts can trigger or defend against jailbreak behaviors. Similarly, Shi et al. (2024) show that adding long meaningless sequences after a sentence can improve model performance on certain tasks. Pfau et al. (2024) and London & Kanade (2025) report that substituting meaningful tokens with filler-like tokens (e.g., ‘…’) in the training data preserves the model’s ability to solve questions, suggesting that even without meaningful tokens the model can perform implicit computation. Meanwhile, there are also some methods to improve the reasoning performance of LLMs (Dhanraj & Eliasmith, 2025; Højer et al., 2025; Sheng et al., 2025). Despite these empirical findings and methods, there is still a lack of systematic analysis explaining why meaningless tokens, especially in longer sequences, can play such a counterintuitive yet beneficial role in shaping LLM reasoning behavior. ### B.2 Activations Analysis in LLMs Activation analysis is a popular method for explaining the mechanics of LLMs (Wang et al., 2025; Kawasaki et al., 2024; Pham & Nguyen, 2024; Rai & Yao, 2024). Owen et al. (2025) supplement Sun et al. (2024) by analyzing the activations after MLP to study how massive values influence bias and large attention. Wang et al. (2025) test hidden states across all layers to examine the importance of parameters in different layers. Zhao et al. (2025) use activations to determine whether an attention head is activated after training. Kaul et al. (2024) analyze attention activations and find that almost all activations focus on the first tokens; they also analyze high activations in the FFN. Luo et al. (2024) systematically study the magnitude law and influencing factors of activation sparsity in decoder Transformer architectures, showing that different activation functions (ReLU vs. SiLU) lead to drastically different sparsity trends during training. In Liu et al. (2024), activation refers to the output behavior of the expert selector: instead of a single neuron activating, the analysis investigates which expert module each token is routed to. Turner et al. (2023) propose steering middle-layer activations to improve model outputs. Voita et al. (2023) uses OPT model to do analysis for FFM neurons. Luo et al. (2025), using activations to understand the semantic information in LLMs. However, most papers analyze activations using activation scores, hidden states, or broader definitions of activation. Few works directly examine the activations right after the non-linear activation functions in the MLP. ## Appendix C Limitations Different meaningless tokens lead to varying performance outcomes. We only know that this difference arises from their differing degrees of transformation, but the underlying reason why different tokens cause such phenomena remains unclear. Meanwhile, we assume that meaningless tokens can be identified by LLMs in the first layer. Therefore, in our analysis, we focus only on their impact on meaningful tokens and how this interaction influences model performance. As such, we ignore the meaningless tokens themselves. Future work can further investigate the results when explicitly considering meaningless tokens. We restrict our analysis to the first layer, as it is the only layer where the attention scores exhibit a clear phenomenon (see Figure 2). Future work may extend this investigation to examine whether similar effects arise in deeper layers. ## Appendix D Time Complexity In this section, we will analyze the time complexity of our method in the MLP. In the first layer’s MLP, we have batch size $B$ , sequence length $S$ , feed forward dimensions $D_f$ , model dimension $D_model$ . For MLP, the module of time complexity contains gate project, up project and down project. The time complexity of each module is $O(2BSD_fD_model)$ , thus the total of MLP is: $$ T_mlp=O(BSD_fD_model), \tag{11} $$ For ARM module, the operation contains: calculating MAD, comparing threshold, calculating proportion $p$ , selecting elements that need to be changed. The time complexity of all above operations is $O(BSD_f)$ . So the time complexity of ARM is: $$ T_ARM=O(BSD_f), \tag{12} $$ The comparison between the time complexity of ARM and MLP is $\frac{1}{D_model}$ . When $D_model$ equals to 4096. This proportion value is approximately $\frac{1}{2*4096}≈ 1.2× 10^-4$ at the level of one ten-thousandth. Therefore, we believe that the time complexity of ARM can be disregarded in MLP layer. ## Appendix E Proof ### E.1 Scaling factor cause variance rise **Lemma E.1** *In LLMs, RMSNorm uses $ε>0$ ; hence $J_q(x_0)$ is bounded and $\|x_0\|≥ε$* For every $λ$ , we have: $$ x(λ)=r+λ UA,y(λ)=RMS(x(λ)),z_j(λ)=w_j^⊤y(λ), \tag{13} $$ For every $Δλ$ , we have: $$ z_j(λ+Δλ)≈ z_j(λ)+g_j(λ)Δλ,g_j(λ)=w_j^⊤J_q(x(λ))UA, \tag{14} $$ For $Var_j$ we have following proof: | | $\displaystyleΔVar_j$ | $\displaystyle\triangleqVar\big[z_j(λ+Δλ)\big]-Var\big[z_j(λ)\big]$ | | | --- | --- | --- | --- | $$ ΔVar_j ≥ -2\bigl|Cov(z_j,g_j)\bigr| |Δλ|+Var(g_j) (Δλ)^2. $$ $$ |Δλ| > \frac{2\bigl|Cov(z_j,g_j)\bigr|}{Var(g_j)} $$ Meanwhile, we also need to have: $$ ΔVar_j ≥ -2\bigl|Cov(z_j,g_j)\bigr| \bigl|Δλ\bigr|+A (Δλ)^2-\frac{K}{6} \bigl|Δλ\bigr|^3. \tag{15} $$ K is upper bound of $Var[z_j(λ)]$ , thus we have a range: $$ \frac{2\bigl|Cov(z_j,g_j)\bigr|}{Var(g_j)}≤Δλ≤\frac{3Var(g_j)}{K}. \tag{16} $$ For every $|Δλ|$ , if it is in this range, we will have $ΔVar_j>0$ . Specially, when $|Δλ|$ becomes larger, the quadratic term dominates, and A increases monotonically and eventually becomes positive. ### E.2 bias factor cause variance rise **Lemma E.2** *The bias we add is a uniform distribution sampled independently each time and does not depend on the specific value of attention output.* **Lemma E.3** *In LLM’s high dimensions, bias has a nonzero tangential component and $w_j^⊤J_q(x_0)W≠ 0$ .* According to above lemmas we have: $$ Var[z_j]≈ e_j^⊤W_gateJ_q(x_0)WΣ_σW^⊤J_q(x_0)^⊤W_gate^⊤e_j \tag{17} $$ Thus, we have $Δ V_j>0$ . ## Appendix F More Analysis ### F.1 The impact of inserting length of meaningless tokens In this section, we analyze the relationship between the length of inserted tokens and the performance of LLMs. We evaluate five models on MATH-500 while varying the number of inserted tokens from 0 to 70. The results are shown in Figure 8. We observe that when the inserted sequence is relatively short, the models outperform the baseline, although their accuracy fluctuates. However, when too many tokens are inserted, performance drops sharply. This occurs because, as the length of the inserted tokens increases, their influence on the attention output values accumulates (as shown in Equation 1). Once this accumulation reaches a critical level, it no longer produces a small, benign effect; instead, it alters the model’s internal semantic structure and degrades its performance. <details> <summary>x8.png Details</summary> ![26b5d931](/v1/image/26b5d93111c8ad07dca5996fc3c7a70e51cfa4abd45c007373ed3279d00aa454) ### Visual Description ## Bar Chart: Accuracy on MATH-500 ### Overview This is a grouped bar chart comparing the performance (accuracy) of five different AI models on the MATH-500 benchmark. Performance is measured across six different input lengths, defined by the number of characters. The chart demonstrates how model accuracy varies with both the model architecture and the length of the input prompt. ### Components/Axes * **Chart Title:** "Accuracy on MATH-500" (centered at the top). * **Y-Axis:** * **Label:** "Accuracy" (rotated vertically on the left). * **Scale:** Linear scale from 30 to 70, with major tick marks at 30, 40, 50, 60, and 70. * **X-Axis:** * **Label:** None explicit. The axis displays categorical labels for five distinct AI models. * **Categories (from left to right):** 1. `Qwen-2.5-MATH-7B` 2. `GPT-4o-mini` 3. `Qwen-2.5-MATH-1.5B` 4. `DS-R1-Distill-Qwen-7B` 5. `DS-math-7b-instruct` * **Legend:** * **Location:** Top-right corner of the chart area. * **Title:** None. * **Categories & Colors (from top to bottom):** 1. `0 characters` - Dark Red/Maroon 2. `10 characters` - Orange 3. `25 characters` - Light Yellow/Beige 4. `40 characters` - Light Blue 5. `55 characters` - Medium Blue 6. `70 characters` - Dark Blue ### Detailed Analysis The chart presents accuracy percentages for each model across the six character-length conditions. Values are approximate based on visual bar height. **1. Qwen-2.5-MATH-7B (Leftmost Group):** * **Trend:** High and relatively stable accuracy for 0-55 characters, with a noticeable drop for 70 characters. * **Approximate Values:** * 0 chars (Dark Red): ~72% * 10 chars (Orange): ~74% * 25 chars (Yellow): ~74% * 40 chars (Light Blue): ~74% * 55 chars (Medium Blue): ~75% * 70 chars (Dark Blue): ~67% **2. GPT-4o-mini (Second Group):** * **Trend:** Similar pattern to the first model—high accuracy for shorter inputs, with a drop at 70 characters. * **Approximate Values:** * 0 chars: ~72% * 10 chars: ~73% * 25 chars: ~73% * 40 chars: ~74% * 55 chars: ~75% * 70 chars: ~65% **3. Qwen-2.5-MATH-1.5B (Third Group):** * **Trend:** Overall lower accuracy than the first two models. Shows a slight peak at 10/40/55 characters and a clear drop at 70 characters. * **Approximate Values:** * 0 chars: ~64% * 10 chars: ~66% * 25 chars: ~65% * 40 chars: ~66% * 55 chars: ~67% * 70 chars: ~58% **4. DS-R1-Distill-Qwen-7B (Fourth Group):** * **Trend:** Significantly lower accuracy than the top three models. Peaks at 10 characters, then shows a general downward trend with increasing length. * **Approximate Values:** * 0 chars: ~52% * 10 chars: ~55% * 25 chars: ~53% * 40 chars: ~53% * 55 chars: ~53% * 70 chars: ~46% **5. DS-math-7b-instruct (Rightmost Group):** * **Trend:** The lowest-performing model overall. Accuracy is relatively flat for 0-55 characters (around 40%) and drops for 70 characters. * **Approximate Values:** * 0 chars: ~40% * 10 chars: ~41% * 25 chars: ~41% * 40 chars: ~41% * 55 chars: ~40% * 70 chars: ~35% ### Key Observations 1. **Performance Hierarchy:** There is a clear performance tier. `Qwen-2.5-MATH-7B` and `GPT-4o-mini` are the top performers, followed by `Qwen-2.5-MATH-1.5B`. The two "DS" models perform substantially worse. 2. **Impact of Input Length:** For all models, accuracy is generally highest for inputs between 10 and 55 characters. There is a consistent and notable drop in accuracy for the longest input length tested (70 characters). 3. **Optimal Length:** For most models, the 10-character condition often yields slightly higher accuracy than the 0-character (baseline) condition, suggesting a potential "sweet spot" for prompt length. 4. **Model Size vs. Performance:** The smaller `Qwen-2.5-MATH-1.5B` model underperforms its 7B counterpart, as expected. However, the `DS-R1-Distill-Qwen-7B` model, despite having 7B parameters, performs worse than the smaller Qwen model, indicating that architecture or training method is a critical factor beyond sheer size. ### Interpretation The data suggests two primary findings: 1. **Input Length Sensitivity:** All evaluated models exhibit sensitivity to the length of the input prompt, with a performance degradation occurring at the longest length (70 characters). This could be due to increased complexity, longer-range dependencies in the problem statement, or limitations in the models' context handling for this specific task. The drop is most pronounced for the top-performing models in absolute terms. 2. **Model Efficacy on Mathematical Reasoning:** The chart clearly separates models based on their specialized training for mathematical tasks. The `Qwen-2.5-MATH` series and `GPT-4o-mini` demonstrate strong performance, indicating effective training on mathematical data. In contrast, the "DS" models (likely from a different provider or training paradigm) show significantly lower accuracy, highlighting that general language model capability does not automatically translate to strong mathematical reasoning performance. The fact that `DS-R1-Distill-Qwen-7B` underperforms the smaller `Qwen-2.5-MATH-1.5B` strongly supports the importance of specialized fine-tuning or distillation for this domain. **In summary, the chart provides evidence that for the MATH-500 benchmark, model choice is the dominant factor in accuracy, but input length is a significant secondary factor, with a consistent negative impact at the longest tested length.** </details> Figure 8: The relationship between the length of inserting tokens and the performance of models. ### F.2 The impact of inserting position of meaningless tokens. In the previous section, we demonstrated that inserting meaningless tokens between the system prompt and the question leads to improved model performance. In this section, we further investigate the effect of inserting meaningless tokens at different positions. Specifically, we consider four settings: ❶ the beginning of the system prompt, ❷ between the system prompt and the question, ❸ the end of the input, and ❹ a random position within the input. The results are reported in Table 4. We observe that only inserting tokens between the system prompt and the question yields performance gains. In contrast, appending tokens to the end of the input causes the model to simply repeat them, leading to zero accuracy. Inserting tokens at random positions disrupts the original semantic structure of the sentence, while inserting them at the beginning alters the values of the system prompt itself, introducing extra terms as shown in Equation 1. We hypothesize that this disrupts the intended initialization and interferes with the task the model is expected to process. Therefore, the most effective position for inserting meaningless tokens is between the system prompt and the question. | w/o meaningless tokens position ❶ position ❷ | 72.4 69.6 75.0 | 23.1 21.1 23.3 | | --- | --- | --- | | position ❸ | 0.0 | 0.0 | | position ❹ | 51.2 | 21.1 | Table 4: Performance on Math-500 and AIME 2024 after inserting meaningless tokens in different positions. ### F.3 The impact of inserting type of meaningless tokens In this section, we examine the influence of inserting different types of meaningless tokens on reasoning tasks. In our experiments, we insert varying lengths of slashes (“/”) and question marks (“?”) into the inputs and select the best-performing configuration from each set. As shown in Table 5, different types of meaningless tokens produce varying impacts on LLM performance, and no single unified pattern emerges. We attribute this to the fact that different token types carry distinct representational values, leading to different effects of attention during the transformation. Moreover, the sensitivity of individual questions to such transformations also varies. Consequently, the impact of meaningless tokens differs across tasks and models. Table 5: Accuracy of LLM on two mathematical reasoning datasets with inserting different kinds of meaningless tokens. | Qwen2.5-Math-1.5b Qwen2.5-Math-7b DeepSeek-R1-Distill-Qwen-7B | 63.6 72.4 52.0 | 66.8 75.0 55.0 | 58.2 69.6 53.6 | 14.4 23.3 3.3 | 18.8 24.4 3.3 | 16.1 22.2 4.4 | | --- | --- | --- | --- | --- | --- | --- | | DeepSeek-Math-7b-instruct | 39.6 | 41.4 | 43.4 | 7.8 | 12.2 | 12.5 | | Llama-3.1-8B-Instruct | 35.4 | 36.6 | 34.2 | 11.1 | 7.8 | 13.3 | | Qwen-2.5-32B-Instruct | 80.8 | 81.0 | 81.6 | 18.9 | 20.0 | 21.1 | ### F.4 Why we only analyze first layer <details> <summary>x9.png Details</summary> ![f564f365](/v1/image/f564f3652453626705f982addbfde7e1b358818ac6f15332088f5d0bd7cd15b1) ### Visual Description ## Line Charts: Average Attention Weight Comparison (Qwen2.5-7B-Math, Head 22) ### Overview The image displays three horizontally arranged line charts, each comparing the "Average Attention Weight" across an index (0-60) for two conditions: "None" (blue line) and "Mless" (red line). The charts correspond to different layers (1, 2, and 3) of the Qwen2.5-7B-Math model, all for Head 22. The visualization appears to analyze how attention patterns differ between a baseline ("None") and a modified ("Mless") condition across model depth. ### Components/Axes * **Chart Titles (Top-Center of each subplot):** * Left Chart: `Qwen2.5-7B-Math Layer 1 Head 22` * Middle Chart: `Qwen2.5-7B-Math Layer 2 Head 22` * Right Chart: `Qwen2.5-7B-Math Layer 3 Head 22` * **Y-Axis Label (Leftmost chart, vertically oriented):** `Average Attention Weight` * **Y-Axis Scales (Vary per chart):** * Layer 1: 0.00 to 0.08 (ticks at 0.00, 0.02, 0.04, 0.06, 0.08) * Layer 2: 0.000 to 0.200 (ticks at 0.000, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150, 0.175, 0.200) * Layer 3: 0.000 to 0.175 (ticks at 0.000, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150, 0.175) * **X-Axis (All charts):** Numerical index from 0 to 60, with major ticks at 0, 10, 20, 30, 40, 50, 60. The axis label is not explicitly shown but represents a sequence (e.g., token position). * **Legend (Top-right corner of each subplot):** * Blue Line: `None` * Red Line: `Mless` ### Detailed Analysis **Layer 1, Head 22 (Left Chart):** * **Trend Verification:** The "None" (blue) line exhibits high volatility with multiple sharp peaks. The "Mless" (red) line follows a similar pattern but is consistently lower in magnitude, especially at the peaks. * **Data Points (Approximate):** * **Blue ("None"):** Starts near 0.01. Major peaks occur around index ~30 (0.07), ~40 (0.08), and ~50 (0.075). Troughs dip to ~0.01-0.02. * **Red ("Mless"):** Follows the blue line's shape but peaks are attenuated. Peaks at ~30 (~0.04), ~40 (~0.05), ~50 (~0.045). General baseline is around 0.01-0.02. **Layer 2, Head 22 (Middle Chart):** * **Trend Verification:** The "None" (blue) line shows one extremely dominant peak early on, followed by lower activity. The "Mless" (red) line has a different pattern, with its highest peak occurring later. * **Data Points (Approximate):** * **Blue ("None"):** A very sharp, high peak at index ~15, reaching ~0.175. After this, values drop significantly, fluctuating mostly below 0.05, with a smaller peak around index ~35 (~0.075). * **Red ("Mless"):** Does not share the early blue peak. Its highest point is around index ~35 (~0.15). Otherwise, it fluctuates at a lower level, often below the blue line in the first half and above it in the second half. **Layer 3, Head 22 (Right Chart):** * **Trend Verification:** Both lines show more frequent, lower-amplitude oscillations compared to earlier layers. The "Mless" (red) line generally has higher peaks than the "None" (blue) line in this layer. * **Data Points (Approximate):** * **Blue ("None"):** Oscillates with peaks rarely exceeding 0.05. Notable peaks around index ~10 (~0.04), ~25 (~0.05), ~40 (~0.04). * **Red ("Mless"):** Shows more pronounced peaks. A very high initial value at index 0 (~0.17). Other significant peaks at ~25 (~0.075), ~35 (~0.14), and ~45 (~0.06). ### Key Observations 1. **Layer-Dependent Behavior:** The relationship between the "None" and "Mless" conditions changes dramatically across layers. In Layer 1, "None" dominates. In Layer 2, they have distinct peak locations. In Layer 3, "Mless" often has higher peaks. 2. **Peak Magnitude:** The highest absolute attention weight observed is in Layer 2 for the "None" condition (~0.175). The highest for "Mless" is in Layer 3 at index 0 (~0.17). 3. **Pattern Shift:** The "None" condition's most prominent feature (the huge Layer 2 peak) disappears in the "Mless" condition, suggesting the modification significantly alters attention focus at that specific layer and position. 4. **Increased Volatility in "Mless" for Layer 3:** The "Mless" line in Layer 3 shows sharper, more isolated spikes compared to the smoother oscillations of the "None" line. ### Interpretation This data visualizes the internal attention mechanism of a large language model (Qwen2.5-7B-Math) under two different conditions. "None" likely represents the standard, unmodified model inference. "Mless" presumably stands for a modified inference technique (e.g., "Memory-less" or another intervention). The charts demonstrate that the intervention ("Mless") does not simply scale attention weights up or down uniformly. Instead, it **reconfigures the attention pattern in a layer-specific manner**: * In early layers (Layer 1), it suppresses the magnitude of attention peaks. * In middle layers (Layer 2), it completely shifts the focus of attention away from the position that was most critical in the baseline model. * In deeper layers (Layer 3), it appears to increase the salience of certain positions, creating sharper, more isolated attention spikes. This suggests the "Mless" technique fundamentally changes how the model allocates its attention resources across its depth, potentially to reduce reliance on certain types of information (e.g., long-range dependencies or specific token memories) or to encourage different reasoning pathways. The dramatic shift in Layer 2 is particularly noteworthy, indicating that this layer may be a critical point where the standard model's processing is significantly altered by the intervention. </details> Figure 9: Average attention weights for later tokens in Layers 1 to 3 of Qwen2.5-7B-Math. In this section, we explain why our analysis and redistribution of activations focus exclusively on the first layer. As shown in Figure 9, we present the average attention weights of later tokens in Layers 1, 2, and 3 of Qwen2.5-7B-Math. We observe that only the first layer exhibits a clear and consistent phenomenon: after inserting meaningless tokens, the average attention weights decrease to a noticeable extent, suggesting that meaningless tokens directly alter the initial allocation of attention. In contrast, Layers 2 and 3 do not display such regularity—the average attention weights with and without meaningless tokens show no systematic relationship. Consequently, later layers do not undergo an affine transformation of this type. We hypothesize that this disappearance of the phenomenon arises because, beyond the first layer, the model has already integrated and mixed substantial semantic information through residual connections. From the second layer onward, the model begins to reconstruct and redistribute information, thereby diminishing the direct effect of meaningless tokens on average attention weights. In other words, the role of meaningless tokens becomes less distinguishable once meaningful contextual representations dominate, which explains why the first layer is the most critical point for observing and leveraging this effect. ### F.5 Repeat Meaningful tokens’ effectiveness In this section, we investigate whether adding meaningful tokens can play a role similar to meaningless tokens. Specifically, we insert a long sequence of repeated tokens that are semantically irrelevant to the question. For example, we add 55 repetitions of “he” between the system prompt and the question. The results, shown on the left of Figure 10, indicate that even such repeated but irrelevant meaningful tokens lead to an improvement in model performance. To better understand this effect, we further visualize the average attention weights after inserting these tokens, as presented on the right of Figure 10. The results reveal that the activation changes induced by repeated meaningful tokens closely resemble those caused by meaningless tokens, and the inserted tokens receive similar attention patterns which means the weight value of inserted part’s are similar. Taken together, these findings suggest that when repeated tokens are inserted at appropriate positions without introducing additional semantic content, LLMs are able to recognize them as irrelevant. Consequently, they trigger a redistribution of activations in the MLP, ultimately improving model performance. <details> <summary>x10.png Details</summary> ![87be90e9](/v1/image/87be90e9bbbe0651624e84c772c72fc8ccfc2945d05057430dba8b3add27201c) ### Visual Description ## Bar Chart and Line Charts: Model Performance and Attention Analysis ### Overview The image contains three distinct charts arranged in a composite layout. On the left is a grouped bar chart comparing the performance of three language models on a math benchmark under different conditions. On the right are two line charts stacked vertically, both analyzing "Average attention on Head 14" but under different conditions and over different token position ranges. The overall theme appears to be an analysis of how certain modifications ("Mless", "Mful") affect both model performance and internal attention mechanisms. ### Components/Axes **1. Left Chart: "Results on Math-500"** * **Type:** Grouped Bar Chart. * **X-axis (Categorical):** Lists three model names: `Qwen2.5-Math-1.5B`, `Qwen2.5-Math-7B`, and `Llama3.1-8B Instruct`. * **Y-axis (Numerical):** Represents a performance score, likely accuracy percentage. The scale runs from 30 to 75, with major ticks at 30, 40, 50, 60, 70. * **Legend (Top-Right of chart area):** Defines three conditions: * `w/o` (Purple bar) * `w/ Mless` (Teal/Green bar) * `w/ Mful` (Light Blue bar) **2. Top-Right Chart: "Average attention on Head 14"** * **Type:** Line Chart with filled area under the lines. * **X-axis (Numerical):** Labeled with numbers from 0 to 60, representing token positions. * **Y-axis (Numerical):** Represents average attention value. The scale runs from 0.00 to 0.10, with ticks at 0.00, 0.02, 0.04, 0.06, 0.08, 0.10. * **Legend (Top-Left of chart area):** Defines three data series: * `None` (Blue line) * `Mless` (Red/Orange line) * `Mful` (Green line) **3. Bottom-Right Chart: "Average attention on Head 14"** * **Type:** Line Chart. * **X-axis (Numerical):** Labeled with numbers from 0 to 120, representing token positions. * **Y-axis (Numerical):** Represents average attention value. The scale runs from 0.00 to 0.10, with ticks at 0.00, 0.02, 0.04, 0.06, 0.08, 0.10. * **Legend (Top-Left of chart area):** Defines two data series: * `Mless` (Blue line) * `Mful` (Red/Orange line) ### Detailed Analysis **1. Results on Math-500 (Bar Chart)** * **Qwen2.5-Math-1.5B:** * `w/o`: ~64% * `w/ Mless`: ~66% * `w/ Mful`: ~66% * *Trend:* Performance is similar across conditions, with a slight improvement with Mless/Mful over the baseline (w/o). * **Qwen2.5-Math-7B:** * `w/o`: ~73% * `w/ Mless`: ~76% * `w/ Mful`: ~74% * *Trend:* This model shows the highest overall performance. `w/ Mless` yields the best result, followed by `w/ Mful`, both outperforming the baseline. * **Llama3.1-8B Instruct:** * `w/o`: ~44% * `w/ Mless`: ~46% * `w/ Mful`: ~48% * *Trend:* This model has the lowest scores. There is a clear, stepwise improvement from `w/o` to `w/ Mless` to `w/ Mful`. **2. Average attention on Head 14 - Top Chart (Token positions 0-60)** * **General Pattern:** All three lines (`None`, `Mless`, `Mful`) show a highly volatile, spiky pattern of attention across token positions. There is no smooth trend; attention values fluctuate dramatically between near 0.00 and peaks above 0.08. * **Key Observations:** * The `None` (blue) line often has the highest peaks, notably around positions ~5, ~18, ~35, ~50, and ~58, where it reaches or exceeds 0.10. * The `Mless` (red) and `Mful` (green) lines generally follow the same spiky pattern as `None` but with slightly attenuated peaks. Their peaks are often lower than the corresponding `None` peak. * The three lines are tightly correlated in their timing of spikes, suggesting the attention pattern's structure is preserved, but its magnitude is modulated by the conditions. **3. Average attention on Head 14 - Bottom Chart (Token positions 0-120)** * **General Pattern:** This chart compares only `Mless` (blue) and `Mful` (red) over a longer sequence. The pattern is again spiky but shows a distinct shift in behavior. * **Key Observations:** * **First ~70 tokens:** Both lines show low-amplitude fluctuations, mostly below 0.02, with occasional small spikes. The `Mful` (red) line appears slightly more active in this region. * **After token ~70:** There is a dramatic change. Both lines begin exhibiting very high, sharp spikes, frequently reaching 0.08-0.10. The `Mful` (red) line shows particularly pronounced and frequent high spikes in the 80-120 range. * The `Mless` (blue) line also spikes but seems to have slightly fewer extreme peaks in the later segment compared to `Mful`. ### Key Observations 1. **Performance Correlation:** The model showing the most significant performance gain from `Mful` (Llama3.1-8B Instruct) is also the one with the lowest baseline performance. 2. **Attention Modulation:** The modifications (`Mless`, `Mful`) do not create entirely new attention patterns but appear to modulate the intensity of the existing pattern seen in the `None` condition, generally reducing peak attention values in the short sequence (top chart). 3. **Attention Phase Shift:** The bottom chart reveals a potential phase shift in the model's attention behavior around the 70th token, moving from a low-attention state to a high-attention, spiky state. The `Mful` condition seems to exacerbate or make this high-attention state more volatile. ### Interpretation The data suggests an investigation into how specific techniques (`Mless`, `Mful`) affect both the external performance and internal mechanics of language models on mathematical tasks. * **Performance:** The techniques generally provide a performance boost, with the benefit being most pronounced for the weaker-performing model (Llama3.1-8B Instruct). This could imply the techniques help compensate for underlying model deficiencies in mathematical reasoning. * **Attention Mechanism:** The line charts provide a "under the hood" view. Head 14 appears to be a highly dynamic attention head. The `Mless` and `Mful` conditions seem to act as a regularizer or dampener on its peak activity in shorter sequences (top chart). However, in longer sequences (bottom chart), a different dynamic emerges where attention becomes highly concentrated and volatile after a certain point, with `Mful` leading to more extreme attention spikes. This could indicate that `Mful` encourages the model to engage in more intense, focused processing at specific points in longer problems, which might correlate with the performance gains observed. * **Relationship:** The charts together tell a story: modifying the model (with Mless/Mful) changes its internal attention dynamics (reducing peak attention in some contexts, increasing it in others), and these internal changes are associated with measurable improvements in external task performance. The outlier is the Llama model's dramatic stepwise improvement, suggesting it benefits most from the altered attention strategy. </details> Figure 10: The left panel illustrates a comparison between adding repeated meaningful tokens and meaningless tokens, while the right panel presents the average attention weights resulting from the addition of meaningful and meaningless tokens. ### F.6 Why random sentence is useless <details> <summary>x11.png Details</summary> ![56a87306](/v1/image/56a8730631fedee6aa512b23ca0cabd4f0c203f7326bb25e810e1d651ef28bcc) ### Visual Description ## Line Charts: Attention Weights Across Transformer Models ### Overview The image displays three horizontally arranged line charts, each comparing the "Attention Weights" assigned to two different input types ("Meaningless tokens" and "Random sentence") across token positions (0 to 120) for a specific layer and attention head of three different large language models. The charts illustrate how attention patterns differ between models and input conditions. ### Components/Axes * **Titles (Top of each chart):** * Left Chart: `Qwen2.5-7B-Math Layer 1 Head 1` * Middle Chart: `Llama3.1-8B-Instruct Layer 1 Head 27` * Right Chart: `Gemma3-4b-it Layer 1 Head 8` * **Y-Axis (Vertical):** Labeled `Attention Weights` on the leftmost chart. The scale varies per chart. * **X-Axis (Horizontal):** Represents token position, with markers at 0, 20, 40, 60, 80, 100, and 120. The axis is not explicitly labeled with text but is implied by the numerical markers. * **Legend (Top-left corner within each chart's plot area):** * Blue Line: `Meaningless tokens` * Red Line: `Random sentence` ### Detailed Analysis **1. Left Chart: Qwen2.5-7B-Math (Layer 1, Head 1)** * **Y-Axis Scale:** 0.00 to 0.16. * **Trend Verification:** * **Meaningless tokens (Blue):** Shows a generally low baseline with one very prominent, sharp spike. * **Random sentence (Red):** Exhibits frequent, moderate-amplitude fluctuations across the entire sequence. * **Data Points & Observations:** * The blue line remains below ~0.04 for most positions, except for a major peak reaching approximately **0.15** at token position **~80**. * The red line shows consistent activity, with multiple peaks between **0.06 and 0.12**. Notable peaks occur around positions **10, 25, 45, and 70**. * The two lines show little correlation; the major blue spike does not correspond to a major red peak. **2. Middle Chart: Llama3.1-8B-Instruct (Layer 1, Head 27)** * **Y-Axis Scale:** 0.00 to 0.10. * **Trend Verification:** * **Meaningless tokens (Blue):** Shows a gradual, noisy increase in attention weight from left to right. * **Random sentence (Red):** Also shows a general upward trend with higher volatility and magnitude than the blue line. * **Data Points & Observations:** * Both lines start near 0.00 at position 0. * The red line is consistently above the blue line for most of the sequence after position ~20. * Both lines reach their highest values towards the end of the sequence (positions 100-120). The red line peaks near **0.10**, while the blue line peaks near **0.08**. * The pattern suggests this head may be attending more strongly to later tokens in the sequence, especially for the "Random sentence" input. **3. Right Chart: Gemma3-4b-it (Layer 1, Head 8)** * **Y-Axis Scale:** 0.00 to 0.6 (Note: This scale is 6x larger than the middle chart and ~4x larger than the left chart). * **Trend Verification:** * **Meaningless tokens (Blue):** Remains extremely low and flat for the first ~100 tokens, then spikes dramatically. * **Random sentence (Red):** Follows a similar pattern but with a less extreme final spike. * **Data Points & Observations:** * For positions 0 to ~100, both lines are near the baseline, mostly below **0.05**. * A dramatic shift occurs after position 100. The blue line ("Meaningless tokens") spikes to the highest value in the entire image, approximately **0.55**, at position **~115**. * The red line ("Random sentence") also spikes at the end, reaching about **0.20**. * This head exhibits a very strong "end-of-sequence" attention bias, particularly for meaningless tokens. ### Key Observations 1. **Model-Specific Patterns:** Each model/head displays a unique attention signature. Qwen shows a single sharp spike, Llama shows a gradual build-up, and Gemma shows a late-sequence explosion. 2. **Input-Type Sensitivity:** The relationship between the blue and red lines differs per chart. In Llama, the "Random sentence" consistently draws more attention. In Gemma, the "Meaningless tokens" trigger the most extreme response at the end. 3. **Scale Disparity:** The y-axis scales are not uniform. The Gemma chart's maximum (0.6) is significantly higher than the others, indicating this head can produce much stronger attention weights. 4. **Spatial Grounding:** The legends are consistently placed in the top-left of each plot area. The most significant data features (spikes) are located in the right half of the x-axis for all charts. ### Interpretation These charts provide a technical visualization of internal model mechanics, specifically how early-layer attention heads process different kinds of input. The data suggests: * **Functional Specialization:** Different attention heads (even within Layer 1) develop specialized roles. The Qwen head (left) may act as a "spotlight" for specific token positions, the Llama head (middle) may track sequence progression, and the Gemma head (right) may be highly sensitive to sequence boundaries or padding. * **Response to Nonsense:** The starkly different reactions to "Meaningless tokens" (blue) are telling. The Gemma head's extreme late spike could indicate a mechanism for handling or ignoring nonsensical input by concentrating attention at the sequence end, possibly before a generation step. * **Architectural Insights:** The variance in patterns (spike vs. ramp vs. cliff) across models highlights how architectural choices and training objectives (e.g., math-focused for Qwen, instruct-tuned for Llama and Gemma) lead to different internal representations and processing strategies, even at the same layer depth. **Language:** All text in the image is in English. </details> Figure 11: The average attention weights of adding meaningless tokens and random sentence. When additional tokens are inserted into a sentence, both the attention weights and the resulting attention outputs exhibit consistent patterns: the weights assigned to the original tokens decrease, while the attention outputs gain additional values from the inserted tokens. In this section, we analyze why adding repeated tokens can enhance the performance of LLMs, whereas inserting random or unrelated sentences can have a detrimental effect. The results are shown in Figure 11. We observe that the attention weights associated with the random sentence are highly diverse, and their corresponding value vectors also differ substantially. In contrast, the repeated meaningless tokens exhibit more uniform attention weights and nearly identical value vectors. Consequently, compared with repeated meaningless tokens, a random sentence introduces not only numerical fluctuations but also a pronounced directional shift in the attention outputs—one that carries additional semantic information. The formula of RMSNorm is: $$ (x)=γ\odot\frac{x}{√{\frac{1}{d}∑_i=1^dx_i^2+ε}}, \tag{18} $$ where $γ$ is a learnable rescaling vector and $ε$ ensures numerical stability. For repeated meaningless tokens, the effect manifests as a small and uniform directional bias on the input to RMSNorm, producing only a minor numerical perturbation in its output. In contrast, inserting a random sentence introduces high-rank and structured semantic signals that RMSNorm cannot simply absorb. This leads to systematic shifts in the output direction and subspace, thereby altering the model’s internal semantic representations. ### F.7 The optimal hyperparameter range <details> <summary>x12.png Details</summary> ![c198135a](/v1/image/c198135a1cff0551ec7cd1cf4b18103ec18988a872d02d545e820c3eec137639) ### Visual Description \n ## Heatmap: Performance vs. Near-Zero Range Width and Percentage Threshold ### Overview The image is a heatmap chart visualizing a numerical metric (likely a performance score or accuracy percentage) as a function of two parameters: the "width of near-zero range c" (x-axis) and a "percentage p" (y-axis). The chart uses a color gradient from light yellow (lower values) to dark blue (higher values) to represent the metric's magnitude. All text is in English. ### Components/Axes * **Chart Type:** 2D Heatmap / Grid Plot. * **X-Axis (Horizontal):** * **Label:** `width of near-zero range c` * **Scale/Markers:** Categorical with values: `0.05`, `0.1`, `0.15`, `0.2`, `0.25`, `0.3`. * **Y-Axis (Vertical):** * **Label:** `percentage p` * **Scale/Markers:** Categorical with values: `85.0`, `90.0`, `92.5`, `95.0`, `97.0`, `99.5`. * **Legend / Color Bar:** * **Placement:** Vertically aligned on the right side of the chart. * **Scale:** Continuous gradient from light yellow to dark blue. * **Labeled Ticks:** `64.0` (bottom, light yellow), `64.5`, `65.0`, `65.5`, `66.0`, `66.5`, `67.0` (top, dark blue). * **Data Grid:** A 6x6 grid of cells, each containing a numerical value and colored according to the legend. ### Detailed Analysis The following table reconstructs the data from the heatmap. Values are read directly from the cells. The color description is based on visual matching to the legend. | percentage p \ width c | 0.05 | 0.1 | 0.15 | 0.2 | 0.25 | 0.3 | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | **99.5** | 65.2 (Light Green-Blue) | 64.6 (Light Yellow-Green) | **66.4 (Dark Blue)** | **66.0 (Medium Blue)** | 64.4 (Light Yellow) | 65.2 (Light Green-Blue) | | **97.0** | 65.4 (Light Green-Blue) | 65.4 (Light Green-Blue) | 65.2 (Light Green-Blue) | 65.6 (Medium Green-Blue) | 65.2 (Light Green-Blue) | 65.2 (Light Green-Blue) | | **95.0** | 64.8 (Light Yellow-Green) | 64.4 (Light Yellow) | 64.4 (Light Yellow) | 65.6 (Medium Green-Blue) | 64.6 (Light Yellow-Green) | 64.2 (Light Yellow) | | **92.5** | 65.0 (Light Green-Blue) | 64.2 (Light Yellow) | 64.8 (Light Yellow-Green) | 64.8 (Light Yellow-Green) | 65.2 (Light Green-Blue) | 65.0 (Light Green-Blue) | | **90.0** | 64.6 (Light Yellow-Green) | 64.6 (Light Yellow-Green) | 64.6 (Light Yellow-Green) | 64.4 (Light Yellow) | 65.6 (Medium Green-Blue) | 65.2 (Light Green-Blue) | | **85.0** | 64.0 (Light Yellow) | 64.4 (Light Yellow) | 64.8 (Light Yellow-Green) | 64.6 (Light Yellow-Green) | **67.0 (Darkest Blue)** | **66.2 (Dark Blue)** | **Trend Verification:** * **Horizontal Trend (Fixed p):** There is no single consistent trend across all rows. For example, at `p=99.5`, the value peaks sharply at `c=0.15` and `c=0.2` before dropping. At `p=85.0`, the value increases dramatically at the highest `c` values (`0.25`, `0.3`). * **Vertical Trend (Fixed c):** Similarly, trends vary by column. At `c=0.25`, the value is relatively stable (~65.2) for higher `p` but jumps to 67.0 at `p=85.0`. At `c=0.15`, the value is highest at the top (`p=99.5`). ### Key Observations 1. **Global Maximum:** The highest value in the grid is **67.0**, located at the intersection of `p=85.0` and `c=0.25` (bottom row, second column from the right). This cell is colored the darkest blue. 2. **Global Minimum:** The lowest value is **64.0**, located at `p=85.0` and `c=0.05` (bottom-left corner). This cell is the lightest yellow. 3. **High-Value Clusters:** * A cluster of high values (66.4, 66.0) appears in the top row (`p=99.5`) for mid-range `c` values (0.15, 0.2). * Another high-value cluster (67.0, 66.2) appears in the bottom row (`p=85.0`) for high `c` values (0.25, 0.3). 4. **Low-Value Region:** The lower-left quadrant of the chart (lower `p` and lower `c`) generally contains the lowest values, ranging from 64.0 to 64.8. 5. **Stability:** The row for `p=97.0` shows remarkably stable values, all between 65.2 and 65.6, indicating the metric is least sensitive to changes in `c` at this percentage threshold. ### Interpretation This heatmap likely illustrates the performance of a system or algorithm where two hyperparameters are being tuned: a percentage threshold (`p`) and the width of a "near-zero range" (`c`). The metric being measured (values ~64-67) could be accuracy, F1-score, or a similar performance indicator. The data suggests a complex, non-linear interaction between the parameters: * **Optimal Performance is Context-Dependent:** There is no single "best" setting. Peak performance (67.0) is achieved with a low percentage threshold (`p=85.0`) and a relatively high near-zero range width (`c=0.25`). However, a nearly as high score (66.4) is achieved with a very high threshold (`p=99.5`) and a moderate width (`c=0.15`). This implies the optimal configuration depends on other constraints or priorities in the system. * **Sensitivity Varies:** The system's performance is highly sensitive to parameter changes in some regions (e.g., the bottom-right corner shows a steep increase) but very stable in others (e.g., the entire `p=97.0` row). * **Potential Trade-off:** The two high-performance clusters (top-center and bottom-right) might represent two different operating regimes or strategies that yield similar results. Choosing between them could involve trade-offs not visible in this chart, such as computational cost or robustness. **Uncertainty Note:** The values are transcribed from the image with high confidence, but minor reading errors (±0.1) are possible due to color perception and cell alignment. The exact meaning of the parameters `p` and `c` and the measured metric is not provided in the image. </details> Figure 12: This figure illustrates how accuracy varies with changes in the parameters p and c. In this section, we investigate how the hyper-parameters—the percentage p and the width of the near-zero range c —influence model performance on Math-500 when using Qwen2.5-1.5B-Math. The results are summarized in Figure 12. As the figure illustrates, the accuracy does not change monotonically with either p or c; instead, the best-performing settings emerge only within specific regions of the parameter space. This indicates that the choice of hyper-parameters is not trivial and cannot be reduced to cherry-picking. More concretely, we find that balanced combinations of p and c lead to more stable improvements. High accuracy is typically concentrated in two regions: when p is large and c is small, or conversely, when c is large and p is small. In these cases, the redistribution mechanism introduced by ARM effectively amplifies informative activations while suppressing uninformative near-zero activations. Outside of these regions, however, the performance of the model degrades, suggesting that poorly chosen hyper-parameters may distort the activation distribution rather than enhance it. These observations highlight the importance of aligning hyper-parameter choices with the intrinsic properties of activation distributions. To maximize the benefits of ARM, one must take into account both the proportion of near-zero activations and the magnitude of the maximum activation values, thereby ensuring that p and c are set within an appropriate interval. By doing so, ARM can operate in its most effective regime, consistently improving model reasoning performance rather than introducing instability. From each row and column, we can see that the performance of LLMs after ARM depends on both $p$ and $c$ . Since they are equally important, the optimal performance is determined by the range of these two parameters. ## Appendix G More Experiments ### G.1 Results on non-reasoning tasks In this section, we present supplementary results on non-reasoning benchmarks, including ARC-C, MMLU, BoolQ, and OpenBookQA, as shown in Table 6. Across all evaluated models, the application of our method yields only marginal variations in performance. For most models and tasks, it either produces slight improvements or maintains parity with the baseline (i.e., vanilla model performance without any inference-time trick), suggesting that the redistribution of activations has little impact when the task primarily requires factual recall or pattern recognition rather than multi-step reasoning. A minor performance drop is observed only on a small subset of tasks with Llama-3.1-8B-Instruct, which we attribute to model-specific characteristics or sensitivity to activation perturbations.These findings indicate that our approach exerts negligible influence on non-reasoning tasks and, in most cases, does not introduce adverse effects on task accuracy. This observation further supports our central claim: the benefits of activation redistribution are most pronounced in reasoning-oriented scenarios, while in non-reasoning settings the method remains stable and does not compromise the model’s inherent ability to answer factual or knowledge-intensive questions. Table 6: Complete results of several models on non-reasoning tasks. | Model Qwen2.5 Math-1.5B ARM | Setting Baseline 78.6 | GSM8K 78.0 39.3 | ARC-E 39.3 35.4 | ARC-C 35.0 32.1 | MMLU 32.1 33.4 | BoolQ 32.6 39.5 | HellaSwag 39.1 42.4 | OpenBookQA 42.0 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | gray!15 Improve Rate (%) | gray!15 0.6 $↑$ | gray!15 0 | gray!15 0.4 $↑$ | gray!15 0 | gray!15 0.8 $↑$ | gray!15 0.4 $↑$ | gray!15 0.4 $↑$ | | | Qwen2.5 Math-7B | Baseline | 83.8 | 49.7 | 47.9 | 36.9 | 38.6 | 46.9 | 47.6 | | ARM | 83.8 | 49.7 | 47.0 | 37.5 | 38.7 | 47.1 | 47.9 | | | gray!15 Improve Rate (%) | gray!15 0 | gray!15 0 | gray!15 0 | gray!15 0.6 $↑$ | gray!15 0.1 $↑$ | gray!15 0.2 $↑$ | gray!15 0.3 $↑$ | | | Llama3.1 8B-Instruct | Baseline | 80.0 | 46.6 | 49.0 | 38.6 | 43.3 | 56.8 | 52.8 | | ARM | 82.4 | 47.1 | 48.7 | 38.2 | 43.2 | 57.3 | 50.8 | | | gray!15 Improve Rate (%) | gray!15 2.4 $↑$ | gray!15 0.5 $↑$ | gray!15 -0.3 $↓$ | gray!15 -0.4 $↓$ | gray!15 -0.1 $↓$ | gray!15 0.5 $↑$ | gray!15 -2.0 $↓$ | | | Gemma3 4b-it | Baseline | 86.8 | 47.1 | 44.5 | 33.9 | 45.0 | 42.0 | 41.0 | | ARM | 86.8 | 47.1 | 45.0 | 34.1 | 45.2 | 42.0 | 42.0 | | | gray!15 Improve Rate (%) | gray!15 0 | gray!15 0 | gray!15 0.5 $↑$ | gray!15 0.2 $↑$ | gray!15 0.2 $↑$ | gray!15 0 | gray!15 1.0 $↑$ | | ### G.2 Results on Base Model Table 7: Performance on Math-500 and AIME 2024 after incorporating ARM into the MLP in non-reasoning model. | Model Qwen2.5-1.5B ARM | Setting Baseline 68.2 | Math-500 (Pass@1) 67.8 14.4 | AIME 2024 (Pass@1) 14.4 | | --- | --- | --- | --- | | Improve Rate (%) | 0.4 $↑$ | 0 | | | Qwen2.5-7B | Baseline | 50.4 | 15.6 | | ARM | 50.6 | 16.7 | | | Improve Rate (%) | 0.2 $↑$ | 1.1 $↑$ | | | Qwen2.5-32B | Baseline | 77.2 | 27.8 | | ARM | 77.4 | 28.9 | | | Improve Rate (%) | 0.2 $↑$ | 1.1 $↑$ | | In this section, we evaluate the effect of applying ARM to base models and report their performance on Math-500 and AIME2024 using Qwen2.5-1.5B, Qwen2.5-7B, and Qwen2.5-32B. Since these models achieve accuracy above 5%, we consider them capable of tackling these tasks. In contrast, models such as Llama3.1-8B and Gemma3-4B-PT exhibit poor performance and are therefore excluded from the evaluation. The results in Table 7 show that incorporating ARM into the MLP layers of base models yields measurable performance gains on reasoning tasks, although the improvements are generally smaller than those observed for reasoning, oriented models. We attribute this gap to the weaker inherent reasoning abilities of base models. While activation redistribution can still enhance their internal representations, it may not strongly affect how they process key numerical or symbolic elements, such as digits and operators, compared with models trained specifically for reasoning. Table 8: Performance on Math-500 and AIME 2024 after incorporating ARM into the MLP. | Qwen2.5 Math-1.5B ARM Best-of-N(N=5) | Baseline 67.8 69.4 | 63.8 18.9 14.4 | 14.4 | | --- | --- | --- | --- | | Best-of-N + ARM | 71.2 | 18.9 | | | Qwen2.5 Math-7B | Baseline | 72.4 | 23.3 | | ARM | 73.4 | 25.6 | | | Best-of-N(N=5) | 72.8 | 23.3 | | | Best-of-N + ARM | 73.4 | 25.6 | | ### G.3 Inference Time trick comparison To more comprehensively evaluate the robustness, effectiveness, and compatibility of ARM with established inference-time scaling techniques, we further compare its performance against the widely used Best-of-N sampling approach during inference. Specifically, Table 8 summarizes the results obtained by applying ARM alone, Best-of-N sampling alone, and their combined usage on two representative reasoning benchmarks. For all settings, we fix the generation hyperparameters to a temperature of 0.5 and a top_p of 0.95 to ensure a consistent sampling regime. As demonstrated in the table, both ARM and Best-of-N independently yield improvements over the baseline, and their combination produces an even larger performance gain, suggesting that ARM complements rather than competes with existing inference-time strategies. These findings collectively underscore the practical value and scalability of ARM as a lightweight inference-time method for enhancing reasoning capabilities across diverse tasks. Table 9: The hyper parameters in 7 models on three benchmarks. For Qwen and Llama, we using near-zero range $c$ to choose proportion, so $p$ is dash(“-”). But for Gemma, due to the activation distribution, we directly decide to skip setting $c$ and choose $p$ . So here, $c$ is dash(“-”). If the task performance doesn’t improve, we replace hyper-parameters with dash(“-”). | Qwen2.5-Math-1.5B Qwen2.5-Math-7B Qwen2.5-7B-Instruct | 0.15/- 0.2/- 0.15/- | 99.5 99.5 99.5 | 0.13/- 0.1/- 0.1/- | 99.5 95.0 99.5 | 0.13/- 0.05/- - | 99.5 90.0 - | 0.13/- 0.13/- 0.13/- | 99.5 99.5 95.0 | 0.13/- 0.13/- 0.05/- | 99.5 95.0 90 | - - 0.3/- | - - 99.5 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Qwen2.5-32B-Instruct | 0.05/- | 90.5 | - | - | 0.13/- | 99.5 | 0.05/- | 99.0 | 0.13/- | 99.5 | 0.3/- | 99.5 | | Llama3.1-8B-Instruct | 0.45/- | 80.0 | 0.32/- | 90.0 | 0.32/- | 90.0 | - | - | 0.3/- | 90.0 | 0.3/- | 90.0 | | Gemma3-4B-it | -/0.5 | 96.5 | -/0.25 | 85.0 | -/0.25 | 96.5 | -/0.25 | 85.0 | -/0.25 | 96.5 | -/0.25 | 75.0 | | Gemma3-27B-it | -/0.5 | 96.5 | -/0.25 | 85.0 | -/0.25 | 85.0 | -/0.25 | 70.0 | -/0.25 | 85.0 | -/0.25 | 85.0 | ## Appendix H The example of hyper parameters The selection of $p_1$ and $c$ depends on the distribution of model activations after the activation function. In most cases, when the frequency of near-zero activations greatly exceeds that of other values (as in Qwen), the value of $c$ should be chosen smaller. In contrast, for models like LLaMA, $c$ should be chosen larger. When the proportion of near-zero activations is extremely high (as in Gemma), we recommend directly setting the modification proportion to a value larger than $p_max$ . For the choice of $p_1$ , we generally advise selecting more than $80\$ , which covers regions with higher activation frequencies and thus exerts a stronger influence on near-zero values. Because the proportion of extremely high-frequency activations is small, this choice will not cause large deviations in the near-zero values. Overall, although models from the same family may behave differently across tasks and parameter scales, the selection ranges of $p_1$ and $c$ remain relatively consistent within each family. Some examples are in Table 9. ## Appendix I More Average activation weights In this section, we present additional figures of average attention weights to further validate the previously observed phenomenon. When computing the average attention weights for Llama3.1 and Gemma3, we exclude the first token to make the effect clearer, as position tokens tend to absorb most of the attention. The results are in Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20. From results, we can see that in the most layers, models will have same phenomenon like we mentioned before. ## Appendix J Examples of model’s output change of adding meaningless In this section, we show some examples where, after adding meaningless tokens, the model can turn wrong answers into correct ones. The specific examples are provided in Appendix J. <details> <summary>x13.png Details</summary> ![59926974](/v1/image/599269747873bcca1082d8b3168dd5bfa55e30faf08e374fc8235481fa059694) ### Visual Description ## Attention Weight Analysis: Qwen2.5-Math-7B Model ### Overview The image displays a 2x3 grid of six line charts analyzing the "Average Attention Weight" across token positions for different attention heads in the Qwen2.5-Math-7B model. The analysis compares model behavior with and without the inclusion of "Meaningless tokens" in the input sequence. ### Components/Axes * **Titles:** Each of the six subplots has a title specifying the model and attention head: * Top Row (Left to Right): `Qwen2.5-Math-7B Layer 1 Head 1`, `Qwen2.5-Math-7B Layer 1 Head 2`, `Qwen2.5-Math-7B Layer 1 Head 8` * Bottom Row (Left to Right): `Qwen2.5-Math-7B Layer 1 Head 1`, `Qwen2.5-Math-7B Layer 1 Head 2`, `Qwen2.5-Math-7B Layer 1 Head 8` * **Y-Axis:** All six charts share the same y-axis label: `Average Attention Weight`. The scale varies per chart. * **X-Axis:** The x-axis represents token position index. The top row charts range from 0 to 60. The bottom row charts range from 0 to 120. * **Legends:** * **Top Row Charts:** Each contains a legend in the top-right corner with two entries: * `w/o Meaningless tokens` (Blue line) * `w/ Meaningless tokens` (Red line) * **Bottom Row Charts:** Each contains a legend in the top-right corner with one entry: * `w/ Meaningless tokens` (Blue line) * **Annotations:** The bottom row charts contain a shaded gray region labeled `Meaningless tokens`, indicating the span of token positions occupied by these tokens. Vertical dashed lines mark the start and end of this region. ### Detailed Analysis **Top Row: Comparison of Attention With/Without Meaningless Tokens** 1. **Layer 1 Head 1 (Top-Left):** * **Trend (w/o, Blue):** Shows several sharp, high-magnitude peaks. The highest peak is at approximately token position 15, reaching an average attention weight of ~0.175. Other major peaks occur near positions 25 and 30. * **Trend (w/, Red):** The attention pattern is significantly more diffuse and lower in magnitude. The sharp peaks are replaced by broader, lower humps. The highest point is around position 25, reaching only ~0.10. * **Interpretation:** The inclusion of meaningless tokens dramatically smooths and redistributes the attention for this head, eliminating its sharp, focused peaks. 2. **Layer 1 Head 2 (Top-Center):** * **Trend (w/o, Blue):** Attention is relatively low and stable for the first ~20 tokens, then shows a gradual, noisy increase, peaking around position 50 at ~0.08. * **Trend (w/, Red):** Follows a similar overall shape to the blue line but with consistently higher magnitude, especially in the latter half. It peaks around position 50 at ~0.12. * **Interpretation:** For this head, meaningless tokens amplify the existing attention pattern, particularly for later tokens in the sequence, without fundamentally changing its shape. 3. **Layer 1 Head 8 (Top-Right):** * **Trend (w/o, Blue):** Attention is very low and flat for the first ~40 tokens, then exhibits a few moderate peaks between positions 40-60, the highest being ~0.08. * **Trend (w/, Red):** Shows a dramatically different pattern. Attention is elevated across the entire sequence, with a pronounced, jagged increase starting around position 30 and culminating in a very high peak of ~0.20 near position 60. * **Interpretation:** This head's behavior is most radically altered. Meaningless tokens cause it to become highly active, especially towards the end of the sequence, suggesting it may be attending to the structure or presence of these tokens themselves. **Bottom Row: Attention Pattern with Meaningless Tokens (Extended Sequence)** These charts show the `w/ Meaningless tokens` condition (blue line) over a longer sequence (0-120), with the `Meaningless tokens` region highlighted. 1. **Layer 1 Head 1 (Bottom-Left):** * **Pattern:** High attention at the very start (position 0). Attention drops within the `Meaningless tokens` region (approx. positions 15-70), showing a low, decaying trend. After the meaningless tokens end, attention spikes sharply again around position 75 and shows several subsequent peaks. * **Key Data Points:** Initial peak ~0.08. Post-meaningless token peak ~0.06. 2. **Layer 1 Head 2 (Bottom-Center):** * **Pattern:** Similar to Head 1 but with lower overall magnitude. A peak at the start (~0.04), a low plateau during the `Meaningless tokens` region, and a resurgence of noisy, moderate attention after position 70. * **Key Data Points:** Initial peak ~0.04. Post-meaningless token activity fluctuates between 0.01-0.02. 3. **Layer 1 Head 8 (Bottom-Right):** * **Pattern:** Distinct from the other two heads. Shows high, volatile attention at the start. Within the `Meaningless tokens` region, attention is moderate and relatively stable. After the region ends (position ~70), attention becomes extremely volatile with very high peaks. * **Key Data Points:** Initial peaks ~0.025. Post-meaningless token peaks reach up to ~0.025, with significant variance. ### Key Observations 1. **Differential Impact:** The effect of meaningless tokens is not uniform across attention heads. Head 1 is smoothed, Head 2 is amplified, and Head 8 is fundamentally reconfigured. 2. **Temporal Focus:** In the extended sequence (bottom row), all heads show a pattern of high initial attention, a suppressed or stable period during the meaningless token span, and a resurgence of activity afterward. This suggests the model may "reset" or change processing mode after a block of non-informative tokens. 3. **Head 8 Anomaly:** Head 8 (Layer 1) exhibits the most extreme behavior, with the highest recorded attention weight (~0.20) occurring in the presence of meaningless tokens, indicating a potential specialization or sensitivity to this type of input. ### Interpretation This visualization provides a technical investigation into how a large language model's internal attention mechanism reacts to the insertion of "Meaningless tokens." The data suggests these tokens are not simply ignored. * **Mechanism Disruption:** The tokens actively alter attention distributions. For some heads (Head 1), they act as a "smoothing" agent, breaking up sharp focus. For others (Head 8), they act as a strong attractor or catalyst for high attention. * **Processing Phases:** The bottom-row charts imply a potential three-phase processing sequence for inputs containing such tokens: 1) Initial engagement, 2) A distinct processing phase for the meaningless block (characterized by lower or stable attention), and 3) A return to (or heightened) engagement with subsequent meaningful content. * **Model Robustness & Vulnerability:** The findings are relevant for understanding model robustness. If meaningless tokens can so drastically rewire attention patterns, they could potentially be used to manipulate model behavior or, conversely, could be a vector for adversarial attacks. The model appears to dedicate significant computational resources (high attention) to processing these tokens, which may represent an inefficiency. **Language:** All text in the image is in English. </details> Figure 13: The average attention weights of Qwen2.5-Math-7B in Head 1, 2, 8. <details> <summary>x14.png Details</summary> ![1ab0e798](/v1/image/1ab0e7985e1e484ccb9510c63d62e72fb2a4253cf4e70e59b309e9a72bcbf165) ### Visual Description ## Multi-Panel Line Chart: Attention Weight Analysis for Qwen2.5-Math-7B ### Overview The image displays a 2x3 grid of six line charts analyzing the average attention weight distribution across token positions in specific transformer heads of the Qwen2.5-Math-7B model. The charts compare model behavior with and without the presence of "Meaningless tokens" in the input sequence. ### Components/Axes * **Overall Structure:** Six individual plots arranged in two rows and three columns. * **Titles:** Each plot has a title indicating the model and specific layer/head: * Top Row (Left to Right): `Qwen2.5-Math-7B Layer 1 Head 12`, `Qwen2.5-Math-7B Layer 1 Head 13`, `Qwen2.5-Math-7B Layer 1 Head 14`. * Bottom Row (Left to Right): Same titles as the corresponding plot above them. * **Axes:** * **X-axis (All plots):** Labeled with numbers representing token position. The top row plots range from 0 to 60. The bottom row plots range from 0 to 120. * **Y-axis (All plots):** Labeled "Average Attention Weight". The scale varies per plot (e.g., 0.00 to 0.12, 0.00 to 0.14, 0.00 to 0.10). * **Legends:** * **Top Row Plots:** Each contains a legend with two entries: * A blue line labeled `w/o Meaningless tokens`. * A red line labeled `w/ Meaningless tokens`. * **Bottom Row Plots:** Each contains a legend with one entry: * A blue line labeled `w/ Meaningless tokens`. * **Spatial Grounding:** Legends are consistently placed in the top-left corner of each plot area. * **Additional Annotations (Bottom Row):** Each bottom-row plot contains a shaded gray rectangular region spanning approximately token positions 20 to 70. This region is labeled with the text "Meaningless tokens" centered within it. ### Detailed Analysis **Top Row (Comparison: With vs. Without Meaningless Tokens, Sequence Length ~60)** * **Trend Verification:** In all three heads (12, 13, 14), the red line (`w/ Meaningless tokens`) exhibits significantly higher and sharper peaks compared to the blue line (`w/o Meaningless tokens`). The blue line shows a more subdued, lower-amplitude pattern. * **Data Points (Approximate):** * **Head 12:** The red line has major peaks near positions ~15 (0.08), ~35 (0.09), and ~50 (0.12). The blue line fluctuates mostly below 0.04. * **Head 13:** The red line shows prominent peaks near positions ~15 (0.09), ~35 (0.09), and ~50 (0.14). The blue line remains below 0.06. * **Head 14:** The red line has sharp peaks near positions ~15 (0.09), ~35 (0.10), and ~50 (0.10). The blue line is generally below 0.06. **Bottom Row (Focus: Effect of Meaningless Tokens, Sequence Length ~120)** * **Trend Verification:** Each plot shows a single blue line (`w/ Meaningless tokens`). The trend is characterized by high attention weights at the very beginning (positions 0-10) and end (positions 70+) of the sequence, with a pronounced, sustained drop to near-zero values within the shaded "Meaningless tokens" region (positions ~20-70). * **Data Points (Approximate):** * **Head 12:** Initial peak >0.10 at position ~5. Attention drops to ~0.01 within the shaded region. Resumes spiky pattern after position 70, with peaks ~0.06. * **Head 13:** Initial peak ~0.07 at position ~5. Attention is very low (<0.01) in the shaded region. Post-region peaks reach ~0.04. * **Head 14:** Initial peak >0.10 at position ~5. Attention is minimal in the shaded region. Post-region peaks reach ~0.08. ### Key Observations 1. **Amplified Attention:** The presence of meaningless tokens dramatically amplifies the attention weights in specific positions (the sharp peaks in the top row), suggesting the model focuses intensely on certain tokens when meaningless content is present. 2. **Attention Collapse:** The bottom row clearly demonstrates that within a contiguous block of meaningless tokens, the average attention weight for these heads collapses to near zero. The model effectively "ignores" this segment. 3. **Head Similarity:** The three analyzed heads (Layer 1, Heads 12, 13, 14) exhibit qualitatively similar behavior, indicating this might be a common pattern in this layer of the model. 4. **Boundary Effects:** High attention is observed at the very start and immediately after the meaningless token block, suggesting the model pays special attention to the boundaries or transitions into and out of meaningless content. ### Interpretation This data provides a technical insight into the attention mechanism of a large language model (Qwen2.5-Math-7B). The charts suggest that "meaningless tokens" (which could be filler, padding, or nonsensical text) are not processed uniformly. * **Peircean Investigation:** The sharp peaks (the *sign*) indicate the model is actively selecting specific tokens for processing when meaningless context is present (the *object*). The near-zero attention within the meaningless block (the *interpretant*) reveals the model's learned behavior: to suppress processing of irrelevant information. * **Functional Implication:** This behavior is likely efficient. By collapsing attention on meaningless segments, the model conserves its computational focus for meaningful parts of the input. The amplified peaks outside this region might represent the model "compensating" or re-orienting its attention after processing a block of noise. * **Anomaly/Notable Pattern:** The most striking pattern is the stark contrast between the high-variance, peaky attention outside the meaningless block and the flat, near-zero attention inside it. This is not a gradual decline but a sharp, functional on/off switch for attention in these specific heads. This could be a deliberate architectural feature or an emergent behavior for handling noisy input. </details> Figure 14: The average attention weights of Qwen2.5-Math-7B in Head 12, 13, 14. <details> <summary>x15.png Details</summary> ![137a7ae7](/v1/image/137a7ae793b1c1b9913fa7da3f03e950cb02632a7c5693f6b256b1cf132e52ef) ### Visual Description ## Line Charts: Attention Weight Analysis for Qwen2.5-Math-7B ### Overview The image displays a 2x3 grid of six line charts analyzing the "Average Attention Weight" across token positions for different attention heads in the Qwen2.5-Math-7B model. The analysis compares model behavior with and without the inclusion of "Meaningless tokens." The top row (3 charts) compares the two conditions directly over a token position range of 0-60. The bottom row (3 charts) isolates and details the attention pattern specifically on the "Meaningless tokens" themselves over a wider range (0-120). ### Components/Axes * **Chart Titles:** Each chart is titled with the model name and specific layer/head: * Top Row (Left to Right): `Qwen2.5-Math-7B Layer 1 Head 17`, `Qwen2.5-Math-7B Layer 1 Head 26`, `Qwen2.5-Math-7B Layer 1 Head 28`. * Bottom Row (Left to Right): Same titles as above, corresponding to the same heads. * **Y-Axis:** Labeled `Average Attention Weight`. The scale varies: * Top Row Charts: Range from 0.00 to approximately 0.10, 0.10, and 0.10 respectively. * Bottom Row Charts: Range from 0.000 to approximately 0.05, 0.035, and 0.035 respectively. * **X-Axis:** Represents token position index. * Top Row Charts: Range from 0 to 60. * Bottom Row Charts: Range from 0 to 120. * **Legend (Top Row Charts):** Located in the top-left corner of each chart. * Blue Line: `w/o Meaningless tokens` * Red Line: `w/ Meaningless tokens` * **Legend/Annotation (Bottom Row Charts):** A label `w/ Meaningless tokens` with a blue line sample is placed in the top-right area. The region containing the meaningless tokens is shaded in light blue and labeled `Meaningless tokens` within the plot area. ### Detailed Analysis **Top Row Charts (Comparison over positions 0-60):** 1. **Head 17 (Top-Left):** * **Trend:** Both lines show high volatility with multiple sharp peaks. The red line (`w/ Meaningless tokens`) generally exhibits higher peaks than the blue line. * **Key Points:** The red line has its highest peak (~0.16) near position 55. The blue line's highest peak (~0.08) is around position 30. Both lines show a notable dip between positions 10-20. 2. **Head 26 (Top-Center):** * **Trend:** Similar volatile pattern. The red line's peaks are consistently more pronounced than the blue line's. * **Key Points:** The red line's maximum peak (~0.09) occurs near position 50. The blue line's highest point (~0.06) is around position 55. A significant valley is present for both lines between positions 10-20. 3. **Head 28 (Top-Right):** * **Trend:** Follows the same pattern of volatility, with the red line showing greater amplitude in its fluctuations. * **Key Points:** The red line peaks at ~0.10 near position 50. The blue line's highest peak (~0.07) is also near position 50. A deep trough is visible for both between positions 10-20. **Bottom Row Charts (Focus on Meaningless Tokens, positions 0-120):** * **Common Structure:** All three charts show a distinct pattern. There is an initial spike in attention at the very beginning (position 0-5). This is followed by a relatively flat, low-attention plateau from roughly position 20 to 70. After position 70, the attention weight becomes highly volatile again with multiple sharp peaks until the end of the sequence. * **Head 17 (Bottom-Left):** The initial spike reaches ~0.045. The volatile region after position 70 has peaks up to ~0.03. * **Head 26 (Bottom-Center):** The initial spike is ~0.030. Peaks in the later volatile region reach ~0.020. * **Head 28 (Bottom-Right):** The initial spike is ~0.030. Later peaks reach ~0.015. ### Key Observations 1. **Amplified Attention with Meaningless Tokens:** In the direct comparison (top row), the inclusion of meaningless tokens (red line) consistently leads to higher maximum attention weights and more extreme fluctuations compared to the condition without them (blue line). 2. **Consistent Attention Dip:** All three heads in the top row show a characteristic dip in attention weight for both conditions between token positions 10 and 20. 3. **Distinct Pattern for Meaningless Tokens:** The bottom row reveals a unique, three-phase attention pattern specifically for meaningless tokens: an initial high-attention burst, a long period of low/flat attention, and a final phase of high volatility. 4. **Head Similarity:** While the absolute scales differ, the qualitative patterns (volatility, dip location, three-phase structure) are remarkably consistent across Heads 17, 26, and 28. ### Interpretation The data suggests that "Meaningless tokens" are not ignored by the model's attention mechanism; instead, they actively alter its behavior. Their presence (red lines, top row) appears to "excite" or destabilize the attention patterns, leading to higher and more variable attention weights across the board. The unique three-phase pattern in the bottom row indicates a specific processing strategy for these tokens. The initial spike suggests the model pays strong immediate attention to them upon encounter. The subsequent flat plateau implies a period of sustained, low-level processing or perhaps a "holding pattern." The final volatile phase could indicate that these tokens become relevant again later in the sequence, possibly interacting with other content. The consistent dip between positions 10-20 in the top row charts is a notable structural feature of the model's attention in this layer, independent of the meaningless token condition. This might correspond to a specific syntactic or semantic boundary in the input sequences used for this analysis. Overall, the charts demonstrate that meaningless tokens are a significant variable in the model's internal computations, influencing global attention dynamics and exhibiting their own distinct temporal processing signature. </details> Figure 15: The average attention weights of Qwen2.5-Math-7B in Head 17, 26, 28. <details> <summary>x16.png Details</summary> ![e5b62f99](/v1/image/e5b62f9968ead6671e90fd3341ed6c670129ac5d2582d7e860b5987e4d625996) ### Visual Description ## Line Charts: Attention Weight Analysis for Llama3.1-8B-Instruct ### Overview The image displays a 2x3 grid of six line charts analyzing "Average Attention Weight" across token positions for different attention heads in the first layer of the Llama3.1-8B-Instruct model. The top row compares model behavior with and without "meaningless tokens," while the bottom row focuses exclusively on the condition with meaningless tokens, highlighting the specific segment where they appear. ### Components/Axes * **Titles:** Each column is titled by model, layer, and head: * Left Column: `Llama3.1-8B-Instruct Layer 1 Head 2` * Middle Column: `Llama3.1-8B-Instruct Layer 1 Head 5` * Right Column: `Llama3.1-8B-Instruct Layer 1 Head 7` * **Y-Axis (All Charts):** Labeled `Average Attention Weight`. The scale varies per chart. * **X-Axis (All Charts):** Represents token position index. * Top Row Charts: Range from 0 to 60. * Bottom Row Charts: Range from 0 to 120. * **Legends:** * **Top Row Charts:** Contain a legend in the top-left corner. * Blue Line: `w/o Meaningless tokens` * Red Line: `w/ Meaningless tokens` * **Bottom Row Charts:** Contain a legend in the top-right corner. * Blue Line: `w/ Meaningless tokens` * **Annotations:** * **Bottom Row Charts:** A shaded gray region spans approximately from token position 20 to 70. This region is labeled with the text `Meaningless tokens` centered within it. ### Detailed Analysis **Top Row (Comparison of Conditions):** * **Head 2 (Left):** * Y-axis scale: 0.00 to 0.20. * **Trend (w/o Meaningless - Blue):** Shows high variability with several sharp peaks. Notable peaks occur near token positions ~10 (≈0.07), ~30 (≈0.08), and a major peak at ~60 (≈0.18). * **Trend (w/ Meaningless - Red):** Generally lower and smoother than the blue line. It has smaller, more frequent oscillations. Its highest point is near token position 60 (≈0.16), slightly below the blue peak. * **Head 5 (Middle):** * Y-axis scale: 0.00 to 0.16. * **Trend (w/o Meaningless - Blue):** Exhibits a pattern of regular, sharp peaks of similar height (≈0.08-0.10) across the sequence, with a final, higher peak near position 60 (≈0.15). * **Trend (w/ Meaningless - Red):** Follows a similar rhythmic pattern to the blue line but with consistently lower amplitude. Peaks are dampened, generally staying below 0.08. * **Head 7 (Right):** * Y-axis scale: 0.000 to 0.175. * **Trend (w/o Meaningless - Blue):** Shows a series of peaks that increase in magnitude towards the end of the sequence, culminating in a very high peak near position 60 (≈0.17). * **Trend (w/ Meaningless - Red):** Again, mirrors the blue line's pattern but with reduced intensity. Its final peak near position 60 is notably lower (≈0.12). **Bottom Row (Focus on Meaningless Token Segment):** * **Head 2 (Left):** * Y-axis scale: 0.000 to 0.025. * **Trend:** Attention weight is highest at the very beginning (positions 0-10, peaking near 0.020). It drops sharply upon entering the `Meaningless tokens` region (positions ~20-70), remaining very low and flat (≈0.008-0.010). After the region (positions >70), attention becomes highly variable again with multiple sharp peaks. * **Head 5 (Middle):** * Y-axis scale: 0.00 to 0.05. * **Trend:** Similar pattern. High initial attention (peak ≈0.045). A steep decline occurs at the start of the `Meaningless tokens` region, followed by a low, gradually decaying plateau. Post-region attention (>70) resumes a pattern of sharp, regular peaks. * **Head 7 (Right):** * Y-axis scale: 0.00 to 0.05. * **Trend:** Consistent with the other heads. High initial peak (≈0.045), a drop into the `Meaningless tokens` region where attention is minimal and decaying, followed by a return to high-variance, peaked attention after position 70. ### Key Observations 1. **Consistent Dampening Effect:** Across all three heads (2, 5, 7), the presence of meaningless tokens (`w/ Meaningless tokens` condition, red line in top row) consistently reduces the magnitude and variability of attention weights compared to the condition without them (`w/o Meaningless tokens`, blue line). 2. **Pattern Preservation:** While amplitude is reduced, the fundamental rhythmic pattern or timing of attention peaks is largely preserved between the two conditions in the top row charts. 3. **Attention Suppression in Meaningless Segment:** The bottom row charts explicitly show that attention weights drop to a low, stable baseline specifically within the segment containing meaningless tokens (positions ~20-70). 4. **Variable Y-Axis Scales:** The vertical scales differ between charts (e.g., Head 2 top: 0.20, Head 5 top: 0.16, Head 2 bottom: 0.025). This must be considered when comparing absolute values across different panels. 5. **Legend Inconsistency:** The color coding for the `w/ Meaningless tokens` condition is **red** in the top-row comparison charts but **blue** in the bottom-row detail charts. This is a critical labeling inconsistency to note. ### Interpretation This visualization provides a technical investigation into how a large language model's attention mechanism processes sequences containing semantically void or "meaningless" tokens. * **Core Finding:** The data strongly suggests that meaningless tokens act as an attention sink or dampener. When present, they cause the model to allocate less focused attention (lower average weights) across the entire sequence, not just locally to the meaningless tokens themselves. This is evidenced by the globally lower red lines in the top row. * **Mechanism Insight:** The bottom row reveals the local mechanism: attention is actively suppressed *during* the meaningless token segment. The model appears to "ignore" this segment, resulting in a low, flat attention profile. The subsequent return to high-variance attention after the segment indicates the model re-engages with meaningful content. * **Implication for Model Behavior:** This behavior could be a model's learned strategy for maintaining efficiency or robustness. By reducing attention to non-informative tokens, the model may preserve its capacity for processing meaningful information. However, the global dampening effect also suggests that the presence of such tokens might slightly degrade the model's ability to form strong, selective attention patterns on the meaningful parts of the input. * **Cross-Head Consistency:** The effect is observed across multiple attention heads (2, 5, 7) in Layer 1, indicating it is not an isolated phenomenon but a more general response characteristic of this model layer to meaningless input. </details> Figure 16: The average attention weights of Llama3.1-8B-Instruct in Head 2, 5, 7. <details> <summary>x17.png Details</summary> ![41907594](/v1/image/41907594f019e4edd0cad0b3712205d116ecc734d6f9de3e15e4c23027c2c46b) ### Visual Description \n ## Line and Area Charts: Llama3.1-8B-Instruct Layer 1 Attention Head Analysis ### Overview The image displays a 2x3 grid of six charts analyzing the average attention weight distribution across token positions for three specific attention heads (13, 16, and 17) in Layer 1 of the Llama3.1-8B-Instruct model. The top row compares model behavior with and without meaningless tokens, while the bottom row focuses solely on the scenario with meaningless tokens, highlighting a specific segment of the sequence. ### Components/Axes * **Chart Titles (Top Row, Left to Right):** * `Llama3.1-8B-Instruct Layer 1 Head 13` * `Llama3.1-8B-Instruct Layer 1 Head 16` * `Llama3.1-8B-Instruct Layer 1 Head 17` * **Chart Titles (Bottom Row, Left to Right):** Same as above, corresponding to the same heads. * **Y-Axis Label (All Charts):** `Average Attention Weight` * **X-Axis Label (All Charts):** `Token Position` * **Legends:** * **Top Row Charts:** Located in the top-right corner of each plot area. * Blue line: `w/o Meaningless tokens` * Red line: `w/ Meaningless tokens` * **Bottom Row Charts:** Located in the top-right corner of each plot area. * Blue filled area: `w/ Meaningless tokens` * **Axis Scales:** * **Top Row X-Axis:** 0 to 60, with major ticks every 10 units. * **Bottom Row X-Axis:** 0 to 120, with major ticks every 20 units. * **Y-Axis Scales (Vary by Head):** * Head 13 (Top & Bottom): 0.00 to 0.12 * Head 16 (Top): 0.00 to 0.10 * Head 16 (Bottom): 0.00 to 0.08 * Head 17 (Top): 0.00 to 0.25 * Head 17 (Bottom): 0.00 to 0.08 * **Annotations (Bottom Row Charts):** * A gray shaded region spans from approximately token position 20 to 70. * Vertical dashed lines mark the boundaries of this region at positions ~20 and ~70. * Text label `Meaningless tokens` is centered within this shaded region. ### Detailed Analysis **Top Row (Comparison: w/o vs. w/ Meaningless Tokens):** * **Head 13:** The blue line (`w/o`) shows high volatility with sharp peaks exceeding 0.12 at positions ~5, ~15, and ~25. The red line (`w/`) follows a similar pattern but with generally lower peak magnitudes (max ~0.08) and appears slightly smoothed. * **Head 16:** The blue line (`w/o`) is highly variable, with multiple peaks between 0.04 and 0.08. The red line (`w/`) is significantly dampened, staying mostly below 0.04, with a much flatter profile. * **Head 17:** The blue line (`w/o`) exhibits extreme spikes, with the highest peak reaching ~0.25 near position 55. The red line (`w/`) is dramatically flattened, remaining below 0.05 for the entire sequence, indicating a massive reduction in attention weight variance. **Bottom Row (Focus on w/ Meaningless Tokens):** * **General Pattern (All Heads):** Attention weight is concentrated at the very beginning (positions 0-10) and the very end (positions 70-120) of the sequence. The central region, corresponding to the `Meaningless tokens` segment (positions 20-70), shows near-zero attention weight. * **Head 13:** Initial peak ~0.12 at position ~2. A secondary, smaller cluster of peaks appears after position 70, maxing around 0.04. * **Head 16:** Initial peak ~0.08 at position ~2. Very low activity (<0.01) in the middle. A small resurgence of peaks (max ~0.02) occurs after position 70. * **Head 17:** Initial peak ~0.08 at position ~2. The middle section is flat. A distinct cluster of peaks appears after position 70, with the highest reaching ~0.05. ### Key Observations 1. **Dampening Effect:** The presence of meaningless tokens (`w/`) consistently reduces the magnitude and volatility of attention weights across all three heads compared to the `w/o` condition. This effect is most pronounced in Head 17. 2. **Attention Avoidance:** The bottom charts clearly demonstrate that the model's attention mechanism actively ignores the segment labeled as `Meaningless tokens`. Attention weight plummets to near zero within this region (positions 20-70). 3. **Boundary Focus:** Attention is heavily focused on the tokens immediately before and after the meaningless segment, suggesting the model uses these boundary tokens as anchors. 4. **Head-Specific Behavior:** While the general pattern holds, the specific attention profiles differ. Head 17 shows the most extreme reaction to meaningless tokens, going from the highest peaks (`w/o`) to one of the flattest profiles (`w/`). ### Interpretation This data provides a technical visualization of how a large language model (Llama3.1-8B-Instruct) processes filler or meaningless content. The findings suggest: * **Efficiency Mechanism:** The model appears to have an implicit mechanism to conserve its "attention budget." By dramatically reducing attention weights on meaningless tokens, it likely allocates more computational resources to semantically meaningful parts of the input. * **Contextual Anchoring:** The strong attention at the boundaries of the meaningless segment indicates the model uses the surrounding meaningful context to frame its understanding, effectively "bridging" over the irrelevant information. * **Architectural Insight:** The variation between heads (13, 16, 17) shows that not all attention heads are equal; some (like Head 17) may be more specialized in filtering noise or handling syntactic boundaries. This supports the concept of functional specialization within transformer attention layers. * **Practical Implication:** For prompt engineering or model analysis, this indicates that inserting filler text may not simply "distract" the model in a uniform way but could trigger a specific, structured attention-reduction pattern, potentially altering the model's processing pathway in predictable ways. **Language:** All text in the image is in English. </details> Figure 17: The average attention weights of Llama3.1-8B-Instruct in Head 13, 16, 17. <details> <summary>x18.png Details</summary> ![254f4c52](/v1/image/254f4c520f7467d9d884ca58d981cb4215808e6eeffe4e89e1eb34dabc961359) ### Visual Description ## Line Charts: Llama3.1-8B-Instruct Attention Weight Analysis ### Overview The image displays a 2x3 grid of six line charts analyzing the "Average Attention Weight" across token positions for different attention heads in the Llama3.1-8B-Instruct model. The top row compares attention distributions with and without the inclusion of "Meaningless tokens." The bottom row isolates and highlights the attention pattern specifically within the region containing these "Meaningless tokens." ### Components/Axes * **Titles:** Each of the six subplots has a title specifying the model layer and attention head: * Top Left: `Llama3.1-8B-Instruct Layer 1 Head 20` * Top Center: `Llama3.1-8B-Instruct Layer 1 Head 25` * Top Right: `Llama3.1-8B-Instruct Layer 1 Head 26` * Bottom Left: `Llama3.1-8B-Instruct Layer 1 Head 20` (implied by position) * Bottom Center: `Llama3.1-8B-Instruct Layer 1 Head 25` (implied by position) * Bottom Right: `Llama3.1-8B-Instruct Layer 1 Head 26` (implied by position) * **Y-Axis:** All charts share the same label: `Average Attention Weight`. The scale varies per chart. * **X-Axis:** Represents token position index. The top row charts range from 0 to 60. The bottom row charts range from 0 to 120. * **Legend (Top Row Charts):** Located in the top-left corner of each top-row chart. * Blue Line: `w/o Meaningless tokens` * Red Line: `w/ Meaningless tokens` * **Legend (Bottom Row Charts):** Located in the top-right corner of each bottom-row chart. * Blue Line: `w/ Meaningless tokens` * **Annotations (Bottom Row Charts):** A shaded gray rectangular region is labeled `Meaningless tokens`. This region spans approximately from token position 18 to 72 on the x-axis. ### Detailed Analysis **Top Row: Comparison of Attention Distributions** * **Layer 1 Head 20 (Top Left):** * **Trend:** Both lines show a highly volatile, spiky pattern. The red line (`w/ Meaningless tokens`) generally exhibits higher peaks than the blue line (`w/o Meaningless tokens`). * **Data Points (Approximate):** The y-axis ranges from 0.00 to 0.08. Major peaks for the red line occur near token positions 5 (~0.08), 12 (~0.07), 22 (~0.065), and 55 (~0.05). The blue line's peaks are consistently lower, with its highest near position 12 (~0.06). * **Layer 1 Head 25 (Top Center):** * **Trend:** Similar volatile pattern. The red line shows significantly higher attention weights, especially in the latter half of the sequence (positions 40-60). * **Data Points (Approximate):** The y-axis ranges from 0.00 to 0.12. The red line has very high peaks near positions 50 (~0.12) and 58 (~0.11). The blue line's highest peak is near position 50 (~0.08). * **Layer 1 Head 26 (Top Right):** * **Trend:** Both lines are spiky. The red line's peaks are more pronounced and frequent compared to the blue line. * **Data Points (Approximate):** The y-axis ranges from 0.00 to 0.14. The red line has its highest peak near position 50 (~0.14). Other major red peaks are near positions 5 (~0.10), 22 (~0.09), and 35 (~0.08). The blue line's maximum is near position 50 (~0.10). **Bottom Row: Focus on Meaningless Token Region** * **Layer 1 Head 20 (Bottom Left):** * **Trend:** The attention weight is high at the very beginning (positions 0-10), drops sharply at the start of the "Meaningless tokens" region (~pos 18), remains low and relatively flat within the region, and then shows a small, volatile recovery after the region ends (~pos 72). * **Data Points (Approximate):** Y-axis: 0.00 to 0.07. Initial peak >0.07. Within the shaded region (pos ~18-72), weight hovers near 0.00-0.01. Post-region peaks reach ~0.02. * **Layer 1 Head 25 (Bottom Center):** * **Trend:** Similar pattern: high initial attention, a drop into the meaningless region, low/flat attention within it, and a volatile recovery afterward. * **Data Points (Approximate):** Y-axis: 0.00 to 0.04. Initial peak >0.04. Within the shaded region, weight is very low (~0.00-0.005). Post-region peaks reach ~0.02. * **Layer 1 Head 26 (Bottom Right):** * **Trend:** Follows the same pattern as the other two heads in this row. * **Data Points (Approximate):** Y-axis: 0.00 to 0.05. Initial peak >0.05. Within the shaded region, weight is minimal (~0.00-0.01). Post-region peaks reach ~0.03. ### Key Observations 1. **Impact of Meaningless Tokens:** Including meaningless tokens (red lines, top row) consistently increases the magnitude of attention weight peaks compared to excluding them (blue lines, top row). 2. **Attention Collapse:** The bottom row charts clearly demonstrate that for these specific heads in Layer 1, attention weight collapses to near-zero within the span of meaningless tokens. 3. **Post-Collapse Recovery:** After the meaningless token region ends, attention patterns resume with volatility, but the peak weights are generally lower than the initial pre-region peaks. 4. **Consistency Across Heads:** The qualitative behavior (high initial attention, collapse in meaningless region, partial recovery) is consistent across the three analyzed attention heads (20, 25, 26) in Layer 1. ### Interpretation The data suggests that "Meaningless tokens" (which could be padding, separators, or other non-semantic tokens) have a profound and disruptive effect on the early-layer attention mechanism of this model. * **Mechanism:** The collapse of attention within the meaningless token region indicates that these heads effectively "ignore" or "skip over" this span of tokens. The model does not allocate significant computational focus to them. * **Purpose:** This behavior is likely a learned, efficient processing strategy. By suppressing attention to non-informative tokens, the model can preserve its attention capacity for meaningful content before and after this region. * **Comparison:** The top-row charts show that when the *entire sequence* (including meaningless tokens) is considered, the attention distribution is more "stretched" and peaks are higher. This is because the total attention mass (which sums to 1) is distributed over more tokens, but the model compensates by assigning extremely low weights to the meaningless block, thereby increasing the relative weight on meaningful tokens. * **Anomaly/Notable Point:** The sharp, high attention peaks at the very beginning of the sequence (positions 0-10) in the bottom-row charts are striking. This may indicate a strong "recency" or "initial token" bias in these heads, where the model pays disproportionate attention to the first few tokens of the input, possibly for establishing context or as a default attention pattern. </details> Figure 18: The average attention weights of Llama3.1-8B-Instruct in Head 20, 25, 26. <details> <summary>x19.png Details</summary> ![0b501e32](/v1/image/0b501e3285c766b9e524c5ee537d021612f385e3c00d81c2ec462fb33d1a15ea) ### Visual Description \n ## Multi-Panel Line Chart: Llama3.1-8B-Instruct Attention Weight Analysis ### Overview The image displays a 2x3 grid of six line charts analyzing the average attention weight distributions across token positions for different attention heads in the Llama3.1-8B-Instruct model. The top row compares model behavior with and without the presence of "meaningless tokens," while the bottom row isolates the effect of those meaningless tokens over a longer sequence length. ### Components/Axes * **Chart Titles (Top Row, Left to Right):** 1. `Llama3.1-8B-Instruct Layer 1 Head 28` 2. `Llama3.1-8B-Instruct Layer 1 Head 29` 3. `Llama3.1-8B-Instruct Layer 1 Head 31` * **Chart Titles (Bottom Row, Left to Right):** 1. (Implied: Layer 1 Head 28, w/ Meaningless tokens) 2. (Implied: Layer 1 Head 29, w/ Meaningless tokens) 3. (Implied: Layer 1 Head 31, w/ Meaningless tokens) * **Y-Axis Label (All Charts):** `Average Attention Weight` * **X-Axis Label (All Charts):** Token position (implied, numbered 0, 10, 20, etc.). * **Legend (Top Row Charts):** * Blue Line: `w/o Meaningless tokens` * Red Line: `w/ Meaningless tokens` * **Legend (Bottom Row Charts):** Single entry: `w/ Meaningless tokens` (Blue line). * **Annotations (Bottom Row Charts):** A shaded gray vertical region labeled `Meaningless tokens` spans approximately token positions 16 to 72. ### Detailed Analysis **Top Row: Comparison of Conditions (Sequence Length ~60 tokens)** * **Head 28 (Top-Left):** * **Trend (w/o, Blue):** Shows a series of sharp, high-magnitude peaks (up to ~0.11) interspersed with lower baseline activity. Peaks are irregularly spaced. * **Trend (w/, Red):** Follows a similar pattern of peaks but with consistently lower amplitude than the blue line. The highest red peak is approximately 0.08. * **Key Data Points:** Blue peaks near positions 5, 15, 25, 35, 45, 55. Red peaks are co-located but attenuated. * **Head 29 (Top-Center):** * **Trend (w/o, Blue):** Exhibits one dominant, very high peak (reaching ~0.30) around position 20, with much lower activity elsewhere. * **Trend (w/, Red):** The dominant peak at position 20 is drastically reduced (to ~0.10). Other minor peaks are also present but suppressed. * **Key Data Points:** Primary blue peak at ~pos 20 (0.30). Corresponding red peak at same position (~0.10). * **Head 31 (Top-Right):** * **Trend (w/o, Blue):** Displays multiple high, sharp peaks (up to ~0.17) at somewhat regular intervals. * **Trend (w/, Red):** The peaks are present at the same positions but are significantly reduced in height (highest ~0.08). The pattern appears more "smoothed." * **Key Data Points:** Major blue peaks near positions 10, 20, 30, 40, 50. Red peaks are co-located but lower. **Bottom Row: Effect of Meaningless Tokens (Sequence Length ~120 tokens)** * **General Pattern (All Three Bottom Charts):** The blue line (`w/ Meaningless tokens`) shows a distinct three-phase pattern: 1. **Pre-Region (Tokens 0-16):** High, spiky attention weights. 2. **Meaningless Token Region (Tokens ~16-72, Shaded):** Attention weights drop to a very low, near-zero baseline with minimal fluctuation. 3. **Post-Region (Tokens 72-120):** Attention weights immediately return to a high, spiky pattern similar to the pre-region. * **Head 28 (Bottom-Left):** Pre-region peaks reach ~0.045. Post-region peaks are similar in magnitude. * **Head 29 (Bottom-Center):** Pre-region peaks are lower (~0.04). A very prominent spike occurs just after the meaningless region, around position 75, reaching ~0.08. * **Head 31 (Bottom-Right):** Pre-region peaks are sharp (~0.04). Post-region activity is high and sustained, with multiple peaks between 0.03 and 0.05. ### Key Observations 1. **Suppression Effect:** The presence of meaningless tokens (`w/`) universally suppresses the magnitude of attention peaks compared to their absence (`w/o`), as seen in all top-row charts. 2. **Pattern Preservation:** While attenuated, the *locations* of attention peaks are largely preserved between the two conditions. The model attends to similar token positions regardless, but with less intensity when meaningless tokens are present. 3. **Attention Sink Behavior:** The bottom row provides strong evidence for the "attention sink" phenomenon. The model allocates almost no attention weight to the meaningless token segment (the shaded region), effectively ignoring it. Attention is focused on the meaningful tokens before and after this segment. 4. **Head Specialization:** Different heads show different attention patterns. Head 29 has a single dominant focus point, while Heads 28 and 31 have more distributed attention across multiple tokens. ### Interpretation This data visualizes a key mechanism in large language model inference. The "meaningless tokens" (likely a padding or separator sequence) act as an attention sink. The model learns to bypass them, dedicating its attentional capacity almost exclusively to the semantically meaningful parts of the input (the text before and after the sink). The top-row comparison suggests that the *potential* attention pattern (blue lines) is more peaked and intense. When forced to process meaningless tokens (red lines), the model's attention is "diluted" or regularized, leading to lower peak weights but a similar focus distribution. This could imply that meaningless tokens introduce a form of noise that the model must work to filter out, slightly reducing the efficiency or sharpness of its attention mechanism for the core task. The stark contrast in the bottom charts is particularly telling. The near-zero attention within the shaded region is not a failure but an optimized strategy. It allows the model to maintain a long context window (120+ tokens) without wasting computational resources on irrelevant information, effectively "resetting" its attention after the sink. This behavior is crucial for efficient processing of documents or conversations with structural separators. </details> Figure 19: The average attention weights of Llama3.1-8B-Instruct in Head 28, 29, 31. <details> <summary>x20.png Details</summary> ![f9df622b](/v1/image/f9df622b0839672dcbb0b2823e1c98513c91c792f5ee69d2dbcaaff0eb5150ee) ### Visual Description \n ## Multi-Panel Line Chart: Attention Weight Analysis for Gemma3-4b-it Layer 1 Heads ### Overview The image displays a 2x3 grid of six line charts analyzing the "Average Attention Weight" across token positions for three different attention heads (Head 1, Head 4, Head 8) in Layer 1 of the "Gemma3-4b-it" model. The charts compare two conditions: with and without "Meaningless tokens." The top row shows a token position range of 0-60, while the bottom row extends the range to 0-120 and includes a shaded region explicitly marking the "Meaningless tokens." ### Components/Axes * **Chart Type:** Multi-panel line chart with filled areas under the lines. * **Titles:** Each of the six subplots has a title indicating the model layer and head: * Top Row (Left to Right): "Gemma3-4b-it Layer1 Head 1", "Gemma3-4b-it Layer1 Head 4", "Gemma3-4b-it Layer1 Head 8" * Bottom Row (Left to Right): Same titles as above, corresponding to the heads in the top row. * **Y-Axis:** Labeled "Average Attention Weight" for all charts. The scale varies per subplot: * Top Row: Head 1 (0.00-0.08), Head 4 (0.00-0.10), Head 8 (0.0-0.6) * Bottom Row: Head 1 (0.00-0.06), Head 4 (0.00-0.05), Head 8 (0.00-0.07) * **X-Axis:** Represents token position index. The scale differs between rows: * Top Row: 0 to 60, with major ticks every 10 units. * Bottom Row: 0 to 120, with major ticks every 20 units. * **Legend:** Present in the top-right corner of each subplot in the top row and each subplot in the bottom row. * **Blue Line/Area:** "w/o Meaningless tokens" (without) * **Red Line/Area:** "w/ Meaningless tokens" (with) * **Special Annotation (Bottom Row):** A gray shaded region spanning from x=0 to approximately x=20 in each bottom chart is labeled "Meaningless tokens" in black text. ### Detailed Analysis **Top Row (Token Positions 0-60):** * **Head 1:** The blue line ("w/o") shows a very sharp, prominent peak reaching ~0.08 around position 15. The red line ("w/") is generally lower and more stable, with smaller peaks, never exceeding ~0.04. * **Head 4:** Both lines show significant volatility. The red line ("w/") exhibits several sharp peaks, with the highest reaching ~0.10 near position 55. The blue line ("w/o") has a major peak around position 45 (~0.08). * **Head 8:** The y-axis scale is much larger (0.6 max). The red line ("w/") dominates, showing a dramatic, exponential-like increase starting around position 40 and peaking at ~0.55 near position 60. The blue line ("w/o") remains very low (<0.1) throughout. **Bottom Row (Token Positions 0-120, with "Meaningless tokens" region):** * **Head 1:** Within the shaded "Meaningless tokens" region (0-20), the blue line ("w/o") shows high attention, peaking at ~0.06. After position 20, the blue line drops to a very low baseline (<0.01). The red line ("w/") is not plotted in this chart. * **Head 4:** Similar pattern. The blue line ("w/o") has high, volatile attention within the meaningless token region (0-20), peaking at ~0.05. After position 20, it drops to a low baseline with minor fluctuations. The red line ("w/") is not plotted. * **Head 8:** The blue line ("w/o") shows high attention in the meaningless token region (0-20), with multiple peaks up to ~0.07. After position 20, it drops to a low baseline but shows a notable, isolated spike to ~0.04 around position 70. The red line ("w/") is not plotted. ### Key Observations 1. **Divergent Behavior by Head:** The three heads exhibit fundamentally different attention patterns. Head 8 (top row) shows an extreme, late-sequence focus when meaningless tokens are present, while Head 1 shows an early, sharp focus when they are absent. 2. **Impact of "Meaningless Tokens":** The bottom row charts suggest that when the sequence contains a block of "Meaningless tokens" at the beginning (positions 0-20), the model's attention (blue line, "w/o" condition) is heavily concentrated on that block. After the block ends, attention to subsequent tokens drops dramatically. 3. **Scale Discrepancy:** The attention weight magnitudes vary greatly between heads. Head 8 operates on a scale an order of magnitude larger than Head 1 or 4 in the top row comparison. 4. **Volatility:** Head 4 displays the most volatile attention patterns in both conditions in the top row. ### Interpretation This visualization investigates how the presence of semantically "meaningless" tokens (e.g., padding, special control tokens, or filler text) affects the internal attention mechanisms of a large language model (Gemma3-4b-it). The data suggests that meaningless tokens act as a strong **attention sink**. In the bottom charts, the model's attention is disproportionately drawn to this initial block, potentially at the expense of attending to later, meaningful content. This is evidenced by the sharp drop in attention weight after the meaningless token region ends. The comparison in the top row (which likely shows a sequence without a dedicated meaningless block) reveals that different attention heads specialize in different patterns. Head 8, in particular, may be responsible for capturing long-range dependencies or sequence-end phenomena, as its attention skyrockets at the end of the 60-token window when meaningless tokens are part of the context ("w/"). **Practical Implication:** This analysis is crucial for model efficiency and performance. If meaningless tokens consume a large portion of the attention budget, it could degrade the model's ability to process the actual meaningful context. Techniques like "attention sink" removal or token pruning might be informed by such visualizations to improve inference speed and focus. The stark differences between heads also highlight the specialized, non-uniform nature of attention in transformer models. </details> Figure 20: The average attention weights of Gemma3-4b-it in Head 1, 4, 8. See pages 1-last of case_study.pdf

Rendering Paper...