<details>
<summary>Image 1 Details</summary>

### Visual Description
Icon/Small Image (247x32)
</details>
## AN EXPRESSIVE, EFFICIENT ATTENTION ARCHITECTURE
TECHNICAL REPORT OF KIMI LINEAR
## Kimi Team
/github https://github.com/MoonshotAI/Kimi-Linear
## ABSTRACT
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios-including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet [111] with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-LowRank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule.
We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 × decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths.
To support further research, we open-source the KDA kernel and vLLM implementations 1 , and release the pre-trained and instruction-tuned model checkpoints. 2
Figure 1: (a) Performance vs. acceleration. With strict fair comparisons with 1.4T training tokens, on MMLU-Pro (4k context length, red stars), Kimi Linear leads performance (51.0) at similar speed. On RULER (128k context length, blue circles), it is Pareto-optimal, achieving top performance (84.3) and 3 . 98 × acceleration. (b) Time per output token (TPOT) vs. decoding length. Kimi Linear (blue line) maintains a low TPOT, matching GDN-H and outperforming MLA at long sequences. This enables larger batches, yielding a 6 . 3 × faster TPOT (1.84ms vs. 11.48ms) than MLA at 1M tokens.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Scatter Plot & Line Graph: Performance vs. Decoding Acceleration & TPoT vs. Decoding Length
### Overview
The image presents two related charts. The left chart (a) is a scatter plot showing the relationship between Decoding Acceleration and Performance for different models (MLA, GDN-H, Kimi Linear). The right chart (b) is a line graph illustrating the Time to Peak Output Token (TPoT) in milliseconds (ms) against Decoding Length, again for the same models. Both charts appear to be evaluating the efficiency and speed of different decoding methods.
### Components/Axes
**Chart (a):**
* **X-axis:** Decoding Acceleration (ranging approximately from 0 to 5x).
* **Y-axis:** Performance (ranging approximately from 45 to 90).
* **Data Points:** Represented by circles and stars, labeled with model names and values.
* **Legend:** Indicates the models:
* Blue Circles: MLA (428k)
* Red Stars: MMILU-Pro (4k)
* Blue Circles: Kimi Linear (428k)
* Blue Circles: GDN-H (428k)
* **Trend Line:** A dashed gray line showing a negative correlation between Decoding Acceleration and Performance.
**Chart (b):**
* **X-axis:** Decoding Length (ranging from 4K to 1M, with markers at 4K, 128K, 256K, 512K, and 1M). The scale is logarithmic.
* **Y-axis:** TPoT (ms) (ranging approximately from 0 to 12 ms). The scale is linear.
* **Data Series:** Three lines representing different models.
* **Legend:** Located in the top-right corner:
* Dashed Blue Line: MLA
* Dashed Orange Line: GDN-H
* Solid Blue Line: Kimi Linear
* **Arrows:** Red arrows indicating the increase in TPoT for MLA from 512K to 1M, with a factor of 6.3x. Another arrow indicates the increase in TPoT for GDN-H from 512K to 1M, with a factor of 5.7x. A third arrow indicates the increase in TPoT for Kimi Linear from 512K to 1M, with a factor of 4.8x.
### Detailed Analysis or Content Details
**Chart (a):**
* **MLA:** Located at approximately (1x, 81.3).
* **GDN-H:** Two points are present: (1x, 47.9) and (4x, 80.5).
* **Kimi Linear:** Two points are present: (1x, 51.0) and (4x, 84.3).
* The trend line shows that as Decoding Acceleration increases, Performance generally decreases.
**Chart (b):**
* **MLA (Dashed Blue):** Starts at approximately (4K, 0.2 ms) and increases rapidly after 256K, reaching approximately (1M, 11.5 ms).
* **GDN-H (Dashed Orange):** Starts at approximately (4K, 0.2 ms) and increases rapidly after 256K, reaching approximately (1M, 7.5 ms).
* **Kimi Linear (Solid Blue):** Starts at approximately (4K, 0.2 ms) and increases gradually, reaching approximately (1M, 5.5 ms).
* All three lines show a relatively flat trend up to 256K, after which they begin to curve upwards significantly.
### Key Observations
* In Chart (a), GDN-H and Kimi Linear show a trade-off between Decoding Acceleration and Performance. Higher acceleration leads to lower performance.
* In Chart (b), MLA exhibits the highest TPoT at 1M Decoding Length, followed by GDN-H, and then Kimi Linear.
* The increase in TPoT for all models is more pronounced at larger Decoding Lengths (512K and 1M).
* The TPoT values are very low for all models up to 256K Decoding Length.
### Interpretation
The data suggests that different decoding methods have varying performance characteristics. MLA achieves high performance at low decoding acceleration but suffers from a significant increase in TPoT as the decoding length increases. GDN-H and Kimi Linear offer a trade-off, allowing for higher decoding acceleration at the cost of some performance. Kimi Linear appears to be the most efficient in terms of TPoT, especially at larger decoding lengths.
The relationship between the two charts is that the performance gains achieved by MLA at lower decoding acceleration come at the cost of increased latency (TPoT) for longer sequences. The trend lines in both charts highlight the inherent trade-offs in decoding algorithms – optimizing for speed versus optimizing for efficiency. The logarithmic scale on the x-axis of chart (b) emphasizes the rapid increase in TPoT as decoding length approaches 1M, suggesting a potential bottleneck in processing very long sequences. The arrows quantifying the increase in TPoT from 512K to 1M provide a clear indication of the scalability challenges associated with each model.
</details>
1 /github https://github.com/fla-org/flash-linear-attention/tree/main/fla/ops/kda
2 https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
## 1 Introduction
As large language models (LLMs) evolve into increasingly capable agents [50], the computational demands of inference-particularly in long-horizon and reinforcement learning (RL) settings-are becoming a central bottleneck. This shift toward RL test-time scaling [95, 33, 80, 74, 53], where models must process extended trajectories, tool-use interactions, and complex decision spaces at inference time, exposes fundamental inefficiencies in standard attention mechanisms. In particular, the quadratic time complexity and the linearly growing key-value (KV) cache of softmax attention introduce substantial computational and memory overheads, hindering throughput, context-length scaling, and real-time interactivity.
Linear attention [48] offers a principled approach to reducing computational complexity but has historically underperformed softmax attention in language modeling-even for short sequences-due to limited expressivity. Recent advances have significantly narrowed this gap, primarily through two innovations: gating or decay mechanisms [92, 16, 114] and the delta rule [84, 112, 111, 71]. Together, these developments have pushed linear attention closer to softmaxlevel quality on moderate-length sequences. Nevertheless, purely linear structure remain fundamentally constrained by the finite-state capacity, making long-sequence modeling and in-context retrieval theoretically challenging [104, 4, 45].
Hybrid architectures that combine softmax and linear attention-using a few global-attention layers alongside predominantly faster linear layers-have thus emerged as a practical compromise between quality and efficiency [57, 100, 66, 12, 32, 81]. However, previous hybrid models often operated at limited scale or lacked comprehensive evaluation across diverse benchmarks. The core challenge remains: to develop an attention architecture that matches or surpasses full attention in quality while achieving substantial efficiency gains in both speed and memory-an essential step toward enabling the next generation of agentic, decoding-heavy LLMs.
In this work, we present Kimi Linear , a hybrid linear attention architecture designed to meet the efficiency demands of agentic intelligence and test-time scaling without compromising quality. At its core lies Kimi Delta Attention (KDA) , a hardware-efficient linear attention module that extends Gated DeltaNet [111] with a finer-grained gating mechanism. While GDN, similar to Mamba2 [16], employs a coarse head-wise forget gate, KDA introduces a channel-wise variant in which each feature dimension maintains an independent forgetting rate, akin to Gated Linear Attention (GLA) [114]. This fine-grained design enables more precise regulation of the finite-state RNN memory, unlocking the potential of RNN-style models within hybrid architectures.
Crucially, KDA parameterizes its transition dynamics with a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) matrices [30, 71], enabling a bespoke chunkwise-parallel algorithm that substantially reduces computation relative to general DPLR formulations while remaining consistent with the classical delta rule.
Kimi Linear interleaves KDA with periodic full attention layers in a uniform 3:1 ratio. This hybrid structure reduces memory and KV-cache usage by up to 75% during long-sequence generation while preserving global information flow via the full attention layers. Through matched-scale pretraining and evaluation, we show that Kimi Linear consistently matches or outperforms strong full-attention baselines across short-context, long-context, and RL-style post-training tasks-while achieving up to 6 × higher decoding throughput at 1M context length.
To facilitate further research, we release open-source KDA kernels with vLLM integration, as well as pre-trained and instruction-tuned checkpoints. These components are drop-in compatible with existing full-attention pipelines, requiring no modification to caching or scheduling interfaces, thereby facilitating research on hybrid architectures.
## Contributions
- Kimi Delta Attention (KDA): a linear attention mechanism that refines the gated delta rule with improved recurrent memory management and hardware efficiency.
- The Kimi Linear architecture: a hybrid design adopting a 3:1 KDA-to-global attention ratio, reducing memory footprint while surpassing full-attention quality.
- Fair empirical validation at scale: through 1.4T token training runs, Kimi Linear outperforms full attention and other baselines in short/long context and RL-style evaluations, with full release of kernels, vLLM integration, and checkpoints.
## 2 Preliminary
In this section, we introduce the technical background related to our proposed Kimi Delta Attention.
## 2.1 Notation
In this paper, we define □ t ∈ R d k or R d v , s . t ., □ ∈ { q, k, v, o, u, w } denotes a t -th corresponding column vector, and S t ∈ R d k × d v represents the matrix-form memory state. M and M -denote lower-triangular masks with and without diagonal elements, respectively; for convenience, we also write them as Tril and StrictTril .
Chunk-wise Formulation Suppose the sequence is split into L/C chunks where each chunk is of length C . We define □ [ t ] ∈ R C × d for □ ∈ { Q , K , V , O , U , W } are matrices that stack the vectors within the t -th chunk, and □ r [ t ] = □ tC + r is the r -th element of the chunk. Note that t ∈ [0 , L/C ) , r ∈ [1 , C ] . State matrices are also re-indexed such that S i [ t ] = S tC + i . Additionally, S [ t ] := S 0 [ t ] = S C [ t -1] , i.e., the initial state of a chunk is the last state of the previous chunk.
Decay Formulation We define the cumulative decay γ i → j [ t ] := ∏ j k = i α k [ t ] , and abbreviate γ 1 → r [ t ] as γ r [ t ] . Additionally, A [ t ] := A i/j [ t ] ∈ R C × C is the matrix with elements γ i [ t ] /γ j [ t ] . Diag ( α t ) denotes the fine-grained decay, Diag ( γ i → j [ t ] ) := ∏ j k = i Diag ( α k [ t ] ) , and Γ i → j [ t ] ∈ R C × d k is the matrix stack from γ i [ t ] to γ j [ t ] .
## 2.2 Linear Attention and the Gated Delta Rule
Linear Attention as Online Learning. Linear attention [48] maintains a matrix-valued recurrent state that accumulates key-value associations:
$$S _ { t } = S _ { t - 1 } + k _ { t } v _ { t } ^ { \top } , \quad o _ { t } = S _ { t } ^ { \top } q _ { t } .$$
From the fast-weight perspective [84, 85], S t serves as an associative memory storing transient mappings from keys to values. This update can be viewed as performing gradient descent on the unbounded correlation objective
$$\mathcal { L } _ { t } ( S ) = - \langle S ^ { \top } k _ { t } , v _ { t } \rangle ,$$
which continually reinforces recent key-value pairs without any forgetting. However, such an objective provides no criterion for which memories to erase, and the accumulated state grows unbounded, leading to interference over long contexts.
DeltaNet: Online Gradient Descent on Reconstruction Loss. DeltaNet [84] reinterprets this recurrence as online gradient descent on a reconstruction objective:
$$\begin{array} { r } { \mathcal { L } _ { t } ( S ) = \frac { 1 } { 2 } \| S ^ { \top } k _ { t } - v _ { t } \| ^ { 2 } . } \end{array}$$
Taking a gradient step with learning rate β t gives
$$\begin{array} { r } { S _ { t } = S _ { t - 1 } - \beta _ { t } \nabla _ { s } \mathcal { L } _ { t } ( S _ { t - 1 } ) = ( I - \beta _ { t } k _ { t } k _ { t } ^ { \top } ) S _ { t - 1 } + \beta _ { t } k _ { t } v _ { t } ^ { \top } . } \end{array}$$
This rule-the classical delta rule -treats S as a learnable associative memory that continually corrects itself toward the mapping k t ↦→ v t . The rank-1 update structure, equivalent to a generalized Householder transformation, supports hardware-efficient chunkwise parallelization [11, 112].
Gated DeltaNet as Weight Decay. Although DeltaNet stabilizes learning, it still retains outdated associations indefinitely. Gated DeltaNet (GDN) [111] introduces a scalar forget gate α t ∈ [0 , 1] , yielding
$$S _ { t } = \alpha _ { t } ( I - \beta _ { t } k _ { t } k _ { t } ^ { \top } ) S _ { t - 1 } + \beta _ { t } k _ { t } v _ { t } ^ { \top } .$$
Here, α t acts as a form of weight decay on the fast weights [8], implementing a forgetting mechanism analogous to data-dependent L 2 regularization. This simple yet effective modification provides a principled way to control memory lifespan and mitigate interference, improving both stability and long-context generalization while preserving DeltaNet's parallelizable structure.
From this perspective, we observe that GDN can be interpreted as a form of multiplicative positional encoding where the transition matrix is data-dependent and learnable, relaxing the orthogonality constraint of RoPE [115]. 3
3 When the state transformation matrix preserves its orthogonality, absolute positional encodings can also be applied independently to q and k to be converted into relative positional encodings during the attention computation [87].
## 3 Kimi Delta Attention: Improving Delta Rule with Fine-grained Gating
We propose Kimi Delta Attention (KDA), a new gated linear attention variant that refines GDN's scalar decay by introducing a fine-grained diagonalized gate Diag( α t ) that enables fine-grained control over memory decay and positional awareness (as discussed in §6.1). We begin by introducing the chunkwise parallelization of KDA, showing how a series of rank-1 matrix transformations can be compressed into a dense representation while maintaining stability under diagonal gating. We then highlight the efficiency gains of KDA over the standard DPLR ( Diagonal-Plus-LowRank ) formulation [30, 71].
$$S _ { t } = \left ( I - \beta _ { t } k _ { t } k _ { t } ^ { \top } \right ) D i a g \left ( \alpha _ { t } \right ) S _ { t - 1 } + \beta _ { t } k _ { t } v _ { t } ^ { \top } \in \mathbb { R } ^ { d _ { k } \times d _ { v } } ; \quad o _ { t } = S _ { t } ^ { \top } q _ { t } \in \mathbb { R } ^ { d _ { v } }$$
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Heatmap Series: Matrix Operations
### Overview
The image presents a series of six heatmaps demonstrating matrix operations. Each heatmap represents a matrix, and the operations are indicated by symbols between them: equality, subtraction, multiplication, addition, and equality again. The heatmaps use a color gradient to represent values, with darker shades indicating lower values and lighter shades indicating higher values.
### Components/Axes
There are no explicit axes labels or legends. The matrices are represented as square grids of colored cells. The color scale appears to range from a deep blue (lowest value) to a light orange (highest value), passing through shades of grey. The matrices are arranged horizontally, with operations connecting them.
### Detailed Analysis or Content Details
The image shows a sequence of matrix operations. Due to the lack of numerical values, only qualitative descriptions of the color patterns can be provided.
**Matrix 1 (Leftmost):**
The matrix is 3x3. The color distribution is relatively uniform, with shades of blue and grey. The top-left cell is a medium blue, the top-middle is a lighter blue, and the top-right is a medium grey. The middle row has a similar pattern, and the bottom row is predominantly grey.
**Matrix 2:**
This is a 3x1 column vector. The top cell is a light grey, the middle cell is a medium grey, and the bottom cell is a light orange.
**Matrix 3:**
This is a 3x3 matrix. The top-left cell is a dark red, the top-middle is a light red, and the top-right is a light orange. The middle row has a mix of red and blue shades, and the bottom row is predominantly blue.
**Matrix 4:**
This is a 3x3 matrix. The color distribution is a mix of blue and grey shades. The top-left cell is a medium blue, the top-middle is a light blue, and the top-right is a medium grey. The middle row has a similar pattern, and the bottom row is predominantly grey.
**Matrix 5:**
This is a 3x1 column vector. The top cell is a light orange, the middle cell is a medium grey, and the bottom cell is a light grey.
**Matrix 6 (Rightmost):**
This is a 3x3 matrix. The color distribution is a mix of blue, grey, and orange shades. The top-left cell is a medium blue, the top-middle is a light blue, and the top-right is a light orange. The middle row has a similar pattern, and the bottom row is predominantly grey.
### Key Observations
The operations appear to be performed sequentially, with each matrix being the result of the previous operation. The color patterns change with each operation, indicating changes in the matrix values. The lack of numerical values makes it impossible to determine the exact nature of these changes.
### Interpretation
The image demonstrates a series of matrix operations using a visual representation of matrix values through color. The operations are likely linear algebra operations, such as addition, subtraction, and multiplication. The color gradient provides a qualitative indication of the matrix values, but without numerical data, it is impossible to determine the specific values or the exact results of the operations. The image serves as a visual illustration of matrix algebra concepts, but it is not a precise representation of numerical data. It is a conceptual diagram rather than a data-rich chart. The image is intended to convey the *process* of matrix operations rather than the specific *results*.
</details>
## 3.1 Hardware-Efficient Chunkwise Algorithm
By partially expanding the recurrence for Eq. 1 into a chunk-wise formulation, we have:
$$\begin{array} { r l } & { s _ { [ t ] } ^ { r } = \underbrace { \left ( \prod _ { i = 1 } ^ { r } \left ( I - \beta _ { [ t ] } ^ { i } k _ { [ t ] } ^ { i } k _ { [ t ] } ^ { i \top } \right ) D i a g ( \alpha _ { [ t ] } ^ { i } ) \right ) } _ { \colon = P _ { [ t ] } ^ { r } } \cdot s _ { [ t ] } ^ { 0 } + \underbrace { \sum _ { i = 1 } ^ { r } \left ( \prod _ { j = i + 1 } ^ { r } \left ( I - \beta _ { [ t ] } ^ { j } k _ { [ t ] } ^ { j } k _ { [ t ] } ^ { j \top } \right ) D i a g ( \alpha _ { [ t ] } ^ { j } ) \right ) \cdot \beta _ { [ t ] } ^ { i } k _ { [ t ] } ^ { i } v _ { [ t ] } ^ { i \top } } _ { \colon = H _ { [ t ] } ^ { r } } } & { ( 2 ) } \end{array}$$
WYRepresentation shi is typically employed to pack a series rank-1 updates into a single compact representation [11]. We follow the formulation of P in Comba [40] to reduce the need for an additional matrix inversion in subsequent computations.
$$P _ { [ t ] } ^ { r } = D i a g ( \gamma _ { [ t ] } ^ { r } ) - \sum _ { i = 1 } ^ { r } D i a g ( \gamma _ { [ t ] } ^ { i \rightarrow r } ) k _ { [ t ] } ^ { i } w _ { [ t ] } ^ { i \top } & & H _ { [ t ] } ^ { r } = \sum _ { i = 1 } ^ { t } D i a g \left ( \gamma _ { [ t ] } ^ { i \rightarrow r } \right ) k _ { [ t ] } ^ { i } u _ { [ t ] } ^ { i \top } & & ( 3 )$$
where the auxiliary vector w t ∈ R d k and u t ∈ R d v are computed via the following recurrence relation:
$$w _ { [ t ] } ^ { r } = \beta _ { [ t ] } ^ { r } \left ( D i a g ( \gamma _ { [ t ] } ^ { r } ) k _ { [ t ] } ^ { r } - \sum _ { i = 1 } ^ { r - 1 } w _ { [ t ] } ^ { i } \left ( k _ { [ t ] } ^ { i \top } D i a g \left ( \gamma _ { [ t ] } ^ { i \to r } \right ) k _ { [ t ] } ^ { r } \right ) \right )$$
$$\pm b { u } _ { [ t ] } ^ { r } = \beta _ { [ t ] } ^ { r } \left ( v _ { [ t ] } ^ { r } - \sum _ { i = 1 } ^ { r - 1 } u _ { [ t ] } ^ { i } \left ( k _ { [ t ] } ^ { i \top } D i a g \left ( \gamma _ { [ t ] } ^ { i \rightarrow r } \right ) k _ { [ t ] } ^ { r } \right ) \right )$$
UT transform. We apply the UT transform [46] to reduce non-matmul FLOPs, which is crucial to enable better hardware utilization during training.
$$\begin{array} { r } { M _ { [ t ] } = \left ( I + S t r i c t T r i l \left ( D i a g \left ( \beta _ { [ t ] } \right ) \left ( \Gamma _ { [ t ] } ^ { 1 \rightarrow C } \odot K _ { [ t ] } \right ) \left ( \frac { K _ { [ t ] } } { \Gamma _ { [ t ] } ^ { 1 \rightarrow C } } \right ) \right ) \right ) ^ { - 1 } D i a g \left ( \beta _ { [ t ] } \right ) } \end{array} \quad ( 6 )$$
$$\begin{array} { r l } { W _ { [ t ] } = M _ { [ t ] } \left ( \Gamma ^ { 1 \rightarrow C } _ { [ t ] } \odot K _ { [ t ] } \right ) , } & { U _ { [ t ] } = M _ { [ t ] } V _ { [ t ] } } & { ( 7 ) } \end{array}$$
The inverse of a lower triangular matrix can be efficiently computed through an iterative row-wise approach by forward substitution in Gaussian elimination [28].
Equivalently, in matrix form, we can update the state in chunk-wise:
$$\begin{array} { r } { S _ { [ t + 1 ] } = D i a g ( \gamma _ { [ t ] } ^ { C } ) S _ { [ t ] } + \left ( \Gamma _ { [ t ] } ^ { i \rightarrow C } \odot K _ { [ t ] } \right ) ^ { T } \left ( U _ { [ t ] } - W _ { [ t ] } S _ { [ t ] } \right ) \in \mathbb { R } ^ { d _ { k } \times d _ { v } } } \end{array} \quad ( 8 )$$
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Heatmaps: Matrix Decomposition Illustration
### Overview
The image depicts a visual representation of matrix decomposition, specifically illustrating how a matrix can be approximated by the sum of two lower-rank matrices. It consists of five heatmaps arranged horizontally, connected by mathematical symbols. The first heatmap is equated to the sum of the second and third heatmaps, which are then further decomposed into a product of two heatmaps enclosed in parentheses.
### Components/Axes
There are no explicit axes labels or legends present in the image. The heatmaps themselves represent matrices, with color intensity indicating the magnitude of the values within the matrix. Each heatmap is a 5x5 grid. The symbols "=", "+", and "()" are mathematical operators indicating equality, addition, and multiplication, respectively. A circled dot "⊙" is present, indicating element-wise multiplication.
### Detailed Analysis or Content Details
Let's analyze each heatmap individually, describing the color distribution and approximate values based on the color scale. The color scale appears to range from a deep blue (low value) to a bright red (high value), passing through shades of teal and orange.
**Heatmap 1 (Leftmost):**
This heatmap shows a generally cool color scheme, with predominantly blue and teal shades. There are a few warmer areas (orange) in the bottom-right corner and a small patch in the top-left.
* Top-left: ~0.2 (light blue)
* Top-right: ~0.3 (light blue)
* Center: ~0.4 (teal)
* Bottom-left: ~0.5 (teal)
* Bottom-right: ~0.7 (orange)
**Heatmap 2 (Second from Left):**
This heatmap is dominated by red and orange shades, indicating higher values.
* Top-left: ~0.8 (red)
* Top-right: ~0.2 (light blue)
* Center: ~0.3 (light blue)
* Bottom-left: ~0.7 (orange)
* Bottom-right: ~0.9 (red)
**Heatmap 3 (Middle):**
This heatmap is a mix of blue and teal shades, similar to the first heatmap, but with a slightly different distribution.
* Top-left: ~0.3 (light blue)
* Top-right: ~0.5 (teal)
* Center: ~0.2 (light blue)
* Bottom-left: ~0.4 (teal)
* Bottom-right: ~0.6 (teal)
**Heatmap 4 (Fourth from Left - Left Parenthesis):**
This heatmap is predominantly red and orange, with a few cooler shades.
* Top-left: ~0.9 (red)
* Top-right: ~0.6 (orange)
* Center: ~0.4 (teal)
* Bottom-left: ~0.7 (orange)
* Bottom-right: ~0.8 (red)
**Heatmap 5 (Rightmost - Right Parenthesis):**
This heatmap is mostly orange and red, with a gradient from left to right.
* Top-left: ~0.6 (orange)
* Top-right: ~0.9 (red)
* Center: ~0.7 (orange)
* Bottom-left: ~0.8 (red)
* Bottom-right: ~0.9 (red)
### Key Observations
The equation visually demonstrates that the first heatmap can be approximated by adding the second and third heatmaps. The decomposition further shows that the second and third heatmaps can be obtained by element-wise multiplication of the fourth and fifth heatmaps. The circled dot "⊙" confirms this is element-wise multiplication. The color distributions suggest that the first heatmap is a combination of the patterns present in the second and third heatmaps.
### Interpretation
This image illustrates a fundamental concept in linear algebra: matrix decomposition. Specifically, it demonstrates how a matrix can be broken down into simpler, lower-rank matrices. This is a common technique used in dimensionality reduction, data compression, and other applications. The visual representation using heatmaps effectively conveys the idea that the original matrix can be reconstructed by combining the information contained in the decomposed matrices. The use of color intensity to represent values makes it easy to see how the different matrices contribute to the overall structure of the original matrix. The element-wise multiplication suggests a form of factorization, where the original matrix is represented as a product of two matrices with specific properties. This is a simplified illustration, but it captures the essence of matrix decomposition and its potential applications.
</details>
During the output stage, we adopt an inter-block recurrent and intra-block parallel strategy to maximize matrix multiplication throughput, thereby fully utilizing the computational potential of Tensor Cores.
$$O _ { [ t ] } = \underbrace { \left ( \Gamma ^ { 1 \to C } _ { [ t ] } \odot Q _ { [ t ] } \right ) S _ { [ t ] } } _ { i n t e r c h u k } + \underbrace { T r i l \left ( \left ( \Gamma ^ { 1 \to C } _ { [ t ] } \odot Q _ { [ t ] } \right ) \left ( \frac { K _ { [ t ] } } { \Gamma ^ { 1 \to C } _ { [ t ] } } \right ) ^ { \top } \right ) } _ { i n t e r c h u k } \underbrace { \left ( U _ { [ t ] } - W _ { [ t ] } S _ { [ t ] } \right ) \in \mathbb { R } ^ { C \times d _ { v } } } _ { " p s e u n d o r" - value term }$$
## 3.2 Efficiency Analysis
In terms of representational capacity, KDA aligns with the generalized DPLR formulation, i.e., S t = ( D -a t b ⊤ t ) S t -1 + k t v ⊤ t , both exhibiting fine-grained decay behavior. However, such fine-grained decay introduces numerical precision issues during division operations (e.g., the intra-chunk computation in Eq. 9). To address this, prior work such as GLA [114] performs computations in the logarithmic domain and introduces secondary chunking in full precision. This approach, however, prevents full utilization of half-precision matrix multiplications and significantly reduces operator speed. By binding both variables a and b to k , KDA effectively alleviates this bottleneck-reducing the number of second-level chunk matrix computations from four to two, and further eliminating three additional matrix multiplications. As a result, the operator efficiency of KDA improves by roughly 100% compared to the DPLR formulation. A detailed analysis is provided in §6.2.
## 4 The Kimi Linear Model Architecture
The main backbone of our model architecture follows Moonlight [62]. In addition to fine-grained gating, we also leverage several components to further improve the expressiveness of Kimi Linear. The overall Kimi Linear architecture is shown in Figure 3.
Neural Parameterization Let x t ∈ R d be the t -th token input representation, the input to KDA for each head h is computed as follows
$$q _ { t } ^ { h } , k _ { t } ^ { h } & = L 2 N o r m ( S w i s h ( S h o r t C o n v ( W _ { q / k } ^ { h } x _ { t } ) ) ) \in \mathbb { R } ^ { d _ { k } } \\ v _ { t } ^ { h } & = S w i s h ( S h o r t C o n v ( W _ { v } ^ { h } x _ { t } ) ) \in \mathbb { R } ^ { d _ { v } } \\ \alpha _ { t } ^ { h } & = f ( W _ { \alpha } ^ { \uparrow } W _ { \alpha } ^ { \downarrow } x _ { t } ) \in [ 0 , 1 ] ^ { d _ { k } } \\ \beta _ { t } ^ { h } & = S i g m o i d ( W _ { \beta } ^ { h } x _ { t } ) \in [ 0 , 1 ]$$
where d k , d v represent the key and value head dimensions, which are set to 128 for all experiments. For q , k , v , we apply a ShortConv followed by a Swish activation, following [111]. The q and k representations are further normalized using L2Norm to ensure eigenvalues stability, as suggested by [112]. The per-channel decay α h t is parameterized via a low-rank projection ( W ↓ α and W ↑ α with rank equal to the head dimension) and a decay function f ( · ) similar to those
Figure 2: Execution time of kernels for varying input lengths, with a uniform batch size of 1 and 16 heads.
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Line Chart: Execution Time vs. Input Length
### Overview
This image presents a line chart comparing the execution time of two methods, DPLR and KDA (labeled as "ours"), across varying input lengths. The chart visually demonstrates how execution time scales with input length for each method.
### Components/Axes
* **X-axis:** "Input length" with markers at 2K, 4K, 8K, 16K, 32K, and 64K.
* **Y-axis:** "Execution Time (ms)" with a scale ranging from 0 to 64 ms, marked at intervals of 8 ms.
* **Legend:** Located in the top-left corner, it identifies the two data series:
* DPLR (represented by a dashed light-blue line with a diamond marker)
* KDA (ours) (represented by a solid dark-blue line with a circle marker)
* **Gridlines:** Faint vertical gridlines are present to aid in reading values.
### Detailed Analysis
**DPLR (dashed light-blue line):**
The DPLR line slopes upward, indicating increasing execution time with increasing input length.
* At 2K input length: Approximately 1.5 ms.
* At 4K input length: Approximately 3 ms.
* At 8K input length: Approximately 6 ms.
* At 16K input length: Approximately 14 ms.
* At 32K input length: Approximately 30 ms.
* At 64K input length: Approximately 57 ms.
**KDA (solid dark-blue line):**
The KDA line also slopes upward, but at a slower rate than the DPLR line, especially at larger input lengths.
* At 2K input length: Approximately 1 ms.
* At 4K input length: Approximately 2 ms.
* At 8K input length: Approximately 4 ms.
* At 16K input length: Approximately 8 ms.
* At 32K input length: Approximately 16 ms.
* At 64K input length: Approximately 31 ms.
### Key Observations
* KDA consistently exhibits lower execution times than DPLR across all input lengths.
* The difference in execution time between the two methods becomes more pronounced as the input length increases. The gap widens significantly at 32K and 64K.
* Both methods demonstrate a non-linear relationship between input length and execution time; the execution time increases at an accelerating rate as the input length grows.
### Interpretation
The data suggests that KDA ("ours") is more scalable and efficient than DPLR, particularly when dealing with larger input lengths. The increasing slope of both lines indicates that the computational complexity of both algorithms is likely greater than linear with respect to input size. The widening gap between the two lines implies that the scaling difference between the algorithms becomes more significant as the problem size increases. This could be due to differences in algorithmic complexity, optimization techniques, or implementation details. The chart effectively demonstrates the performance advantage of KDA over DPLR, especially in scenarios involving substantial input data.
</details>
Outputs
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Diagram: Mixture of Experts (MoE) Architecture
### Overview
The image depicts a diagram illustrating the architecture of a Mixture of Experts (MoE) model, likely within a larger neural network. The diagram shows two main blocks repeated N times, with a routing mechanism connecting them. The blocks contain layers labeled "MoE", "Norm", "MLA" (Multi-Layer Attention), and "KDA" (Kimi Delta Attention). The diagram also details the internal structure of the MoE layer, showing how inputs are routed to different experts.
### Components/Axes
The diagram consists of four main sections arranged in a 2x2 grid. Each section represents a component or a detailed view of a component within the MoE architecture. The key components and labels are:
* **MoE:** Mixture of Experts layer.
* **Norm:** Normalization layer.
* **MLA:** Multi-Layer Attention.
* **KDA:** Kimi Delta Attention.
* **Router:** The routing mechanism within the MoE layer.
* **Linear:** Linear transformation layer.
* **Conv:** Convolutional layer.
* **L2:** Layer labeled "L2".
* **1x:** Indicates the first block is executed once.
* **Nx:** Indicates the second block is executed N times.
* **Ns:** Indicates the number of experts.
* **Shared Expert:** Represented by a light gray box.
* **Routed Expert:** Represented by a dark gray box.
* Circles with lines connecting them: Represent connections and data flow.
* Plus signs within circles: Represent addition operations.
### Detailed Analysis or Content Details
**Top-Left Block (1x):**
This block represents the initial processing stage. It consists of:
1. A "MoE" layer (purple).
2. A "Norm" layer (yellow).
3. An "MLA" layer (red).
4. A "Norm" layer (yellow).
The output of this block is fed into the top-right block.
**Top-Right Block (MoE Detail):**
This block details the internal workings of the "MoE" layer.
1. An input is split and fed into Ns experts (labeled 1 to Ns).
2. A "Router" component directs the input to the experts.
3. The outputs of the experts are combined.
4. The legend indicates that light gray boxes represent "Shared Experts" and dark gray boxes represent "Routed Experts".
5. The outputs of the experts are multiplied with a value (represented by the circle with an 'x' inside).
**Bottom-Left Block (Nx):**
This block represents the repeated processing stage, executed N times. It consists of:
1. A "MoE" layer (purple).
2. A "Norm" layer (yellow).
3. A "KDA" layer (orange).
4. A "Norm" layer (yellow).
The output of this block is fed into the bottom-right block.
**Bottom-Right Block (KDA Detail):**
This block details the internal workings of the "KDA" layer.
1. An input is passed through two "Conv" layers (light blue).
2. Each "Conv" layer is followed by a "Linear" layer.
3. The outputs of the "Linear" layers are combined.
4. The combined output is passed through a "Kimi Delta Attention" layer (yellow).
5. The output of the "Kimi Delta Attention" layer is passed through a "Norm" layer (yellow).
6. Finally, a "Linear" layer transforms the output.
### Key Observations
* The MoE layer is a central component, appearing in both the initial and repeated processing stages.
* The diagram highlights the routing mechanism within the MoE layer, indicating that different inputs are directed to different experts.
* The KDA layer appears to be a specialized attention mechanism.
* The diagram shows a clear flow of data from left to right, with the repeated blocks suggesting an iterative process.
* The use of "Norm" layers throughout the architecture suggests a focus on stabilizing training and improving performance.
### Interpretation
The diagram illustrates a sophisticated neural network architecture leveraging the Mixture of Experts paradigm. The MoE layers allow the model to specialize in different aspects of the input data, potentially leading to increased capacity and improved performance. The routing mechanism dynamically assigns inputs to the most appropriate experts, enabling efficient learning and generalization. The KDA layer suggests a novel attention mechanism tailored to the specific needs of the model. The repeated blocks (Nx) indicate that the model processes the input data iteratively, refining its representation over multiple stages. The diagram suggests a model designed for complex tasks requiring high capacity and adaptability. The use of normalization layers throughout the architecture indicates a focus on stable training and robust performance. The diagram does not provide any quantitative data or performance metrics, but it clearly outlines the architectural principles and key components of the MoE model. The diagram is a conceptual illustration of the architecture, and does not contain any numerical data.
</details>
Inputs
Figure 3: Illustration of our Kimi Linear model architecture, which consists of a stack of blocks containing a token mixing layer followed by a MoE channel-mixing layer. Specifically, we interleave N KDA layers with one MLA layer for token mixing, where N is set to 3 in our implementation.
used in GDN and Mamba [111, 16]. Before the output projection through W o ∈ R d × d , we use a head-wise RMSNorm [122] and a data-dependent gating mechanism [79] parameterized as:
$$o _ { t } = W _ { o } \left ( S i g m o i d \left ( W _ { g } ^ { \dagger } W _ { g } ^ { \downarrow } x _ { t } \right ) \odot R M S N o r m \left ( K D A \left ( q _ { t } , k _ { t } , v _ { t } , \alpha _ { t } , \beta _ { t } \right ) \right ) \right )$$
Here, the output gate adopts a low-rank parameterization similar to the forget gate, to ensure a fair parameter comparison, while maintaining performance comparable to full-rank gating and alleviating the Attention Sink [79]. The choice of nonlinear activation function is further discussed in §5.2.
Hybrid model architecture Long-context retrieval remains the primary bottleneck for pure linear attention, we therefore hybridize KDA with a small number of full global-attention (Full MLA) layers [19]. For Kimi Linear, we chose a layerwise approach (alternating entire layers) over a headwise one (mixing heads within layers) for its superior infrastructure simplicity and training stability. Empirically, a uniform 3:1 ratio, i.e., repeating 3 KDA layers to 1 full MLA layer, provided the best quality-throughput trade-off. We discuss other hybridization strategies in § 7.2.
No Position Encoding (NoPE) for MLA Layers. In Kimi Linear, we apply NoPE to all full attention (MLA) layers. This design delegates the entire responsibility for encoding positional information and recency bias (see § 6.1) to the KDA layers. KDA is thus established as the primary position-aware operator, fulfilling a role analogous to, or arguably stronger than, auxiliary components like short convolutions [3] or SWA [76]. Our findings align with prior results [110, 7, 19], who similarly demonstrated that complementing global NoPE attention with a dedicated position-aware mechanism yields competitive long-context performance.
Figure 4: Results on synthetic tasks: palindrome, multi query associative recall, and the state tracking.
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Charts: Performance Comparison of KDA, GDN, and Mamba2 Models
### Overview
This image presents six charts comparing the performance of three models – KDA, GDN, and Mamba2 – across different datasets and training regimes. The charts display accuracy as a function of either sequence length or training steps. Each chart focuses on a specific dataset: Palindrome, MQAR, and Stack, with each dataset being evaluated under two different conditions (sequence length vs. training steps).
### Components/Axes
Each chart shares the following components:
* **X-axis:** Represents the independent variable, either "Sequence length" (values: 256, 512, 1024, 2048) or "Training steps" (values: 5K, 10K, 15K, 20K). The units are explicitly stated.
* **Y-axis:** Represents "Accuracy (%)", ranging from 0 to 100.
* **Legend:** Located in the top-left corner of each chart, identifying the three models:
* KDA (represented by a solid blue line with a circle marker)
* GDN (represented by a dashed green line with a star marker)
* Mamba2 (represented by a solid orange line with a diamond marker)
* **Chart Titles:** Each chart is labeled with a letter (a, b, c) and the dataset name (Palindrome, MQAR, Stack).
### Detailed Analysis or Content Details
**Chart (a): Palindrome - Accuracy vs. Sequence Length**
* **KDA (Blue):** The line is nearly flat, maintaining an accuracy of approximately 98% across all sequence lengths.
* **GDN (Green):** The line slopes downward. Accuracy starts at approximately 80% at a sequence length of 256, decreases to around 50% at 1024, and drops to approximately 25% at 2048.
* **Mamba2 (Orange):** The line slopes downward. Accuracy starts at approximately 75% at a sequence length of 256, decreases to around 50% at 1024, and drops to approximately 25% at 2048.
**Chart (b): MQAR - Accuracy vs. Sequence Length**
* **KDA (Blue):** The line slopes downward. Accuracy starts at approximately 90% at a sequence length of 256, decreases to around 60% at 1024, and drops to approximately 30% at 2048.
* **GDN (Green):** The line slopes downward sharply. Accuracy starts at approximately 75% at a sequence length of 256, decreases to around 25% at 1024, and drops to approximately 0% at 2048.
* **Mamba2 (Orange):** The line is relatively flat, maintaining an accuracy of approximately 75% across all sequence lengths.
**Chart (c): Stack - Accuracy vs. Sequence Length**
* **KDA (Blue):** The line is nearly flat, maintaining an accuracy of approximately 98% across all sequence lengths.
* **GDN (Green):** The line slopes downward. Accuracy starts at approximately 80% at a sequence length of 256, decreases to around 50% at 1024, and drops to approximately 25% at 2048.
* **Mamba2 (Orange):** The line slopes downward. Accuracy starts at approximately 75% at a sequence length of 256, decreases to around 50% at 1024, and drops to approximately 25% at 2048.
**Chart (d): Palindrome - Accuracy vs. Training Steps**
* **KDA (Blue):** The line slopes upward sharply. Accuracy starts at approximately 25% at 5K training steps, increases to around 75% at 10K, and reaches approximately 98% at 20K.
* **GDN (Green):** The line slopes upward. Accuracy starts at approximately 0% at 5K training steps, increases to around 25% at 10K, and reaches approximately 75% at 20K.
* **Mamba2 (Orange):** The line is relatively flat, maintaining an accuracy of approximately 75% across all training steps.
**Chart (e): MQAR - Accuracy vs. Training Steps**
* **KDA (Blue):** The line is nearly flat, maintaining an accuracy of approximately 98% across all training steps.
* **GDN (Green):** The line slopes upward. Accuracy starts at approximately 0% at 5K training steps, increases to around 25% at 10K, and reaches approximately 75% at 20K.
* **Mamba2 (Orange):** The line slopes upward. Accuracy starts at approximately 25% at 5K training steps, increases to around 50% at 10K, and reaches approximately 75% at 20K.
**Chart (f): Stack - Accuracy vs. Training Steps**
* **KDA (Blue):** The line is nearly flat, maintaining an accuracy of approximately 98% across all training steps.
* **GDN (Green):** The line slopes upward. Accuracy starts at approximately 0% at 5K training steps, increases to around 25% at 10K, and reaches approximately 75% at 20K.
* **Mamba2 (Orange):** The line slopes upward. Accuracy starts at approximately 25% at 5K training steps, increases to around 50% at 10K, and reaches approximately 75% at 20K.
### Key Observations
* KDA consistently achieves the highest accuracy, particularly on the Palindrome and Stack datasets, and is relatively insensitive to changes in sequence length or training steps.
* GDN generally exhibits the lowest accuracy, and its performance degrades significantly with increasing sequence length.
* Mamba2 shows moderate performance, generally falling between KDA and GDN. Its performance improves with increasing training steps but is less affected by sequence length.
* The performance of GDN and Mamba2 is more sensitive to training steps than KDA.
### Interpretation
The data suggests that KDA is the most robust and effective model across these datasets and conditions. Its high accuracy and stability indicate a strong ability to generalize and maintain performance regardless of input size or training duration. GDN appears to struggle with longer sequences, indicating a potential limitation in its ability to handle long-range dependencies. Mamba2 offers a compromise between KDA and GDN, demonstrating reasonable performance but lacking the consistency of KDA.
The contrasting results between sequence length and training steps highlight different aspects of model performance. Sequence length tests the model's ability to process information within a fixed training budget, while training steps assess its ability to learn from more data. KDA's consistent performance suggests it efficiently utilizes both sequence information and training data. The differences in performance between the models likely stem from their underlying architectures and their capacity to capture and represent complex relationships within the data. The datasets themselves (Palindrome, MQAR, Stack) likely have varying degrees of complexity and long-range dependencies, which contribute to the observed performance differences.
</details>
We note that NoPE offers practical advantages, particularly for MLA. First, NoPE enables their conversion to the highly-efficient pure Multi-Query Attention (MQA) during inference. Second, it simplifies long-context training, as it obviates the need for RoPE parameter adjustments, such as frequency base tuning or methods like YaRN [72].
## 5 Experiments
## 5.1 Synthetic tests
We start by evaluating KDA against other competing linear attention methods on three synthetic tasks, serving as benchmark tests for long-context performance. Across all experiments, we adopt a consistent model configuration of 2 layers with 2 attention heads, each having a head dimension of 128. For each task, we train the model for at most 20,000 steps with a grid search over learning rates in { 5 × 10 -5 , 1 × 10 -4 , 5 × 10 -4 , 1 × 10 -3 } . We then present the best-performing training accuracy curves. Specifically, we compare two scenarios: (1) the performance of different tasks as training length increases from 256 to 2,048 tokens, measuring the peak accuracy; and (2) the convergence speed of KDA, GDN, and Mamba2 with a fixed context length of 1,024 tokens.
Palindrome Palindrome requires the model to reproduce a given sequence of random tokens in reverse order. As illustrated in Table 5.1, given an input like 'O G R S U N E', the model must generate its exact reversal. Such copying tasks are known to be difficult for linear attention models [45], as they struggle to precisely retrieve the entire history from a compressed, fixed-size memory state.
<!-- formula-not-decoded -->
Multi Query Associative Recall (MQAR) MQAR assesses the model's ability to retrieve values associated with multiple queries that appear at various positions within the context. For instance, as shown in Table 5.1, the model is asked to recall 0 for the query B and 5 for G. This task is known to be highly correlated with language modeling performance [5].
$$\begin{array} { r c l } \text {Input} & A & 1 & C & 3 & B & 0 & M & 8 & G & 5 & E & 4 & < S P > \\ \text {Output} & \phi & \phi & \phi & \phi & \phi & \phi & \phi & \phi & \phi & \phi & \phi & \phi & \phi & 0 & 5 \end{array}$$
Table 1: Ablation study on the hybrid ratio of KDA to MLA attention and other key components. We list the training and validation perplexities (lower is better) for comparison. The best-performing model, used in our final experiments, is highlighted in gray.
| | Training PPL ( ↓ ) | Validation PPL ( ↓ ) |
|-----------------------|----------------------|------------------------|
| Hybrid ratio | 9.23 | 5.65 |
| 0:1 | 9.45 | 5.77 |
| | 9.29 | 5.66 |
| 7:1 | 9.23 | 5.7 |
| 15:1 | 9.34 | 5.82 |
| w/o output gate 9.25 | w/o output gate 9.25 | 5.67 |
| w/ swish output gate | 9.43 | 5.81 |
| w/o convolution layer | 9.29 | 5.7 |
Stack We assess the state tracking capabilities [27] of each candidate by simulating the standard LIFO (Last In First Out) stack operations. Our setup involves 64 independent stacks, each identified by a unique ID. The model processes a sequence of two operations: 1) PUSH: an action like ' <push> 1 G' adds the element G to stack 1; 2) POP: an action like ' <pop> 0 E' requires the model to predict the element E most recently pushed onto stack 0. The objective is to accurately track the states of all stacks and predict the correct element upon each pop request.
Figure 4 shows the final results. Across all tasks, KDA consistently achieves the highest accuracy as the sequence length increases from 256 to 2,048 tokens. In particular, on the Palindrome and recall-intensive MQAR tasks, KDA converges significantly faster than GDN. This confirms the benefits of our fine-grained decay, which enables the model to selectively forget irrelevant information while preserving crucial memories more precisely. We also observe that Mamba2 [16], a typical linear attention that uses only multiplicative decay and lacks a delta rule, fails on all tasks in our model settings.
## 5.2 Ablation on Key Components of Kimi Linear
We conducted a series of ablation studies by directly comparing different models to the first-scale scaling law model, i.e., 16 heads, 16 layers. All models were trained with the same FLOPs budget and hyperparameters for a fair comparison. We report the training and validation perplexities (PPLs) in Table 1. The validation PPL is calculated on a highquality dataset whose distribution differs significantly from the pre-training corpus, emphasizing generalization under distribution shift, and thus the differences in training and validation perplexities.
Output gate We compare our default Sigmoid output gate against two variants: one with no gating and another with swish gating. The results show that removing the gate degrades performance. Moreover, the swish gate adopted by [111] performs substantially worse than Sigmoid . Our observation is consistent with [79], who also conclude that Sigmoid gating offers superior performance. So we adopt Sigmoid across all of our experiments, including GDN-H.
Convolution Layer Lightweight depthwise convolutions with a small kernel size (e.g., 4) can be effective at capturing local token dependencies [3] and are widely adopted by many recent architectures [16, 5, 112]. We validate its efficacy in Table 1, demonstrating that convolutional layers continue to play a non-negligible role in hybrid models.
Hybrid ratio We performed an ablation study to determine the optimal hybrid ratio of KDA linear attention layers to MLAfull attention layers. Among the configurations tested, the 3:1 ratio (3 KDA layers for every 1 MLA layer) yielded the best results, achieving the lowest training and validation losses. We observed clear trade-offs with other ratios: a higher ratio (e.g., 7:1) produced a comparable training loss but led to significantly worse validation performance, while a lower ratio (e.g., 1:1) maintained a similar validation loss but at the cost of increased inference overhead. Furthermore, the pure full-attention baseline (0:1) performed poorly. Thus, the 3:1 configuration offers the most effective balance between model performance and computational efficiency.
NoPE vs. RoPE As shown in Table 5, the Kimi Linear consistently excels on long-context evaluations, whereas Kimi Linear (RoPE) attains similar scores on short-context tasks. We posit that this divergence arises from how positional bias is distributed across depth. In Kimi Linear (RoPE), the global attention layer carries a strong, explicit relative positional signal, while the linear attention (e.g., GDN) contributes a weaker, implicit positional inductive bias. This mismatch yields an overemphasis on short-range order in the global layer, which benefits short contexts but makes the model less flexible when adapting mid-training to extended contexts. By contrast, Kimi Linear induces a more
Table 2: Model configurations and hyperparameters for scaling law experiments.
| # Act. Params. † | Head | Layer | Hidden | Tokens | lr | batch size ‡ |
|--------------------|--------|---------|----------|----------|------------------|----------------|
| 653M | 16 | 16 | 1216 | 038.8B | 2 . 006 × 10 - 3 | 336 |
| 878M | 18 | 18 | 1376 | 059.8B | 1 . 790 × 10 - 3 | 432 |
| 1.1B | 20 | 20 | 1536 | 085.2B | 1 . 617 × 10 - 3 | 512 |
| 1.4B | 22 | 22 | 1632 | 102.5B | 1 . 486 × 10 - 3 | 576 |
| 1.7B | 24 | 24 | 1776 | 128.0B | 1 . 371 × 10 - 3 | 640 |
Figure 5: The fitted scaling law curves for MLA and Kimi Linear.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Chart: Loss vs. PFLOP/s-days
### Overview
The image presents a scatter plot illustrating the relationship between Loss and PFLOP/s-days for two models: MLA and Kimi Linear. Both models demonstrate a decreasing loss as PFLOP/s-days increase, suggesting improved performance with increased computational resources. The plot uses a logarithmic scale for the x-axis (PFLOP/s-days).
### Components/Axes
* **X-axis:** PFLOP/s-days, labeled at the bottom. The scale is logarithmic, ranging from approximately 1 to 100 (10<sup>1</sup> to 10<sup>2</sup>).
* **Y-axis:** Loss, labeled on the left. The scale ranges from approximately 2.0 to 2.3.
* **Data Series 1:** MLA, represented by a dashed blue line with star markers. The equation for this line is given as: MLA: 2.3092 x C<sup>-0.0536</sup>.
* **Data Series 2:** Kimi Linear, represented by a dashed red line with diamond markers. The equation for this line is given as: Kimi Linear: 2.2879 x C<sup>-0.0527</sup>.
* **Legend:** Located in the top-right corner, clearly identifying each data series with its corresponding color and line style.
* **Annotation:** An arrow points to a data point near (approximately 10 PFLOP/s-days, 2.1 Loss) with the value "1.16x" written next to it.
### Detailed Analysis
**MLA (Blue, Stars):**
The MLA line slopes downward, indicating that as PFLOP/s-days increase, the loss decreases.
* Approximate data points (reading from the plot):
* (1 PFLOP/s-days, 2.28)
* (5 PFLOP/s-days, 2.18)
* (10 PFLOP/s-days, 2.12)
* (50 PFLOP/s-days, 2.03)
* (100 PFLOP/s-days, 2.00)
**Kimi Linear (Red, Diamonds):**
The Kimi Linear line also slopes downward, showing a similar trend to MLA.
* Approximate data points (reading from the plot):
* (1 PFLOP/s-days, 2.26)
* (5 PFLOP/s-days, 2.16)
* (10 PFLOP/s-days, 2.10)
* (50 PFLOP/s-days, 2.03)
* (100 PFLOP/s-days, 2.00)
The annotation "1.16x" appears to indicate a relative change or ratio at a specific point on the graph, but its exact meaning is unclear without further context.
### Key Observations
* Both MLA and Kimi Linear exhibit a negative correlation between Loss and PFLOP/s-days.
* The slopes of the two lines are very similar, suggesting that both models respond to increased computational resources in a comparable manner.
* The Kimi Linear model consistently shows slightly higher loss values than the MLA model across the observed range of PFLOP/s-days.
* The logarithmic scale on the x-axis indicates diminishing returns – the reduction in loss becomes smaller as PFLOP/s-days increase.
### Interpretation
The chart demonstrates the scaling behavior of two machine learning models (MLA and Kimi Linear). The decreasing loss with increasing PFLOP/s-days suggests that both models benefit from more computational power. The equations provided (MLA: 2.3092 x C<sup>-0.0536</sup> and Kimi Linear: 2.2879 x C<sup>-0.0527</sup>) formalize this relationship, indicating a power-law decay in loss as computational resources (C, representing PFLOP/s-days) increase. The slight difference in the coefficients (2.3092 vs. 2.2879) and exponents (-0.0536 vs. -0.0527) suggests that MLA may converge slightly faster or achieve a lower asymptotic loss than Kimi Linear, but the difference is relatively small. The annotation "1.16x" could represent a factor by which the loss decreases for a given increase in PFLOP/s-days, but its precise meaning requires additional information. The logarithmic scale highlights the concept of diminishing returns, where the benefit of additional computational resources decreases as the amount of resources increases.
</details>
balanced positional bias across layers, which improves robustness and extrapolation at long ranges, leading to stronger long-context performance. Regarding long context performance, as shown in Table 5, Kimi Linear achieves the best average score across different long context benchmarks, which verifies the benefits we claim in the last section.
## 5.3 Scaling Law of Kimi Linear
We conducted scaling law experiments on a series of MoE models following the Moonlight [62] architecture. In all experiments, we activated 8 out of 64 experts and utilized the Muon optimizer [62]. Details and hyperparameters are listed in Table 2.
For MLA, following the Chinchilla scaling law methodology [37], we trained five language models of different sizes, carefully tuning their hyperparameters through grid search to ensure optimal performance for each model. For KDA, we maintained the best hybrid ratio of 3:1 as ablated in Table 1. Except for this, we adhered strictly to the MLA training configuration without any modifications. As shown in Figure 5, Kimi Linear achieves ∼ 1 . 16 × computational efficiency compared to the MLA baselines with compute optimal training. We expect that careful hyperparameter tuning will yield superior scaling curves for KDA.
## 5.4 Experimental Setup
Kimi Linear and baselines settings We evaluate our Kimi Linear model against a full-attention MLA baseline and a hybrid Gated DeltaNet (GDN-H) baseline, all of which share the same architecture, parameter count, and training setup for fair comparisons. The model configuration is largely aligned with Moonlight [62], with the key distinction that MoE sparsity is increased to 32. Each model activates 8 out of 256 experts, including one shared expert, resulting in 48 billion total parameters and 3 billion active parameters per forward pass. The first layer is implemented as a dense layer without MoE, ensuring stable training. To evaluate the effectiveness of NoPE in Kimi Linear, we also introduce a hybrid KDA baseline using RoPE with the same model configuration, referred to as Kimi Linear (RoPE).
Evaluation Benchmarks Our evaluation encompasses three primary categories of benchmarks, each designed to assess distinct capabilities of the model:
- Language Understanding and Reasoning : Hellaswag [121], ARC-Challenge [14], Winogrande [83], MMLU [36], TriviaQA [47], MMLU-Redux [26], MMLU-Pro [103], GPQA-Diamond [82], BBH [94], and [105].
- Code Generation : LiveCodeBench v6 4 [44], EvalPlus [60].
- Math & Reasoning : AIME 2025, MATH 500, HMMT 2025, PolyMath-en.
- Long-context : MRCR 5 , RULER [38], Frames [52], HELMET-ICL [118], RepoQA [61], Long Code Arena [13] and LongBench v2 [6].
- Chinese Language Understanding and Reasoning : C-Eval [43], and CMMLU [55].
Evaluation Configurations All models are evaluated using temperature 1.0. For benchmarks with high variance, we report the score of Avg@ k . For base model, We employ perplexity-based evaluation for MMLU, MMLU-Redux, GPQA-Diamond, and C-Eval. Otherwise, generation-based evaluation is adopted. To mitigate the high variance inherent to GPQA-Diamond, we report the mean score across eight independent runs. All evaluations are conducted using our internal framework derived from LM-Harness-Evaluation [10], ensuring consistent settings across all models.
## 5.4.1 Pre-training recipe
Pre-training recipe All models are pretrained using a 4,096-token context window, the MuonClip optimizer, and the WSD learning rate schedule, processing a shared total of 1.4 trillion tokens sampled from the K2 pretraining corpus [50]. The learning rate is set to 1 . 1 × 10 -3 , and the global batch size is fixed at 32 million tokens. They also adopt the same annealing schedule and long-context activation phase established in Kimi K2 [50].
Our final released Kimi Linear checkpoint is pretrained using the same procedure, but with an expanded total of 5.7 trillion tokens to match the pretraining tokens of Moonlight. In addition, the final checkpoint supports a context length of up to 1 million tokens. We compare the performance of Kimi Linear@5.7T and Moonlight in Appendix D
## 5.4.2 Post-training recipe
SFT recipe The SFT dataset extends the Kimi K2 [50] SFT data by incorporating additional reasoning tasks, creating a large-scale instruction-tuning dataset that spans diverse domains with a heavy emphasis on math and coding. We employ a multi-stage SFT approach, initially training the model on a broad range of diverse SFT data for general instruction-following, followed by scheduled targeted training on reasoning-intensive data to enhance the model's reasoning capabilities.
RL recipe For the RL training prompt set, we primarily integrate three data sources: mathematics, code, and STEM. The main purpose of this enhancement is to boost the model's reasoning ability. Before conducting RL, we pre-selected data that matches a moderate difficulty level for the starting checkpoint.
A known risk of RL training is the potential degeneration of general capabilities. To mitigate this, we incorporate the PTX loss [70] during RL, following the practice of K2 [50]. This involves concurrent SFT on a high-quality, distributionally diverse dataset in the RL progress. Our PTX dataset spans both reasoning and general-purpose tasks. All data mentioned above are subsets derived from the training recipe of the K2 model [50].
For the RL algorithm, we use the same algorithm as in K1.5 [95], while introducing several advanced tricks. We noticed that the precision mismatch between training and inference engines may lead to unstable RL learning. Therefore we introduce truncated importance sampling, a method that effectively mitigates the policy mismatch between rollout and training [116]. We also dynamically adjust the KL penalty and the mini batch size ( i.e. , the number of updates per iteration) to make the RL training stable and avoid collapse of entropy [15].
## 5.5 Main results
## 5.5.1 Kimi Linear@1.4T results
Pretrain results We compared our Kimi Linear model against two baselines (MLA and hybrid GDN-H) using a 1.4T pretraining corpus in Table 3. The evaluation focused on three areas: general knowledge, reasoning (math and code), and Chinese tasks. Kimi Linear consistently outperformed both baselines across almost all categories.
4 Questions from 2024.8 to 2025.5
5 https://huggingface.co/datasets/openai/mrcr
- General Knowledge: Kimi Linear scores highest on all of the key benchmarks like BBH, MMLU and HellaSwag.
- Reasoning: It leads in math (GSM8K) and most code tasks (CRUXEval). However, it scores slightly lower on EvalPlus compared to GDN-H.
- Chinese Tasks: Kimi Linear achieves the top scores on CEval and CMMLU.
In summary, Kimi Linear demonstrated the strongest performance, positioning it as a strong alternative to full-attention architectures at short context pretraining.
Table 3: Performance comparison of Kimi Linear with the full-attention MLA baseline and the hybrid GDN baseline, all after the same pretraining recipe. Kimi Linear consistently outperforms both MLA and GDN-H on short-context pretrain evaluations. Best per-column results are bolded .
| | Type Base | MLA | GDN-H | Kimi Linear |
|---------|----------------|-------|---------|---------------|
| | Trained Tokens | 1.4T | 1.4T | 1.4T |
| | HellaSwag | 81.7 | 82.2 | 82.9 |
| | ARC-challenge | 64.6 | 66.5 | 67.3 |
| | Winogrande | 78.1 | 77.9 | 78.6 |
| General | BBH | 71.6 | 70.6 | 72.9 |
| General | MMLU | 71.6 | 72.2 | 73.8 |
| General | MMLU-Pro | 47.2 | 47.9 | 51.0 |
| General | TriviaQA | 68.9 | 70.1 | 71.7 |
| | GSM8K | 83.7 | 81.7 | 83.9 |
| | MATH | 54.7 | 54.1 | 54.7 |
| | EvalPlus | 59.5 | 63.1 | 60.2 |
| | CRUXEval-I-cot | 51.6 | 56.0 | 56.6 |
| | CRUXEval-O-cot | 61.5 | 58.1 | 62.0 |
| Chinese | CEval | 79.3 | 79.1 | 79.5 |
| Chinese | CMMLU | 79.5 | 80.7 | 80.8 |
Table 4: Performance comparison of Kimi Linear with the full-attention MLA baseline and the hybrid GDN baseline, all using the same SFT recipe after pretraining. Kimi Linear consistently outperforms both MLA and GDN-H on short-context instruction-tuned benchmarks. Best per-column results are bolded .
| Type Instruct | MLA | GDN-H | Kimi Linear |
|---------------------------|-------|---------|---------------|
| Trained Tokens | 1.4T | 1.4T | 1.4T |
| BBH | 68.2 | 68.5 | 69.4 |
| MMLU | 75.7 | 75.6 | 77.0 |
| MMLU-Pro | 65.7 | 64.8 | 67.4 |
| MMLU-Redux | 79.2 | 78.7 | 80.3 |
| GPQA-Diamond (Avg@8) | 57.1 | 58.6 | 62.1 |
| LiveBench (Pass@1) | 45.7 | 46.4 | 45.2 |
| AIME 2025 (Avg@64) | 20.6 | 21.1 | 21.3 |
| MATH500 (Acc.) | 80.8 | 83.0 | 81.2 |
| HMMT2025 (Avg@32) | 11.3 | 11.3 | 12.5 |
| PolyMath-en (Avg@4) | 41.3 | 41.5 | 43.6 |
| LiveCodeBench v6 (Pass@1) | 25.1 | 25.4 | 26.0 |
| EvalPlus | 62.6 | 62.5 | 61.0 |
SFT results Kimi Linear demonstrates strong performance across both general and math & code tasks after undergoing the same supervised fine-tuning (SFT) recipe, consistently outperforming MLA and GDN-H. In general tasks, Kimi Linear leads across the board, achieving the top scores on various MMLU benchmarks, BBH, and GPQA-Diamond. In math & code tasks, it surpasses both baselines on difficult benchmarks like AIME 2025, HMMT 2025, PolyMath-en, and LiveCodeBench. Despite some minor exceptions like MATH500 and EvalPlus, Kimi Linear shows robust superiority across the tasks, confirming its clear superiority to the other models tested (GDN-H and MLA).
Table 5: Comparisons of Kimi Linear with MLA, GDN-H, and Kimi Linear (RoPE) across long-context benchmarks. The last column reports the overall average ( ↑ ). All models is trained on 1.4T tokens. Best per-column results are bolded .
| | RULER | MRCR | HELMET-ICL | LongBench V2 | Frames | RepoQA | Long Code Arena | Long Code Arena | Avg. |
|--------------------|---------|--------|--------------|----------------|----------|----------|-------------------|-------------------|--------|
| | | | | | | | Lib | Commit | |
| MLA | 81.3 | 22.6 | 88.0 | 36.1 | 60.5 | 63.0 | 32.8 | 33.2 | 52.2 |
| GDN-H | 80.5 | 23.9 | 85.5 | 32.6 | 58.7 | 63.0 | 34.7 | 30.5 | 51.2 |
| Kimi Linear (RoPE) | 78.8 | 22.0 | 88.0 | 35.4 | 59.9 | 66.5 | 31.3 | 32.5 | 51.8 |
| Kimi Linear | 84.3 | 29.6 | 90.0 | 35.0 | 58.8 | 68.5 | 37.1 | 32.7 | 54.5 |
Figure 6: The training and test accuracy curves for Kimi Linear@1.4T and MLA@1.4T during Math RL training. Kimi Linear consistently outperforms the full attention baseline by a sizable margin during the whole RL process.
<details>
<summary>Image 9 Details</summary>

### Visual Description
\n
## Line Charts: Accuracy vs. Training/Test Data
### Overview
The image presents three separate line charts, labeled (a), (b), and (c). Each chart displays the accuracy of two models, "MLA@1.4T" and "Kimi Linear@1.4T", plotted against different input data. Chart (a) shows accuracy versus "Train" data, (b) shows accuracy versus "MATH 500 Test" data, and (c) shows accuracy versus "AIME 2025" data. All charts share a common y-axis representing "Accuracy".
### Components/Axes
* **Y-axis (all charts):** "Accuracy", ranging from approximately 10 to 95.
* **Chart (a):**
* **X-axis:** "Train", ranging from 0 to 100.
* **Line 1 (Purple):** "MLA@1.4T"
* **Line 2 (Green):** "Kimi Linear@1.4T"
* **Chart (b):**
* **X-axis:** "MATH 500 Test", ranging from 0 to 100.
* **Line 1 (Purple):** "MLA@1.4T"
* **Line 2 (Green):** "Kimi Linear@1.4T"
* **Chart (c):**
* **X-axis:** "AIME 2025", ranging from 0 to 100.
* **Line 1 (Purple):** "MLA@1.4T"
* **Line 2 (Green):** "Kimi Linear@1.4T"
### Detailed Analysis
**Chart (a): Accuracy vs. Train**
* **MLA@1.4T (Purple):** The line starts at approximately 21 at x=0, increases steadily to around 55 at x=60, then continues to increase, reaching approximately 63 at x=100. The line exhibits some fluctuations.
* **Kimi Linear@1.4T (Green):** The line begins at approximately 22 at x=0, rises to around 45 at x=40, then fluctuates, reaching a peak of approximately 53 at x=80, and ends at approximately 51 at x=100.
**Chart (b): Accuracy vs. MATH 500 Test**
* **MLA@1.4T (Purple):** The line starts at approximately 72 at x=0, rises sharply to around 86 at x=40, then fluctuates, reaching a peak of approximately 91 at x=60, and ends at approximately 88 at x=100.
* **Kimi Linear@1.4T (Green):** The line begins at approximately 74 at x=0, rises to around 84 at x=40, then fluctuates, reaching a peak of approximately 87 at x=80, and ends at approximately 85 at x=100.
**Chart (c): Accuracy vs. AIME 2025**
* **MLA@1.4T (Purple):** The line starts at approximately 11 at x=0, rises sharply to around 23 at x=40, then fluctuates, reaching a peak of approximately 24 at x=60, and ends at approximately 23 at x=100.
* **Kimi Linear@1.4T (Green):** The line begins at approximately 12 at x=0, rises to around 18 at x=40, then fluctuates, reaching a peak of approximately 21 at x=80, and ends at approximately 19 at x=100.
### Key Observations
* In all three charts, "MLA@1.4T" generally achieves higher accuracy than "Kimi Linear@1.4T".
* The "MATH 500 Test" chart shows the highest overall accuracy levels for both models.
* The "AIME 2025" chart shows the lowest overall accuracy levels for both models.
* All lines exhibit fluctuations, suggesting sensitivity to the specific data points used for evaluation.
### Interpretation
The data suggests that the "MLA@1.4T" model consistently outperforms the "Kimi Linear@1.4T" model across all three datasets ("Train", "MATH 500 Test", and "AIME 2025"). The significant difference in accuracy between the datasets indicates that the models perform better on the "MATH 500 Test" data than on the "AIME 2025" data, potentially due to differences in the difficulty or characteristics of the datasets. The fluctuations in the lines suggest that the models' performance is not entirely stable and may be affected by the specific examples within each dataset. The "Train" data chart shows the initial learning phase, while the "MATH 500 Test" and "AIME 2025" charts represent the models' generalization ability on unseen data. The higher accuracy on the "MATH 500 Test" suggests that the models are better at generalizing to problems similar to those in the training data. The lower accuracy on the "AIME 2025" data suggests that this dataset presents a greater challenge for the models.
</details>
Long Context Performance Evaluation We evaluate the long-context performance of Kimi Linear against three baseline models-MLA, GDN-H, and Kimi Linear (RoPE)-across several benchmarks at 128k context length (see Table 5). The results highlight Kimi Linear's clear superiority in these long-context tasks. It consistently outperformed MLA and GDN-H, achieving the highest scores on RULER (84.3) and RepoQA (68.5) by a significant margin. This pattern of outperformance held across most other tasks, except for LongBench V2 and Frames. Overall, Kimi Linear achieved the highest average score (54.5), further reinforcing its effectiveness as a leading attention architecture in long-context scenarios.
RL results To compare the RL convergence properties of Kimi Linear and MLA, we conduct RLVR using the in-house mathematics training set from [50], and evaluate on mathematics test sets (e.g., AIME 2025, MATH500), while keeping the algorithm and all hyperparameters identical to ensure a fair comparison of performance.
As shown in Figure 6, Kimi Linear demonstrates better efficiency compared to MLA. On the training set, even though both models start at similar points, the growth rate of training accuracy for Kimi Linear is significantly higher than that of MLA, and the gap gradually widens. On the test set, similar phenomena are observed. For example, on MATH500 and AIME2025, Kimi Linear achieves faster and better improvement compared to MLA. Overall, in reasoning-intensive long-form generation under RL, we empirically observe that Kimi Linear performs significantly better than MLA.
Summary of overall findings During the pretraining and SFT stages, a clear performance hierarchy was established: Kimi Linear outperformed GDN-H, which in turn outperformed MLA. However, this hierarchy shifted in long-context evaluations. While Kimi Linear maintained its top position, GDN-H's performance declined, placing it behind MLA. Furthermore, in the RL stage, Kimi Linear also demonstrated superior performance over MLA. Overall, Kimi Linear consistently ranked as the top performer across all stages, establishing itself as a superior alternative to full attention architectures.
## 5.6 Efficiency Comparison
Prefilling & Decoding speed We compare the training and decoding times for full attention MLA [19], GDN-H, and Kimi Linear in Figure 7a and Figure 7b. Note that all models are based on the Kimi Linear 48B setting, with the same number of layers and attention heads. We observe that: 1) Despite incorporating a more fine-grained decay mechanism, Kimi Linear introduces negligible latency overhead compared to GDN-H during prefilling. As shown in Figure 7a, their
Figure 7: (a) The prefilling time of MLA (full attention), hybrid GDN-H and our Kimi Linear. (b) The time per output token (TPOT) for MLA, GDN-H and Kimi Linear during decoding. (We use batch size = 1 here for tests.)
<details>
<summary>Image 10 Details</summary>

### Visual Description
\n
## Charts: Performance Comparison of MLA, GDN-H, and Kimi Linear
### Overview
The image presents two line charts (labeled (a) and (b)) comparing the performance of three models – MLA, GDN-H, and Kimi Linear – under varying input lengths. Chart (a) shows Latency (in seconds) versus Prefilling Length, while chart (b) shows TPOT (Time Per Output Token, in milliseconds) versus Decoding Length. Both charts use a logarithmic scale for the input length (Prefilling/Decoding Length). Arrows indicate the performance increase/decrease between 512K and 1M input lengths.
### Components/Axes
**Chart (a): Latency vs. Prefilling Length**
* **X-axis:** Prefilling Length (0K, 128K, 256K, 512K, 1M)
* **Y-axis:** Latency (s) (Scale: 0 to 60, increments of 10)
* **Legend (top-left):**
* MLA (Teal dashed line)
* GDN-H (Orange solid line)
* Kimi Linear (Blue solid line)
**Chart (b): TPOT vs. Decoding Length**
* **X-axis:** Decoding Length (4K, 128K, 256K, 512K, 1M)
* **Y-axis:** TPOT (ms) (Scale: 5 to 20, increments of 5)
* **Legend (top-left):**
* MLA (Teal dashed line)
* GDN-H (Orange solid line)
* Kimi Linear (Blue solid line)
### Detailed Analysis or Content Details
**Chart (a): Latency vs. Prefilling Length**
* **Kimi Linear (Blue):** The line is nearly flat from 0K to 512K, remaining around 1-2 seconds. It increases sharply to approximately 23 seconds at 1M.
* **GDN-H (Orange):** The line is also relatively flat from 0K to 512K, staying around 1-3 seconds. It increases to approximately 25 seconds at 1M.
* **MLA (Teal):** The line is flat from 0K to 256K, remaining below 1 second. It begins to increase significantly at 512K (approximately 8 seconds) and rises dramatically to approximately 58 seconds at 1M. An arrow indicates a 2.9x increase in latency from 512K to 1M. Another arrow indicates a 2.3x increase in latency from 512K to 1M.
**Chart (b): TPOT vs. Decoding Length**
* **Kimi Linear (Blue):** The line is relatively flat from 4K to 512K, staying around 6-8 ms. It increases to approximately 11 ms at 1M.
* **GDN-H (Orange):** The line is flat from 4K to 256K, remaining around 6-7 ms. It increases to approximately 11 ms at 1M.
* **MLA (Teal):** The line is flat from 4K to 256K, remaining around 6-7 ms. It begins to increase at 512K (approximately 9 ms) and rises to approximately 19 ms at 1M. An arrow indicates a 2.2x increase in TPOT from 512K to 1M. Another arrow indicates a 1.8x increase in TPOT from 512K to 1M.
### Key Observations
* **Scaling Issues:** MLA exhibits the most significant performance degradation as the input length increases, particularly in terms of latency.
* **Linearity:** Kimi Linear and GDN-H show more linear scaling with input length compared to MLA.
* **TPOT vs. Latency:** The trends in TPOT are similar to those in latency, but the magnitude of the increase is less pronounced.
* **Performance Gap:** The performance gap between MLA and the other two models widens considerably at larger input lengths.
### Interpretation
These charts demonstrate the scalability of three different models (MLA, GDN-H, and Kimi Linear) when processing longer sequences. The data suggests that MLA, while potentially faster for shorter sequences, suffers from significant performance bottlenecks as the input length grows. This is evidenced by the steep increase in both latency and TPOT at 1M Prefilling/Decoding Length. GDN-H and Kimi Linear exhibit more stable performance, indicating better scalability.
The arrows highlighting the performance increases (2.9x, 2.3x for latency in (a), and 2.2x, 1.8x for TPOT in (b)) emphasize the substantial performance impact of increasing the input length for MLA. The logarithmic scale of the x-axis is crucial for understanding these trends, as it highlights the exponential nature of the performance degradation.
The consistent performance of Kimi Linear and GDN-H suggests that they may be more suitable for applications requiring processing of long sequences. The difference in performance between the models is likely due to architectural differences and optimization strategies. Further investigation into the underlying mechanisms driving these performance differences would be valuable.
</details>
performance curves are virtually indistinguishable, confirming that our method maintains high efficiency. The hybrid Kimi Linear model demonstrates a clear efficiency advantage over the MLA baseline as sequence length increases. While its performance is comparable to MLA at shorter lengths (4k-16k), it becomes significantly faster from 128k onwards. This efficiency gap widens dramatically at scale, with Kimi Linear outperforming MLA by a factor of 2 . 3 for 512k sequences and 2 . 9 for 1M sequences. As shown in Figure 1b, Kimi Linear fully demonstrates its advantages during the decoding phase. For decoding at 1M context length, Kimi Linear is 6 × faster than full attention.
## 6 Discussions
## 6.1 Kimi Delta Attention as learnable position embeddings
The standard attention in transformers is by design agnostic to the sequence order of its inputs [99], thus necessitating explicit positional encodings [75, 86]. Among various methods, RoPE [88] has emerged as the de facto standard in modern LLMs due to its effectiveness [98, 1, 19]. The mechanism of multiplicative positional encodings like RoPE can be analyzed through a generalized attention formulation:
$$s _ { t , i } = q _ { t } ^ { \top } \left ( \prod _ { j = i + 1 } ^ { t } R _ { j } \right ) k _ { i } & & ( 1 1 )$$
where the position relationship between the t -th query q t and the i -th key k i is reflected by the cumulative matrix products. RoPE defines the transformation matrix R j as a block diagonal matrix composed of d k / 2 2D rotation matrices R k j = ( cos( jθ k ) -sin( jθ k ) sin( jθ k ) cos( jθ k ) ) with per-2-dimensional angular frequency θ k . Due to the properties of rotation matrices, i.e., R t -i = R ⊤ t R i , absolute positional information R t and R i can be applied separately to q t and k i , which are then transformed into relative positional information t -i encoded as ∏ t j = i +1 R j = ( cos(( t -i ) θ k ) -sin(( t -i ) θ k ) sin(( t -i ) θ k ) cos(( t -i ) θ k ) ) .
Consequently, we show that linear attentions with the gated delta rule can be expressed in a comparable formulation in Eq. 12. Similar forms for other attention variants are summarized in Table 6.
$$o _ { t } = \sum _ { i = 1 } ^ { t } \left ( q _ { t } ^ { \top } \left ( \prod _ { j = i + 1 } ^ { t } A _ { j } \left ( I - \beta _ { j } k _ { j } k _ { j } ^ { \top } \right ) \right ) k _ { j } \right ) v _ { j }$$
From this perspective, GDN can be interpreted as a form of multiplicative positional encoding whose transition matrix is data-dependent, thereby relaxing the orthogonality constraint imposed by RoPE and can be potentially more powerful [115]. 6 This provides a potential solution to the known extrapolation issues of RoPE, whose fixed frequencies can cause overfitting to context lengths seen during training [108, 72]. Some recent works adopt workarounds like partial RoPE
6 When preserving orthogonality, absolute positional encodings can be applied independently to q and k , which are then automatically transformed into relative positional encodings during the attention computation [87].
Table 6: An overview of attention mechanisms in their mathematically equivalent recurrent ( o t ) and parallel ( O ) forms. We omitted the normalization term and β t to achieve a more concise representation. The function ϕ refers to the infinite-dimensional feature space corresponding to the exponential kernel, i.e., ϕ ( q ) ⊤ ϕ ( k ) = exp( q ⊤ k ) .
| | Recurrent form | Parallel form |
|-------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
| SA [99] | t ∑ j =1 exp ( q ⊤ t k j ) v j | ( exp ( QK ⊤ ) ⊙ M ) V |
| SA + RoPE [88] | t ∑ j =1 exp ( q ⊤ t ( t ∏ s = j +1 R s ) k j ) v j | ( exp ( R ( Q ) R ( K ) ⊤ ) ⊙ M ) V |
| LA [101] | t ∑ j =1 ( q ⊤ t k j ) v j | ( QK ⊤ ⊙ M ) V |
| Mamba2 [16] | t ∑ j =1 ( q ⊤ t ( t ∏ s = j +1 α s ) k j ) v j | ( QK ⊤ ⊙A⊙ M ) V |
| GLA [114] | t ∑ j =1 ( q ⊤ t ( t ∏ s = j +1 Diag ( α s ) ) k j ) v j | ( ( Q ⊙ Γ ) ( K Γ ) ⊤ ⊙ M ) V |
| DeltaNet [84] | t ∑ j =1 ( q ⊤ t ( t ∏ s = j +1 ( I - k s k ⊤ s ) ) k j ) v j ( ) | ( QK ⊤ ⊙ M ) ( I + KK ⊤ ⊙ M - ) - 1 V |
| FoX [58] | t ∑ j =1 exp ( q ⊤ t k j ) t ∏ s = j +1 α s v j | ( exp ( QK ⊤ ) ⊙A⊙ M ) V |
| DeltaFormer [125] | t ∑ j =1 ( ϕ ( q t ) ⊤ ( t ∏ s = j +1 ( I - ϕ ( k s ) ϕ ( w s ) ⊤ ) ) ϕ ( k j ) ) v j | ( exp ( QK ⊤ ) ⊙ M ) ( I +exp ( WK ⊤ ) ⊙ M - ) - 1 V |
| PaTH-FoX [115] | t ∑ j =1 exp ( q ⊤ t ( t ∏ s = j +1 ( I - w s w ⊤ s ) ) k j )( t ∏ s = j +1 α s ) v j ) | ( exp ( ( QK ⊤ ⊙ M ) ( I + WW ⊤ ⊙ M - ) - 1 ) ⊙A⊙ M ) V |
| GDN [111] | t ∑ j =1 ( q ⊤ t ( t ∏ s = j +1 α s ( I - k s k ⊤ s ) ) k j v j | ( QK ⊤ ⊙A⊙ M ) ( I + KK ⊤ ⊙A⊙ M - ) - 1 V |
| Comba [40] | t ∑ j =1 ( q ⊤ t ( t ∏ s = j +1 ( α s - k s k ⊤ s ) ) k j ) v j | ( QK ⊤ ⊙A⊙ M ) ( I + KK ⊤ ⊙A i - 1 /j ⊙ M - ) - 1 V |
| RWKV7 [71] | t ∑ j =1 ( q ⊤ t ( t ∏ s = j +1 ( Diag ( α s ) - ( b s ⊙ ˆ k s ) ˆ k ⊤ s ) ) k j ) v j | ( ( Q ⊙ Γ ) ( K Γ ) ⊤ ⊙ M ) ( I + ( ˆ K ⊙ 0 → t - 1 Γ ) ( ˜ K ⊙ B Γ ) ⊤ ⊙ M - 1 ) - 1 V |
| KDA (ours) | t ∑ j =1 ( q ⊤ t ( t ∏ s = j +1 Diag ( α s ) ( I - k s k ⊤ s ) ) k j ) v j | ( ( Q ⊙ Γ ) ( K Γ ) ⊤ ⊙ M )( I +( K ⊙ Γ ) ( K Γ ) ⊤ ⊙ M - 1 ) - 1 V |
[7] or even forgo explicit positional encodings entirely (NoPE) [49, 76, 19]. Given that GDN serves as an analogue role to RoPE, we choose NoPE for global full attention layers (MLA) in our model, allowing positional information to be captured dynamically by our proposed KDA model.
Moreover, a key strength of RoPE is its fine-grained positional encoding, achieved by assigning different rotation frequencies to each pair of dimensions, which functions analogously to a Nonuniform Fourier Transform [7, 41] along the feature dimension. Standard GDN, however, employs a per-head scalar decay and lacks this per-dimensional diversity, which motivates us to propose KDA with a learnable channel-wise gate.
## 6.2 Relation to DPLR
(Gated) DeltaNet can be generalized to a more expressive Diagonal-Plus-Low-Rank (DPLR) structure, defined as D -a t b ⊤ t . This structure was also explored in models such as S4 [30], which employed a static DPLR formulation as the state transition matrix. During computation, this matrix is typically jointly diagonalized into the complex plane, thereby restricting its expressiveness to diagonal transformations [64].
While the DPLR structure introduces richer model interactions and can potentially enhance recall through its key-value update rule, it also suffers from a notable limitation: high computational cost and poor parallelizability. These drawbacks make DPLR inherently slower in large-scale or real-time scenarios, where maintaining parameter efficiency becomes a crucial design challenge.
To address this issue, KDA introduces a constrained variant of DPLR, where Eq. 1 can be rewritten as S t = ( Diag( α t ) -β t k t k ⊤ t Diag( α t ) ) S t -1 + β t k t v ⊤ t with the correspondence between the two given by:
$$S _ { t } = ( D - a _ { t } b _ { t } ^ { \top } ) S _ { t - 1 } + k _ { t } v _ { t } ^ { \top } , s . t . , D = D i a g ( \alpha _ { t } ) , a _ { t } = \beta _ { t } k _ { t } , b _ { t } = k _ { t } \odot \alpha _ { t } .$$
Furthermore, by sharing α t , we can factor it out as in Eq. 1, enabling a fine-grained multiplicative decay over S t in a manner similar to GLA [114], followed by a Householder-style transformation like DeltaNet [84, 112] for efficient
```
""" Kimi Linear: An Expressive, Efficient
```
(a) PyTorch-style pseudo code for chunkwise DPLR.
state updating. We provide a side-by-side comparison of the chunkwise PyTorch-style pseudocode implementations for DPLR and KDA in Listing 8a and Listing 8b.The key improvements are highlighted below:
- Listing 8a Line 13-16 vs., Listing 8b Line 14-15 : the reciprocal of the cumulative decay term 1 / Γ in chunkwise form (Eq. 9) can introduce numerical instability. While we can resolve this issue by secondary chunking [113], it incurs additional computation and I/O overhead. By fixing a = b = k in the DPLR formulation, KDA removes the need for two secondary chunking steps, substantially reducing redundant operations and improving overall efficiency.
- Listing 8a Line 25-27,31-32 vs., Listing 8b Line 26,29 : KDA further eliminates roughly three matrix multiplications during inter-chunk and output computation, leading to significant kernel-level acceleration.
We further benchmark the kernel speed in Fig. 2, showing that KDA achieves nearly 2 × the speed of DPLR for sequence lengths up to 64 k.
## 6.3 Complexity Analysis
Training flops We maintain a similar number of parameters in Kimi Linear as in the full attention MLA. The linear projection calculation remains identical to that of the global attention layer. The key distinction lies in the FLOPs associated with attention computation. For simplicity, we focus on non-variable length scenarios. Based on the implementation of the gated rule kernel, the theoretical FLOPs for a single attention head with headdim d h and a fixed chunk size C = 64 in the gated delta rule [102] (per sequence of length T ) are as follows:
$$\begin{array} { r l } & { F L O P s _ { K D A } ( T ; C , d _ { h } ) = 6 T d _ { h } ^ { 2 } + 3 T C d _ { h } + T C ^ { 2 } . \quad ( 1 3 ) } \end{array}$$
For full (global) attention, the dominant term per head is
$$F L O P s _ { A t t n } ( T ; d _ { h } ) \, = \, 2 T ^ { 2 } d _ { h } . \quad ( 1 4 )$$
Inference strategy and cost The inference strategy in Kimi Linear employs a hybrid approach to optimize both computational and I/O efficiency. During the prefill phase, the model utilizes a FLOP-intensive chunk kernel (see
```
Expressive, Efficient Attention Architecture TECHNICAL REPORT
```
```
mask = torch.triu(torch.ones(BT, BT), diagonal=1)
for i in range(0, NT):
q_i, k_i, u_i, g_i, w_i = (x[:, :, i] for x in (q, k,
-> u, gc, w))
+ o[:,:,i]=(q_i *g_i.exp()) @ S + Aqk @(u_i-w_i @ S)
decay = (g_i[:,:,-1:] - g_i).exp()
S = S * g_i[:, :, -1, :, None].exp()
+ S += (k_i * decay).transpose(-1,-2) @ v_i
return o, S
```
Table 7: An overview of different attention mechanisms through the lens of state updating rules and their learning objective under the TTT framework [90]. We ignore all normalizer terms and activation/kernel functions for brevity.
| | Objective L | Update rule S t = S t - 1 -∇ S t - 1 L |
|---------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
| LA [48] | - 〈 S ⊤ t - 1 k t , v t 〉 | S t = S t - 1 + k t v ⊤ t |
| RetNet [92] | - β t 〈 S ⊤ t - 1 k t , v t 〉 + 1 2 ∥ ∥ √ 1 - α S t - 1 ∥ ∥ 2 F | S t = α S t - 1 + β t k t v ⊤ t |
| Mamba2 [16] | - β t 〈 S ⊤ t - 1 k t , v t 〉 + 1 2 ∥ ∥ √ 1 - α t S t - 1 ∥ ∥ 2 F | S t = α t S t - 1 + β t k t v ⊤ t |
| GLA [114] | - 〈 S ⊤ t - 1 k t , v t 〉 + 1 2 ∥ ∥ ∥ √ Diag ( 1 - α t ) S t - 1 ∥ ∥ ∥ 2 F | S t = Diag( α t ) S t - 1 + k t v ⊤ t |
| HGRN2 [77] | - 〈 S ⊤ t - 1 ( 1 - α t ) , v t 〉 + 1 2 ∥ ∥ ∥ √ Diag ( 1 - α t ) S t - 1 ∥ ∥ ∥ 2 F | S t = Diag( α t ) S t - 1 +( 1 - α t ) v ⊤ t |
| Longhorn [59] | 1 2 ∥ ∥ S ⊤ t - 1 k t - v t ∥ ∥ 2 Diag( β t ) | S t = ( I - β t 1 + β t k ⊺ t k t k t k ⊺ t ) S t - 1 + β t k t v ⊺ t |
| Comba [40] | β t 2 ∥ ∥ S ⊤ t - 1 k t - v t ∥ ∥ 2 + 1 2 ∥ ∥ √ 1 - α t S t - 1 ∥ ∥ 2 F | S t = ( α t - β t k t ˆ k ⊤ t ) S t - 1 + β t k t v ⊤ t |
| RWKV7 [71] | 1 2 ∥ ∥ ∥ S ⊤ t - 1 ˜ k t - v t ∥ ∥ ∥ 2 + 1 2 ∥ ∥ ∥ √ Diag ( 1 - α t ) S t - 1 ∥ ∥ ∥ 2 F | S t = ( Diag ( α t ) - ( b s ⊙ ˆ k s ) ˆ k ⊤ t ) S t - 1 + k t v ⊤ t |
| GDN [111] | β t 2 ∥ ∥ ∥ ˜ S ⊤ t - 1 k t - v t ∥ ∥ ∥ 2 | S t = ( I - β t k t k ⊤ t ) α t S t - 1 + β t k t v ⊤ t |
| KDA (ours) | β t 2 ∥ ∥ ∥ ˜ S ⊤ t - 1 k t - v t ∥ ∥ ∥ 2 | S t = ( I - β t k t k ⊤ t ) Diag ( α t ) S t - 1 + β t k t v ⊤ t |
For GDN and KDA, the update can be viewed as performing an Stochastic Gradient Descent(SGD) process on the decayed state ˜ S , that is, S t = ˜ S t -1 -∇ ˜ S t -1 L , where ˜ S t -1 is decayed by scalar or fine-grained gate.
§ 3.1), while switching to the more efficient recurrent kernel (Eq. 2) for autoregressive generation. A key advantage of the Linear KDA is its ability to maintain a fixed-sized state ( d k × d v per head, with d k = d v = 128 ) regardless of sequence length. For our hybrid model, as sequence length increases, the I/O-bounded decoding time approaches a maximum hybrid efficiency ratio of 3:1 compared to full attention. This trend is reflected in Fig. 7b, where Kimi Linear achieves a 2 . 3 × speedup at a 1M token context. Additionally, by eliminating the need for a large, linear-scaling KV cache, Kimi Linear is able to reallocate memory resources to support larger batch sizes, enhancing overall throughput. In long-context scenarios (up to 1M tokens), this memory efficiency results in a theoretical decoding speedup of up to 6 . 3 × (see Fig. 1b).
## 7 Related Works
## 7.1 Efficient Subquadratic Attention
The quadratic time complexity of the standard self-attention mechanism [99] remains a fundamental bottleneck for processing long contexts in Transformer-based models. This limitation has become increasingly critical as large language models (LLMs) are now expected to handle million-token sequences for tasks such as agentic tool use and repository-level code analysis [19, 50]. To overcome this challenge, a substantial body of research has explored more efficient attention mechanisms [91, 89], which can broadly be categorized into two main directions: (1) linear attention, and (2) sparse attention.
Linear Attention reformulates the quadratic attention map into kernelized feature interactions, replacing the softmax with a positive feature map so that attention can be computed through two associative matrix products [48]. This eliminates the explicit O ( T 2 ) similarity matrix and enables linear-time computation with respect to sequence length. Subsequent work strengthens the vanilla linear attention significantly through more refined memory control, shifting from data-independent 'decay' [92, 78] to more adaptive, data-dependent mechanisms [29, 93], and refining the decay granularity from coarse headwise [16] to precise, channel-wise decay. GLA generalizes these approaches with diagonal, channel-wise gates that balance expressiveness and efficiency while retaining chunk-wise parallelism [113, 114]. Table 7 summarizes the corresponding update rules. Collectively, these methods cast attention as a compact recurrent memory updated with parallel prefix-scan operators and fused matrix multiplies, aligning well with modern accelerators [42].
A complementary view connects linear attention to fast-weight memory [84]: the state is a low-capacity associative table updated online by Hebbian-like rules [69], while slow weights amortize when to store, update, or forget [68].
In Table 7, we provide a summary of the existing efficient token mixing methods, comparing them from the perspectives of state update mechanisms and optimization objectives.
From this perspective, gating and decay serve as learnable criteria that mitigate interference and stabilize optimization [90]. Despite these advances, linear attention still lags full attention on exact copying and fine-grained selection in extreme long-context retrieval. This motivates hybrid designs (interleaving linear and full attention) and more structured updates. In particular, the gated delta rule used by GDN/KDA introduces rank-1 corrective updates to the fast-weight state, improving targeted retention while remaining parallelizable at the operator level [112].
Linear Attention with Gating Mechanism The vanilla Linear Attention [48] is known to lack the selection mechanism inherent in softmax attention [99], falling short in expressiveness. To address this, Gated Linear Attention models have emerged as memory-efficient and parallelizable alternatives [113, 114, 29]. Instead of storing an everexpanding KV cache, these models employ a fixed-size matrix-valued state and learnable gates to selectively retain and forget information. This design achieves expressive power comparable to softmax attention [65, 125, 64] while maintaining constant time and memory complexity during inference time. The general recurrent formulation of such models for memory update S t ∈ R d k × d v can be expressed as:
$$S _ { t } = A _ { t } S _ { t - 1 } + k _ { t } v _ { t } ^ { \top } , \quad o _ { t } = S _ { t } ^ { \top } q _ { t } .$$
The primary distinction among various gated linear attention mechanisms lies in the parameterization of the forget gate A t , as summarized in Table 7. For instance, RetNet [92] uses a data-independent scalar decay α , and Mamba2 [16] employs a data-dependent scalar α t . Specifically, GLA [114] utilized a diagonalized fine-grained matrix Diag( α t ) ∈ R d k × d k , offering an effective trade-off between efficiency and performance. Other variants are displayed in Table 7.
Sparse Attention A separate body of work reduces the quadratic complexity of standard attention by exploiting its inherent sparsity, approximating the full attention score by performing the computation on a strategically selected subset of tokens. The central challenge lies in identifying this subset effectively without degrading model performance. Early methods often utilized efficient, training-free static patterns, such as sliding and dilated windows [20, 31, 107], or fixed patterns [120, 35], but their rigid structure often compromises model accuracy. More advanced methods determine the important positions based on the context, such as clustering [51, 106] and lightweight routing mechanisms [25, 73, 2, 9], but this dynamic selection process introduces a computational overhead that can prevent them from achieving their full theoretical speedup without dedicated kernel acceleration [21]. Some models further introduce training-free sparsification during the inference stage [107, 109].
Recent approaches to sparse attention have begun to prioritize hardware co-design, as exemplified by NSA [119, 96] and MoBA [63], which both move from token-level to chunk-level selection. In NSA, each query dynamically selects chunks based on scores produced by an MLP. The method's efficiency relies on its use of Grouped-Query Attention (GQA) [98] with a large head count (typically a multiple of 16), a configuration specifically designed to accelerate computation through highly parallelized tensor-matrix multiplications. Similarly, MoBA performs topk chunk selection, but leverages log-sum-exp (LSE) scores computed efficiently via flash-attention kernels [17]. In contrast to NSA and MoBA, the recently proposed DeepSeek-V3.2-Exp Attention (DSA) [18] revives token-level sparsity, maintaining efficiency through a learnable full-attention indexer implemented with low-precision fp8 and a small head dimension for token selection.
Discussion Linear attention and sparse attention represent two distinct pathways toward efficient long-context modeling. Sparse attention tends to retrieve fine-grained historical information more effectively, but this advantage comes at the cost of storing the entire KV cache for token selection, making it less efficient than linear attention models that maintain a constant state. Moreover, sparse attention performs only information selection, and its theoretical expressive upper bound remains that of full attention. In contrast, linear attention, grounded in the principle of 'compression as intelligence', enables generalization with a fixed-size state and, when combined with the Delta learning rule, can achieve theoretically stronger expressive capacity. Although linear attentions have traditionally been criticized for weak retrieval ability, this limitation can be mitigated through state expansion [23, 34, 117, 39] or related techniques. Nevertheless, despite these advantages, linear attention remains limited by current hardware implementations and the absence of optimized inference infrastructure. Our work overcomes these limitations with Kimi Linear, a powerful model integrated with vLLM for efficient inference. Our proposed KDA delivers competitive performance compared to the full-attention baseline (Table 3) and achieves over a 2 × decoding speedup at the one-million-token context (Figure 7b). Despite their distinct approaches to efficient long-context modeling, linear attention and sparse attention are not mutually exclusive. Future work could explore hybrid models that integrate the strengths of both, leveraging the compression and generalization capabilities of linear attention with the fine-grained retrieval advantages of sparse attention to further enhance model performance and efficiency.
## 7.2 Hybrid Model
Despite efficiency, pure Linear Attention still struggle with precise memory retrieval and exact copying [45, 104] This deficiency hinders their adoption in industrial-scale LLMs where robust long-context recall (e.g., beyond 1M tokens) and reliable tool-use over extensive code repositories are critical [50]. Recent work shows that Linear Attention and full attention can effectively complement each other, leading to various hybrid designs.
Intra-layer hybrid One category of hybrid architectures is the intra-layer hybrid, which adaptively fuses the outputs of different mechanisms within each layer. A common implementation fuses outputs from heterogeneous heads within each layer, such as combining standard attention with state space models (SSMs) [22, 56]. In contrast, sequence-level approaches apply distinct mechanisms to different parts of the input. For example, some use linear attention for past context and SWA for recent tokens [123, 54, 67], while NHA [24] compresses the history with GSA [124] and combines it with local sliding window context to emulate a standard attention operation.
Inter-layer Hybrid A key drawback of the intra-layer hybrid is the increased system complexity and inference overhead. The heterogeneous mechanisms require separate computational paths, complicating optimizations like distributed parallelism. To mitigate this challenge, inter-layer hybrids have become a more widely adopted and practical strategy in LLMs [66, 57, 97]. This approach involves stacking distinct layer types, such as full attention and a linear alternative, in a predefined ratio. Building on this paradigm, we implement a simple yet effective strategy: interleaving linear and full attention layers at a fixed 3:1 ratio (see § 5.2 for ablations). This regular, repeating structure simplifies KV cache management and integrates seamlessly with standard optimizations. For the linear component of our hybrid, we deviate from the common practice of using Mamba2 [16]. Instead, we employ KDA, as we found it yields superior overall performance, particularly in retrieval and copying abilities.
Discussion Recent work indicates that hybrid models can be sensitive to adjustments in the RoPE base frequency, a vulnerability that complicates context window extension [126]. This sensitivity can hinder the model's ability to extrapolate to longer sequences. To address this challenge, recent models have trended towards solutions that incorporate No Position Embeddings (NoPE). Falcon-H [126], for example, uses an unconventionally high base frequency (e.g., b ≈ 10 11 ) to push its positional encoding to a near-NoPE state. Architecturally, SwanGPT [76] interleaves RoPE-based layers with NoPE-based full attention layers. Aligning with this direction, we found that hybridizing our KDA layers with NoPE full attention is also a highly effective strategy, facilitating straightforward context window extension.
## Conclusion
We introduce Kimi Linear, a hybrid linear attention architecture designed to meet the efficiency demands of agentic intelligence and test-time scaling without sacrificing quality. Central to Kimi Linear is Kimi Delta Attention (KDA), an advanced linear attention module with a channel-wise gating mechanism that enhances memory control and enables RNN-style models in hybrid architectures. By interleaving KDA with global attention in a 3:1 ratio, Kimi Linear reduces memory usage by up to 75%, while achieving up to 6.3 × higher decoding throughput and outperforming full-attention baselines. Our approach provides a scalable, efficient solution for large language models, with open-source KDA kernels and pre-trained checkpoints facilitating further research.
## References
- [1] Sandhini Agarwal et al. 'gpt-oss-120b & gpt-oss-20b model card'. In: arXiv preprint arXiv:2508.10925 (2025).
- [2] Joshua Ainslie et al. 'Colt5: Faster long-range transformers with conditional computation'. In: arXiv preprint arXiv:2303.09752 (2023).
- [3] Zeyuan Allen-Zhu. 'Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers'. In: SSRN Electronic Journal (May 2025). Available at SSRN: https://ssrn.com/abstract=5240330 or http://dx.doi.org/10.2139/ssrn.5240330 . DOI: 10.2139/ssrn.5240330 .
- [4] Simran Arora et al. 'Simple linear attention language models balance the recall-throughput tradeoff'. In: Forty-first International Conference on Machine Learning . 2024. URL: https://openreview.net/forum? id=e93ffDcpH3 .
- [5] Simran Arora et al. Zoology: Measuring and Improving Recall in Efficient Language Models . 2023. arXiv: 2312.04927 [cs.CL] .
- [6] Yushi Bai et al. 'Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks'. In: arXiv preprint arXiv:2412.15204 (2024).
- [7] Federico Barbero et al. 'Round and Round We Go! What makes Rotary Positional Encodings useful?' In: Proceedings of ICLR . 2025. URL: https://openreview.net/forum?id=GtvuNrk58a .
- [8] Ali Behrouz et al. 'Atlas: Learning to optimally memorize the context at test time'. In: arXiv preprint arXiv:2505.23735 (2025).
- [9] Amanda Bertsch et al. 'Unlimiformer: Long-range transformers with unlimited length input'. In: Advances in NeurIPS 36 (2023), pp. 35522-35543.
- [10] Stella Biderman et al. 'Lessons from the trenches on reproducible evaluation of language models'. In: arXiv preprint arXiv:2405.14782 (2024).
- [11] Christian Bischof and Charles Van Loan. 'The WY Representation for Products of Householder Matrices'. In: SIAM Journal on Scientific and Statistical Computing (1987), s2-s13. URL: https://doi.org/10.1137/ 0908009 .
- [12] Aaron Blakeman et al. 'Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models'. In: arXiv preprint arXiv:2504.03624 (2025).
- [13] Egor Bogomolov et al. 'Long code arena: a set of benchmarks for long-context code models'. In: arXiv preprint arXiv:2406.11612 (2024).
- [14] Peter Clark et al. 'Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge'. In: arXiv:1803.05457v1 (2018).
- [15] Ganqu Cui et al. 'The entropy mechanism of reinforcement learning for reasoning language models'. In: arXiv preprint arXiv:2505.22617 (2025).
- [16] Tri Dao and Albert Gu. 'Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality'. In: CoRR abs/2405.21060 (2024). DOI: 10.48550/ARXIV.2405.21060 . arXiv: 2405.21060 . URL: https://doi.org/10.48550/arXiv.2405.21060 .
- [17] Tri Dao et al. 'FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness'. In: Advances in NeurIPS . 2022, pp. 16344-16359. URL: https://proceedings.neurips.cc/paper\_files/paper/ 2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf .
- [18] DeepSeek-AI. DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention . 2025.
- [19] DeepSeek-AI et al. DeepSeek-V3 Technical Report . 2025. arXiv: 2412.19437 [cs.CL] . URL: https:// arxiv.org/abs/2412.19437 .
- [20] Jiayu Ding et al. LongNet: Scaling Transformers to 1,000,000,000 Tokens . 2023. arXiv: 2307.02486 [cs.CL] . URL: https://arxiv.org/abs/2307.02486 .
- [21] Juechu Dong et al. Flex Attention: A Programming Model for Generating Optimized Attention Kernels . 2024. arXiv: 2412.05496 [cs.LG] . URL: https://arxiv.org/abs/2412.05496 .
- [22] Xin Dong et al. Hymba: A Hybrid-head Architecture for Small Language Models . 2024. arXiv: 2411.13676 [cs.CL] . URL: https://arxiv.org/abs/2411.13676 .
- [23] Jusen Du et al. 'Mom: Linear sequence modeling with mixture-of-memories'. In: arXiv preprint arXiv:2502.13685 (2025).
- [24] Jusen Du et al. 'Native Hybrid Attention for Efficient Sequence Modeling'. In: arXiv preprint arXiv:2510.07019 (2025).
- [25] Tianyu Fu et al. 'Moa: Mixture of sparse attention for automatic large language model compression'. In: arXiv preprint arXiv:2406.14909 (2024).
- [26] Aryo Pradipta Gema et al. 'Are we done with mmlu?' In: arXiv preprint arXiv:2406.04127 (2024).
- [27] Riccardo Grazzi et al. 'Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues'. In: Proceedings of ICLR . 2025. URL: https://openreview.net/forum?id=UvTo3tVBk2 .
- [28] Joseph F. Grcar. 'How ordinary elimination became Gaussian elimination'. In: Historia Mathematica 38.2 (May 2011), pp. 163-218. ISSN: 0315-0860. DOI: 10.1016/j.hm.2010.06.003 . URL: http://dx.doi. org/10.1016/j.hm.2010.06.003 .
- [29] Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces . 2023. arXiv: 2312.00752 [cs.LG] .
- [30] Albert Gu, Karan Goel, and Christopher Ré. Efficiently Modeling Long Sequences with Structured State Spaces . 2022. arXiv: 2111.00396 [cs.LG] .
- [31] Xiangming Gu et al. When Attention Sink Emerges in Language Models: An Empirical View . 2025. arXiv: 2410.10781 [cs.CL] . URL: https://arxiv.org/abs/2410.10781 .
- [32] Yuxian Gu et al. Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search . 2025. arXiv: 2508.15884 [cs.CL] . URL: https://arxiv.org/abs/2508.15884 .
- [33] Daya Guo et al. 'DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning'. In: Nature 645.8081 (2025), pp. 633-638.
- [34] Han Guo et al. 'Log-linear attention'. In: arXiv preprint arXiv:2506.04761 (2025).
- [35] Qipeng Guo et al. 'Star-transformer'. In: arXiv preprint arXiv:1902.09113 (2019).
- [36] Dan Hendrycks et al. Measuring Massive Multitask Language Understanding . 2021. arXiv: 2009.03300 [cs.CY] . URL: https://arxiv.org/abs/2009.03300 .
- [37] Jordan Hoffmann et al. Training Compute-Optimal Large Language Models . 2022. arXiv: 2203.15556 [cs.CL] . URL: https://arxiv.org/abs/2203.15556 .
- [38] Cheng-Ping Hsieh et al. 'RULER: What's the Real Context Size of Your Long-Context Language Models?' In: arXiv preprint arXiv:2404.06654 (2024).
- [39] Jiaxi Hu et al. 'Attractor memory for long-term time series forecasting: A chaos perspective'. In: Advances in NeurIPS 37 (2024), pp. 20786-20818.
- [40] Jiaxi Hu et al. 'Comba: Improving Nonlinear RNNs with Closed-loop Control'. In: arXiv preprint arXiv:2506.02475 (2025).
- [41] Ermo Hua et al. 'Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization'. In: arXiv preprint arXiv:2412.17739 (2024).
- [42] Weizhe Hua et al. 'Transformer Quality in Linear Time'. In: Proceedings of ICML . Ed. by Kamalika Chaudhuri et al. PMLR, 2022, pp. 9099-9117. URL: https://proceedings.mlr.press/v162/hua22a.html .
- [43] Yuzhen Huang et al. 'C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models'. In: Advances in NeurIPS 36 (2023), pp. 62991-63010.
- [44] Naman Jain et al. 'Livecodebench: Holistic and contamination free evaluation of large language models for code'. In: arXiv preprint arXiv:2403.07974 (2024).
- [45] Samy Jelassi et al. Repeat After Me: Transformers are Better than State Space Models at Copying . 2024. arXiv: 2402.01032 [cs.LG] .
- [46] Thierry Joffrain et al. 'Accumulating Householder transformations, revisited'. In: (2006), pp. 169-179. URL: https://doi.org/10.1145/1141885.1141886 .
- [47] Mandar Joshi et al. 'Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension'. In: arXiv preprint arXiv:1705.03551 (2017).
- [48] Angelos Katharopoulos et al. 'Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention'. In: Proceedings of ICML . Ed. by Hal Daumé III and Aarti Singh. PMLR, 2020, pp. 5156-5165. URL: https://proceedings.mlr.press/v119/katharopoulos20a.html .
- [49] Amirhossein Kazemnejad et al. 'The impact of positional encoding on length generalization in transformers'. In: Advances in NeurIPS 36 (2023), pp. 24892-24928.
- [50] Team Kimi et al. 'Kimi k2: Open agentic intelligence'. In: arXiv preprint arXiv:2507.20534 (2025).
- [51] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 'Reformer: The efficient transformer'. In: arXiv preprint arXiv:2001.04451 (2020).
- [52] Satyapriya Krishna et al. 'Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation'. In: arXiv preprint arXiv:2409.12941 (2024).
- [53] Hanyu Lai et al. 'A Survey of Post-Training Scaling in Large Language Models'. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2025, pp. 27712791.
- [54] Disen Lan et al. 'Liger: Linearizing Large Language Models to Gated Recurrent Structures'. In: arXiv preprint arXiv:2503.01496 (2025).
- [55] Haonan Li et al. 'CMMLU: Measuring massive multitask language understanding in Chinese'. In: Findings of the Association for Computational Linguistics: ACL 2024 . Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 11260-11285. DOI: 10.18653/v1/2024.findings-acl.671 . URL: https://aclanthology.org/2024.findingsacl.671/ .
- [56] Yixing Li et al. 'Transmamba: Flexibly switching between transformer and mamba'. In: arXiv preprint arXiv:2503.24067 (2025).
- [57] Opher Lieber et al. Jamba: A Hybrid Transformer-Mamba Language Model . 2024. arXiv: 2403.19887 [cs.CL] .
- [58] Zhixuan Lin et al. 'Forgetting transformer: Softmax attention with a forget gate'. In: arXiv preprint arXiv:2503.02130 (2025).
- [59] Bo Liu et al. 'Longhorn: State Space Models are Amortized Online Learners'. In: ArXiv abs/2407.14207 (2024). URL: https://api.semanticscholar.org/CorpusID:271310065 .
- [60] Jiawei Liu et al. 'Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation'. In: Thirty-seventh Conference on NeurIPS . 2023. URL: https://openreview. net/forum?id=1qvx610Cu7 .
- [61] Jiawei Liu et al. 'Repoqa: Evaluating long context code understanding'. In: arXiv preprint arXiv:2406.06025 (2024).
- [62] Jingyuan Liu et al. Muon is Scalable for LLM Training . 2025. arXiv: 2502.16982 [cs.LG] . URL: https: //arxiv.org/abs/2502.16982 .
- [63] Enzhe Lu et al. MoBA: Mixture of Block Attention for Long-Context LLMs . 2025. arXiv: 2502.13189 [cs.LG] . URL: https://arxiv.org/abs/2502.13189 .
- [64] William Merrill, Jackson Petty, and Ashish Sabharwal. 'The illusion of state in state-space models'. In: arXiv preprint arXiv:2404.08819 (2024).
- [65] William Merrill and Ashish Sabharwal. 'The Parallelism Tradeoff: Limitations of Log-Precision Transformers'. In: Transactions of the Association for Computational Linguistics 11 (2023), pp. 531-545. DOI: 10.1162/ tacl\_a\_00562 . URL: https://aclanthology.org/2023.tacl-1.31/ .
- [66] MiniMax et al. MiniMax-01: Scaling Foundation Models with Lightning Attention . 2025. arXiv: 2501.08313 [cs.CL] .
- [67] Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention . 2024. arXiv: 2404.07143 [cs.CL] .
- [68] Tsendsuren Munkhdalai and Adam Trischler. Metalearning with Hebbian Fast Weights . 2018. arXiv: 1807. 05076 [cs.NE] . URL: https://arxiv.org/abs/1807.05076 .
- [69] Tsendsuren Munkhdalai et al. 'Metalearned Neural Memory'. In: ArXiv abs/1907.09720 (2019). URL: https: //api.semanticscholar.org/CorpusID:198179407 .
- [70] Long Ouyang et al. 'Training language models to follow instructions with human feedback'. In: Advances in NeurIPS 35 (2022), pp. 27730-27744.
- [71] Bo Peng et al. RWKV-7 "Goose" with Expressive Dynamic State Evolution . 2025. arXiv: 2503.14456 [cs.CL] .
- [72] Bowen Peng et al. 'Yarn: Efficient context window extension of large language models'. In: arXiv preprint arXiv:2309.00071 (2023).
- [73] Piotr Pi˛ ekos, Róbert Csordás, and Jürgen Schmidhuber. 'Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing'. In: arXiv preprint arXiv:2505.00315 (2025).
- [74] Aske Plaat et al. 'Reasoning with large language models, a survey'. In: CoRR (2024).
- [75] Ofir Press, Noah Smith, and Mike Lewis. 'Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation'. In: Proceedings of ICLR . 2022. URL: https://openreview.net/forum?id= R8sQPpGCv0 .
- [76] Krishna C. Puvvada et al. SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling . 2025. arXiv: 2504.08719 [cs.CL] .
- [77] Zhen Qin et al. HGRN2: Gated Linear RNNs with State Expansion . 2024. arXiv: 2404.07904 [cs.CL] .
- [78] Zhen Qin et al. TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer . 2024. arXiv: 2307.14995 [cs.CL] .
- [79] Zihan Qiu et al. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free . 2025. arXiv: 2505.06708 [cs.CL] .
- [80] Xiaoye Qu et al. 'A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond'. In: arXiv preprint arXiv:2503.21614 (2025).
- [81] Qwen Team. Qwen3-Next: Towards Ultimate Training & Inference Efficiency . Accessed: 2025-10-27. Sept. 2025.
- [82] David Rein et al. 'Gpqa: A graduate-level google-proof q&a benchmark'. In: First Conference on Language Modeling . 2024.
- [83] Keisuke Sakaguchi et al. WinoGrande: An Adversarial Winograd Schema Challenge at Scale . 2019. arXiv: 1907.10641 [cs.CL] . URL: https://arxiv.org/abs/1907.10641 .
- [84] Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. 'Linear Transformers Are Secretly Fast Weight Programmers'. In: Proceedings of ICML . Ed. by Marina Meila and Tong Zhang. PMLR, 2021, pp. 9355-9366. URL: https://proceedings.mlr.press/v139/schlag21a.html .
- [85] Imanol Schlag, Tsendsuren Munkhdalai, and Jürgen Schmidhuber. Learning Associative Inference Using Fast Weight Memory . 2021. arXiv: 2011.07831 [cs.LG] . URL: https://arxiv.org/abs/2011.07831 .
- [86] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations . 2018. arXiv: 1803.02155 [cs.CL] .
- [87] Jianlin Su. Linear Attention: A Brief History of Imitation, Innovation, and Feedback . June 2025. URL: https: //kexue.fm/archives/11033 .
- [88] Jianlin Su et al. 'Roformer: Enhanced transformer with rotary position embedding'. In: Neurocomputing 568 (2024), p. 127063.
- [89] Weigao Sun et al. Speed Always Wins: A Survey on Efficient Architectures for Large Language Models . 2025. arXiv: 2508.09834 [cs.CL] . URL: https://arxiv.org/abs/2508.09834 .
- [90] Yu Sun et al. 'Learning to (Learn at Test Time): RNNs with Expressive Hidden States'. In: ArXiv abs/2407.04620 (2024). URL: https://api.semanticscholar.org/CorpusID:271039606 .
- [91] Yutao Sun et al. 'Efficient attention mechanisms for large language models: A survey'. In: arXiv preprint arXiv:2507.19595 (2025).
- [92] Yutao Sun et al. Retentive Network: A Successor to Transformer for Large Language Models . 2023. arXiv: 2307.08621 [cs.CL] .
- [93] Yutao Sun et al. You Only Cache Once: Decoder-Decoder Architectures for Language Models . 2024. arXiv: 2405.05254 [cs.CL] . URL: https://arxiv.org/abs/2405.05254 .
- [94] Mirac Suzgun et al. 'Challenging big-bench tasks and whether chain-of-thought can solve them'. In: arXiv preprint arXiv:2210.09261 (2022).
- [95] Kimi Team et al. Kimi k1.5: Scaling Reinforcement Learning with LLMs . 2025. arXiv: 2501.12599 [cs.AI] . URL: https://arxiv.org/abs/2501.12599 .
- [96] MiniCPM Team et al. MiniCPM4: Ultra-Efficient LLMs on End Devices . 2025. arXiv: 2506.07900 [cs.CL] . URL: https://arxiv.org/abs/2506.07900 .
- [97] Tencent Hunyuan Team et al. 'Hunyuan-turbos: Advancing large language models through mamba-transformer synergy and adaptive chain-of-thought'. In: arXiv preprint arXiv:2505.15431 (2025).
- [98] Hugo Touvron et al. LLaMA: Open and Efficient Foundation Language Models . 2023. arXiv: 2302.13971 [cs.CL] .
- [99] Ashish Vaswani et al. 'Attention is All you Need'. In: Advances in NeurIPS . Ed. by I. Guyon et al. Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper\_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf .
- [100] Roger Waleffe et al. An Empirical Study of Mamba-based Language Models . 2024. arXiv: 2406.07887 [cs.LG] . URL: https://arxiv.org/abs/2406.07887 .
- [101] Sinong Wang et al. Linformer: Self-Attention with Linear Complexity . 2020. arXiv: 2006.04768 [cs.LG] .
- [102] Yaoyu Wang. Understanding DeltaNet from the Perspective of Inference Frameworks . May 2025. URL: https: //yywangcs.notion.site/DeltaNet-1fefc9f5d80580a496f8eb406a496f09 .
- [103] Yubo Wang et al. 'Mmlu-pro: A more robust and challenging multi-task language understanding benchmark'. In: Advances in NeurIPS 37 (2024), pp. 95266-95290.
- [104] Kaiyue Wen, Xingyu Dang, and Kaifeng Lyu. 'Rnns are not transformers (yet): The key bottleneck on in-context retrieval'. In: arXiv preprint arXiv:2402.18510 (2024).
- [105] Colin White et al. 'Livebench: A challenging, contamination-free llm benchmark'. In: arXiv preprint arXiv:2406.19314 4 (2024).
- [106] Yuhuai Wu et al. 'Memorizing transformers'. In: arXiv preprint arXiv:2203.08913 (2022).
- [107] Guangxuan Xiao et al. 'Efficient streaming language models with attention sinks'. In: arXiv preprint arXiv:2309.17453 (2023).
- [108] Wenhan Xiong et al. Effective Long-Context Scaling of Foundation Models . 2023. arXiv: 2309.16039 [cs.CL] . URL: https://arxiv.org/abs/2309.16039 .
- [109] Ruyi Xu et al. 'Xattention: Block sparse attention with antidiagonal scoring'. In: arXiv preprint arXiv:2503.16428 (2025).
- [110] Bowen Yang et al. Rope to Nope and Back Again: A New Hybrid Attention Strategy . 2025. arXiv: 2501.18795 [cs.CL] . URL: https://arxiv.org/abs/2501.18795 .
- [111] Songlin Yang, Jan Kautz, and Ali Hatamizadeh. 'Gated Delta Networks: Improving Mamba2 with Delta Rule'. In: Proceedings of ICLR . 2025. URL: https://openreview.net/forum?id=r8H7xhYPwz .
- [112] Songlin Yang and Bailin Wang. 'Parallelizing Linear Transformers with the Delta Rule over Sequence Length'. In: ArXiv abs/2406.06484 (2024). URL: https://api.semanticscholar.org/CorpusID:270371554 .
- [113] Songlin Yang and Yu Zhang. FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism . 2024. URL: https://github.com/fla-org/flash-linear-attention .
- [114] Songlin Yang et al. 'Gated Linear Attention Transformers with Hardware-Efficient Training'. In: Proceedings of ICML . PMLR, 2024.
- [115] Songlin Yang et al. 'PaTH Attention: Position Encoding via Accumulating Householder Transformations'. In: arXiv preprint arXiv:2505.16381 (2025).
- [116] Feng Yao et al. Your Efficient RL Framework Secretly Brings You Off-Policy RL Training . Aug. 2025. URL: https://fengyao.notion.site/off-policy-rl .
- [117] Morris Yau et al. 'Sequential-Parallel Duality in Prefix Scannable Models'. In: arXiv preprint arXiv:2506.10918 (2025).
- [118] Howard Yen et al. 'HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly'. In: International Conference on Learning Representations (ICLR) . 2025.
- [119] Jingyang Yuan et al. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention . 2025. arXiv: 2502.11089 [cs.CL] . URL: https://arxiv.org/abs/2502.11089 .
- [120] Manzil Zaheer et al. 'Big bird: Transformers for longer sequences'. In: Advances in NeurIPS 33 (2020), pp. 17283-17297.
- [121] Rowan Zellers et al. 'HellaSwag: Can a Machine Really Finish Your Sentence?' In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . 2019.
- [122] Biao Zhang and Rico Sennrich. 'Root mean square layer normalization'. In: Advances in NeurIPS 32 (2019).
- [123] Michael Zhang et al. 'Lolcats: On low-rank linearizing of large language models'. In: arXiv preprint arXiv:2410.10254 (2024).
- [124] Yu Zhang et al. Gated Slot Attention for Efficient Linear-Time Sequence Modeling . 2024. arXiv: 2409.07146 [cs.CL] .
- [125] Shu Zhong et al. 'Understanding Transformer from the Perspective of Associative Memory'. In: arXiv preprint arXiv:2505.19488 (2025).
- [126] Jingwei Zuo et al. Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance . 2025. arXiv: 2507.22448 [cs.CL] . URL: https://arxiv.org/abs/2507.22448 .
## A Contributions
The authors are listed in order of the significance of their contributions, with those in project leadership roles appearing last. The project is developed at Moonshot AI, with several external collaborators that are marked with #. Names marked with an asterisk (*) indicate people who are no longer part of our team.
| Yu Zhang 1 | Junjie Yan |
|-----------------|---------------|
| Zongyu Lin ∗ | Zhejun Jiang |
| Xingcheng Yao | Weixiao Huang |
| Jiaxi Hu 2 | Bohong Yin |
| Fanqing Meng | Jiacheng You |
| Chengyin Liu | Chu Wei |
| Xin Men | Zhengtao Wang |
| Songlin Yang #3 | Chao Hong |
| Zhiyuan Li | Yutian Chen |
| Wentao Li | Guanduo Chen |
| Enzhe Lu | Yucheng Wang |
| Weizhou Liu | Huabin Zheng |
| Yanru Chen | Feng Wang |
| Weixin Xu | Yibo Liu |
| Longhui Yu | Mengnan Dong |
| Yejie Wang | Zheng Zhang |
| Yu Fan | Siyuan Pan |
| Longguang Zhong | Wenhao Wu |
| Enming Yuan | Yuhao Wu |
| Dehao Zhang | Longyu Guan |
| Yizhi Zhang | Jiawen Tao |
| T.Y. Liu | Guohong Fu #1 |
| Haiming Wang | Xinran Xu |
| Shengjun Fang | Yuzhi Wang |
| Weiran He | Guokun Lai |
| Shaowei Liu | Yuxin Wu |
| Yiwei Li | Xinyu Zhou |
| Jianlin Su | Zhilin Yang |
| Jiezhong Qiu 4 | Yulun Du |
| Bo Pang | |
<details>
<summary>Image 11 Details</summary>

### Visual Description
Icon/Small Image (21x21)
</details>
## B Derivations for Chunkwise Parallelism of KDA
We first recall the recurrent form of KDA:
$$S _ { [ t ] } ^ { r } & = \underbrace { \left ( \prod _ { i = 1 } ^ { r } \left ( I - \beta _ { [ t ] } ^ { i } k _ { [ t ] } ^ { i } k _ { [ t ] } ^ { i \top } \right ) D i a g ( \alpha _ { [ t ] } ^ { i } ) \right ) } _ { \colon = P _ { [ t ] } ^ { r } } \cdot S _ { [ t ] } ^ { 0 } + \underbrace { \sum _ { i = 1 } ^ { r } \left ( \prod _ { j = i + 1 } ^ { r } \left ( I - \beta _ { [ t ] } ^ { j } k _ { [ t ] } ^ { j } k _ { [ t ] } ^ { j \top } \right ) D i a g ( \alpha _ { [ t ] } ^ { j } ) \right ) \cdot \beta _ { [ t ] } ^ { i } k _ { [ t ] } ^ { i } v _ { [ t ] } ^ { i \top } } _ { \colon = H _ { [ t ] } ^ { r } }$$
Our goal is to transform P r [ t ] and H r [ t ] into matrix forms suitable for parallel computation.
We show that P r [ t ] , which involves the cumulative product of generalized Householder matrices, can be optimized using the classic WY representation.
Proposition 1. The matrix P r [ t ] can be expressed as:
$$P _ { [ t ] } ^ { r } = \text {diag} ( \gamma _ { [ t ] } ^ { r } ) - \sum _ { i = 1 } ^ { r } \text {diag} ( \gamma _ { [ t ] } ^ { i \rightarrow r } ) k _ { [ t ] } ^ { i } w _ { [ t ] } ^ { i \top }$$
where the auxiliary vector w r [ t ] ∈ R d k is computed via the following recurrence relation:
$$w _ { [ t ] } ^ { r } = \beta _ { [ t ] } ^ { r } \left ( D i a g ( \gamma _ { [ t ] } ^ { r } ) k _ { [ t ] } ^ { r } - \sum _ { i = 1 } ^ { r - 1 } w _ { [ t ] } ^ { i } \left ( k _ { [ t ] } ^ { i \top } D i a g \left ( \gamma _ { [ t ] } ^ { i \rightarrow r } \right ) k _ { [ t ] } ^ { r } \right ) \right ) \quad ( 1 7 )$$
Proof. We proceed with a proof by mathematical induction.
Inductive Step: Assume the proposition holds for r -1 , i.e., P r -1 [ t ] = Diag( γ r -1 [ t ] ) -∑ r -1 i =1 Diag( γ i → r -1 [ t ] ) k i [ t ] w i ⊤ [ t ] . We now derive:
<!-- formula-not-decoded -->
The inductive step holds.
Similar to P r [ t ] , H r [ t ] can also be expressed in a parallelizable form.
■
Proposition 2. The matrix H r [ t ] can be expressed as:
$$H _ { [ t ] } ^ { r } = \sum _ { i = 1 } ^ { r } D i a g \left ( \gamma _ { [ t ] } ^ { i \rightarrow r } \right ) k _ { [ t ] } ^ { i } u _ { [ t ] } ^ { i \top }$$
where the auxiliary vector u r [ t ] ∈ R d v is computed via the following recurrence relation:
$$\pm b { u } ^ { r } _ { [ t ] } = \beta ^ { r } _ { [ t ] } \left ( v ^ { r } _ { [ t ] } - \sum _ { i = 1 } ^ { r - 1 } u ^ { i } _ { [ t ] } \left ( k ^ { i \top } _ { [ t ] } \text {diag} \left ( \gamma ^ { i \to r } _ { [ t ] } \right ) k ^ { r } _ { [ t ] } \right ) \right )$$
Proof. We again use mathematical induction.
Inductive Step: Assume the proposition holds for r -1 .
<!-- formula-not-decoded -->
The inductive step holds.
■
## C Pseudo Code for chunkwise KDA
```
```
Listing 1: Pseudo PyTorch-style code snippet for KDA chunked form.
## D Kimi Linear@5.7T results
Following Moonlight, we also trained Kimi Linear with an extended 5.7T token dataset to demonstrate its effectiveness. With 3 × sparsity and a new attention architecture design, Kimi Linear consistently outperforms Moonlight across nearly all benchmarks, underscoring the efficacy of the new architecture. The results are shown in Table 8 for base model and Table 9 for instruction tuned model. Moonlight-Instruct was not evaluated ('-') on tasks exceeding its 8K context limit.
Kimi Linear@5.7T obtains a score of 94.8 on RULER at 1M context length. This long context performance reinforces that Kimi Linear is a promising alternative to full-attention architectures, delivering comparable or superior results while potentially offering more efficient resource utilization.
Table 8: Performance of Kimi-Linear-Base and Moonlight-Base across diverse tasks.
| Benchmark | #Shots | Kimi-Linear-Base | Moonlight-Base |
|----------------------|----------|--------------------|------------------|
| Architecture | - - | MoE | MoE 3B |
| # Activated Params | | 3B | |
| # Total Params | - | 48B | 16B |
| Trained Tokens | - | 5.7T | 5.7T |
| TriviaQA | 5-shots | 75.2 | 66.2 |
| SimpleQA | 5-shots | 10.1 | 05.6 |
| MMLU-Pro | 5-shots | 54.8 | 42.4 |
| MMLU-redux | 5-shots | 79.7 | 73.8 |
| WinoGrande | 5-shots | 81.5 | 74.6 |
| GPQA-Diamond (avg@8) | 5-shots | 40.4 | 35.2 |
| MATH | 4-shots | 58.5 | 45.3 |
| GSM8k | 8-shots | 86.3 | 77.2 |
| GSM8k-platinum | 8-shots | 89.6 | 79.4 |
| CMATH | 6-shots | 85.5 | 79.6 |
| CRUXEval-I-cot | 0-shots | 61.0 | 45.9 |
| CRUXEval-O-cot | 0-shots | 67.0 | 46.6 |
| LiveCodeBench (v6) | 1-shots | 20.0 | 14.3 |
| EvalPlus | - | 64.9 | 50.3 |
| C-Eval | 5-shots | 83.3 | 77.6 |
| CSimpleQA | 5-shots | 53.5 | 34.7 |
Table 9: Performance of Kimi-Linear-Instruct and Moonlight-Instruct across diverse tasks.
| Benchmark | Kimi-Linear-Instruct | Moonlight-Instruct |
|---------------------------|------------------------|----------------------|
| Architecture | MoE | MoE |
| # Activated Params | 3B | 3B |
| # Total Params | 48B | 16B |
| Trained Tokens | 5.7T | 5.7T |
| RULER@128k | 95.4 | - |
| RULER@1M | 94.8 | - |
| GPQA-Diamond (Avg@8) | 71.7 | 24.7 |
| MMLU-Redux (EM) | 86.9 | 66.9 |
| MMLU-Pro (EM) | 72.7 | 43.8 |
| FaithJudge (1-Hallu.) | 64.2 | 56.0 |
| AIME 2025 (Avg@64) | 58.6 | - |
| MATH500 (Acc.) | 94.6 | 58.0 |
| HMMT2025 (Avg@32) | 44.5 | - |
| LiveCodeBench v6 (Pass@1) | 45.7 | 11.9 |
| OJBench (Pass@1) | 14.2 | - |
| Humaneval + | 70.9 | 46.3 |
| MBPP + | 72.4 | 56.3 |