2502.13189

Model: gemini-2.0-flash

# MoBA: Mixture of Block Attention for Long-Context LLMs > ∗zhang_mingxing@mail.tsinghua.edu.cn‡Co-corresponding authors. Xinyu Zhou (zhouxinyu@moonshot.cn), Jiezhong Qiu (jiezhongqiu@outlook.com) \addbibresource iclr2025_conference.bib Abstract Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the “less structure” principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi’s long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA. 1 Introduction The pursuit of artificial general intelligence (AGI) has driven the development of large language models (LLMs) to unprecedented scales, with the promise of handling complex tasks that mimic human cognition. A pivotal capability for achieving AGI is the ability to process, understand, and generate long sequences, which is essential for a wide range of applications, from historical data analysis to complex reasoning and decision-making processes. This growing demand for extended context processing can be seen not only in the popularity of long input prompt understanding, as showcased by models like Kimi \parencite kimi, Claude \parencite claude and Gemini \parencite reid2024gemini, but also in recent explorations of long chain-of-thought (CoT) output capabilities in Kimi k1.5 \parencite team2025kimi, DeepSeek-R1 \parencite guo2025deepseek, and OpenAI o1/o3 \parencite guan2024deliberative. However, extending the sequence length in LLMs is non-trivial due to the quadratic growth in computational complexity associated with the vanilla attention mechanism \parencite waswani2017attention. This challenge has spurred a wave of research aimed at improving efficiency without sacrificing performance. One prominent direction capitalizes on the inherent sparsity of attention scores. This sparsity arises both mathematically — from the softmax operation, where various sparse attention patterns have be studied \parencite jiang2024minference — and biologically \parencite watson2025human, where sparse connectivity is observed in brain regions related to memory storage. Existing approaches often leverage predefined structural constraints, such as sink-based \parencite xiao2023efficient or sliding window attention \parencite beltagy2020longformer, to exploit this sparsity. While these methods can be effective, they tend to be highly task-specific, potentially hindering the model’s overall generalizability. Alternatively, a range of dynamic sparse attention mechanisms, exemplified by Quest \parencite tang2024quest, Minference \parencite jiang2024minference, and RetrievalAttention \parencite liu2024retrievalattention, select subsets of tokens at inference time. Although such methods can reduce computation for long sequences, they do not substantially alleviate the intensive training costs of long-context models, making it challenging to scale LLMs efficiently to contexts on the order of millions of tokens. Another promising alternative way has recently emerged in the form of linear attention models, such as Mamba \parencite dao2024transformers, RWKV \parencite peng2023rwkv, peng2024eagle, and RetNet \parencite sun2023retentive. These approaches replace canonical softmax-based attention with linear approximations, thereby reducing the computational overhead for long-sequence processing. However, due to the substantial differences between linear and conventional attention, adapting existing Transformer models typically incurs high conversion costs \parencite mercat2024linearizing, wang2024mamba, bick2025transformers, zhang2024lolcats or requires training entirely new models from scratch \parencite li2025minimax. More importantly, evidence of their effectiveness in complex reasoning tasks remains limited. Consequently, a critical research question arises: How can we design a robust and adaptable attention architecture that retains the original Transformer framework while adhering to a “less structure” principle, allowing the model to determine where to attend without relying on predefined biases? Ideally, such an architecture would transition seamlessly between full and sparse attention modes, thus maximizing compatibility with existing pre-trained models and enabling both efficient inference and accelerated training without compromising performance. Thus, we introduce Mixture of Block Attention (MoBA), a novel architecture that builds upon the innovative principles of Mixture of Experts (MoE) \parencite shazeer2017outrageously and applies them to the attention mechanism of the Transformer model. MoE has been used primarily in the feedforward network (FFN) layers of Transformers \parencite lepikhin2020gshard,fedus2022switch, zoph2022st, but MoBA pioneers its application to long context attention, allowing dynamic selection of historically relevant blocks of key and values for each query token. This approach not only enhances the efficiency of LLMs but also enables them to handle longer and more complex prompts without a proportional increase in resource consumption. MoBA addresses the computational inefficiency of traditional attention mechanisms by partitioning the context into blocks and employing a gating mechanism to selectively route query tokens to the most relevant blocks. This block sparse attention significantly reduces the computational costs, paving the way for more efficient processing of long sequences. The model’s ability to dynamically select the most informative blocks of keys leads to improved performance and efficiency, particularly beneficial for tasks involving extensive contextual information. In this paper, we detail the architecture of MoBA, firstly its block partitioning and routing strategy, and secondly its computational efficiency compared to traditional attention mechanisms. We further present experimental results that demonstrate MoBA’s superior performance in tasks requiring the processing of long sequences. Our work contributes a novel approach to efficient attention computation, pushing the boundaries of what is achievable with LLMs in handling complex and lengthy inputs. 2 Method <details> <summary>x1.png Details</summary> ![eb801cbd](/v1/image/eb801cbdb201bd15c7fc59483fd14655e2e0a3a1da1d421f5c630b4c19189ff7) ### Visual Description ## Diagram: Router-Based Attention Mechanism ### Overview The image illustrates a router-based attention mechanism. It shows how queries are routed to different key-value blocks, and how attention scores are calculated based on the selected blocks. The diagram depicts two queries, q1 and q2, being processed by a router, which then directs them to specific key-value blocks. ### Components/Axes * **Queries:** Labeled at the top, consisting of two blocks: * `q1`: Red block. * `q2`: Yellow block. * **Router:** A rectangular block labeled "Router" in the center-top of the diagram. * **Keys:** Labeled on the left, consisting of four blocks: * `block1`: Purple block. * `block2`: Blue block. * `block3`: Green block. * `block4`: Gray block. * **Values:** Labeled on the left, consisting of four blocks with dashed borders: * `block1`: Purple block with dashed border. * `block2`: Blue block with dashed border. * `block3`: Green block with dashed border. * `block4`: Gray block with dashed border. * **Attention Scores:** Two rectangular blocks labeled "Attn score" at the bottom. * **Flow Arrows:** Solid and dashed arrows indicating the flow of data. ### Detailed Analysis 1. **Queries:** * `q1` (red) and `q2` (yellow) are the initial inputs. 2. **Router:** * The "Router" block receives both `q1` and `q2`. * `q1` is routed via a solid arrow to `block1` (purple) and `block2` (blue) in the "Keys" section. * `q2` is routed via a dashed arrow to `block3` (green) and `block4` (gray) in the "Keys" section. 3. **Keys and Values:** * `q1`'s route selects `block1` (purple) and `block2` (blue) from both the "Keys" and "Values" sections. The "Values" blocks are represented with dashed borders. * `q2`'s route selects `block3` (green) and `block4` (gray) from both the "Keys" and "Values" sections. The "Values" blocks are represented with dashed borders. 4. **Attention Score Calculation:** * For `q1`, `block1` (purple) and `block2` (blue) are concatenated and fed into an "Attn score" block. The corresponding `block1` (purple dashed) and `block2` (blue dashed) from the "Values" section are shown to the right, stacked vertically. * For `q2`, `block3` (green) and `block4` (gray) are concatenated and fed into an "Attn score" block. The corresponding `block3` (green dashed) and `block4` (gray dashed) from the "Values" section are shown to the right, stacked vertically. 5. **Output:** * The "Attn score" blocks represent the calculated attention scores for each query. ### Key Observations * The router directs different queries to different sets of key-value blocks. * The "Values" blocks are represented with dashed borders, possibly indicating they are used differently from the "Keys" blocks. * The diagram suggests a selective attention mechanism where each query focuses on a specific subset of the key-value pairs. ### Interpretation The diagram illustrates a router-based attention mechanism, where the router dynamically selects which key-value blocks are relevant for each query. This allows the model to focus on different parts of the input based on the query, potentially improving performance and efficiency. The use of dashed borders for the "Values" blocks might indicate that they are used for weighting or combining information, while the "Keys" blocks are used for determining relevance. The concatenation of key blocks before calculating the attention score suggests that the model considers the relationships between these blocks. The vertical stacking of the dashed value blocks to the right of the attention score box is unclear, but may represent the values being weighted by the attention score. </details> (a) <details> <summary>x2.png Details</summary> ![1a52cac5](/v1/image/1a52cac50dc4c2d74d7c62dd6ffc1525e45ee3757d97b94c5e246faa49086187) ### Visual Description ## Diagram: MoBA Gating with Varlen Flash-Attention ### Overview The diagram illustrates a process involving MoBA Gating, RoPE (Rotary Positional Embedding), Index Selection, and Varlen Flash-Attention, ultimately leading to an Attention Output. The diagram shows the flow of data and operations within this system. ### Components/Axes * **MoBA Gating:** A module containing the following sub-modules: * Partition to blocks * Mean Pooling * MatMul (Matrix Multiplication) * TopK Gating * **Q, K, V:** Inputs to the system. * **RoPE:** Rotary Positional Embedding module. * **Index Select:** A module that selects indices. * **Varlen Flash-Attention:** A module performing variable-length flash attention. * **Attention Output:** The final output of the system. * **Selected Block Index:** Output from the MoBA Gating module. ### Detailed Analysis 1. **Inputs:** * Q (Query) flows into the RoPE module. * K (Key) flows directly to the Index Select module and then to the Varlen Flash-Attention module. * V (Value) flows directly to the Index Select module and then to the Varlen Flash-Attention module. 2. **MoBA Gating:** * The MoBA Gating module consists of a sequence of operations: Partition to blocks, Mean Pooling, MatMul, and TopK Gating. * The output of the TopK Gating module, labeled "Selected Block Index," is fed into the Index Select module. 3. **RoPE:** * The Q input is processed by the RoPE module. * The output of the RoPE module flows to the Index Select module and then to the Varlen Flash-Attention module. 4. **Index Select:** * The Index Select module receives input from RoPE (processed Q), K, V, and the "Selected Block Index" from the MoBA Gating module. * The output of the Index Select module flows to the Varlen Flash-Attention module. 5. **Varlen Flash-Attention:** * The Varlen Flash-Attention module receives input from the Index Select module. * The output of the Varlen Flash-Attention module is the "Attention Output." ### Key Observations * The MoBA Gating module appears to be responsible for selecting specific blocks of data, which are then used by the Index Select module. * The RoPE module processes the Query (Q) input before it is used in the attention mechanism. * The Index Select module combines the processed Q, K, V, and the selected block indices to prepare the data for the Varlen Flash-Attention module. * The Varlen Flash-Attention module produces the final "Attention Output." ### Interpretation The diagram illustrates a specific architecture for attention mechanisms, incorporating MoBA Gating for selective processing of input data. The MoBA Gating module likely aims to improve efficiency or focus the attention mechanism on relevant parts of the input. The use of RoPE suggests that positional information is important for the attention mechanism. The Varlen Flash-Attention module likely handles variable-length inputs, making the system more flexible. The overall architecture suggests a system designed for efficient and focused attention processing. </details> (b) Figure 1: Illustration of mixture of block attention (MoBA). (a) A running example of MoBA; (b) Integration of MoBA into Flash Attention. In this work, we introduce a novel architecture, termed Mixture of Block Attention (MoBA), which extends the capabilities of the Transformer model by dynamically selecting historical segments (blocks) for attention computation. MoBA is inspired by techniques of Mixture of Experts (MoE) and sparse attention. The former technique has been predominantly applied to the feedforward network (FFN) layers within the Transformer architecture, while the latter has been widely adopted in scaling Transformers to handle long contexts. Our method is innovative in applying the MoE principle to the attention mechanism itself, allowing for more efficient and effective processing of long sequences. 2.1 Preliminaries: Standard Attention in Transformer We first revisit the standard Attention in Transformers. For simplicity, we revisit the case where a single query token ${\bm{q}}∈\mathbb{R}^{1× d}$ attends to the $N$ key and value tokens, denoting ${\bm{K}},{\bm{V}}∈\mathbb{R}^{N× d}$ , respectively. The standard attention is computed as: $$ \mathrm{Attn}({\bm{q}},{\bm{K}},{\bm{V}})=\mathrm{Softmax}{\left({\bm{q}}{\bm{% K}}^{\top}\right)}{\bm{V}}, \tag{1} $$ where $d$ denotes the dimension of a single attention head. We focus on the single-head scenario for clarity. The extension to multi-head attention involves concatenating the outputs from multiple such single-head attention operations. 2.2 MoBA Architecture Different from standard attention where each query tokens attend to the entire context, MoBA enables each query token to only attend to a subset of keys and values: $$ \mathrm{MoBA}({\bm{q}},{\bm{K}},{\bm{V}})=\mathrm{Softmax}{\left({\bm{q}}{{\bm% {K}}[I]}^{\top}\right)}{\bm{V}}[I], \tag{2} $$ where $I⊂eq[N]$ is the set of selected keys and values. The key innovation in MoBA is the block partitioning and selection strategy. We divide the full context of length $N$ into $n$ blocks, where each block represents a subset of subsequent tokens. Without loss of generality, we assume that the context length $N$ is divisible by the number of blocks $n$ . We further denote $B=\frac{N}{n}$ to be the block size and $$ I_{i}=\left[(i-1)\times B+1,i\times B\right] \tag{3} $$ to be the range of the $i$ -th block. By applying the top- $k$ gating mechanism from MoE, we enable each query to selectively focus on a subset of tokens from different blocks, rather than the entire context: $$ I=\bigcup_{g_{i}>0}I_{i}. \tag{4} $$ The model employs a gating mechanism, as $g_{i}$ in Equation 4, to select the most relevant blocks for each query token. The MoBA gate first computes the affinity score $s_{i}$ measuring the relevance between query ${\bm{q}}$ and the $i$ -th block, and applies a top- $k$ gating among all blocks. More formally, the gate value for the $i$ -th block $g_{i}$ is computed by $$ g_{i}=\begin{cases}1&s_{i}\in\mathrm{Topk}\left(\{s_{j}|j\in[n]\},k\right)\\ 0&\text{otherwise}\end{cases}, \tag{5} $$ where $\mathrm{Topk}(·,k)$ denotes the set containing $k$ highest scores among the affinity scores calculated for each block. In this work, the score $s_{i}$ is computed by the inner product between ${\bm{q}}$ and the mean pooling of ${\bm{K}}[I_{i}]$ along the sequence dimension: $$ s_{i}=\langle{\bm{q}},\mathrm{mean\_pool}({\bm{K}}[I_{i}])\rangle \tag{6} $$ A Running Example. We provide a running example of MoBA at Figure 1a, where we have two query tokens and four KV blocks. The router (gating network) dynamically selects the top two blocks for each query to attend. As shown in Figure 1a, the first query is assigned to the first and second blocks, while the second query is assigned to the third and fourth blocks. It is important to maintain causality in autoregressive language models, as they generate text by next-token prediction based on previous tokens. This sequential generation process ensures that a token cannot influence tokens that come before it, thus preserving the causal relationship. MoBA preserves causality through two specific designs: Causality: No Attention to Future Blocks. MoBA ensures that a query token cannot be routed to any future blocks. By limiting the attention scope to current and past blocks, MoBA adheres to the autoregressive nature of language modeling. More formally, denoting $\mathrm{pos}({\bm{q}})$ as the position index of the query ${\bm{q}}$ , we set $s_{i}=-∞$ and $g_{i}=0$ for any blocks $i$ such that $\mathrm{pos}({\bm{q}})<i× B$ . Current Block Attention and Causal Masking. We define the ”current block” as the block that contains the query token itself. The routing to the current block could also violate causality, since mean pooling across the entire block can inadvertently include information from future tokens. To address this, we enforce that each token must be routed to its respective current block and apply a causal mask during the current block attention. This strategy not only avoids any leakage of information from subsequent tokens but also encourages attention to the local context. More formally, we set $g_{i}=1$ for the block $i$ where the position of the query token $\mathrm{pos}({\bm{q}})$ is within the interval $I_{i}$ . From the perspective of Mixture-of-Experts (MoE), the current block attention in MoBA is akin to the role of shared experts in modern MoE architectures \parencite dai2024deepseekmoe, yang2024qwen2, where static routing rules are added when expert selection. Next, we discuss some additional key design choices of MoBA, such as its block segmentation strategy and the hybrid of MoBA and full attention. Fine-Grained Block Segmentation. The positive impact of fine-grained expert segmentation in improving mode performance has been well-documented in the Mixture-of-Experts (MoE) literature \parencite dai2024deepseekmoe, yang2024qwen2. In this work, we explore the potential advantage of applying a similar fine-grained segmentation technique to MoBA. MoBA, inspired by MoE, operates segmentation along the context-length dimension rather than the FFN intermediate hidden dimension. Therefore our investigation aims to determine if MoBA can also benefit when we partition the context into blocks with a finer grain. More experimental results can be found in Section 3.1. Hybrid of MoBA and Full Attention. MoBA is designed to be a substitute for full attention, maintaining the same number of parameters without any addition or subtraction. This feature inspires us to conduct smooth transitions between full attention and MoBA. Specifically, at the initialization stage, each attention layer has the option to select full attention or MoBA, and this choice can be dynamically altered during training if necessary. A similar idea of transitioning full attention to sliding window attention has been studied in previous work \parencite zhang2024simlayerkv. More experimental results can be found in Section 3.2. Comparing to Sliding Window Attention and Attention Sink. Sliding window attention (SWA) and attention sink are two popular sparse attention architectures. We demonstrate that both can be viewed as special cases of MoBA. For sliding window attention \parencite beltagy2020longformer, each query token only attends to its neighboring tokens. This can be interpreted as a variant of MoBA with a gating network that keeps selecting the most recent blocks. Similarly, attention sink \parencite xiao2023efficient, where each query token attends to a combination of initial tokens and the most recent tokens, can be seen as a variant of MoBA with a gating network that always selects both the initial and the recent blocks. The above discussion shows that MoBA has stronger expressive power than sliding window attention and attention sink. Moreover, it shows that MoBA can flexibly approximate many static sparse attention architectures by incorporating specific gating networks. Overall, MoBA’s attention mechanism allows the model to adaptively and dynamically focus on the most informative blocks of the context. This is particularly beneficial for tasks involving long documents or sequences, where attending to the entire context may be unnecessary and computationally expensive. MoBA’s ability to selectively attend to relevant blocks enables more nuanced and efficient processing of information. 2.3 Implementation Algorithm 1 MoBA (Mixture of Block Attention) Implementation 0: Query, key and value matrices $\mathbf{Q},\mathbf{K},\mathbf{V}∈\mathbb{R}^{N× h× d}$ ; MoBA hyperparameters (block size $B$ and top- $k$ ); $h$ and $d$ denote the number of attention heads and head dimension. Also denote $n=N/B$ to be the number of blocks. 1: // Split KV into blocks 2: $\{\tilde{\mathbf{K}}_{i},\tilde{\mathbf{V}}_{i}\}=\text{split\_blocks}(\mathbf% {K},\mathbf{V},B)$ , where $\tilde{\mathbf{K}}_{i},\tilde{\mathbf{V}}_{i}∈\mathbb{R}^{B× h× d}% ,i∈[n]$ 3: // Compute gating scores for dynamic block selection 4: $\bar{\mathbf{K}}=\text{mean\_pool}(\mathbf{K},B)∈\mathbb{R}^{n× h× d}$ 5: $\mathbf{S}=\mathbf{Q}\bar{\mathbf{K}}^{→p}∈\mathbb{R}^{N× h× n}$ 6: // Select blocks with causal constraint (no attention to future blocks) 7: $\mathbf{M}=\text{create\_causal\_mask}(N,n)$ 8: $\mathbf{G}=\text{topk}(\mathbf{S}+\mathbf{M},k)$ 9: // Organize attention patterns for computation efficiency 10: $\mathbf{Q}^{s},\tilde{\mathbf{K}}^{s},\tilde{\mathbf{V}}^{s}=\text{get\_self\_% attn\_block}(\mathbf{Q},\tilde{\mathbf{K}},\tilde{\mathbf{V}})$ 11: $\mathbf{Q}^{m},\tilde{\mathbf{K}}^{m},\tilde{\mathbf{V}}^{m}=\text{index\_% select\_moba\_attn\_block}(\mathbf{Q},\tilde{\mathbf{K}},\tilde{\mathbf{V}},% \mathbf{G})$ 12: // Compute attentions seperately 13: $\mathbf{O}^{s}=\text{flash\_attention\_varlen}(\mathbf{Q}^{s},\tilde{\mathbf{K% }}^{s},\tilde{\mathbf{V}}^{s},\text{causal=True})$ 14: $\mathbf{O}^{m}=\text{flash\_attention\_varlen}(\mathbf{Q}^{m},\tilde{\mathbf{K% }}^{m},\tilde{\mathbf{V}}^{m},\text{causal=False})$ 15: // Combine results with online softmax 16: $\mathbf{O}=\text{combine\_with\_online\_softmax}(\mathbf{O}^{s},\mathbf{O}^{m})$ 17: return $\mathbf{O}$ <details> <summary>x3.png Details</summary> ![157df320](/v1/image/157df3203352cda72bdca1fcabea9d2c1f5f263d08f5c3726125e24aea3a5b33) ### Visual Description ## Line Chart: Computation Time vs. Sequence Length ### Overview The image is a line chart comparing the computation time (in milliseconds) of "Flash Attention" and "MoBA" algorithms across varying sequence lengths. The x-axis represents the sequence length, while the y-axis represents the computation time. ### Components/Axes * **X-axis:** Sequence Length, with markers at 32K, 128K, 256K, 512K, and 1M. * **Y-axis:** Computation Time (ms), with markers at 0, 200, 400, 600, and 800. * **Legend (top-left):** * Flash Attention (light blue, dashed line) * MoBA (dark blue, dashed line) ### Detailed Analysis * **Flash Attention (light blue, dashed line):** The computation time increases exponentially with the sequence length. * 32K: ~5 ms * 128K: ~10 ms * 256K: ~40 ms * 512K: ~230 ms * 1M: ~900 ms * **MoBA (dark blue, dashed line):** The computation time increases linearly with the sequence length. * 32K: ~2 ms * 128K: ~8 ms * 256K: ~15 ms * 512K: ~60 ms * 1M: ~140 ms ### Key Observations * At shorter sequence lengths (32K to 256K), the computation times for Flash Attention and MoBA are relatively close. * As the sequence length increases (512K to 1M), the computation time for Flash Attention increases dramatically compared to MoBA. * MoBA consistently exhibits lower computation times than Flash Attention across all sequence lengths tested. ### Interpretation The chart demonstrates that MoBA is more computationally efficient than Flash Attention, especially for longer sequence lengths. The exponential increase in computation time for Flash Attention suggests that it may not scale as well as MoBA for large sequence lengths. This information is crucial for selecting the appropriate algorithm based on the expected sequence length and computational resources available. The data suggests that MoBA is a better choice for applications dealing with very long sequences, while Flash Attention might be suitable for shorter sequences where the difference in computation time is less significant. </details> (a) <details> <summary>x4.png Details</summary> ![32412146](/v1/image/324121462cb434ef46e3fd143e3fb107343a888b6763ef4a28076be2a7220fba) ### Visual Description ## Line Chart: Computation Time vs. Sequence Length ### Overview The image is a line chart comparing the computation time (in seconds) of two methods, "Flash Attention" and "MoBA," as a function of sequence length. The x-axis represents the sequence length, and the y-axis represents the computation time. There is an inset plot showing the performance for smaller sequence lengths. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Sequence Length, with markers at 1M, 4M, 7M, and 10M. The inset plot's x-axis has markers at 32K, 128K, 256K, and 512K. * **Y-axis:** Computation Time (s), with markers at 0, 20, 40, 60, and 80. The inset plot's y-axis has markers at 0.0, 0.1, and 0.2. * **Legend:** Located at the top-left of the chart. * Flash Attention (light blue) * MoBA (dark blue) ### Detailed Analysis * **Flash Attention (light blue):** * The line slopes upward, indicating increasing computation time with increasing sequence length. * At 1M sequence length, the computation time is approximately 0.1 s. * At 4M sequence length, the computation time is approximately 15 s. * At 7M sequence length, the computation time is approximately 40 s. * At 10M sequence length, the computation time is approximately 80 s. * Inset plot: * At 32K sequence length, the computation time is approximately 0.002 s. * At 128K sequence length, the computation time is approximately 0.01 s. * At 256K sequence length, the computation time is approximately 0.05 s. * At 512K sequence length, the computation time is approximately 0.2 s. * **MoBA (dark blue):** * The line slopes upward gently, indicating a slower increase in computation time with increasing sequence length compared to Flash Attention. * At 1M sequence length, the computation time is approximately 0.2 s. * At 4M sequence length, the computation time is approximately 1 s. * At 7M sequence length, the computation time is approximately 3 s. * At 10M sequence length, the computation time is approximately 6 s. * Inset plot: * At 32K sequence length, the computation time is approximately 0.002 s. * At 128K sequence length, the computation time is approximately 0.005 s. * At 256K sequence length, the computation time is approximately 0.01 s. * At 512K sequence length, the computation time is approximately 0.02 s. ### Key Observations * Flash Attention's computation time increases much more rapidly with sequence length than MoBA's. * For smaller sequence lengths (as shown in the inset), the computation times for both methods are relatively close. * At larger sequence lengths, the difference in computation time between the two methods becomes significant. ### Interpretation The chart demonstrates that MoBA scales much better with increasing sequence length compared to Flash Attention. While Flash Attention may be faster for very short sequences, its computation time grows exponentially, making it less efficient for longer sequences. MoBA, on the other hand, exhibits a more linear growth in computation time, making it a more suitable choice for applications dealing with long sequences. The inset plot highlights that the performance difference is minimal for short sequences, suggesting that the choice between the two methods depends heavily on the typical sequence length encountered in a given application. </details> (b) Figure 2: Efficiency of MoBA vs. full attention (implemented with Flash Attention). (a) 1M Model speedup evaluation: Computation time scaling of MoBA versus Flash Attention on 1M model with increasing sequence lengths (8K-1M). (b) Fixed Sparsity Ratio scaling: Computation time scaling comparison between MoBA and Flash Attention across increasing sequence lengths (8K-10M), maintaining a constant sparsity ratio of $95.31\%$ (fixed 64 MoBA blocks with variance block size and fixed top-k=3). We provide a high-performance implementation of MoBA, by incorporating optimization techniques from FlashAttention \parencite dao2022flashattention and MoE \parencite rajbhandari2022deepspeed. Figure 2 demonstrates the high efficiency of MoBA, while we defer the detailed experiments on efficiency and scalability to Section 3.4. Our implementation consists of five major steps: - Determine the assignment of query tokens to KV blocks according to the gating network and causal mask. - Arrange the ordering of query tokens based on their assigned KV blocks. - Compute attention outputs for each KV block and the query tokens assigned to it. This step can be optimized by FlashAttention with varying lengths. - Re-arrange the attention outputs back to their original ordering. - Combine the corresponding attention outputs using online Softmax (i.e., tiling), as a query token may attend to its current block and multiple historical KV blocks. The algorithmic workflow is formalized in Algorithm 1 and visualized in Figure 1b, illustrating how MoBA can be implemented based on MoE and FlashAttention. First, the KV matrices are partitioned into blocks (Line 1-2). Next, the gating score is computed according to Equation 6, which measures the relevance between query tokens and KV blocks (Lines 3-7). A top- $k$ operator is applied on the gating score (together with causal mask), resulting in a sparse query-to-KV-block mapping matrix ${\bm{G}}$ to represent the assignment of queries to KV blocks (Line 8). Then, query tokens are arranged based on the query-to-KV-block mapping, and block-wise attention outputs are computed (Line 9-12). Notably, attention to historical blocks (Line 11 and 14) and the current block attention (Line 10 and 13) are computed separately, as additional causality needs to be maintained in the current block attention. Finally, the attention outputs are rearranged back to their original ordering and combined with online softmax (Line 16) \parencite milakov2018onlinenormalizercalculationsoftmax,liu2023blockwiseparalleltransformerlarge. 3 Experiments 3.1 Scaling Law Experiments and Ablation Studies In this section, we conduct scaling law experiments and ablation studies to validate some key design choices of MoBA. | 568M 822M 1.1B | 14 16 18 | 14 16 18 | 1792 2048 2304 | 10.8B 15.3B 20.6B | 512 512 512 | 3 3 3 | | --- | --- | --- | --- | --- | --- | --- | | 1.5B | 20 | 20 | 2560 | 27.4B | 512 | 3 | | 2.1B | 22 | 22 | 2816 | 36.9B | 512 | 3 | Table 1: Configuration of Scaling Law Experiments <details> <summary>x5.png Details</summary> ![414baa80](/v1/image/414baa8050b85cf166a6e255ef853d3b54f61ce9d7f18361d62122be879aa2bf) ### Visual Description ## Log-Log Chart: LM Loss vs. PFLOP/s-days ### Overview The image is a log-log chart comparing the language modeling (LM) loss of MoBA Projection and Full Attention Projection models against the computational cost measured in PFLOP/s-days. The chart displays multiple lines for MoBA Projection, likely representing different configurations or runs. ### Components/Axes * **Title:** Implicit, but the chart compares LM loss vs. computational cost. * **X-axis:** PFLOP/s-days (PetaFLOPS per second-days). Logarithmic scale. Axis markers are at 10-1, 100 (1), and 101 (10). * **Y-axis:** LM loss (seqlen=8K). Logarithmic scale. Axis markers are at 2 x 100 (2), 3 x 100 (3), 4 x 100 (4), and 5 x 100 (5). * **Legend:** Located at the top-center of the chart. * MoBA Projection: Blue dashed line. * Full Attention Projection: Red dashed line. ### Detailed Analysis * **MoBA Projection:** There are multiple blue lines, all solid. Each line represents a different run or configuration of the MoBA Projection model. All lines show a downward trend, indicating decreasing LM loss with increasing computational cost. * The leftmost MoBA Projection line starts at approximately (0.04, 5) and decreases to approximately (2, 2.3). * The rightmost MoBA Projection line starts at approximately (0.2, 5) and decreases to approximately (2, 2.3). * **Full Attention Projection:** Represented by a single red dashed line. This line also shows a downward trend. * The Full Attention Projection line starts at approximately (0.04, 3.5) and decreases to approximately (20, 2.1). ### Key Observations * The MoBA Projection models (multiple runs) generally exhibit higher initial LM loss compared to the Full Attention Projection model at lower computational costs. * As the computational cost increases, the LM loss for all MoBA Projection models decreases and converges towards a similar level as the Full Attention Projection model. * The Full Attention Projection model shows a more consistent and gradual decrease in LM loss across the range of computational costs. ### Interpretation The chart suggests that while MoBA Projection models may initially perform worse than Full Attention Projection models in terms of LM loss, they can achieve comparable performance with sufficient computational resources. The multiple lines for MoBA Projection could indicate variability in training or sensitivity to hyperparameter settings. The Full Attention Projection model appears to be more stable and efficient in reducing LM loss with increasing computational cost, at least within the observed range. The logarithmic scales highlight the diminishing returns of increased computation on LM loss. </details> (a) <details> <summary>x6.png Details</summary> ![3907a387](/v1/image/3907a38743e36380d5e333182006181baf0e959fce291e916e7cdc6bb292430c) ### Visual Description ## Log-Log Chart: Trailing LM Loss vs. PFLOP/s-days ### Overview The image is a log-log plot comparing the trailing language model (LM) loss for two different projection methods: MoBA Projection and Full Attention Projection, as a function of PFLOP/s-days. The plot shows multiple curves for each projection method, and also includes dashed lines representing a baseline for each method. ### Components/Axes * **X-axis:** PFLOP/s-days (log scale), with markers at 10-1, 100 (1), and 101 (10). * **Y-axis:** Trailing LM loss (seqlen=32K, last 2K) (log scale), with markers at 100 (1), 2 x 100 (2), 3 x 100 (3), and 4 x 100 (4). * **Legend (top-center):** * Blue dashed line: MoBA Projection * Red dashed line: Full Attention Projection ### Detailed Analysis * **MoBA Projection (Blue Dashed Line):** The dashed blue line represents the MoBA projection. It shows a downward trend, starting at approximately (0.05, 2.5) and ending at approximately (20, 1.0). * **Full Attention Projection (Red Dashed Line):** The dashed red line represents the Full Attention projection. It also shows a downward trend, starting at approximately (0.05, 2.0) and ending at approximately (20, 0.8). * **MoBA Projection (Solid Blue Lines):** There are multiple solid blue lines, each representing a different run or configuration of the MoBA projection. All solid blue lines show a steep downward trend initially, then flatten out as PFLOP/s-days increases. * The leftmost blue line starts at approximately (0.05, 40) and decreases to approximately (1, 1.5). * The rightmost blue line starts at approximately (0.05, 15) and decreases to approximately (1, 1.3). * **Full Attention Projection (Solid Red Lines):** There are multiple solid red lines, each representing a different run or configuration of the Full Attention projection. All solid red lines show a steep downward trend initially, then flatten out as PFLOP/s-days increases. * The leftmost red line starts at approximately (0.05, 40) and decreases to approximately (1, 1.3). * The rightmost red line starts at approximately (0.05, 15) and decreases to approximately (1, 1.2). ### Key Observations * Both MoBA and Full Attention projections show a decrease in trailing LM loss as PFLOP/s-days increases. * The solid lines (representing different runs) converge towards a similar loss value as PFLOP/s-days increases. * The dashed lines (baselines) show a more consistent decrease in loss compared to the solid lines. * At lower PFLOP/s-days, the solid lines show a much higher loss compared to the dashed lines. ### Interpretation The plot compares the performance of MoBA and Full Attention projection methods in terms of trailing LM loss as a function of computational resources (PFLOP/s-days). The multiple solid lines for each method likely represent different hyperparameter settings or training runs. The dashed lines probably represent a theoretical or baseline performance. The data suggests that both methods benefit from increased computational resources, as indicated by the decreasing loss. The convergence of the solid lines suggests that, regardless of initial configuration, the models tend to achieve similar performance levels with sufficient training. The higher initial loss of the solid lines compared to the dashed lines may indicate a need for initial warm-up or optimization before the models reach their full potential. The Full Attention Projection seems to have a slightly lower loss than the MoBA Projection, especially at higher PFLOP/s-days. </details> (b) | LM loss (seqlen=8K) Trailing LM loss (seqlen=32K, last 2K) | $2.625× C^{-0.063}$ $1.546× C^{-0.108}$ | $2.622× C^{-0.063}$ $1.464× C^{-0.097}$ | | --- | --- | --- | (c) Figure 3: Scaling law comparison between MoBA and full attention. (a) LM loss on validation set (seqlen=8K); (b) trailing LM loss on validation set (seqlen=32K, last 1K tokens); (c) fitted scaling law curve. Scalability w.r.t. LM Loss. To assess the effectiveness of MoBA, we perform scaling law experiments by comparing the validation loss of language models trained using either full attention or MoBA. Following the Chinchilla scaling law \parencite hoffmann2022training, we train five language models of varying sizes with a sufficient number of training tokens to ensure that each model achieves its training optimum. Detailed configurations of the scaling law experiments can be found in Table 1. Both MoBA and full attention models are trained with a sequence length of 8K. For MoBA models, we set the block size to 512 and select the top-3 blocks for attention, resulting in a sparse attention pattern with sparsity up to $1-\frac{512× 3}{8192}=81.25\%$ Since we set top-k=3, thus each query token can attend to at most 2 history block and the current block.. In particular, MoBA serves as an alternative to full attention, meaning that it does not introduce new parameters or remove existing ones. This design simplifies our comparison process, as the only difference across all experiments lies in the attention modules, while all other hyperparameters, including the learning rate and batch size, remain constant. As shown in Figure 3a, the validation loss curves for MoBA and full attention display very similar scaling trends. Specifically, the validation loss differences between these two attention mechanisms remain consistent within a range of $1e-3$ . This suggests that MoBA achieves scaling performance that is comparable to full attention, despite its sparse attention pattern with sparsity up to 75%. Long Context Scalability. However, LM loss may be skewed by the data length distribution \parencite an2024does, which is typically dominated by short sequences. To fully assess the long-context capability of MoBA, we assess the LM loss of trailing tokens (trailing LM loss, in short), which computes the LM loss of the last few tokens in the sequence. We count this loss only for sequences that reach the maximum sequence length to avoid biases that may arise from very short sequences. A detailed discussion on trailing tokens scaling can be found in the Appendix A.1 These metrics provide insights into the model’s ability to generate the final portion of a sequence, which can be particularly informative for tasks involving long context understanding. Therefore, we adopt a modified experimental setting by increasing the maximum sequence length from 8k to 32k. This adjustment leads to an even sparser attention pattern for MoBA, achieving a sparsity level of up to $1-\frac{512× 3}{32768}=95.31\%$ . As shown in Figure 3b, although MoBA exhibits a marginally higher last block LM loss compared to full attention in all five experiments, the loss gap is progressively narrowing. This experiment implies the long-context scalability of MoBA. Ablation Study on Fine-Grained Block Segmentation. We further ablate the block granularity of MoBA. We carry out a series of experiments using a 1.5B parameter model with a 32K context length. The hyperparameters of block size and top-k are adjusted to maintain a consistent level of attention sparsity. Specifically, we divide the 32K context into 8, 16, 32, 64, and 128 blocks, and correspondingly select 2, 4, 8, 16, and 32 blocks, ensuring an attention sparsity of 75% across these configurations. As shown in Figure 4, MoBA’s performance is significantly affected by block granularity. Specifically, there is a performance difference of 1e-2 between the coarsest-grained setting (selecting 2 blocks from 8) and the settings with finer granularity. These findings suggest that fine-grained segmentation appears to be a general technique for enhancing the performance of models within the MoE family, including MoBA. <details> <summary>x7.png Details</summary> ![24978e53](/v1/image/24978e53e101418a3c9c0694723e842bebc81152c8a459cd50ae8e441838944d) ### Visual Description ## Line Chart: LM Loss vs. MoBA Block Segmentation Settings ### Overview The image is a line chart comparing the Language Model (LM) Loss of MoBA (a model) against a Full Attention Baseline across different MoBA block segmentation settings. The x-axis represents the segmentation settings (topk / #blocks), and the y-axis represents the LM Loss. ### Components/Axes * **Title:** Implicit, but the chart compares LM Loss vs. MoBA Block Segmentation Settings * **X-axis:** "MoBA block segmentation settings (topk / #blocks)" * X-axis markers: 2/8, 4/16, 8/32, 16/64, 32/128 * **Y-axis:** "LM Loss" * Y-axis markers: 2.230, 2.235, 2.240, 2.245, 2.250, 2.255, 2.260 * **Legend:** Located in the top-right corner. * Blue line with circular markers: "MoBA" * Red line with circular markers: "Full Attention Baseline" ### Detailed Analysis **MoBA (Blue Line):** The MoBA line starts high, decreases, then fluctuates slightly. * 2/8: Approximately 2.258 * 4/16: Approximately 2.243 * 8/32: Approximately 2.240 * 16/64: Approximately 2.243 * 32/128: Approximately 2.241 **Full Attention Baseline (Red Line):** The Full Attention Baseline line is relatively flat. * 2/8: Approximately 2.242 * 4/16: Approximately 2.242 * 8/32: Approximately 2.242 * 16/64: Approximately 2.242 * 32/128: Approximately 2.241 ### Key Observations * The MoBA model's LM Loss decreases significantly from the 2/8 setting to the 8/32 setting. * The Full Attention Baseline maintains a relatively constant LM Loss across all segmentation settings. * At the 32/128 setting, the LM Loss for MoBA is very close to that of the Full Attention Baseline. ### Interpretation The chart suggests that the MoBA model's performance, as measured by LM Loss, is sensitive to the block segmentation settings. Initially, with fewer blocks (2/8), the loss is high, indicating poorer performance. As the number of blocks increases (up to 8/32), the performance improves significantly. Beyond that, the performance fluctuates slightly. The Full Attention Baseline, however, is not affected by the MoBA block segmentation settings, maintaining a consistent level of performance. The convergence of MoBA's loss to the Full Attention Baseline's loss at the 32/128 setting suggests that with sufficient blocks, MoBA can achieve comparable performance. </details> Figure 4: Fine-Grained Block Segmentation. The LM loss on validation set v.s. MoBA with different block granularity. 3.2 Hybrid of MoBA and Full Attention <details> <summary>x8.png Details</summary> ![f0888d2c](/v1/image/f0888d2c16a2122f06cf78336b96a7995a85a843766139d37ce6c87f8ca7293e) ### Visual Description ## Line Chart: LM Loss vs. Position for Different Attention Mechanisms ### Overview The image is a line chart comparing the Language Model (LM) Loss against Position (in thousands) for three different attention mechanisms: MoBA/Full Hybrid, MoBA, and Full Attention. The chart shows how the LM Loss decreases as the position increases for each mechanism. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** * Label: "Position" * Units: Thousands (K) * Scale: 0K, 5K, 10K, 15K, 20K, 25K, 30K * **Y-axis:** * Label: "LM Loss" * Scale: 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0 * **Legend:** Located in the top-right corner of the chart. * Green: MoBA/Full Hybrid * Blue: MoBA * Red: Full Attention ### Detailed Analysis * **MoBA/Full Hybrid (Green):** The green line represents the MoBA/Full Hybrid attention mechanism. The line starts at approximately 2.8 and decreases to around 1.35. * Position 0K: ~2.8 * Position 5K: ~1.85 * Position 10K: ~1.55 * Position 15K: ~1.45 * Position 20K: ~1.4 * Position 25K: ~1.35 * Position 30K: ~1.35 * **MoBA (Blue):** The blue line represents the MoBA attention mechanism. The line starts at approximately 2.8 and decreases to around 1.38. * Position 0K: ~2.8 * Position 5K: ~1.85 * Position 10K: ~1.55 * Position 15K: ~1.45 * Position 20K: ~1.42 * Position 25K: ~1.4 * Position 30K: ~1.38 * **Full Attention (Red):** The red dashed line represents the Full Attention mechanism. The line starts at approximately 2.8 and decreases to around 1.32. * Position 0K: ~2.8 * Position 5K: ~2.2 * Position 10K: ~1.6 * Position 15K: ~1.45 * Position 20K: ~1.4 * Position 25K: ~1.35 * Position 30K: ~1.32 ### Key Observations * All three attention mechanisms show a decrease in LM Loss as the position increases. * The Full Attention mechanism (red line) has a steeper initial decline in LM Loss compared to the other two. * At higher positions (20K-30K), the LM Loss values for all three mechanisms converge, with Full Attention showing a slightly lower loss. * The MoBA and MoBA/Full Hybrid lines are very close to each other. ### Interpretation The chart suggests that all three attention mechanisms improve in performance (lower LM Loss) as the position increases. The Full Attention mechanism initially reduces the loss more rapidly, but the performance difference between the three mechanisms becomes smaller at higher positions. This could indicate that Full Attention is more effective at capturing initial dependencies, but the benefits diminish as the model processes more data. The convergence of the lines at higher positions suggests that the choice of attention mechanism becomes less critical as the model learns. The MoBA and MoBA/Full Hybrid mechanisms perform similarly, suggesting that the hybrid approach does not offer a significant advantage over MoBA alone in this context. </details> (a) <details> <summary>x9.png Details</summary> ![e15ad5cf](/v1/image/e15ad5cfebcaf8dc03379526c2fbbb28ae7a545bf80dc536261ab4d252de326f) ### Visual Description ## Line Chart: LM Loss vs. Number of Hybrid Full Layers ### Overview The image is a line chart comparing the Language Model (LM) loss for three different models: Layer-wise Hybrid, Full Attention, and MoBA, as the number of hybrid full layers increases. The x-axis represents the number of hybrid full layers (1, 3, 5, and 10), and the y-axis represents the LM loss. ### Components/Axes * **X-axis:** "Number of Hybrid Full Layers" with markers at "1layer", "3layer", "5layer", and "10layer". * **Y-axis:** "LM loss" with a numerical scale ranging from 1.08 to 1.14, with tick marks at each 0.01 increment. * **Legend:** Located on the right side of the chart, it identifies the three models: * Layer-wise Hybrid (blue line with circle markers) * Full Attention (red line) * MoBA (gray line) ### Detailed Analysis * **Layer-wise Hybrid (blue, dashed line with circle markers):** The LM loss decreases as the number of hybrid full layers increases. * 1 layer: approximately 1.137 * 3 layers: approximately 1.117 * 5 layers: approximately 1.099 * 10 layers: approximately 1.077 * **Full Attention (red, solid line):** The LM loss remains constant regardless of the number of hybrid full layers. The value is approximately 1.076. * **MoBA (gray, solid line):** The LM loss remains constant regardless of the number of hybrid full layers. The value is approximately 1.147. ### Key Observations * The Layer-wise Hybrid model shows a decreasing LM loss as the number of hybrid full layers increases, indicating improved performance. * The Full Attention and MoBA models maintain a constant LM loss, suggesting that their performance is not affected by the number of hybrid full layers. * The MoBA model has the highest LM loss, followed by the Layer-wise Hybrid model (at 1 layer), and the Full Attention model has the lowest LM loss. ### Interpretation The chart suggests that increasing the number of hybrid full layers in the Layer-wise Hybrid model improves its performance, as indicated by the decreasing LM loss. The Full Attention model consistently outperforms the MoBA model, as it has a lower LM loss. The Full Attention and MoBA models are not affected by the number of hybrid full layers, implying that their architecture or training is independent of this parameter. The Layer-wise Hybrid model starts with a higher loss than Full Attention but approaches Full Attention's performance as the number of layers increases. </details> (b) <details> <summary>x10.png Details</summary> ![a625b946](/v1/image/a625b946099ca61b8a49861c03c27bd57caeff9a8766b1fa50cc22ce96c1e6e2) ### Visual Description ## Chart: LM Trailing Loss vs. Number of Hybrid Full Layers ### Overview The image is a line chart comparing the Language Model (LM) trailing loss for three different models: Layer-wise Hybrid, Full Attention, and MoBA, as the number of hybrid full layers increases. The x-axis represents the number of hybrid full layers (1, 3, 5, and 10), and the y-axis represents the LM trailing loss (seqlen=32K, last 2K). ### Components/Axes * **X-axis:** Number of Hybrid Full Layers, with labels at 1layer, 3layer, 5layer, and 10layer. * **Y-axis:** LM trailing loss (seqlen=32K, last 2K), ranging from approximately 1.08 to 1.18. * **Legend:** Located on the right side of the chart, it identifies the three models: * Layer-wise Hybrid (blue dashed line with circular markers) * Full Attention (red solid line) * MoBA (gray solid line) ### Detailed Analysis * **Layer-wise Hybrid (blue dashed line):** The LM trailing loss decreases as the number of hybrid full layers increases. * 1 layer: approximately 1.17 * 3 layers: approximately 1.13 * 5 layers: approximately 1.10 * 10 layers: approximately 1.085 * **Full Attention (red solid line):** The LM trailing loss remains relatively constant as the number of hybrid full layers increases, staying at approximately 1.085. * **MoBA (gray solid line):** The LM trailing loss remains constant at approximately 1.175, regardless of the number of hybrid full layers. ### Key Observations * The Layer-wise Hybrid model shows a significant decrease in LM trailing loss as the number of hybrid full layers increases. * The Full Attention model has the lowest LM trailing loss and remains constant across different numbers of hybrid full layers. * The MoBA model has the highest LM trailing loss and remains constant across different numbers of hybrid full layers. ### Interpretation The chart suggests that increasing the number of hybrid full layers in the Layer-wise Hybrid model improves its performance, as indicated by the decreasing LM trailing loss. The Full Attention model consistently outperforms the Layer-wise Hybrid and MoBA models, maintaining a low and stable LM trailing loss. The MoBA model's performance remains unchanged with varying numbers of hybrid full layers and exhibits the highest loss among the three models. This indicates that the hybrid layer configuration has a significant impact on the Layer-wise Hybrid model, while the Full Attention model is less sensitive to this parameter. </details> (c) Figure 5: Hybrid of MoBA and full attention. (a) position-wise LM loss for MoBA, full attention, and MoBA/full hybrid training; (b) SFT LM loss w.r.t the number of full attention layers in layer-wise hybrid; (c) SFT trailing LM loss (seqlen=32K, last 2K) w.r.t the number of full attention layers in layer-wise hybrid. As discussed in Section 2, we design MoBA to be a flexible substitute for full attention, so that it can easily switch from/to full attention with minimal overhead and achieve comparable long-context performance. In this section, we first show seamless transition between full attention and MoBA can be a solution for efficient long-context pre-training. Then we discuss the layer-wise hybrid strategy, mainly for the performance of supervised fine-tuning (SFT). MoBA/Full Hybrid Training. We train three models, each with 1.5B parameters, on 30B tokens with a context length of 32K tokens. For the hyperparameters of MoBA, the block size is set to 2048, and the top-k parameter is set to 3. The detailed training recipes are as follows: - MoBA/full hybrid: This model is trained using a two-stage recipe. In the first stage, MoBA is used to train on 90% of the tokens. In the second stage, the model switches to full attention for the remaining 10% of the tokens. - Full attention: This model is trained using full attention throughout the entire training. - MoBA: This model is trained exclusively using MoBA. We evaluate their long-context performance via position-wise language model (LM) loss, which is a fine-grained metric to evaluate lm loss at each position within a sequence. Unlike the vanilla LM loss, which is computed by averaging the LM loss across all positions, the position-wise LM loss breaks down the loss for each position separately. Similar metrics have been suggested by previous studies \parencite xiong2023effectivelongcontextscalingfoundation,reid2024gemini, who noticed that position-wise LM loss follows a power-law trend relative to context length. As shown in Figure 5a, the MoBA-only recipe results in higher position-wise losses for trailing tokens. Importantly, our MoBA/full hybrid recipe reaches a loss nearly identical to that of full attention. This result highlights the effectiveness of the MoBA/full hybrid training recipe in balancing training efficiency with model performance. More interestingly, we have not observed significant loss spikes during the switch between MoBA and full attention, again demonstrating the flexibility and robustness of MoBA. Layer-wise Hybrid. This flexibility of MoBA encourages us to delve into a more sophisticated strategy — the layer-wise hybrid of MoBA and full attention. We investigate this strategy with a particular focus on its application during the supervised fine-tuning (SFT). The motivation for investigating this strategy stems from our observation that MoBA sometimes results in suboptimal performance during SFT, as shown in Figure 5b. We speculate that this may be attributed to the loss masking employed in SFT — prompt tokens are typically excluded from the loss calculation during SFT, which can pose a sparse gradient challenge for sparse attention methods like MoBA. Because it may hinder the backpropagation of gradients, which are initially calculated from unmasked tokens, throughout the entire context. To address this issue, we propose a hybrid approach — switching the last several Transformer layers from MoBA to full attention, while the remaining layers continue to employ MoBA. As shown in Figure 5b and Figure 5c, this strategy can significantly reduce SFT loss. 3.3 Large Language Modeling Evaluation <details> <summary>x11.png Details</summary> ![744cb433](/v1/image/744cb433c25de2924cb86c252119e9017f4e9054e888039422b03149fd57682c) ### Visual Description ## Diagram: Model Training Flow ### Overview The image is a flowchart illustrating the training process of a model, starting with "LLama3.1 8B" and progressing through several stages of "Continue Pretrain" and "SFT" (Supervised Fine-Tuning) before resulting in an "Instruct Model". ### Components/Axes The diagram consists of rectangular boxes representing different stages of model training, connected by arrows indicating the flow of the process. The stages are: * **LLama3.1 8B**: Initial model. * **256K Continue Pretrain**: Continue pretraining with 256K data. * **512K Continue Pretrain**: Continue pretraining with 512K data. * **1M Continue Pretrain**: Continue pretraining with 1M data. * **32K SFT**: Supervised Fine-Tuning with 32K data. * **256K SFT**: Supervised Fine-Tuning with 256K data. * **1M SFT**: Supervised Fine-Tuning with 1M data. * **Instruct Model**: Final instructed model. ### Detailed Analysis or Content Details The flow starts with "LLama3.1 8B", then proceeds through "256K Continue Pretrain", "512K Continue Pretrain", and "1M Continue Pretrain" in sequence. From "1M Continue Pretrain", the flow branches down to "32K SFT". The flow then continues from "32K SFT" to "256K SFT", then to "1M SFT", and finally to "Instruct Model". ### Key Observations * The diagram shows a sequential process of pretraining and fine-tuning. * The pretraining phase increases in data size (256K -> 512K -> 1M). * The fine-tuning phase decreases in data size (1M -> 256K -> 32K). * The "1M Continue Pretrain" stage is a branching point, leading to the fine-tuning stages. ### Interpretation The diagram illustrates a common strategy in machine learning where a model is first pretrained on a large dataset and then fine-tuned on a smaller, task-specific dataset. The initial pretraining stages likely aim to learn general language representations, while the subsequent fine-tuning stages adapt the model to a specific instruction-following task. The decreasing data size in the fine-tuning stages might reflect a focus on high-quality, curated data for instruction tuning. The branching from "1M Continue Pretrain" suggests that the model's weights after this stage are used as a starting point for the supervised fine-tuning process. </details> Figure 6: The continual pre-training and SFT recipes. We conduct a thorough assessment of MoBA across a variety of real-world downstream tasks, evaluating its performance in comparison to full attention models. For ease of verification, our experiments begin with the Llama 3.1 8B Base Model, which is used as the starting point for long-context pre-training. This model, termed Llama-8B-1M-MoBA, is initially trained with a context length of 128K tokens, and we gradually increase the context length to 256K, 512K, and 1M tokens during the continual pre-training. To ease this transition, we use position interpolation method \parencite chen2023extendingcontextwindowlarge at the start of the 256K continual pre-training stage. This technique enables us to extend the effective context length from 128K tokens to 1M tokens. After completing the 1M continuous pre-training, MoBA is activated for 100B tokens. We set the block size to 4096 and the top-K parameter to 12, leading to an attention sparsity of up to $1-\frac{4096× 12}{1M}=95.31\%$ . To preserve some full attention capabilities, we adopt the layer-wise hybrid strategy — the last three layers remain as full attention, while the other 29 full attention layers are switched to MoBA. For supervised fine-tuning, we follow a similar strategy that gradually increases the context length from 32K to 1M. The baseline full attention models (termed Llama-8B-1M-Full) also follow a similar training strategy as shown in Figure 6, with the only difference being the use of full attention throughout the process. This approach allows us to directly compare the performance of MoBA with that of full attention models under equivalent training conditions. The evaluation is performed on several widely used long-context benchmarks. In particular, across all evaluation tasks, MoBA is used for prefill only, while we switch to full attention during generation for better performance. As shown in Table 2, Llama-8B-1M-MoBA exhibits a performance that is highly comparable to that of Llama-8B-1M-Full. It is particularly noteworthy that in the longest benchmark, RULER, where MoBA operates at a sparsity level of up to $1-\frac{4096× 12}{128K}=62.5\%$ , Llama-8B-1M-MoBA nearly matches the performance of Llama-8B-1M-Full, with a score of 0.7818 compared to 0.7849. For context lengths of up to 1M tokens, we evaluate the model using the traditional Needle in the Haystack benchmark. As shown in Figure 7, Llama-8B-1M-MoBA demonstrates satisfactory performance even with an extended context length of 1 million tokens. | AGIEval [0-shot] BBH [3-shot] CEval [5-shot] | 0.5144 0.6573 0.6273 | 0.5146 0.6589 0.6165 | | --- | --- | --- | | GSM8K [5-shot] | 0.7278 | 0.7142 | | HellaSWAG [0-shot] | 0.8262 | 0.8279 | | Loogle [0-shot] | 0.4209 | 0.4016 | | Competition Math [0-shot] | 0.4254 | 0.4324 | | MBPP [3-shot] | 0.5380 | 0.5320 | | MBPP Sanitized [0-shot] | 0.6926 | 0.6615 | | MMLU [0-shot] | 0.4903 | 0.4904 | | MMLU Pro [5-shot][CoT] | 0.4295 | 0.4328 | | OpenAI HumanEval [0-shot][pass@1] | 0.6951 | 0.7012 | | SimpleQA [0-shot] | 0.0465 | 0.0492 | | TriviaQA [0-shot] | 0.5673 | 0.5667 | | LongBench @32K [0-shot] | 0.4828 | 0.4821 | | RULER @128K [0-shot] | 0.7818 | 0.7849 | Table 2: Performance comparison between MoBA and full Attention across different evaluation benchmarks. <details> <summary>x12.png Details</summary> ![deea9e47](/v1/image/deea9e47fd6aa4fd005617c692449aee0bb0fd772e484df90564e0e0e905c055) ### Visual Description ## Heatmap: Needle in a Haystack Evaluation ### Overview The image is a heatmap titled "Needle in a Haystack Evaluation". It visualizes the performance (score) of a system in finding a "needle" within a "haystack" of varying context lengths and starting positions. The heatmap uses a color gradient from red (low score) to green (high score) to represent the score. The x-axis represents the "Context Length", and the y-axis represents the "Start of Needle (percent)". ### Components/Axes * **Title:** Needle in a Haystack Evaluation * **X-axis:** Context Length, with values ranging from 32000 to 1024000 in increments of 32000. * **Y-axis:** Start of Needle (percent), with values ranging from 0 to 100 in increments of 7, except for the last increment which is 93 to 100. * **Color Legend (right side):** Score, ranging from 0 (red) to 100 (green). The legend has tick marks at 0, 20, 40, 60, 80, and 100. ### Detailed Analysis or ### Content Details The heatmap is a grid of cells, each representing a combination of "Context Length" and "Start of Needle (percent)". The color of each cell indicates the "Score" for that combination. * **General Observation:** Almost all cells are green, indicating a high score across all context lengths and starting positions. * **Specific Values:** * The entire heatmap is predominantly green, suggesting scores close to 100 for almost all combinations of context length and needle start position. * There is a single vertical line at the right edge of the chart that is green. ### Key Observations * The system performs well (high score) across a wide range of context lengths and starting positions for the needle. * There are no apparent areas of low performance (red or orange cells). ### Interpretation The heatmap suggests that the system being evaluated is highly effective at finding the "needle" within the "haystack," regardless of the context length or the starting position of the needle. The consistently high scores indicate a robust and reliable performance. The absence of any significant color variation implies that the system's performance is not significantly affected by changes in context length or needle position within the context. </details> Figure 7: Performance of LLama-8B-1M-MoBA on the Needle in the Haystack benchmark (upto 1M context length). 3.4 Efficiency and Scalability The above experimental results show that MoBA achieves comparable performance not only regarding language model losses but also in real-world tasks. To further investigate its efficiency, we compare the forward pass time of the attention layer in two models trained in Section 3.3 — Llama-8B-1M-MoBA and Llama-8B-1M-Full. We focus solely on the attention layer, as all other layers (e.g., FFN) have identical FLOPs in both models. As shown in Figure 2a, MoBA is more efficient than full attention across all context lengths, demonstrating a sub-quadratic computational complexity. In particular, it achieves a speedup ratio of up to 6.5x when prefilling 1M tokens. We also explore the length scalability of MoBA by gradually increasing the context length to 10 million tokens. To maintain a constant attention sparsity, we keep the top-k value and number of MoBA Block fixed while proportionally increasing the block size. To reach the 10M context length, we expanded tensor parallelism \parencite shoeybi2019megatron toward the query head level, Specifically, we broadcast key and value tensors across distributed query heads, effectively addressing GPU memory limitations while preserving computational efficiency. As shown in Figure 2b, MoBA demonstrates superior efficiency compared to standard Flash Attention when scaling to longer sequences. Specifically, at 10M tokens moba achieves a speedup ratio of 16x reduction in attention computation time. The inset graph in the top figure, focusing on shorter sequences (32K to 512K), shows that even though both methods perform comparably at smaller scales, MoBA’s computational advantage becomes increasingly evident as sequences grow longer, highlighting its particular strength in processing extremely long sequences. Overall, the high efficiency of MoBA can be attributed to two key innovations: (1) the block sparse attention mechanism, and (2) the optimized implementation combining Mixture-of-Experts (MoE) and FlashAttention, as described in Section 2.3. These techniques effectively address the quadratic complexity limitation of full attention, reducing the computational complexity to a more economical sub-quadratic scale. 4 Related Work The development of efficient attention \parencite tay2020efficient mechanisms has been a critical area of research in the field of natural language processing, particularly with the rise of Large Language Models (LLMs). As the demand for handling longer sequences and reducing computational costs grows, efficeint attention techniques have emerged as a promising solution to reduce the quadratic complexity of self-attention mechanisms while maintaining model performance. Static Sparse Patterns: Significant efforts, such as Sparse Transformer \parencite child2019generating, Star-Transformer \parencite guo2019star, BlockBERT \parencite qiu2019blockwise, Longformer \parencite beltagy2020longformer, GMAT \parencite gupta2020gmat, ETC \parencite ainslie2020etc, BigBird \parencite zaheer2020big, LongT5 \parencite guo2021longt5 and LongNet \parencite ding2023longnet, have been dedicated to the design of static attention patterns in LLMs. Their choices of static attention patterns can encompass strided and fixed attention, window attention, global token attention, random attention, dilated attention, block sparse attention, or any combinations of them. In the realm of multimodal models, static sparse attention mechanisms have also been developed, such as axial attention \parencite ho2019axial for 2D images and spatial-temporal attention \parencite opensora for 3D videos. Dynamic Sparse Patterns: Different from static patterns, dynamic sparse attention techniques adaptively determine which tokens to attend. Reformer \parencite kitaev2020reformer and Routing Transformer \parencite roy2021efficient respectively employ locality-sensitive hashing (LSH) and K-means to cluster tokens, and attend to clusters rather than the full context. Memorizing Transformers \parencite wu2022memorizing and Unlimiformer \parencite bertsch2024unlimiformer dynamically attend to tokens selected by the k-nearest-neighbor (kNN) algorithms. CoLT5 \parencite ainslie2023colt5 designs a routing modules to select the most important queries and keys. Sparse Sinkhorn Attention \parencite tay2020sparse learns to permute blocks from the input sequence, allowing dynamic block sparse attention computation. Training-free Sparse Attention: In addition to the previously discussed approaches that study training sparse attention models, there are also strategies designed to incorporate sparse attention mechanisms to enhance the efficiency of the two primary stages of model inference — either the prefill stage or the decode stage, or both of them. During the prefill optimization phase, the complete prompt can be utilized for attention profiling, which allows for the exploration of more intricate sparse attention patterns. For instance, MoA \parencite fu2024moa, Minference \parencite jiang2024minference, and SeerAttention \parencite gao2024seerattention have investigated sparse attention configurations such as A-shape, vertical-slash, and dynamic block sparsity. In the context of decode optimization, considerable work has been dedicated to compressing and pruning the KV-cache to achieve a balance between the quality and speed of text generation. Notable efforts in this area include H2O \parencite zhang2024h2o, StreamingLLM \parencite xiao2023efficient, TOVA \parencite oren2024tova, FastGen \parencite ge2023fastgen and Quest \parencite tang2024quest. Quest, in particular, can be viewed as MoBA with a smaller block size and a specialized block representation function which combines both min and max pooling. Another work closely related to MoBA is Longheads \parencite lu2024longheads which can be viewed as MoBA with a top-1 gating network, meaning that each query selects the most relevant KV blocks for attention. Beyond Traditional Attention Architecture: Another line of research investigates novel model architectures that deviate from the conventional attention mechanism. As architectures change, these methods require training models from scratch and are unable to reuse pre-trained Transformer-based models. Studies in this domain have explored architectures that are inspired by Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), State Space Models (SSMs), or Linear Attention \parencite katharopoulos2020transformers, Examples of such models include Hyena \parencite poli2023hyena, Performer \parencite choromanski2020rethinking, Linformer \parencite wang2020linformer, RWKV [peng2023rwkv], Mamba \parencite gu2023mamba, RetNet \parencite sun2023retentive, etc. In summary, the landscape of efficient attention techniques is diverse, encompassing sparse patterns that range from static to dynamic, optimization objectives that span from training to inference, and architectures that extend from traditional attention mechanisms to innovative alternatives. Each method presents unique advantages and trade-offs, and the choice of technique often depends on the specific requirements of the application, such as the maximum sequence length, computational resources, and the desired balance between efficiency and performance. As research in this area continues to evolve, it is expected that these methods will play a crucial role in enabling LLMs to tackle increasingly complex tasks while maintaining efficiency and scalability. 5 Conclusion In this paper, we introduce Mixture of Block Attention (MoBA), a novel attention architecture inspired by the principles of Mixture of Experts (MoE) that aims to enhance the efficiency and scalability of large language models (LLMs) for long-context tasks. MoBA addresses the computational challenges associated with traditional attention mechanisms by partitioning the context into blocks and employing a dynamic gating mechanism to selectively route query tokens to the most relevant KV blocks. This approach not only reduces computational complexity but also maintains model performance. Moreover, it allows for seamless transitions between full and sparse attention. Through extensive experiments, we demonstrated that MoBA achieves performance comparable to full attention while significantly improving computational efficiency. Our results show that MoBA can scale effectively to long contexts, maintaining low LM losses and high performance on various benchmarks. Additionally, MoBA’s flexibility allows it to be integrated with existing models without substantial training cost, making it a practical continual pre-training solution for enhancing long-context capabilities in LLMs. In summary, MoBA represents a significant advancement in efficient attention, offering a balanced approach between performance and efficiency. Future work may explore further optimizations of MoBA’s block-selection strategies, investigate its application to other modalities, and study its potential for improving generalization in complex reasoning tasks. \printbibliography [title=References] Appendix A Appendix A.1 Long Context Scalability To address the bias in natural data distribution that favors short contexts, we strategically segmented the overall sequences into discrete segments based on their actual positions. For example, the segment spanning positions 30K-32K exclusively reflects losses associated with documents exceeding 30K context lengths and also masks the positions from 30K to 32K. This approach ensures a more balanced and representative evaluation across different context lengths. In our exploration of long-context scalability, we made a pivotal discovery: the trailing tokens account for the majority of the performance discrepancy between the full context baseline and the newly proposed sparse attention architectures. Consequently, we streamlined the long-context scaling process by focusing on trailing token scaling. This not only simplifies the computational requirements but also significantly enhances the efficiency and effectiveness of investigating long-context scenarios. This finding holds substantial implications for the development of more efficient and scalable attention mechanisms in the future. <details> <summary>x13.png Details</summary> ![62f42d64](/v1/image/62f42d6408b161c92294ca15dd10a3315cfb6622433d5d88e4ce41d00b979ff5) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days ### Overview The image is a line chart comparing the Language Model (LM) Loss of "MoBA Projection" and "Full Attention Projection" models against the computational cost measured in PFLOP/s-days. The y-axis represents LM Loss (0k-2k), and the x-axis represents PFLOP/s-days. The chart shows how the loss decreases as the computational cost increases for both models. ### Components/Axes * **X-axis:** PFLOP/s-days (ranges from approximately 0.1 to 10) * **Y-axis:** LM Loss 0k-2k (ranges from 1 x 10^0 to 6 x 10^0) * **Legend:** Located in the top-right corner. * MoBA Projection (blue lines) * Full Attention Projection (red dashed line) ### Detailed Analysis * **Full Attention Projection (Red Dashed Line):** * Trend: The loss decreases steadily as PFLOP/s-days increases. * Data Points: * At 0.1 PFLOP/s-days, LM Loss is approximately 4.1 x 10^0. * At 1 PFLOP/s-days, LM Loss is approximately 2.8 x 10^0. * At 10 PFLOP/s-days, LM Loss is approximately 2.2 x 10^0. * **MoBA Projection (Blue Lines):** There are multiple blue lines, each representing a different configuration or parameter setting for the MoBA Projection model. * Trend: All blue lines show a decreasing loss as PFLOP/s-days increases, but the decrease is more rapid initially and then plateaus. * Data Points (Approximate, for the lowest MoBA line): * At 0.1 PFLOP/s-days, LM Loss is approximately 4.8 x 10^0. * At 1 PFLOP/s-days, LM Loss is approximately 3.1 x 10^0. * At 10 PFLOP/s-days, LM Loss is approximately 2.7 x 10^0. ### Key Observations * The Full Attention Projection model starts with a lower loss at low PFLOP/s-days compared to the MoBA Projection models. * The MoBA Projection models show a steeper initial decrease in loss as PFLOP/s-days increases. * At higher PFLOP/s-days, the LM Loss for the MoBA Projection models approaches that of the Full Attention Projection model. ### Interpretation The chart compares the performance of two language model projection techniques: MoBA and Full Attention. The data suggests that while Full Attention Projection initially has a lower loss, MoBA Projection can achieve comparable or even better performance with increased computational resources (PFLOP/s-days). The multiple MoBA lines likely represent different configurations or parameter settings, indicating that the performance of MoBA Projection is sensitive to these factors. The rapid initial decrease in loss for MoBA Projection suggests diminishing returns as computational resources increase. The point where the MoBA lines converge with the Full Attention line indicates a potential threshold where the two methods achieve similar performance levels. </details> (a) Scaling law (0-2k) <details> <summary>x14.png Details</summary> ![b94b7735](/v1/image/b94b77351bafe7c01a7b4cc07318c599d3aab23cda1ba4f0c97868200f189b51) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections ### Overview The image is a line chart comparing the Language Model (LM) Loss (2k-4k) against PFLOP/s-days for two different projection methods: MoBA Projection and Full Attention Projection. The chart shows how the LM Loss decreases as the computational resources (PFLOP/s-days) increase for both methods. ### Components/Axes * **X-axis:** PFLOP/s-days (PetaFLOPS per second-days). The scale is logarithmic, ranging from approximately 0.1 to 10. * Axis markers: 10-1, 100, 101 * **Y-axis:** LM Loss 2k-4k (Language Model Loss between 2000 and 4000 tokens). The scale is logarithmic, ranging from approximately 1 to 6 x 100. * Axis markers: 100, 2 x 100, 3 x 100, 4 x 100, 6 x 100 * **Legend:** Located at the top-right of the chart. * MoBA Projection (blue lines) * Full Attention Projection (red dashed line) ### Detailed Analysis * **MoBA Projection:** There are four blue lines representing MoBA Projection. All four lines show a decreasing trend in LM Loss as PFLOP/s-days increase. * The top-most blue line starts at approximately (0.05, 6) and decreases to approximately (1, 2.5). * The second blue line from the top starts at approximately (0.06, 5) and decreases to approximately (1, 2.5). * The third blue line from the top starts at approximately (0.07, 4) and decreases to approximately (1, 2.5). * The bottom-most blue line starts at approximately (0.08, 3.5) and decreases to approximately (1, 2.5). * **Full Attention Projection:** The red dashed line represents Full Attention Projection. It also shows a decreasing trend in LM Loss as PFLOP/s-days increase. * The red dashed line starts at approximately (0.05, 3.5) and decreases to approximately (10, 1.5). ### Key Observations * The MoBA Projection lines are clustered together and show a steeper initial decrease in LM Loss compared to the Full Attention Projection line. * The Full Attention Projection line has a more gradual decrease in LM Loss as PFLOP/s-days increase. * At lower PFLOP/s-days values (around 0.1), the Full Attention Projection has a lower LM Loss than the MoBA Projection. * As PFLOP/s-days increase, the LM Loss for MoBA Projection quickly converges to a similar level as the Full Attention Projection. ### Interpretation The chart suggests that the MoBA Projection method is more efficient in reducing LM Loss at lower computational costs (PFLOP/s-days) compared to the Full Attention Projection. However, the Full Attention Projection method achieves a lower LM Loss at very low PFLOP/s-days. As the computational cost increases, the performance of the MoBA Projection converges to a similar level as the Full Attention Projection. The clustering of the MoBA Projection lines indicates some variability in performance depending on the specific configuration or parameters used. The logarithmic scales highlight the diminishing returns in LM Loss reduction as PFLOP/s-days increase for both methods. </details> (b) Scaling law (2-4k) <details> <summary>x15.png Details</summary> ![88df8f3a](/v1/image/88df8f3ac82c5f618d8aa73c470f27e2b16589991432bf6f9e252d2d99f76ffe) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections ### Overview The image is a line chart comparing the Language Modeling (LM) Loss (4k-6k) against PFLOP/s-days for MoBA Projection and Full Attention Projection. The chart uses a log-log scale for both axes. The MoBA Projection is represented by multiple blue lines, while the Full Attention Projection is represented by a single red dashed line. ### Components/Axes * **Title:** Implicitly, LM Loss vs. PFLOP/s-days * **Y-axis:** LM Loss 4k-6k (Logarithmic scale) * Axis Markers: 1 x 10^0, 2 x 10^0, 3 x 10^0, 4 x 10^0, 6 x 10^0 * **X-axis:** PFLOP/s-days (Logarithmic scale) * Axis Markers: 10^-1, 10^0, 10^1 * **Legend (Top-Left):** * MoBA Projection (Blue lines) * Full Attention Projection (Red dashed line) ### Detailed Analysis * **MoBA Projection (Blue lines):** There are multiple blue lines, indicating different configurations or runs of the MoBA Projection. All blue lines show a similar trend: a steep decline in LM Loss as PFLOP/s-days increases initially, followed by a more gradual decline. * At PFLOP/s-days = 0.1, LM Loss ranges from approximately 4.5 x 10^0 to 6 x 10^0 across the different MoBA lines. * At PFLOP/s-days = 1, LM Loss ranges from approximately 2 x 10^0 to 2.5 x 10^0. * At PFLOP/s-days = 10, LM Loss appears to converge to approximately 1 x 10^0. * **Full Attention Projection (Red dashed line):** The red dashed line shows a consistent downward trend. * At PFLOP/s-days = 0.1, LM Loss is approximately 2.8 x 10^0. * At PFLOP/s-days = 1, LM Loss is approximately 1.8 x 10^0. * At PFLOP/s-days = 10, LM Loss is approximately 1.1 x 10^0. ### Key Observations * The MoBA Projection lines initially show a steeper decline in LM Loss compared to the Full Attention Projection. * The Full Attention Projection line is consistently below the MoBA Projection lines for PFLOP/s-days values greater than approximately 0.2. * The MoBA Projection lines converge as PFLOP/s-days increases. ### Interpretation The chart compares the performance of MoBA Projection and Full Attention Projection in terms of LM Loss as a function of computational cost (PFLOP/s-days). The data suggests that for lower computational costs (PFLOP/s-days < 0.2), MoBA Projection can achieve lower LM Loss. However, as the computational cost increases, Full Attention Projection consistently outperforms MoBA Projection, achieving lower LM Loss for the same computational cost. The convergence of the MoBA Projection lines suggests that there might be a performance limit for MoBA, while Full Attention Projection continues to improve with increasing computational resources. The multiple MoBA lines likely represent different configurations or hyperparameters, and their convergence indicates that optimizing these parameters has diminishing returns at higher computational costs. </details> (c) Scaling law (4-6k) <details> <summary>x16.png Details</summary> ![3fae5015](/v1/image/3fae5015e6ae4c07e326846d0a05ddca7b58fead8516d74e53e74654c332aed9) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days ### Overview The image is a line chart comparing the Language Model (LM) Loss of "MoBA Projection" and "Full Attention Projection" models against the computational cost measured in PFLOP/s-days. The y-axis represents LM Loss (6k-8k), and the x-axis represents PFLOP/s-days. The chart shows how the loss decreases as the computational cost increases for both models. ### Components/Axes * **X-axis:** PFLOP/s-days (Peta Floating Point Operations per second-days). The scale is logarithmic, with markers at 10-1, 100 (1), and 101 (10). * **Y-axis:** LM Loss 6k-8k (Language Model Loss between 6000 and 8000). The scale is logarithmic, with markers at 1 x 100 (1), 2 x 100 (2), 3 x 100 (3), 4 x 100 (4), and 6 x 100 (6). * **Legend:** Located in the top-right corner. * "MoBA Projection" is represented by a dashed blue line. * "Full Attention Projection" is represented by a dashed red line. ### Detailed Analysis * **MoBA Projection:** The chart contains multiple blue lines, which appear to represent different runs or configurations of the MoBA Projection model. All blue lines show a similar trend: a steep decrease in LM Loss as PFLOP/s-days increases initially, followed by a more gradual decrease. * At PFLOP/s-days = 0.1, the LM Loss ranges from approximately 4.5 x 100 to 6 x 100. * At PFLOP/s-days = 1, the LM Loss is approximately 2 x 100 for all MoBA Projection lines. * At PFLOP/s-days = 10, the LM Loss is approximately 1.7 x 100 for all MoBA Projection lines. * **Full Attention Projection:** The red dashed line represents the Full Attention Projection model. The LM Loss decreases as PFLOP/s-days increases. * At PFLOP/s-days = 0.1, the LM Loss is approximately 2.7 x 100. * At PFLOP/s-days = 1, the LM Loss is approximately 1.9 x 100. * At PFLOP/s-days = 10, the LM Loss is approximately 1.3 x 100. ### Key Observations * The MoBA Projection model initially has a higher LM Loss than the Full Attention Projection model at lower PFLOP/s-days values. * As PFLOP/s-days increases, the LM Loss for both models decreases. * The MoBA Projection model seems to converge to a similar LM Loss as the Full Attention Projection model at higher PFLOP/s-days values. * There are multiple runs or configurations for the MoBA Projection model, as indicated by the multiple blue lines. ### Interpretation The chart compares the performance of two language models, MoBA Projection and Full Attention Projection, in terms of LM Loss versus computational cost. The data suggests that while MoBA Projection initially has a higher loss, its performance converges towards that of Full Attention Projection as the computational cost increases. The multiple lines for MoBA Projection could indicate that the model's performance is sensitive to hyperparameter tuning or initialization. The Full Attention Projection model shows a more consistent decrease in loss with increasing computational cost. The chart demonstrates the trade-off between computational cost and model performance, highlighting how increasing computational resources can lead to lower language model loss for both models. </details> (d) Scaling law (6-8k) <details> <summary>x17.png Details</summary> ![a6f51e6a](/v1/image/a6f51e6a59049ef83ea3a73bf7be085ddb78590631df156693997007e109d566) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days ### Overview The image is a line chart comparing the Language Model (LM) Loss of "MoBA Projection" and "Full Attention Projection" models against the computational cost measured in PFLOP/s-days. The y-axis represents LM Loss (8k-10k), and the x-axis represents PFLOP/s-days. There are multiple lines for MoBA Projection, likely representing different configurations or runs. ### Components/Axes * **Title:** Implicit, comparing LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections. * **X-axis:** PFLOP/s-days (Peta Floating Point Operations per second-days). Scale ranges from approximately 0.05 to 10. * **Y-axis:** LM Loss 8k-10k (Language Model Loss). Logarithmic scale ranging from 1 x 10^0 to 6 x 10^0. * **Legend (Top-Right):** * Blue dashed line: MoBA Projection * Red dashed line: Full Attention Projection ### Detailed Analysis * **Full Attention Projection (Red Dashed Line):** * Trend: The line slopes downward consistently. * Data Points: * At PFLOP/s-days = 0.05, LM Loss ≈ 2.5 x 10^0 * At PFLOP/s-days = 1, LM Loss ≈ 1.7 x 10^0 * At PFLOP/s-days = 10, LM Loss ≈ 1.2 x 10^0 * **MoBA Projection (Blue Lines):** * Trend: Multiple blue lines, all generally sloping downward, but with a steeper initial decline compared to the Full Attention Projection. The lines converge around PFLOP/s-days = 1. * Data Points (Estimates for the leftmost MoBA line): * At PFLOP/s-days = 0.05, LM Loss ≈ 6 x 10^0 * At PFLOP/s-days = 0.2, LM Loss ≈ 3 x 10^0 * At PFLOP/s-days = 1, LM Loss ≈ 1.7 x 10^0 ### Key Observations * The Full Attention Projection has a lower initial LM Loss at low PFLOP/s-days compared to the MoBA Projection. * The MoBA Projection lines show a rapid decrease in LM Loss initially, converging with the Full Attention Projection around 1 PFLOP/s-days. * Beyond 1 PFLOP/s-days, the LM Loss for both projections decreases at a similar rate. ### Interpretation The chart compares the performance of two language model projection techniques, MoBA and Full Attention, in terms of LM Loss versus computational cost. The MoBA projection initially has a higher loss but improves rapidly with increasing computational resources, eventually converging with the Full Attention projection. This suggests that MoBA projection may be more efficient in terms of reducing loss with increased computation, at least in the initial stages. The Full Attention projection provides a more stable, albeit slower, reduction in loss. The multiple MoBA lines likely represent different configurations or runs, indicating some variability in the MoBA projection's performance. </details> (e) Scaling law (8-10k) <details> <summary>x18.png Details</summary> ![65e3aa40](/v1/image/65e3aa40048207d739bdaf1eb8f656b935a232a444ba33380da57b2f3083efd7) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections ### Overview The image is a line chart comparing the Language Model (LM) Loss of MoBA Projection and Full Attention Projection models against PFLOP/s-days (a measure of computational cost). The chart displays multiple lines for each projection type, likely representing different model configurations or runs. ### Components/Axes * **X-axis:** PFLOP/s-days (logarithmic scale from approximately 0.03 to 10) * **Y-axis:** LM Loss 10k-12k (logarithmic scale from 1 to 6 x 10^0) * **Legend (top-right):** * Blue dashed line: MoBA Projection * Red dashed line: Full Attention Projection ### Detailed Analysis * **MoBA Projection (Blue):** There are multiple blue lines. All lines show a decreasing trend as PFLOP/s-days increases. * The leftmost blue line starts at approximately LM Loss = 6 around PFLOP/s-days = 0.03, decreasing to approximately LM Loss = 1.5 around PFLOP/s-days = 1. * The rightmost blue line starts at approximately LM Loss = 4.5 around PFLOP/s-days = 0.03, decreasing to approximately LM Loss = 1.5 around PFLOP/s-days = 1. * The dashed blue line starts at approximately LM Loss = 2.5 around PFLOP/s-days = 0.03, decreasing to approximately LM Loss = 1.2 around PFLOP/s-days = 10. * **Full Attention Projection (Red):** There are multiple red lines. All lines show a decreasing trend as PFLOP/s-days increases. * The leftmost red line starts at approximately LM Loss = 6 around PFLOP/s-days = 0.03, decreasing to approximately LM Loss = 1.5 around PFLOP/s-days = 1. * The rightmost red line starts at approximately LM Loss = 4.5 around PFLOP/s-days = 0.03, decreasing to approximately LM Loss = 1.5 around PFLOP/s-days = 1. * The dashed red line starts at approximately LM Loss = 2.5 around PFLOP/s-days = 0.03, decreasing to approximately LM Loss = 1.2 around PFLOP/s-days = 10. ### Key Observations * Both MoBA and Full Attention Projections show a decrease in LM Loss as PFLOP/s-days increases, indicating that more computation leads to lower loss. * The dashed lines for both MoBA and Full Attention Projections are very close to each other, suggesting similar performance characteristics for these specific configurations. * The solid lines for both MoBA and Full Attention Projections are very close to each other, suggesting similar performance characteristics for these specific configurations. * At lower PFLOP/s-days values, there is a wider spread in LM Loss values for both projection types, suggesting variability in performance depending on the specific model configuration. ### Interpretation The chart suggests that increasing computational resources (PFLOP/s-days) generally leads to a reduction in LM Loss for both MoBA and Full Attention Projection models. The close proximity of the dashed lines indicates that, for certain configurations, the two projection methods achieve comparable performance. The spread in LM Loss values at lower PFLOP/s-days suggests that the performance of these models is more sensitive to other factors (e.g., model architecture, hyperparameters) when computational resources are limited. The dashed lines appear to represent a more optimized or stable configuration, as they exhibit a smoother and more consistent decrease in LM Loss with increasing PFLOP/s-days. </details> (f) Scaling law (10-12k) <details> <summary>x19.png Details</summary> ![9b4f1d16](/v1/image/9b4f1d164cfc018ef924991b18b1aeb8f837d2e1bd829e939bf34aa1a53ac3fa) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections ### Overview The image is a line chart comparing the Language Model (LM) Loss of MoBA Projection and Full Attention Projection models as a function of PFLOP/s-days. The y-axis represents LM Loss (12k-14k), and the x-axis represents PFLOP/s-days. Several lines are plotted for each projection type, showing the loss curves. ### Components/Axes * **X-axis:** PFLOP/s-days (logarithmic scale from 10-1 to 101) * **Y-axis:** LM Loss 12k-14k (logarithmic scale from 1x100 to 6x100) * **Legend (top-right):** * Blue dashed line: MoBA Projection * Red dashed line: Full Attention Projection ### Detailed Analysis * **MoBA Projection (Blue lines):** There are multiple blue lines, each representing a different run or configuration of the MoBA Projection model. All the blue lines show a similar trend: a steep decrease in LM Loss as PFLOP/s-days increases from 0.1 to 1, followed by a more gradual decrease as PFLOP/s-days increases further. * At PFLOP/s-days = 0.1, the LM Loss ranges from approximately 4x100 to 6x100. * At PFLOP/s-days = 1, the LM Loss ranges from approximately 1.5x100 to 2x100. * At PFLOP/s-days = 10, the LM Loss appears to be approaching a value around 1.5x100. * **Full Attention Projection (Red dashed line):** The red dashed line represents the Full Attention Projection model. It shows a consistent, gradual decrease in LM Loss as PFLOP/s-days increases. * At PFLOP/s-days = 0.1, the LM Loss is approximately 2.2x100. * At PFLOP/s-days = 1, the LM Loss is approximately 1.5x100. * At PFLOP/s-days = 10, the LM Loss is approximately 1.2x100. ### Key Observations * The MoBA Projection models (blue lines) initially have a higher LM Loss than the Full Attention Projection model (red dashed line) at low PFLOP/s-days values. * The MoBA Projection models experience a rapid decrease in LM Loss as PFLOP/s-days increases, converging towards the performance of the Full Attention Projection model. * At higher PFLOP/s-days values (around 1 and above), the LM Loss of the MoBA Projection models becomes comparable to or slightly better than the Full Attention Projection model. ### Interpretation The chart suggests that the MoBA Projection models require a certain amount of computational resources (PFLOP/s-days) to reach their optimal performance. Initially, they perform worse than the Full Attention Projection model, but with sufficient training, they can achieve comparable or even slightly better LM Loss. This indicates that MoBA Projection might be more computationally efficient in the long run, as it can achieve similar performance with potentially less computational overhead once it has been adequately trained. The multiple lines for MoBA likely represent different hyperparameter settings or random initializations, showing the variability in performance depending on these factors. The Full Attention model shows a more stable and predictable decrease in loss with increasing compute. </details> (g) Scaling law (12-14k) <details> <summary>x20.png Details</summary> ![070ab702](/v1/image/070ab7025b3d060be6b8c5f338f467bb9685b1f02c7a415dce5d1552b20c94ea) ### Visual Description ## Chart/Diagram Type: Line Chart ### Overview The image is a line chart comparing the performance of "MoBA Projection" and "Full Attention Projection" models. The chart plots "LM Loss 14k-16k" against "PFLOP/s-days" on a log-log scale. Several lines are shown for each model type, indicating different runs or configurations. ### Components/Axes * **X-axis:** PFLOP/s-days (log scale), with markers at 10-1, 100 (1), and 101 (10). * **Y-axis:** LM Loss 14k-16k (log scale), with markers at 2 x 100 (2), 3 x 100 (3), 4 x 100 (4), and 6 x 100 (6). * **Legend:** Located at the top-right of the chart. * "MoBA Projection" is represented by a dashed blue line. * "Full Attention Projection" is represented by a dashed red line. ### Detailed Analysis or ### Content Details * **MoBA Projection (dashed blue lines):** * The four MoBA Projection lines start at different LM Loss values between approximately 4 x 100 and 6 x 100 at low PFLOP/s-days values (around 0.05). * All four lines show a decreasing trend in LM Loss as PFLOP/s-days increases. * The lines converge around PFLOP/s-days = 1, with LM Loss values around 1.5 x 100. * Beyond PFLOP/s-days = 1, the lines continue to decrease gradually. * **Full Attention Projection (dashed red line):** * The Full Attention Projection line starts at an LM Loss of approximately 2.3 x 100 at low PFLOP/s-days values (around 0.05). * The line shows a decreasing trend in LM Loss as PFLOP/s-days increases. * The line reaches an LM Loss of approximately 1.3 x 100 at PFLOP/s-days = 10. ### Key Observations * The MoBA Projection models initially have a higher LM Loss than the Full Attention Projection model. * As PFLOP/s-days increases, the LM Loss for both models decreases. * The MoBA Projection models converge to a similar LM Loss as the Full Attention Projection model around PFLOP/s-days = 1. * At higher PFLOP/s-days values, the Full Attention Projection model appears to have a slightly lower LM Loss than the MoBA Projection models. ### Interpretation The chart compares the performance of two different projection methods ("MoBA Projection" and "Full Attention Projection") in terms of language modeling loss (LM Loss) as a function of computational resources (PFLOP/s-days). The data suggests that while MoBA Projection models may initially have a higher loss, they converge to a similar performance level as Full Attention Projection models with increased computational resources. The Full Attention Projection model seems to achieve a slightly lower loss at higher computational costs. The multiple lines for MoBA Projection likely represent different initialization or hyperparameter settings, showing the variability in performance for that model type. </details> (h) Scaling law (14-16k) Figure 8: Scaling laws for positions 0-16k <details> <summary>x21.png Details</summary> ![b19bd125](/v1/image/b19bd1255f9b937c0ae9d086e952b71846fbda8566ecf624ae1e32b7dd7313a3) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections ### Overview The image is a line chart comparing the Language Model (LM) Loss of MoBA Projection and Full Attention Projection models against PFLOP/s-days (a measure of computational cost). The y-axis represents LM Loss from 16k to 18k, and the x-axis represents PFLOP/s-days. The chart includes multiple lines for each projection type, showing performance at different scales or configurations. ### Components/Axes * **Title:** Implicit: LM Loss vs. PFLOP/s-days * **X-axis:** PFLOP/s-days (logarithmic scale) * Axis Markers: 10-1, 100, 101 * **Y-axis:** LM Loss 16k-18k (logarithmic scale) * Axis Markers: 2 x 100, 3 x 100, 4 x 100, 6 x 100 * **Legend:** Located at the top-right of the chart. * MoBA Projection (dashed blue line) * Full Attention Projection (dashed red line) ### Detailed Analysis * **MoBA Projection:** Multiple solid blue lines. Each line represents a different model configuration. * Trend: Each line shows a decreasing LM Loss as PFLOP/s-days increases. The lines start with a steep decline, which gradually flattens out. * Data Points (Approximate): * At 0.1 PFLOP/s-days, LM Loss ranges from approximately 4.0 x 100 to 6.0 x 100. * At 1 PFLOP/s-days, LM Loss ranges from approximately 1.5 x 100 to 2.0 x 100. * At 10 PFLOP/s-days, LM Loss is approximately 1.2 x 100. * **Full Attention Projection:** Dashed red line. * Trend: The line shows a decreasing LM Loss as PFLOP/s-days increases. The decline is more gradual and consistent compared to the MoBA Projection lines. * Data Points (Approximate): * At 0.1 PFLOP/s-days, LM Loss is approximately 2.2 x 100. * At 1 PFLOP/s-days, LM Loss is approximately 1.6 x 100. * At 10 PFLOP/s-days, LM Loss is approximately 1.2 x 100. ### Key Observations * The MoBA Projection models initially have a higher LM Loss compared to the Full Attention Projection model at lower PFLOP/s-days. * As PFLOP/s-days increases, the LM Loss for both MoBA and Full Attention Projections converges. * The MoBA Projection models show a more rapid initial decrease in LM Loss compared to the Full Attention Projection model. ### Interpretation The chart compares the performance of MoBA Projection and Full Attention Projection models in terms of LM Loss and computational cost (PFLOP/s-days). The data suggests that while MoBA Projection models may initially have a higher LM Loss, they can achieve comparable or even better performance than Full Attention Projection models with increased computational resources. The steeper decline in LM Loss for MoBA Projection models indicates that they may be more efficient in reducing loss with additional computation, at least initially. The convergence of LM Loss at higher PFLOP/s-days suggests that both model types can achieve similar levels of performance given sufficient computational resources. The multiple lines for MoBA Projection likely represent different model sizes or configurations, showing the trade-off between model complexity and performance. </details> (i) Scaling law (16-18k) <details> <summary>x22.png Details</summary> ![6f1e0b17](/v1/image/6f1e0b17f5cf4ea132e28ee5c872b10a6f3b3f6994fc000c65dc2b8e03f9a9cd) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days ### Overview The image is a line chart comparing the Language Model (LM) Loss of MoBA Projection and Full Attention Projection models against the computational cost measured in PFLOP/s-days. The chart displays multiple runs or configurations for each model type, showing how loss decreases with increased computation. ### Components/Axes * **Title:** LM Loss 18k-20k * **X-axis:** PFLOP/s-days (Logarithmic scale from 0.01 to 10) * **Y-axis:** LM Loss 18k-20k (Logarithmic scale from 1 to 6 x 10^0) * **Legend (Top-Right):** * Blue dashed line: MoBA Projection * Red dashed line: Full Attention Projection ### Detailed Analysis * **MoBA Projection (Blue lines):** There are multiple blue lines, each representing a different run or configuration of the MoBA Projection model. All lines show a general downward trend, indicating that LM Loss decreases as PFLOP/s-days increases. * The leftmost MoBA Projection line starts at approximately (0.04, 6) and decreases to approximately (2, 1.3). * The rightmost MoBA Projection line starts at approximately (0.2, 6) and decreases to approximately (10, 1.1). * **Full Attention Projection (Red lines):** There are multiple red lines, each representing a different run or configuration of the Full Attention Projection model. All lines show a general downward trend, indicating that LM Loss decreases as PFLOP/s-days increases. * The leftmost Full Attention Projection line starts at approximately (0.04, 6) and decreases to approximately (2, 1.3). * The rightmost Full Attention Projection line starts at approximately (0.2, 6) and decreases to approximately (10, 1.1). * **Dashed Lines:** * The blue dashed line (MoBA Projection) starts at approximately (0.04, 2.2) and decreases to approximately (10, 1.1). * The red dashed line (Full Attention Projection) starts at approximately (0.04, 2.2) and decreases to approximately (10, 1.1). ### Key Observations * Both MoBA Projection and Full Attention Projection models exhibit a decrease in LM Loss as computational cost (PFLOP/s-days) increases. * The multiple lines for each model type suggest variability in performance across different runs or configurations. * The dashed lines representing MoBA Projection and Full Attention Projection are very close to each other, suggesting similar performance. ### Interpretation The chart demonstrates the relationship between computational cost and language model loss for two different projection methods: MoBA Projection and Full Attention Projection. The downward trend of the lines indicates that increasing the computational budget generally leads to a reduction in LM Loss for both methods. The proximity of the dashed lines suggests that, on average, the two methods achieve similar performance levels within the tested range of computational costs. The multiple lines for each model indicate that the performance can vary depending on the specific configuration or run, highlighting the importance of tuning and optimization. </details> (j) Scaling law (18-20k) <details> <summary>x23.png Details</summary> ![59c7fc90](/v1/image/59c7fc90e88c1d0a34ec8a9d09bb6891186cd91f855b4ec21d6f556f740d0f81) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections ### Overview The image is a line chart comparing the Language Model (LM) Loss for MoBA Projection and Full Attention Projection across varying computational costs measured in PFLOP/s-days. The y-axis represents LM Loss from 20k to 22k, and the x-axis represents PFLOP/s-days. Multiple lines are plotted for each projection type, showing the trend of loss reduction as computational cost increases. ### Components/Axes * **Title:** Implicitly, the chart compares LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections. * **X-axis:** PFLOP/s-days (PetaFLOPS per second-days). The scale is logarithmic, with markers at 10-1, 100 (1), and 101 (10). * **Y-axis:** LM Loss 20k-22k. The scale is logarithmic, with markers at 2 x 100 (2), 3 x 100 (3), 4 x 100 (4), and 6 x 100 (6). * **Legend:** Located at the top-right of the chart. * **Blue dashed line:** MoBA Projection * **Red dashed line:** Full Attention Projection ### Detailed Analysis * **MoBA Projection (Blue dashed line):** The MoBA Projection line starts at approximately 2.2 x 100 LM Loss at 0.05 PFLOP/s-days and decreases to approximately 1.2 x 100 LM Loss at 20 PFLOP/s-days. The trend is a decreasing loss with increasing computational cost. * **Full Attention Projection (Red dashed line):** The Full Attention Projection line starts at approximately 2.0 x 100 LM Loss at 0.05 PFLOP/s-days and decreases to approximately 1.1 x 100 LM Loss at 20 PFLOP/s-days. The trend is a decreasing loss with increasing computational cost. * **Multiple Lines (Solid Blue and Red):** There are multiple solid blue and red lines that represent different runs or variations of the MoBA and Full Attention projections. These lines generally show a similar decreasing trend in LM Loss as computational cost increases. The solid lines start at higher LM Loss values (between 4 x 100 and 6 x 100) at low PFLOP/s-days (around 0.05) and converge towards the dashed lines as PFLOP/s-days increases. ### Key Observations * Both MoBA and Full Attention Projections show a decrease in LM Loss as computational cost (PFLOP/s-days) increases. * The solid lines, representing different runs or variations, start with higher LM Loss values but converge towards the dashed lines as computational cost increases. * The dashed lines (MoBA and Full Attention) are relatively close to each other, suggesting similar performance in terms of LM Loss reduction for higher computational costs. ### Interpretation The chart suggests that both MoBA and Full Attention Projections are effective in reducing LM Loss as computational cost increases. The convergence of the solid lines towards the dashed lines indicates that the initial variations in LM Loss diminish with higher computational investment. The proximity of the MoBA and Full Attention dashed lines at higher PFLOP/s-days suggests that their performance becomes comparable in that regime. The multiple solid lines likely represent different experimental runs or hyperparameter settings, and their convergence indicates a degree of robustness in the models' performance as computational resources are scaled up. </details> (k) Scaling law (20-22k) <details> <summary>x24.png Details</summary> ![bb03f735](/v1/image/bb03f735370d4d6fe5ce728eecf8c246c985ac76ab3fd52688c81baf17f2fec0) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections ### Overview The image is a line chart comparing the Language Model (LM) Loss for MoBA Projection and Full Attention Projection across varying computational resources (PFLOP/s-days). The y-axis represents LM Loss from 22k to 24k, and the x-axis represents PFLOP/s-days. The chart includes multiple lines for each projection type, showing the loss trend as computational resources increase. ### Components/Axes * **Title:** LM Loss 22k-24k * **X-axis:** PFLOP/s-days (logarithmic scale) * Axis markers: 10-1, 100, 101 * **Y-axis:** LM Loss 22k-24k (logarithmic scale) * Axis markers: 2 x 100, 3 x 100, 4 x 100, 6 x 100 * **Legend:** Located at the top-right of the chart. * MoBA Projection (dashed blue line) * Full Attention Projection (dashed red line) ### Detailed Analysis * **MoBA Projection:** * There are multiple solid blue lines representing different MoBA projections. * Trend: All MoBA projection lines show a steep decrease in LM Loss as PFLOP/s-days increase initially, then the rate of decrease slows down. * Approximate values: * At 0.1 PFLOP/s-days, the LM Loss ranges from approximately 3.5 x 100 to 5.5 x 100. * At 1 PFLOP/s-days, the LM Loss ranges from approximately 1.5 x 100 to 2 x 100. * At 10 PFLOP/s-days, the LM Loss converges to approximately 1.2 x 100. * **Full Attention Projection:** * Represented by a dashed red line. * Trend: The Full Attention Projection line shows a consistent decrease in LM Loss as PFLOP/s-days increase. * Approximate values: * At 0.1 PFLOP/s-days, the LM Loss is approximately 2.2 x 100. * At 1 PFLOP/s-days, the LM Loss is approximately 1.4 x 100. * At 10 PFLOP/s-days, the LM Loss is approximately 1.1 x 100. ### Key Observations * The MoBA projections initially have higher LM Loss compared to the Full Attention Projection at lower PFLOP/s-days. * As PFLOP/s-days increase, the LM Loss for both MoBA and Full Attention Projections converges. * The rate of decrease in LM Loss for MoBA projections slows down significantly as PFLOP/s-days increase. ### Interpretation The chart suggests that while MoBA projections may initially perform worse than Full Attention Projection in terms of LM Loss at lower computational resources, their performance improves significantly with increased computational resources. The convergence of LM Loss at higher PFLOP/s-days indicates that both methods can achieve similar performance levels given sufficient computational power. The multiple lines for MoBA projection likely represent different configurations or variations of the MoBA model, showing a range of possible performance outcomes. The Full Attention Projection serves as a baseline for comparison, demonstrating a more consistent but potentially less scalable performance trend. </details> (l) Scaling law (22-24k) <details> <summary>x25.png Details</summary> ![2b925fff](/v1/image/2b925fff6ed4fcdf18dd5d9a8a0c0f512f8061b29d41452694ce92cb77ee4476) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days ### Overview The image is a line chart comparing the Language Model (LM) Loss against the computational cost (PFLOP/s-days) for two different projection methods: MoBA Projection and Full Attention Projection. The chart displays multiple runs or configurations for each method, showing how the loss decreases as computational cost increases. The y-axis is LM Loss from 24k to 26k, and the x-axis is PFLOP/s-days. Both axes are on a logarithmic scale. ### Components/Axes * **Title:** None visible in the image. * **X-Axis:** PFLOP/s-days (PetaFLOPS per second-days). Logarithmic scale from approximately 0.1 to 10. * Axis Markers: 10-1, 100, 101 * **Y-Axis:** LM Loss 24k-26k (Language Model Loss from 24,000 to 26,000). Logarithmic scale from approximately 1 to 6 x 100. * Axis Markers: 100, 2 x 100, 3 x 100, 4 x 100, 6 x 100 * **Legend:** Located in the top-right corner. * MoBA Projection (Blue dashed line) * Full Attention Projection (Red dashed line) ### Detailed Analysis * **MoBA Projection (Blue dashed line):** This line represents the performance of the MoBA Projection method. It starts at approximately (0.1, 2.2) and decreases to approximately (10, 1.2). The trend is a downward slope, indicating that as computational cost increases, the LM Loss decreases. * **Full Attention Projection (Red dashed line):** This line represents the performance of the Full Attention Projection method. It starts at approximately (0.1, 2.4) and decreases to approximately (10, 1.1). The trend is a downward slope, indicating that as computational cost increases, the LM Loss decreases. * **Multiple Runs (Solid Blue and Red Lines):** There are several solid blue and red lines, which likely represent different runs or configurations of the MoBA and Full Attention Projection methods, respectively. These lines show a steeper initial drop in LM Loss compared to the dashed lines. * **Solid Blue Lines (MoBA):** These lines generally start at a higher LM Loss (around 6 x 100) and decrease rapidly as PFLOP/s-days increases. They converge towards the dashed blue line as the computational cost increases. * **Solid Red Lines (Full Attention):** These lines also start at a higher LM Loss (around 6 x 100) and decrease rapidly as PFLOP/s-days increases. They converge towards the dashed red line as the computational cost increases. * **Run 1 (leftmost):** Starts at approximately (0.05, 6.0). * The blue line decreases to approximately (0.2, 2.0) * The red line decreases to approximately (0.2, 2.0) * **Run 2:** Starts at approximately (0.06, 6.0). * The blue line decreases to approximately (0.3, 2.0) * The red line decreases to approximately (0.3, 2.0) * **Run 3:** Starts at approximately (0.08, 6.0). * The blue line decreases to approximately (0.4, 2.0) * The red line decreases to approximately (0.4, 2.0) * **Run 4:** Starts at approximately (0.1, 6.0). * The blue line decreases to approximately (0.5, 2.0) * The red line decreases to approximately (0.5, 2.0) ### Key Observations * Both MoBA and Full Attention Projection methods show a decrease in LM Loss as computational cost (PFLOP/s-days) increases. * The solid lines, representing multiple runs, show a steeper initial decrease in LM Loss compared to the dashed lines. * The solid lines converge towards the dashed lines as the computational cost increases, suggesting that the initial variability in performance decreases with more computation. * At higher computational costs (larger PFLOP/s-days), the difference in LM Loss between MoBA and Full Attention Projection becomes smaller. ### Interpretation The chart suggests that both MoBA and Full Attention Projection methods are effective in reducing LM Loss with increased computational resources. The multiple runs indicate that there might be some variability in the initial performance, but this variability decreases as the computational cost increases. The convergence of the solid lines towards the dashed lines suggests that the initial conditions or configurations have less impact on the final LM Loss when more computational resources are used. The dashed lines likely represent an average or baseline performance for each method. The fact that both methods converge to similar LM Loss values at higher computational costs indicates that their performance becomes comparable in that regime. The data suggests that the choice between MoBA and Full Attention Projection might depend on the specific computational budget and the desired level of initial performance variability. </details> (m) Scaling law (24-26k) <details> <summary>x26.png Details</summary> ![af406204](/v1/image/af40620470ffe3162092ed32aaf5104d80d7c1952361d101fea3d47504516750) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days ### Overview The image is a line chart comparing the Language Model (LM) Loss of "MoBA Projection" and "Full Attention Projection" models against the computational cost measured in PFLOP/s-days. The chart displays multiple runs or configurations for the MoBA Projection, showing a range of performance. ### Components/Axes * **X-axis:** PFLOP/s-days (PetaFLOPS per second-days), ranging from approximately 0.05 to 10. The axis is displayed on a log scale. * **Y-axis:** LM Loss 26k-28k, ranging from 1 x 10^0 to 6 x 10^0. The axis is displayed on a log scale. * **Legend:** Located in the top-right corner: * Blue dashed line: MoBA Projection * Red dashed line: Full Attention Projection ### Detailed Analysis * **MoBA Projection (Blue Lines):** There are multiple blue lines, indicating several runs or configurations of the MoBA Projection model. * **Trend:** All MoBA Projection lines show a decreasing trend in LM Loss as PFLOP/s-days increase. The lines start with a steep decline, which gradually flattens out as PFLOP/s-days increases. * **Values:** * At 0.1 PFLOP/s-days, the LM Loss ranges from approximately 3.5 x 10^0 to 5.5 x 10^0 across the different MoBA runs. * At 1 PFLOP/s-days, the LM Loss ranges from approximately 1.5 x 10^0 to 2.0 x 10^0. * At 10 PFLOP/s-days, the LM Loss converges to approximately 1.2 x 10^0 for all MoBA runs. * **Full Attention Projection (Red Dashed Line):** * **Trend:** The Full Attention Projection line also shows a decreasing trend in LM Loss as PFLOP/s-days increase, but the decrease is more gradual compared to the MoBA Projection lines. * **Values:** * At 0.1 PFLOP/s-days, the LM Loss is approximately 2.2 x 10^0. * At 1 PFLOP/s-days, the LM Loss is approximately 1.5 x 10^0. * At 10 PFLOP/s-days, the LM Loss is approximately 1.1 x 10^0. ### Key Observations * The MoBA Projection model, in its various runs, generally achieves lower LM Loss than the Full Attention Projection model for a given computational cost (PFLOP/s-days), especially at lower computational costs. * The MoBA Projection lines converge to a similar LM Loss value at higher PFLOP/s-days, suggesting a performance limit or saturation point. * The Full Attention Projection model shows a more consistent, but less rapid, decrease in LM Loss as computational cost increases. ### Interpretation The chart suggests that the MoBA Projection model is more efficient in reducing LM Loss compared to the Full Attention Projection model, particularly in the lower computational cost range. The multiple runs of the MoBA Projection model indicate that its performance can vary depending on the specific configuration or initialization. However, at higher computational costs, the performance of the MoBA Projection model converges, indicating a potential limit to its effectiveness. The Full Attention Projection model, while less efficient initially, provides a more consistent and predictable reduction in LM Loss as computational cost increases. The data implies that the choice between MoBA and Full Attention Projection depends on the desired trade-off between initial efficiency and consistent performance at higher computational costs. </details> (n) Scaling law (26-28k) <details> <summary>x27.png Details</summary> ![07a5a216](/v1/image/07a5a21624125e4d1f41718007fc97e29d462c8da6ba9cf01bc64df8af7b3f53) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days ### Overview The image is a line chart comparing the Language Model (LM) Loss of two different projection methods, MoBA Projection and Full Attention Projection, against the computational cost measured in PFLOP/s-days. The chart displays multiple runs for each method, showing the variability in performance. ### Components/Axes * **X-axis:** PFLOP/s-days (PetaFLOPS per second-days), ranging from approximately 0.05 to 10. * **Y-axis:** LM Loss (Language Model Loss) for the range 28k-30k, ranging from 1 x 10^0 to 6 x 10^0 (logarithmic scale). * **Legend (top-left):** * Blue dashed line: MoBA Projection * Red dashed line: Full Attention Projection ### Detailed Analysis * **MoBA Projection (Blue dashed line):** * The MoBA Projection line starts at approximately 2.3 x 10^0 LM Loss at 0.05 PFLOP/s-days and decreases to approximately 1.1 x 10^0 LM Loss at 10 PFLOP/s-days. * **Full Attention Projection (Red dashed line):** * The Full Attention Projection line starts at approximately 2.1 x 10^0 LM Loss at 0.05 PFLOP/s-days and decreases to approximately 1.2 x 10^0 LM Loss at 10 PFLOP/s-days. * **Multiple Runs (Solid Lines):** * There are multiple solid blue and red lines, representing different runs of the MoBA Projection and Full Attention Projection, respectively. * The solid lines start at approximately 6 x 10^0 LM Loss at 0.05 PFLOP/s-days and decrease to approximately 1.3 x 10^0 LM Loss at 10 PFLOP/s-days. * The solid lines show a steeper initial decrease in LM Loss compared to the dashed lines. * The solid lines converge towards the dashed lines as the PFLOP/s-days increase. ### Key Observations * The LM Loss decreases as the computational cost (PFLOP/s-days) increases for both projection methods. * The multiple runs (solid lines) show variability in performance, especially at lower computational costs. * The MoBA Projection and Full Attention Projection (dashed lines) have similar performance, with the Full Attention Projection slightly outperforming the MoBA Projection. ### Interpretation The chart demonstrates the relationship between computational cost and language model loss for two different projection methods. The decreasing LM Loss with increasing PFLOP/s-days indicates that more computation leads to better model performance. The multiple runs highlight the importance of considering variability in performance when evaluating these methods. The similarity in performance between MoBA Projection and Full Attention Projection suggests that both methods are viable options, and the choice between them may depend on other factors such as implementation complexity or specific hardware constraints. The convergence of the solid lines towards the dashed lines suggests that the initial variability in performance decreases as the computational cost increases. </details> (o) Scaling law (28-30k) <details> <summary>x28.png Details</summary> ![2406e910](/v1/image/2406e91008b3fd5aeb377bf8e22c21c9f3e43f0da201fd99351ba4638c5e987d) ### Visual Description ## Chart: LM Loss vs. PFLOP/s-days for MoBA and Full Attention Projections ### Overview The image is a log-log plot comparing the language modeling (LM) loss of MoBA Projection and Full Attention Projection models against the computational cost in PFLOP/s-days. The plot shows multiple runs for the MoBA Projection, illustrating its variability, while a single run is shown for the Full Attention Projection. ### Components/Axes * **X-axis:** PFLOP/s-days (log scale), ranging from approximately 0.03 to 10. Axis markers are at 0.1, 1, and 10. * **Y-axis:** LM Loss 30k-32k (log scale), ranging from approximately 1 to 6 x 10^0. Axis markers are at 2 x 10^0, 3 x 10^0, 4 x 10^0, and 6 x 10^0. * **Legend (top-left):** * Blue dashed line: MoBA Projection * Red dashed line: Full Attention Projection ### Detailed Analysis * **MoBA Projection (Blue):** There are multiple blue lines, indicating several runs. * **Trend:** Each line generally slopes downward, indicating decreasing LM loss with increasing PFLOP/s-days. * **Values:** * At 0.1 PFLOP/s-days, the LM loss ranges from approximately 3.5 x 10^0 to 5 x 10^0 across the different runs. * At 1 PFLOP/s-days, the LM loss ranges from approximately 1.5 x 10^0 to 2 x 10^0. * At 10 PFLOP/s-days, the LM loss converges to approximately 1.2 x 10^0. * **Full Attention Projection (Red):** There is a single red line. * **Trend:** The line slopes downward, indicating decreasing LM loss with increasing PFLOP/s-days. * **Values:** * At 0.1 PFLOP/s-days, the LM loss is approximately 2.2 x 10^0. * At 1 PFLOP/s-days, the LM loss is approximately 1.4 x 10^0. * At 10 PFLOP/s-days, the LM loss is approximately 1.1 x 10^0. ### Key Observations * The MoBA Projection exhibits more variability in LM loss compared to the Full Attention Projection, as indicated by the multiple runs. * Both projections show a decrease in LM loss as the computational cost (PFLOP/s-days) increases. * At lower computational costs (e.g., 0.1 PFLOP/s-days), the MoBA Projection can have a higher LM loss than the Full Attention Projection. * At higher computational costs (e.g., 10 PFLOP/s-days), the LM loss for both projections converges to a similar value. ### Interpretation The data suggests that while the Full Attention Projection provides a more consistent reduction in LM loss with increasing computational cost, the MoBA Projection can achieve comparable performance at higher computational budgets, albeit with more variability. The multiple runs of MoBA Projection highlight the importance of hyperparameter tuning and potential instability in the training process. The convergence of both projections at higher PFLOP/s-days indicates that both models can achieve similar levels of performance given sufficient computational resources. The choice between the two may depend on the desired level of consistency and the available computational budget. </details> (p) Scaling law (30-32k) Figure 8: Scaling laws for positions 16-32k Table 3: Loss scaling with different positions | 0K - 2K 2K - 4K 4K - 6K | $3.075× C^{-0.078}$ $2.415× C^{-0.084}$ $2.085× C^{-0.081}$ | $3.068× C^{-0.078}$ $2.411× C^{-0.083}$ $2.077× C^{-0.081}$ | | --- | --- | --- | | 6K - 8K | $1.899× C^{-0.092}$ | $1.894× C^{-0.092}$ | | 8K - 10K | $1.789× C^{-0.091}$ | $1.774× C^{-0.089}$ | | 10K - 12K | $1.721× C^{-0.092}$ | $1.697× C^{-0.087}$ | | 12K - 14K | $1.670× C^{-0.089}$ | $1.645× C^{-0.088}$ | | 14K - 16K | $1.630× C^{-0.089}$ | $1.600× C^{-0.087}$ | | 16K - 18K | $1.607× C^{-0.090}$ | $1.567× C^{-0.087}$ | | 18K - 20K | $1.586× C^{-0.091}$ | $1.542× C^{-0.087}$ | | 20K - 22K | $1.571× C^{-0.093}$ | $1.519× C^{-0.086}$ | | 22K - 24K | $1.566× C^{-0.089}$ | $1.513× C^{-0.085}$ | | 24K - 26K | $1.565× C^{-0.091}$ | $1.502× C^{-0.085}$ | | 26K - 28K | $1.562× C^{-0.095}$ | $1.493× C^{-0.088}$ | | 28K - 30K | $1.547× C^{-0.097}$ | $1.471× C^{-0.091}$ | | 30K - 32K | $1.546× C^{-0.108}$ | $1.464× C^{-0.097}$ |

Rendering Paper...