2502.13189

Model: healer-alpha-free

# MoBA: Mixture of Block Attention for Long-Context LLMs > ∗ ‡ Co-corresponding authors. Xinyu Zhou ( ), Jiezhong Qiu ( ) iclr2025_conference.bib ## Abstract Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the “less structure” principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi’s long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA. ## 1 Introduction The pursuit of artificial general intelligence (AGI) has driven the development of large language models (LLMs) to unprecedented scales, with the promise of handling complex tasks that mimic human cognition. A pivotal capability for achieving AGI is the ability to process, understand, and generate long sequences, which is essential for a wide range of applications, from historical data analysis to complex reasoning and decision-making processes. This growing demand for extended context processing can be seen not only in the popularity of long input prompt understanding, as showcased by models like Kimi kimi, Claude claude and Gemini reid2024gemini, but also in recent explorations of long chain-of-thought (CoT) output capabilities in Kimi k1.5 team2025kimi, DeepSeek-R1 guo2025deepseek, and OpenAI o1/o3 guan2024deliberative. However, extending the sequence length in LLMs is non-trivial due to the quadratic growth in computational complexity associated with the vanilla attention mechanism waswani2017attention. This challenge has spurred a wave of research aimed at improving efficiency without sacrificing performance. One prominent direction capitalizes on the inherent sparsity of attention scores. This sparsity arises both mathematically — from the softmax operation, where various sparse attention patterns have be studied jiang2024minference — and biologically watson2025human, where sparse connectivity is observed in brain regions related to memory storage. Existing approaches often leverage predefined structural constraints, such as sink-based xiao2023efficient or sliding window attention beltagy2020longformer, to exploit this sparsity. While these methods can be effective, they tend to be highly task-specific, potentially hindering the model’s overall generalizability. Alternatively, a range of dynamic sparse attention mechanisms, exemplified by Quest tang2024quest, Minference jiang2024minference, and RetrievalAttention liu2024retrievalattention, select subsets of tokens at inference time. Although such methods can reduce computation for long sequences, they do not substantially alleviate the intensive training costs of long-context models, making it challenging to scale LLMs efficiently to contexts on the order of millions of tokens. Another promising alternative way has recently emerged in the form of linear attention models, such as Mamba dao2024transformers, RWKV peng2023rwkv, peng2024eagle, and RetNet sun2023retentive. These approaches replace canonical softmax-based attention with linear approximations, thereby reducing the computational overhead for long-sequence processing. However, due to the substantial differences between linear and conventional attention, adapting existing Transformer models typically incurs high conversion costs mercat2024linearizing, wang2024mamba, bick2025transformers, zhang2024lolcats or requires training entirely new models from scratch li2025minimax. More importantly, evidence of their effectiveness in complex reasoning tasks remains limited. Consequently, a critical research question arises: How can we design a robust and adaptable attention architecture that retains the original Transformer framework while adhering to a “less structure” principle, allowing the model to determine where to attend without relying on predefined biases? Ideally, such an architecture would transition seamlessly between full and sparse attention modes, thus maximizing compatibility with existing pre-trained models and enabling both efficient inference and accelerated training without compromising performance. Thus, we introduce Mixture of Block Attention (MoBA), a novel architecture that builds upon the innovative principles of Mixture of Experts (MoE) shazeer2017outrageously and applies them to the attention mechanism of the Transformer model. MoE has been used primarily in the feedforward network (FFN) layers of Transformers lepikhin2020gshard,fedus2022switch, zoph2022st, but MoBA pioneers its application to long context attention, allowing dynamic selection of historically relevant blocks of key and values for each query token. This approach not only enhances the efficiency of LLMs but also enables them to handle longer and more complex prompts without a proportional increase in resource consumption. MoBA addresses the computational inefficiency of traditional attention mechanisms by partitioning the context into blocks and employing a gating mechanism to selectively route query tokens to the most relevant blocks. This block sparse attention significantly reduces the computational costs, paving the way for more efficient processing of long sequences. The model’s ability to dynamically select the most informative blocks of keys leads to improved performance and efficiency, particularly beneficial for tasks involving extensive contextual information. In this paper, we detail the architecture of MoBA, firstly its block partitioning and routing strategy, and secondly its computational efficiency compared to traditional attention mechanisms. We further present experimental results that demonstrate MoBA’s superior performance in tasks requiring the processing of long sequences. Our work contributes a novel approach to efficient attention computation, pushing the boundaries of what is achievable with LLMs in handling complex and lengthy inputs. ## 2 Method <details> <summary>x1.png Details</summary> ![eb801cbd](/v1/image/eb801cbdb201bd15c7fc59483fd14655e2e0a3a1da1d421f5c630b4c19189ff7) ### Visual Description ## Diagram: Query Routing and Attention Score Calculation ### Overview This image is a technical diagram illustrating a two-stage process for handling queries in a computational system, likely a neural network or information retrieval architecture. It shows how incoming queries are routed to specific blocks of keys and values, which are then used to compute attention scores. The diagram uses color-coding and line styles (solid vs. dashed) to distinguish between two parallel processing paths. ### Components/Axes The diagram is organized into three main horizontal layers, with labels on the left and top. **Top Layer (Input):** * **Label (Top-Left):** "Queries" * **Components:** Two colored boxes representing individual queries. * `q1`: A red box. * `q2`: A yellow/gold box. * **Flow:** Arrows point downward from the queries to the central "Router" component. The arrow from `q1` is solid; the arrow from `q2` is dashed. **Middle Layer (Routing & Storage):** * **Central Component:** A large, light-blue box labeled "Router". * **Left-Side Labels:** "Keys" and "Values", indicating two parallel rows of components below the router. * **Key Row (Solid Borders):** Four boxes, each a different color. * `block1`: Purple. * `block2`: Blue. * `block3`: Green. * `block4`: Gray. * **Value Row (Dashed Borders):** Four boxes directly below their corresponding key boxes, with identical labels and colors. * `block1`: Purple (dashed border). * `block2`: Blue (dashed border). * `block3`: Green (dashed border). * `block4`: Gray (dashed border). * **Flow from Router:** * A solid arrow from the Router points to the `block1`/`block2` pair (left side). * A dashed arrow from the Router points to the `block3`/`block4` pair (right side). **Bottom Layer (Computation):** This layer shows two separate computation groups, one for each query path. * **Left Group (for q1 path):** * **Input:** A large, solid downward arrow originates from the space between the `block1` and `block2` keys/values. * **Components:** * A red `q1` box (bottom-left). * A large, light-blue box labeled "Attn score". * Above the "Attn score" box: Solid-bordered `block1` (purple) and `block2` (blue) boxes. * To the right of the "Attn score" box: A vertical stack of two dashed-bordered boxes: `block1` (purple) on top of `block2` (blue). * **Right Group (for q2 path):** * **Input:** A large, dashed downward arrow originates from the space between the `block3` and `block4` keys/values. * **Components:** * A yellow/gold `q2` box (bottom-left). * A large, light-blue box labeled "Attn score". * Above the "Attn score" box: Solid-bordered `block3` (green) and `block4` (gray) boxes. * To the right of the "Attn score" box: A vertical stack of two dashed-bordered boxes: `block1` (purple) on top of `block2` (blue). ### Detailed Analysis The diagram depicts a selective routing and attention mechanism. 1. **Query Initiation:** Two distinct queries (`q1`, `q2`) enter the system. 2. **Routing Decision:** A central "Router" component directs each query to a specific subset of available key-value blocks. * The solid line path (associated with `q1`) routes to the left pair: `block1` and `block2`. * The dashed line path (associated with `q2`) routes to the right pair: `block3` and `block4`. 3. **Key-Value Association:** For each routed pair, the system accesses both the "Key" (solid border) and "Value" (dashed border) representations of the blocks. 4. **Attention Score Computation:** The final stage computes an "Attn score" for each query. * **For q1's score:** The computation uses the `block1` and `block2` keys (solid) and the `block1` and `block2` values (dashed). The query `q1` itself is also an input. * **For q2's score:** The computation uses the `block3` and `block4` keys (solid) and the `block3` and `block4` values (dashed). The query `q2` itself is also an input. * **Notable Anomaly:** In both attention score groups, there is an additional vertical stack of dashed `block1` and `block2` boxes to the right. Their presence in the `q2` path is inconsistent with the routing logic shown, as `q2` was routed to blocks 3 and 4. This may indicate a shared component, a diagrammatic error, or that blocks 1 and 2 are used in a secondary capacity for all queries. ### Key Observations * **Color Consistency:** Colors are used consistently to identify blocks across the diagram (e.g., `block1` is always purple). * **Line Style Semantics:** Solid lines and borders are associated with the `q1` path and "Key" components. Dashed lines and borders are associated with the `q2` path and "Value" components. * **Parallel Structure:** The left and right sides of the diagram mirror each other in structure, emphasizing two parallel processing pipelines. * **Spatial Grounding:** The legend/labels ("Queries", "Keys", "Values") are positioned on the left margin. The "Router" is centrally located. The two computation groups are separated spatially at the bottom, left for `q1` and right for `q2`. * **Potential Inconsistency:** The appearance of `block1` and `block2` (dashed) in the attention score calculation for `q2` contradicts the routing logic shown by the dashed arrow from the Router to `block3`/`block4`. ### Interpretation This diagram illustrates a **conditional computation** or **mixture-of-experts** style architecture. The core idea is that not all data (blocks) are relevant to every query. A router acts as a gatekeeper, dynamically selecting a subset of specialized modules (blocks 1-4) to process each incoming query. * **What it demonstrates:** It visualizes the flow of information from input queries, through a selective routing mechanism, to the final computation of relevance scores (attention). This is a fundamental pattern in efficient large models, where activating only a subset of parameters per query reduces computational cost. * **Relationships:** The Router has a one-to-many relationship with the blocks. Each query has a one-to-one relationship with its computed attention score. The Keys and Values for a given block are intrinsically linked but represented separately, which is standard in attention mechanisms where keys are used for matching and values for retrieval. * **Notable Anomaly/Outlier:** The presence of the `block1`/`block2` stack in the `q2` attention computation is the most significant observation. This could mean: 1. **Diagram Error:** The illustrator mistakenly copied the left-side element to the right. 2. **Shared Context:** Blocks 1 and 2 represent a common, always-available context or memory that is accessed by all queries in addition to their specifically routed blocks. 3. **Hierarchical Attention:** The system performs a two-stage attention: first routing to a primary set of blocks (3 & 4 for q2), then attending to a secondary, universal set of blocks (1 & 2). Without additional context, the diagram successfully conveys the principle of selective routing for efficient attention but contains an ambiguous element regarding the role of blocks 1 and 2 in the `q2` pathway. </details> (a) <details> <summary>x2.png Details</summary> ![1a52cac5](/v1/image/1a52cac50dc4c2d74d7c62dd6ffc1525e45ee3757d97b94c5e246faa49086187) ### Visual Description ## Diagram: MoBA Gating and Varlen Flash-Attention Architecture ### Overview The image is a technical architectural diagram illustrating a neural network attention mechanism. It depicts a system that combines a "MoBA Gating" module with a "Varlen Flash-Attention" computation block. The diagram shows the flow of Query (Q), Key (K), and Value (V) tensors through various processing stages, with a gating mechanism selecting specific blocks of data for the final attention operation. ### Components/Axes The diagram is composed of labeled processing blocks connected by directional arrows (solid and dashed) indicating data and control flow. **Primary Input/Output Labels:** * **Q**: Query input (top center) * **K**: Key input (top center-right) * **V**: Value input (top right) * **Attention Output**: Final output (bottom center) **Processing Blocks (with approximate colors and positions):** 1. **RoPE** (Blue box, top center): Positioned below the Q and K inputs. 2. **MoBA Gating** (Large light gray box, left side): Contains a sub-process. * **Partition to blocks** (Yellow box, inside MoBA Gating, top) * **Mean Pooling** (Green box, inside MoBA Gating, middle) * **MatMul** (Purple box, inside MoBA Gating, below Mean Pooling) * **TopK Gating** (Pink box, inside MoBA Gating, bottom) 3. **Index Select** (Light gray box, center-right) 4. **Varlen Flash-Attention** (Blue box, bottom center) **Flow and Connection Labels:** * **Selected Block Index**: A dashed arrow output from the "TopK Gating" block. * **Solid arrows**: Indicate the primary flow of tensor data (Q, K, V). * **Dashed arrows**: Indicate control signals or index selection paths. ### Detailed Analysis **Component Isolation & Flow:** 1. **Header Region (Inputs & Initial Processing):** * Three input tensors, labeled **Q**, **K**, and **V**, enter from the top. * The **Q** and **K** tensors are fed into a blue **RoPE** (Rotary Positional Embedding) block. The **V** tensor bypasses this block. * The output of RoPE continues downward along two separate paths (for Q and K). 2. **Main Chart Region (Gating & Selection):** * **Left Side - MoBA Gating Module:** A dashed line originates from the path after RoPE (likely from K or a derived representation) and enters the **MoBA Gating** block. * Inside, the data first goes to **Partition to blocks**. * The output proceeds to **Mean Pooling**. * The pooled data is then processed by a **MatMul** (Matrix Multiplication) operation. * Finally, **TopK Gating** selects the most important elements, outputting a **Selected Block Index** via a dashed arrow. * **Right Side - Index Selection:** The **Selected Block Index** (dashed arrow) points to the **Index Select** block. * The **Index Select** block receives the post-RoPE **K** tensor and the original **V** tensor via solid arrows. * It uses the index to select specific blocks from K and V. 3. **Footer Region (Attention Computation):** * The post-RoPE **Q** tensor, and the selected K and V blocks from **Index Select**, all feed into the **Varlen Flash-Attention** block via solid arrows. * This block computes the attention mechanism, producing the final **Attention Output**. **Spatial Grounding:** * The **MoBA Gating** module occupies the entire left third of the diagram. * The **RoPE** block is centered horizontally near the top. * The **Index Select** block is positioned to the right of the center, vertically between the MoBA Gating and Varlen Flash-Attention blocks. * The **Varlen Flash-Attention** block is centered at the bottom. * The **Selected Block Index** dashed line travels from the bottom-left (MoBA Gating) to the center-right (Index Select). ### Key Observations * **Hybrid Control Flow:** The diagram uses solid lines for data flow and dashed lines for control/index flow, clearly separating the main tensor pipeline from the gating mechanism's selection logic. * **Sparse Attention Pattern:** The architecture implements a form of sparse attention. The **MoBA Gating** module (likely "Mixture of Block Attention" or similar) does not process all key-value pairs. Instead, it uses **Partition to blocks**, **Mean Pooling**, and **TopK Gating** to select a subset of blocks (**Selected Block Index**), which are then used by **Index Select**. * **Efficiency Focus:** The use of **Varlen Flash-Attention** (an optimized kernel for variable-length sequences) combined with block-wise selection suggests the architecture is designed for computational and memory efficiency, especially with long sequences. * **Positional Encoding:** The **RoPE** block is applied only to Q and K, not V, which is standard practice for rotary positional embeddings. ### Interpretation This diagram details an efficient, gated attention mechanism designed to reduce the quadratic complexity of standard self-attention. The **MoBA Gating** module acts as a "router" or "selector." It analyzes the input (likely the keys) to identify the most relevant blocks of information (**TopK Gating**) for a given query, rather than attending to every single position. The process can be interpreted as follows: For each attention computation, the system first determines *which parts of the memory (Key/Value blocks) are worth attending to*. This selection is based on a lightweight, pooled representation of the blocks. Only these selected blocks are then used in the expensive **Varlen Flash-Attention** operation. This approach dramatically reduces the computational cost, making it feasible to process very long contexts. The architecture embodies a "compute-on-demand" principle for attention, where full computation is reserved only for the most promising data blocks identified by the gating network. </details> (b) Figure 1: Illustration of mixture of block attention (MoBA). (a) A running example of MoBA; (b) Integration of MoBA into Flash Attention. In this work, we introduce a novel architecture, termed Mixture of Block Attention (MoBA), which extends the capabilities of the Transformer model by dynamically selecting historical segments (blocks) for attention computation. MoBA is inspired by techniques of Mixture of Experts (MoE) and sparse attention. The former technique has been predominantly applied to the feedforward network (FFN) layers within the Transformer architecture, while the latter has been widely adopted in scaling Transformers to handle long contexts. Our method is innovative in applying the MoE principle to the attention mechanism itself, allowing for more efficient and effective processing of long sequences. ### 2.1 Preliminaries: Standard Attention in Transformer We first revisit the standard Attention in Transformers. For simplicity, we revisit the case where a single query token ${\bm{q}}∈ℝ^1× d$ attends to the $N$ key and value tokens, denoting ${\bm{K}},{\bm{V}}∈ℝ^N× d$ , respectively. The standard attention is computed as: $$ Attn({\bm{q}},{\bm{K}},{\bm{V}})=Softmax{≤ft({\bm{q}}{\bm{ K}}^⊤\right)}{\bm{V}}, \tag{1} $$ where $d$ denotes the dimension of a single attention head. We focus on the single-head scenario for clarity. The extension to multi-head attention involves concatenating the outputs from multiple such single-head attention operations. ### 2.2 MoBA Architecture Different from standard attention where each query tokens attend to the entire context, MoBA enables each query token to only attend to a subset of keys and values: $$ MoBA({\bm{q}},{\bm{K}},{\bm{V}})=Softmax{≤ft({\bm{q}}{{\bm {K}}[I]}^⊤\right)}{\bm{V}}[I], \tag{2} $$ where $I⊆[N]$ is the set of selected keys and values. The key innovation in MoBA is the block partitioning and selection strategy. We divide the full context of length $N$ into $n$ blocks, where each block represents a subset of subsequent tokens. Without loss of generality, we assume that the context length $N$ is divisible by the number of blocks $n$ . We further denote $B=\frac{N}{n}$ to be the block size and $$ I_i=≤ft[(i-1)× B+1,i× B\right] \tag{3} $$ to be the range of the $i$ -th block. By applying the top- $k$ gating mechanism from MoE, we enable each query to selectively focus on a subset of tokens from different blocks, rather than the entire context: $$ I=\bigcup_g_{i>0}I_i. \tag{4} $$ The model employs a gating mechanism, as $g_i$ in Equation 4, to select the most relevant blocks for each query token. The MoBA gate first computes the affinity score $s_i$ measuring the relevance between query ${\bm{q}}$ and the $i$ -th block, and applies a top- $k$ gating among all blocks. More formally, the gate value for the $i$ -th block $g_i$ is computed by $$ g_i=\begin{cases}1&s_i∈Topk≤ft(\{s_j|j∈[n]\},k\right)\\ 0&otherwise\end{cases}, \tag{5} $$ where $Topk(·,k)$ denotes the set containing $k$ highest scores among the affinity scores calculated for each block. In this work, the score $s_i$ is computed by the inner product between ${\bm{q}}$ and the mean pooling of ${\bm{K}}[I_i]$ along the sequence dimension: $$ s_i=⟨{\bm{q}},mean\_pool({\bm{K}}[I_i])⟩ \tag{6} $$ A Running Example. We provide a running example of MoBA at Figure 1a, where we have two query tokens and four KV blocks. The router (gating network) dynamically selects the top two blocks for each query to attend. As shown in Figure 1a, the first query is assigned to the first and second blocks, while the second query is assigned to the third and fourth blocks. It is important to maintain causality in autoregressive language models, as they generate text by next-token prediction based on previous tokens. This sequential generation process ensures that a token cannot influence tokens that come before it, thus preserving the causal relationship. MoBA preserves causality through two specific designs: Causality: No Attention to Future Blocks. MoBA ensures that a query token cannot be routed to any future blocks. By limiting the attention scope to current and past blocks, MoBA adheres to the autoregressive nature of language modeling. More formally, denoting $pos({\bm{q}})$ as the position index of the query ${\bm{q}}$ , we set $s_i=-∞$ and $g_i=0$ for any blocks $i$ such that $pos({\bm{q}})<i× B$ . Current Block Attention and Causal Masking. We define the ”current block” as the block that contains the query token itself. The routing to the current block could also violate causality, since mean pooling across the entire block can inadvertently include information from future tokens. To address this, we enforce that each token must be routed to its respective current block and apply a causal mask during the current block attention. This strategy not only avoids any leakage of information from subsequent tokens but also encourages attention to the local context. More formally, we set $g_i=1$ for the block $i$ where the position of the query token $pos({\bm{q}})$ is within the interval $I_i$ . From the perspective of Mixture-of-Experts (MoE), the current block attention in MoBA is akin to the role of shared experts in modern MoE architectures dai2024deepseekmoe, yang2024qwen2, where static routing rules are added when expert selection. Next, we discuss some additional key design choices of MoBA, such as its block segmentation strategy and the hybrid of MoBA and full attention. Fine-Grained Block Segmentation. The positive impact of fine-grained expert segmentation in improving mode performance has been well-documented in the Mixture-of-Experts (MoE) literature dai2024deepseekmoe, yang2024qwen2. In this work, we explore the potential advantage of applying a similar fine-grained segmentation technique to MoBA. MoBA, inspired by MoE, operates segmentation along the context-length dimension rather than the FFN intermediate hidden dimension. Therefore our investigation aims to determine if MoBA can also benefit when we partition the context into blocks with a finer grain. More experimental results can be found in Section 3.1. Hybrid of MoBA and Full Attention. MoBA is designed to be a substitute for full attention, maintaining the same number of parameters without any addition or subtraction. This feature inspires us to conduct smooth transitions between full attention and MoBA. Specifically, at the initialization stage, each attention layer has the option to select full attention or MoBA, and this choice can be dynamically altered during training if necessary. A similar idea of transitioning full attention to sliding window attention has been studied in previous work zhang2024simlayerkv. More experimental results can be found in Section 3.2. Comparing to Sliding Window Attention and Attention Sink. Sliding window attention (SWA) and attention sink are two popular sparse attention architectures. We demonstrate that both can be viewed as special cases of MoBA. For sliding window attention beltagy2020longformer, each query token only attends to its neighboring tokens. This can be interpreted as a variant of MoBA with a gating network that keeps selecting the most recent blocks. Similarly, attention sink xiao2023efficient, where each query token attends to a combination of initial tokens and the most recent tokens, can be seen as a variant of MoBA with a gating network that always selects both the initial and the recent blocks. The above discussion shows that MoBA has stronger expressive power than sliding window attention and attention sink. Moreover, it shows that MoBA can flexibly approximate many static sparse attention architectures by incorporating specific gating networks. Overall, MoBA’s attention mechanism allows the model to adaptively and dynamically focus on the most informative blocks of the context. This is particularly beneficial for tasks involving long documents or sequences, where attending to the entire context may be unnecessary and computationally expensive. MoBA’s ability to selectively attend to relevant blocks enables more nuanced and efficient processing of information. ### 2.3 Implementation Algorithm 1 MoBA (Mixture of Block Attention) Implementation 0: Query, key and value matrices $Q,K,V∈ℝ^N× h× d$ ; MoBA hyperparameters (block size $B$ and top- $k$ ); $h$ and $d$ denote the number of attention heads and head dimension. Also denote $n=N/B$ to be the number of blocks. 1: // Split KV into blocks 2: $\{\tilde{K}_i,\tilde{V}_i\}=split\_blocks(\mathbf {K},V,B)$ , where $\tilde{K}_i,\tilde{V}_i∈ℝ^B× h× d ,i∈[n]$ 3: // Compute gating scores for dynamic block selection 4: $\bar{K}=mean\_pool(K,B)∈ℝ^n× h× d$ 5: $S=Q\bar{K}^⊤∈ℝ^N× h× n$ 6: // Select blocks with causal constraint (no attention to future blocks) 7: $M=create\_causal\_mask(N,n)$ 8: $G=topk(S+M,k)$ 9: // Organize attention patterns for computation efficiency 10: $Q^s,\tilde{K}^s,\tilde{V}^s=get\_self\_ attn\_block(Q,\tilde{K},\tilde{V})$ 11: $Q^m,\tilde{K}^m,\tilde{V}^m=index\_ select\_moba\_attn\_block(Q,\tilde{K},\tilde{V}, G)$ 12: // Compute attentions seperately 13: $O^s=flash\_attention\_varlen(Q^s,\tilde{K }^s,\tilde{V}^s,causal=True)$ 14: $O^m=flash\_attention\_varlen(Q^m,\tilde{K }^m,\tilde{V}^m,causal=False)$ 15: // Combine results with online softmax 16: $O=combine\_with\_online\_softmax(O^s,O^m)$ 17: return $O$ <details> <summary>x3.png Details</summary> ![157df320](/v1/image/157df3203352cda72bdca1fcabea9d2c1f5f263d08f5c3726125e24aea3a5b33) ### Visual Description ## Line Chart: Computation Time vs. Sequence Length for Flash Attention and MoBA ### Overview The image is a line chart comparing the computational performance of two methods, "Flash Attention" and "MoBA," as the input sequence length increases. The chart demonstrates a significant divergence in scalability between the two approaches. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis (Horizontal):** * **Label:** "Sequence Length" * **Scale:** Logarithmic (base 2). The labeled tick marks are at 32K, 128K, 256K, 512K, and 1M. * **Y-Axis (Vertical):** * **Label:** "Computation Time (ms)" * **Scale:** Linear, ranging from 0 to approximately 950 ms. Major grid lines are at intervals of 200 ms (0, 200, 400, 600, 800). * **Legend:** Located in the top-left corner of the plot area. * **Flash Attention:** Represented by a light blue, dashed line with circular markers. * **MoBA:** Represented by a dark blue, dashed line with circular markers. * **Grid:** A light gray grid is present for both axes. ### Detailed Analysis **Data Series 1: Flash Attention (Light Blue, Dashed Line)** * **Trend:** The line exhibits a steep, upward-curving (super-linear, likely quadratic or worse) trend. Computation time increases dramatically with sequence length. * **Approximate Data Points:** * At 32K: ~5 ms * At 128K: ~20 ms * At 256K: ~60 ms * At 512K: ~240 ms * At 1M: ~940 ms **Data Series 2: MoBA (Dark Blue, Dashed Line)** * **Trend:** The line shows a gentle, upward-sloping linear trend. Computation time increases at a much slower, more manageable rate. * **Approximate Data Points:** * At 32K: ~2 ms * At 128K: ~10 ms * At 256K: ~25 ms * At 512K: ~60 ms * At 1M: ~150 ms ### Key Observations 1. **Performance Crossover:** At the shortest sequence length (32K), the two methods have very similar computation times (both under 10 ms). MoBA is marginally faster. 2. **Divergence Point:** The performance gap begins to widen noticeably at 128K and becomes substantial at 256K. 3. **Scalability Disparity:** At the longest sequence length shown (1M), Flash Attention's computation time (~940 ms) is over **6 times slower** than MoBA's (~150 ms). 4. **Growth Pattern:** Flash Attention's time grows at an accelerating rate, while MoBA's growth appears constant and linear. ### Interpretation This chart provides a clear performance benchmark demonstrating the superior scalability of the MoBA method compared to Flash Attention for long-sequence processing tasks. * **What the data suggests:** The data strongly indicates that MoBA is a more efficient algorithm for handling very long sequences (e.g., in large language models or high-resolution data processing). Its linear time complexity makes it predictable and practical for scaling to million-length sequences, whereas Flash Attention becomes prohibitively expensive. * **Relationship between elements:** The x-axis (Sequence Length) is the independent variable, representing the problem size. The y-axis (Computation Time) is the dependent variable, measuring the cost. The two lines represent competing solutions to the same problem. The widening gap between them visually argues for the adoption of MoBA in scenarios requiring long-context windows. * **Notable Anomalies/Trends:** The most critical trend is the difference in growth order. Flash Attention's curve suggests O(n²) or similar complexity, while MoBA's line suggests O(n) complexity. This fundamental difference in algorithmic efficiency is the core insight of the chart. There are no outliers; the data points follow their respective trends perfectly, indicating a controlled and consistent benchmark. </details> (a) <details> <summary>x4.png Details</summary> ![32412146](/v1/image/324121462cb434ef46e3fd143e3fb107343a888b6763ef4a28076be2a7220fba) ### Visual Description ## Line Chart: Computation Time vs. Sequence Length for Flash Attention and MoBA ### Overview The image is a line chart comparing the computational performance of two methods, "Flash Attention" and "MoBA," as the input sequence length increases. The chart demonstrates a significant performance divergence between the two methods at longer sequence lengths. An inset chart provides a zoomed-in view of the performance for shorter sequences. ### Components/Axes * **Chart Type:** Line chart with a dashed line style and circular data points. * **X-Axis (Main Chart):** Labeled "Sequence Length". The scale is linear, with major tick marks at 1M, 4M, 7M, and 10M. * **Y-Axis (Main Chart):** Labeled "Computation Time (s)". The scale is linear, ranging from 0 to 80 seconds, with major tick marks every 20 seconds. * **Legend:** Located in the top-left corner of the main chart area. * **Flash Attention:** Represented by a light blue (`#a6cee3` approximate) dashed line with circular markers. * **MoBA:** Represented by a dark blue (`#1f78b4` approximate) dashed line with circular markers. * **Inset Chart:** A smaller chart positioned in the upper-left quadrant of the main chart area. It zooms in on the performance for shorter sequence lengths. * **X-Axis (Inset):** Labeled with specific sequence lengths: 32K, 128K, 256K, 512K. * **Y-Axis (Inset):** Unlabeled, but shares the same unit (seconds) as the main chart. The scale ranges from 0.0 to 0.2 seconds. * **Data Series:** The same two series (Flash Attention and MoBA) are plotted in the inset with the same color and style coding. ### Detailed Analysis **Main Chart Analysis (Sequence Lengths ~1M to 10M):** * **Trend Verification - Flash Attention (Light Blue):** The line exhibits a steep, upward, and accelerating (super-linear, likely quadratic or worse) trend. * At ~1M sequence length, computation time is near 0 seconds (approximately 0.5s). * At 4M sequence length, computation time is approximately 15 seconds. * At 10M sequence length, computation time is approximately 90 seconds (the data point is slightly above the 80s grid line). * **Trend Verification - MoBA (Dark Blue):** The line exhibits a very shallow, near-linear upward trend. * At ~1M sequence length, computation time is near 0 seconds (approximately 0.1s). * At 4M sequence length, computation time is approximately 1 second. * At 10M sequence length, computation time is approximately 5 seconds. **Inset Chart Analysis (Sequence Lengths 32K to 512K):** * **Trend Verification - Flash Attention (Light Blue):** Shows a clear upward curve even at these shorter lengths. * At 32K: ~0.005s * At 128K: ~0.01s * At 256K: ~0.05s * At 512K: ~0.2s * **Trend Verification - MoBA (Dark Blue):** Shows a very slight, almost flat increase. * At 32K: ~0.002s * At 128K: ~0.003s * At 256K: ~0.005s * At 512K: ~0.01s ### Key Observations 1. **Performance Crossover:** While both methods have sub-second computation times for sequences under 1M, their performance diverges dramatically thereafter. 2. **Scalability:** MoBA demonstrates vastly superior scalability. Its computation time grows approximately linearly with sequence length. Flash Attention's computation time grows at a much faster, non-linear rate. 3. **Magnitude of Difference:** At the longest measured sequence (10M), Flash Attention is approximately **18 times slower** (90s vs. 5s) than MoBA. 4. **Inset Purpose:** The inset is crucial for visualizing the performance relationship at shorter sequences, where the main chart's scale compresses both lines near zero. ### Interpretation This chart presents a compelling performance benchmark for sequence processing tasks, likely in the context of transformer-based machine learning models. The data strongly suggests that **MoBA is a more computationally efficient and scalable algorithm than Flash Attention for handling very long sequences.** * **Underlying Mechanism:** The near-linear scaling of MoBA implies its computational complexity is O(n) or O(n log n) with respect to sequence length (n). The super-linear scaling of Flash Attention suggests a complexity of O(n²) or worse, which becomes prohibitively expensive for long contexts. * **Practical Implication:** For applications requiring the processing of extremely long documents, high-resolution images, or lengthy videos (where sequence length can reach millions of tokens), MoBA would enable feasible processing where Flash Attention would not. The inset shows this advantage begins even at moderate lengths (512K). * **Anomaly/Note:** The chart does not specify the hardware or exact task used for benchmarking. The absolute time values are therefore relative, but the *relative performance difference* and *scaling trends* are the key takeaways. The consistent color coding and clear trends make the conclusion robust. </details> (b) Figure 2: Efficiency of MoBA vs. full attention (implemented with Flash Attention). (a) 1M Model speedup evaluation: Computation time scaling of MoBA versus Flash Attention on 1M model with increasing sequence lengths (8K-1M). (b) Fixed Sparsity Ratio scaling: Computation time scaling comparison between MoBA and Flash Attention across increasing sequence lengths (8K-10M), maintaining a constant sparsity ratio of $95.31\$ (fixed 64 MoBA blocks with variance block size and fixed top-k=3). We provide a high-performance implementation of MoBA, by incorporating optimization techniques from FlashAttention dao2022flashattention and MoE rajbhandari2022deepspeed. Figure 2 demonstrates the high efficiency of MoBA, while we defer the detailed experiments on efficiency and scalability to Section 3.4. Our implementation consists of five major steps: - Determine the assignment of query tokens to KV blocks according to the gating network and causal mask. - Arrange the ordering of query tokens based on their assigned KV blocks. - Compute attention outputs for each KV block and the query tokens assigned to it. This step can be optimized by FlashAttention with varying lengths. - Re-arrange the attention outputs back to their original ordering. - Combine the corresponding attention outputs using online Softmax (i.e., tiling), as a query token may attend to its current block and multiple historical KV blocks. The algorithmic workflow is formalized in Algorithm 1 and visualized in Figure 1b, illustrating how MoBA can be implemented based on MoE and FlashAttention. First, the KV matrices are partitioned into blocks (Line 1-2). Next, the gating score is computed according to Equation 6, which measures the relevance between query tokens and KV blocks (Lines 3-7). A top- $k$ operator is applied on the gating score (together with causal mask), resulting in a sparse query-to-KV-block mapping matrix ${\bm{G}}$ to represent the assignment of queries to KV blocks (Line 8). Then, query tokens are arranged based on the query-to-KV-block mapping, and block-wise attention outputs are computed (Line 9-12). Notably, attention to historical blocks (Line 11 and 14) and the current block attention (Line 10 and 13) are computed separately, as additional causality needs to be maintained in the current block attention. Finally, the attention outputs are rearranged back to their original ordering and combined with online softmax (Line 16) milakov2018onlinenormalizercalculationsoftmax,liu2023blockwiseparalleltransformerlarge. ## 3 Experiments ### 3.1 Scaling Law Experiments and Ablation Studies In this section, we conduct scaling law experiments and ablation studies to validate some key design choices of MoBA. | 568M 822M 1.1B | 14 16 18 | 14 16 18 | 1792 2048 2304 | 10.8B 15.3B 20.6B | 512 512 512 | 3 3 3 | | --- | --- | --- | --- | --- | --- | --- | | 1.5B | 20 | 20 | 2560 | 27.4B | 512 | 3 | | 2.1B | 22 | 22 | 2816 | 36.9B | 512 | 3 | Table 1: Configuration of Scaling Law Experiments <details> <summary>x5.png Details</summary> ![414baa80](/v1/image/414baa8050b85cf166a6e255ef853d3b54f61ce9d7f18361d62122be879aa2bf) ### Visual Description ## Line Chart: Scaling Laws for Language Model Loss vs. Computational Cost ### Overview The image is a log-log plot illustrating the relationship between language model (LM) loss and computational cost, measured in PFLOP/s-days. It compares empirical scaling curves for various model configurations against two theoretical projection lines: "MoBA Projection" and "Full Attention Projection." The chart demonstrates the power-law scaling behavior typical of neural language models. ### Components/Axes * **Chart Type:** Line chart with logarithmic scales on both axes. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic, ranging from approximately `10^-1` (0.1) to `10^1` (10). * **Major Ticks:** `10^-1`, `10^0`, `10^1`. * **Y-Axis:** * **Label:** `LM loss (seqlen=8K)` * **Scale:** Logarithmic, ranging from `2 x 10^0` (2) to `5 x 10^0` (5). * **Major Ticks:** `2 x 10^0`, `3 x 10^0`, `4 x 10^0`, `5 x 10^0`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entries:** 1. `MoBA Projection` - Represented by a blue dashed line (`--`). 2. `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series:** Multiple solid lines in various colors (blue, red, purple, grey) representing empirical data from different model runs or configurations. These lines are not individually labeled in the legend. ### Detailed Analysis * **Trend Verification:** All data series (both solid lines and dashed projections) exhibit a clear downward slope from left to right. This indicates an inverse relationship: as computational cost (PFLOP/s-days) increases, the language model loss decreases. * **Projection Lines:** * The **red dashed "Full Attention Projection"** line starts at a loss of approximately `3.3` at `0.1` PFLOP/s-days and slopes downward to a loss of approximately `2.1` at `10` PFLOP/s-days. * The **blue dashed "MoBA Projection"** line starts at a higher loss (off the top of the chart at `0.1` PFLOP/s-days) but has a steeper downward slope. It intersects and falls below the Full Attention projection line at approximately `1.5` PFLOP/s-days, suggesting MoBA becomes more efficient at higher compute budgets. * **Empirical Data Curves:** * The solid lines represent actual model training data. They generally follow the power-law trend predicted by the projections. * At lower compute values (`0.1 - 1` PFLOP/s-days), the data curves are more spread out vertically, indicating higher variance in loss for similar compute. * As compute increases (`>1` PFLOP/s-days), the data curves converge and cluster tightly around the projection lines, particularly the MoBA projection. * The curves are not smooth; they exhibit small-scale fluctuations or "jitter," which is typical of training loss curves. ### Key Observations 1. **Power-Law Scaling:** The linear relationship on a log-log plot confirms that LM loss scales as a power law with increased compute. 2. **Projection Divergence:** The MoBA and Full Attention projections diverge significantly at both low and high compute ends, with MoBA predicting worse performance at low compute but better performance at high compute. 3. **Data Convergence:** Empirical data points converge towards the projection lines as compute increases, suggesting the scaling laws become more predictable at larger scales. 4. **Efficiency Crossover:** There is a visual crossover point around `1.5` PFLOP/s-days where the MoBA projection suggests it becomes the more compute-efficient method for achieving lower loss. ### Interpretation This chart is a technical comparison of scaling efficiency between two attention mechanism paradigms—likely "Mixture of Block Attention" (MoBA) and standard "Full Attention"—for language models. * **What the data suggests:** The empirical data validates the theoretical scaling laws for both attention types. The steeper slope of the MoBA projection indicates it has a better scaling exponent; it achieves a greater reduction in loss per additional unit of compute. However, its higher intercept suggests it may be less efficient at very small scales. * **How elements relate:** The solid data lines serve as real-world validation for the dashed theoretical projections. Their convergence at high compute values lends credibility to the projections' predictive power for large-scale model training. * **Notable implications:** The chart argues that for very large-scale training runs (high PFLOP/s-days), the MoBA architecture could be more advantageous, yielding lower loss for the same computational budget compared to Full Attention. The crossover point is a critical value for practitioners deciding which architecture to invest in for a given compute budget. The tight clustering of data at the high-compute end suggests that scaling behavior is robust and predictable in this regime. </details> (a) <details> <summary>x6.png Details</summary> ![3907a387](/v1/image/3907a38743e36380d5e333182006181baf0e959fce291e916e7cdc6bb292430c) ### Visual Description ## Log-Log Plot: Trailing LM Loss vs. Computational Cost (PFLOP/s-days) ### Overview The image is a technical log-log plot comparing the performance scaling of two different projection methods for language models. It plots the "Trailing LM loss" against the computational cost measured in "PFLOP/s-days". The chart demonstrates how model loss decreases as computational resources increase for both methods, with one method consistently outperforming the other. ### Components/Axes * **Chart Type:** Log-Log Line Plot. * **Y-Axis (Vertical):** * **Label:** `Trailing LM loss (seqlen=32K, last 2K)` * **Scale:** Logarithmic (base 10). * **Major Tick Marks:** `10^0` (1), `2×10^0` (2), `3×10^0` (3), `4×10^0` (4). * **Interpretation:** This measures the language model's loss (lower is better) on the final 2,000 tokens of a 32,000-token sequence. * **X-Axis (Horizontal):** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic (base 10). * **Major Tick Marks:** `10^-1` (0.1), `10^0` (1), `10^1` (10). * **Interpretation:** This is a unit of computational cost, representing Peta (10^15) Floating-Point Operations per second sustained over a number of days. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entry 1:** `MoBA Projection` - Represented by a blue, dashed line (`--`). * **Entry 2:** `Full Attention Projection` - Represented by a red, dashed line (`--`). * **Data Series:** The plot contains multiple lines for each projection type. There are approximately 5-6 distinct blue dashed lines (MoBA) and 5-6 distinct red dashed lines (Full Attention). Each pair of a blue and red line likely corresponds to a specific model configuration or size, though these are not individually labeled in the image. ### Detailed Analysis * **General Trend:** All lines on the plot slope downwards from left to right. This indicates a clear inverse relationship: as computational cost (PFLOP/s-days) increases, the trailing LM loss decreases. The relationship appears linear on this log-log scale, suggesting a power-law relationship between loss and compute. * **Method Comparison:** For any given x-value (computational cost), the red dashed lines (`Full Attention Projection`) are consistently positioned **below** the corresponding blue dashed lines (`MoBA Projection`). This visual relationship holds true across the entire range of the x-axis shown (from ~0.05 to ~30 PFLOP/s-days). * **Line Groupings:** The lines are not evenly spaced. They appear in clustered pairs (one blue, one red). The vertical gap between a blue line and its paired red line appears relatively consistent across the x-axis for each pair. The pairs themselves are spread vertically, with some starting at higher loss values (top of the chart) and others starting lower. * **Specific Value Points (Approximate):** * At `10^0` (1) PFLOP/s-day: * The lowest red line is at approximately `1.3` loss. * The corresponding lowest blue line is at approximately `1.5` loss. * The highest visible red line is near `4.0` loss. * The corresponding highest visible blue line is above `4.0` loss (off the top of the chart). * At `10^1` (10) PFLOP/s-days: * The lowest red line approaches `1.0` loss. * The corresponding lowest blue line is slightly above `1.0` loss. * The lines converge toward the bottom-right corner of the plot. ### Key Observations 1. **Consistent Performance Gap:** The `Full Attention Projection` method demonstrates a consistent and significant advantage over the `MoBA Projection` method. For the same computational budget, it achieves a lower language model loss. 2. **Power-Law Scaling:** The linear trend on the log-log plot for all series confirms that loss scales as a power law with increased compute (`Loss ∝ Compute^(-α)`), a common finding in neural scaling laws. 3. **Parallel Trajectories:** The paired lines for each model configuration appear roughly parallel, suggesting that the scaling exponent (the slope, `α`) is similar between the two projection methods for a given model, but the constant factor (the vertical offset) favors Full Attention. 4. **Convergence at High Compute:** The lines appear to converge as they approach the bottom-right of the chart (high compute, low loss), suggesting the performance gap may narrow at extremely large scales, though it remains present within the plotted range. ### Interpretation This chart provides empirical evidence for a comparative analysis of two architectural choices in language modeling: **MoBA (Mixture of Block Attention) Projection** versus **Full Attention Projection**. * **What the data suggests:** The data strongly suggests that, under the measured conditions (seqlen=32K, evaluating on the last 2K tokens), the Full Attention mechanism is more **compute-efficient** than the MoBA Projection. It achieves a better (lower) loss for an identical amount of computational resources spent. * **Relationship between elements:** The x-axis (compute) is the independent variable, and the y-axis (loss) is the dependent variable. The different line pairs represent different model scales or configurations. The key relationship highlighted is the **efficiency frontier**: the Full Attention lines define a better frontier (lower loss for given compute) than the MoBA lines. * **Notable implications:** The consistent gap implies that the potential efficiency gains promised by sparse attention methods like MoBA may not fully materialize in this specific evaluation metric and setup, or that they come at a cost to model quality (higher loss). The power-law scaling confirms both methods are viable and improve predictably with scale, but Full Attention maintains a constant-factor advantage. This kind of analysis is critical for making informed architectural decisions when allocating massive computational budgets for training large language models. The "Trailing LM loss" metric specifically tests the model's ability to use long-context information effectively, suggesting Full Attention may be superior for this particular aspect of long-context reasoning. </details> (b) | LM loss (seqlen=8K) Trailing LM loss (seqlen=32K, last 2K) | $2.625× C^-0.063$ $1.546× C^-0.108$ | $2.622× C^-0.063$ $1.464× C^-0.097$ | | --- | --- | --- | (c) Figure 3: Scaling law comparison between MoBA and full attention. (a) LM loss on validation set (seqlen=8K); (b) trailing LM loss on validation set (seqlen=32K, last 1K tokens); (c) fitted scaling law curve. #### Scalability w.r.t. LM Loss. To assess the effectiveness of MoBA, we perform scaling law experiments by comparing the validation loss of language models trained using either full attention or MoBA. Following the Chinchilla scaling law hoffmann2022training, we train five language models of varying sizes with a sufficient number of training tokens to ensure that each model achieves its training optimum. Detailed configurations of the scaling law experiments can be found in Table 1. Both MoBA and full attention models are trained with a sequence length of 8K. For MoBA models, we set the block size to 512 and select the top-3 blocks for attention, resulting in a sparse attention pattern with sparsity up to $1-\frac{512× 3}{8192}=81.25\$ Since we set top-k=3, thus each query token can attend to at most 2 history block and the current block.. In particular, MoBA serves as an alternative to full attention, meaning that it does not introduce new parameters or remove existing ones. This design simplifies our comparison process, as the only difference across all experiments lies in the attention modules, while all other hyperparameters, including the learning rate and batch size, remain constant. As shown in Figure 3a, the validation loss curves for MoBA and full attention display very similar scaling trends. Specifically, the validation loss differences between these two attention mechanisms remain consistent within a range of $1e-3$ . This suggests that MoBA achieves scaling performance that is comparable to full attention, despite its sparse attention pattern with sparsity up to 75%. #### Long Context Scalability. However, LM loss may be skewed by the data length distribution an2024does, which is typically dominated by short sequences. To fully assess the long-context capability of MoBA, we assess the LM loss of trailing tokens (trailing LM loss, in short), which computes the LM loss of the last few tokens in the sequence. We count this loss only for sequences that reach the maximum sequence length to avoid biases that may arise from very short sequences. A detailed discussion on trailing tokens scaling can be found in the Appendix A.1 These metrics provide insights into the model’s ability to generate the final portion of a sequence, which can be particularly informative for tasks involving long context understanding. Therefore, we adopt a modified experimental setting by increasing the maximum sequence length from 8k to 32k. This adjustment leads to an even sparser attention pattern for MoBA, achieving a sparsity level of up to $1-\frac{512× 3}{32768}=95.31\$ . As shown in Figure 3b, although MoBA exhibits a marginally higher last block LM loss compared to full attention in all five experiments, the loss gap is progressively narrowing. This experiment implies the long-context scalability of MoBA. #### Ablation Study on Fine-Grained Block Segmentation. We further ablate the block granularity of MoBA. We carry out a series of experiments using a 1.5B parameter model with a 32K context length. The hyperparameters of block size and top-k are adjusted to maintain a consistent level of attention sparsity. Specifically, we divide the 32K context into 8, 16, 32, 64, and 128 blocks, and correspondingly select 2, 4, 8, 16, and 32 blocks, ensuring an attention sparsity of 75% across these configurations. As shown in Figure 4, MoBA’s performance is significantly affected by block granularity. Specifically, there is a performance difference of 1e-2 between the coarsest-grained setting (selecting 2 blocks from 8) and the settings with finer granularity. These findings suggest that fine-grained segmentation appears to be a general technique for enhancing the performance of models within the MoE family, including MoBA. <details> <summary>x7.png Details</summary> ![24978e53](/v1/image/24978e53e101418a3c9c0694723e842bebc81152c8a459cd50ae8e441838944d) ### Visual Description ## Line Chart: LM Loss vs. MoBA Block Segmentation Settings ### Overview The image displays a line chart comparing the Language Model (LM) Loss of two methods, "MoBA" and a "Full Attention Baseline," across five different block segmentation settings. The chart illustrates how the performance (measured by loss) of the MoBA method changes with different configurations, relative to a constant baseline. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** `MoBA block segmentation settings (topk / #blocks)` * **Categories/Ticks:** `2/8`, `4/16`, `8/32`, `16/64`, `32/128`. These represent paired settings for a parameter called "topk" and the number of blocks ("#blocks"). * **Y-Axis:** * **Label:** `LM Loss` * **Scale:** Linear, ranging from 2.230 to 2.260, with major ticks at intervals of 0.005. * **Legend:** * **Position:** Top-right corner of the plot area. * **Series 1:** `MoBA` - Represented by a blue dashed line (`--`) with circular markers (`o`). * **Series 2:** `Full Attention Baseline` - Represented by a red dashed line (`--`) with circular markers (`o`). ### Detailed Analysis **Data Series: Full Attention Baseline (Red Line)** * **Trend:** The line is perfectly horizontal, indicating a constant loss value across all x-axis categories. * **Data Points (Approximate):** * At `2/8`: LM Loss ≈ 2.242 * At `4/16`: LM Loss ≈ 2.242 * At `8/32`: LM Loss ≈ 2.242 * At `16/64`: LM Loss ≈ 2.242 * At `32/128`: LM Loss ≈ 2.242 **Data Series: MoBA (Blue Line)** * **Trend:** The line shows a significant initial decrease, followed by a shallow valley and a slight rise, before a final small decrease. It starts well above the baseline, dips below it, and ends very close to it. * **Data Points (Approximate):** * At `2/8`: LM Loss ≈ 2.258 (Highest point, significantly above baseline). * At `4/16`: LM Loss ≈ 2.243 (Sharp decrease, now slightly above baseline). * At `8/32`: LM Loss ≈ 2.240 (Lowest point, slightly below baseline). * At `16/64`: LM Loss ≈ 2.243 (Slight increase, back to being slightly above baseline). * At `32/128`: LM Loss ≈ 2.241 (Final small decrease, ending very close to, but marginally below, the baseline). ### Key Observations 1. **Convergence:** The MoBA method's loss converges toward the Full Attention Baseline as the block segmentation settings increase (moving right on the x-axis). The largest performance gap is at the simplest setting (`2/8`). 2. **Optimal Setting:** The lowest loss for MoBA is achieved at the `8/32` setting, where it performs marginally better than the constant baseline. 3. **Non-Monotonic Behavior:** The improvement in MoBA's loss is not strictly linear or monotonic. After the optimal `8/32` point, the loss increases slightly at `16/64` before decreasing again at `32/128`. 4. **Baseline Stability:** The Full Attention Baseline serves as a fixed reference point, showing no sensitivity to the "MoBA block segmentation settings" parameter, which is expected as it likely represents a different, non-segmented attention mechanism. ### Interpretation This chart demonstrates the trade-off between computational configuration and model performance for the MoBA method. The "topk / #blocks" setting appears to control a granularity or sparsity parameter in the attention mechanism. * **What the data suggests:** Very coarse segmentation (`2/8`) leads to a significant performance penalty (higher loss). Increasing the segmentation granularity (moving to `4/16` and `8/32`) rapidly improves performance, bringing MoBA to a level competitive with, and even slightly surpassing, the full attention baseline. The performance plateau and slight fluctuation at higher settings (`16/64`, `32/128`) suggest diminishing returns or potential optimization challenges at very fine granularities. * **How elements relate:** The x-axis represents a design choice in the MoBA algorithm. The y-axis measures the consequence of that choice on model accuracy (loss). The red baseline defines the target performance level. The blue line's trajectory shows that MoBA can match full attention's quality with appropriately tuned segmentation, validating its potential as an efficient alternative. * **Notable anomaly:** The slight increase in loss at `16/64` after the low at `8/32` is interesting. It could indicate a suboptimal interaction between the "topk" and "#blocks" parameters at that specific ratio, or simply noise in the experimental results. It highlights that "more" (finer segmentation) is not always linearly "better." </details> Figure 4: Fine-Grained Block Segmentation. The LM loss on validation set v.s. MoBA with different block granularity. ### 3.2 Hybrid of MoBA and Full Attention <details> <summary>x8.png Details</summary> ![f0888d2c](/v1/image/f0888d2c16a2122f06cf78336b96a7995a85a843766139d37ce6c87f8ca7293e) ### Visual Description ## Line Chart: LM Loss vs. Position for Three Attention Mechanisms ### Overview The image is a line chart comparing the performance of three different attention mechanisms—MoBA/Full Hybrid, MoBA, and Full Attention—by plotting their Language Model (LM) Loss against sequence Position. The chart demonstrates how the loss decreases for all three methods as the position index increases, with all three following a very similar, steeply decaying curve. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Title:** `Position` * **Scale:** Linear, from `0K` to `30K`. * **Major Tick Marks:** `0K`, `5K`, `10K`, `15K`, `20K`, `25K`, `30K`. * **Y-Axis:** * **Title:** `LM Loss` * **Scale:** Linear, from `1.2` to `3.0`. * **Major Tick Marks:** `1.2`, `1.4`, `1.6`, `1.8`, `2.0`, `2.2`, `2.4`, `2.6`, `2.8`, `3.0`. * **Legend:** Located in the top-right corner of the plot area. * **Green line with circle markers:** `MoBA/Full Hybrid` * **Blue line with diamond markers:** `MoBA` * **Red dashed line with square markers:** `Full Attention` * **Background:** A light gray grid is present for easier value estimation. ### Detailed Analysis **Trend Verification:** All three data series exhibit the same fundamental trend: a sharp, convex decay in LM Loss as Position increases. The loss is highest at the beginning of the sequence (Position 0K) and decreases rapidly, with the rate of decrease slowing significantly after approximately Position 10K. **Data Point Extraction (Approximate Values):** * **At Position 0K:** All three lines start at approximately the same point, with an LM Loss of ~`2.8`. * **At Position 5K:** Loss has dropped sharply to ~`1.85` for all series. * **At Position 10K:** Loss is approximately `1.55`. * **At Position 15K:** Loss is approximately `1.45`. * **At Position 20K:** Loss is approximately `1.38`. * **At Position 25K:** Loss is approximately `1.34`. * **At Position 30K:** Loss is approximately `1.32`. **Cross-Referencing Legend with Lines:** * The **green line (MoBA/Full Hybrid)** consistently appears as the lowest line on the chart across all positions, though the difference is very small. * The **blue line (MoBA)** is generally in the middle. * The **red dashed line (Full Attention)** is generally the highest of the three, but again, the separation is minimal. * The lines are so close that at many points, especially beyond Position 15K, the markers overlap significantly. ### Key Observations 1. **Uniform Decay Pattern:** The primary observation is the nearly identical performance trajectory of all three methods. The choice between MoBA, Full Attention, or their hybrid appears to have a negligible impact on the LM Loss vs. Position curve in this specific evaluation. 2. **Steep Initial Improvement:** The most significant reduction in loss occurs within the first 10,000 positions, after which the curves flatten into a long tail of gradual improvement. 3. **Minimal Differentiation:** While the `MoBA/Full Hybrid` (green) line is visually the lowest, the practical difference between the three methods is extremely small, likely within a margin of error for such measurements. The lines converge further as position increases. ### Interpretation This chart suggests that for the task or model being evaluated, the core attention mechanism (whether it's the "Full Attention" baseline, the "MoBA" variant, or a hybrid) is not the dominant factor influencing how language model loss evolves with sequence position. The universal decay pattern indicates that all tested methods are equally effective at capturing the necessary contextual information to reduce prediction error as the sequence grows. The steep initial drop implies that the most critical contextual dependencies for the model are established relatively early in the sequence. The long, flat tail suggests that beyond a certain context length (~10K-15K tokens), adding more position provides diminishing returns in terms of loss reduction for this specific setup. From a technical standpoint, the key takeaway is the **robustness of the loss-position relationship** across these architectural variants. If the goal is to optimize LM Loss with respect to sequence position, efforts might be better directed toward other factors (e.g., data quality, training regimen, or model scale) rather than fine-tuning between these specific attention mechanisms, as they yield functionally equivalent results in this metric. The chart serves as evidence that MoBA and its hybrid are viable alternatives to Full Attention, achieving comparable performance. </details> (a) <details> <summary>x9.png Details</summary> ![e15ad5cf](/v1/image/e15ad5cfebcaf8dc03379526c2fbbb28ae7a545bf80dc536261ab4d252de326f) ### Visual Description ## Line Graph: LM Loss vs. Number of Hybrid Full Layers ### Overview The image is a line graph comparing the Language Model (LM) Loss of three different model architectures or attention mechanisms as a function of the number of "Hybrid Full Layers" used. The graph demonstrates how the loss metric changes for one method ("Layer-wise Hybrid") as a specific parameter increases, while the other two methods ("Full Attention" and "MoBA") serve as constant baselines. ### Components/Axes * **Chart Type:** Line graph with markers. * **X-Axis:** * **Title:** "Number of Hybrid Full Layers" * **Scale/Markers:** Categorical with four discrete points: "1 layer", "3 layer", "5 layer", "10 layer". * **Y-Axis:** * **Title:** "LM Loss" * **Scale:** Linear, ranging from approximately 1.075 to 1.145. Major tick marks are at 1.08, 1.09, 1.10, 1.11, 1.12, 1.13, and 1.14. * **Legend:** * **Position:** Center-right of the plot area. * **Series 1:** "Layer-wise Hybrid" - Represented by a blue dashed line with circular markers. * **Series 2:** "Full Attention" - Represented by a solid red line. * **Series 3:** "MoBA" - Represented by a solid gray line. * **Background:** White with a light gray grid. ### Detailed Analysis **1. Layer-wise Hybrid (Blue Dashed Line with Circles):** * **Trend:** Shows a clear, consistent downward (improving) trend as the number of hybrid full layers increases. * **Data Points (Approximate):** * At **1 layer**: LM Loss ≈ 1.136 * At **3 layer**: LM Loss ≈ 1.118 * At **5 layer**: LM Loss ≈ 1.109 * At **10 layer**: LM Loss ≈ 1.077 * **Visual Check:** The line slopes downward from left to right. The blue color and circular markers match the legend entry for "Layer-wise Hybrid". **2. Full Attention (Solid Red Line):** * **Trend:** Perfectly horizontal (constant). This indicates its performance is independent of the "Number of Hybrid Full Layers" parameter, serving as a fixed baseline. * **Value:** LM Loss is constant at approximately **1.075**. This is the lowest (best) loss value on the chart. * **Visual Check:** The red line is positioned at the very bottom of the plot area, matching the "Full Attention" legend. **3. MoBA (Solid Gray Line):** * **Trend:** Perfectly horizontal (constant). Also serves as a fixed baseline. * **Value:** LM Loss is constant at approximately **1.145**. This is the highest (worst) loss value on the chart. * **Visual Check:** The gray line is positioned at the very top of the plot area, matching the "MoBA" legend. ### Key Observations 1. **Inverse Relationship:** For the "Layer-wise Hybrid" method, there is a strong inverse relationship between the number of hybrid full layers and LM Loss. More layers lead to significantly lower loss. 2. **Performance Convergence:** The "Layer-wise Hybrid" method's performance improves from being worse than "Full Attention" but better than "MoBA" at 1 layer, to nearly matching the "Full Attention" baseline at 10 layers. 3. **Baseline Spread:** There is a substantial performance gap (≈0.07 in LM Loss) between the two constant baselines, "Full Attention" (best) and "MoBA" (worst). 4. **Diminishing Returns:** The rate of improvement for "Layer-wise Hybrid" appears to slow slightly. The drop from 1 to 3 layers (≈0.018) is larger than the drop from 5 to 10 layers (≈0.032 over 5 layers vs. ≈0.009 over 2 layers). ### Interpretation This graph presents a technical evaluation likely from a machine learning research paper. It demonstrates the efficacy of a "Layer-wise Hybrid" attention mechanism. * **What the data suggests:** The "Layer-wise Hybrid" approach is a tunable method where increasing a specific architectural parameter (hybrid full layers) directly improves model performance (reduces LM Loss). Its goal appears to be to approximate the performance of the "Full Attention" mechanism, which is often considered a gold standard but may be computationally expensive. * **How elements relate:** The "Full Attention" and "MoBA" lines act as critical reference points. They establish the performance ceiling (Full Attention) and floor (MoBA) for this experiment. The "Layer-wise Hybrid" line is the variable under test, showing its trajectory between these bounds. * **Notable implications:** The key takeaway is that the "Layer-wise Hybrid" method is effective and scalable. At 10 layers, it achieves a loss nearly identical to "Full Attention," suggesting it could be a viable, potentially more efficient alternative. The constant, poor performance of "MoBA" highlights it as an inferior method in this specific comparison. The graph argues for the value of increasing hybrid layers in this architecture. </details> (b) <details> <summary>x10.png Details</summary> ![a625b946](/v1/image/a625b946099ca61b8a49861c03c27bd57caeff9a8766b1fa50cc22ce96c1e6e2) ### Visual Description ## Line Graph: Comparison of LM Trailing Loss Across Attention Mechanisms ### Overview The image is a line graph comparing the performance of three different attention mechanisms in a language model, measured by trailing loss on a specific dataset. The graph plots loss against the number of hybrid full layers used in one of the methods. ### Components/Axes * **Chart Type:** Line graph with markers. * **X-Axis:** * **Title:** "Number of Hybrid Full Layers" * **Scale/Markers:** Categorical with four discrete points: "1layer", "3layer", "5layer", "10layer". * **Y-Axis:** * **Title:** "LM trailing loss (wepile-30K, last 2K)" * **Scale:** Linear, ranging from approximately 1.08 to 1.18. Major tick marks are at 1.10, 1.12, 1.14, 1.16. * **Legend:** * **Placement:** Center-right of the plot area. * **Series 1:** "Layer-wise Hybrid" - Represented by a blue dashed line with circular markers. * **Series 2:** "Full Attention" - Represented by a solid red line. * **Series 3:** "MoBA" - Represented by a solid gray line. ### Detailed Analysis The graph displays three distinct data series: 1. **Layer-wise Hybrid (Blue Dashed Line):** * **Trend:** Shows a clear, steep downward slope, indicating that loss decreases significantly as the number of hybrid full layers increases. * **Data Points (Approximate):** * At 1 layer: Loss ≈ 1.170 * At 3 layers: Loss ≈ 1.128 * At 5 layers: Loss ≈ 1.102 * At 10 layers: Loss ≈ 1.087 * **Spatial Grounding:** The line starts at the top-left of the plotted data and descends towards the bottom-right, converging with the red line at the 10-layer mark. 2. **Full Attention (Red Solid Line):** * **Trend:** A perfectly horizontal line, indicating constant performance regardless of the "Number of Hybrid Full Layers" parameter (which likely does not apply to this baseline method). * **Data Point (Approximate):** Constant loss ≈ 1.085 across all x-axis categories. * **Spatial Grounding:** This line runs along the very bottom of the chart, serving as the performance baseline. 3. **MoBA (Gray Solid Line):** * **Trend:** A perfectly horizontal line, indicating constant performance. * **Data Point (Approximate):** Constant loss ≈ 1.170 across all x-axis categories. * **Spatial Grounding:** This line runs along the very top of the chart, representing the highest (worst) loss value shown. ### Key Observations * The "Layer-wise Hybrid" method's performance improves dramatically with more hybrid layers, moving from a loss value similar to "MoBA" at 1 layer to a value nearly matching "Full Attention" at 10 layers. * "Full Attention" represents the lowest (best) loss on the chart, serving as a performance target. * "MoBA" represents the highest (worst) loss and is unaffected by the hybrid layer parameter. * The most significant performance gain for "Layer-wise Hybrid" occurs between 1 and 3 layers (a drop of ~0.042). The rate of improvement slows as more layers are added. ### Interpretation This graph demonstrates the efficacy of the "Layer-wise Hybrid" attention mechanism. The data suggests that by increasing the number of full attention layers within a hybrid model, one can systematically reduce language modeling loss, approaching the performance of a full attention model. This is likely a trade-off between computational cost (full attention is expensive) and model performance. The flat lines for "Full Attention" and "MoBA" indicate they are static baselines in this experiment. "Full Attention" is the gold standard for performance but is computationally intensive. "MoBA" (likely an acronym for a specific efficient attention method) performs poorly on this specific metric ("trailing loss on the last 2K tokens of wepile-30K"). The "Layer-wise Hybrid" approach offers a tunable middle ground, where performance can be scaled by allocating more resources (hybrid layers) to full attention computation. The convergence at 10 layers implies that with sufficient hybrid layers, the hybrid model can match full attention's quality on this task. </details> (c) Figure 5: Hybrid of MoBA and full attention. (a) position-wise LM loss for MoBA, full attention, and MoBA/full hybrid training; (b) SFT LM loss w.r.t the number of full attention layers in layer-wise hybrid; (c) SFT trailing LM loss (seqlen=32K, last 2K) w.r.t the number of full attention layers in layer-wise hybrid. As discussed in Section 2, we design MoBA to be a flexible substitute for full attention, so that it can easily switch from/to full attention with minimal overhead and achieve comparable long-context performance. In this section, we first show seamless transition between full attention and MoBA can be a solution for efficient long-context pre-training. Then we discuss the layer-wise hybrid strategy, mainly for the performance of supervised fine-tuning (SFT). #### MoBA/Full Hybrid Training. We train three models, each with 1.5B parameters, on 30B tokens with a context length of 32K tokens. For the hyperparameters of MoBA, the block size is set to 2048, and the top-k parameter is set to 3. The detailed training recipes are as follows: - MoBA/full hybrid: This model is trained using a two-stage recipe. In the first stage, MoBA is used to train on 90% of the tokens. In the second stage, the model switches to full attention for the remaining 10% of the tokens. - Full attention: This model is trained using full attention throughout the entire training. - MoBA: This model is trained exclusively using MoBA. We evaluate their long-context performance via position-wise language model (LM) loss, which is a fine-grained metric to evaluate lm loss at each position within a sequence. Unlike the vanilla LM loss, which is computed by averaging the LM loss across all positions, the position-wise LM loss breaks down the loss for each position separately. Similar metrics have been suggested by previous studies xiong2023effectivelongcontextscalingfoundation,reid2024gemini, who noticed that position-wise LM loss follows a power-law trend relative to context length. As shown in Figure 5a, the MoBA-only recipe results in higher position-wise losses for trailing tokens. Importantly, our MoBA/full hybrid recipe reaches a loss nearly identical to that of full attention. This result highlights the effectiveness of the MoBA/full hybrid training recipe in balancing training efficiency with model performance. More interestingly, we have not observed significant loss spikes during the switch between MoBA and full attention, again demonstrating the flexibility and robustness of MoBA. #### Layer-wise Hybrid. This flexibility of MoBA encourages us to delve into a more sophisticated strategy — the layer-wise hybrid of MoBA and full attention. We investigate this strategy with a particular focus on its application during the supervised fine-tuning (SFT). The motivation for investigating this strategy stems from our observation that MoBA sometimes results in suboptimal performance during SFT, as shown in Figure 5b. We speculate that this may be attributed to the loss masking employed in SFT — prompt tokens are typically excluded from the loss calculation during SFT, which can pose a sparse gradient challenge for sparse attention methods like MoBA. Because it may hinder the backpropagation of gradients, which are initially calculated from unmasked tokens, throughout the entire context. To address this issue, we propose a hybrid approach — switching the last several Transformer layers from MoBA to full attention, while the remaining layers continue to employ MoBA. As shown in Figure 5b and Figure 5c, this strategy can significantly reduce SFT loss. ### 3.3 Large Language Modeling Evaluation <details> <summary>x11.png Details</summary> ![744cb433](/v1/image/744cb433c25de2924cb86c252119e9017f4e9054e888039422b03149fd57682c) ### Visual Description ## Diagram: LLM Training Pipeline Flowchart ### Overview The image displays a flowchart illustrating a multi-stage training pipeline for a large language model (LLM). The process begins with a base model and progresses through phases of continued pretraining at increasing context lengths, followed by supervised fine-tuning (SFT) at varying context lengths, culminating in a final "Instruct Model." The flow is directional, moving from left to right in the top row, then down, and finally from right to left in the bottom row. ### Components/Axes The diagram consists of eight rectangular boxes with rounded corners, connected by directional arrows. Each box contains text describing a model or a training stage. The arrows indicate the sequence and flow of the process. **Box Content (in order of flow):** 1. **Top Row, Leftmost:** `LLama3.1 8B` 2. **Top Row, Second:** `256K Continue Pretrain` 3. **Top Row, Third:** `512K Continue Pretrain` 4. **Top Row, Rightmost:** `1M Continue Pretrain` 5. **Bottom Row, Rightmost:** `32K SFT` 6. **Bottom Row, Second from Right:** `256K SFT` 7. **Bottom Row, Third from Right:** `1M SFT` 8. **Bottom Row, Leftmost:** `Instruct Model` **Flow Direction:** * The top row flows sequentially from left to right: Box 1 → Box 2 → Box 3 → Box 4. * A single arrow points downward from Box 4 (top-right) to Box 5 (bottom-right). * The bottom row flows sequentially from right to left: Box 5 → Box 6 → Box 7 → Box 8. ### Detailed Analysis The pipeline describes a two-phase training process: **Phase 1: Continued Pretraining (Top Row)** * **Starting Point:** The base model is `LLama3.1 8B`. * **Process:** The model undergoes "Continue Pretrain" in three successive stages. * **Key Variable:** The context length (likely measured in tokens) increases at each stage: `256K` → `512K` → `1M`. This suggests the model is being progressively adapted to handle much longer sequences of text. **Phase 2: Supervised Fine-Tuning (SFT) (Bottom Row)** * **Process:** Following the final pretraining stage, the model enters a series of "SFT" (Supervised Fine-Tuning) stages. * **Key Variable:** The context length varies during fine-tuning: `32K` → `256K` → `1M`. The sequence starts with a shorter context (`32K`) and increases back to the maximum (`1M`). * **End Product:** The final output of the entire pipeline is labeled `Instruct Model`, indicating a model fine-tuned to follow instructions. ### Key Observations 1. **Context Length Scaling:** The primary technical detail being communicated is the scaling of the model's context window. The pipeline explicitly shows training at 256K, 512K, and 1 million (1M) token contexts. 2. **Two-Phase Structure:** The process is clearly divided into a pretraining extension phase and a fine-tuning phase. 3. **Non-Linear SFT Context:** While pretraining context length increases monotonically, the SFT phase starts at a lower context (`32K`) before scaling back up. This could indicate a training strategy where the model is first fine-tuned on shorter, potentially higher-quality instruction data before being adapted to the full long-context capability. 4. **Model Origin:** The starting point is specified as `LLama3.1 8B`, identifying the base model architecture and size (8 billion parameters). ### Interpretation This flowchart documents a sophisticated training recipe for creating a long-context instruction-following model. The process suggests that simply pretraining a model on long sequences is not sufficient. To create a usable "Instruct Model," a dedicated fine-tuning phase is required, which itself involves training at multiple context lengths. The sequence implies a logical progression: 1. **Foundation:** Start with a capable base model (`LLama3.1 8B`). 2. **Capability Expansion:** Systematically extend its core ability to process very long documents (up to 1M tokens) through continued pretraining. 3. **Alignment & Refinement:** Fine-tune the model to follow instructions, a process that also involves re-exposing it to varying context lengths, possibly to ensure the instruction-following behavior is robust across the entire supported context window. The diagram serves as a high-level technical specification, answering the question: "What were the key stages and context lengths used to train this specific instruct model?" It provides a reproducible blueprint for the training pipeline. </details> Figure 6: The continual pre-training and SFT recipes. We conduct a thorough assessment of MoBA across a variety of real-world downstream tasks, evaluating its performance in comparison to full attention models. For ease of verification, our experiments begin with the Llama 3.1 8B Base Model, which is used as the starting point for long-context pre-training. This model, termed Llama-8B-1M-MoBA, is initially trained with a context length of 128K tokens, and we gradually increase the context length to 256K, 512K, and 1M tokens during the continual pre-training. To ease this transition, we use position interpolation method chen2023extendingcontextwindowlarge at the start of the 256K continual pre-training stage. This technique enables us to extend the effective context length from 128K tokens to 1M tokens. After completing the 1M continuous pre-training, MoBA is activated for 100B tokens. We set the block size to 4096 and the top-K parameter to 12, leading to an attention sparsity of up to $1-\frac{4096× 12}{1M}=95.31\$ . To preserve some full attention capabilities, we adopt the layer-wise hybrid strategy — the last three layers remain as full attention, while the other 29 full attention layers are switched to MoBA. For supervised fine-tuning, we follow a similar strategy that gradually increases the context length from 32K to 1M. The baseline full attention models (termed Llama-8B-1M-Full) also follow a similar training strategy as shown in Figure 6, with the only difference being the use of full attention throughout the process. This approach allows us to directly compare the performance of MoBA with that of full attention models under equivalent training conditions. The evaluation is performed on several widely used long-context benchmarks. In particular, across all evaluation tasks, MoBA is used for prefill only, while we switch to full attention during generation for better performance. As shown in Table 2, Llama-8B-1M-MoBA exhibits a performance that is highly comparable to that of Llama-8B-1M-Full. It is particularly noteworthy that in the longest benchmark, RULER, where MoBA operates at a sparsity level of up to $1-\frac{4096× 12}{128K}=62.5\$ , Llama-8B-1M-MoBA nearly matches the performance of Llama-8B-1M-Full, with a score of 0.7818 compared to 0.7849. For context lengths of up to 1M tokens, we evaluate the model using the traditional Needle in the Haystack benchmark. As shown in Figure 7, Llama-8B-1M-MoBA demonstrates satisfactory performance even with an extended context length of 1 million tokens. | AGIEval [0-shot] BBH [3-shot] CEval [5-shot] | 0.5144 0.6573 0.6273 | 0.5146 0.6589 0.6165 | | --- | --- | --- | | GSM8K [5-shot] | 0.7278 | 0.7142 | | HellaSWAG [0-shot] | 0.8262 | 0.8279 | | Loogle [0-shot] | 0.4209 | 0.4016 | | Competition Math [0-shot] | 0.4254 | 0.4324 | | MBPP [3-shot] | 0.5380 | 0.5320 | | MBPP Sanitized [0-shot] | 0.6926 | 0.6615 | | MMLU [0-shot] | 0.4903 | 0.4904 | | MMLU Pro [5-shot][CoT] | 0.4295 | 0.4328 | | OpenAI HumanEval [0-shot][pass@1] | 0.6951 | 0.7012 | | SimpleQA [0-shot] | 0.0465 | 0.0492 | | TriviaQA [0-shot] | 0.5673 | 0.5667 | | LongBench @32K [0-shot] | 0.4828 | 0.4821 | | RULER @128K [0-shot] | 0.7818 | 0.7849 | Table 2: Performance comparison between MoBA and full Attention across different evaluation benchmarks. <details> <summary>x12.png Details</summary> ![deea9e47](/v1/image/deea9e47fd6aa4fd005617c692449aee0bb0fd772e484df90564e0e0e905c055) ### Visual Description ## Heatmap: Needle in a Haystack Evaluation ### Overview The image displays a heatmap titled "Needle in a Haystack Evaluation." It visualizes performance scores across a two-dimensional grid defined by "Context Length" on the x-axis and "Start of Needle (percent)" on the y-axis. The entire grid is uniformly colored in a bright green, indicating consistently high scores across all tested conditions. ### Components/Axes * **Title:** "Needle in a Haystack Evaluation" (centered at the top). * **X-Axis:** * **Label:** "Context Length" (centered below the axis). * **Scale:** Linear scale with discrete tick marks. * **Tick Values (from left to right):** 32000, 64000, 96000, 128000, 160000, 192000, 224000, 256000, 288000, 320000, 352000, 384000, 416000, 448000, 480000, 512000, 544000, 576000, 608000, 640000, 672000, 704000, 736000, 768000, 800000, 832000, 864000, 896000, 928000, 960000, 992000, 1024000. * **Y-Axis:** * **Label:** "Start of Needle (percent)" (rotated 90 degrees, left of the axis). * **Scale:** Linear scale from 0 to 100. * **Tick Values (from bottom to top):** 0, 7, 14, 21, 29, 36, 43, 50, 57, 64, 71, 79, 86, 93, 100. * **Legend / Color Scale:** * **Label:** "Score" (right of the color bar). * **Placement:** Vertical bar on the far right of the chart. * **Scale:** Continuous gradient from red (bottom) to green (top). * **Tick Values (from bottom to top):** 0, 20, 40, 60, 80, 100. * **Color Mapping:** Red corresponds to a score of 0, transitioning through orange and yellow to green, which corresponds to a score of 100. ### Detailed Analysis * **Data Grid:** The heatmap consists of a grid of rectangular cells. Each cell's color represents the "Score" for a specific combination of Context Length (x-axis) and Start of Needle percentage (y-axis). * **Observed Data Pattern:** Every single cell in the grid is filled with the same bright green color. This color matches the top of the "Score" color bar, corresponding to a value of approximately 100. * **Trend Verification:** There is no visual trend (slope, gradient, or variation) across the grid. The color is uniform both horizontally (across all Context Lengths) and vertically (across all Needle Start percentages). This indicates a flat, perfect performance profile. ### Key Observations 1. **Perfect Uniformity:** The most striking feature is the complete lack of variation. The score appears to be 100 for every tested data point. 2. **Comprehensive Testing:** The evaluation covers a wide range of context lengths (from 32k to over 1 million tokens) and all possible needle insertion positions (0% to 100% through the context). 3. **No Failures or Degradation:** There are no cells showing yellow, orange, or red colors, which would indicate lower scores, performance degradation, or failure cases. ### Interpretation This heatmap presents the results of a "Needle in a Haystack" test, a common benchmark for evaluating a language model's ability to retrieve a specific piece of information (the "needle") from a long, irrelevant context (the "haystack"). The data suggests that the model or system being evaluated achieved a **perfect score (100)** across all tested conditions. This implies flawless retrieval accuracy regardless of: * **Context Length:** Performance did not degrade as the haystack grew from 32,000 to 1,024,000 tokens. * **Needle Position:** The model could find the needle with equal success whether it was placed at the very beginning, middle, or end of the context. **Potential Implications:** * **Benchmark Ceiling:** The model may have reached the performance ceiling for this specific test, indicating the test may no longer be challenging enough for it. * **Test Design:** The result could point to a highly effective model architecture for information retrieval, or it could raise questions about the test's difficulty or construction (e.g., if the "needle" was too obvious or the "haystack" too structured). * **Lack of Diagnostic Value:** For the purpose of identifying weaknesses or failure modes, this particular evaluation run provides no discriminatory data, as all outcomes are identical. **Note on Uncertainty:** The interpretation is based on the visual evidence of uniform color matching the top of the scale. The exact numerical score for each cell is inferred to be 100 based on the color bar, but the chart does not provide explicit numerical labels within the grid cells. </details> Figure 7: Performance of LLama-8B-1M-MoBA on the Needle in the Haystack benchmark (upto 1M context length). ### 3.4 Efficiency and Scalability The above experimental results show that MoBA achieves comparable performance not only regarding language model losses but also in real-world tasks. To further investigate its efficiency, we compare the forward pass time of the attention layer in two models trained in Section 3.3 — Llama-8B-1M-MoBA and Llama-8B-1M-Full. We focus solely on the attention layer, as all other layers (e.g., FFN) have identical FLOPs in both models. As shown in Figure 2a, MoBA is more efficient than full attention across all context lengths, demonstrating a sub-quadratic computational complexity. In particular, it achieves a speedup ratio of up to 6.5x when prefilling 1M tokens. We also explore the length scalability of MoBA by gradually increasing the context length to 10 million tokens. To maintain a constant attention sparsity, we keep the top-k value and number of MoBA Block fixed while proportionally increasing the block size. To reach the 10M context length, we expanded tensor parallelism shoeybi2019megatron toward the query head level, Specifically, we broadcast key and value tensors across distributed query heads, effectively addressing GPU memory limitations while preserving computational efficiency. As shown in Figure 2b, MoBA demonstrates superior efficiency compared to standard Flash Attention when scaling to longer sequences. Specifically, at 10M tokens moba achieves a speedup ratio of 16x reduction in attention computation time. The inset graph in the top figure, focusing on shorter sequences (32K to 512K), shows that even though both methods perform comparably at smaller scales, MoBA’s computational advantage becomes increasingly evident as sequences grow longer, highlighting its particular strength in processing extremely long sequences. Overall, the high efficiency of MoBA can be attributed to two key innovations: (1) the block sparse attention mechanism, and (2) the optimized implementation combining Mixture-of-Experts (MoE) and FlashAttention, as described in Section 2.3. These techniques effectively address the quadratic complexity limitation of full attention, reducing the computational complexity to a more economical sub-quadratic scale. ## 4 Related Work The development of efficient attention tay2020efficient mechanisms has been a critical area of research in the field of natural language processing, particularly with the rise of Large Language Models (LLMs). As the demand for handling longer sequences and reducing computational costs grows, efficeint attention techniques have emerged as a promising solution to reduce the quadratic complexity of self-attention mechanisms while maintaining model performance. Static Sparse Patterns: Significant efforts, such as Sparse Transformer child2019generating, Star-Transformer guo2019star, BlockBERT qiu2019blockwise, Longformer beltagy2020longformer, GMAT gupta2020gmat, ETC ainslie2020etc, BigBird zaheer2020big, LongT5 guo2021longt5 and LongNet ding2023longnet, have been dedicated to the design of static attention patterns in LLMs. Their choices of static attention patterns can encompass strided and fixed attention, window attention, global token attention, random attention, dilated attention, block sparse attention, or any combinations of them. In the realm of multimodal models, static sparse attention mechanisms have also been developed, such as axial attention ho2019axial for 2D images and spatial-temporal attention opensora for 3D videos. Dynamic Sparse Patterns: Different from static patterns, dynamic sparse attention techniques adaptively determine which tokens to attend. Reformer kitaev2020reformer and Routing Transformer roy2021efficient respectively employ locality-sensitive hashing (LSH) and K-means to cluster tokens, and attend to clusters rather than the full context. Memorizing Transformers wu2022memorizing and Unlimiformer bertsch2024unlimiformer dynamically attend to tokens selected by the k-nearest-neighbor (kNN) algorithms. CoLT5 ainslie2023colt5 designs a routing modules to select the most important queries and keys. Sparse Sinkhorn Attention tay2020sparse learns to permute blocks from the input sequence, allowing dynamic block sparse attention computation. Training-free Sparse Attention: In addition to the previously discussed approaches that study training sparse attention models, there are also strategies designed to incorporate sparse attention mechanisms to enhance the efficiency of the two primary stages of model inference — either the prefill stage or the decode stage, or both of them. During the prefill optimization phase, the complete prompt can be utilized for attention profiling, which allows for the exploration of more intricate sparse attention patterns. For instance, MoA fu2024moa, Minference jiang2024minference, and SeerAttention gao2024seerattention have investigated sparse attention configurations such as A-shape, vertical-slash, and dynamic block sparsity. In the context of decode optimization, considerable work has been dedicated to compressing and pruning the KV-cache to achieve a balance between the quality and speed of text generation. Notable efforts in this area include H2O zhang2024h2o, StreamingLLM xiao2023efficient, TOVA oren2024tova, FastGen ge2023fastgen and Quest tang2024quest. Quest, in particular, can be viewed as MoBA with a smaller block size and a specialized block representation function which combines both min and max pooling. Another work closely related to MoBA is Longheads lu2024longheads which can be viewed as MoBA with a top-1 gating network, meaning that each query selects the most relevant KV blocks for attention. Beyond Traditional Attention Architecture: Another line of research investigates novel model architectures that deviate from the conventional attention mechanism. As architectures change, these methods require training models from scratch and are unable to reuse pre-trained Transformer-based models. Studies in this domain have explored architectures that are inspired by Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), State Space Models (SSMs), or Linear Attention katharopoulos2020transformers, Examples of such models include Hyena poli2023hyena, Performer choromanski2020rethinking, Linformer wang2020linformer, RWKV [peng2023rwkv], Mamba gu2023mamba, RetNet sun2023retentive, etc. In summary, the landscape of efficient attention techniques is diverse, encompassing sparse patterns that range from static to dynamic, optimization objectives that span from training to inference, and architectures that extend from traditional attention mechanisms to innovative alternatives. Each method presents unique advantages and trade-offs, and the choice of technique often depends on the specific requirements of the application, such as the maximum sequence length, computational resources, and the desired balance between efficiency and performance. As research in this area continues to evolve, it is expected that these methods will play a crucial role in enabling LLMs to tackle increasingly complex tasks while maintaining efficiency and scalability. ## 5 Conclusion In this paper, we introduce Mixture of Block Attention (MoBA), a novel attention architecture inspired by the principles of Mixture of Experts (MoE) that aims to enhance the efficiency and scalability of large language models (LLMs) for long-context tasks. MoBA addresses the computational challenges associated with traditional attention mechanisms by partitioning the context into blocks and employing a dynamic gating mechanism to selectively route query tokens to the most relevant KV blocks. This approach not only reduces computational complexity but also maintains model performance. Moreover, it allows for seamless transitions between full and sparse attention. Through extensive experiments, we demonstrated that MoBA achieves performance comparable to full attention while significantly improving computational efficiency. Our results show that MoBA can scale effectively to long contexts, maintaining low LM losses and high performance on various benchmarks. Additionally, MoBA’s flexibility allows it to be integrated with existing models without substantial training cost, making it a practical continual pre-training solution for enhancing long-context capabilities in LLMs. In summary, MoBA represents a significant advancement in efficient attention, offering a balanced approach between performance and efficiency. Future work may explore further optimizations of MoBA’s block-selection strategies, investigate its application to other modalities, and study its potential for improving generalization in complex reasoning tasks. [title=References] ## Appendix A Appendix ### A.1 Long Context Scalability To address the bias in natural data distribution that favors short contexts, we strategically segmented the overall sequences into discrete segments based on their actual positions. For example, the segment spanning positions 30K-32K exclusively reflects losses associated with documents exceeding 30K context lengths and also masks the positions from 30K to 32K. This approach ensures a more balanced and representative evaluation across different context lengths. In our exploration of long-context scalability, we made a pivotal discovery: the trailing tokens account for the majority of the performance discrepancy between the full context baseline and the newly proposed sparse attention architectures. Consequently, we streamlined the long-context scaling process by focusing on trailing token scaling. This not only simplifies the computational requirements but also significantly enhances the efficiency and effectiveness of investigating long-context scenarios. This finding holds substantial implications for the development of more efficient and scalable attention mechanisms in the future. <details> <summary>x13.png Details</summary> ![62f42d64](/v1/image/62f42d6408b161c92294ca15dd10a3315cfb6622433d5d88e4ce41d00b979ff5) ### Visual Description ## Line Chart: Comparison of MoBA Projection vs. Full Attention Projection Scaling Laws ### Overview The image is a log-log line chart comparing the scaling behavior of two projection methods for language models: "MoBA Projection" and "Full Attention Projection." It plots model loss against computational cost, illustrating how each method's performance improves with increased compute. ### Components/Axes * **Chart Type:** 2D line chart with logarithmic scales on both axes. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic (base 10). * **Range & Ticks:** Visible ticks at `10^-1` (0.1), `10^0` (1), and `10^1` (10). The axis extends slightly beyond these points. * **Y-Axis:** * **Label:** `LM Loss (bpb)` - Likely "Language Model Loss in bits per byte." * **Scale:** Logarithmic (base 10). * **Range & Ticks:** Visible ticks at `10^0` (1), `2 x 10^0` (2), `3 x 10^0` (3), `4 x 10^0` (4), and `6 x 10^0` (6). * **Legend:** * **Position:** Top-right corner of the plot area. * **Entries:** 1. `MoBA Projection` - Represented by a blue dashed line (`--`). 2. `Full Attention Projection` - Represented by a red dashed line (`--`). ### Detailed Analysis The chart displays two distinct, downward-sloping lines on the log-log plot, indicating a power-law relationship between compute (PFLOP/s-days) and loss for both methods. **1. MoBA Projection (Blue Dashed Line):** * **Trend:** The line slopes steeply downward from left to right, indicating a strong inverse relationship between compute and loss. * **Data Points (Approximate):** * At `~0.1 PFLOP/s-days`, Loss is `~6 bpb`. * At `~0.5 PFLOP/s-days`, Loss is `~4 bpb`. * At `~1 PFLOP/s-days`, Loss is `~3 bpb`. * At `~2 PFLOP/s-days`, Loss is `~2.5 bpb`. * The line continues to decrease beyond `2 PFLOP/s-days`. **2. Full Attention Projection (Red Dashed Line):** * **Trend:** The line also slopes downward but with a shallower slope compared to the MoBA line. * **Data Points (Approximate):** * At `~0.1 PFLOP/s-days`, Loss is `~4.2 bpb`. * At `~0.5 PFLOP/s-days`, Loss is `~3.2 bpb`. * At `~1 PFLOP/s-days`, Loss is `~3 bpb`. * At `~2 PFLOP/s-days`, Loss is `~2.8 bpb`. * At `~10 PFLOP/s-days`, Loss is `~2.2 bpb`. **3. Relationship Between Lines:** * The two lines intersect at approximately `1 PFLOP/s-days` and `3 bpb`. * **To the left of the intersection (Compute < 1 PFLOP/s-day):** The Full Attention line is below the MoBA line, meaning Full Attention achieves lower loss for the same compute budget in this regime. * **To the right of the intersection (Compute > 1 PFLOP/s-day):** The MoBA line is below the Full Attention line, meaning MoBA Projection achieves lower loss for the same compute budget in this higher-compute regime. ### Key Observations 1. **Scaling Laws:** Both methods follow predictable power-law scaling (linear on a log-log plot). 2. **Crossover Point:** A critical crossover occurs at ~1 PFLOP/s-day. The relative efficiency of the two methods flips at this point. 3. **Slope Difference:** The MoBA Projection line has a steeper negative slope than the Full Attention Projection line. This indicates that MoBA's loss improves more rapidly per unit of additional compute. 4. **Performance Gap:** At the lowest compute shown (~0.1 PFLOP/s-day), Full Attention has a significant advantage (~1.8 bpb lower loss). At the highest compute shown (~10 PFLOP/s-day), MoBA has a smaller but clear advantage (~0.6 bpb lower loss). ### Interpretation This chart presents a comparative analysis of scaling efficiency for two architectural components in language modeling. The data suggests a fundamental trade-off: * **Full Attention Projection** is more **compute-efficient at lower scales**. For projects with limited computational resources (< 1 PFLOP/s-day), it delivers better loss performance. * **MoBA Projection** demonstrates **superior scaling efficiency at higher scales**. As the computational budget increases beyond the crossover point, it becomes the more effective method, yielding lower loss for the same investment in compute. The steeper slope of the MoBA line is the key finding. It implies that the MoBA method has a better "scaling exponent" – each doubling of compute yields a greater reduction in loss compared to Full Attention, but only after a certain compute threshold is passed. This could be due to MoBA (likely a "Mixture of Block Attention" or similar sparse method) having higher fixed overhead or less efficient use of very small compute budgets, but better utilization of parallelism and memory at scale. **Conclusion:** The choice between MoBA and Full Attention Projection is not absolute but depends on the target operational scale. The chart provides a quantitative guide for this architectural decision based on available computational resources. </details> (a) Scaling law (0-2k) <details> <summary>x14.png Details</summary> ![b94b7735](/v1/image/b94b77351bafe7c01a7b4cc07318c599d3aab23cda1ba4f0c97868200f189b51) ### Visual Description ## Line Chart: Language Model Loss vs. Compute (Projected Scaling) ### Overview The image is a log-log line chart comparing the projected scaling performance of two different attention mechanisms for language models (LMs). It plots model loss against computational resources, showing that one method ("MoBA") is projected to achieve lower loss with the same compute compared to a "Full Attention" baseline. ### Components/Axes * **Chart Type:** Log-Log Line Chart. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic, ranging from approximately `10^-1` (0.1) to `10^1` (10). * **Markers:** Major ticks at `10^-1`, `10^0` (1), and `10^1`. * **Y-Axis:** * **Label:** `LM Loss 2k-4k` * **Scale:** Logarithmic, ranging from `10^0` (1) to `6 x 10^0` (6). * **Markers:** Major ticks at `10^0`, `2 x 10^0`, `3 x 10^0`, `4 x 10^0`, and `6 x 10^0`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entry 1:** `MoBA Projection` - Represented by a blue dashed line (`--`). * **Entry 2:** `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series (Unlabeled in Legend):** * There are five solid lines in varying shades of blue/purple. These are not explicitly labeled in the legend but appear to represent specific model configurations or empirical data points that the projections are based on. They all follow a similar downward trend. ### Detailed Analysis * **Trend Verification:** * **All Solid Lines:** Each solid line slopes steeply downward from left to right, indicating that as computational resources (PFLOP/s-days) increase, the language model loss decreases significantly. * **MoBA Projection (Blue Dashed Line):** This line has a negative slope, starting at a loss value of approximately `3.2` at `0.1 PFLOP/s-days` and decreasing to approximately `1.8` at `10 PFLOP/s-days`. It sits below the red dashed line for the entire visible range. * **Full Attention Projection (Red Dashed Line):** This line also has a negative slope but is less steep than the MoBA projection. It starts at a loss of approximately `3.3` at `0.1 PFLOP/s-days` and decreases to approximately `1.8` at `10 PFLOP/s-days`. It converges with the MoBA projection line at the far right of the chart (~10 PFLOP/s-days). * **Data Point Approximation (from visual inspection of the log scale):** * At **~0.1 PFLOP/s-days**: The solid lines cluster between a loss of ~4.0 and >6.0 (off the top of the chart). The MoBA projection is at ~3.2, and the Full Attention projection is at ~3.3. * At **~1 PFLOP/s-days**: The solid lines have converged and are near a loss of ~2.5. The MoBA projection is at ~2.3, and the Full Attention projection is at ~2.5. * At **~10 PFLOP/s-days**: Both projection lines converge at a loss of approximately `1.8`. ### Key Observations 1. **Performance Gap:** The MoBA Projection line is consistently below the Full Attention Projection line across most of the compute range shown, suggesting a projected efficiency advantage for the MoBA method. 2. **Convergence:** The two projection lines appear to converge at very high compute levels (~10 PFLOP/s-days), indicating that the relative advantage of MoBA may diminish at extreme scale. 3. **Steep Empirical Scaling:** The solid, unlabeled lines show a very steep initial drop in loss, which then flattens as it approaches the projection lines. This is characteristic of scaling laws where initial gains are rapid. 4. **Log-Log Linearity:** The projections are approximately linear on this log-log plot, which is consistent with power-law scaling relationships commonly observed in neural network scaling. ### Interpretation This chart presents a technical argument for the superior scaling efficiency of a "MoBA" (likely an acronym for a specific model architecture, e.g., "Mixture of Block Attention" or similar) compared to a standard "Full Attention" transformer baseline. * **What the data suggests:** For a given budget of computational resources (PFLOP/s-days), a model using the MoBA architecture is projected to achieve a lower loss (i.e., better performance) than one using full attention. This implies MoBA could be a more cost-effective architecture for training large language models. * **How elements relate:** The solid lines likely represent the performance of actual trained models at smaller scales. The dashed lines are extrapolations (projections) of these trends to larger compute budgets. The chart's core message is that the MoBA trend line (blue dashed) has a more favorable slope, leading to better-expected performance at scale. * **Notable implications:** The convergence at the far right suggests there may be a fundamental limit to loss reduction where architectural differences matter less, or that the projection models assume similar asymptotic behavior. The primary takeaway is the projected **mid-range advantage** of MoBA, where it achieves the same loss as Full Attention with significantly less compute (e.g., to reach a loss of 2.5, MoBA requires ~0.7 PFLOP/s-days vs. Full Attention's ~1.0 PFLOP/s-days, a ~30% compute saving based on visual estimate). This has significant implications for the economics and feasibility of training next-generation AI models. </details> (b) Scaling law (2-4k) <details> <summary>x15.png Details</summary> ![88df8f3a](/v1/image/88df8f3ac82c5f618d8aa73c470f27e2b16589991432bf6f9e252d2d99f76ffe) ### Visual Description ## Line Chart: LM Loss vs. Compute (PFLOP/s-days) ### Overview The image is a line chart plotted on a log-log scale. It compares the projected language model (LM) loss on a 4,000-token context ("4k ctx") against the amount of compute, measured in PetaFLOP per second-days (PFLOP/s-days). Two distinct projection lines are shown, illustrating different scaling behaviors. ### Components/Axes * **Chart Type:** 2D line chart with logarithmic axes. * **Y-Axis (Vertical):** * **Label:** `LM Loss 4k ctx` * **Scale:** Logarithmic, ranging from `10^0` (1) to `6 x 10^0` (6). * **Major Ticks:** 1, 2, 3, 4, 5, 6. * **X-Axis (Horizontal):** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic, ranging from `10^-1` (0.1) to `10^1` (10). * **Major Ticks:** 0.1, 1, 10. * **Legend:** * **Position:** Top-right corner of the chart area. * **Entry 1:** `MoBA Projection` - Represented by a blue dashed line (`--`). * **Entry 2:** `Full Attention Projection` - Represented by a red dashed line (`--`). ### Detailed Analysis **1. MoBA Projection (Blue Dashed Line):** * **Trend:** The line exhibits a steep, downward slope that gradually flattens. It starts at a very high loss value for low compute and decreases rapidly as compute increases, showing strong initial scaling. * **Approximate Data Points:** * At ~0.1 PFLOP/s-days: Loss is off the top of the chart (>> 6). * At ~0.3 PFLOP/s-days: Loss ≈ 6. * At ~0.5 PFLOP/s-days: Loss ≈ 4. * At ~1 PFLOP/s-days: Loss ≈ 2.5. * At ~2 PFLOP/s-days: Loss ≈ 2.0. * At ~5 PFLOP/s-days: Loss ≈ 1.8 (line ends near this point). **2. Full Attention Projection (Red Dashed Line):** * **Trend:** The line has a much shallower, consistent downward slope across the entire compute range. It represents a more gradual, power-law scaling trend. * **Approximate Data Points:** * At ~0.1 PFLOP/s-days: Loss ≈ 2.9. * At ~1 PFLOP/s-days: Loss ≈ 2.2. * At ~10 PFLOP/s-days: Loss ≈ 1.5. **3. Relationship Between Lines:** * The MoBA line starts significantly above the Full Attention line at low compute (< ~0.8 PFLOP/s-days). * The two lines intersect at approximately **0.8 - 1.0 PFLOP/s-days**, where both project a loss of roughly **2.3 - 2.4**. * For compute greater than ~1 PFLOP/s-days, the MoBA Projection line falls **below** the Full Attention Projection line, indicating a lower projected loss for the same amount of compute in this regime. ### Key Observations 1. **Crossing Point:** The most significant feature is the intersection of the two projection lines. This suggests a critical compute threshold where the efficiency advantage shifts from one method (Full Attention) to the other (MoBA). 2. **Scaling Behavior:** The MoBA projection shows a "knee" or flattening curve, suggesting diminishing returns at higher compute levels. The Full Attention projection maintains a more constant rate of improvement (on a log-log plot). 3. **Initial Disparity:** At very low compute budgets, the Full Attention model is projected to have a substantially lower loss than the MoBA model. ### Interpretation This chart presents a comparative scaling law analysis for two different model architectures or training methodologies: "MoBA" (likely an acronym for a specific architecture like Mixture of Block Attention) and standard "Full Attention." * **What the data suggests:** The projection implies that the MoBA architecture may be less sample-efficient at very small scales but possesses superior scaling properties. Once a sufficient compute threshold (~1 PFLOP/s-days) is crossed, it is projected to achieve lower loss values than a Full Attention model trained with the same compute budget. * **How elements relate:** The x-axis (compute) is the independent variable driving the reduction in the dependent variable (model loss). The two lines represent two different functions (scaling laws) mapping compute to performance. The crossing point is the key insight, defining the regime where one approach becomes more favorable than the other. * **Notable implications:** This type of analysis is crucial for strategic decision-making in AI research and development. It argues that investing in the MoBA architecture is beneficial for large-scale projects, as it promises better final performance for massive compute investments, despite potentially worse performance for small-scale experiments. The flattening of the MoBA curve also hints at a potential performance ceiling or a need for further architectural innovation at extreme scales. </details> (c) Scaling law (4-6k) <details> <summary>x16.png Details</summary> ![3fae5015](/v1/image/3fae5015e6ae4c07e326846d0a05ddca7b58fead8516d74e53e74654c332aed9) ### Visual Description ## Line Chart: LM Loss vs. Compute (PFLOP/s-days) ### Overview The image is a log-log line chart comparing the projected language model (LM) loss as a function of computational resources (measured in PFLOP/s-days) for two different model architectures or projection methods: "MoBA Projection" and "Full Attention Projection." The chart demonstrates how model loss decreases with increased compute, with the MoBA projection showing a steeper initial decline that converges with the Full Attention projection at higher compute levels. ### Components/Axes * **Chart Type:** 2D line chart with logarithmic scales on both axes. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic (base 10). * **Range & Ticks:** Spans from approximately `10^-1` (0.1) to `10^1` (10). Major tick marks are at `10^-1`, `10^0` (1), and `10^1`. * **Y-Axis:** * **Label:** `LM Loss (ck-3k)` * **Scale:** Logarithmic (base 10). * **Range & Ticks:** Spans from `10^0` (1) to `6 × 10^0` (6). Major tick marks are at `10^0`, `2 × 10^0`, `3 × 10^0`, `4 × 10^0`, and `6 × 10^0`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entries:** 1. `MoBA Projection` - Represented by a blue dashed line (`--`). 2. `Full Attention Projection` - Represented by a red dashed line (`--`). ### Detailed Analysis **1. Data Series: Full Attention Projection (Red Dashed Line)** * **Trend:** A straight line with a constant negative slope on the log-log plot, indicating a power-law relationship (Loss ∝ Compute^(-α)). * **Key Data Points (Approximate):** * At `0.1 PFLOP/s-days`: Loss ≈ `2.7` * At `1 PFLOP/s-days`: Loss ≈ `1.9` * At `10 PFLOP/s-days`: Loss ≈ `1.3` **2. Data Series: MoBA Projection (Blue Dashed Lines)** * **Trend:** A family of curves (at least 5 distinct lines are visible). Each curve starts at a higher loss for low compute, decreases steeply, and then flattens out, converging towards the Full Attention line. The curves appear to represent different model sizes or configurations within the MoBA framework. * **Key Observations & Approximate Points:** * **Starting Points (Leftmost, ~0.1 PFLOP/s-days):** The curves are spread vertically. The highest visible curve starts near `6.0` loss, while the lowest starts near `3.0` loss. * **Convergence Zone:** All MoBA curves converge into a tight band between `1` and `2 PFLOP/s-days`. In this region, their loss values are very close to the Full Attention line, ranging approximately from `1.8` to `2.0`. * **Rightmost Points (~2-3 PFLOP/s-days):** The converged MoBA band continues to decrease, closely following the trajectory of the Full Attention line. The last clearly visible point for the MoBA band is at approximately `2.5 PFLOP/s-days` with a loss of about `1.7`. ### Key Observations 1. **Power-Law Scaling:** Both projections follow a power-law scaling trend (linear on a log-log plot), a common pattern in neural scaling laws. 2. **Efficiency Crossover:** At low compute budgets (< ~1 PFLOP/s-day), the Full Attention projection predicts lower loss than most MoBA configurations. However, the MoBA projections show a much steeper improvement rate. 3. **Convergence:** Around `1-2 PFLOP/s-days`, the MoBA projections catch up to and converge with the Full Attention projection. Beyond this point, their performance is nearly indistinguishable on this chart. 4. **MoBA Variability:** The spread of the blue curves at low compute suggests that the MoBA architecture's performance is highly sensitive to its specific configuration (e.g., model size, hyperparameters) when resources are limited. ### Interpretation This chart presents a comparative scaling analysis between a "Full Attention" transformer baseline and a "MoBA" (likely an efficient attention variant) architecture. * **The Core Message:** The data suggests that while the efficient MoBA architecture may be less sample-efficient at very small scales, it achieves comparable performance to the full attention model once a sufficient compute threshold (~1-2 PFLOP/s-days) is reached. This implies MoBA could be a more efficient choice for large-scale training, as it likely offers lower computational or memory cost per training step for equivalent loss. * **Underlying Principle:** The steeper slope of the MoBA curves indicates a better "scaling exponent" – it extracts more performance gain per additional unit of compute in the mid-curve region. The convergence suggests a fundamental limit or a point of diminishing returns where architectural differences matter less than total compute. * **Practical Implication:** For a project with a compute budget below the convergence point, Full Attention might be preferable. For projects scaling beyond that point, MoBA becomes an attractive alternative, promising similar final performance with potentially lower operational costs. The spread in MoBA curves also highlights the importance of proper configuration to achieve optimal scaling. </details> (d) Scaling law (6-8k) <details> <summary>x17.png Details</summary> ![a6f51e6a](/v1/image/a6f51e6a59049ef83ea3a73bf7be085ddb78590631df156693997007e109d566) ### Visual Description \n ## Line Chart: LLM Loss vs. Compute (PFLOP/s-days) - Projection Comparison ### Overview This is a log-log line chart comparing the projected performance (in terms of Language Model Loss) of two different model architectures or projection methods as a function of increasing computational resources. The chart illustrates how the loss decreases with more compute, with one method showing a steeper initial improvement that converges towards the other. ### Components/Axes * **Chart Type:** Log-Log Line Chart. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic (base 10). * **Range & Markers:** Spans from approximately `10^-1` (0.1) to `10^1` (10). Major tick marks are at `10^-1`, `10^0` (1), and `10^1`. * **Y-Axis:** * **Label:** `LLM Loss (4k-32k)` * **Scale:** Logarithmic (base 10). * **Range & Markers:** Spans from `10^0` (1) to `6 × 10^0` (6). Major tick marks are at `10^0`, `2 × 10^0`, `3 × 10^0`, `4 × 10^0`, and `6 × 10^0`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entries:** 1. `MoBA Projection` - Represented by a blue dashed line (`--`). 2. `Full Attention Projection` - Represented by a red dashed line (`--`). ### Detailed Analysis The chart contains two primary data series, each represented by a dashed line. 1. **Full Attention Projection (Red Dashed Line):** * **Trend:** Exhibits a steady, nearly linear downward slope on the log-log plot, indicating a consistent power-law relationship between compute and loss reduction. * **Data Points (Approximate):** * At ~`0.1 PFLOP/s-days`: Loss ≈ `2.5` * At `1 PFLOP/s-days`: Loss ≈ `1.8` * At `10 PFLOP/s-days`: Loss ≈ `1.3` 2. **MoBA Projection (Blue Dashed Lines):** * **Trend:** There are multiple, closely grouped blue dashed lines, suggesting projections for different variants or configurations of the MoBA method. All lines follow a similar pattern: they start at a significantly higher loss than the Full Attention line at low compute, decrease very steeply, and then converge to follow a path nearly parallel to and just above the Full Attention line at higher compute levels. * **Data Points (Approximate for the central blue line):** * At ~`0.1 PFLOP/s-days`: Loss is very high, > `6` (off the top of the chart). * At ~`0.5 PFLOP/s-days`: Loss ≈ `3.0` * At `1 PFLOP/s-days`: Loss ≈ `1.9` (intersecting/converging with the red line). * At `10 PFLOP/s-days`: Loss ≈ `1.4` (slightly above the red line). ### Key Observations * **Convergence Point:** The MoBA and Full Attention projections intersect or converge at approximately `1 PFLOP/s-days` of compute. Below this point, Full Attention has a lower projected loss. Above this point, the two methods have very similar performance, with MoBA maintaining a slight, consistent overhead. * **Steep Initial Descent:** The MoBA projection shows a dramatically steeper improvement in loss per unit of compute in the low-compute regime (`0.1` to `1 PFLOP/s-days`) compared to the steady slope of Full Attention. * **Grouped Lines:** The presence of multiple blue lines indicates uncertainty or a range of outcomes for the MoBA projection, but all follow the same characteristic steep-then-shallow trajectory. ### Interpretation This chart presents a performance-efficiency trade-off between two architectural approaches for Large Language Models (LLMs). The "Full Attention" projection serves as a baseline, showing a predictable, power-law scaling of performance with compute. The "MoBA" (likely an acronym for a specific model architecture or attention mechanism) projection suggests a different scaling behavior. Its key implication is that MoBA may be significantly less sample-efficient at very low compute budgets but achieves comparable performance to full attention once a sufficient compute threshold (~1 PFLOP/s-days) is crossed. The steep initial descent could indicate that MoBA requires a minimum scale to "activate" its efficiency benefits. From a practical standpoint, the data suggests that for large-scale training runs (beyond 1 PFLOP/s-days), choosing MoBA over Full Attention might offer similar final model quality. The potential advantage of MoBA, not shown on this loss-vs-compute chart, could be in inference speed, memory usage, or handling longer contexts, making it a viable alternative at scale despite the initial overhead. The chart argues that the choice between these methods depends critically on the available computational budget. </details> (e) Scaling law (8-10k) <details> <summary>x18.png Details</summary> ![65e3aa40](/v1/image/65e3aa40048207d739bdaf1eb8f656b935a232a444ba33380da57b2f3083efd7) ### Visual Description \n ## Line Graph: LM Loss vs. Compute (PFLOP/s-days) ### Overview The image is a log-log line graph comparing the scaling behavior of two projection methods for Language Model (LM) loss as a function of computational resources. The graph demonstrates how model loss decreases with increased training compute, measured in PetaFLOP/s-days. ### Components/Axes * **Chart Type:** Log-log line plot. * **X-Axis (Horizontal):** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic, ranging from `10^-1` (0.1) to `10^1` (10). * **Major Ticks:** `10^-1`, `10^0` (1), `10^1`. * **Y-Axis (Vertical):** * **Label:** `LM Loss (10k, 128k)` * **Scale:** Logarithmic, ranging from `10^0` (1) to `6 x 10^0` (6). * **Major Ticks:** `10^0`, `2 x 10^0`, `3 x 10^0`, `4 x 10^0`, `6 x 10^0`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entry 1:** `MoBA Projection` - Represented by a blue dashed line (`--`). * **Entry 2:** `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series:** * There are multiple solid lines (approximately 6-7) in varying shades of purple, blue, and red. These represent individual model training runs or data series. * Two prominent dashed projection lines (blue and red) overlay the solid lines, representing the fitted scaling laws. ### Detailed Analysis * **General Trend:** All lines, both solid and dashed, exhibit a strong downward slope from left to right. This indicates a consistent inverse relationship: as the computational budget (`PFLOP/s-days`) increases, the `LM Loss` decreases. * **Projection Lines (Dashed):** * The **blue dashed line (`MoBA Projection`)** starts at approximately `LM Loss ≈ 2.5` at `PFLOP/s-days = 0.1`. It follows a smooth, slightly convex curve downward, passing near `Loss ≈ 1.8` at `1 PFLOP/s-day` and ending near `Loss ≈ 1.2` at `10 PFLOP/s-days`. * The **red dashed line (`Full Attention Projection`)** starts slightly lower than the blue line at `0.1 PFLOP/s-days`, at approximately `Loss ≈ 2.4`. It follows a very similar trajectory, remaining just below the blue line for most of the range, and converges with it near `10 PFLOP/s-days` at `Loss ≈ 1.2`. * **Solid Data Lines:** The solid lines represent actual data points. They are tightly clustered and generally follow the path of the projection lines, though with more local variation (wiggles). They all originate from high loss values (above `6 x 10^0`) at low compute (`< 0.1 PFLOP/s-days`) and converge into a narrow band as compute increases. * **Key Intersection Points (Approximate):** * At `1 PFLOP/s-day`, the cluster of solid lines and the projection lines are centered around `LM Loss ≈ 1.8`. * At `10 PFLOP/s-days`, the projections and the trend of the solid lines converge at `LM Loss ≈ 1.2`. ### Key Observations 1. **Consistent Scaling Law:** The tight alignment of the solid data lines with the smooth dashed projections strongly suggests that LM loss follows a predictable power-law scaling with respect to compute. 2. **Minimal Difference Between Projections:** The `MoBA Projection` (blue) and `Full Attention Projection` (red) are nearly indistinguishable across the entire plotted range. The red line is marginally lower, but the difference is minimal and within the noise of the solid data lines. 3. **Convergence at High Compute:** The projections and data trends suggest that the difference between the two methods, if any, becomes negligible as the computational budget scales into the tens of PFLOP/s-days. 4. **Log-Log Linearity:** The approximately straight-line behavior on this log-log plot is characteristic of a power-law relationship (`Loss ∝ Compute^(-α)`). ### Interpretation This graph provides empirical evidence for the scaling hypothesis in large language models. It demonstrates that investing more computational resources (`PFLOP/s-days`) during training leads to predictable and significant reductions in model loss (a proxy for capability). The primary finding is the striking similarity between the `MoBA` and `Full Attention` projection curves. This suggests that, within the observed compute regime, the MoBA (likely a Memory-efficient or Mixture-of-Experts based Attention) architecture achieves a training efficiency nearly identical to that of a standard Full Attention mechanism. This is a significant result, as it implies potential architectural optimizations (like MoBA) can be adopted without sacrificing the fundamental scaling efficiency of the model. The graph does not show a clear "knee" or point of diminishing returns within the plotted range (`0.1` to `10 PFLOP/s-days`), indicating that further performance gains are likely achievable with even more compute. The tight clustering of the solid lines also indicates high reproducibility in the scaling behavior across different training runs or model configurations. </details> (f) Scaling law (10-12k) <details> <summary>x19.png Details</summary> ![9b4f1d16](/v1/image/9b4f1d164cfc018ef924991b18b1aeb8f837d2e1bd829e939bf34aa1a53ac3fa) ### Visual Description ## Line Chart: Language Model Loss vs. Compute Scale (Projected) ### Overview The image is a log-log line chart projecting the relationship between language model (LM) loss and computational scale, measured in PFLOP/s-days. It compares two projection methodologies: "MoBA Projection" and "Full Attention Projection." The chart illustrates how model loss decreases as computational investment increases, with multiple curves suggesting different starting conditions or model configurations converging toward a common scaling trend. ### Components/Axes * **Chart Type:** Log-Log Line Chart. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic, ranging from `10^-1` (0.1) to `10^1` (10). * **Major Ticks:** `10^-1`, `10^0`, `10^1`. * **Y-Axis:** * **Label:** `LM Loss 128k-16k` * **Scale:** Logarithmic, ranging from `10^0` (1) to `6 x 10^0` (6). * **Major Ticks:** `10^0`, `2 x 10^0`, `3 x 10^0`, `4 x 10^0`, `6 x 10^0`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entry 1:** `MoBA Projection` - Represented by a blue dashed line (`--`). * **Entry 2:** `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series:** * **Full Attention Projection (Red Dashed Line):** A single, straight line sloping downward from left to right. It originates near `(x=0.1, y≈2.2)` and terminates near `(x=10, y≈1.2)`. Its linearity on a log-log plot indicates a power-law relationship. * **MoBA Projection (Blue Solid Lines):** A family of approximately 6-7 distinct solid blue curves. **Note:** The legend indicates a dashed blue line, but the plotted lines are solid. This is a visual discrepancy. Each curve starts at a different, higher loss value on the left side of the chart (varying between y≈2.5 and y≈6 at x=0.1) and slopes downward steeply. As they move right (increasing compute), they converge and asymptotically approach the path of the red "Full Attention Projection" line, merging with it around `x=1` to `x=2`. ### Detailed Analysis * **Trend Verification:** * **Full Attention (Red):** Exhibits a consistent, linear downward slope across the entire compute range. This represents a stable, predictable scaling law where loss decreases proportionally with increased compute. * **MoBA (Blue):** Each curve shows a steep initial decline in loss for small increases in compute (left side of chart). The rate of loss reduction (slope) is much steeper than the red line initially. As compute increases, the slope of each blue curve flattens, and they all converge onto the trajectory defined by the red line. * **Data Points & Convergence:** * At `x = 0.1 PFLOP/s-days`: Blue curves are spread between `y ≈ 2.5` and `y ≈ 6.0`. The red line is at `y ≈ 2.2`. * At `x = 1 PFLOP/s-days`: Most blue curves have descended to between `y ≈ 1.6` and `y ≈ 1.8`, closely approaching the red line at `y ≈ 1.6`. * At `x = 10 PFLOP/s-days`: All lines (blue and red) appear to converge at approximately `y ≈ 1.2`. * **Spatial Grounding:** The legend is placed in the top-right, avoiding overlap with the data. The convergence zone where blue lines meet the red line is in the center-right portion of the plot area, between `x=1` and `x=2`. ### Key Observations 1. **Convergence to a Scaling Law:** The most prominent feature is the convergence of all "MoBA Projection" curves onto the single "Full Attention Projection" line at higher compute scales (`>1 PFLOP/s-days`). 2. **Diminishing Returns:** The steep initial slope of the blue curves indicates high efficiency (large loss reduction per unit of compute) at lower scales. The flattening slope demonstrates diminishing returns as compute increases. 3. **Power-Law Behavior:** The straight red line on the log-log plot confirms that the projected loss follows a power-law relationship with compute (`Loss ∝ (Compute)^-α`). 4. **Initial Condition Variance:** The multiple blue curves starting at different loss values suggest that the "MoBA" method's performance at low compute is sensitive to some initial parameter (e.g., model size, data mixture, or training recipe), but this variance becomes irrelevant at high compute. ### Interpretation This chart presents a technical argument about the scalability of two different methods ("MoBA" and "Full Attention") for training large language models. * **What the data suggests:** It demonstrates that while the "MoBA" method may have variable and often worse (higher) loss at low computational budgets, its scaling trajectory ultimately matches that of the "Full Attention" method. The "Full Attention" line acts as a fundamental scaling limit or target. * **Relationship between elements:** The "Full Attention Projection" serves as a benchmark or theoretical baseline. The "MoBA Projection" curves illustrate a practical method that, despite initial inefficiencies, is predicted to achieve the same optimal scaling behavior when given sufficient compute resources. * **Notable Implication:** The key takeaway is that the choice of method ("MoBA" vs. "Full Attention") may not affect the ultimate model quality achievable at very large scale (high PFLOP/s-days), but it significantly impacts the efficiency and loss trajectory during the earlier, lower-compute phases of training. This has implications for cost and resource allocation in model development. The convergence suggests a universal scaling law governs the final performance, regardless of the initial path taken. </details> (g) Scaling law (12-14k) <details> <summary>x20.png Details</summary> ![070ab702](/v1/image/070ab7025b3d060be6b8c5f338f467bb9685b1f02c7a415dce5d1552b20c94ea) ### Visual Description \n ## Line Chart: LLM Loss vs. Compute (PFLOP/s-days) ### Overview The image is a line chart plotted on a log-log scale, comparing the projected loss of two different Large Language Model (LLM) architectures as a function of computational resources. The chart demonstrates a scaling law relationship, where model loss decreases as the amount of compute (measured in PFLOP/s-days) increases. ### Components/Axes * **Chart Type:** 2D line chart with logarithmic scales on both axes. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic (base 10). * **Range & Markers:** The visible axis spans from approximately `10^-1` (0.1) to `10^1` (10). Major tick marks are present at `10^-1`, `10^0` (1), and `10^1`. * **Y-Axis:** * **Label:** `LLM Loss (4k ctx)` * **Scale:** Logarithmic (base 10). * **Range & Markers:** The visible axis spans from `10^0` (1) to `6 × 10^0` (6). Major tick marks are present at `10^0`, `2 × 10^0`, `3 × 10^0`, `4 × 10^0`, and `6 × 10^0`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entry 1:** `MoBA Projection` - Represented by a blue dashed line (`--`). * **Entry 2:** `Full Attention Projection` - Represented by a red dashed line (`--`). ### Detailed Analysis The chart plots two data series, each represented by a dashed line. 1. **MoBA Projection (Blue Dashed Line):** * **Trend:** The line shows a strong, consistent downward slope from left to right, indicating that loss decreases significantly as compute increases. * **Data Points (Approximate):** * At ~`0.1` PFLOP/s-days, Loss is ~`2.2`. * At ~`1` PFLOP/s-days, Loss is ~`1.5`. * At ~`10` PFLOP/s-days, Loss is ~`1.1`. * **Spatial Grounding:** This line originates from the upper-left quadrant and descends diagonally towards the bottom-right, remaining above the red line for the entire visible range until the far right. 2. **Full Attention Projection (Red Dashed Line):** * **Trend:** The line also shows a consistent downward slope, but it is less steep than the blue line. It starts at a lower loss value for a given compute level compared to the blue line. * **Data Points (Approximate):** * At ~`0.1` PFLOP/s-days, Loss is ~`2.0`. * At ~`1` PFLOP/s-days, Loss is ~`1.4`. * At ~`10` PFLOP/s-days, Loss is ~`1.1`. * **Spatial Grounding:** This line originates from the middle-left area and descends diagonally, positioned below the blue line. The two lines appear to converge and nearly intersect at the far right of the chart, near `10` PFLOP/s-days. ### Key Observations * **Convergence:** The primary observation is the convergence of the two projection lines. The "MoBA Projection" starts with a higher loss but improves at a faster rate with increased compute, eventually matching the performance of the "Full Attention Projection" at approximately `10` PFLOP/s-days. * **Scaling Efficiency:** The steeper slope of the MoBA line suggests it has a more favorable scaling exponent with respect to compute in this regime. It gains more performance per additional unit of compute compared to the Full Attention model. * **Log-Log Linearity:** Both projections appear as nearly straight lines on this log-log plot, which is characteristic of power-law scaling relationships commonly observed in neural network training (e.g., the Chinchilla scaling laws). ### Interpretation This chart presents a technical projection comparing the computational efficiency of two LLM architectures: "MoBA" and "Full Attention." * **What the data suggests:** The data suggests that while the Full Attention architecture may be more efficient (lower loss) at lower compute budgets, the MoBA architecture is projected to scale more efficiently. Given sufficient computational resources (around 10 PFLOP/s-days in this projection), MoBA is expected to achieve parity with Full Attention. * **How elements relate:** The relationship is a direct comparison of scaling laws. The x-axis (compute) is the independent variable, and the y-axis (loss) is the dependent performance metric. The two lines represent different model families or architectural choices, with their slopes indicating their respective scaling efficiencies. * **Notable implications:** This type of analysis is crucial for resource allocation in AI research. It implies that investing in the MoBA architecture could be more beneficial for long-term scaling, as it promises better returns on large compute investments. The convergence point is a critical threshold where the architectural advantage shifts. The chart does not show data points, only projections, so these are theoretical scaling curves based on empirical fits or modeling. </details> (h) Scaling law (14-16k) Figure 8: Scaling laws for positions 0-16k <details> <summary>x21.png Details</summary> ![b19bd125](/v1/image/b19bd1255f9b937c0ae9d086e952b71846fbda8566ecf624ae1e32b7dd7313a3) ### Visual Description ## Line Chart with Multiple Curves: LLM Loss vs. Compute ### Overview The image is a technical line chart plotted on a log-log scale. It compares the scaling behavior of two different projection methods for Large Language Models (LLMs) as a function of computational resources. The chart shows multiple solid-line curves, each representing a different model configuration or size, alongside two dashed-line trend projections. ### Components/Axes * **Chart Type:** Line chart with multiple series on a log-log scale. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic (base 10). * **Range:** Approximately from `10^-1` (0.1) to `10^1` (10). * **Major Tick Marks:** Visible at `10^-1`, `10^0` (1), and `10^1`. * **Y-Axis:** * **Label:** `LLM Loss (diff-16k)` * **Scale:** Logarithmic (base 10). * **Range:** Approximately from `10^0` (1) to `6 x 10^2` (600). * **Major Tick Marks:** Visible at `10^0`, `2 x 10^2`, `3 x 10^2`, `4 x 10^2`, and `6 x 10^2`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Items:** 1. `MoBA Projection` - Represented by a blue dashed line (`--`). 2. `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series (Solid Lines):** There are approximately 7-8 distinct solid curves in various colors (dark blue, purple, magenta, etc.). These are not individually labeled in the legend. They all follow a similar pattern: starting at high loss values on the left (low compute) and decreasing as they move to the right (higher compute). ### Detailed Analysis * **Trend Verification:** * **Solid Curves (All):** Each solid curve slopes steeply downward from left to right, indicating that LLM loss decreases significantly as the computational budget (PFLOP/s-days) increases. The curves are roughly parallel in their steep descent phase. * **MoBA Projection (Blue Dashed Line):** This line has a shallow, consistent downward slope across the entire x-axis range. It starts at a loss value of approximately `2.2 x 10^2` at `10^-1` PFLOP/s-days and ends near `1.2 x 10^0` at `10^1` PFLOP/s-days. * **Full Attention Projection (Red Dashed Line):** This line also slopes downward but is positioned slightly below the MoBA line for most of the chart. It starts at a loss value of approximately `2.0 x 10^2` at `10^-1` PFLOP/s-days and converges with the MoBA line near `1.2 x 10^0` at `10^1` PFLOP/s-days. * **Data Point Relationships:** * The solid curves appear to represent actual model training runs or more detailed scaling laws. They descend much more steeply than the two dashed projection lines. * All solid curves intersect and merge with the dashed projection lines in the region between `10^0` (1) and `10^1` (10) PFLOP/s-days. This suggests the projections model the asymptotic behavior at high compute. * At the lowest compute (`~10^-1`), the solid curves show a wide spread of loss values, from roughly `2.5 x 10^2` to over `6 x 10^2`. This spread narrows dramatically as compute increases. ### Key Observations 1. **Steep Initial Descent:** The primary solid curves show a very strong scaling relationship: small increases in compute at the low end yield massive reductions in loss. 2. **Convergence at High Compute:** All data series, both solid and dashed, converge to a very similar loss value (approximately `1.2`) at the highest compute level shown (`10^1` PFLOP/s-days). This indicates diminishing returns; adding more compute beyond this point yields smaller improvements. 3. **Projection Comparison:** The `Full Attention Projection` (red dashed) consistently predicts a slightly lower loss than the `MoBA Projection` (blue dashed) for the same compute budget, until they converge at the far right. 4. **Lack of Series Labels:** The individual solid curves are not labeled, making it impossible to determine what specific variable (e.g., model size, data size, architecture variant) they represent without external context. ### Interpretation This chart is a **scaling law analysis** for LLMs. It visually answers the question: "How does model performance (measured by loss) improve as we allocate more computational resources?" * **What the data suggests:** The steep solid curves demonstrate the "classical" scaling regime where performance improves rapidly with compute. The dashed lines represent projected scaling laws, likely extrapolated from a smaller set of data points. The convergence suggests a fundamental limit or a point of severe diminishing returns for the given model family and training setup. * **Relationship between elements:** The solid lines are the empirical data or high-fidelity simulations. The dashed lines are simplified, predictive models of that data. The chart's purpose is to validate these projection methods (MoBA vs. Full Attention) by showing how well they match the trend of the solid curves, especially in the high-compute extrapolation region. * **Notable Anomaly/Insight:** The most significant insight is the **dramatic change in scaling efficiency**. The slope of the solid curves is much steeper than the slope of the dashed projections at low compute. This implies that the simple power-law projected by the dashed lines does not fully capture the dynamics at lower compute budgets, where other factors may be dominant. The projections become accurate only in the high-compute, low-loss asymptotic regime. * **Peircean Investigation:** The chart is an **abductive** reasoning tool. It presents observed data (solid lines) and competing explanatory models (dashed projections). The viewer is meant to infer which projection better explains the observed trend and can be trusted for future predictions. The `Full Attention Projection` appears to be a marginally better fit (slightly lower loss) across the range before convergence. The lack of labels on the solid lines is a critical gap, preventing a full understanding of what is being scaled. </details> (i) Scaling law (16-18k) <details> <summary>x22.png Details</summary> ![6f1e0b17](/v1/image/6f1e0b17f5cf4ea132e28ee5c872b10a6f3b3f6994fc000c65dc2b8e03f9a9cd) ### Visual Description ## Line Chart: LLM Loss vs. Compute (PFLOP/s-days) ### Overview The image is a technical line chart comparing the training loss of a large language model (Llama-3 8B) against the amount of compute used, measured in PFLOP/s-days. It plots multiple training runs (solid lines) and projects their future performance using two different methods (dashed lines). The chart uses logarithmic scales on both axes. ### Components/Axes * **Chart Type:** Log-Log Line Chart. * **Y-Axis (Vertical):** * **Label:** `LLM Loss (Llama-3 8B)` * **Scale:** Logarithmic, ranging from `10^0` (1) to `6 x 10^2` (600). * **Major Ticks:** 10^0, 2x10^2, 3x10^2, 4x10^2, 6x10^2. * **X-Axis (Horizontal):** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic, ranging from `10^-1` (0.1) to `10^1` (10). * **Major Ticks:** 10^-1, 10^0, 10^1. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entry 1:** `MoBA Projection` - Represented by a blue dashed line (`--`). * **Entry 2:** `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series (Solid Lines):** * There are **five distinct solid lines** in varying colors (from left to right/top to bottom): a dark purple line, a medium blue line, a lighter blue line, a red line, and a dark red/brown line. * **Important Note:** These solid lines are **not explicitly labeled** in the legend. They likely represent different model configurations, training runs, or architectures whose performance is being tracked and projected. ### Detailed Analysis 1. **Trend of Solid Lines (Observed Data):** * All five solid lines exhibit a **consistent downward slope** from left to right. * This indicates a **strong inverse relationship**: as the compute budget (PFLOP/s-days) increases, the LLM Loss decreases. * The lines are roughly parallel on this log-log plot, suggesting a similar power-law scaling relationship between loss and compute for all runs, though with different constants (offsets). * The leftmost (dark purple) line starts at the highest loss (~600 at ~0.08 PFLOP/s-days) and ends at a loss of ~15 at ~2 PFLOP/s-days. * The rightmost (dark red) line starts at a lower loss (~250 at ~0.1 PFLOP/s-days) and ends at a loss of ~10 at ~4 PFLOP/s-days. 2. **Projection Lines (Extrapolated Data):** * **MoBA Projection (Blue Dashed):** This line originates from the endpoint of one of the central solid blue lines (at approximately 2 PFLOP/s-days, loss ~12). It projects a continued, slightly shallower downward slope, reaching a loss of approximately **8** at 10 PFLOP/s-days. * **Full Attention Projection (Red Dashed):** This line originates from the endpoint of the rightmost solid red line (at approximately 4 PFLOP/s-days, loss ~10). It projects a downward slope that is **steeper than the MoBA projection**, converging to a similar loss value of approximately **8** at 10 PFLOP/s-days. * **Cross-Reference:** The blue dashed line aligns with the blue solid line family, and the red dashed line aligns with the red solid line family, confirming the legend mapping. ### Key Observations * **Scaling Law Confirmation:** The chart visually demonstrates the empirical scaling laws for LLMs, where loss follows a power-law decrease with increased compute. * **Projection Divergence:** The two projection methods (MoBA vs. Full Attention) predict different paths to similar final performance. The Full Attention projection suggests a more efficient use of additional compute (steeper slope) from its starting point compared to the MoBA projection. * **Performance Hierarchy:** The solid lines are stacked, indicating that different model configurations achieve different loss values for the same amount of compute. The configuration represented by the dark red line appears most efficient (lowest loss for a given compute). * **Convergence Point:** Both projection lines, despite different slopes, appear to converge at the far right of the chart (~10 PFLOP/s-days), suggesting a predicted performance plateau or similar asymptotic limit under both projection methods. ### Interpretation This chart is a tool for **predicting the return on investment (ROI) of additional compute** for training the Llama-3 8B model. The solid lines provide empirical evidence of past performance, while the dashed lines are crucial for **capacity planning and resource allocation**. The key insight is the comparison between the "MoBA Projection" and "Full Attention Projection." MoBA (likely an acronym for a specific model architecture or training method) and Full Attention represent two different technical approaches. The chart suggests that while both predict similar ultimate performance (~8 loss at 10 PFLOP/s-days), the **Full Attention method is projected to be more compute-efficient in the extrapolated region**, achieving the same loss with fewer additional PFLOP/s-days from its last observed point. The absence of labels for the solid lines is a significant limitation. It prevents us from knowing which specific model variants or training techniques correspond to the observed data. However, the clear grouping (blue family vs. red family) and the alignment of projections with these families imply that the projections are specifically modeling the future behavior of the two most relevant or promising configurations from the observed set. In essence, the chart argues: "We have trained several models (solid lines). Based on their performance, if we continue training using the MoBA approach (blue dashed), we expect this outcome. If we use the Full Attention approach (red dashed), we expect a slightly better outcome for the same additional compute." This is critical for deciding where to invest the next million dollars of cloud compute time. </details> (j) Scaling law (18-20k) <details> <summary>x23.png Details</summary> ![59c7fc90](/v1/image/59c7fc90e88c1d0a34ec8a9d09bb6891186cd91f855b4ec21d6f556f740d0f81) ### Visual Description \n ## Line Chart: LLM Training Loss Projections vs. Compute ### Overview The image is a log-log line chart comparing the projected training loss of Large Language Models (LLMs) against computational resources. It displays multiple empirical loss curves (solid lines) and two theoretical projection lines (dashed). The chart illustrates the scaling behavior of model performance with increased compute. ### Components/Axes * **Chart Type:** Log-Log Line Chart. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic. * **Tick Marks (Approximate):** `10^-1` (0.1), `10^0` (1), `10^1` (10), `10^2` (100). * **Y-Axis:** * **Label:** `LLM Loss 30B-22%` * **Scale:** Logarithmic. * **Tick Marks (Approximate):** `10^0` (1), `2 x 10^0` (2), `3 x 10^0` (3), `4 x 10^0` (4), `6 x 10^0` (6). * **Legend:** * **Position:** Top-right corner of the plot area. * **Entry 1:** `MoBA Projection` - Represented by a blue dashed line (`--`). * **Entry 2:** `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series (Solid Lines):** There are approximately 7-8 solid lines in various colors (including shades of purple, blue, red, and gray). These are not explicitly labeled in the legend and likely represent empirical training runs or different model configurations. ### Detailed Analysis * **Empirical Data (Solid Lines):** * **Trend:** All solid lines slope steeply downward from left to right, indicating that LLM loss decreases significantly as the computational budget (PFLOP/s-days) increases. * **Shape:** The curves are convex on the log-log plot, showing a diminishing returns relationship. The rate of loss improvement slows at higher compute values. * **Convergence:** The solid lines appear to converge towards a similar region at the far right of the chart (high compute, ~100 PFLOP/s-days), suggesting a potential performance floor or asymptotic behavior. * **Spread:** At lower compute values (e.g., 0.1 PFLOP/s-days), there is a wide vertical spread in loss values (from ~2 to >6), indicating high variance in efficiency or model quality at smaller scales. * **Projection Lines (Dashed Lines):** * **MoBA Projection (Blue Dashed):** * **Trend:** A straight line sloping downward on the log-log plot, representing a power-law relationship. * **Position:** It starts at a loss of ~2.2 at 0.1 PFLOP/s-days and ends at a loss of ~1.0 at 100 PFLOP/s-days. It lies *above* the Full Attention Projection line across the entire range. * **Full Attention Projection (Red Dashed):** * **Trend:** Also a straight, downward-sloping line on the log-log plot. * **Position:** It starts at a loss of ~2.1 at 0.1 PFLOP/s-days and ends at a loss of ~1.0 at 100 PFLOP/s-days. It lies *below* the MoBA Projection line, suggesting a more optimistic (lower loss) forecast for the same compute. ### Key Observations 1. **Power-Law Scaling:** The straight dashed projection lines confirm that LLM loss is modeled to follow a power-law scaling with compute. 2. **Projection Divergence:** The two projection methods (MoBA vs. Full Attention) diverge more noticeably at lower compute levels and converge at very high compute (~100 PFLOP/s-days), where both predict a loss near 1.0. 3. **Empirical vs. Projected:** The solid empirical curves are generally steeper than the dashed projection lines at lower compute, suggesting that initial gains from scaling may outpace the projected power-law rate before settling into it. 4. **Performance Floor:** The clustering of all lines (empirical and projected) in the bottom-right corner suggests a strong consensus that pushing loss significantly below ~1.0 requires exponentially more compute. ### Interpretation This chart is a technical visualization of **AI scaling laws**, specifically for LLMs. It demonstrates the fundamental principle that increasing computational resources (measured in PFLOP/s-days) leads to predictable, power-law reductions in model loss (a key performance metric). * **What the data suggests:** The primary takeaway is that while more compute always helps, the efficiency of that compute (the loss reduction per added unit) diminishes. The comparison between "MoBA Projection" and "Full Attention Projection" likely evaluates two different architectural or methodological approaches for predicting this scaling. The "Full Attention" projection appears more optimistic, predicting slightly lower loss for the same compute budget. * **How elements relate:** The solid lines provide real-world context against the theoretical dashed projections. Their convergence at high compute validates the core scaling hypothesis but also highlights that the exact trajectory (the path to that convergence) can vary based on model design and training methodology. * **Notable Anomalies:** The significant spread of the solid lines at low compute is notable. It implies that at smaller scales, factors other than raw compute (like data quality, architecture details, or hyperparameter tuning) have a massive impact on performance. This variance collapses at scale, where compute becomes the dominant factor. * **Peircean Investigation:** The chart is an **icon** (resembling the phenomenon of diminishing returns) and a **symbol** (using standardized axes and legends to represent abstract concepts like "loss" and "compute"). It functions as an **index** pointing to the underlying, empirically observed relationship between resource investment and model capability in modern AI research. The space between the two dashed lines represents a zone of theoretical uncertainty in forecasting AI progress. </details> (k) Scaling law (20-22k) <details> <summary>x24.png Details</summary> ![bb03f735](/v1/image/bb03f735370d4d6fe5ce728eecf8c246c985ac76ab3fd52688c81baf17f2fec0) ### Visual Description \n ## Line Chart: LLM Loss vs. Computational Cost (PFLOP/s-days) ### Overview The image is a technical line chart comparing the performance scaling of two different projection methods for Large Language Models (LLMs). It plots model loss (a measure of error, where lower is better) against a metric of computational cost. The chart uses logarithmic scales on both axes to display data spanning multiple orders of magnitude. ### Components/Axes * **Chart Type:** Log-log line chart. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic. * **Range:** Approximately from `10^-1` (0.1) to `10^1` (10). * **Markers:** Major ticks at `10^-1`, `10^0` (1), and `10^1`. * **Y-Axis:** * **Label:** `LLM Loss 22B-38B` * **Scale:** Logarithmic. * **Range:** From `10^0` (1) to `6 × 10^0` (6). * **Markers:** Major ticks at `10^0`, `2 × 10^0`, `3 × 10^0`, `4 × 10^0`, and `6 × 10^0`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entry 1:** `MoBA Projection` - Represented by a blue dashed line (`---`). * **Entry 2:** `Full Attention Projection` - Represented by a red dashed line (`---`). * **Data Series:** The chart contains multiple solid lines (approximately 5-6) in varying shades of purple and blue. These are not explicitly labeled in the legend but represent different model configurations or training runs. Two dashed projection lines (blue and red) are overlaid. ### Detailed Analysis * **Trend of Solid Lines (Model Runs):** All solid lines exhibit a strong, consistent downward trend from left to right. They start at high loss values (between ~4 and >6) at low computational cost (~0.1 PFLOP/s-days) and decrease rapidly as cost increases. The curves begin to flatten and converge as they move towards the right side of the chart. * **Trend of Dashed Projection Lines:** Both projection lines show a much shallower, linear downward slope on this log-log plot, indicating a power-law relationship. * The **blue dashed line (`MoBA Projection`)** is consistently positioned **above** the red dashed line across the entire x-axis range. * The **red dashed line (`Full Attention Projection`)** is consistently the lowest line on the chart for any given x-value. * **Intersection and Convergence:** The solid model run lines appear to converge towards and align with the dashed projection lines in the region between `10^0` (1) and `10^1` (10) PFLOP/s-days. The solid lines seem to transition from a steeper initial descent to follow the shallower slope of the projections at higher compute. * **Approximate Data Points (Visual Estimation):** * At **~0.1 PFLOP/s-days**: Solid lines have loss values ranging from ~4 to >6. The red projection line is at ~2.2, and the blue projection line is at ~2.4. * At **~1 PFLOP/s-days**: Solid lines have converged significantly, with loss values clustered between ~1.8 and ~2.2. The red projection is at ~1.8, and the blue projection is at ~1.9. * At **~10 PFLOP/s-days**: All lines (solid and dashed) are tightly grouped, with loss values just above `10^0` (1), approximately in the range of 1.1 to 1.3. ### Key Observations 1. **Clear Scaling Law:** The data demonstrates a classic power-law scaling relationship between model performance (loss) and computational resources (PFLOP/s-days), as evidenced by the linear trends on the log-log plot. 2. **Projection Divergence:** There is a persistent, small but visible gap between the `MoBA Projection` (blue) and the `Full Attention Projection` (red). The Full Attention projection predicts slightly better (lower) loss for the same computational budget. 3. **Empirical Convergence:** The actual model training runs (solid lines) appear to validate the projection trends, especially at higher compute levels, where they align closely with the projected slopes. 4. **Initial Variance, Final Convergence:** Models starting with different loss values at low compute show a tendency to converge in performance as more computation is applied. ### Interpretation This chart is a tool for predicting the performance of large language models based on computational investment. It compares two theoretical projection methods (`MoBA` and `Full Attention`) against empirical training data. * **What the Data Suggests:** The close alignment between the solid training lines and the dashed projection lines at higher compute values suggests that these projection methods are effective for forecasting model performance in the high-compute regime. The consistent gap between the red and blue projections indicates that the choice of projection methodology (MoBA vs. Full Attention) leads to systematically different performance forecasts, with Full Attention being more optimistic. * **How Elements Relate:** The x-axis (compute) is the independent variable driving the reduction in the y-axis (loss). The legend defines the two predictive models being tested. The solid lines serve as the ground truth against which the projections are validated. * **Notable Implications:** For a practitioner, this chart implies that investing more compute reliably improves model performance following a predictable trend. Furthermore, it highlights that the `Full Attention Projection` method may be the more favorable (or perhaps more accurate) benchmark for planning large-scale training runs, as it predicts a lower loss for a given budget. The convergence of all lines at the far right suggests a region of diminishing returns, where massive increases in compute yield smaller absolute improvements in loss. </details> (l) Scaling law (22-24k) <details> <summary>x25.png Details</summary> ![2b925fff](/v1/image/2b925fff6ed4fcdf18dd5d9a8a0c0f512f8061b29d41452694ce92cb77ee4476) ### Visual Description ## Line Chart: Language Model Loss Scaling with Compute ### Overview The image is a log-log line chart illustrating the relationship between computational resources (measured in PFLOP/s-days) and the loss of a 30-billion parameter language model evaluated on a 30,000-token context (LM Loss 30B-30K). It displays multiple empirical scaling curves alongside two theoretical projection lines, demonstrating how model performance improves with increased training compute. ### Components/Axes * **Chart Type:** Log-Log Line Chart. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic (base 10). * **Range & Major Ticks:** Spans from approximately `10^-1` (0.1) to `10^1` (10). Major tick marks are visible at `10^-1`, `10^0` (1), and `10^1`. * **Y-Axis:** * **Label:** `LM Loss 30B-30K` * **Scale:** Logarithmic (base 10). * **Range & Major Ticks:** Spans from `10^0` (1) to `6 × 10^0` (6). Major tick marks are visible at `10^0`, `2 × 10^0`, `3 × 10^0`, `4 × 10^0`, and `6 × 10^0`. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entries:** 1. `MoBA Projection` - Represented by a blue dashed line (`--`). 2. `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series:** * There are approximately 7-8 solid lines in various colors (including shades of purple, blue, and gray). These represent empirical data from different model training runs or configurations. Their specific labels are not provided in the chart. ### Detailed Analysis * **Empirical Data Curves (Solid Lines):** * **Trend:** All solid lines exhibit a strong, consistent downward slope from left to right. This indicates a clear inverse relationship: as `PFLOP/s-days` (compute) increases, the `LM Loss` decreases. * **Shape:** The curves are not perfectly straight on the log-log plot, showing a slight convex curvature. They start at higher loss values (between ~4 and 6) at low compute (~0.1 PFLOP/s-days) and converge toward lower loss values (approaching 1) at high compute (~10 PFLOP/s-days). * **Convergence:** The curves are more spread out at the low-compute end (left side) and become tightly clustered, almost merging, at the high-compute end (right side). * **Projection Lines (Dashed Lines):** * **MoBA Projection (Blue Dashed):** A straight line on the log-log plot, indicating a power-law relationship. It starts at a loss of approximately `2.2` at `0.1` PFLOP/s-days and slopes downward to a loss of approximately `1.1` at `10` PFLOP/s-days. * **Full Attention Projection (Red Dashed):** Also a straight line, positioned slightly below the MoBA projection. It starts at a loss of approximately `2.1` at `0.1` PFLOP/s-days and slopes downward to a loss of approximately `1.05` at `10` PFLOP/s-days. * **Relationship to Data:** The empirical solid lines generally lie *above* both projection lines across most of the compute range, especially at lower compute values. They appear to asymptotically approach the projection lines as compute increases. ### Key Observations 1. **Power-Law Scaling:** The linear trend of the projection lines on a log-log scale confirms that language model loss is expected to follow a power-law scaling relationship with training compute. 2. **Diminishing Returns:** The convex shape of the empirical curves suggests diminishing returns; each doubling of compute yields a smaller absolute reduction in loss as the model scales. 3. **Projection vs. Reality:** The actual model performance (solid lines) is consistently worse (higher loss) than the theoretical projections (dashed lines) at equivalent compute levels, particularly in the lower-compute regime. The gap narrows significantly at higher compute. 4. **Convergence of Architectures:** The tight clustering of the solid lines at high compute suggests that different model configurations or architectures may converge to similar performance levels when provided with sufficient resources. ### Interpretation This chart provides a technical visualization of **scaling laws** in large language model training. It demonstrates that increasing computational investment (PFLOP/s-days) reliably improves model performance (reduces loss), but with diminishing returns. The key insight lies in the relationship between the empirical data and the projections. The "Full Attention" and "MoBA" (likely a model architecture variant) projections represent idealized, efficient scaling trajectories. The fact that real-world runs (solid lines) sit above these projections indicates **sub-optimal compute utilization** or **scaling inefficiencies** in practice, especially when resources are limited. The convergence of the solid lines toward the projections at the high-compute end suggests that these inefficiencies become less pronounced as models are trained to extreme scales, or that the models are approaching a fundamental performance limit defined by the architecture and data. For a researcher or engineer, this chart is crucial for **forecasting training costs** and **setting performance expectations**. It answers the question: "How much more compute do I need to achieve a target loss reduction?" The gap between the solid and dashed lines also highlights an opportunity for optimization—improving training algorithms or model architectures to make real-world scaling track closer to the idealized projections. </details> (m) Scaling law (24-26k) <details> <summary>x26.png Details</summary> ![af406204](/v1/image/af40620470ffe3162092ed32aaf5104d80d7c1952361d101fea3d47504516750) ### Visual Description ## Line Chart: LLM Loss vs. Compute (PFLOP/s-days) ### Overview The image is a log-log line chart plotting the loss of a Large Language Model (LLM) against a measure of computational effort. It compares the projected performance of two different attention mechanisms ("MoBA" and "Full Attention") against several actual training runs represented by solid lines. The chart demonstrates the relationship between increased compute and reduced model loss. ### Components/Axes * **Chart Type:** Log-Log Line Chart. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic, ranging from `10^-1` (0.1) to `10^1` (10). * **Major Ticks:** `10^-1`, `10^0`, `10^1`. * **Y-Axis:** * **Label:** `LLM Loss (20B, 20K)` * **Scale:** Logarithmic, ranging from `10^0` (1) to `6 × 10^0` (6). * **Major Ticks:** `10^0`, `2 × 10^0`, `3 × 10^0`, `4 × 10^0`, `6 × 10^0`. * **Legend:** * **Position:** Top-right corner of the chart area. * **Entries:** 1. `MoBA Projection` - Represented by a blue dashed line (`---`). 2. `Full Attention Projection` - Represented by a red dashed line (`---`). * **Data Series:** * **Projection Lines (Dashed):** Two lines extending from the left side of the chart to the bottom-right. * **Training Run Lines (Solid):** Approximately 7-8 solid lines in various colors (including shades of purple, blue, and red). These lines are not individually labeled in the legend. They all originate from the top-left area and curve downwards to converge near the bottom-right. ### Detailed Analysis * **Projection Trends (Dashed Lines):** * **MoBA Projection (Blue Dashed):** Starts at a loss value slightly above `2 × 10^0` at `0.1 PFLOP/s-days`. It slopes downward linearly on the log-log plot, ending at a loss value slightly above `1 × 10^0` at `10 PFLOP/s-days`. * **Full Attention Projection (Red Dashed):** Starts at a loss value slightly below `2 × 10^0` at `0.1 PFLOP/s-days`. It also slopes downward linearly, running parallel but consistently below the MoBA projection line. It ends at a loss value very close to `1 × 10^0` at `10 PFLOP/s-days`. * **Training Run Trends (Solid Lines):** * All solid lines show a steep, curved descent from high loss values (above `6 × 10^0`) at low compute (`<0.1 PFLOP/s-days`) towards lower loss values as compute increases. * The curves are not linear; they show a steep initial drop that gradually flattens, exhibiting diminishing returns. * The lines are tightly clustered but distinct. They appear to converge into a narrow band as they approach the bottom-right of the chart (high compute, low loss). * At the highest compute levels shown (`~10 PFLOP/s-days`), the solid lines appear to meet or slightly undercut the dashed projection lines, with loss values approaching `1 × 10^0`. ### Key Observations 1. **Consistent Hierarchy:** The `Full Attention Projection` line is consistently below the `MoBA Projection` line across the entire compute range, suggesting a predicted lower loss for the same computational budget. 2. **Convergence of Actual Runs:** The multiple solid lines, likely representing different model configurations or training runs, all follow a similar trajectory and converge to a similar final loss value at high compute. 3. **Projection vs. Reality:** The dashed projection lines represent a simplified, linear (on log-log scale) scaling law. The actual training curves (solid lines) are more complex, showing a rapid initial improvement that slows down, eventually aligning with the projected trend at the high-compute end. 4. **Scale of Improvement:** Moving from `0.1` to `10 PFLOP/s-days` (a 100x increase in compute) corresponds to a reduction in loss from approximately `6+` to approximately `1`, a significant improvement. ### Interpretation This chart visualizes a scaling law for LLM training. It demonstrates the fundamental principle that increasing computational resources (measured in PFLOP/s-days) leads to a predictable reduction in model loss (a key performance metric). * **The Projections:** The dashed lines represent theoretical or extrapolated scaling laws for two different architectural approaches (MoBA vs. Full Attention). The fact that the Full Attention line is lower suggests it may be a more compute-efficient architecture according to this model, achieving lower loss for the same FLOP budget. * **The Training Runs:** The solid lines represent empirical data from actual training experiments. Their steep initial drop shows that early in training, small increases in compute yield large performance gains. The flattening curve illustrates the law of diminishing returns—ever larger compute investments are needed for smaller incremental improvements. * **Convergence and Validation:** The convergence of the solid training lines with the dashed projection lines at the high-compute end serves as a validation of the scaling law projection in that regime. It suggests that for sufficiently large models and training runs, the theoretical projection becomes a reliable predictor of performance. * **Practical Implication:** The chart is a tool for resource planning. It helps answer questions like: "To achieve a target loss of `1.5`, approximately how much compute is required?" and "What is the expected performance difference between investing in a MoBA-based vs. a Full Attention-based architecture at a given scale?" The tight clustering of the solid lines also implies that, at this scale, the specific differences between those training runs have a smaller impact on final loss than the overall compute budget. </details> (n) Scaling law (26-28k) <details> <summary>x27.png Details</summary> ![07a5a216](/v1/image/07a5a21624125e4d1f41718007fc97e29d462c8da6ba9cf01bc64df8af7b3f53) ### Visual Description \n ## Line Chart: LLM Training Loss vs. Computational Resources ### Overview The image is a log-log line chart comparing the projected training loss of two different Large Language Model (LLM) architectures—MoBA (Mixture of Block Attention) and Full Attention—against the computational budget measured in PFLOP/s-days. The chart also includes several unlabeled solid lines, likely representing empirical training runs or alternative scaling laws. ### Components/Axes * **Chart Type:** Log-Log Line Chart. * **X-Axis:** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic (base 10). * **Tick Marks:** Major ticks at `10^-1` (0.1), `10^0` (1), and `10^1` (10). The axis extends from approximately 0.05 to 20 PFLOP/s-days. * **Y-Axis:** * **Label:** `LLM Loss 20B-30B` * **Scale:** Logarithmic (base 10). * **Tick Marks:** Major ticks at `10^0` (1), `2×10^0` (2), `3×10^0` (3), `4×10^0` (4), and `6×10^0` (6). The axis extends from 1 to approximately 7. * **Legend:** * **Position:** Top-right corner of the plot area. * **Entries:** 1. `MoBA Projection` - Represented by a blue dashed line (`--`). 2. `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series:** 1. **MoBA Projection (Blue Dashed Line):** A straight line on the log-log plot, indicating a power-law relationship. It starts at approximately (0.05 PFLOP/s-days, 2.3 Loss) and ends near (20 PFLOP/s-days, 1.0 Loss). 2. **Full Attention Projection (Red Dashed Line):** A straight line parallel to and slightly below the MoBA line. It starts at approximately (0.05 PFLOP/s-days, 2.1 Loss) and ends near (20 PFLOP/s-days, 1.0 Loss). 3. **Unlabeled Solid Lines (Multiple Colors):** A family of 5-6 solid lines in various colors (e.g., purple, teal, brown). They are not identified in the legend. These lines exhibit a steep, curved decline at lower compute budgets (left side) and converge to a shallower slope at higher compute budgets (right side), eventually aligning with the trajectory of the dashed projection lines. ### Detailed Analysis * **Trend Verification:** * **Projection Lines (Dashed):** Both the MoBA and Full Attention projections show a consistent, linear downward slope on the log-log scale. This visually confirms a power-law scaling: Loss ∝ (Compute)^-α, where α is the slope. The Full Attention line is consistently below the MoBA line, indicating a lower projected loss for the same compute budget. * **Empirical/Alternative Lines (Solid):** These lines show a "knee" or bend. They descend very steeply for compute values less than ~1 PFLOP/s-days, suggesting rapid initial improvement. Beyond ~1 PFLOP/s-days, their slope flattens and begins to parallel the projection lines. * **Data Point Approximation (Key Intersections):** * At **0.1 PFLOP/s-days**: * MoBA Projection: ~1.9 Loss * Full Attention Projection: ~1.8 Loss * Solid Lines: Range from ~2.5 to 4.0 Loss. * At **1 PFLOP/s-days**: * MoBA Projection: ~1.4 Loss * Full Attention Projection: ~1.35 Loss * Solid Lines: Converge to a narrow band between ~1.4 and 1.6 Loss. * At **10 PFLOP/s-days**: * MoBA Projection: ~1.1 Loss * Full Attention Projection: ~1.05 Loss * Solid Lines: Align closely with the projection lines, between ~1.05 and 1.15 Loss. ### Key Observations 1. **Projection vs. Empirical Behavior:** The dashed projection lines represent idealized power-law scaling. The solid lines, likely representing actual model training curves or a different scaling law, show that initial loss reduction is much faster than the long-term projection predicts, but they eventually converge to the projected scaling trend. 2. **Architectural Efficiency:** The Full Attention projection line is consistently below the MoBA projection line across the entire compute range shown. This suggests that, according to this model, Full Attention is predicted to be more compute-efficient (achieving lower loss for the same PFLOP/s-days) than MoBA for models in the 20B-30B parameter range. 3. **Convergence Point:** The family of solid lines appears to converge and merge with the projection lines in the region of **1 to 10 PFLOP/s-days**. This may indicate a transition point where the scaling behavior stabilizes into the predicted power law. 4. **Log-Log Linearity:** The straightness of the dashed lines on the log-log plot is a classic signature of a power-law relationship, a common finding in neural scaling laws. ### Interpretation This chart is a technical comparison of scaling efficiency for two LLM attention mechanisms. The core message is twofold: 1. **Validation of Scaling Laws:** The convergence of the solid (empirical) lines onto the dashed (projected) lines at higher compute budgets visually validates the power-law scaling hypothesis for these architectures. It suggests that after an initial "warm-up" phase of rapid gain, model improvement follows a predictable, diminishing-returns trajectory relative to computational investment. 2. **Relative Architectural Performance:** The data suggests a performance hierarchy. For a given, fixed computational budget (PFLOP/s-days), the **Full Attention architecture is projected to yield a lower loss** than the MoBA architecture. This implies that, within the constraints of this analysis, the potential efficiency gains of a sparse attention mechanism like MoBA do not translate to lower loss compared to dense Full Attention when scaled to this model size (20B-30B parameters). The choice between them would therefore depend on other factors not shown here, such as inference speed, memory footprint, or performance on specific tasks. **Uncertainty Note:** All numerical values are approximate, read from a log-scale chart. The exact values of the unlabeled solid lines cannot be determined without their corresponding legend keys. The title "LLM Loss 20B-30B" implies the data is specific to models within that parameter count range. </details> (o) Scaling law (28-30k) <details> <summary>x28.png Details</summary> ![2406e910](/v1/image/2406e91008b3fd5aeb377bf8e22c21c9f3e43f0da201fd99351ba4638c5e987d) ### Visual Description ## Line Chart: Language Model Loss Scaling Projections ### Overview The image is a log-log line chart comparing the projected loss of a 30-billion parameter language model with a 32k context window ("LM Loss 30B-32k") against computational effort measured in "PFLOP/s-days". It displays multiple solid-line data series and two dashed-line projection models, illustrating scaling laws and efficiency comparisons. ### Components/Axes * **Chart Type:** Log-Log Line Chart. * **Y-Axis (Vertical):** * **Label:** `LM Loss 30B-32k` * **Scale:** Logarithmic, base 10. * **Range & Ticks:** From `10^0` (1) to `6 x 10^0` (6). Major ticks are at 1, 2, 3, 4, 5, 6. * **X-Axis (Horizontal):** * **Label:** `PFLOP/s-days` * **Scale:** Logarithmic, base 10. * **Range & Ticks:** From `10^-1` (0.1) to `10^1` (10). Major ticks are at 0.1, 1, 10. * **Legend:** * **Position:** Top-right corner of the chart area. * **Items:** 1. `McBA Projection` - Represented by a blue dashed line (`--`). 2. `Full Attention Projection` - Represented by a red dashed line (`--`). * **Data Series (Solid Lines):** There are approximately 7-8 solid lines in various colors (including shades of purple, blue, green, and orange/brown). These are not individually labeled in the legend and likely represent empirical training runs or different model configurations. ### Detailed Analysis * **Trend Verification:** All lines, both solid and dashed, exhibit a consistent downward slope from left to right. This indicates that as computational effort (PFLOP/s-days) increases, the language model loss decreases. * **Projection Lines (Dashed):** * The **red dashed line (Full Attention Projection)** starts at a loss value of approximately **2.2** at 0.1 PFLOP/s-days and slopes downward, ending near a loss of **1.0** at 10 PFLOP/s-days. * The **blue dashed line (McBA Projection)** starts higher, at a loss of approximately **2.5** at 0.1 PFLOP/s-days. It has a steeper initial slope than the red line, crossing below it around 1-2 PFLOP/s-days, and converges to a similar final loss value near **1.0** at 10 PFLOP/s-days. * **Empirical Data Lines (Solid):** * The solid lines are clustered between the two projection lines at lower compute values (left side of the chart). * They show a steeper descent than the projection lines in the mid-range (0.5 to 5 PFLOP/s-days). * As compute increases towards 10 PFLOP/s-days, all solid lines converge tightly with each other and with the two projection lines, ending in a narrow band between loss values of approximately **1.0 and 1.1**. * **Spatial Grounding:** The legend is placed in the top-right, avoiding overlap with the data. The most significant separation between the two projection lines occurs in the top-left quadrant of the chart (low compute, high loss). The convergence of all lines happens in the bottom-right quadrant (high compute, low loss). ### Key Observations 1. **Convergence at Scale:** The most prominent pattern is the convergence of all data series and both projection models at the high-compute end of the chart (~10 PFLOP/s-days). This suggests that with sufficient computational resources, the predicted performance of different methods becomes very similar. 2. **Initial Efficiency Difference:** The McBA Projection (blue dashed) predicts worse performance (higher loss) than the Full Attention Projection (red dashed) at lower compute budgets. However, its steeper slope suggests it may scale more efficiently, eventually matching Full Attention performance. 3. **Empirical vs. Projected:** The solid empirical lines generally fall between the two projections at low compute but align more closely with the Full Attention Projection trend as compute increases. They demonstrate a faster rate of improvement (steeper slope) in the mid-range than either projection model alone. 4. **Log-Log Linearity:** The relationships appear approximately linear on this log-log scale, which is characteristic of power-law scaling relationships commonly observed in neural network training. ### Interpretation This chart visualizes a core question in AI scaling research: how do different architectural or training approaches compare in their efficiency and ultimate performance as compute increases? * **What the data suggests:** The data indicates that while there may be measurable efficiency differences between methods (like "McBA" vs. "Full Attention") at smaller scales, these differences diminish as the computational budget grows very large. The ultimate performance ceiling, as measured by loss, appears to be similar for the approaches shown. * **Relationship between elements:** The solid lines represent actual or simulated training outcomes. The dashed lines are theoretical scaling laws fitted to predict performance. The chart tests these predictions against the data. The close alignment at high compute validates the projections' asymptotic predictions, while the divergence at low compute highlights the challenge of extrapolating scaling laws across different regimes. * **Notable implications:** The convergence implies that for sufficiently large models trained with massive compute, the choice between these specific methods may become less critical for final loss. However, the initial gap is significant for resource-constrained training, where the Full Attention Projection appears more favorable. The steeper slope of the empirical data suggests real-world training might benefit from factors not fully captured by either simplified projection model. </details> (p) Scaling law (30-32k) Figure 8: Scaling laws for positions 16-32k Table 3: Loss scaling with different positions | 0K - 2K 2K - 4K 4K - 6K | $3.075× C^-0.078$ $2.415× C^-0.084$ $2.085× C^-0.081$ | $3.068× C^-0.078$ $2.411× C^-0.083$ $2.077× C^-0.081$ | | --- | --- | --- | | 6K - 8K | $1.899× C^-0.092$ | $1.894× C^-0.092$ | | 8K - 10K | $1.789× C^-0.091$ | $1.774× C^-0.089$ | | 10K - 12K | $1.721× C^-0.092$ | $1.697× C^-0.087$ | | 12K - 14K | $1.670× C^-0.089$ | $1.645× C^-0.088$ | | 14K - 16K | $1.630× C^-0.089$ | $1.600× C^-0.087$ | | 16K - 18K | $1.607× C^-0.090$ | $1.567× C^-0.087$ | | 18K - 20K | $1.586× C^-0.091$ | $1.542× C^-0.087$ | | 20K - 22K | $1.571× C^-0.093$ | $1.519× C^-0.086$ | | 22K - 24K | $1.566× C^-0.089$ | $1.513× C^-0.085$ | | 24K - 26K | $1.565× C^-0.091$ | $1.502× C^-0.085$ | | 26K - 28K | $1.562× C^-0.095$ | $1.493× C^-0.088$ | | 28K - 30K | $1.547× C^-0.097$ | $1.471× C^-0.091$ | | 30K - 32K | $1.546× C^-0.108$ | $1.464× C^-0.097$ |

Rendering Paper...