2505.23219v2

Model: nemotron-free

# Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism ## Abstract In-situ LLM inference on end-user devices has gained significant interest due to its privacy benefits and reduced dependency on external infrastructure. However, as the decoding process is memory-bandwidth-bound, the diverse processing units in modern end-user devices cannot be fully exploited, resulting in slow LLM inference. This paper presents Ghidorah, a LLM inference system for end-user devices with the unified memory architecture. The key idea of Ghidorah can be summarized in two steps: 1) leveraging speculative decoding approaches to enhance parallelism, and 2) ingeniously distributing workloads across multiple heterogeneous processing units to maximize computing power utilization. Ghidorah includes the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. The HCMP architecture guides partitioning by leveraging the unified memory design of end-user devices and adapting to the hybrid computational demands of speculative decoding. The ARCA approach is used to determine the optimal speculative strategy and partitioning strategy, balancing acceptance rate with parallel capability to maximize the speedup. Additionally, we optimize sparse computation on ARM CPUs. Experimental results show that Ghidorah can achieve up to 7.6 $\times$ speedup in the dominant LLM decoding phase compared to the sequential decoding approach in NVIDIA Jetson NX. Index Terms: Edge Intelligence, LLM Inference, Speculative Decoding, Unified Memory Architecture footnotetext: Jinhui Wei and Ye Huang contributed equally to this work and are considered co-first authors. Yuhui Zhou participated in this work as a research intern at Sun Yat-sen University. * Corresponding author: Jiangsu Du (dujiangsu@mail.sysu.edu.cn). ## I Introduction Deploying large language models (LLMs) directly on end-user devices, such as mobile phones and laptops, is highly attractive. In-situ inference can eliminate data privacy concerns and reduce reliance on external infrastructures, thereby boosting intelligence and autonomy. However, achieving fast LLM inference remains a significant challenge, constrained by both algorithmic parallelism and hardware capability. To begin with, end-user devices typically process only a single request at a moment and the traditional decoding approach, shown in Figure 1, generates tokens sequentially, one at a time. With the KV cache technique [1], the dominant decoding phase of the LLM inference has very limited parallelism, while requiring loading all the model weights in each iteration. Thus, the limited parallelism and memory-bandwidth-bound nature prevent the full utilization of modern hardware. Alternatively, the speculative decoding approach [2, 3, 4] takes a draft-then-verify decoding process. In each decoding step, speculative decoding first drafts multiple tokens as predictions for future steps and then verifies in parallel whether to accept the drafted tokens. It can parallelize the inherently sequential process, showing strong potential to accelerate token generation rates by trading off compute for bandwidth. After enhancing the algorithmic parallelism, the LLM inference starts to be blocked by the limited hardware capability of end user devices. To fully leverage the capabilities of end-user devices, it is promising to distribute workloads across heterogeneous processing units [5, 6, 7, 8]. For instance, the GPU of Apple M4 [9] Pro has 17.04 TFLOPs FP16 performance, while its CPU has around 38 TFlops FP16 performance. However, existing systems [10, 3] do not align with the computing pattern of speculative decoding and the unified memory architecture of end-user devices, resulting in inefficiency. <details> <summary>x1.png Details</summary> ![dcda4d58](/v1/image/dcda4d58ea49e34b5d42ed05f9c5778571de56d46b38557fa86549f42224d069) ### Visual Description ## Diagram: Transformer Model Text Generation Process ### Overview The diagram illustrates the workflow of a Transformer Model during text generation, highlighting the **prefill** and **decode** phases. It shows how input prompts are processed to generate output text, with attention to the **kv cache** and sequence of generated words. ### Components/Axes 1. **Transformer Model**: Central block with four orange squares representing attention/processing layers. 2. **Prompt**: Input text "The color of dog" at the bottom, feeding into the model. 3. **Prefill Phase**: Arrows from the prompt to the Transformer Model, indicating initial input processing. 4. **Decode Phase**: Arrows from the Transformer Model to green boxes labeled "can", "vary", "widely", and "/EoS/", representing generated output tokens. 5. **kv Cache**: Labeled on the right side of the Transformer Model, storing key-value pairs for efficient decoding. 6. **EoS Marker**: "/EoS/" (End of Sentence) token indicating completion of text generation. ### Detailed Analysis - **Prefill**: The prompt "The color of dog" is processed by the Transformer Model to initialize hidden states. - **Decode**: The model generates tokens sequentially: - "can" → "vary" → "widely" → "/EoS/". - **kv Cache**: Positioned to the right of the Transformer Model, it stores intermediate key-value pairs to accelerate autoregressive decoding. - **Token Flow**: Arrows show the sequence of generated words, with dashed lines indicating attention mechanisms or positional relationships. ### Key Observations - The model generates text in a left-to-right sequence, with each token depending on prior context. - The "/EoS/" token marks the end of the generated sequence, terminating the decoding process. - The kv cache is critical for reducing computational redundancy during token generation. ### Interpretation This diagram demonstrates how Transformer Models balance **prefilling** (initial input processing) and **decoding** (autoregressive text generation). The kv cache optimizes efficiency by reusing computed key-value pairs, avoiding redundant calculations. The sequence "can vary widely" suggests the model’s ability to generate contextually coherent phrases, while "/EoS/" ensures termination. The absence of numerical data implies this is a conceptual workflow rather than a performance metric visualization. </details> Figure 1: The autoregressive process of LLM inference with the KV cache technique. The decoding phase generally dominates the execution time. First, existing model partitioning approaches [11, 12] for distributing workloads across devices focus excessively on minimizing inter-device communication overhead and overlook opportunities for different processing units to access the same memory. When distributing LLM inference across multiple devices, explicit communication is required, making the reduction of communication overhead a top priority. In comparison, end-user devices typically integrate multiple heterogeneous processing units on a single chip and organize them within a unified memory architecture. Well-known examples include Apple’s M-series SoCs [9] and Intel Core Ultra [13], and NVIDIA Jetson [14]. In this way, minimizing memory access by carefully organizing partitioning strategy and data placement becomes the top priority for the unified memory architecture. Second, existing systems treat speculative decoding the same as the traditional decoding with a large batch size, less considering its specific computing pattern, particularly hybrid computational demands. The attention mechanism in the traditional decoding calculates the correlation between every pair of tokens, while it only requires to calculate correlation between partial pair of tokens in speculative decoding. Given the extremely high capability of accelerators in cloud systems and the high cost of transferring data to other processing units in the discrete memory architecture, this sparsity is often handled as dense computation using a mask mechanism. This oversight leads to missed opportunities to leverage computing affinity with processing units, such as on CPU. Third, algorithmic studies on speculative decoding primarily focus on increasing the acceptance length (the average number of tokens accepted in a single step), overlooking the balance between acceptance length and limited computing resources. Verifying more combinations of drafted tokens leads to better acceptance length, while resulting in more FLOPs to generate a single token. Compared to cloud hardware, end-user devices offer fewer resources for parallel verification, and verifying too many combinations can lead to decreased performance. Besides, modern hardware widely leverages vectorization technique, enabling the parallel processing of multiple data elements. It is crucial to select an appropriate verification width tailored to the specific hardware, thereby achieving the optimal balance between acceptance length and hardware parallelism. In this paper, we introduce Ghidorah, a system specifically designed to accelerate single-sample LLM inference in end-user devices with a unified memory architecture. It explores speculative decoding approaches to tackle the memory-bandwidth limitations inherent in the single-sample decoding process. Next, it distributes the novel workloads across multiple heterogeneous processing units in end-user devices, taking into account the hardware architecture and computing patterns. In summary, we make the following contributions: - We identify the performance bottlenecks in the single-sample LLM inference and explore speculative decoding approaches. - We present Ghidorah, a large language model inference system for end-user devices. It incorporates the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach to better employ the speculative decoding and align with the unified memory architecture. - We implement customized sparse computation optimizations for ARM CPUs. Experimental results show that Ghidorah achieves up to a 7.6 $\times$ speedup over the sequential decoding approach on the Nvidia Jetson NX. This improvement is attributed to a 3.27 $\times$ algorithmic enhancement and a 2.31 $\times$ parallel speedup. ## II Background and Motivation ### II-A LLM Generative Task Inference LLMs are primarily used for generative tasks, where inference involves generating new text based on a given prompt. As shown in Figure 1, it predicts the next token and performs iteratively until meeting the end identifier (EoS). By maintaining the KV cache in memory, modern LLM serving systems eliminates most redundant computations during this iterative process. The inference process is divided into prefill and decode phases based on whether the KV cache has been generated. The prefill phase handles the newly incoming prompt and initializes the KV cache. Since the prompt typically consists of many tokens, it has high parallelism. Next, with the KV cache, each step of the decode phase processes only one new token generated by the previous step and appends the new KV cache. In end-user scenarios, single-sample inference processes only a single token during each iteration of the decoding phase. The decoding phase is heavily constrained by the memory bandwidth and cannot fully exploit hardware capability. Moreover, the decoding phase typically dominates the overall inference process. ### II-B Collaborative Inference and Transformer Structure Distributing inference workload across various devices or processing units is a common strategy for enhancing performance. #### II-B 1 Data, Pipeline and Sequence Parallelism Data parallelism (DP) [15] partitions workloads along the sample dimension, allowing each device to perform inferences independently. Pipeline parallelism (PP) [7] horizontally partitions the model into consecutive stages along layer dimension, with each stage placed in a device. Since DP and PP are mainly used in high-throughput scenarios and cannot reduce the execution time for a single sample, they are not suitable for end-user scenarios. Sequence parallelism (SP) [16, 12] is designed for the long-context inference and it divides the input along the sequence dimension. It requires each device to load the entire model weights, making it advantageous only for extremely long sequences. We do not consider it within the scope of our work. #### II-B 2 Tensor Parallelism and Transformer Structure <details> <summary>x2.png Details</summary> ![76708c04](/v1/image/76708c04aba82808f655eb35a2ae28169fff6e7f83113e3c5ba9a1ff085f4025) ### Visual Description ## Diagram: Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) Architecture ### Overview The diagram illustrates a neural network architecture combining **Multi-Head Attention (MHA)** and **Multi-Layer Perceptron (MLP)** components. It shows data flow through linear layers, attention mechanisms, and perceptron layers, with explicit attention to distributed training operations like **AllReduce**. ### Components/Axes - **Key Elements**: - **Linear Layers**: Input/output transformations (e.g., `A → W_O^K`, `E_0 → E`). - **Attention Module**: - **Queries (Q)**: `Q_O`, `Q_1` (computed via `W_O^Q`, `W_1^Q`). - **Keys (K)**: `K_O`, `K_1` (computed via `W_O^K`, `W_1^K`). - **Values (V)**: `V_O`, `V_1` (computed via `W_O^V`, `W_1^V`). - **Cache**: `K^V` (stores key-value pairs for efficiency). - **Softmax**: Applied to attention scores. - **MLP Layers**: - **Bias Terms**: `B_0`, `B_1` (added to linear layer outputs). - **Weight Matrices**: `W_O^B`, `W_1^B` (for bias adjustments). - **Dense Layers**: `C_0`, `C_1`, `D_0`, `D_1`, `E_0`, `E_1` (intermediate transformations). - **AllReduce**: Distributed communication operations between layers (e.g., `C_0 → C_1`). - **Color Coding**: - **Green**: Cache (`K^V`). - **Pink**: Key matrices (`K_O`, `K_1`). - **Gray**: Query matrices (`Q_O`, `Q_1`). - **Blue**: Bias terms (`B_0`, `B_1`). - **White/Black**: Linear layer weights and outputs. ### Detailed Analysis 1. **Input Flow**: - Input tokens `A` are processed through linear layers with weights `W_O^K`, `W_O^V`, `W_O^Q` to generate queries, keys, and values. - Additional tokens (`more tokens`) are appended to the input. 2. **Attention Mechanism**: - Queries (`Q_O`, `Q_1`) and keys (`K_O`, `K_1`) are computed via linear transformations. - Attention scores are derived from `Q` and `K`, cached (`K^V`), and passed through softmax. - Values (`V_O`, `V_1`) are combined with attention scores to produce context-aware outputs. 3. **MLP Processing**: - Outputs from attention are fed into MLP layers with weights `W_O^B`, `W_1^B` and biases `B_0`, `B_1`. - Intermediate layers (`C_0`, `C_1`, `D_0`, `D_1`, `E_0`, `E_1`) apply dense transformations. - **AllReduce** operations synchronize gradients across distributed devices (e.g., `C_0 → C_1`). 4. **Layer Structure**: - **Top Path**: Represents the first attention and MLP layer (`O`). - **Bottom Path**: Represents the second attention and MLP layer (`1`). - Arrows indicate data flow and gradient synchronization. ### Key Observations - **Distributed Training**: AllReduce operations suggest the model is designed for multi-device training. - **Hierarchical Processing**: Attention layers capture global context, while MLP layers refine features locally. - **Efficiency**: Caching (`K^V`) reduces redundant computations in attention mechanisms. ### Interpretation This architecture resembles a **transformer-based model** optimized for distributed training. The combination of attention and MLP layers enables the network to: - Capture long-range dependencies via attention. - Process sequential data through perceptron layers. - Scale efficiently across devices using AllReduce. The diagram emphasizes modularity, with clear separation between attention and MLP components. The use of cached keys/values and distributed operations highlights a focus on computational efficiency and scalability, typical in large language models or vision transformers. </details> Figure 2: The partitioning [11] of Transformer model in the matrix form. Tensor parallelism partitions model weights to different devices [11, 12] or processing units [6, 17], and it is promising to reduce the execution time of a single sample. The Transformer [18] structure is the primary backbone for LLMs, consisting of multiple stacked Transformer layers, all configured identically. In a Transformer layer, the primary components are the multi-head attention (MHA) block and the multi-layer perceptron (MLP) block. Here we ignore element-wise operations, such as dropout, residual addition, and layernorm. In MHA block, the first linear layer generates query (Q), key (K), and value(V) matrices for each attention head. Each head conducts self-attention as in Equation 1, independently, and their outputs are concatenated and further processed by the next linear layer. MLP block involves two consecutive linear layers. $$ \displaystyle X=QK^{T} \displaystyle A=softmax(X) \displaystyle O=AV \tag{1} $$ Figure 2 shows a single Transformer layer and it is partitioned using the cloud TP solution in Megatron-LM [11]. For every two consecutive linear layers, it partitions the weights of the first one by columns and those of the second one by rows. An AllReduce communication is inserted as the synchronization point for every two linear layers. For the attention module in MHA block, it is partitioned by heads, with different attention heads assigned to different devices. ### II-C Speculative Decoding <details> <summary>x3.png Details</summary> ![dee96bc7](/v1/image/dee96bc714876a1a6e2460eec5d48ec0d13e9c3df763767a0a87bda9bc48eaba) ### Visual Description ## Diagram: Attention Mechanism in a Transformer Model ### Overview The image depicts a simplified visualization of an attention mechanism in a transformer model, illustrating how different "heads" (components of the model) process and relate to input words. The diagram includes three labeled heads (Head 0, Head 1, Head 2) connected to a grid of words, with colored squares indicating attention weights. ### Components/Axes 1. **Heads**: - **Head 0** (yellow): Contains words "It" and "I". - **Head 1** (pink): Contains words "It" and "I". - **Head 2** (orange): Contains words "is," "am," "the," "is," "am," "the". 2. **Grid**: - A 6x6 matrix of squares (rows and columns labeled with words: "It," "I," "is," "am," "the"). - **Colors**: - **Blue**: High attention weights. - **Gray**: Low attention weights. - **Axes**: - **Rows**: Labeled with words from Head 2 ("is," "am," "the," "is," "am," "the"). - **Columns**: Labeled with words from Head 0 and Head 1 ("It," "I," "is," "am," "the"). ### Detailed Analysis 1. **Head Connections**: - Arrows from Head 0 and Head 1 point to Head 2, indicating input flow. - Head 2 distributes attention to the grid, with blue squares highlighting key connections. 2. **Grid Patterns**: - **Blue Squares**: - Head 0's "It" connects to Head 1's "I" (top-left blue cluster). - Head 2's "is" and "am" connect to Head 1's "I" (middle-left blue squares). - Head 2's "the" connects to Head 0's "It" (bottom-right blue squares). - **Gray Squares**: Mostly dominate the grid, indicating sparse attention. ### Key Observations 1. **Repetition in Head 2**: The repeated words "is," "am," "the" suggest a focus on grammatical structure (auxiliary verbs and articles). 2. **Attention Concentration**: Blue squares cluster around specific word pairs (e.g., "It" → "I," "is" → "I"), indicating strong contextual relationships. 3. **Sparse Attention**: Most grid cells are gray, reflecting the model's selective focus on critical word interactions. ### Interpretation This diagram demonstrates how transformer models use attention mechanisms to prioritize relationships between words. The repetition in Head 2 highlights the model's emphasis on grammatical roles, while the blue squares in the grid reveal how specific word pairs (e.g., pronouns and verbs) are dynamically linked. The sparse attention pattern underscores the efficiency of transformers in focusing computational resources on meaningful connections, enabling context-aware language processing. The diagram aligns with known transformer behavior, where attention heads specialize in different linguistic features (e.g., syntax, semantics). </details> (a) Medusa Verification Tree. <details> <summary>x4.png Details</summary> ![91af07d0](/v1/image/91af07d0f79c850cda3f15a5f66d850efd48541b134a053ede0ec9b0ea15bb08) ### Visual Description ## Heatmap: Context Length vs. Unspecified Metric ### Overview The image depicts a heatmap with two distinct regions separated by a diagonal boundary. The left region is uniformly yellow, while the right region is purple with a jagged yellow boundary. The axes are labeled "Context Length" (x-axis) and "64" (y-axis), suggesting a relationship between input length and an unspecified metric (possibly performance, error rate, or threshold). ### Components/Axes - **X-axis (Horizontal)**: Labeled "Context Length," spanning 0 to 64. - **Y-axis (Vertical)**: Labeled "64," with no explicit scale but likely representing a fixed value or threshold. - **Colors**: - **Yellow**: Dominates the left region (0 ≤ Context Length ≤ 64). - **Purple**: Dominates the right region (Context Length > 64). - **Yellow Diagonal Line**: Separates the two regions, starting at (0, 64) and ending at (64, 0). - **Jagged Yellow Line**: Appears in the purple region, starting near (64, 64) and descending stepwise to (64, 0). ### Detailed Analysis 1. **Yellow Region (Left)**: - Uniformly colored, indicating consistent values across all Context Lengths ≤ 64. - No internal variation suggests a constant metric (e.g., 100% performance or 0 errors) for inputs within this range. 2. **Purple Region (Right)**: - Diagonal boundary (yellow line) slopes downward from (0, 64) to (64, 0), implying a linear decrease in the metric as Context Length increases beyond 64. - Jagged yellow line in the purple region starts near (64, 64) and descends in discrete steps to (64, 0). This could represent: - A threshold or cutoff point (e.g., maximum tolerable error rate). - A stepwise degradation in performance for inputs exceeding Context Length = 64. 3. **Boundary Dynamics**: - The diagonal line acts as a decision boundary, separating "acceptable" (yellow) and "degraded" (purple) regions. - The jagged line introduces non-linearity in the purple region, suggesting abrupt changes or discrete thresholds. ### Key Observations - **Threshold at Context Length = 64**: The diagonal boundary and jagged line both anchor at Context Length = 64, indicating a critical transition point. - **Non-Linearity in Degradation**: The jagged line in the purple region contradicts the smooth diagonal boundary, hinting at discrete failure modes or stepwise limitations. - **No Data Points in Yellow Region**: The uniform yellow area lacks internal variation, implying no measurable differences within the threshold. ### Interpretation This heatmap likely visualizes a system's performance or error rate as a function of input context length. The yellow region (Context Length ≤ 64) represents an optimal or stable operating range, while the purple region (Context Length > 64) shows degradation. The diagonal boundary suggests a linear trade-off beyond the threshold, but the jagged line introduces complexity, possibly indicating: - **Hardware/Software Limits**: Discrete memory or processing constraints causing stepwise failures. - **Algorithmic Behavior**: Sudden drops in accuracy or increased errors at specific Context Lengths. - **Data Sparsity**: The jagged line might reflect sparse data points or experimental observations rather than a smooth trend. The absence of a legend or explicit metric labels leaves the exact interpretation open, but the spatial grounding of the boundary and jagged line strongly implies a critical transition at Context Length = 64, with non-linear effects dominating beyond this point. </details> (b) Medusa [3]. <details> <summary>x5.png Details</summary> ![81a81788](/v1/image/81a81788f0bf6ce54b93baf5c7498dc053f87b52b193351f6aa16b3bdd7e803d) ### Visual Description ## Bar Chart: Context Length Distribution ### Overview The image depicts a bar chart with a vertical axis labeled "64" and a horizontal axis labeled "Context Length." A diagonal dotted line divides the chart into two distinct regions: a yellow section on the left and a purple section on the right. The chart appears to represent a relationship between "Context Length" and an unspecified metric, with the diagonal line acting as a boundary or threshold. ### Components/Axes - **Vertical Axis**: Labeled "64" (no explicit unit or scale provided). - **Horizontal Axis**: Labeled "Context Length" with a value of "64" at the far right. - **Diagonal Line**: A dotted line runs from the top-right corner (64, 64) to the bottom-left corner (0, 0), dividing the chart into two triangular regions. - **Legend**: Not visible in the image, but the legend is mentioned to be in the top-right corner. ### Detailed Analysis - **Vertical Axis**: The single value "64" suggests a fixed scale, possibly representing a maximum or threshold. - **Horizontal Axis**: "Context Length" is labeled with "64" at the far right, indicating the maximum value on this axis. - **Diagonal Line**: The dotted line creates two regions: - **Yellow Region (Left)**: Occupies the left half of the chart, bounded by the vertical axis (64) and the diagonal line. - **Purple Region (Right)**: Occupies the right half, bounded by the horizontal axis (64) and the diagonal line. - **No Data Points or Numerical Values**: The chart lacks explicit data points, numerical labels, or a legend to clarify the meaning of the colors or the diagonal line. ### Key Observations 1. The diagonal line divides the chart into two equal triangular regions, suggesting a symmetrical relationship between "Context Length" and the vertical axis value (64). 2. The absence of a legend makes it impossible to confirm the meaning of the yellow and purple regions. 3. The vertical axis value "64" is identical to the horizontal axis maximum, implying a 1:1 scaling or a fixed constraint. ### Interpretation The chart likely represents a threshold or boundary condition where "Context Length" interacts with another variable (implied by the vertical axis). The diagonal line could symbolize a linear relationship, a cutoff, or a decision boundary. However, without a legend or additional data, the exact interpretation remains speculative. The equal scaling of both axes (64) suggests a constrained or normalized system, possibly in a computational or algorithmic context (e.g., token limits in NLP models). The lack of explicit data points or labels indicates the chart may be a conceptual or illustrative diagram rather than a data-driven visualization. **Note**: The image does not provide factual data or numerical trends beyond the axis labels. Further clarification (e.g., legend, data source) is required for a definitive analysis. </details> (c) Draft model [2]. <details> <summary>x6.png Details</summary> ![0fcf24d2](/v1/image/0fcf24d2c8d89ff467ceb15d7b752fec89f94937fa185c0f86dd56b8aed02992) ### Visual Description ## Diagram: Context Length Threshold Visualization ### Overview The image depicts a rectangular diagram divided into two primary regions: a large yellow area on the left and a smaller purple triangular region on the right. Three diagonal yellow dashed lines separate these regions, creating a stepped pattern. The vertical axis is labeled "64" on the left, and the horizontal axis is labeled "Context Length" with "64" at the far right. ### Components/Axes - **Vertical Axis**: Labeled "64" (no explicit unit or variable name). - **Horizontal Axis**: Labeled "Context Length," with a scale ending at "64." - **Regions**: - **Yellow Region**: Occupies the left 75% of the diagram, bounded by the left edge, bottom edge, and the first diagonal line. - **Purple Region**: A right-angled triangle occupying the upper-right corner, bounded by the top edge, right edge, and the third diagonal line. - **Diagonal Lines**: Three evenly spaced yellow dashed lines originating from the bottom-right corner, ascending diagonally toward the top-left. These lines create a stepped boundary between the yellow and purple regions. ### Detailed Analysis - **Yellow Region**: - Covers the majority of the diagram (≈75% of the area). - Bounded by the left edge (x=0), bottom edge (y=0), and the first diagonal line (slope ≈ -1, intercept ≈ 64). - The second and third diagonal lines further subdivide the yellow region into smaller trapezoidal sections. - **Purple Region**: - A right triangle with vertices at (64, 0), (64, 64), and (0, 64). - Bounded by the third diagonal line (slope ≈ -1, intercept ≈ 64), the top edge (y=64), and the right edge (x=64). - **Diagonal Lines**: - All lines are yellow, dashed, and have a consistent slope of -1. - The first line intersects the bottom edge at (64, 0) and the left edge at (0, 64). - The second and third lines are parallel to the first, spaced at intervals of ≈21.3 units along the x-axis (calculated as 64/3). ### Key Observations 1. **Threshold Behavior**: The diagonal lines suggest a stepwise reduction in the "Context Length" threshold, with each line representing a discrete boundary. 2. **Area Proportions**: The yellow region dominates, while the purple region occupies ≈25% of the total area. 3. **Symmetry**: The diagonal lines form a 45° angle with both axes, indicating a direct proportionality between the x- and y-axes. ### Interpretation This diagram likely represents a system where "Context Length" (horizontal axis) interacts with an unspecified vertical parameter (fixed at 64). The diagonal lines may indicate critical thresholds where the system transitions from a "valid" (yellow) to an "invalid" (purple) state. For example: - **Yellow Region**: Valid operational range where context length ≤ 64 and meets unspecified criteria. - **Purple Region**: Exceeds thresholds, potentially triggering errors or limitations. - **Diagonal Lines**: Could represent incremental checkpoints (e.g., 64, 42.7, 21.3) where the system evaluates context length against a constraint. The fixed vertical axis value (64) suggests a maximum capacity or upper bound for the system, while the horizontal axis implies scalability up to 64 units of context length. The stepped boundaries might reflect discrete processing stages or resource allocation limits. </details> (d) Lookahead [4]. Figure 3: Sparsity visualization of the $X=Q\times K^{T}$ for speculative decoding approaches. Light yellow indicates points that require computation, while dark blue represents points that do not need computation. To overcome the sequential limitation, speculative decoding is introduced as an alternative. To achieve higher parallelism in decoding steps, it primarily employs the Predict-then-Verify mechanism, where predictions for multiple steps are made in parallel and then fallback to the longest validated prefix LLM inference. There are independent drafting [2, 19] and self-drafting [3, 20, 4] approaches. The independent drafting employs a smaller model with significantly reduced computation to generate predictions. The self-drafting make predictions through appending more heads or using Jacobi iteration. After drafting multiple candidates, these candidate tokens are combined together under a given pattern and put into the target model for verification. Notably, these self-drafting approaches combine the prediction and verification into a single step. For example, Medusa [3] predicts candidates for future token slots by appending additional heads at the tail of LLMs. In Figure 3(a), Head1 has two candidates and Head2 has three candidates. We aim to verify which combinations between these tokens can be accepted. Here the verification width in Figure 3(a) is 8 and only a subset of token correlations needs to be calculated in the attention module. Basically, we observe significant sparsity within the attention module when performing these speculative decoding approaches. Under a given pattern, a token needs to compute relationships only with certain tokens. This sparsity is typically handled as dense computation using the mask mechanism. As illustrated in Figure 3(b) 3(c) 3(d), here we present the sparsity visualization of $Q\times K$ result matrix in three popular approaches, namely draft model [2], Medusa [3], and LookAhead [4]. Only the light yellow points represent computations worth performing, highlighting the significant sparsity in the right part. Softmax and $A\times V$ operation exhibits the same sparsity pattern. ### II-D Unified Memory Architecture <details> <summary>x7.png Details</summary> ![a2515d6e](/v1/image/a2515d6ebe318127ad52d1b03d0711ac660564a2443db18839db628ded1234ed) ### Visual Description ## Diagram: Memory Architecture Comparison ### Overview The image compares two memory architectures: **(a) Unified Memory Architecture** and **(b) Discrete Architecture**. Both diagrams illustrate hierarchical relationships between processing units (CPU, GPU, xPU cores), memory hierarchies (cache, DRAM), and interconnects. ### Components/Axes - **Labels**: - **Processing Units**: CPU Cores (blue), GPU Cores (green), xPU Cores (yellow). - **Memory Hierarchy**: Cache (yellow), Memory Controller (gray), DRAM (pink). - **Interconnect**: PCIe Bus (black). - **Spatial Grounding**: - **Unified Architecture (a)**: - CPU/GPU/xPU cores at the top, connected to shared caches. - Caches feed into a single Memory Controller. - Memory Controller connects to a shared DRAM. - **Discrete Architecture (b)**: - CPU/GPU/xPU cores at the top, each with dedicated caches. - Caches connect directly to individual DRAM modules. - DRAM modules linked via a PCIe Bus. ### Detailed Analysis - **Unified Architecture (a)**: - Shared resources: All cores share a single Memory Controller and DRAM. - Hierarchical flow: Data moves from cores → caches → Memory Controller → DRAM. - **Discrete Architecture (b)**: - Dedicated resources: Each core type has its own DRAM. - Interconnect: PCIe Bus enables communication between DRAM modules. - Hierarchical flow: Data moves from cores → caches → DRAM → PCIe Bus (for cross-core communication). ### Key Observations 1. **Unified Architecture**: - Simplified memory hierarchy with centralized control. - Potential bottleneck at the Memory Controller and shared DRAM. 2. **Discrete Architecture**: - Decentralized memory access reduces contention. - PCIe Bus introduces latency for inter-DRAM communication. ### Interpretation - **Unified Architecture** prioritizes resource sharing, which may improve efficiency for workloads with uniform memory access patterns but risks contention during high concurrency. - **Discrete Architecture** isolates memory resources per core type, enhancing scalability and reducing contention but at the cost of increased complexity and latency due to the PCIe Bus. - The diagrams emphasize trade-offs between centralized vs. decentralized memory management, critical for optimizing performance in heterogeneous computing systems. *Note: No numerical data or trends are present; the focus is on structural and hierarchical relationships.* </details> Figure 4: The unified memory architecture and the discrete architecture. End-user devices [21, 9, 13, 14, 22, 23], such as consumer electronics and embedded systems, prioritize power efficiency, cost-effectiveness, and compact design, usually adopting the unified memory architecture. We show the unified memory architecture and discrete architecture in Figure 4. End-user devices typically adopt a unified DRAM architecture shared across multiple processing units, rather than discrete memory modules. This design eliminates explicit data transfer via PCIe bus and enables a zero-copy mechanism between processing units. While this significantly reduces data synchronization overhead, concurrent memory access still requires manual consistency management by the programmer. On the NVIDIA Jetson Xavier NX, memory page synchronization between processing units takes less than 0.1 milliseconds when accessing memory written by another unit. ## III Ghidorah Design ### III-A Overview <details> <summary>x8.png Details</summary> ![7a7df421](/v1/image/7a7df421bb75707e53fb8c793ea1a695ecf9fa8f064f43c205d99f5670c00a46) ### Visual Description ## Diagram: Hetero-Core Model Parallelism Architecture ### Overview The diagram illustrates a two-phase system architecture for optimizing machine learning model execution. It is divided into **Preprocessing Phase** (top) and **Runtime Support** (bottom), connected by bidirectional arrows indicating workflow dependencies. The system emphasizes hardware-aware optimization strategies, memory management, and parallelism across heterogeneous compute units (CPU, GPU, NPU, XPU). --- ### Components/Axes #### Preprocessing Phase (Top Section) - **Calibration dataset**: Input data for model tuning. - **LLM + Speculative Decoding**: Core processing unit with speculative execution. - **Hardware Accelerators**: - CPU (green), GPU (orange), NPU (blue), XPU (orange) labeled in legend. - **Strategies**: - *Speculative Strategy*: Linked to verification width and tree structures. - *Partitioning Strategy*: Connected to parallelism and contention-aware optimizations. - **Output**: Feeds into **Architecture-aware Profiling**. #### Runtime Support (Bottom Section) - **Operator Placement**: Allocation of tasks to hardware units. - **Memory Access Reduction**: Techniques to minimize latency. - **Computation Affinity**: Optimization of task-hardware alignment. - **Partitioning Guide**: Directs task distribution. - **Hardware-Specific Optimizations**: - GPU Time, CPU Time, Synchro Time (timing metrics). - GPU Cache, CPU Cache (memory hierarchies). - **Customized ARM SpMM**: Specialized matrix multiplication for ARM architectures. #### Legends - **Color Coding**: - Green: CPU - Orange: GPU/XPU - Blue: NPU - Positioned in the top-right corner, adjacent to hardware accelerator labels. --- ### Detailed Analysis 1. **Preprocessing Flow**: - The calibration dataset initializes the LLM + Speculative Decoding pipeline. - Hardware accelerators (CPU, GPU, NPU, XPU) are integrated into the speculative decoding process. - *Speculative Strategy* and *Partitioning Strategy* are visually linked to parallelism and contention-aware optimizations, suggesting dynamic resource allocation. 2. **Runtime Flow**: - **Operator Placement** determines task distribution across hardware units, guided by *Partitioning Guide*. - *Memory Access Reduction* and *Computation Affinity* are central to minimizing latency. - Hardware-specific metrics (GPU Time, CPU Time) and caches are depicted as parallel tracks, indicating concurrent optimization. 3. **Interconnections**: - Arrows show bidirectional flow between preprocessing and runtime, emphasizing iterative optimization. - *Customized ARM SpMM* in the runtime section suggests ARM-specific optimizations for matrix operations. --- ### Key Observations - **Hardware Heterogeneity**: Explicit use of CPU, GPU, NPU, and XPU indicates a focus on multi-accelerator systems. - **Speculative Execution**: Central to preprocessing, implying probabilistic or adaptive computation. - **Memory-Centric Optimizations**: Repeated emphasis on cache hierarchies and access reduction. - **ARM Specialization**: The *Customized ARM SpMM* block highlights ARM architecture targeting. --- ### Interpretation This diagram represents a **heterogeneous computing framework** for machine learning workloads, optimized for performance and efficiency. Key insights: 1. **Preprocessing-Driven Optimization**: The calibration dataset and speculative decoding suggest adaptive model tuning before runtime. 2. **Hardware-Aware Scheduling**: The *Partitioning Guide* and *Operator Placement* indicate dynamic task allocation based on hardware capabilities. 3. **Memory Efficiency**: Repeated focus on cache hierarchies and access reduction implies memory bottlenecks are a critical concern. 4. **ARM-Centric Design**: The *Customized ARM SpMM* block suggests the system is tailored for ARM-based edge or mobile devices. The architecture balances speculative computation (for model accuracy) with hardware-aware runtime optimizations (for speed and efficiency), likely targeting real-time or resource-constrained environments. </details> Figure 5: The overview of Ghidorah. Ghidorah is specifically designed for single-sample LLM generative task inference on end-user devices with a unified memory architecture. It primarily focuses on accelerating the dominant decoding phase. Since the decoding phase is constraint by the memory bandwidth, Ghidorah explores speculative decoding approaches to enhance parallelism and alleviate the memory bandwidth bottleneck. This manuscript takes the multi-head drafting approach, Medusa [3], as the default. Ghidorah distributes the speculative decoding workloads across all processing units and choose the optimal verification length in the end-user device to maximize the inference speed. Figure 5 illustrates the overview of Ghidorah. Leveraging the unified memory architecture of end-user devices and the computation pattern of speculative decoding, Ghidorah employs a novel hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. Functionally, the HCMP architecture focuses on providing runtime support, while the ARCA approach is primarily designed for the preprocessing phase. The HCMP architecture provides guidance for model partitioning and includes customized sparse matrix multiplication optimizations. Based on the runtime support, the ARCA approach provides the preprocessing phase that runs once before deployment. It performs an inference process using calibration data as input on multiple heterogeneous processing units in the end-user device. By comprehensively considering acceptance length, hardware parallelism, and resource contention, this phase formulates speculative and partitioning strategies to maximize performance. ### III-B Hetero-core Model Parallelism Architecture <details> <summary>x9.png Details</summary> ![78bdd42b](/v1/image/78bdd42bc6e55a379a6ed763948f932ae310c4318aa2340451199203f27675d7) ### Visual Description ## Diagram: Neural Network Architecture with Hardware-Specific Computation Flow ### Overview The diagram illustrates a hybrid neural network architecture that distributes computations across GPU, DRAM, and CPU. It emphasizes memory caching strategies, attention mechanisms, and parallel processing optimizations. The flow progresses from GPU → DRAM → CPU, with explicit hardware-specific operations and data transformations. ### Components/Axes **Legend:** - **Yellow dashed lines**: GPU Output - **Red dashed lines**: CPU Output - **Blue solid lines**: Split Weights by Column - **Green solid lines**: Split Each Head with Affinity **Key Elements:** 1. **GPU Section (Top):** - Linear Layer (gray blocks) - Attention Module (orange blocks): - `cache_V` (blue) - `cache_K` (blue) - `Q`, `V`, `K` (black/white checkered) - `Softmax` (orange) - Outputs: GPU Output (yellow dashed) 2. **DRAM Section (Middle):** - Caches: `cache_V`, `cache_K` (blue) - Attention Operations: - `Q`, `V`, `K` (black/white checkered) - `Softmax` (orange) - Outputs: GPU Output (yellow dashed) and CPU Output (red dashed) 3. **CPU Section (Bottom):** - Linear Layers (gray blocks): - `W_B`, `W_C`, `W_D` (blue) - Splitting Strategies: - Split Weights by Column (blue) - Split Each Head with Affinity (green) - Outputs: CPU Output (red dashed) ### Detailed Analysis **GPU Operations:** - Linear layers (gray) process inputs before attention module - Attention module uses cached `V`/`K` (blue) and computes `Q`/`V`/`K` (black/white) - `Softmax` (orange) applied to attention scores - Outputs flow to DRAM via yellow dashed lines **DRAM Operations:** - Caches `V`/`K` (blue) enable efficient attention computation - `Q` (black/white) interacts with cached `V`/`K` - `Softmax` (orange) computes attention weights - Dual outputs: GPU Output (yellow) and CPU Output (red) **CPU Operations:** - Linear layers (`W_B`, `W_C`, `W_D`) process attention outputs - Weights split by column (blue) for parallel processing - Heads split with affinity (green) for specialized processing - Final outputs via red dashed lines ### Key Observations 1. **Hardware Specialization:** - GPU handles attention module and initial linear layers - CPU manages final linear transformations with specialized weight splitting - DRAM acts as intermediate buffer for cached attention parameters 2. **Memory Optimization:** - `cache_V`/`cache_K` (blue) reduce redundant computations - Checkered `Q`/`V`/`K` blocks suggest dynamic computation paths 3. **Parallelism Strategies:** - Column-wise weight splitting (blue) enables vectorized operations - Affinity-based head splitting (green) optimizes CPU core utilization 4. **Flow Direction:** - Data flows GPU → DRAM → CPU - Attention module acts as computational bottleneck - Multiple softmax operations indicate multi-stage processing ### Interpretation This architecture demonstrates a multi-stage computation pipeline optimized for large-scale neural networks: 1. **GPU Acceleration:** The attention module (critical for transformer models) is offloaded to GPU for parallel matrix operations 2. **Memory Efficiency:** DRAM caching of `V`/`K` parameters reduces memory bandwidth requirements 3. **CPU Specialization:** Final linear layers use column-wise weight splitting and affinity-based head processing to maximize CPU core utilization 4. **Hybrid Workflow:** The system balances GPU parallelism with CPU specialization, suggesting a design for models requiring both attention mechanisms and complex post-processing The diagram reveals a sophisticated approach to distributed computing where each hardware component handles specific computational phases, with DRAM serving as a critical intermediary for parameter caching and data transfer. </details> Figure 6: The illustration of Hetero-Core Model Parallelism. The HCMP architecture is designed for efficiently distributing speculative decoding workloads across multiple heterogeneous processing units organized in the unified memory architecture. Figure 6 illustrates the HCMP architecture by presenting an inference example conducted across two processing units, CPU and GPU. It shows the memory layout and computation processes separately. Instead of focusing on reducing inter-device communication, the HCMP architecture prioritizes minimizing memory access and exposing computation affinity across processing units. #### III-B 1 Linear Layer - Split weights by column The existing tensor parallelism approach [11] is designed to distribute workloads across multiple separate devices, requiring explicit inter-device communication to resolve data dependencies. Reducing the frequency of tensor synchronization can significantly improve parallel efficiency. Migrating this setup to end-user devices, for every two linear layers, the weights of the first layer are split by columns and those of the second layer by rows, thus requiring only a single all-reduce synchronization for the two layers. However, this all-reduce operation necessitates additional memory access. It must read activations generated by different processing units and write the combined results back. The HCMP architecture opts to split the weights of all linear layers by columns. As shown in Figure 6, the CPU and GPU receive the same input, multiply it by different columns of weights, and then write the outputs to their designated memory regions without enforcing consistency between them. This allows the result to be used in the next operation without any additional data access. #### III-B 2 Attention Module - Split each head with affinity The HCMP architecture is designed to exploit computation affinity across different processing units, taking into account the hybrid computational intensity inherent in speculative decoding. The existing tensor parallelism approach divides the computation of the attention module by heads, with each device responsible for processing different heads. However, since only the correlation between partial tokens requires to be verified in speculative decoding, there exists dense and sparse computation in each head of the attention module. As shown in Figure 6, each attention head can be divided into a dense part — primarily the multiplication with the KV cache — and a sparse part, which mainly involves multiplication with the newly generated key and value. The dense part will be prioritized for processing units with high parallelism, such as GPU, while the sparse part will be prioritized for processing units with low parallelism, such as CPU. To achieve load balance, each partition may optionally include a portion of the other part’s computation. As shown in Figure 3 (b)(c)(d), the left boundary of the sparse part tends to be denser and can be preferentially included in the dense part. Furthermore, inspired by Ring Attention [24] and FlashAttention [25], we introduce the online softmax technique to extend the span that can be computed continuously. As shown in Equation 1, the computation of $O=AV$ on one processing unit cannot begin until $X=QK^{T}$ on another processing unit is completed, as the softmax operation requires data from all processing units. With the online softmax technique [24, 25], each processing unit can have its own softmax operation, aligning their results by applying a scaling factor at the end of the attention module. This scaling operation can be fused with the reduce operation, introducing almost no overhead. #### III-B 3 Customized ARM SpMM Optimization <details> <summary>x10.png Details</summary> ![7eed1fda](/v1/image/7eed1fdad5c8ace5793d1bc92d38bdb6720c1a055e0a0a8f9f19dacab6235a6b) ### Visual Description ## Diagram: Matrix Operations in Attention Mechanism ### Overview The diagram illustrates a sequence of matrix operations involving **Q (Query)**, **Kᵀ (Key Transpose)**, **V (Value)**, and **O (Output)**. It shows data flow between **Cache** and **Register** components, with operations like matrix multiplication (`QKᵀ`), masking, and storage. The process resembles steps in an attention mechanism (e.g., transformer models). ### Components/Axes - **Matrices**: - **Q**: Pink matrix (top-left). - **Kᵀ**: Green matrix (top-center). - **V**: Yellow matrix (bottom-center). - **O**: Purple matrix (bottom-right). - **Operations**: - `QKᵀ+mask`: Result of multiplying Q and Kᵀ, then applying a mask. - Arrows indicate data flow: - **Cache → Register**: Q and Kᵀ are moved from Cache to Register. - **Register → Cache**: V is moved from Register to Cache. - **Register → Register**: Final output O is stored in Register. - **Key Elements**: - **Mask**: Applied to `QKᵀ` (top-right matrix). - **X Symbols**: Represent matrix multiplication operations. - **Color Coding**: - Pink (Q), Green (Kᵀ), Yellow (V), Purple (O). - Gray/White blocks in matrices likely represent zero or inactive values. ### Detailed Analysis 1. **Top Row**: - **Q (Cache)**: A pink matrix with a horizontal orange stripe (highlighted row). - **Kᵀ (Cache)**: A green matrix with a vertical green stripe (highlighted column). - **QKᵀ+mask (Register)**: Result of multiplying Q and Kᵀ, then applying a mask. The mask zeros out certain values (gray blocks). 2. **Bottom Row**: - **A (Register)**: A gray matrix with scattered blue blocks (possibly attention weights). - **V (Cache)**: A yellow matrix with a gradient of yellow blocks (highlighted row). - **O (Register)**: A purple matrix with a horizontal purple stripe (result of combining A and V). ### Key Observations - **Flow Direction**: - Q and Kᵀ are processed in the top row, while V and O are processed in the bottom row. - Masking occurs after `QKᵀ` multiplication to filter irrelevant values. - **Color Significance**: - Highlighted rows/columns (orange, green, yellow) likely represent active or important data. - Masking introduces sparsity (gray blocks) in the `QKᵀ+mask` matrix. - **No Numerical Data**: The diagram focuses on structural relationships, not quantitative values. ### Interpretation This diagram represents a simplified attention mechanism workflow: 1. **Query-Key Interaction**: Q and Kᵀ are multiplied to compute attention scores (`QKᵀ`). 2. **Masking**: Irrelevant scores are masked out (e.g., padding tokens in NLP). 3. **Value Aggregation**: The masked scores are used to weight the Value matrix (V), producing the final Output (O). 4. **Memory Management**: Cache and Register act as intermediate storage, optimizing data access. The process emphasizes efficiency in handling large matrices, critical for transformer models. The use of masking and color-coded highlights suggests optimization for sparse data and attention to specific elements. </details> Figure 7: ARM SpMM Optimization. ARM CPUs are widely used in end-user devices due to their power efficiency feature, and we implement customized computation optimizations for sparse computations in the self-attention module, i.e. $QK^{T}$ and $AV$ . In details, knowing the token correlations to be verified, we follow the COO sparsity data format to generate the index before performing the inference. We optimize the sparse operation from two aspects, namely vectorization and memory access reduction. Figure 7 illustrates the optimization. In the $QK^{T}$ computation, the data access for both $Q$ and $K$ is in a row-wise way, thereby supporting continuous data access. Leveraging NEON instruction set of ARM, we enable the 128-bit long vector instructions to combine memory access in continuous FMA calculations. For its output matrix, each result value is stored in the register until the final result is fully accumulated, so as to reducing the load/store operations. In the $AV$ computation, we adjust the execution order to ensure continuous memory access to matrix $V$ . Instead of multiplying with each column of matrix $V$ , each non-zero element in the matrix $A[i,j]$ performs multiplication with row j of the matrix $V$ and then accumulates the result to the row i of the output matrix $O$ until the non-zero element in the same row finishes computing and finally gets row i of the output matrix. Also, since the capacity of registers is limited, we divide this process into blocks, only process partial elements from $V$ to $O$ , to enable storing elements in registers and reduce load/store operations. ### III-C Architecture-aware Profiling Approach The ARCA approach comprehensively considers the acceptance length, hardware parallelism and resource contention to finally generate the speculative strategy and the partitioning strategy. Initially, for a specific speculative decoding approach, we explore verification trees of all verification widths for the optimal acceptance lengths by combining accuracy-based estimation with brute-force search. Next, we determine the verification width and partitioning ratio for the final deployment, which is both parallelism- and contention-aware. #### III-C 1 Speculative Strategy Determination The speculative strategy in Ghidorah consists of both the verification width and the verification tree. The verification width is the computing budget in a single decoding step, and it is defined as the total number of tokens to be verified, while the verification tree determines the specific combinations of tokens to be verified. Both the verification width and verification tree influence the acceptance length. As the verification width increases, it becomes more likely to identify acceptable tokens; however, if the width is too large, the benefits for improving acceptance length becomes limited. Also, some specific verification routes are more likely to be accepted. For instance, the closer the head is to HEAD0, the easier its tokens are accepted. We design verification trees for different verification widths. <details> <summary>x11.png Details</summary> ![730e8ca1](/v1/image/730e8ca1c91f9bca95a682a0df7e90b91c7b0f36262ee917dca289aeb93c7bc0) ### Visual Description ## Tree Diagram: Syntactic Parse Tree Structure ### Overview The image depicts a hierarchical tree diagram representing syntactic parse trees for a sentence. It includes multiple "heads" (Head0–Head3) with nodes connected by color-coded arrows. The diagram illustrates three distinct search spaces: the estimated tree (red), all combinations (black), and brute-force search space (pink). ### Components/Axes - **Heads (Vertical Labels)**: - Head0: Topmost level with "Since" as the root node. - Head1: Contains "it" and "there" as child nodes of "Since." - Head2: Branches from "it" and "there" into "is," "am," "a," and "the time." - Head3: Further subdivisions of nodes, with some empty slots. - **Arrows (Color-Coded)**: - **Red**: Estimated tree (most probable parse). - **Black**: All combinations (exhaustive possibilities). - **Pink**: Brute-force search space (exhaustive exploration). - **Legend**: Located at the bottom, explicitly labeling the color-coded arrows. ### Detailed Analysis - **Root Node**: "Since" (Head0) is the root, connecting to "it" and "there" (Head1). - **Branching Structure**: - "it" (Head1) branches into "is," "am," "a," and "the time" (Head2). - "there" (Head1) branches into "is," "am," "a" (Head2). - Head3 shows further subdivisions, with some nodes empty (e.g., "the time" in Head3). - **Color-Coded Paths**: - Red arrows form a compact, hierarchical structure (estimated tree). - Black arrows represent all possible combinations, creating a dense network. - Pink arrows highlight the brute-force search space, overlapping with red and black paths. ### Key Observations 1. **Hierarchical Parsing**: The tree reflects syntactic dependencies, with "Since" as the root and subsequent nodes representing grammatical roles. 2. **Search Space Complexity**: - The estimated tree (red) is the most efficient path. - All combinations (black) and brute-force (pink) show exponentially increasing complexity. 3. **Empty Nodes**: Head3 contains unfilled slots, suggesting incomplete or optional parses. ### Interpretation The diagram visualizes parsing strategies in natural language processing (NLP). The **estimated tree** (red) represents the most likely syntactic structure, while **all combinations** (black) and **brute-force** (pink) illustrate exhaustive exploration. The overlapping pink arrows indicate that brute-force methods explore all possibilities, including the estimated path. The empty nodes in Head3 suggest variability in parse outcomes or optional grammatical structures. This diagram highlights the trade-off between efficiency (estimated tree) and completeness (brute-force) in syntactic analysis. </details> Figure 8: The verification tree determination with a verification width of 16. Figure 8 illustrates the tree design process with a verification width of 16 (nodes), involving candidate tokens generated from 4 heads of Medusa [3]. In details, we use a calibration dataset and generate the accuracies of the top predictions of different heads. Then, we estimate the accuracy of a candidate sequence by multiplying the accuracies of its related tokens. Then we can get the overall expected acceptance length by accumulating the expected acceptance length of different candidate sequences. Maximizing the acceptance length expectation, we can add nodes with the highest accuracies one by one until reaching the given verification length. As in Figure 8, the estimated tree is indicated by red lines. We further employ the brute-force search based on the estimated tree and compare their real acceptance lengths to determine the final tree. In Figure 8, we search leaf nodes and nodes in the same level. #### III-C 2 Parallelism-Aware Profiling The ARCA approach is parallelism-aware. Modern processing units leverage vectorization techniques to boost computational efficiency, leading to stepwise performance improvements in LLM inference latency as batch size increases, a phenomenon referred to as wave quantization [26] on NVIDIA hardware. Consequently, different processing units exhibit distinct performance sweet spots. In our preliminary experiments, we observe that setting candidate verification widths to powers of two, specifically 2, 4, 8, 16, 32, and 64, aligns well with the vectorization capabilities of these devices, resulting in more efficient execution. This observation can be explained as follows. The execution time of general matrix multiplication (GeMM) operations, including linear and attention, generally dominates LLM workloads. When applying model parallelism, the hidden state dimension is partitioned across processing units, while the token dimension (verification length) remains unchanged. As a result, the split GeMM operations executed on different processing units share the same token dimension and collectively span the hidden state dimension. Next, we determine their verification trees and check these speculative strategies with the runtime support. This determination process is contention-aware. #### III-C 3 Contention-Aware Profiling Although the partitioning guidance of the HCMP architecture will not increase the overall memory access within a single step of speculative decoding, the simultaneous access from different processing units will lead to resource contention problem. Since this contention affects different processing units at varying levels, it is difficult to determine the partitioning ratio. In this way, the ARCA approach initializes the partitioning strategy based on the individual execution times of different processing units and determines the final partitioning strategy for a given verification width through gradual adjustments. Besides, as the context (KV cache) length will impact the ratio of sparsity in the attention module, we additionally adjust the context length for the attention module to support its dynamic partitioning. ## IV Evaluation We implement a prototype system of Ghidorah in C/C++. We utilize the CUDA kernels from the highly optimized Transformer inference system FasterTransformer [27] for the GPU and rebuild the ARM CPU code based on CTranslated2 [28]. In this section, we evaluate Ghidorah from two aspects. 1) Verification Efficiency: we evaluate the acceptance length of our speculative strategies, and 2) Overall Performance: we compare the single-sample inference of Ghidorah with that of state-of-the-art approaches. Here we only focus on the dominant decoding phase of the LLM inference. ### IV-A Experimental Setup Model and Speculative Decoding. We evaluate Ghidorah with Vicuna-7B [29] and speculative decoding approach Medusa [3]. Medusa originally offers a 5-head version of Vicuna-7B and we design our own verification trees to evaluate the acceptance length under different verification widths. Vicuna-7B is a fine-tuned version from LLaMA-7B [30], which is a widely-recognized model architecture in the AI community. Datasets. We evaluate Ghidorah for the acceptance length across a broad spectrum of datasets, including: 1) MT-Bench [29]: a diverse collection of multi-turn questions, 2) GSM8K [31]: a set of mathematical problems, 3) HumanEval [32]: code completion and infilling tasks, and 4) MBPP [33]: instruction-based code generation. Node testbed. We use NVIDIA Jetson NX, which is a CPU-GPU edge device with unified memory architecture. It consists of an embedded 384-core Volta GPU and a 6-core ARM v8.2 CPU. We lock the GPU at 204MHz and the CPU at 1.9GHz to simulate end-user devices with more balanced capabilities of heterogeneous processing units. Specifically, we use g++ 7.5, cuda 10.2, arm performance library 24.10. Metrics. We use the acceptance length and decoding throughput as the evaluation metrics. The acceptance length refers to the average number of tokens that the target model accepts in a single decoding step. The decoding throughput is the number of tokens generated in a fixed interval. Baseline Methods. We compare Ghidorah with approaches: - Sequential: The sequential decoding approach running on the GPU, which is generally limited by the algorithmic parallelism. - Medusa: The Medusa decoding approach running on the GPU and it adopts our verification trees in given verification widths. - Medusa+EM (Medusa + E dgeNN [34] + M egatron) [11]: The Medusa decoding approach is distributed across the CPU and GPU using the partitioning guidance of the tensor parallelism from Megatron-LM. Additionally, we apply zero-copy optimization to reduce the synchronization overhead and determines the partitioning ratio based on the execution times of the processing units as in EdgeNN. ### IV-B Acceptance Length Evaluation TABLE I: Acceptance length under given verification widths. | Dataset GSM8K MBPP | MT-bench 1 1 | 1 1.76 1.78 | 1.72 2.43 2.54 | 2.28 2.69 2.89 | 2.59 3.08 3.27 | 2.93 3.34 3.55 | 3.19 3.56 3.74 | 3.34 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Human-eval | 1 | 1.77 | 2.49 | 2.8 | 3.19 | 3.48 | 3.71 | | We present the acceptance lengths for various datasets under different verification widths. Notably, MT-bench is used as the calibration dataset for determining verification trees. Subsequently, these verification trees are applied to three other datasets to generate their acceptance lengths. Table I shows the acceptance length results. Larger verification widths generally result in longer acceptance lengths, while the benefit becomes weak as verification widths grow and come at the cost of significantly increased computational effort. Furthermore, we find that these verification trees exhibit strong generality, achieving even better acceptance lengths on GSM8K, MBPP, and Huamn-eval, when migrated from MT-Bench. This can be attributed to MT-Bench’s comprehensive nature, as it includes conversations across eight different types. <details> <summary>x12.png Details</summary> ![56be6bd2](/v1/image/56be6bd21feeeaa92bd87163f0ec3a2a3eb3812148afcaf53b0bc1ffee58fb34) ### Visual Description ## Bar Chart: Normalized Decoding Speed Across Verification Widths and Methods ### Overview The image is a grouped bar chart comparing the normalized decoding speed of four methods (Sequential, Medusa, EM+Medusa, Ghidorah) across four benchmarks (MT-bench, GSM8K, MBPP, Human-Eval) at varying verification widths (4, 8, 16, 32, 64). Each subplot (a-d) represents a benchmark, with bars grouped by verification width and colored by method. ### Components/Axes - **X-axis**: Verification Width (4, 8, 16, 32, 64) - **Y-axis**: Normalized Decoding Speed (0–6) - **Legend**: - **Sequential**: Brown (solid) - **Medusa**: Blue (diagonal lines) - **EM+Medusa**: Orange (crosshatch) - **Ghidorah**: Green (diagonal stripes) - **Subplots**: - (a) MT-bench - (b) GSM8K - (c) MBPP - (d) Human-Eval ### Detailed Analysis #### (a) MT-bench - **Trend**: Ghidorah (green) consistently outperforms others, peaking at ~5.5 at width 16. EM+Medusa (orange) follows closely (~5.0 at width 16). Medusa (blue) and Sequential (brown) lag, with values ~3.0–4.0. - **Values**: - Width 4: Ghidorah ~5.0, EM+Medusa ~4.5, Medusa ~3.0, Sequential ~1.0 - Width 64: Ghidorah ~5.2, EM+Medusa ~4.8, Medusa ~3.5, Sequential ~1.2 #### (b) GSM8K - **Trend**: EM+Medusa (orange) leads at width 32 (~5.5), while Ghidorah (green) dominates at width 16 (~5.8). Medusa (blue) and Sequential (brown) remain lower (~3.0–4.0). - **Values**: - Width 8: EM+Medusa ~4.7, Ghidorah ~5.3, Medusa ~3.2, Sequential ~1.1 - Width 64: EM+Medusa ~5.0, Ghidorah ~5.4, Medusa ~3.8, Sequential ~1.3 #### (c) MBPP - **Trend**: Ghidorah (green) maintains the highest speed (~5.5 at width 32), with EM+Medusa (orange) slightly behind (~5.0). Medusa (blue) and Sequential (brown) show minimal improvement with width. - **Values**: - Width 4: Ghidorah ~5.2, EM+Medusa ~4.6, Medusa ~2.8, Sequential ~1.0 - Width 64: Ghidorah ~5.3, EM+Medusa ~4.9, Medusa ~3.5, Sequential ~1.2 #### (d) Human-Eval - **Trend**: Ghidorah (green) peaks at width 32 (~5.7), while EM+Medusa (orange) declines slightly at width 64 (~4.8). Medusa (blue) and Sequential (brown) show weak scaling. - **Values**: - Width 16: Ghidorah ~5.6, EM+Medusa ~5.1, Medusa ~3.0, Sequential ~1.1 - Width 64: Ghidorah ~5.4, EM+Medusa ~4.8, Medusa ~3.7, Sequential ~1.3 ### Key Observations 1. **Ghidorah Dominance**: Outperforms all methods in three benchmarks (MT-bench, GSM8K, MBPP) and matches EM+Medusa in Human-Eval. 2. **EM+Medusa Scalability**: Shows strong performance in GSM8K and Human-Eval but lags in MT-bench. 3. **Medusa Limitations**: Consistently underperforms across benchmarks, with minimal improvement at higher widths. 4. **Sequential Baseline**: Remains the weakest method, with negligible gains as verification width increases. ### Interpretation The data suggests **Ghidorah** is the most efficient method for decoding across diverse tasks, likely due to optimized parallelization or architectural advantages. **EM+Medusa** excels in logic-heavy tasks (GSM8K, Human-Eval) but struggles with MT-bench, indicating task-specific trade-offs. **Medusa** and **Sequential** methods underperform, highlighting their inefficiency in scaling with verification width. The lack of linear scaling for all methods implies computational overhead at higher widths, possibly due to memory or processing constraints. These results underscore the importance of method selection based on task requirements and resource constraints. </details> Figure 9: The overall performance under different verification widths. ### IV-C Overall Performance Evaluation This section compares the decoding throughput of Ghidorah against that of baseline methods with different verification widths from 4 to 64. We select a context length (KV cache length) of approximately 256. Figure 9 shows the normalized results. In all cases shown in Figure 9, Ghidorah presents significant speedup in the decoding throughput. Especially when the verification width is 16, Ghidorah’s speedup can reach up to 7.6 $\times$ compared with the sequential decoding. The impressive performance improvement of Ghidorah can be attributed to two main factors: First, the increased acceptance length enabled by speculative decoding. Since the sequential decoding approach is memory-bandwidth-bound, its execution time is similar to that of Medusa, and the acceptance length is fixed at 1. However, the Medusa method generates multiple tokens in a single step and requires the similar execution time; specifically, as the verification width increases, the acceptance length of Medusa also grows. Second, the collaborative operation of multiple heterogeneous processing units can better conduct the Medusa decoding workload. When comparing Medusa, Medusa+EM, and Ghidorah, we find that utilizing heterogeneous processing unit yields a better acceleration effect than merely running Medusa on the GPU. Taking MBPP dataset as the instance, Ghidorah achieves an average speedup of 2.06 $\times$ compared to running Medusa solely on the GPU, and an average speedup of 1.20 $\times$ compared to Medusa+EM. The improvement over Medusa+EM is attributed to reduced memory access and improved computing affinity. Longer acceptance lengths do not always result in better throughput, making it essential for the ARCA approach to balance the algorithmic parallelism with hardware capabilities. As shown in Figure 9, Medusa and Ghidorah achieve their best throughput results at verification width of 64 and 16 respectively. This is because CPU and GPU have different sweet spots. The GPU maintains a similar execution time from 4 to 64 verification width and achieves a longer acceptance length at verification width of 64, while the CPU can only maintain a similar execution time from 4 to 16 verification width. Since increasing verification width from 16 to 64 results in only 0.47 increase in the acceptance length, Ghidorah achieves the optimal throughput at a verification width of 16. ### IV-D Dynamic Partitioning and Sparse Optimization <details> <summary>x13.png Details</summary> ![7012dc40](/v1/image/7012dc40ae1efe93fdd1d9e898d6996639dc20726e030fae4510a50cbf13dfd3) ### Visual Description ## Bar Chart: Normalized Execution Time by Context Length ### Overview The chart compares normalized execution times for two methods ("Static" and "Dynamic") across varying context lengths (16 to 2048). Execution time is normalized to a scale of 0.6–1.0, with "Static" consistently at 1.0 and "Dynamic" decreasing as context length increases. ### Components/Axes - **X-axis (Context Length)**: Discrete categories: 16, 32, 64, 128, 256, 512, 1024, 2048. - **Y-axis (Normalize Execution Time)**: Scale from 0.6 to 1.0. - **Legend**: Top-right corner, labeled "Static" (blue, crosshatch pattern) and "Dynamic" (green, crosshatch pattern). - **Bars**: Paired bars per context length, grouped by method. ### Detailed Analysis - **Static Method**: - All bars are at **1.0** (no variation across context lengths). - Crosshatch pattern matches the legend. - **Dynamic Method**: - Execution time decreases with increasing context length: - 16: ~1.0 - 32: ~1.0 - 64: ~1.0 - 128: ~0.95 - 256: ~0.9 - 512: ~0.85 - 1024: ~0.75 - 2048: ~0.7 - Crosshatch pattern matches the legend. ### Key Observations 1. **Static Method Consistency**: Execution time remains constant at 1.0 regardless of context length. 2. **Dynamic Method Scalability**: Execution time improves (decreases) as context length increases, suggesting better performance scaling. 3. **Outlier**: Dynamic method at 2048 context length drops to ~0.7, the lowest observed value. ### Interpretation - The data demonstrates that the **Dynamic method adapts more efficiently to larger context lengths**, with execution time decreasing by ~30% from 128 to 2048. - The **Static method’s fixed execution time** implies it does not scale with context size, potentially making it less suitable for large-scale applications. - The crosshatch patterns in the bars visually distinguish the two methods, aiding in quick comparison. - The normalization of execution time suggests the results are relative to a baseline (e.g., maximum or average time), emphasizing proportional differences rather than absolute values. </details> (a) The attention module performance. <details> <summary>x14.png Details</summary> ![7487a4c2](/v1/image/7487a4c2e384dd13629cc63177e9ef716315a76c20965442e07a4a17352bf719) ### Visual Description ## Bar Chart: Comparison of Sparse and Dense Methods ### Overview The image is a bar chart comparing three computational methods: "Naive Sparse," "Dense," and "Opt. Sparse." The y-axis represents a normalized "Value" scale (0.0 to 1.0), while the x-axis categorizes the methods. The chart uses distinct patterns and colors to differentiate the methods, with a legend at the bottom for reference. ### Components/Axes - **Y-Axis**: Labeled "Value," scaled from 0.0 to 1.0 in increments of 0.2. Ticks are marked at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. - **X-Axis**: Categorizes the three methods: "Naive Sparse," "Dense," and "Opt. Sparse." - **Legend**: Located at the bottom, with: - Green (diagonal stripes) for "Naive Sparse" - Blue (horizontal stripes) for "Dense" - Orange (diagonal stripes) for "Opt. Sparse" ### Detailed Analysis - **Naive Sparse**: The tallest bar, reaching the maximum value of **1.0** on the y-axis. Its green diagonal-striped pattern matches the legend. - **Dense**: The middle bar, with a value of approximately **0.55**. Its blue horizontal-striped pattern aligns with the legend. - **Opt. Sparse**: The shortest bar, with a value of approximately **0.3**. Its orange diagonal-striped pattern corresponds to the legend. ### Key Observations 1. **Naive Sparse** achieves the highest value (1.0), indicating optimal performance in the measured metric. 2. **Dense** performs moderately, with a value of ~0.55, roughly half of Naive Sparse. 3. **Opt. Sparse** underperforms both, with a value of ~0.3, suggesting potential inefficiencies or trade-offs in its implementation. ### Interpretation The data suggests that the "Naive Sparse" method outperforms both "Dense" and "Opt. Sparse" in the evaluated context. The significant gap between "Opt. Sparse" and the other methods raises questions about the effectiveness of the optimization strategy or potential confounding factors (e.g., computational overhead, data distribution). The "Dense" method’s intermediate performance may indicate a balance between sparsity and density, but further analysis is needed to determine its applicability. The chart highlights the importance of method selection based on the specific use case and constraints. </details> (b) ARM Optim. Figure 10: Performance of Sparse Optimization This section presents the effectiveness of the dynamic partitioning and the sparsity optimization. Since the dynamic partitioning merely impact the attention module, Figure 10(a) represents the attention module performance with a verification length of 64. Notably, the attention module becomes dominant in the Transformer model as the context length increases. Static refers to that all sparse computations are processed on the CPU and all dense computations are processed on the GPU. Dynamic indicates that sparse computations are primarily handled by the CPU, while dense computations are primarily handled by the GPU, with dynamic partitioning guided by profiling results. Dynamic partitioning demonstrates obvious improvements when dealing with large context lengths. Figure 10(b) illustrates the performance of the sparse component executed using three different strategies: a naive sparse implementation, our optimized sparse implementation, and the dense implementation that treats the sparse computation as dense. The execution time of optimized sparse implementation exceeds naive sparse and dense implementations by 3.49 $\times$ and 1.90 $\times$ respectively. Also, if we do not optimize the sparse computation, the naive sparse execution cannot outperform the dense one. ## V Related Work Collaborative Execution of Transformer. Tensor Parallelism [1] and Pipeline Parallelism [35] effectively addresses memory limitations by accumulating memory from more devices, gaining widespread adoption in inference tasks within datacenters. They are also used in end-user environments. Galaxy [12] hides communication into computation in the tensor parallelism approach for edge devices, while DeTransformer [36] fundamentally mitigates data dependency by decoupling the original model. Petals [37] utilizes pipeline parallelism to distribute LLM inference across geographically distributed end-user devices. Integrated CPU-GPU Scheduling and Execution. There has been a large amount of prior work focusing on improving the device utilization on integrated CPU-GPU architecture. CoDL [5] and EdgeNN [34] perform collaborative convolutional neural network inference on mobile CPU and GPU with unified memory architecture. Asymo [38] schedules workloads based on the computing power of asymmetric mobile CPU cores to achieve higher throughput. Pipe-It [7] achieves higher throughput by organizing ARM big.Little cores in a pipelined manner. However, none of these approaches are designed for Transformer-based models. Speculative Decoding System. Speculative Decoding represents an innovative paradigm designed to accelerate LLM inference. Existing LLM inference systems [10] start to integrate speculative decoding approaches and they treat them as ordinary LLM inference with a large batch size. Sequoia [39] explores ultra-wide verification width for advanced hardware in cloud. LLMCad [40] adopts the independent drafting speculative decoding approach to accommodate memory-constrained devices. It utilizes a smaller model to generate drafts, thereby reducing data loading time for the larger model. ## VI Conclusion Ghidorah focuses on deploying LLM inference on end-user devices with the unified memory architecture. To fully exploit the hardware capability, Ghidorah leverages speculative decoding approaches to enhance the algorithmic parallelism and then distributes this novel workload across multiple heterogeneous processing units. With the help of the HCMP architecture and ARCA approach, it can achieve up to 7.6 $\times$ speedup in the dominant decoding phase. In the future, Ghidorah will be open-sourced and extended to support more speculative decoding approaches and a wider range of end-user devices. ## References - [1] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626. - [2] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y. Y. Wong, A. Zhu, L. Yang, X. Shi et al., “Specinfer: Accelerating large language model serving with tree-based speculative inference and verification,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 932–949. - [3] T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao, “Medusa: Simple llm inference acceleration framework with multiple decoding heads,” arXiv preprint arXiv:2401.10774, 2024. - [4] Y. Fu, P. Bailis, I. Stoica, and H. Zhang, “Break the sequential dependency of LLM inference using lookahead decoding,” in Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=eDjvSFOkXw - [5] F. Jia, D. Zhang, T. Cao, S. Jiang, Y. Liu, J. Ren, and Y. Zhang, “Codl: efficient CPU-GPU co-execution for deep learning inference on mobile devices,” in MobiSys ’22: The 20th Annual International Conference on Mobile Systems, Applications and Services, Portland, Oregon, 27 June 2022 - 1 July 2022, N. Bulusu, E. Aryafar, A. Balasubramanian, and J. Song, Eds. ACM, 2022, pp. 209–221. [Online]. Available: https://doi.org/10.1145/3498361.3538932 - [6] C. Zhang, F. Zhang, K. Chen, M. Chen, B. He, and X. Du, “Edgenn: Efficient neural network inference for cpu-gpu integrated edge devices,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 2023, pp. 1193–1207. - [7] S. Wang, G. Ananthanarayanan, Y. Zeng, N. Goel, A. Pathania, and T. Mitra, “High-throughput cnn inference on embedded arm big. little multicore processors,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 10, pp. 2254–2267, 2019. - [8] Y. Cho, P. Jiyeon, N. Florian, J. Changyeon, T. R. Gross, and B. Egger, “Dopia: online parallelism management for integrated cpu/gpu architectures,” in Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022, pp. 32–45. - [9] “Apple m4,” https://en.wikipedia.org/wiki/Apple_M4, 2024. - [10] “vllm github,” https://github.com/vllm-project/vllm, 2024. - [11] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. - [12] S. Ye, J. Du, L. Zeng, W. Ou, X. Chu, Y. Lu, and X. Chen, “Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference,” arXiv preprint arXiv:2405.17245, 2024. - [13] “Intel core ultra processor family,” https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html, 2024. - [14] “Meet jetson, the platform for ai at the edge,” https://developer.nvidia.com/embedded-computing, 2024. - [15] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” Advances in Neural Information Processing Systems, vol. 27, 2014. - [16] S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L. Song, S. Rajbhandari, and Y. He, “Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models,” 2023. [Online]. Available: https://arxiv.org/abs/2309.14509 - [17] F. Jia, D. Zhang, T. Cao, S. Jiang, Y. Liu, J. Ren, and Y. Zhang, “Codl: efficient cpu-gpu co-execution for deep learning inference on mobile devices.” in MobiSys, vol. 22, 2022, pp. 209–221. - [18] A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017. - [19] Y. Zhou, K. Lyu, A. S. Rawat, A. K. Menon, A. Rostamizadeh, S. Kumar, J.-F. Kagy, and R. Agarwal, “Distillspec: Improving speculative decoding via knowledge distillation,” arXiv preprint arXiv:2310.08461, 2023. - [20] Y. Li, F. Wei, C. Zhang, and H. Zhang, “EAGLE: speculative sampling requires rethinking feature uncertainty,” in Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=1NdN7eXyb4 - [21] “Apple a17,” https://en.wikipedia.org/wiki/Apple_A17, 2024. - [22] “Amd reveals next-gen desktop processors for extreme pc gaming and creator performance,” https://www.amd.com/en/newsroom/press-releases, 2024. - [23] “Qualcomm snapdragon,” https://en.wikipedia.org/wiki/Qualcomm_Snapdragon, 2024. - [24] H. Liu et al., “Ring attention with blockwise transformers for near-infinite context,” arXiv preprint arXiv:2310.01889, 2023. - [25] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022. - [26] “Wave quantization,” https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#wave-quant, 2024. - [27] Nvidia, “Nvidia fastertransformer: Transformer related optimization, including bert, gpt,” in Github, 2023. - [28] “Ctranslate2,” https://github.com/OpenNMT/CTranslate2, 2024. - [29] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023. - [30] H. Touvron et al., “Llama: Open and efficient foundation language models,” 2023. [Online]. Available: https://arxiv.org/abs/2302.13971 - [31] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. - [32] M. Chen, J. Tworek, et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021. - [33] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021. - [34] C. Zhang, F. Zhang, K. Chen, M. Chen, B. He, and X. Du, “Edgenn: Efficient neural network inference for CPU-GPU integrated edge devices,” in 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023. IEEE, 2023, pp. 1193–1207. [Online]. Available: https://doi.org/10.1109/ICDE55515.2023.00096 - [35] A. Agrawal et al., “Taming $\{$ Throughput-Latency $\}$ tradeoff in $\{$ LLM $\}$ inference with $\{$ Sarathi-Serve $\}$ ,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 117–134. - [36] Y. Wei, S. Ye, J. Jiang, X. Chen, D. Huang, J. Du, and Y. Lu, “Communication-efficient model parallelism for distributed in-situ transformer inference,” in 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2024, pp. 1–6. - [37] A. Borzunov, D. Baranchuk, T. Dettmers, M. Ryabinin, Y. Belkada, A. Chumachenko, P. Samygin, and C. Raffel, “Petals: Collaborative inference and fine-tuning of large models,” arXiv preprint arXiv:2209.01188, 2022. - [38] M. Wang, S. Ding, T. Cao, Y. Liu, and F. Xu, “Asymo: scalable and efficient deep-learning inference on asymmetric mobile cpus,” in ACM MobiCom ’21: The 27th Annual International Conference on Mobile Computing and Networking, New Orleans, Louisiana, USA, October 25-29, 2021. ACM, 2021, pp. 215–228. [Online]. Available: https://doi.org/10.1145/3447993.3448625 - [39] Z. Chen, A. May, R. Svirschevski, Y. Huang, M. Ryabinin, Z. Jia, and B. Chen, “Sequoia: Scalable, robust, and hardware-aware speculative decoding,” arXiv preprint arXiv:2402.12374, 2024. - [40] D. Xu, W. Yin, X. Jin, Y. Zhang, S. Wei, M. Xu, and X. Liu, “Llmcad: Fast and scalable on-device large language model inference,” CoRR, vol. abs/2309.04255, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.04255

Rendering Paper...