2505.23219

Model: gemini-2.0-flash

# Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism Abstract In-situ LLM inference on end-user devices has gained significant interest due to its privacy benefits and reduced dependency on external infrastructure. However, as the decoding process is memory-bandwidth-bound, the diverse processing units in modern end-user devices cannot be fully exploited, resulting in slow LLM inference. This paper presents Ghidorah, a LLM inference system for end-user devices with the unified memory architecture. The key idea of Ghidorah can be summarized in two steps: 1) leveraging speculative decoding approaches to enhance parallelism, and 2) ingeniously distributing workloads across multiple heterogeneous processing units to maximize computing power utilization. Ghidorah includes the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. The HCMP architecture guides partitioning by leveraging the unified memory design of end-user devices and adapting to the hybrid computational demands of speculative decoding. The ARCA approach is used to determine the optimal speculative strategy and partitioning strategy, balancing acceptance rate with parallel capability to maximize the speedup. Additionally, we optimize sparse computation on ARM CPUs. Experimental results show that Ghidorah can achieve up to 7.6 $×$ speedup in the dominant LLM decoding phase compared to the sequential decoding approach in NVIDIA Jetson NX. Index Terms: Edge Intelligence, LLM Inference, Speculative Decoding, Unified Memory Architecture footnotetext: Jinhui Wei and Ye Huang contributed equally to this work and are considered co-first authors. Yuhui Zhou participated in this work as a research intern at Sun Yat-sen University. * Corresponding author: Jiangsu Du (dujiangsu@mail.sysu.edu.cn). I Introduction Deploying large language models (LLMs) directly on end-user devices, such as mobile phones and laptops, is highly attractive. In-situ inference can eliminate data privacy concerns and reduce reliance on external infrastructures, thereby boosting intelligence and autonomy. However, achieving fast LLM inference remains a significant challenge, constrained by both algorithmic parallelism and hardware capability. To begin with, end-user devices typically process only a single request at a moment and the traditional decoding approach, shown in Figure 1, generates tokens sequentially, one at a time. With the KV cache technique [1], the dominant decoding phase of the LLM inference has very limited parallelism, while requiring loading all the model weights in each iteration. Thus, the limited parallelism and memory-bandwidth-bound nature prevent the full utilization of modern hardware. Alternatively, the speculative decoding approach [2, 3, 4] takes a draft-then-verify decoding process. In each decoding step, speculative decoding first drafts multiple tokens as predictions for future steps and then verifies in parallel whether to accept the drafted tokens. It can parallelize the inherently sequential process, showing strong potential to accelerate token generation rates by trading off compute for bandwidth. After enhancing the algorithmic parallelism, the LLM inference starts to be blocked by the limited hardware capability of end user devices. To fully leverage the capabilities of end-user devices, it is promising to distribute workloads across heterogeneous processing units [5, 6, 7, 8]. For instance, the GPU of Apple M4 [9] Pro has 17.04 TFLOPs FP16 performance, while its CPU has around 38 TFlops FP16 performance. However, existing systems [10, 3] do not align with the computing pattern of speculative decoding and the unified memory architecture of end-user devices, resulting in inefficiency. <details> <summary>x1.png Details</summary> ![dcda4d58](/v1/image/dcda4d58ea49e34b5d42ed05f9c5778571de56d46b38557fa86549f42224d069) ### Visual Description ## Diagram: Transformer Model Prefill and Decode ### Overview The image illustrates the prefill and decode stages of a transformer model, showing the flow of information between the prompt, the transformer model with its kv cache, and the generated output. ### Components/Axes * **Regions:** The diagram is divided into two main regions: "prefill" (left) and "decode" (right), separated by a vertical dashed line. The entire diagram is enclosed in a larger dashed rectangle. * **Transformer Model:** A yellow rounded rectangle labeled "Transformer Model" with "kv cache" below it. * **Prompt:** A series of gray rectangles labeled "The", "color", "of", and "dog" with the label "prompt" below. * **Output:** Green rectangles representing the generated words: "can", "vary", "widely", and "/EoS/". These appear both above and below the Transformer Model. * **Arrows:** Black curved arrows indicate the flow of information. ### Detailed Analysis * **Prefill Stage (Left):** * The "prompt" consists of the words "The", "color", "of", and "dog". * These words are fed into the "Transformer Model". * Inside the "Transformer Model", there are four orange squares, representing the processing of the input. * **Decode Stage (Right):** * The "Transformer Model" contains the "kv cache". * The model generates the words "can", "vary", and "widely", and "/EoS/". * The generated words are fed back into the "Transformer Model" to influence the generation of subsequent words. * The arrows show the feedback loop from the generated words to the "kv cache" and then back to the output. * The "kv cache" contains one orange square for each word generated. ### Key Observations * The diagram highlights the iterative nature of the decoding process in transformer models. * The "kv cache" is used to store information from previous steps, allowing the model to maintain context. * The "prefill" stage processes the initial prompt, while the "decode" stage generates the output sequence. ### Interpretation The diagram illustrates how a transformer model processes an input prompt and generates a sequence of words. The "prefill" stage initializes the model with the prompt, and the "decode" stage iteratively generates the output, using the "kv cache" to maintain context. The feedback loop in the "decode" stage allows the model to condition its output on previously generated words, leading to coherent and contextually relevant sequences. The "/EoS/" likely stands for "End of Sequence", indicating the termination of the generation process. </details> Figure 1: The autoregressive process of LLM inference with the KV cache technique. The decoding phase generally dominates the execution time. First, existing model partitioning approaches [11, 12] for distributing workloads across devices focus excessively on minimizing inter-device communication overhead and overlook opportunities for different processing units to access the same memory. When distributing LLM inference across multiple devices, explicit communication is required, making the reduction of communication overhead a top priority. In comparison, end-user devices typically integrate multiple heterogeneous processing units on a single chip and organize them within a unified memory architecture. Well-known examples include Apple’s M-series SoCs [9] and Intel Core Ultra [13], and NVIDIA Jetson [14]. In this way, minimizing memory access by carefully organizing partitioning strategy and data placement becomes the top priority for the unified memory architecture. Second, existing systems treat speculative decoding the same as the traditional decoding with a large batch size, less considering its specific computing pattern, particularly hybrid computational demands. The attention mechanism in the traditional decoding calculates the correlation between every pair of tokens, while it only requires to calculate correlation between partial pair of tokens in speculative decoding. Given the extremely high capability of accelerators in cloud systems and the high cost of transferring data to other processing units in the discrete memory architecture, this sparsity is often handled as dense computation using a mask mechanism. This oversight leads to missed opportunities to leverage computing affinity with processing units, such as on CPU. Third, algorithmic studies on speculative decoding primarily focus on increasing the acceptance length (the average number of tokens accepted in a single step), overlooking the balance between acceptance length and limited computing resources. Verifying more combinations of drafted tokens leads to better acceptance length, while resulting in more FLOPs to generate a single token. Compared to cloud hardware, end-user devices offer fewer resources for parallel verification, and verifying too many combinations can lead to decreased performance. Besides, modern hardware widely leverages vectorization technique, enabling the parallel processing of multiple data elements. It is crucial to select an appropriate verification width tailored to the specific hardware, thereby achieving the optimal balance between acceptance length and hardware parallelism. In this paper, we introduce Ghidorah, a system specifically designed to accelerate single-sample LLM inference in end-user devices with a unified memory architecture. It explores speculative decoding approaches to tackle the memory-bandwidth limitations inherent in the single-sample decoding process. Next, it distributes the novel workloads across multiple heterogeneous processing units in end-user devices, taking into account the hardware architecture and computing patterns. In summary, we make the following contributions: - We identify the performance bottlenecks in the single-sample LLM inference and explore speculative decoding approaches. - We present Ghidorah, a large language model inference system for end-user devices. It incorporates the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach to better employ the speculative decoding and align with the unified memory architecture. - We implement customized sparse computation optimizations for ARM CPUs. Experimental results show that Ghidorah achieves up to a 7.6 $×$ speedup over the sequential decoding approach on the Nvidia Jetson NX. This improvement is attributed to a 3.27 $×$ algorithmic enhancement and a 2.31 $×$ parallel speedup. II Background and Motivation II-A LLM Generative Task Inference LLMs are primarily used for generative tasks, where inference involves generating new text based on a given prompt. As shown in Figure 1, it predicts the next token and performs iteratively until meeting the end identifier (EoS). By maintaining the KV cache in memory, modern LLM serving systems eliminates most redundant computations during this iterative process. The inference process is divided into prefill and decode phases based on whether the KV cache has been generated. The prefill phase handles the newly incoming prompt and initializes the KV cache. Since the prompt typically consists of many tokens, it has high parallelism. Next, with the KV cache, each step of the decode phase processes only one new token generated by the previous step and appends the new KV cache. In end-user scenarios, single-sample inference processes only a single token during each iteration of the decoding phase. The decoding phase is heavily constrained by the memory bandwidth and cannot fully exploit hardware capability. Moreover, the decoding phase typically dominates the overall inference process. II-B Collaborative Inference and Transformer Structure Distributing inference workload across various devices or processing units is a common strategy for enhancing performance. II-B 1 Data, Pipeline and Sequence Parallelism Data parallelism (DP) [15] partitions workloads along the sample dimension, allowing each device to perform inferences independently. Pipeline parallelism (PP) [7] horizontally partitions the model into consecutive stages along layer dimension, with each stage placed in a device. Since DP and PP are mainly used in high-throughput scenarios and cannot reduce the execution time for a single sample, they are not suitable for end-user scenarios. Sequence parallelism (SP) [16, 12] is designed for the long-context inference and it divides the input along the sequence dimension. It requires each device to load the entire model weights, making it advantageous only for extremely long sequences. We do not consider it within the scope of our work. II-B 2 Tensor Parallelism and Transformer Structure <details> <summary>x2.png Details</summary> ![76708c04](/v1/image/76708c04aba82808f655eb35a2ae28169fff6e7f83113e3c5ba9a1ff085f4025) ### Visual Description ## Diagram: Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) ### Overview The image is a diagram illustrating the architecture and flow of data through a Multi-Head Attention (MHA) module and a Multi-Layer Perceptron (MLP) module. The diagram shows the connections and operations between different layers and components within these modules. ### Components/Axes * **Titles:** * Multi-Head Attention (MHA) - Located at the top-left. * Multi-Layer Perceptron (MLP) - Located at the top-right. * **Layers:** * Linear Layer (MHA): Located at the bottom-left, spanning the first section. * Attention Module (MHA): Located in the middle, spanning the second section. * Linear Layer (MLP): Located at the bottom-right, spanning the third and fourth sections. * **Components (MHA):** * A (Input): Located at the left of the MHA module. * W0K, W1K (Weight matrices for Key): Located after the input A. * W0V, W1V (Weight matrices for Value): Located at the top and bottom, respectively. * W0Q, W1Q (Weight matrices for Query): Located below W0V and W1V, respectively. * KV Cache: Located next to W0V and W1V. * K0T, K1T (Transposed Key matrices): Located to the right of the KV Cache. * Q0, Q1 (Query matrices): Located to the right of W0Q and W1Q, respectively. * Q0K0T, Q1K1T (Query-Key product): Located to the right of K0T and K1T. * Softmax: Located to the right of Q0K0T and Q1K1T. * B0, B1 (Output of Attention Module): Located to the right of Softmax. * W0B, W1B (Weight matrices): Located to the right of B0 and B1. * **Components (MLP):** * C0, C1 (Input to MLP): Located to the right of W0B and W1B. * W0C, W1C (Weight matrices): Located to the right of C0 and C1. * D0, D1 (Intermediate output): Located to the right of W0C and W1C. * W0D, W1D (Weight matrices): Located to the right of D0 and D1. * E0, E1 (Output before final layer): Located to the right of W0D and W1D. * E (Final output): Located to the right of E0 and E1. * **Arrows:** Indicate the direction of data flow. * **AllReduce:** Blue arrows indicating a reduction operation. * **Colors:** * Green: Represents weight matrices (W0V, W1V, W0Q, W1Q, W0B, W1B, W0C, W1C, W0D, W1D). * Peach: Represents KV Cache and transposed Key matrices (K0T, K1T). * Gray: Represents other intermediate matrices and operations. * Purple: Represents the flow of "more tokens". ### Detailed Analysis * **MHA Module (Top):** * Input A is multiplied by W0K. * W0V, W0Q are weight matrices. * KV Cache stores Key and Value matrices. * Q0K0T is the product of Query and Transposed Key matrices. * Softmax is applied to Q0K0T. * The output of the attention module is B0. * B0 is multiplied by W0B. * **MHA Module (Bottom):** * Input A is multiplied by W1K. * W1V, W1Q are weight matrices. * KV Cache stores Key and Value matrices. * Q1K1T is the product of Query and Transposed Key matrices. * Softmax is applied to Q1K1T. * The output of the attention module is B1. * B1 is multiplied by W1B. * **MLP Module (Top):** * C0 is the input to the MLP. * C0 is multiplied by W0C. * D0 is the intermediate output. * D0 is multiplied by W0D. * E0 is the output before the final layer. * E0 is added to the output of the AllReduce operation to produce E. * **MLP Module (Bottom):** * C1 is the input to the MLP. * C1 is multiplied by W1C. * D1 is the intermediate output. * D1 is multiplied by W1D. * E1 is the output before the final layer. * E1 is added to the output of the AllReduce operation to produce E. * **AllReduce:** * The AllReduce operation combines the outputs of the two parallel paths in both the MHA and MLP modules. * The output of the AllReduce operation is added to E0 and E1 to produce the final output E. * **"more tokens"**: * A purple arrow indicates the flow of "more tokens" from the input A to the Query matrices (W0Q, W1Q). ### Key Observations * The diagram illustrates a parallel processing architecture with two identical paths in both the MHA and MLP modules. * The AllReduce operation is used to combine the outputs of the parallel paths. * The MHA module consists of a linear layer and an attention module. * The MLP module consists of multiple linear layers. ### Interpretation The diagram provides a high-level overview of the architecture and data flow in a Multi-Head Attention (MHA) module and a Multi-Layer Perceptron (MLP) module. The MHA module is used to capture relationships between different parts of the input sequence, while the MLP module is used to perform non-linear transformations on the data. The parallel processing architecture and the AllReduce operation are used to improve the efficiency and scalability of the model. The "more tokens" flow suggests that the query matrices are influenced by additional input tokens, potentially allowing the model to incorporate more contextual information. </details> Figure 2: The partitioning [11] of Transformer model in the matrix form. Tensor parallelism partitions model weights to different devices [11, 12] or processing units [6, 17], and it is promising to reduce the execution time of a single sample. The Transformer [18] structure is the primary backbone for LLMs, consisting of multiple stacked Transformer layers, all configured identically. In a Transformer layer, the primary components are the multi-head attention (MHA) block and the multi-layer perceptron (MLP) block. Here we ignore element-wise operations, such as dropout, residual addition, and layernorm. In MHA block, the first linear layer generates query (Q), key (K), and value(V) matrices for each attention head. Each head conducts self-attention as in Equation 1, independently, and their outputs are concatenated and further processed by the next linear layer. MLP block involves two consecutive linear layers. $$ \displaystyle X=QK^{T} \displaystyle A=softmax(X) \displaystyle O=AV \tag{1} $$ Figure 2 shows a single Transformer layer and it is partitioned using the cloud TP solution in Megatron-LM [11]. For every two consecutive linear layers, it partitions the weights of the first one by columns and those of the second one by rows. An AllReduce communication is inserted as the synchronization point for every two linear layers. For the attention module in MHA block, it is partitioned by heads, with different attention heads assigned to different devices. II-C Speculative Decoding <details> <summary>x3.png Details</summary> ![dee96bc7](/v1/image/dee96bc714876a1a6e2460eec5d48ec0d13e9c3df763767a0a87bda9bc48eaba) ### Visual Description ## Diagram: Attention Head Visualization ### Overview The image presents a visualization of attention heads in a neural network, likely a transformer model. It shows how different heads focus on different parts of the input sequence. The diagram consists of a tree-like structure representing attention heads and a matrix visualizing the attention weights. ### Components/Axes * **Attention Heads:** The diagram shows three attention heads, labeled "Head 0", "Head 1", and "Head 2". * **Input Sequence:** The input sequence is "It I is am the is am the". This sequence is represented both in the tree structure and as labels along the top and left of the attention matrix. * **Attention Matrix:** A 9x9 matrix visualizes the attention weights. Each cell represents the attention weight between two words in the input sequence. Blue cells indicate a high attention weight, while dark gray cells indicate a low attention weight. ### Detailed Analysis * **Head 0:** The "Head 0" node is colored yellow and does not appear to have any direct connections to the input sequence. * **Head 1:** The "Head 1" node is colored red and is connected to "It" and "I". * **Head 2:** The "Head 2" node is colored orange and is connected to "is", "am", "the", "is", "am", and "the". * **Attention Matrix:** * The matrix is a 9x9 grid. * The x-axis and y-axis are labeled with the input sequence: "It I is am the is am the". * The cells are colored either blue (high attention) or dark gray (low attention). * The diagonal elements show some attention, but there are also off-diagonal elements with significant attention. * The first "It" attends to itself. * The "I" attends to itself. * The first "is" attends to itself. * The first "am" attends to itself. * The first "the" attends to itself. * The second "is" attends to itself. * The second "am" attends to itself. * The second "the" attends to itself. ### Key Observations * Different attention heads focus on different parts of the input sequence. * The attention matrix shows the specific attention weights between words. * The diagonal elements of the attention matrix show that each word attends to itself to some extent. ### Interpretation The diagram illustrates how attention mechanisms work in neural networks. It shows that different attention heads can focus on different parts of the input sequence, allowing the model to capture complex relationships between words. The attention matrix provides a detailed view of the attention weights, showing which words are most relevant to each other. This visualization helps to understand how the model is processing the input sequence and making predictions. The fact that Head 0 is not connected to any words suggests it might be learning a different aspect of the data or is not actively contributing to this specific input. </details> (a) Medusa Verification Tree. <details> <summary>x4.png Details</summary> ![91af07d0](/v1/image/91af07d0f79c850cda3f15a5f66d850efd48541b134a053ede0ec9b0ea15bb08) ### Visual Description ## Heatmap: Context Length vs. 64 ### Overview The image is a heatmap showing the relationship between context length and a fixed value of 64. The heatmap is colored with two distinct colors, yellow and purple, indicating different values or states. The x-axis represents the context length, and the y-axis represents a fixed value of 64. ### Components/Axes * **X-axis:** "Context Length", with a length of 64 units. * **Y-axis:** Fixed value of "64" units. * **Colors:** * Yellow: Represents one state or value. * Purple: Represents another state or value. ### Detailed Analysis The heatmap is divided into two main regions based on color: 1. **Left Region (Yellow):** A rectangular region on the left side of the heatmap is entirely yellow. This region extends vertically for the full height of 64 units and horizontally for a certain context length. The yellow region extends approximately 1/4 of the total width. 2. **Right Region (Purple with Yellow Lines):** The remaining area is predominantly purple, but it contains several yellow lines or curves. * A diagonal yellow line extends from the top-right corner to the bottom-left corner. * Several other yellow curves originate from the left edge of the purple region and extend towards the right. These curves are not straight lines and have varying lengths. ### Key Observations * The yellow region on the left indicates a consistent state or value for shorter context lengths. * The purple region with yellow lines indicates a change in state or value as the context length increases. * The diagonal yellow line suggests a linear relationship or transition point. * The yellow curves indicate more complex relationships or transitions. ### Interpretation The heatmap likely represents the behavior of a system or model as the context length changes. The yellow region on the left suggests that for shorter context lengths, the system is in a stable state. As the context length increases, the system transitions into a different state, represented by the purple region. The yellow lines and curves within the purple region indicate specific points or conditions where the system's behavior changes. The diagonal line might represent a critical threshold or a linear relationship between context length and some other variable. The curves could represent more complex interactions or dependencies. Without more context, it's difficult to determine the exact meaning of these states and transitions. </details> (b) Medusa [3]. <details> <summary>x5.png Details</summary> ![81a81788](/v1/image/81a81788f0bf6ce54b93baf5c7498dc053f87b52b193351f6aa16b3bdd7e803d) ### Visual Description ## Heatmap: Cellular Automaton Evolution ### Overview The image is a heatmap visualizing the evolution of a cellular automaton over 64 time steps, starting from an initial context of length 64. The heatmap displays the state of each cell at each time step, with yellow representing one state and dark purple representing another. The initial context is entirely in the yellow state. ### Components/Axes * **X-axis:** Represents the spatial dimension of the cellular automaton, with a length of 64 cells. Labeled "Context Length". * **Y-axis:** Represents the time steps of the evolution, ranging from 0 to 64. * **Color Scale:** * Yellow: Represents one state of the cell (e.g., "on" or "1"). * Dark Purple: Represents the other state of the cell (e.g., "off" or "0"). ### Detailed Analysis The heatmap shows the evolution of the cellular automaton from an initial state where all cells in the context are in the yellow state. As time progresses (moving down the heatmap), the cells change state according to the rules of the automaton. * **Initial State:** The first 64 time steps are all yellow. * **Evolution Pattern:** After the initial state, a pattern emerges where diagonal lines of yellow cells propagate from the top-right corner downwards. There are also vertical segments of yellow cells that appear and disappear. * **Diagonal Lines:** The diagonal lines suggest a propagation of information or state change across the cells. * **Vertical Segments:** The vertical segments indicate that some cells maintain their state for multiple time steps before changing. ### Key Observations * The initial state has a significant impact on the subsequent evolution of the automaton. * The evolution pattern is complex and non-uniform, with both diagonal and vertical structures. * The automaton exhibits a degree of self-organization, as evidenced by the emergence of distinct patterns. ### Interpretation The heatmap visualizes the dynamics of a cellular automaton, demonstrating how a simple set of rules can lead to complex and emergent behavior. The initial context plays a crucial role in shaping the evolution of the system. The diagonal lines and vertical segments in the heatmap suggest that the automaton exhibits both local and global interactions between cells. The specific patterns observed depend on the rules governing the automaton and the initial conditions. The image provides a visual representation of how complex systems can arise from simple components and interactions. </details> (c) Draft model [2]. <details> <summary>x6.png Details</summary> ![0fcf24d2](/v1/image/0fcf24d2c8d89ff467ceb15d7b752fec89f94937fa185c0f86dd56b8aed02992) ### Visual Description ## Cellular Automaton Diagram: Rule 30 Evolution ### Overview The image depicts the evolution of a one-dimensional cellular automaton, specifically Rule 30, over 64 time steps. The automaton starts with a single "on" cell (yellow) and evolves according to Rule 30, resulting in a complex pattern of "on" (yellow) and "off" (purple) cells. The diagram shows the state of the automaton at each time step, stacked vertically. ### Components/Axes * **Vertical Axis:** Represents time steps, labeled with a length of 64. * **Horizontal Axis:** Represents cell positions, labeled "Context Length" and a length of 64. * **Colors:** * Yellow: Represents an "on" cell (state 1). * Purple: Represents an "off" cell (state 0). ### Detailed Analysis The diagram shows the evolution of Rule 30 starting from a single "on" cell in the middle of the initial row. As time progresses, the pattern evolves, creating a complex, seemingly random pattern. The yellow regions represent areas where cells are "on," while the purple regions represent areas where cells are "off." The pattern exhibits triangular structures emanating from the initial "on" cell. ### Key Observations * The initial condition is a single "on" cell. * The pattern generated by Rule 30 is complex and exhibits self-similar structures. * The pattern expands outward from the initial cell. * The pattern is not symmetrical. ### Interpretation The diagram illustrates the complex behavior that can arise from simple rules in cellular automata. Rule 30 is known for its ability to generate pseudo-random patterns, making it useful in various applications such as cryptography and random number generation. The diagram visually demonstrates how a simple initial condition and a deterministic rule can lead to complex and unpredictable behavior. The triangular structures and the overall asymmetry are characteristic of Rule 30's evolution. </details> (d) Lookahead [4]. Figure 3: Sparsity visualization of the $X=Q× K^{T}$ for speculative decoding approaches. Light yellow indicates points that require computation, while dark blue represents points that do not need computation. To overcome the sequential limitation, speculative decoding is introduced as an alternative. To achieve higher parallelism in decoding steps, it primarily employs the Predict-then-Verify mechanism, where predictions for multiple steps are made in parallel and then fallback to the longest validated prefix LLM inference. There are independent drafting [2, 19] and self-drafting [3, 20, 4] approaches. The independent drafting employs a smaller model with significantly reduced computation to generate predictions. The self-drafting make predictions through appending more heads or using Jacobi iteration. After drafting multiple candidates, these candidate tokens are combined together under a given pattern and put into the target model for verification. Notably, these self-drafting approaches combine the prediction and verification into a single step. For example, Medusa [3] predicts candidates for future token slots by appending additional heads at the tail of LLMs. In Figure 3(a), Head1 has two candidates and Head2 has three candidates. We aim to verify which combinations between these tokens can be accepted. Here the verification width in Figure 3(a) is 8 and only a subset of token correlations needs to be calculated in the attention module. Basically, we observe significant sparsity within the attention module when performing these speculative decoding approaches. Under a given pattern, a token needs to compute relationships only with certain tokens. This sparsity is typically handled as dense computation using the mask mechanism. As illustrated in Figure 3(b) 3(c) 3(d), here we present the sparsity visualization of $Q× K$ result matrix in three popular approaches, namely draft model [2], Medusa [3], and LookAhead [4]. Only the light yellow points represent computations worth performing, highlighting the significant sparsity in the right part. Softmax and $A× V$ operation exhibits the same sparsity pattern. II-D Unified Memory Architecture <details> <summary>x7.png Details</summary> ![a2515d6e](/v1/image/a2515d6ebe318127ad52d1b03d0711ac660564a2443db18839db628ded1234ed) ### Visual Description ## Architecture Diagram: Unified vs. Discrete Memory Architectures ### Overview The image presents a comparative diagram of two memory architectures: Unified Memory Architecture (UMA) and Discrete Architecture. The diagrams illustrate the flow of data between CPU cores, GPU cores, xPU cores, cache, memory controller, DRAM, and PCIe bus. ### Components/Axes * **Top Row (Both Architectures):** * CPU Cores (light blue) * GPU Cores (light green) * xPU Cores (light green) * **Second Row (Both Architectures):** * Cache (yellow) * **Unified Memory Architecture (Left):** * Memory Controller (gray) * DRAM (light red) * **Discrete Architecture (Right):** * DRAM (light red) * PCIe Bus (black lines) * **Labels:** * (a) Unified Memory Architecture (bottom-left) * (b) Discrete Architecture (bottom-right) ### Detailed Analysis **Unified Memory Architecture (Left)** * **CPU Cores, GPU Cores, xPU Cores:** The top row consists of three blocks: CPU Cores (light blue), GPU Cores (light green), and xPU Cores (light green). * **Cache:** Below the cores is a row of three Cache blocks (yellow), each connected to the corresponding core above via bidirectional arrows. * **Memory Controller:** Below the Cache blocks is a single Memory Controller block (gray). Bidirectional arrows connect each Cache block to the Memory Controller. * **DRAM:** Below the Memory Controller is a single DRAM block (light red). A bidirectional arrow connects the Memory Controller to the DRAM. **Discrete Architecture (Right)** * **CPU Cores, GPU Cores, xPU Cores:** The top row consists of three blocks: CPU Cores (light blue), GPU Cores (light green), and xPU Cores (light green). * **Cache:** Below the cores is a row of three Cache blocks (yellow), each connected to the corresponding core above via bidirectional arrows. * **DRAM:** Below the Cache blocks is a row of three DRAM blocks (light red). Bidirectional arrows connect each Cache block to the corresponding DRAM block. * **PCIe Bus:** Black lines representing the PCIe Bus connect each DRAM block to all three cores (CPU, GPU, and xPU). ### Key Observations * **Unified Memory Architecture:** All cores share a single DRAM module via a memory controller. * **Discrete Architecture:** Each core has dedicated DRAM modules, and communication between cores and DRAM occurs via the PCIe bus. ### Interpretation The diagrams illustrate the fundamental difference between unified and discrete memory architectures. In a unified architecture, all processing units (CPU, GPU, xPU) share a single pool of memory (DRAM) managed by a memory controller. This simplifies memory management and allows for efficient data sharing between different processing units. However, it can also lead to bottlenecks if multiple processing units try to access the memory simultaneously. In a discrete architecture, each processing unit has its own dedicated memory (DRAM). This eliminates the potential for memory bottlenecks but requires more complex memory management and data transfer mechanisms. The PCIe bus is used for communication between the processing units and their respective DRAM modules. This architecture is often used in high-performance systems where memory bandwidth is critical. </details> Figure 4: The unified memory architecture and the discrete architecture. End-user devices [21, 9, 13, 14, 22, 23], such as consumer electronics and embedded systems, prioritize power efficiency, cost-effectiveness, and compact design, usually adopting the unified memory architecture. We show the unified memory architecture and discrete architecture in Figure 4. End-user devices typically adopt a unified DRAM architecture shared across multiple processing units, rather than discrete memory modules. This design eliminates explicit data transfer via PCIe bus and enables a zero-copy mechanism between processing units. While this significantly reduces data synchronization overhead, concurrent memory access still requires manual consistency management by the programmer. On the NVIDIA Jetson Xavier NX, memory page synchronization between processing units takes less than 0.1 milliseconds when accessing memory written by another unit. III Ghidorah Design III-A Overview <details> <summary>x8.png Details</summary> ![7a7df421](/v1/image/7a7df421bb75707e53fb8c793ea1a695ecf9fa8f064f43c205d99f5670c00a46) ### Visual Description ## Diagram: Hetero-Core Model Parallelism ### Overview The image presents a diagram illustrating a hetero-core model parallelism strategy, encompassing both preprocessing and runtime support phases. It details the flow from calibration datasets and LLM speculative decoding to partitioning strategies and customized ARM SpMM. ### Components/Axes **Preprocessing Phase (Top)**: * **Input**: Calibration dataset (represented by stacked cylinders) and LLM + Speculative Decoding (represented by a neural network). * Legend (top-right): * Green circle: It, I, As * Blue circle: Is, the * Orange circle: Difficult, is * Red circle: not, difficult, a * **Processing Units**: CPU, GPU, NPU, XPU (listed below the neural network). Images of a CPU, GPU, and Intel Core chip are shown. * **Strategies**: Verification width & Tree Speculative Strategy, and Parallelism- & Contention-aware Partitioning Strategy. * **Central Theme**: Architecture-aware Profiling (connecting the two strategies). **Runtime Support (Bottom)**: * **Operator Placement**: CPU (green), GPU (orange), XPU (blue). * **Network Representation**: A network diagram with nodes colored according to operator placement. * **Optimization Goals**: Memory access reduction, Computation affinity. * **Partitioning Guide**: A dashed box containing the text "Partitioning Guide". * **Hardware Utilization**: * GPU Time (green): Represented by stacked blocks labeled "PE" and "GPU Cache". * CPU Time (blue): Represented by stacked blocks labeled "core" and "CPU Cache". * Synchronization (red): Represented by stacked blocks. * "dram" is written vertically between the CPU Time and Synchronization blocks. * **Vectorization**: Vectorization and Memory access reduction are listed as goals. * **Customization**: Customized ARM SpMM (Sparse Matrix-Matrix Multiplication). * **Central Theme**: Hetero-Core Model Parallelism. ### Detailed Analysis **Preprocessing Phase**: 1. The process begins with a "Calibration dataset" and "LLM + Speculative Decoding". 2. These inputs are processed using CPU, GPU, NPU, and XPU. 3. Two strategies are employed: "Verification width & Tree Speculative Strategy" and "Parallelism- & Contention-aware Partitioning Strategy". 4. "Architecture-aware Profiling" serves as the central theme connecting these strategies. **Runtime Support**: 1. "Operator Placement" is visualized with green (CPU), orange (GPU), and blue (XPU) nodes in a network diagram. 2. The network diagram aims for "Memory access reduction" and "Computation affinity". 3. A "Partitioning Guide" is used. 4. Hardware utilization is shown with "GPU Time", "CPU Time", and "Synchronization" blocks. 5. "Vectorization" and "Memory access reduction" are key goals. 6. The process culminates in "Customized ARM SpMM". ### Key Observations * The diagram illustrates a two-phase approach: Preprocessing and Runtime Support. * Heterogeneous computing resources (CPU, GPU, NPU, XPU) are utilized. * Optimization goals include memory access reduction, computation affinity, and vectorization. * Partitioning strategies are central to both phases. ### Interpretation The diagram outlines a comprehensive approach to hetero-core model parallelism. The preprocessing phase focuses on profiling and partitioning, while the runtime support phase focuses on operator placement, memory access reduction, and efficient hardware utilization. The use of speculative decoding and customized ARM SpMM suggests an emphasis on performance optimization for specific hardware architectures. The diagram highlights the importance of architecture-aware profiling and partitioning strategies in achieving efficient model parallelism across heterogeneous computing resources. </details> Figure 5: The overview of Ghidorah. Ghidorah is specifically designed for single-sample LLM generative task inference on end-user devices with a unified memory architecture. It primarily focuses on accelerating the dominant decoding phase. Since the decoding phase is constraint by the memory bandwidth, Ghidorah explores speculative decoding approaches to enhance parallelism and alleviate the memory bandwidth bottleneck. This manuscript takes the multi-head drafting approach, Medusa [3], as the default. Ghidorah distributes the speculative decoding workloads across all processing units and choose the optimal verification length in the end-user device to maximize the inference speed. Figure 5 illustrates the overview of Ghidorah. Leveraging the unified memory architecture of end-user devices and the computation pattern of speculative decoding, Ghidorah employs a novel hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. Functionally, the HCMP architecture focuses on providing runtime support, while the ARCA approach is primarily designed for the preprocessing phase. The HCMP architecture provides guidance for model partitioning and includes customized sparse matrix multiplication optimizations. Based on the runtime support, the ARCA approach provides the preprocessing phase that runs once before deployment. It performs an inference process using calibration data as input on multiple heterogeneous processing units in the end-user device. By comprehensively considering acceptance length, hardware parallelism, and resource contention, this phase formulates speculative and partitioning strategies to maximize performance. III-B Hetero-core Model Parallelism Architecture <details> <summary>x9.png Details</summary> ![78bdd42b](/v1/image/78bdd42bc6e55a379a6ed763948f932ae310c4318aa2340451199203f27675d7) ### Visual Description ## Diagram: Hardware Acceleration of Attention Mechanism ### Overview The image is a diagram illustrating the hardware acceleration of an attention mechanism, showing how computations are distributed across GPU, DRAM, and CPU. It details the flow of data through linear layers and an attention module, highlighting the splitting of weights and heads for parallel processing. ### Components/Axes * **Title:** Hardware Acceleration of Attention Mechanism (inferred) * **Vertical Labels (Left):** GPU, Dram, CPU * **Horizontal Labels (Top):** Linear Layer, Attention Module, Linear Layer, Linear Layer, Linear Layer * **Legend (Bottom):** * Yellow Dashed Line: GPU Output * Red Dashed Line: CPU Output * Light Blue Fill: Split Weights by Column * Light Green Fill: Split Each Head with Affinity * **Attention Module Components:** cache\_V, cache\_K, V, K, Q, Online Softmax Reduce * **Linear Layer Components:** W^B, W^C, W^D ### Detailed Analysis The diagram is divided into vertical sections representing different stages of computation, and horizontal sections representing the hardware on which the computations are performed. **GPU Section:** * **Linear Layer 1 (Leftmost):** Input data (represented by stacked gray rectangles) is processed through a linear layer. The output is a blue rectangle. * **Attention Module:** * `cache_V` and `cache_K` (orange boxes) are inputs. * The output of the attention module is a large orange rectangle after a Softmax operation. * **Linear Layers 2-4:** The output from the attention module is processed through three subsequent linear layers. Each layer takes a blue rectangle as input and produces stacked gray rectangles as output. **DRAM Section:** * **Linear Layer 1:** Input data (stacked gray rectangles) is processed, resulting in a blue and green rectangle. * **Attention Module:** * `cache_V` and `cache_K` (orange boxes with red dashed borders) are inputs. * V, K, and Q are computed. * A matrix operation is performed, followed by "Online Softmax Reduce". * **Linear Layers 2-4:** The output from the attention module is processed through three subsequent linear layers. Each layer takes a blue and green rectangle as input and produces stacked gray rectangles as output. The boxes containing W^B, W^C, and W^D have red dashed borders. **CPU Section:** * **Linear Layer 1:** Input data (stacked gray rectangles) is processed, resulting in a green rectangle. * **Attention Module:** * A matrix operation is performed, followed by a Softmax operation. * **Linear Layers 2-4:** The output from the attention module is processed through three subsequent linear layers. Each layer takes a green rectangle as input and produces stacked gray rectangles as output. **Data Flow:** * Dashed gray lines indicate the flow of data between different hardware components and layers. * Arrows indicate the direction of data flow. ### Key Observations * The attention module is the most complex part of the diagram, involving multiple computations and data transformations. * The GPU handles the initial and final linear layers, as well as part of the attention module. * The DRAM handles intermediate computations within the attention module. * The CPU handles the initial linear layer and part of the attention module. * The splitting of weights and heads is indicated by the light blue and light green fills, respectively. ### Interpretation The diagram illustrates a hardware-accelerated attention mechanism, where different parts of the computation are offloaded to different hardware components (GPU, DRAM, CPU) to improve performance. The attention module, being the most computationally intensive, is distributed across these components. The splitting of weights and heads further enables parallel processing, maximizing throughput. The use of GPU for initial and final linear layers suggests that these operations are well-suited for parallel processing on GPUs. The DRAM likely serves as a high-bandwidth memory for intermediate computations. The CPU handles specific parts of the computation, possibly those that are less amenable to parallelization or require more complex control logic. The diagram highlights the importance of hardware-software co-design for efficient implementation of attention mechanisms in deep learning models. </details> Figure 6: The illustration of Hetero-Core Model Parallelism. The HCMP architecture is designed for efficiently distributing speculative decoding workloads across multiple heterogeneous processing units organized in the unified memory architecture. Figure 6 illustrates the HCMP architecture by presenting an inference example conducted across two processing units, CPU and GPU. It shows the memory layout and computation processes separately. Instead of focusing on reducing inter-device communication, the HCMP architecture prioritizes minimizing memory access and exposing computation affinity across processing units. III-B 1 Linear Layer - Split weights by column The existing tensor parallelism approach [11] is designed to distribute workloads across multiple separate devices, requiring explicit inter-device communication to resolve data dependencies. Reducing the frequency of tensor synchronization can significantly improve parallel efficiency. Migrating this setup to end-user devices, for every two linear layers, the weights of the first layer are split by columns and those of the second layer by rows, thus requiring only a single all-reduce synchronization for the two layers. However, this all-reduce operation necessitates additional memory access. It must read activations generated by different processing units and write the combined results back. The HCMP architecture opts to split the weights of all linear layers by columns. As shown in Figure 6, the CPU and GPU receive the same input, multiply it by different columns of weights, and then write the outputs to their designated memory regions without enforcing consistency between them. This allows the result to be used in the next operation without any additional data access. III-B 2 Attention Module - Split each head with affinity The HCMP architecture is designed to exploit computation affinity across different processing units, taking into account the hybrid computational intensity inherent in speculative decoding. The existing tensor parallelism approach divides the computation of the attention module by heads, with each device responsible for processing different heads. However, since only the correlation between partial tokens requires to be verified in speculative decoding, there exists dense and sparse computation in each head of the attention module. As shown in Figure 6, each attention head can be divided into a dense part — primarily the multiplication with the KV cache — and a sparse part, which mainly involves multiplication with the newly generated key and value. The dense part will be prioritized for processing units with high parallelism, such as GPU, while the sparse part will be prioritized for processing units with low parallelism, such as CPU. To achieve load balance, each partition may optionally include a portion of the other part’s computation. As shown in Figure 3 (b)(c)(d), the left boundary of the sparse part tends to be denser and can be preferentially included in the dense part. Furthermore, inspired by Ring Attention [24] and FlashAttention [25], we introduce the online softmax technique to extend the span that can be computed continuously. As shown in Equation 1, the computation of $O=AV$ on one processing unit cannot begin until $X=QK^{T}$ on another processing unit is completed, as the softmax operation requires data from all processing units. With the online softmax technique [24, 25], each processing unit can have its own softmax operation, aligning their results by applying a scaling factor at the end of the attention module. This scaling operation can be fused with the reduce operation, introducing almost no overhead. III-B 3 Customized ARM SpMM Optimization <details> <summary>x10.png Details</summary> ![7eed1fda](/v1/image/7eed1fdad5c8ace5793d1bc92d38bdb6720c1a055e0a0a8f9f19dacab6235a6b) ### Visual Description ## Diagram: Attention Mechanism Data Flow ### Overview The image illustrates the data flow within an attention mechanism, likely in a neural network. It shows matrix operations involving Query (Q), Key (K), Value (V), and Output (O) matrices, highlighting the use of cache and register memory. ### Components/Axes * **Matrices:** The diagram features several matrices labeled as Q, KT, QKT+mask, A, V, and O. Each matrix is represented as a grid of cells. * **Labels:** The labels "Cache" and "Register" indicate memory access patterns. Arrows point to specific rows or columns within the matrices to show where data is being accessed from cache or register. * **Operators:** Multiplication symbols (x) and equality symbols (=) denote matrix operations. ### Detailed Analysis The diagram can be broken down into two main rows, each representing a step in the attention mechanism. **Row 1:** 1. **Q (Query):** A 7x7 matrix. A horizontal row in the middle is highlighted in orange, with an arrow labeled "Cache" pointing to it from the left. 2. **KT (Key Transpose):** A 7x7 matrix. A vertical column on the left is highlighted in green, with an arrow labeled "Cache" pointing to it from the top. 3. **QKT+mask:** A 7x7 matrix. The matrix contains a pattern of dark gray and light blue cells. A single cell in the top-left corner is highlighted in blue, with an arrow labeled "Register" pointing to it from the left. **Row 2:** 1. **A:** A 7x7 matrix. The matrix contains a pattern of dark gray and light blue cells. A single cell in the top-left corner is highlighted in blue, with an arrow labeled "Register" pointing to it from the left. 2. **V (Value):** A 7x7 matrix. A horizontal row at the top is highlighted in yellow, with an arrow labeled "Cache" pointing to it from the right. 3. **O (Output):** A 7x7 matrix. A horizontal row in the middle is highlighted with varying shades of purple, with an arrow labeled "Register" pointing to it from the right. ### Key Observations * The "Cache" arrows indicate that specific rows or columns of the matrices are being accessed from the cache memory. * The "Register" arrows indicate that single elements of the matrices are being accessed from the register memory. * The "QKT+mask" and "A" matrices have a similar pattern of dark gray and light blue cells, suggesting a masking operation. * The "O" matrix has a gradient of purple, indicating a weighted sum or aggregation of values. ### Interpretation The diagram illustrates the core operations of an attention mechanism. The Query (Q) and Key (KT) matrices are used to compute attention weights, which are then applied to the Value (V) matrix to produce the Output (O). The "mask" in "QKT+mask" suggests that certain elements are being masked out during the attention weight calculation. The use of cache and register memory highlights the importance of memory access patterns in optimizing the performance of the attention mechanism. The diagram suggests that the attention mechanism is processing data in a row-wise or column-wise manner, as indicated by the "Cache" arrows. The final output "O" is a weighted combination of the "V" matrix, where the weights are determined by the attention mechanism. </details> Figure 7: ARM SpMM Optimization. ARM CPUs are widely used in end-user devices due to their power efficiency feature, and we implement customized computation optimizations for sparse computations in the self-attention module, i.e. $QK^{T}$ and $AV$ . In details, knowing the token correlations to be verified, we follow the COO sparsity data format to generate the index before performing the inference. We optimize the sparse operation from two aspects, namely vectorization and memory access reduction. Figure 7 illustrates the optimization. In the $QK^{T}$ computation, the data access for both $Q$ and $K$ is in a row-wise way, thereby supporting continuous data access. Leveraging NEON instruction set of ARM, we enable the 128-bit long vector instructions to combine memory access in continuous FMA calculations. For its output matrix, each result value is stored in the register until the final result is fully accumulated, so as to reducing the load/store operations. In the $AV$ computation, we adjust the execution order to ensure continuous memory access to matrix $V$ . Instead of multiplying with each column of matrix $V$ , each non-zero element in the matrix $A[i,j]$ performs multiplication with row j of the matrix $V$ and then accumulates the result to the row i of the output matrix $O$ until the non-zero element in the same row finishes computing and finally gets row i of the output matrix. Also, since the capacity of registers is limited, we divide this process into blocks, only process partial elements from $V$ to $O$ , to enable storing elements in registers and reduce load/store operations. III-C Architecture-aware Profiling Approach The ARCA approach comprehensively considers the acceptance length, hardware parallelism and resource contention to finally generate the speculative strategy and the partitioning strategy. Initially, for a specific speculative decoding approach, we explore verification trees of all verification widths for the optimal acceptance lengths by combining accuracy-based estimation with brute-force search. Next, we determine the verification width and partitioning ratio for the final deployment, which is both parallelism- and contention-aware. III-C 1 Speculative Strategy Determination The speculative strategy in Ghidorah consists of both the verification width and the verification tree. The verification width is the computing budget in a single decoding step, and it is defined as the total number of tokens to be verified, while the verification tree determines the specific combinations of tokens to be verified. Both the verification width and verification tree influence the acceptance length. As the verification width increases, it becomes more likely to identify acceptable tokens; however, if the width is too large, the benefits for improving acceptance length becomes limited. Also, some specific verification routes are more likely to be accepted. For instance, the closer the head is to HEAD0, the easier its tokens are accepted. We design verification trees for different verification widths. <details> <summary>x11.png Details</summary> ![730e8ca1](/v1/image/730e8ca1c91f9bca95a682a0df7e90b91c7b0f36262ee917dca289aeb93c7bc0) ### Visual Description ## Dependency Tree Diagram: Estimated Tree vs. All Combinations ### Overview The image is a diagram illustrating a dependency tree structure, comparing an "Estimated Tree" with "All Combinations" within a "Brute-force search space." The diagram is organized into four levels labeled Head0 to Head3, representing hierarchical relationships between words. ### Components/Axes * **Levels:** The diagram is divided into four horizontal levels, labeled Head0, Head1, Head2, and Head3 from top to bottom. These levels likely represent different stages or depths in the dependency parsing process. * **Nodes:** Each level contains nodes representing words or phrases. The words are: "Since" (Head0), "it", "is", "there" (Head1), "is", "am", "a" (Head2), and "time", "a", "the" (Head3). Some nodes at Head3 are empty. * **Edges:** Edges connect the nodes, indicating dependencies between words. * **Estimated Tree:** Represented by solid red arrows, showing the most likely dependency relationships. * **All Combinations:** Represented by dashed black arrows, showing all possible dependency relationships. * **Brute-force search space:** Represented by a shaded orange area, highlighting the combinations considered during the search. * **Legend:** Located at the bottom of the image. * Red Arrow: "Estimated Tree" * Dashed Black Arrow: "All Combinations" * Orange Shaded Area: "Brute-force search space" ### Detailed Analysis * **Head0:** Contains a single node with the word "Since". * **Head1:** Contains three nodes: "it", "is", and "there". The "Since" node in Head0 has red arrows (Estimated Tree) pointing to each of these nodes. * **Head2:** Contains nine nodes: three "is", three "am", and three "a". The "it" node in Head1 has red arrows (Estimated Tree) pointing to the first "is", "am", and "a" nodes. The "is" node in Head1 has red arrows (Estimated Tree) pointing to the second "is", "am", and "a" nodes. The "there" node in Head1 has red arrows (Estimated Tree) pointing to the third "is", "am", and "a" nodes. Dashed black arrows (All Combinations) also connect the Head1 nodes to the Head2 nodes. * **Head3:** Contains a mix of filled and empty nodes. The filled nodes contain the words "time", "a", "the", "time", "a", "the". The first "is" node in Head2 has red arrows (Estimated Tree) pointing to the first "time", "a", and "the" nodes. The first "am" node in Head2 has red arrows (Estimated Tree) pointing to the second "time", "a", and "the" nodes. The first "a" node in Head2 has red arrows (Estimated Tree) pointing to the third "time", "a", and "the" nodes. Dashed black arrows (All Combinations) also connect the Head2 nodes to the Head3 nodes. ### Key Observations * The "Estimated Tree" represents a simplified, likely correct, dependency structure. * "All Combinations" shows the complexity of considering every possible dependency. * The "Brute-force search space" highlights the computational cost of exploring all combinations. * The diagram shows how the number of possible combinations increases as the depth of the tree increases. ### Interpretation The diagram illustrates the challenge of dependency parsing, where the goal is to find the correct dependency structure among a vast number of possibilities. The "Estimated Tree" represents a heuristic or statistical approach to finding the most likely structure, while "All Combinations" represents the exhaustive search space that a brute-force approach would need to explore. The "Brute-force search space" visually emphasizes the computational complexity of dependency parsing and the need for efficient algorithms to find the correct structure. The diagram suggests that the algorithm starts with the word "Since" and then tries to find the dependencies between the words in the sentence. The red arrows show the most likely dependencies, while the black arrows show all possible dependencies. The orange shaded area shows the combinations that are considered during the search. </details> Figure 8: The verification tree determination with a verification width of 16. Figure 8 illustrates the tree design process with a verification width of 16 (nodes), involving candidate tokens generated from 4 heads of Medusa [3]. In details, we use a calibration dataset and generate the accuracies of the top predictions of different heads. Then, we estimate the accuracy of a candidate sequence by multiplying the accuracies of its related tokens. Then we can get the overall expected acceptance length by accumulating the expected acceptance length of different candidate sequences. Maximizing the acceptance length expectation, we can add nodes with the highest accuracies one by one until reaching the given verification length. As in Figure 8, the estimated tree is indicated by red lines. We further employ the brute-force search based on the estimated tree and compare their real acceptance lengths to determine the final tree. In Figure 8, we search leaf nodes and nodes in the same level. III-C 2 Parallelism-Aware Profiling The ARCA approach is parallelism-aware. Modern processing units leverage vectorization techniques to boost computational efficiency, leading to stepwise performance improvements in LLM inference latency as batch size increases, a phenomenon referred to as wave quantization [26] on NVIDIA hardware. Consequently, different processing units exhibit distinct performance sweet spots. In our preliminary experiments, we observe that setting candidate verification widths to powers of two, specifically 2, 4, 8, 16, 32, and 64, aligns well with the vectorization capabilities of these devices, resulting in more efficient execution. This observation can be explained as follows. The execution time of general matrix multiplication (GeMM) operations, including linear and attention, generally dominates LLM workloads. When applying model parallelism, the hidden state dimension is partitioned across processing units, while the token dimension (verification length) remains unchanged. As a result, the split GeMM operations executed on different processing units share the same token dimension and collectively span the hidden state dimension. Next, we determine their verification trees and check these speculative strategies with the runtime support. This determination process is contention-aware. III-C 3 Contention-Aware Profiling Although the partitioning guidance of the HCMP architecture will not increase the overall memory access within a single step of speculative decoding, the simultaneous access from different processing units will lead to resource contention problem. Since this contention affects different processing units at varying levels, it is difficult to determine the partitioning ratio. In this way, the ARCA approach initializes the partitioning strategy based on the individual execution times of different processing units and determines the final partitioning strategy for a given verification width through gradual adjustments. Besides, as the context (KV cache) length will impact the ratio of sparsity in the attention module, we additionally adjust the context length for the attention module to support its dynamic partitioning. IV Evaluation We implement a prototype system of Ghidorah in C/C++. We utilize the CUDA kernels from the highly optimized Transformer inference system FasterTransformer [27] for the GPU and rebuild the ARM CPU code based on CTranslated2 [28]. In this section, we evaluate Ghidorah from two aspects. 1) Verification Efficiency: we evaluate the acceptance length of our speculative strategies, and 2) Overall Performance: we compare the single-sample inference of Ghidorah with that of state-of-the-art approaches. Here we only focus on the dominant decoding phase of the LLM inference. IV-A Experimental Setup Model and Speculative Decoding. We evaluate Ghidorah with Vicuna-7B [29] and speculative decoding approach Medusa [3]. Medusa originally offers a 5-head version of Vicuna-7B and we design our own verification trees to evaluate the acceptance length under different verification widths. Vicuna-7B is a fine-tuned version from LLaMA-7B [30], which is a widely-recognized model architecture in the AI community. Datasets. We evaluate Ghidorah for the acceptance length across a broad spectrum of datasets, including: 1) MT-Bench [29]: a diverse collection of multi-turn questions, 2) GSM8K [31]: a set of mathematical problems, 3) HumanEval [32]: code completion and infilling tasks, and 4) MBPP [33]: instruction-based code generation. Node testbed. We use NVIDIA Jetson NX, which is a CPU-GPU edge device with unified memory architecture. It consists of an embedded 384-core Volta GPU and a 6-core ARM v8.2 CPU. We lock the GPU at 204MHz and the CPU at 1.9GHz to simulate end-user devices with more balanced capabilities of heterogeneous processing units. Specifically, we use g++ 7.5, cuda 10.2, arm performance library 24.10. Metrics. We use the acceptance length and decoding throughput as the evaluation metrics. The acceptance length refers to the average number of tokens that the target model accepts in a single decoding step. The decoding throughput is the number of tokens generated in a fixed interval. Baseline Methods. We compare Ghidorah with approaches: - Sequential: The sequential decoding approach running on the GPU, which is generally limited by the algorithmic parallelism. - Medusa: The Medusa decoding approach running on the GPU and it adopts our verification trees in given verification widths. - Medusa+EM (Medusa + E dgeNN [34] + M egatron) [11]: The Medusa decoding approach is distributed across the CPU and GPU using the partitioning guidance of the tensor parallelism from Megatron-LM. Additionally, we apply zero-copy optimization to reduce the synchronization overhead and determines the partitioning ratio based on the execution times of the processing units as in EdgeNN. IV-B Acceptance Length Evaluation TABLE I: Acceptance length under given verification widths. | Dataset GSM8K MBPP | MT-bench 1 1 | 1 1.76 1.78 | 1.72 2.43 2.54 | 2.28 2.69 2.89 | 2.59 3.08 3.27 | 2.93 3.34 3.55 | 3.19 3.56 3.74 | 3.34 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Human-eval | 1 | 1.77 | 2.49 | 2.8 | 3.19 | 3.48 | 3.71 | | We present the acceptance lengths for various datasets under different verification widths. Notably, MT-bench is used as the calibration dataset for determining verification trees. Subsequently, these verification trees are applied to three other datasets to generate their acceptance lengths. Table I shows the acceptance length results. Larger verification widths generally result in longer acceptance lengths, while the benefit becomes weak as verification widths grow and come at the cost of significantly increased computational effort. Furthermore, we find that these verification trees exhibit strong generality, achieving even better acceptance lengths on GSM8K, MBPP, and Huamn-eval, when migrated from MT-Bench. This can be attributed to MT-Bench’s comprehensive nature, as it includes conversations across eight different types. <details> <summary>x12.png Details</summary> ![56be6bd2](/v1/image/56be6bd21feeeaa92bd87163f0ec3a2a3eb3812148afcaf53b0bc1ffee58fb34) ### Visual Description ## Bar Chart: Normalized Decoding Speed vs. Verification Width for Different Benchmarks ### Overview The image presents four bar charts comparing the normalized decoding speed of four different methods (Sequential, Medusa, EM+Medusa, and Ghidorah) across varying verification widths (4, 8, 16, 32, and 64). Each chart corresponds to a different benchmark: MT-bench, GSM8K, MBPP, and Human-Eval. The y-axis represents the normalized decoding speed, while the x-axis represents the verification width. ### Components/Axes * **Title:** Normalized Decoding Speed vs. Verification Width for Different Benchmarks * **Y-axis:** Normalized Decoding Speed, ranging from 0 to 6. * **X-axis:** Verification Width, with values 4, 8, 16, 32, and 64. * **Legend (Top-Left):** * Brown: Sequential * Blue: Medusa * Orange: EM+Medusa * Green: Ghidorah * **Subplot Titles:** * (a) MT-bench * (b) GSM8K * (c) MBPP * (d) Human-Eval ### Detailed Analysis **General Trend:** For all benchmarks, the Ghidorah method (green bars) generally exhibits the highest normalized decoding speed, followed by EM+Medusa (orange bars). Sequential (brown bars) consistently shows the lowest speed. Medusa (blue bars) varies depending on the benchmark. **1. (a) MT-bench:** * **Sequential (Brown):** Remains relatively constant around 0.8-1.0 across all verification widths. * **Medusa (Blue):** Increases from approximately 2.0 at width 4 to about 3.0 at width 64. * **EM+Medusa (Orange):** Increases from approximately 4.0 at width 4 to about 5.2 at width 64. * **Ghidorah (Green):** Increases from approximately 5.0 at width 4 to about 6.5 at width 16, then decreases slightly to about 5.5 at width 64. **2. (b) GSM8K:** * **Sequential (Brown):** Remains relatively constant around 0.8-1.0 across all verification widths. * **Medusa (Blue):** Increases from approximately 2.0 at width 4 to about 3.5 at width 64. * **EM+Medusa (Orange):** Increases from approximately 4.5 at width 4 to about 5.5 at width 64. * **Ghidorah (Green):** Increases from approximately 6.0 at width 4 to about 7.0 at width 16, then decreases slightly to about 5.7 at width 64. **3. (c) MBPP:** * **Sequential (Brown):** Remains relatively constant around 0.8-1.0 across all verification widths. * **Medusa (Blue):** Increases from approximately 2.0 at width 4 to about 3.5 at width 64. * **EM+Medusa (Orange):** Increases from approximately 4.5 at width 4 to about 5.8 at width 64. * **Ghidorah (Green):** Increases from approximately 5.5 at width 4 to about 7.0 at width 16, then decreases slightly to about 5.8 at width 64. **4. (d) Human-Eval:** * **Sequential (Brown):** Remains relatively constant around 0.8-1.0 across all verification widths. * **Medusa (Blue):** Increases from approximately 2.0 at width 4 to about 3.5 at width 64. * **EM+Medusa (Orange):** Increases from approximately 4.5 at width 4 to about 5.5 at width 64. * **Ghidorah (Green):** Increases from approximately 5.5 at width 4 to about 6.5 at width 16, then decreases slightly to about 5.5 at width 64. ### Key Observations * Sequential decoding consistently performs the worst across all benchmarks and verification widths. * Ghidorah generally achieves the highest normalized decoding speed, peaking at a verification width of 16 and then slightly decreasing. * EM+Medusa consistently outperforms Medusa alone. * The performance difference between Ghidorah and EM+Medusa narrows at higher verification widths (32 and 64). * The performance of Medusa increases steadily with verification width. ### Interpretation The data suggests that Ghidorah is the most efficient decoding method among those tested, particularly at a verification width of 16. The EM+Medusa method also provides significant performance improvements over the baseline Medusa method. The relatively poor performance of Sequential decoding highlights the benefits of parallel or optimized decoding strategies. The slight decrease in Ghidorah's performance at higher verification widths (32 and 64) may indicate diminishing returns or increased overhead associated with larger verification widths. The consistent improvement of Medusa with increasing verification width suggests that it benefits from increased parallelization opportunities. </details> Figure 9: The overall performance under different verification widths. IV-C Overall Performance Evaluation This section compares the decoding throughput of Ghidorah against that of baseline methods with different verification widths from 4 to 64. We select a context length (KV cache length) of approximately 256. Figure 9 shows the normalized results. In all cases shown in Figure 9, Ghidorah presents significant speedup in the decoding throughput. Especially when the verification width is 16, Ghidorah’s speedup can reach up to 7.6 $×$ compared with the sequential decoding. The impressive performance improvement of Ghidorah can be attributed to two main factors: First, the increased acceptance length enabled by speculative decoding. Since the sequential decoding approach is memory-bandwidth-bound, its execution time is similar to that of Medusa, and the acceptance length is fixed at 1. However, the Medusa method generates multiple tokens in a single step and requires the similar execution time; specifically, as the verification width increases, the acceptance length of Medusa also grows. Second, the collaborative operation of multiple heterogeneous processing units can better conduct the Medusa decoding workload. When comparing Medusa, Medusa+EM, and Ghidorah, we find that utilizing heterogeneous processing unit yields a better acceleration effect than merely running Medusa on the GPU. Taking MBPP dataset as the instance, Ghidorah achieves an average speedup of 2.06 $×$ compared to running Medusa solely on the GPU, and an average speedup of 1.20 $×$ compared to Medusa+EM. The improvement over Medusa+EM is attributed to reduced memory access and improved computing affinity. Longer acceptance lengths do not always result in better throughput, making it essential for the ARCA approach to balance the algorithmic parallelism with hardware capabilities. As shown in Figure 9, Medusa and Ghidorah achieve their best throughput results at verification width of 64 and 16 respectively. This is because CPU and GPU have different sweet spots. The GPU maintains a similar execution time from 4 to 64 verification width and achieves a longer acceptance length at verification width of 64, while the CPU can only maintain a similar execution time from 4 to 16 verification width. Since increasing verification width from 16 to 64 results in only 0.47 increase in the acceptance length, Ghidorah achieves the optimal throughput at a verification width of 16. IV-D Dynamic Partitioning and Sparse Optimization <details> <summary>x13.png Details</summary> ![7012dc40](/v1/image/7012dc40ae1efe93fdd1d9e898d6996639dc20726e030fae4510a50cbf13dfd3) ### Visual Description ## Bar Chart: Normalized Execution Time vs. Context Length ### Overview The image is a bar chart comparing the normalized execution time of "Static" and "Dynamic" processes across different context lengths. The x-axis represents the context length, ranging from 16 to 2048. The y-axis represents the normalized execution time, ranging from 0.6 to 1.0. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** * **Label:** Context Length * **Scale:** 16, 32, 64, 128, 256, 512, 1024, 2048 * **Y-axis:** * **Label:** Normalize Execution Time * **Scale:** 0.6, 0.8, 1.0 * **Legend:** Located at the top of the chart. * **Blue:** Static * **Green:** Dynamic ### Detailed Analysis The chart displays two sets of bars for each context length, representing "Static" (blue) and "Dynamic" (green) execution times. Both bars are filled with a white diagonal crosshatch pattern. * **Static (Blue):** The "Static" execution time remains consistently at approximately 1.0 for all context lengths from 16 to 2048. * **Dynamic (Green):** The "Dynamic" execution time starts near 1.0 for context lengths 16 to 128, then decreases as the context length increases. * Context Length 16: ~1.0 * Context Length 32: ~1.0 * Context Length 64: ~1.0 * Context Length 128: ~1.0 * Context Length 256: ~0.95 * Context Length 512: ~0.82 * Context Length 1024: ~0.77 * Context Length 2048: ~0.70 ### Key Observations * The "Static" execution time is consistently high and unaffected by the context length. * The "Dynamic" execution time decreases as the context length increases, suggesting that the "Dynamic" process becomes more efficient with longer context lengths. * There is a significant difference in execution time between "Static" and "Dynamic" processes at higher context lengths (1024 and 2048). ### Interpretation The data suggests that the "Dynamic" process benefits from longer context lengths, resulting in reduced execution time. The "Static" process, however, does not show any improvement with increased context length, maintaining a consistently high execution time. This could indicate that the "Dynamic" process is better optimized for handling larger contexts, while the "Static" process may have a fixed overhead regardless of the context size. The chart highlights the performance trade-offs between "Static" and "Dynamic" approaches, particularly in scenarios with varying context lengths. </details> (a) The attention module performance. <details> <summary>x14.png Details</summary> ![7487a4c2](/v1/image/7487a4c2e384dd13629cc63177e9ef716315a76c20965442e07a4a17352bf719) ### Visual Description ## Bar Chart: Performance Comparison of Sparse and Dense Methods ### Overview The image is a bar chart comparing the performance of three methods: Naive Sparse, Dense, and Opt. Sparse. The chart displays the relative performance of each method, with higher bars indicating better performance. ### Components/Axes * **Y-axis:** Ranges from 0.0 to 1.0, with increments of 0.2. * **X-axis:** Implicitly represents the three methods being compared. * **Legend:** Located at the bottom-left of the chart. * Naive Sparse (Green with diagonal lines) * Dense (Blue with horizontal lines) * Opt. Sparse (Orange with diagonal lines) ### Detailed Analysis * **Naive Sparse:** The green bar extends to approximately 1.0. * **Dense:** The blue bar extends to approximately 0.55. * **Opt. Sparse:** The orange bar extends to approximately 0.3. ### Key Observations * Naive Sparse performs significantly better than both Dense and Opt. Sparse. * Dense performs better than Opt. Sparse. ### Interpretation The chart suggests that the Naive Sparse method is the most effective among the three methods compared. The Dense method has moderate performance, while the Opt. Sparse method has the lowest performance. The data indicates that sparsity, when implemented naively, can lead to better performance compared to dense methods or optimized sparse methods. </details> (b) ARM Optim. Figure 10: Performance of Sparse Optimization This section presents the effectiveness of the dynamic partitioning and the sparsity optimization. Since the dynamic partitioning merely impact the attention module, Figure 10(a) represents the attention module performance with a verification length of 64. Notably, the attention module becomes dominant in the Transformer model as the context length increases. Static refers to that all sparse computations are processed on the CPU and all dense computations are processed on the GPU. Dynamic indicates that sparse computations are primarily handled by the CPU, while dense computations are primarily handled by the GPU, with dynamic partitioning guided by profiling results. Dynamic partitioning demonstrates obvious improvements when dealing with large context lengths. Figure 10(b) illustrates the performance of the sparse component executed using three different strategies: a naive sparse implementation, our optimized sparse implementation, and the dense implementation that treats the sparse computation as dense. The execution time of optimized sparse implementation exceeds naive sparse and dense implementations by 3.49 $×$ and 1.90 $×$ respectively. Also, if we do not optimize the sparse computation, the naive sparse execution cannot outperform the dense one. V Related Work Collaborative Execution of Transformer. Tensor Parallelism [1] and Pipeline Parallelism [35] effectively addresses memory limitations by accumulating memory from more devices, gaining widespread adoption in inference tasks within datacenters. They are also used in end-user environments. Galaxy [12] hides communication into computation in the tensor parallelism approach for edge devices, while DeTransformer [36] fundamentally mitigates data dependency by decoupling the original model. Petals [37] utilizes pipeline parallelism to distribute LLM inference across geographically distributed end-user devices. Integrated CPU-GPU Scheduling and Execution. There has been a large amount of prior work focusing on improving the device utilization on integrated CPU-GPU architecture. CoDL [5] and EdgeNN [34] perform collaborative convolutional neural network inference on mobile CPU and GPU with unified memory architecture. Asymo [38] schedules workloads based on the computing power of asymmetric mobile CPU cores to achieve higher throughput. Pipe-It [7] achieves higher throughput by organizing ARM big.Little cores in a pipelined manner. However, none of these approaches are designed for Transformer-based models. Speculative Decoding System. Speculative Decoding represents an innovative paradigm designed to accelerate LLM inference. Existing LLM inference systems [10] start to integrate speculative decoding approaches and they treat them as ordinary LLM inference with a large batch size. Sequoia [39] explores ultra-wide verification width for advanced hardware in cloud. LLMCad [40] adopts the independent drafting speculative decoding approach to accommodate memory-constrained devices. It utilizes a smaller model to generate drafts, thereby reducing data loading time for the larger model. VI Conclusion Ghidorah focuses on deploying LLM inference on end-user devices with the unified memory architecture. To fully exploit the hardware capability, Ghidorah leverages speculative decoding approaches to enhance the algorithmic parallelism and then distributes this novel workload across multiple heterogeneous processing units. With the help of the HCMP architecture and ARCA approach, it can achieve up to 7.6 $×$ speedup in the dominant decoding phase. In the future, Ghidorah will be open-sourced and extended to support more speculative decoding approaches and a wider range of end-user devices. References - [1] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626. - [2] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y. Y. Wong, A. Zhu, L. Yang, X. Shi et al., “Specinfer: Accelerating large language model serving with tree-based speculative inference and verification,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 932–949. - [3] T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao, “Medusa: Simple llm inference acceleration framework with multiple decoding heads,” arXiv preprint arXiv:2401.10774, 2024. - [4] Y. Fu, P. Bailis, I. Stoica, and H. Zhang, “Break the sequential dependency of LLM inference using lookahead decoding,” in Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=eDjvSFOkXw - [5] F. Jia, D. Zhang, T. Cao, S. Jiang, Y. Liu, J. Ren, and Y. Zhang, “Codl: efficient CPU-GPU co-execution for deep learning inference on mobile devices,” in MobiSys ’22: The 20th Annual International Conference on Mobile Systems, Applications and Services, Portland, Oregon, 27 June 2022 - 1 July 2022, N. Bulusu, E. Aryafar, A. Balasubramanian, and J. Song, Eds. ACM, 2022, pp. 209–221. [Online]. Available: https://doi.org/10.1145/3498361.3538932 - [6] C. Zhang, F. Zhang, K. Chen, M. Chen, B. He, and X. Du, “Edgenn: Efficient neural network inference for cpu-gpu integrated edge devices,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 2023, pp. 1193–1207. - [7] S. Wang, G. Ananthanarayanan, Y. Zeng, N. Goel, A. Pathania, and T. Mitra, “High-throughput cnn inference on embedded arm big. little multicore processors,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 10, pp. 2254–2267, 2019. - [8] Y. Cho, P. Jiyeon, N. Florian, J. Changyeon, T. R. Gross, and B. Egger, “Dopia: online parallelism management for integrated cpu/gpu architectures,” in Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022, pp. 32–45. - [9] “Apple m4,” https://en.wikipedia.org/wiki/Apple_M4, 2024. - [10] “vllm github,” https://github.com/vllm-project/vllm, 2024. - [11] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. - [12] S. Ye, J. Du, L. Zeng, W. Ou, X. Chu, Y. Lu, and X. Chen, “Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference,” arXiv preprint arXiv:2405.17245, 2024. - [13] “Intel core ultra processor family,” https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html, 2024. - [14] “Meet jetson, the platform for ai at the edge,” https://developer.nvidia.com/embedded-computing, 2024. - [15] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” Advances in Neural Information Processing Systems, vol. 27, 2014. - [16] S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L. Song, S. Rajbhandari, and Y. He, “Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models,” 2023. [Online]. Available: https://arxiv.org/abs/2309.14509 - [17] F. Jia, D. Zhang, T. Cao, S. Jiang, Y. Liu, J. Ren, and Y. Zhang, “Codl: efficient cpu-gpu co-execution for deep learning inference on mobile devices.” in MobiSys, vol. 22, 2022, pp. 209–221. - [18] A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017. - [19] Y. Zhou, K. Lyu, A. S. Rawat, A. K. Menon, A. Rostamizadeh, S. Kumar, J.-F. Kagy, and R. Agarwal, “Distillspec: Improving speculative decoding via knowledge distillation,” arXiv preprint arXiv:2310.08461, 2023. - [20] Y. Li, F. Wei, C. Zhang, and H. Zhang, “EAGLE: speculative sampling requires rethinking feature uncertainty,” in Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=1NdN7eXyb4 - [21] “Apple a17,” https://en.wikipedia.org/wiki/Apple_A17, 2024. - [22] “Amd reveals next-gen desktop processors for extreme pc gaming and creator performance,” https://www.amd.com/en/newsroom/press-releases, 2024. - [23] “Qualcomm snapdragon,” https://en.wikipedia.org/wiki/Qualcomm_Snapdragon, 2024. - [24] H. Liu et al., “Ring attention with blockwise transformers for near-infinite context,” arXiv preprint arXiv:2310.01889, 2023. - [25] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022. - [26] “Wave quantization,” https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#wave-quant, 2024. - [27] Nvidia, “Nvidia fastertransformer: Transformer related optimization, including bert, gpt,” in Github, 2023. - [28] “Ctranslate2,” https://github.com/OpenNMT/CTranslate2, 2024. - [29] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023. - [30] H. Touvron et al., “Llama: Open and efficient foundation language models,” 2023. [Online]. Available: https://arxiv.org/abs/2302.13971 - [31] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. - [32] M. Chen, J. Tworek, et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021. - [33] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021. - [34] C. Zhang, F. Zhang, K. Chen, M. Chen, B. He, and X. Du, “Edgenn: Efficient neural network inference for CPU-GPU integrated edge devices,” in 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023. IEEE, 2023, pp. 1193–1207. [Online]. Available: https://doi.org/10.1109/ICDE55515.2023.00096 - [35] A. Agrawal et al., “Taming $\{$ Throughput-Latency $\}$ tradeoff in $\{$ LLM $\}$ inference with $\{$ Sarathi-Serve $\}$ ,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 117–134. - [36] Y. Wei, S. Ye, J. Jiang, X. Chen, D. Huang, J. Du, and Y. Lu, “Communication-efficient model parallelism for distributed in-situ transformer inference,” in 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2024, pp. 1–6. - [37] A. Borzunov, D. Baranchuk, T. Dettmers, M. Ryabinin, Y. Belkada, A. Chumachenko, P. Samygin, and C. Raffel, “Petals: Collaborative inference and fine-tuning of large models,” arXiv preprint arXiv:2209.01188, 2022. - [38] M. Wang, S. Ding, T. Cao, Y. Liu, and F. Xu, “Asymo: scalable and efficient deep-learning inference on asymmetric mobile cpus,” in ACM MobiCom ’21: The 27th Annual International Conference on Mobile Computing and Networking, New Orleans, Louisiana, USA, October 25-29, 2021. ACM, 2021, pp. 215–228. [Online]. Available: https://doi.org/10.1145/3447993.3448625 - [39] Z. Chen, A. May, R. Svirschevski, Y. Huang, M. Ryabinin, Z. Jia, and B. Chen, “Sequoia: Scalable, robust, and hardware-aware speculative decoding,” arXiv preprint arXiv:2402.12374, 2024. - [40] D. Xu, W. Yin, X. Jin, Y. Zhang, S. Wei, M. Xu, and X. Liu, “Llmcad: Fast and scalable on-device large language model inference,” CoRR, vol. abs/2309.04255, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.04255

Rendering Paper...