2601.17768

Model: gemma-3-27b-it-free

# LLM-42: Enabling Determinism in LLM Inference with Verified Speculation ## Abstract In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non-determinism arises from floating-point non-associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non-determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch-invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM-42 Many developers unknowingly set 42 as a random state variable (seed) to ensure reproducibility, a reference to The Hitchhiker’s Guide to the Galaxy. The number itself is arbitrary, but widely regarded as a playful tradition. —a scheduling-based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token is likely to be consistent even with dynamic batching. Moreover, most GPU kernels use shape-consistent reductions. Leveraging these insights, LLM-42 decodes tokens using a non-deterministic fast path and enforces determinism via a lightweight verify–rollback loop. The verifier replays candidate tokens under a fixed-shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. LLM-42 mostly re-uses existing kernels unchanged and incurs overhead only in proportion to the traffic that requires determinism. ## 1 Introduction LLMs are becoming increasingly more powerful [openai2022gpt4techreport, kaplan2020scalinglaws]. However, a key challenge many users usually face with LLMs is their non-determinism [he2025nondeterminism, atil2024nondeterminism, yuan2025fp32death, song2024greedy]: the same model can produce different outputs across different runs of a given prompt, even with identical decoding parameters and hardware. Enabling determinism in LLM inference (aka deterministic inference) has gained significant traction recently for multiple reasons. It helps developers isolate subtle implementation bugs that arise only under specific batching choices; improves reward stability in reinforcement-learning training [zhang2025-deterministic-tp]; is essential for integration testing in large-scale systems. Moreover, determinism underpins scientific reproducibility [song2024greedy] and enables traceability [rainbird2025deterministic]. Consequently, users have increasingly requested support for deterministic inference in LLM serving engines [Charlie2025, Anadkat2025consistent]. He et al. [he2025nondeterminism] showed that non-determinism in LLM inference at the system-level stems from the non-associativity of floating-point arithmetic LLM outputs can also vary due to differences in sampling strategies (e.g., top-k, top-p, or temperature). Our goal is to ensure that output is deterministic for fixed sampling hyper-parameters; different hyper-parameters can result in different output and this behavior is intentional.. This effect manifests in practice because most core LLM operators—including matrix multiplications, attention, and normalization—rely on reduction operations, and GPU kernels for these operators choose different reduction schedules to maximize performance at different batch sizes. The same study also proposed batch-invariant computation as a means to eliminate non-determinism. In this approach, a kernel processes each input token using a single, universal reduction strategy independent of batching. Popular LLM serving systems such as vLLM and SGLang have recently adopted this approach [SGLangTeam2025, vllm-batch-invariant-2025]. While batch-invariant computation guarantees determinism, we find that this approach is fundamentally over-constrained. Enforcing a fixed reduction strategy for every token—regardless of model phase or batch geometry—strips GPU kernels of the very parallelism they are built to exploit. For example, the batch-invariant GEMM kernels provided by He et al. do not use the split-K strategy that is otherwise commonly used to accelerate GEMMs at low batch sizes [tritonfusedkernel-splitk-meta, nvidia_cutlass_blog]. Furthermore, most GPU kernels are not batch-invariant to begin with, so insisting on batch-invariant execution effectively demands new kernel implementations, increasing engineering and maintenance overhead. Finally, batch-invariant execution makes determinism the default for all requests, even when determinism is undesirable or even harmful [det-inf-kills]. Our observations suggest that determinism can be enabled with a simpler approach. (O1) Token-level inconsistencies are rare: as long as a sequence remains in a consistent state, the next emitted token is mostly identical across runs; sequence-level divergence arises mainly from autoregressive decoding after the first inconsistent token. (O2) Most GPU kernels already use shape-consistent reduction schedules: they apply the same reduction strategy on all inputs of a given shape, but potentially different reduction strategies on inputs of different shapes. (O3) Determinism requires only position-consistent reductions: a particular token position must use the same reduction schedule across runs, but different positions within or across sequences can use different reduction schedules. (O4) Real-world LLM systems require determinism only for select tasks (e.g., evaluation, auditing, continuous-integration pipelines), while creative workloads benefit from the stochasticity of LLMs. Based on these observations, we introduce LLM-42, a scheduling-based approach to deterministic LLM inference inspired by speculative decoding. In speculative decoding [specdecoding-icml2023, chen2023accelerating, xia2023speculative, specinfer-2024], a fast path produces multiple candidate tokens while a verifier validates their correctness. We observe that the same structure can be repurposed to enable determinism (see Figure 1): fast path can optimize for high throughput token generation while a verifier can enforce determinism. Table 1 highlights key differences between conventional speculative decoding and our approach. <details> <summary>figures_h100_pcie/diagram/banner.png Details</summary> ![4cdf4c50](/v1/image/4cdf4c505b67b94c4a45a65f220654d3d8e921758836d841f03ac7a8e789cbd3) ### Visual Description \n ## Diagram: Parallel Decoding Process ### Overview This diagram illustrates a parallel decoding process, likely within a large language model or similar generative system. It compares two decoding methods: "Regular Decoding (Dynamic Batching)" and "Fixed Input Shape (No Dynamic Batching)". The diagram shows the flow of tokens through these two methods, highlighting the comparison and recomputation steps. ### Components/Axes The diagram consists of rectangular boxes representing processes, connected by arrows indicating the flow of data. The key components are: * **User Prompt:** The initial input to the system. * **Generate N tokens auto-regressively:** A process that generates a sequence of N tokens. * **Replay N tokens in parallel:** A process that replays the generated tokens in parallel. * **Compare N tokens:** A process that compares the tokens generated by the two methods. * **Recompute mismatched tokens:** A process that recomputes tokens that do not match. * **Accept matching tokens:** A process that accepts tokens that match. * **Regular Decoding (Dynamic Batching):** A label indicating one decoding method. (Red text) * **Fixed Input Shape (No Dynamic Batching):** A label indicating the other decoding method. (Purple text) ### Detailed Analysis or Content Details The diagram depicts a workflow starting with a "User Prompt". This prompt feeds into the "Generate N tokens auto-regressively" block (light blue). This block then splits the process into two parallel paths: 1. **Regular Decoding (Dynamic Batching):** The output of the "Generate N tokens" block is fed into the "Replay N tokens in parallel" block (blue). The output of this block is then fed into the "Compare N tokens" block (yellow). Mismatched tokens are sent back to the "Generate N tokens" block for recomputation. Matching tokens are sent to the "Accept matching tokens" block. 2. **Fixed Input Shape (No Dynamic Batching):** The output of the "Generate N tokens" block is also fed into the "Replay N tokens in parallel" block (blue). The output of this block is then fed into the "Compare N tokens" block (yellow). Mismatched tokens are sent back to the "Generate N tokens" block for recomputation. Matching tokens are sent to the "Accept matching tokens" block. The arrows indicate a cyclical process where tokens are generated, compared, and recomputed until a match is found. The diagram does not provide specific numerical values or data points. ### Key Observations The diagram highlights the difference between two decoding methods. The "Regular Decoding" method uses dynamic batching, while the "Fixed Input Shape" method does not. The diagram suggests that both methods involve generating tokens, replaying them in parallel, comparing them, and recomputing mismatched tokens. The cyclical nature of the process indicates an iterative refinement of the generated tokens. ### Interpretation The diagram illustrates a method for improving the accuracy or efficiency of token generation. By comparing the output of two decoding methods and recomputing mismatched tokens, the system can potentially identify and correct errors or inconsistencies. The parallel replay of tokens suggests an attempt to speed up the decoding process. The use of dynamic batching in the "Regular Decoding" method may allow for more flexible and efficient processing of variable-length inputs. The diagram suggests a feedback loop where the system continuously refines its output until it reaches a desired level of accuracy or consistency. This is a common technique in machine learning to improve the quality of generated content. </details> Figure 1: Overview of LLM-42. LLM-42 employs a decode–verify–rollback protocol that decouples regular decoding from determinism enforcement. It generates candidate output tokens using standard fast-path autoregressive decoding whose output is largely—but not provably—consistent across runs (O1). Determinism is guaranteed by a verifier that periodically replays a fixed-size window of recently generated tokens. Because the input shape is fixed during verification, replayed tokens follow a consistent reduction order (O2) and serve as a deterministic reference execution. Tokens are released to the user only after verification. When the verifier detects a mismatch, LLM-42 rolls the sequence back to the last matching token and resumes decoding from this known-consistent state. In general, two factors make this approach practical: (1) verification is low cost i.e., like prefill, it is typically 1-2 orders of magnitude more efficient than the decode phase, and (2) rollbacks are infrequent: more than half of the requests complete without any rollback, and only a small fraction require multiple rollbacks. | Fast path generates tokens using some form of approximation | No approximation; only floating-point rounding errors | | --- | --- | | Low acceptance rate and hence limited speculation depth (2-8 tokens) | High acceptance rate and hence longer speculation (32-64 tokens) | | Typically uses separate draft and target models | Uses the same model for decoding and verification | Table 1: Speculative decoding vs. LLM-42. By decoupling determinism from token generation, LLM-42 enables determinism to be enforced selectively, preserving the natural variability and creativity of LLM outputs where appropriate. This separation also helps performance: the fast path can use whatever batch sizes and reduction schedules are most efficient, prefill computation can follow different reduction strategy than decode (O3) and its execution need not be verified (in our design, prefill is deterministic by construction), and finally, verification can be skipped entirely for requests that do not need determinism (O4). The efficiency of our approach critically depends on the size of verification window i.e., number of tokens verified together. Smaller windows incur high verification overhead since their computation is largely memory-bound but require less recomputation on verification failures. In contrast, larger windows incur lower verification cost due to being compute-bound, but increase recomputation cost by triggering longer rollbacks on mismatches. To balance this trade-off, we introduce grouped verification: instead of verifying a large window of a single request, we verify smaller fixed-size windows of multiple requests together. This design preserves the low rollback cost of small windows while amortizing verification overhead. Overall, we make the following contributions: - We present the first systematic analysis of batch-invariant computation to highlight the performance and engineering cost associated with this approach. - We present an alternate approach LLM-42 to enable determinism in LLM inference. - We implement LLM-42 on top of SGLang and show that its overhead is proportional to fraction of traffic that requires determinism; it retains near-peak performance when deterministic traffic is low, whereas SGLang incurs high overhead of up to 56% in deterministic mode. Our source code will be available at https://github.com/microsoft/llm-42. ## 2 Background and Motivation <details> <summary>figures_h100_pcie/diagram/llm-arch.png Details</summary> ![2d9a7c2a](/v1/image/2d9a7c2ad1e9d3adab78b3a3bb1fdf683ed0fc41382588564547dac721c61167) ### Visual Description \n ## Diagram: Autoregressive Model Architecture ### Overview This diagram illustrates the architecture of an autoregressive model. It depicts a sequential flow of processing stages, starting with an "Input" and culminating in an "Output". The core of the model consists of repeated "L" layers, each containing an Attention mechanism, normalization layers, and a Feed Forward Network (FFN). A "Sampler" component is used to generate the output. ### Components/Axes The diagram consists of the following components: * **Input:** The initial data fed into the model. * **Embedding:** Transforms the input into a vector representation. * **Attention:** Processes the embedded input, potentially using a "KV Cache". * **All Reduce RMSNorm:** Normalization layer applied after Attention. * **FFN (Feed Forward Network):** A fully connected neural network. * **All Reduce RMSNorm:** Normalization layer applied after FFN. * **Sampler:** Generates the output based on the processed data. * **Output:** The final result of the model. * **Autoregressive:** Label indicating the overall model type. * **x L layers:** Indicates that the Attention, All Reduce RMSNorm, and FFN blocks are repeated "L" times. ### Detailed Analysis or Content Details The diagram shows a clear sequential flow: 1. **Input** is passed to **Embedding**. 2. **Embedding** output is fed into the first **Attention** layer. 3. The **Attention** layer's output is processed by **All Reduce RMSNorm**. 4. The normalized output goes to **FFN**. 5. **FFN** output is processed by **All Reduce RMSNorm**. 6. This sequence (Attention, All Reduce RMSNorm, FFN, All Reduce RMSNorm) is repeated "L" times. 7. The output of the final layer is passed to **Sampler**. 8. **Sampler** generates the **Output**. 9. The **Output** is fed back into the **Input** in a loop, indicated by the arrow labeled "Autoregressive". The "KV Cache" is a component within the "Attention" block, suggesting it stores key-value pairs for efficient attention calculations. ### Key Observations The autoregressive nature of the model is highlighted by the feedback loop from the "Output" to the "Input". The repeated "L" layers suggest a deep neural network architecture. The use of "All Reduce RMSNorm" indicates a specific normalization technique. ### Interpretation This diagram represents a common architecture for autoregressive models, frequently used in natural language processing (NLP) tasks like language modeling and text generation. The autoregressive loop is crucial for generating sequences, as the model uses its previous predictions as input for the next prediction. The "Attention" mechanism allows the model to focus on relevant parts of the input sequence. The repeated layers ("L" layers) enable the model to learn complex patterns and relationships in the data. The "KV Cache" likely optimizes the attention calculations by storing previously computed key-value pairs, reducing computational cost. The "All Reduce RMSNorm" layers help stabilize training and improve performance. The diagram provides a high-level overview of the model's structure, without specifying the details of the individual components (e.g., the size of the embedding, the number of layers "L", or the specific architecture of the FFN). </details> Figure 2: High-level architecture of LLMs. This section introduces LLMs, explains the source of non-determinism in LLM inference and quantifies the cost of batch-invariant computation based deterministic inference. ### 2.1 Large Language Models Transformer-based LLMs compose a stack of identical decoder blocks, each containing a self-attention module, a feed-forward network (FFN), normalization layers and communication primitives (Figure 2). The hidden state of every token flows through these blocks sequentially, but within each block the computation is highly parallel: attention performs matrix multiplications over the key–value (KV) cache, FFNs apply two large GEMMs surrounding a nonlinearity, and normalization applies a per-token reduction over the hidden dimension. Inference happens in two distinct phases namely prefill and decode. Prefill processes prompt tokens in parallel, generating KV cache for all input tokens. This phase is dominated by large parallel computation. Decode is autoregressive: each step consumes the most recent token, updates the KV cache with a single new key and value, and produces the next output token. Decode is therefore sequential within a sequence but parallel across other requests in the batch. ### 2.2 Non-determinism in LLM Inference In finite precision, arithmetic operations such as accumulation are non-associative, meaning that $(a+b)+c\neq a+(b+c)$ . Non-determinism in LLM inference stems from this non-associativity when combined with dynamic batching, a standard technique to achieve high throughput inference. With dynamic batching, the same request may be co-located with different sets of requests across different runs, resulting in varying batch sizes. Further, GPU kernels adapt their parallelization—and consequently their reduction—strategies based on the input sizes. As a result, the same logical operation can be evaluated with different floating-point accumulation orders depending on the batch it appears in, leading to inconsistent numerical results. Reductions are common in LLM inference, appearing in matrix multiplications (GEMMs), attention, normalization and collective communication such as AllReduce. GEMM is the most common and time consuming operator. High-performance GEMM implementations on GPUs use hierarchical tiling and parallel reductions to improve occupancy and memory reuse. A common optimization is split-K parallelism, where the reduction dimension is partitioned across multiple thread blocks. Each block computes a partial result, which is then combined via an additional reduction step. Whether split-K is used—and how many splits are chosen—depends on the shape on input matrices. These choices directly change the reduction tree as shown in Figure 3, producing different numerical results for a given token across runs. Similarly, attention kernels split work across the key–value dimension to increase SM utilization, followed by a reduction to combine partial results. Normalization operators reduce across feature dimensions. As inference progresses, batch-dependent reduction choices in these operators introduce small numerical drifts that propagate across kernels, layers and decoding steps, eventually influencing output tokens even when the sampling hyper-parameters are fixed. <details> <summary>figures_h100_pcie/diagram/gemm_wo_splitk.png Details</summary> ![5cfc40d8](/v1/image/5cfc40d847d053aa31511e59534f42ffac9d723e424bb815c52c47e00ef88cd9) ### Visual Description \n ## Diagram: Matrix Multiplication Decomposition ### Overview The image depicts a visual breakdown of matrix multiplication. It illustrates how the product of two matrices, represented as A and B, results in a matrix C, and how this multiplication can be decomposed into a series of scalar multiplications and additions. ### Components/Axes The diagram consists of the following components: * **Matrix A:** A horizontal rectangular block labeled "A₀ A₁ A₂ A₃". * **Matrix B:** A vertical rectangular block labeled "B₀, B₁, B₂, B₃". * **Matrix C:** A single rectangular block labeled "C". * **Multiplication Symbol:** "x" indicating matrix multiplication. * **Addition Symbols:** "+" within rounded rectangles, representing addition operations. * **Scalar Multiplication:** Expressions like "A₀ * B₀", "A₁ * B₁", etc., representing element-wise multiplication. * **Arrows:** Indicating the flow of operations. ### Detailed Analysis or Content Details The diagram shows the following decomposition: The multiplication of matrix A (1x4) and matrix B (4x1) results in matrix C (1x1). The process is broken down as follows: 1. **Scalar Multiplications:** * A₀ * B₀ * A₁ * B₁ * A₂ * B₂ * A₃ * B₃ 2. **First Level of Additions:** * A₀ * B₀ + A₁ * B₁ * A₂ * B₂ + A₃ * B₃ 3. **Final Addition:** * (A₀ * B₀ + A₁ * B₁) + (A₂ * B₂ + A₃ * B₃) = C The arrows indicate the flow of data from the scalar multiplications to the addition operations, ultimately resulting in the element of matrix C. ### Key Observations The diagram clearly illustrates the fundamental operation of matrix multiplication, breaking it down into simpler scalar multiplications and additions. The arrangement of the elements and operations highlights the element-wise nature of the calculation. ### Interpretation This diagram is a pedagogical tool for understanding matrix multiplication. It visually demonstrates that matrix multiplication is not a single operation but a series of scalar multiplications and additions. The decomposition helps to clarify the relationship between the elements of the input matrices (A and B) and the resulting matrix (C). The diagram is useful for students learning linear algebra or anyone needing a visual reminder of how matrix multiplication works. The diagram does not contain any numerical data, but rather focuses on the conceptual breakdown of the operation. It is a representation of the mathematical process, not a specific calculation with given values. </details> (a) Without split-K. <details> <summary>figures_h100_pcie/diagram/gemm_w_splitk.png Details</summary> ![6dc45e70](/v1/image/6dc45e7096a97383c258fe74922a8c9b721c6994e6b57c0b156e7990cac44a03) ### Visual Description \n ## Diagram: Matrix Multiplication Decomposition ### Overview The image depicts a visual representation of matrix multiplication, breaking down the process into its constituent element-wise multiplications and additions. It shows a matrix multiplication operation (A x B = C) and then expands this into a series of smaller addition operations, illustrating how the resulting matrix C is computed. ### Components/Axes The diagram consists of the following components: * **Matrix A:** Represented as a horizontal row of four elements: A₀, A₁, A₂, A₃. Enclosed in a light blue rectangle. * **Matrix B:** Represented as a vertical column of four elements: B₀, B₁, B₂, B₃. Enclosed in a light green rectangle. * **Matrix C:** Represented as a single element: C. Enclosed in a pink rectangle. * **Multiplication and Addition Operations:** A series of rectangular boxes with "+" signs inside, representing addition operations. * **Element-wise Products:** Boxes containing terms like A₀ * B₀, A₁ * B₁, etc., representing the product of individual elements from matrices A and B. * **Arrows:** Indicate the flow of data and the order of operations. ### Detailed Analysis or Content Details The diagram illustrates the following breakdown of the matrix multiplication A x B = C: 1. **Element-wise Multiplication:** * A₀ is multiplied by B₀, resulting in A₀ * B₀. * A₁ is multiplied by B₁, resulting in A₁ * B₁. * A₂ is multiplied by B₂, resulting in A₂ * B₂. * A₃ is multiplied by B₃, resulting in A₃ * B₃. 2. **First Level Addition:** * A₀ * B₀ and A₁ * B₁ are added together in a box labeled "+". 3. **Second Level Addition:** * A₂ * B₂ and A₃ * B₃ are added together in a box labeled "+". 4. **Final Addition:** * The results of the two first-level additions are then added together in a box labeled "+", resulting in C. The arrows show the flow of data: from the element-wise products to the first-level additions, and then to the final addition, which produces the element C. ### Key Observations The diagram clearly demonstrates how matrix multiplication can be decomposed into a series of scalar multiplications and additions. The structure highlights the parallel nature of the computation, where multiple element-wise products can be calculated simultaneously before being summed. ### Interpretation This diagram is a visual aid for understanding the fundamental operation of matrix multiplication. It's a common technique used in linear algebra to explain how matrices interact and how their elements contribute to the final result. The decomposition into smaller operations is crucial for understanding the computational complexity of matrix multiplication and for optimizing its implementation in software and hardware. The diagram emphasizes that matrix multiplication isn't a single operation, but a structured sequence of simpler operations. This is particularly important when considering large matrices, where efficient implementation is critical. The diagram is a pedagogical tool, designed to make the abstract concept of matrix multiplication more concrete and accessible. </details> (b) With split-K. Figure 3: GEMM kernels compute dot products using standard accumulation or split-K parallelization. While split-K increases parallelism, it alters the reduction tree based on K. ### 2.3 Defeating Non-determinism He et al. at Thinking Machines Lab recently introduced a new approach named batch-invariant computation to enable deterministic LLM inference [he2025nondeterminism]. In this approach, a GPU kernel is constrained to use a single, universal reduction strategy for all tokens, eliminating batch-dependent reductions. Both SGLang and vLLM use this approach [SGLangTeam2025, vllm-batch-invariance-2025, vllm-batch-invariant-2025]: these systems either deploy the batch-invariant kernels provided by He et al. [tmops2025] or implement new kernels [batch-inv-inf-vllm, det-infer-batch-inv-ops-sglang]. <details> <summary>x1.png Details</summary> ![d0aadefe](/v1/image/d0aadefec9d380420d1c37169238ed0a3cdb0670bebff47f0fef0d775bee8d8b) ### Visual Description ## Line Chart: TFLOPS vs. Number of Tokens ### Overview This image presents a line chart comparing the performance (measured in TFLOPS - Tera Floating Point Operations Per Second) of two implementations – Non-batch-invariant (using cuBLAS) and Batch-invariant (using Triton) – as the number of tokens increases. The chart visually demonstrates how performance scales with increasing input size for each implementation. ### Components/Axes * **X-axis:** "# Tokens" - Represents the number of tokens, with markers at 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, and 8192. * **Y-axis:** "TFLOPS" - Represents the performance in Tera Floating Point Operations Per Second. The scale is logarithmic, ranging from 10⁰ to approximately 10² (1 to 100). * **Legend:** Located in the top-right corner. * Blue Line: "Non-batch-invariant (cuBLAS)" - Represented by blue markers with lines. * Red Line: "Batch-invariant (Triton)" - Represented by red markers with lines. * **Gridlines:** A light gray grid is present to aid in reading values. ### Detailed Analysis **Non-batch-invariant (cuBLAS) - Blue Line:** The blue line shows an upward trend, indicating increasing TFLOPS as the number of tokens increases. The slope of the line decreases as the number of tokens grows, suggesting diminishing returns in performance gains. * 1 Token: Approximately 1.5 TFLOPS * 2 Tokens: Approximately 2.5 TFLOPS * 4 Tokens: Approximately 4.5 TFLOPS * 8 Tokens: Approximately 8.5 TFLOPS * 16 Tokens: Approximately 16 TFLOPS * 32 Tokens: Approximately 30 TFLOPS * 64 Tokens: Approximately 55 TFLOPS * 128 Tokens: Approximately 95 TFLOPS * 256 Tokens: Approximately 150 TFLOPS * 512 Tokens: Approximately 210 TFLOPS * 1024 Tokens: Approximately 260 TFLOPS * 2048 Tokens: Approximately 290 TFLOPS * 4096 Tokens: Approximately 300 TFLOPS * 8192 Tokens: Approximately 305 TFLOPS **Batch-invariant (Triton) - Red Line:** The red line also shows an upward trend, but it is less steep than the blue line, especially at lower token counts. The slope of the red line also decreases, but it appears to level off more quickly than the blue line. * 1 Token: Approximately 0.5 TFLOPS * 2 Tokens: Approximately 1.5 TFLOPS * 4 Tokens: Approximately 3.5 TFLOPS * 8 Tokens: Approximately 7.5 TFLOPS * 16 Tokens: Approximately 15 TFLOPS * 32 Tokens: Approximately 25 TFLOPS * 64 Tokens: Approximately 45 TFLOPS * 128 Tokens: Approximately 70 TFLOPS * 256 Tokens: Approximately 95 TFLOPS * 512 Tokens: Approximately 110 TFLOPS * 1024 Tokens: Approximately 125 TFLOPS * 2048 Tokens: Approximately 135 TFLOPS * 4096 Tokens: Approximately 140 TFLOPS * 8192 Tokens: Approximately 145 TFLOPS ### Key Observations * The cuBLAS implementation (blue line) consistently outperforms the Triton implementation (red line) across all token counts, but the difference diminishes as the number of tokens increases. * Both implementations exhibit diminishing returns in performance gains as the number of tokens grows. * The Triton implementation shows a more gradual performance increase, leveling off at higher token counts. * The logarithmic scale of the Y-axis emphasizes the relative performance differences. ### Interpretation The chart demonstrates the performance characteristics of two different implementations for processing tokens. The cuBLAS implementation, being non-batch-invariant, appears to be more efficient for smaller batch sizes (lower token counts). However, as the batch size increases, the performance gap between the two implementations narrows. This suggests that the Triton implementation, being batch-invariant, is better suited for larger batch sizes, as it can leverage batch processing to improve performance. The leveling off of the Triton line indicates that it reaches a performance ceiling, likely due to limitations in its batch processing capabilities or other architectural constraints. The diminishing returns observed in both implementations suggest that there are inherent limitations in the underlying hardware or algorithms that prevent further performance scaling with increasing token counts. This data is valuable for choosing the appropriate implementation based on the expected batch size and performance requirements. </details> (a) GEMM <details> <summary>x2.png Details</summary> ![5a7c4375](/v1/image/5a7c43757b9722b336907550415d27bc7ade67e3867169f2041115856dba9f80) ### Visual Description \n ## Line Chart: Execution Time vs. Number of Tokens ### Overview This line chart compares the execution time (in milliseconds) of three different implementations – Non-batch-invariant (CUDA), Batch-invariant (Python), and Batch-invariant (Triton) – as a function of the number of tokens processed. The chart visually demonstrates how execution time scales with increasing token count for each implementation. ### Components/Axes * **X-axis:** "# Tokens" - Represents the number of tokens, with markers at 1, 8, 32, 128, 256, 512, 1024, 2048, and 4096. * **Y-axis:** "Execution Time (ms)" - Represents the execution time in milliseconds, ranging from 0 to 1.2 ms. * **Legend:** Located in the top-left corner, identifies the three data series: * Non-batch-invariant (CUDA) - Represented by a green line with circle markers. * Batch-invariant (Python) - Represented by a red line with circle markers. * Batch-invariant (Triton) - Represented by a blue line with circle markers. * **Gridlines:** A light gray grid is present to aid in reading values. ### Detailed Analysis * **Non-batch-invariant (CUDA) - Green Line:** The line starts at approximately 0.03 ms at 1 token and increases relatively linearly to approximately 0.23 ms at 4096 tokens. * 1 Token: ~0.03 ms * 8 Tokens: ~0.02 ms * 32 Tokens: ~0.01 ms * 128 Tokens: ~0.04 ms * 256 Tokens: ~0.06 ms * 512 Tokens: ~0.09 ms * 1024 Tokens: ~0.14 ms * 2048 Tokens: ~0.19 ms * 4096 Tokens: ~0.23 ms * **Batch-invariant (Python) - Red Line:** This line remains relatively flat from 1 to 512 tokens, around 0.08 ms. It then exhibits a steep increase, reaching approximately 1.25 ms at 4096 tokens. * 1 Token: ~0.08 ms * 8 Tokens: ~0.08 ms * 32 Tokens: ~0.08 ms * 128 Tokens: ~0.09 ms * 256 Tokens: ~0.09 ms * 512 Tokens: ~0.10 ms * 1024 Tokens: ~0.22 ms * 2048 Tokens: ~0.68 ms * 4096 Tokens: ~1.25 ms * **Batch-invariant (Triton) - Blue Line:** This line shows a gradual increase from approximately 0.04 ms at 1 token to approximately 0.21 ms at 4096 tokens. * 1 Token: ~0.04 ms * 8 Tokens: ~0.04 ms * 32 Tokens: ~0.04 ms * 128 Tokens: ~0.05 ms * 256 Tokens: ~0.07 ms * 512 Tokens: ~0.10 ms * 1024 Tokens: ~0.13 ms * 2048 Tokens: ~0.17 ms * 4096 Tokens: ~0.21 ms ### Key Observations * The Batch-invariant (Python) implementation exhibits the lowest execution time for small token counts (up to 512 tokens) but scales poorly with increasing token counts, becoming significantly slower than the other two implementations. * The Non-batch-invariant (CUDA) implementation shows a consistent, linear increase in execution time with increasing token counts. * The Batch-invariant (Triton) implementation demonstrates the most stable and efficient scaling, with a moderate increase in execution time as the number of tokens grows. * The difference in execution time between the implementations becomes substantial at higher token counts (above 1024). ### Interpretation The data suggests that for small workloads (few tokens), the Batch-invariant (Python) implementation is the fastest. However, as the workload increases, its performance degrades rapidly, likely due to the overhead associated with Python's interpreted nature and lack of inherent parallelism. The CUDA implementation provides a steady performance increase, indicating a more predictable scaling behavior. The Triton implementation appears to be the most scalable and efficient, maintaining relatively low execution times even with a large number of tokens. This suggests that Triton is well-suited for handling large-scale token processing tasks. The steep increase in Python's execution time at higher token counts highlights the benefits of using optimized, compiled implementations like CUDA and Triton for performance-critical applications. The choice of implementation depends on the expected workload size and the priority given to performance and scalability. </details> (b) RMSNorm Figure 4: Performance comparison between batch-invariant vs. non-batch-invariant kernels. While batch-invariant computation eliminates non-determinism, we find that it is a poor fit for real LLM serving systems. By enforcing a universal reduction strategy across all executions, it couples determinism to kernel design and sacrifices performance opportunities. It also turns determinism into a fixed tax paid by every request: dynamic batching aggregates requests, but batch-invariant kernels eliminate the very optimizations—such as split-K and shape-aware tiling—that make batching effective in the first place. Worse, because kernels are not batch-invariant, adopting this approach requires maintaining a parallel kernel stack solely for determinism. We also quantify the performance cost of this approach below. GEMM. 4(a) compares the throughput of cuBLAS based GEMMs used in PyTorch against Triton-based batch-invariant kernels developed by He et al. The matrix dimensions correspond to the down projection operation of the Llama-3.1-8B-Instruct model’s feed-forward-network. On our GPU, cuBLAS (via torch.mm) reaches up to 527 TFLOPS, whereas the batch-invariant kernel peaks at 194 TFLOPS, a slowdown of 63%. This gap arises because this Triton-based batch-invariant implementation does not use split-K or exploit newer hardware features such as Tensor Memory Accelerators [TMA_Engine] or advanced techniques like warp specialization [fa-3], all of which are leveraged by PyTorch through vendor-optimized cuBLAS kernels. RMSNorm. 4(b) compares RMSNorm execution time for varying number of tokens for three implementations: Python-based version used in SGLang, Thinking Machines’ Triton-based kernel, and SGLang’s default CUDA kernel. The first two are batch-invariant; the CUDA kernel is not. The Python implementation is up to $7\times$ slower than the non-batch-invariant CUDA kernel due to unfused primitive operations and poor shared-memory utilization. The Triton kernel performs substantially better but remains up to 50% slower than the fused CUDA implementation, which benefits from optimized reductions and kernel fusion. These overheads are amplified at high batch sizes or long context lengths, where normalization may account for a nontrivial fraction of inference time [gond2025tokenweave]. <details> <summary>x3.png Details</summary> ![2e885763](/v1/image/2e8857636e9f30f27276e74376b8dd5c4a0227b91e7a5c19e25a7244dc032fcf) ### Visual Description \n ## Stacked Bar Chart: Decode Throughput vs. Batch Size ### Overview This is a stacked bar chart comparing the decode throughput (tokens/second) of different language models (SGLang and LLM-42) at two different batch sizes (10 and 11). The chart uses stacked bars to show the contribution of deterministic and non-deterministic SGLang to the overall throughput. ### Components/Axes * **X-axis:** Batch Size, with markers at 10 and 11. * **Y-axis:** Decode Throughput (tokens/sec), ranging from 0 to 1200. * **Legend:** * SGLang non-deterministic (Light Blue) * SGLang deterministic (Coral/Salmon) * LLM-42 (Light Green) ### Detailed Analysis The chart presents data for two batch sizes: 10 and 11. **Batch Size 10:** * SGLang non-deterministic: Approximately 830 tokens/sec. (Color: Light Blue) * SGLang deterministic: Approximately 380 tokens/sec. (Color: Coral/Salmon) * Total SGLang Throughput: Approximately 1210 tokens/sec. **Batch Size 11:** * SGLang non-deterministic: Approximately 830 tokens/sec. (Color: Light Blue) * SGLang deterministic: Approximately 400 tokens/sec. (Color: Coral/Salmon) * LLM-42: Approximately 170 tokens/sec. (Color: Light Green) * Total Throughput: Approximately 1400 tokens/sec. The bars are stacked, meaning the total height of each bar represents the combined throughput. ### Key Observations * At a batch size of 10, SGLang (both deterministic and non-deterministic) has a significantly higher throughput than LLM-42 (which is not present at this batch size). * At a batch size of 11, LLM-42 is introduced, and the total throughput increases. * The non-deterministic component of SGLang contributes the most to the overall throughput at both batch sizes. * The deterministic component of SGLang increases slightly in throughput from batch size 10 to 11. ### Interpretation The data suggests that SGLang, particularly its non-deterministic component, offers higher decode throughput compared to LLM-42, especially at lower batch sizes. Increasing the batch size to 11 allows for the inclusion of LLM-42, which contributes to a further increase in overall throughput. The consistent throughput of the non-deterministic SGLang component indicates its stability across different batch sizes. The increase in the deterministic SGLang throughput with a larger batch size could be due to improved resource utilization or optimization at higher batch sizes. The chart demonstrates a trade-off between model choice and batch size in optimizing decode throughput. The addition of LLM-42 at batch size 11 does not diminish the performance of SGLang, but rather adds to the overall system throughput. </details> Figure 5: Decode throughput under different scenarios. End-to-end throughput. Figure 5 measures token generation throughput (tokens per second) under three scenarios: (1) 10 requests running in non-deterministic mode, (2) 11 requests running in non-deterministic mode, and (3) 11 requests running in deterministic mode but only one of them requires deterministic output. With 10 concurrent non-deterministic requests, the system generates 845 tokens/s. The batch size increases to 11 when a new request arrives and if decoding continues non-deterministically, throughput improves to 931 tokens/s (a jump of about 10%). In contrast, if the new request requires determinism, the entire batch is forced to execute through the slower batch-invariant kernels, causing throughput to collapse by 56% to about 415 tokens/s—penalizing every in-flight request for a single deterministic one. This behavior is undesirable because it couples the performance of all requests to that of the slowest request. Overall, these results show that batch-invariant execution incurs a substantial performance penalty. While it may be feasible to improve the performance of batch-invariant kernels, doing so would require extensive model- and hardware-specific kernel development. This engineering and maintenance burden makes the approach difficult to sustain in practice. This may be why deterministic inference is largely confined to debugging and verification today, rather than being deployed for real-world LLM serving. ## 3 Observations In this section, we distill a set of concrete observations about non-determinism, GPU kernels and LLM use-cases. These observations expose why batch-invariant computation is overly restrictive and motivate a more general approach to enable determinism in LLM inference. Observation-1 (O1). If a sequence is already in a consistent state, the next emitted token is usually consistent even under dynamic batching. However, once a token diverges, autoregressive decoding progressively amplifies this difference over subsequent steps. This is because tokens become inconsistent only when floating-point drift is large enough to alter the effective decision made by the sampler—e.g., by changing the relative ordering or acceptance of high-probability candidates under the decoder’s sampling policy (e.g., greedy or stochastic sampling). In practice, such boundary-crossing events are rare, as numerical drift typically perturbs logits only slightly. However, autoregressive decoding amplifies even a single such deviation: once a token differs, all subsequent tokens may diverge. Since a single request typically produces hundreds to thousands of output tokens, two sequence-level outputs can look dramatically different even if the initial divergence is caused by a single token flip induced by a different reduction order across runs. To demonstrate this phenomenon empirically, we conduct an experiment using the Llama-3.1-8B-Instruct model on the ShareGPT dataset. We first execute 350 requests with batch size one—i.e., without dynamic batching—to obtain reference (“ground-truth”) output tokens. We then re-run the same requests under dynamic batching at a load of 6 queries per second and compare each request’s output against its reference. In both runs, we fix the output length to 512 tokens. <details> <summary>x4.png Details</summary> ![78f6943b](/v1/image/78f6943b459bced4eaf89d9cc24780cacd939e5b045ccc5098cc85b026579d9d) ### Visual Description \n ## Line Chart: Token Span Lengths vs. Request ID ### Overview The image presents a line chart illustrating the relationship between Request ID and the number of tokens for three different spans: `first_consistent_span`, `second_consistent_span`, and `output_span`. The chart appears to track token lengths across 80 requests. ### Components/Axes * **X-axis:** Labeled "Request Id", ranging from approximately 0 to 80. * **Y-axis:** Labeled "# Tokens", ranging from approximately 0 to 500. * **Legend:** Located at the top-center of the chart, identifying three data series: * `first_consistent_span` (Blue line) * `second_consistent_span` (Green line) * `output_span` (Red line) ### Detailed Analysis * **first_consistent_span (Blue Line):** This line exhibits a highly oscillatory pattern, fluctuating significantly between approximately 150 and 480 tokens. The trend is generally downward from Request ID 0 to approximately Request ID 75, then increases sharply at Request ID 80. * At Request ID 0, the value is approximately 420 tokens. * At Request ID 10, the value is approximately 450 tokens. * At Request ID 20, the value is approximately 460 tokens. * At Request ID 30, the value is approximately 60 tokens. * At Request ID 40, the value is approximately 480 tokens. * At Request ID 50, the value is approximately 180 tokens. * At Request ID 60, the value is approximately 250 tokens. * At Request ID 70, the value is approximately 80 tokens. * At Request ID 80, the value is approximately 160 tokens. * **second_consistent_span (Green Line):** This line remains consistently low, generally between 0 and 30 tokens. It shows minor fluctuations, with a peak around Request ID 10 at approximately 25 tokens. The trend is relatively flat. * At Request ID 0, the value is approximately 0 tokens. * At Request ID 10, the value is approximately 25 tokens. * At Request ID 20, the value is approximately 10 tokens. * At Request ID 30, the value is approximately 5 tokens. * At Request ID 40, the value is approximately 15 tokens. * At Request ID 50, the value is approximately 5 tokens. * At Request ID 60, the value is approximately 10 tokens. * At Request ID 70, the value is approximately 0 tokens. * At Request ID 80, the value is approximately 5 tokens. * **output_span (Red Line):** This line is a horizontal line at approximately 500 tokens throughout the entire range of Request IDs. It does not fluctuate. ### Key Observations * The `first_consistent_span` exhibits the most significant variability in token length. * The `second_consistent_span` consistently has a very low token count. * The `output_span` maintains a constant token count. * There is a clear difference in scale between the `first_consistent_span` and the other two spans. ### Interpretation The chart likely represents the token lengths of different components within a processing pipeline for 80 requests. The `first_consistent_span` could represent the input text or a primary processing stage, explaining its large and variable token count. The `second_consistent_span` might represent a filtered or summarized version of the input, resulting in a much smaller and stable token count. The `output_span` being constant suggests a fixed-length output format or a padding mechanism. The fluctuations in the `first_consistent_span` could be due to variations in the input text length or complexity. The sharp increase at Request ID 80 is an outlier and warrants further investigation. The consistent low value of the `second_consistent_span` suggests it is a consistently small component of the overall process. The constant value of the `output_span` suggests a fixed output size. </details> Figure 6: Length of the first and second consistent span (number of tokens that match with the ground-truth) for different requests under dynamic batching. We quantify divergence using two metrics. The first consistent span of a request measures the number of initial output tokens that match exactly across the two runs, while the second consistent span measures the number of matching tokens between the first and second divergence points. Figure 6 shows these metrics for 80 requests. In the common case, hundreds of initial tokens are identical across both runs, with some requests exhibiting an exact match of all 512 tokens in the first consistent span. However, once a single token diverges, the sequence rapidly drifts: the second consistent span is near zero for most requests, indicating that divergence quickly propagates through the remainder of the output. <details> <summary>figures/position-invariant-kernel.png Details</summary> ![42cd1cce](/v1/image/42cd1ccea12cf247bdfcb4f237e3fa18ba3d26878d6a8eaf9a4a0afed065fb02) ### Visual Description \n ## Diagram: Kernel Processing Runs ### Overview The image depicts a diagram illustrating two sequential runs of a "Kernel" process. Each run takes two inputs (T₀, T₁) or (T₁, T₂) and produces two outputs (T₀', T₁') or (T₁'', T₂''). The second run appears to be highlighted with red circles around T₁'' and T₁'. ### Components/Axes The diagram consists of two main sections, labeled "Run-1" and "Run-2", positioned horizontally next to each other. Each section contains a light blue rectangle representing the "Kernel". Arrows indicate the input and output flow to and from the Kernel. The inputs are labeled T₀, T₁, T₂, and the outputs are labeled T₀', T₁', T₁'', T₂''. ### Detailed Analysis or Content Details **Run-1:** * Input 1: T₀ flows into the Kernel and outputs T₀'. * Input 2: T₁ flows into the Kernel and outputs T₁'. **Run-2:** * Input 1: T₁ flows into the Kernel and outputs T₁''. This output is circled in red. * Input 2: T₂ flows into the Kernel and outputs T₂''. The red circles around T₁'' and T₁' suggest these outputs are of particular interest or significance. ### Key Observations The diagram shows a sequential dependency between the runs. The output T₁' from Run-1 becomes the input T₁ for Run-2. The diagram does not provide any numerical data or specific details about the Kernel's function. ### Interpretation The diagram illustrates a two-step process involving a Kernel. The Kernel transforms inputs into outputs. The second run utilizes an output from the first run as an input, indicating a dependency or iterative process. The highlighting of T₁'' and T₁' suggests these outputs are critical for further analysis or represent a specific condition or outcome. The diagram is conceptual and does not provide quantitative information about the Kernel's operation or the values of T₀, T₁, T₂, T₀', T₁', T₁'', and T₂''. It is likely a simplified representation of a more complex system, possibly related to machine learning or signal processing where kernels are commonly used. The diagram emphasizes the flow of data and the sequential nature of the processing. </details> Figure 7: A position-invariant kernel produces the same output for a given input element irrespective of its position in the batch, as long as the total batch size is fixed. In this example, $T_{1}^{\prime}==T_{1}^{\prime\prime}$ if the kernel is position-invariant. | Category | Operator | Invariant | | | --- | --- | --- | --- | | Batch | Position | | | | Matmul | CuBLAS GEMM | ✗ | ✓ | | Attention | FlashAttention-3 ‡ | ✓ | ✓ | | Communication | Multimem-based AllReduce ∗ | ✓ | ✓ | | Ring-based AllReduce | ✗ | ✗ | | | Tree-based AllReduce ⋆ | ✓ | ✓ | | | Normalization | RMSNorm † | ✗ | ✓ | | Fused RMSNorm + Residual † | ✗ | ✓ | | Table 2: Invariance properties of common inference operators (‡ num_splits=1, ∗ CUDA 13.0+, ⋆ specific NCCL settings, † vLLM/SGLang defaults). Observation-2 (O2). Most GPU kernels use uniform, shape-consistent reductions: they apply the same reduction strategy to all elements within a given batch. Moreover, the strategy remains fixed for all batches of the same shape, changing only when the shape changes. The simplest example of this is GEMM kernels. For a given input matrix A of size M x K, a GEMM kernel computes all the M input elements under the same reduction order (say R). Moreover, it applies the same reduction order R to all input matrices of size M x K. We refer to such kernels as position-invariant. Position invariance implies that, with a fixed total batch size, an input element’s output is independent of its position in the batch. Note that such guarantees do not hold for kernels that implement reductions via atomic operations. Fortunately, kernels used in the LLM forward pass do not use atomic reductions. Figure 7 shows an example of a position-invariant kernel and Table 2 shows the invariance properties of common LLM operators. The motivation for batch-invariant computation stems from the fact that GPU kernels used in LLMs, while deterministic for a particular input, are not batch-invariant. We observe that position-invariance captures a strictly stronger property than determinism: determinism only requires that the same input produce the same output across runs, whereas position-invariance implies that the output of a given input remains consistent as long as the input size to the kernel remains the same. This allows us to reason about kernel behavior at the level of input shapes, rather than individual input values. Observation-3 (O3). For deterministic inference, it is sufficient to ensure that a given token position goes through the same reduction strategy across all runs of a given request; reduction strategy for different token positions within or across sequences can be different. Numerical differences in the output of a token arise from differences in how its own floating-point reductions are performed, not from the numerical values of other co-batched tokens. While batching affects how computations are scheduled and grouped, the computation for a given token position is functionally independent: it consumes the same inputs and executes the same sequence of operations. As a result, interactions across tokens occur only indirectly through execution choices—such as which partial sums are reduced together—not through direct data dependencies. Consequently, as long as a token position is always reduced using the same strategy, its output remains consistent regardless of how other token positions are computed. Observation-4 (O4). Current systems take an all-or-nothing approach: they either enforce determinism for every request or disable it entirely. Such a design is not a natural fit for LLM deployments. This is because many LLM workloads neither require bit-wise reproducibility nor benefit from it. In fact, controlled stochasticity is often desirable as it enhances output diversity and creativity of LLMs [integrating_randomness_llm2024, diversity_bias_llm2025, det-inf-kills]. In contrast, requests such as evaluation runs, safety audits, or regression testing require bit-level reproducibility. Overall, different use-cases imply that enforcing determinism for all requests is an overkill. It is also worth highlighting that this all-or-nothing behavior largely stems from the batch-invariant approach that ties determinism to the kernel design. Since all co-batched tokens go through the same set of kernels, determinism becomes a global property of the batch rather than being selective. While one could run different requests through separate deterministic and non-deterministic kernels, doing so would fragment batches, complicate scheduling, and likely hurt efficiency. <details> <summary>x5.png Details</summary> ![31e47d4d](/v1/image/31e47d4d989ffb050dfb9988241b1faff9df12bc9884871de1c9666a223d1c49) ### Visual Description \n ## Diagram: KV Cache and Token Processing Flow ### Overview The image is a diagram illustrating the flow of tokens through a system involving a KV (Key-Value) cache during prefill, decode, and verify stages. It demonstrates how the KV cache is populated and updated as tokens are processed, and highlights a "No Rollback" feature. The diagram shows three sequential states of the KV cache and token sequence. ### Components/Axes The diagram is divided into three main sections, each representing a stage: 1. **Prefill:** Shows the initial state with a deterministic request and the KV cache. 2. **Decode & Verify:** Illustrates the processing of tokens (T₀, T₁', T₂', T₃') and their corresponding verification (T₀, T₁(=T₁'), T₂(=T₂'), T₃(=T₃')). 3. **Final State:** Depicts the KV cache and token sequence after accepting all tokens, including T₄. Key labels include: * "KV cache after prefill" * "KV cache after decode" * "KV cache after accepting all tokens (including T₄)" * "other requests" * "accepted tokens" * "deterministic request" * "Prefill" * "Decode" * "Verify" * "Sequence and KV after accepting all tokens (including T₄)" * "No Rollback" * Tokens: T₀, T₁', T₂', T₃', T₄ * Verified Tokens: T₀, T₁(=T₁'), T₂(=T₂'), T₃(=T₃') ### Detailed Analysis or Content Details The diagram shows a sequential process. **Prefill Stage:** * A "deterministic request" initiates the process. * The KV cache is initially populated (represented by a cylinder). * "other requests" are indicated by an arrow pointing towards the process. **Decode & Verify Stage:** * Tokens T₀, T₁', T₂', and T₃' are generated during the decode phase. * These tokens are then verified, resulting in T₀, T₁(=T₁'), T₂(=T₂'), and T₃(=T₃'). The equality indicates successful verification. * Green checkmarks next to each verified token (T₁(=T₁'), T₂(=T₂'), T₃(=T₃')) signify successful verification. * A curved arrow connects the decoded tokens (T₁', T₂', T₃') to their verified counterparts (T₁(=T₁'), T₂(=T₂'), T₃(=T₃')). **Final State:** * The KV cache is fully populated with tokens T₀, T₁, T₂, T₃, and T₄. * The token sequence is represented by a series of blocks. * The last token, T₄, is visually distinct (hatched pattern) indicating its final acceptance. * The text "Sequence and KV after accepting all tokens (including T₄)" and "No Rollback" are displayed. ### Key Observations * The diagram emphasizes the verification process, highlighting that decoded tokens are checked against their expected values. * The "No Rollback" feature suggests that once a token is accepted, it cannot be reversed. * The KV cache is updated incrementally as tokens are processed and verified. * The deterministic request implies a predictable and repeatable process. ### Interpretation The diagram illustrates a robust token processing pipeline with a focus on data integrity and immutability. The KV cache serves as a persistent store for the processed tokens, ensuring that once a token is accepted (verified), it remains in the system without the possibility of rollback. This is crucial for applications where data consistency and auditability are paramount. The deterministic request suggests that the system is designed to produce the same output for the same input, enhancing predictability and reliability. The verification step is a critical component, preventing invalid or corrupted tokens from being added to the KV cache. The final state with T₄ being visually distinct suggests that it represents a completed or finalized state of the sequence. The diagram effectively communicates a system designed for secure and reliable token management. </details> (a) DVR without rollbacks. <details> <summary>x6.png Details</summary> ![466d9ff9](/v1/image/466d9ff98bd18afb91407a61265be9b81650562893288e1d7141b3cfcba1ccbd) ### Visual Description \n ## Diagram: KV Cache Workflow with Prefill, Decode, Verify, and Rollback ### Overview This diagram illustrates a workflow involving a KV (Key-Value) cache across four stages: Prefill, Decode, Verify, and Rollback. It depicts how a deterministic request progresses through these stages, interacting with the KV cache and undergoing token acceptance or rejection. The diagram highlights the state of the KV cache at each stage and the handling of accepted and rejected tokens. ### Components/Axes The diagram is segmented into four main sections, each representing a stage in the workflow: 1. **Prefill:** Shows the initial deterministic request and the KV cache state. 2. **Decode:** Illustrates the decoding process and the KV cache update. 3. **Verify:** Depicts the verification stage, token acceptance/rejection, and the KV cache state. 4. **Rollback:** Shows the rollback process and the final KV cache state. There are also labels indicating: * "KV cache after prefill" * "KV cache after decode" * "KV cache after verify-rollback" * "other requests" * "accepted tokens" * "rejected tokens" * Time steps: T₀, T₁', T₂', T₃', T₁, T₂, T₃, T₄ ### Detailed Analysis or Content Details **Prefill:** * A blue block represents the "deterministic request" with an arrow indicating its entry point. * Above the block is a cylinder labeled "KV cache after prefill", indicating an initial state. **Decode:** * The deterministic request transitions into a series of four blocks representing time steps: T₀, T₁', T₂', T₃'. These blocks are stacked vertically. * Above the blocks is a cylinder labeled "KV cache after decode", indicating the cache state after decoding. * "other requests" are shown as an arrow pointing towards the decode stage. **Verify:** * Four blocks are shown, representing time steps: T₀, T₁', T₂', T₃', T₄. * T₁ (=T₁') is marked with a green checkmark, indicating acceptance. * T₂ (=T₂) is marked with a green checkmark, indicating acceptance. * T₃ is marked with a red "X", indicating rejection. * T₄ is marked with a red "X", indicating rejection. * The blocks representing accepted tokens are colored light green, while rejected tokens are colored red. * Above the blocks is a cylinder labeled "KV cache after verify-rollback", indicating the cache state after verification and potential rollback. **Rollback:** * A series of three blocks representing time steps: T₀, T₁, T₂. These blocks are colored light green. * Text states: "Sequence and KV cache restored until the final accepted token (T2)". ### Key Observations * The workflow is sequential, progressing from Prefill to Decode, Verify, and potentially Rollback. * The KV cache is updated at each stage, reflecting the accepted tokens. * Rejected tokens trigger a rollback mechanism, restoring the sequence and KV cache to the state before the rejected token. * The time steps are denoted with primes (T₁', T₂', T₃') during the decode stage, and without primes (T₁, T₂) after verification. * The diagram clearly illustrates the concept of speculative execution (Decode) followed by verification and potential correction (Verify/Rollback). ### Interpretation The diagram demonstrates a mechanism for handling speculative execution in a system utilizing a KV cache. The "deterministic request" initiates the process, and the "Decode" stage speculatively processes tokens. The "Verify" stage then checks the validity of these tokens. If tokens are rejected, the "Rollback" stage restores the system to a consistent state, ensuring that only accepted tokens are committed to the KV cache. This approach allows for efficient processing while maintaining data integrity. The use of a KV cache suggests an optimization strategy to reduce latency by storing frequently accessed data. The diagram highlights the importance of verification and rollback in handling potential errors during speculative execution. The time step notation (primes vs. no primes) indicates a distinction between the speculative and committed states of the tokens. </details> (b) DVR with rollbacks. Figure 8: An example of decode-verify-rollback. After generating a fixed number of tokens through regular decoding, LLM-42 verifies them in parallel via a separate forward pass. (a) all the tokens generated in the decode phase pass verification; LLM-42 accepts all these tokens along with the new verifier-generated token ( $T_{4}$ ). (b) some tokens do not match between decode and verification; LLM-42 accepts only the initial matching tokens ( $T_{1}^{\prime}$ ) and the following verifier-generated token ( $T_{2}$ ); all other tokens are recomputed. In both cases, the verifier replaces the KV cache entries generated by the decode phase with its own. ## 4 LLM-42 Since non-determinism in LLM inference comes from dynamic batching, disabling it would make inference deterministic. However, dynamic batching is arguably the most powerful performance optimization for LLM inference [orca, distserve2024, vllmsosp, sarathi2023]. Therefore, our goal is to enable determinism in the presence of dynamic batching. In this section, we introduce a speculative decoding-inspired, scheduler-driven approach LLM-42 to achieve this goal. ### 4.1 Overall Design LLM-42 exploits the observations presented in §3 as follows: Leveraging O1. Tokens decoded from a consistent state are mostly consistent. Based on this observation, we re-purpose speculative-decoding style mechanism to enforce determinism via a decode–verify–rollback (DVR) protocol. DVR optimistically decodes tokens using the fast path and a verifier ensures determinism. Only tokens approved by the verifier are returned to the user, while the few that fail verification are discarded and recomputed. The key, however, is to ensure that the verifier’s output itself is always consistent. We leverage O2 to achieve this. Leveraging O2. Because GPU kernels rely on shape-consistent reductions, we make the verifier deterministic by always operating on a fixed number of tokens (e.g., verifying 10 tokens at a time). The only corner case occurs at the end of a sequence, where fewer than T tokens may remain (for instance, when the 11th token is an end-of-sequence token). We handle this by padding with dummy tokens so that the verifier always processes exactly T tokens. Leveraging O3. Determinism only requires that each token position follow a consistent strategy across runs whereas different positions can follow different strategies. This observation lets us compute prefill and decode phases using different strategies. Because prefill is massively parallel even within a single request, it can be made deterministic simply by avoiding arbitrary batching of prefill tokens, eliminating the need for a verifier in this phase. Verifier is required only for the tokens generated by the decode phase. Leveraging O4: LLM-42 decouples determinism from token generation by moving it into a separate verification phase. This makes selective determinism straightforward: only requests that require deterministic inference incur verification, while all other traffic avoids it. We expose this control to the users via a new API flag is_deterministic=True|False that allows them to explicitly request determinism on a per-request basis; default is False. Figure 5 quantifies the performance benefit of selective determinism. When one out of 11 requests requires determinism, LLM-42 achieves decode throughput of 911 tokens per second which is $2.2\times$ higher than the deterministic mode throughput of SGLang and only within 3% of the non-deterministic mode (best case) throughput. We present a more detailed evaluation in §5. ### 4.2 Decode-verify-rollback (DVR) DVR performs decoding optimistically by first generating tokens using high-throughput, non-deterministic execution and then correcting any inconsistencies through verification and recomputation. Rather than enforcing determinism upfront, it identifies inconsistencies in the generated sequence on the fly, and recomputes only those tokens that are not guaranteed to be consistent across all possible runs of a given request. Figure 8 illustrates how DVR operates, assuming a verification window of four tokens. The blue request requires deterministic output and its first output token $T_{0}$ is produced by deterministic prefill phase that avoids arbitrary inter-request batching. LLM-42 then generates candidate tokens $T_{1}^{\prime}$ , $T_{2}^{\prime}$ , and $T_{3}^{\prime}$ using the regular fast path with dynamic batching, where gray requests may be batched arbitrarily. The first input to the verifier should be consistent (in this case, $T_{0}$ is consistent since it comes from the prefill phase). The verifier replays these four tokens $T_{0}$ and $T_{1}^{\prime}$ – $T_{3}^{\prime}$ as input and produces output tokens $T_{1}$ – $T_{4}$ . Verification has two possible outcomes: (1) all tokens pass verification, or (2) one or more tokens fail verification. We describe these two cases in detail below. #### Case-1: When verification succeeds. In the common case, verification reproduces the same tokens that the preceding decode iterations generated. For example, in 8(a), $T_{1}=T_{1}^{\prime}$ , $T_{2}=T_{2}^{\prime}$ and $T_{3}=T_{3}^{\prime}$ . In this case, LLM-42 accepts all these tokens; in addition, it also accepts the token $T_{4}$ since $T_{4}$ was generated by the verifier from a consistent state and is therefore consistent. <details> <summary>x7.png Details</summary> ![46066dbc](/v1/image/46066dbcdd1b6f3a7ca42e77c2fe12edb3c7aed03525eef78d4161b59dd7fb71) ### Visual Description \n ## Line Chart: Latency per Token vs. Number of Tokens ### Overview This image presents a line chart illustrating the relationship between the number of tokens and the latency per token. The chart shows a decreasing trend, indicating that as the number of tokens increases, the latency per token decreases. ### Components/Axes * **X-axis:** Number of Tokens. Scale ranges from 16 to 512, with markers at 16, 32, 64, 128, 256, and 512. * **Y-axis:** Latency per Token (ms). Scale ranges from 0 to 0.8, with markers at 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8. * **Data Series:** A single teal-colored line representing the latency per token. * **Background:** A light green grid. ### Detailed Analysis The teal line begins at approximately (16, 0.76 ms) and exhibits a steep downward slope initially. Here's a breakdown of approximate data points: * (16, 0.76 ms) * (32, 0.38 ms) * (64, 0.19 ms) * (128, 0.08 ms) * (256, 0.05 ms) * (512, 0.03 ms) The line's slope decreases as the number of tokens increases, indicating diminishing returns in latency reduction. The line flattens out between 256 and 512 tokens, suggesting that increasing the number of tokens beyond 256 yields minimal further reduction in latency per token. ### Key Observations * The most significant latency reduction occurs between 16 and 64 tokens. * The curve demonstrates a logarithmic-like decay. * The latency per token approaches zero as the number of tokens increases, but never quite reaches it within the displayed range. ### Interpretation The chart suggests that processing latency per token is highly sensitive to the number of tokens, especially at lower token counts. This could be due to overhead associated with initializing or setting up processing for each token. As the number of tokens increases, the overhead becomes less significant relative to the total processing time, leading to a decrease in latency per token. The flattening of the curve at higher token counts indicates that there's a limit to how much latency can be reduced by simply increasing the number of tokens. This could be due to inherent limitations in the processing hardware or software. The data suggests an optimal range for token count exists, where latency is minimized without significant diminishing returns. This information is valuable for optimizing systems that process tokenized data, such as large language models or text processing pipelines. </details> (a) Verification latency <details> <summary>x8.png Details</summary> ![097d200c](/v1/image/097d200c2522950c46283aca3c6539fd03d59f3921e643675e7f47914f8a5e54) ### Visual Description ## Chart: Cumulative Distribution Function of Rollbacks per Verification Window ### Overview The image presents a cumulative distribution function (CDF) plot illustrating the relationship between the number of rollbacks per verification window and the CDF value. Four different window sizes (32, 64, 128, and 256) are compared. The plot shows how the probability of observing a certain number of rollbacks or fewer changes with different window sizes. ### Components/Axes * **X-axis:** "Rollbacks per Verification Window" - Scale ranges from 0.0 to 0.9, with increments of 0.1. * **Y-axis:** "CDF" - Scale ranges from 0.0 to 1.0, with increments of 0.2. * **Legend:** Located in the top-right corner. * Window Size=32 (Blue line) * Window Size=64 (Orange line) * Window Size=128 (Green line) * Window Size=256 (Red line) ### Detailed Analysis The chart displays four step-wise CDF curves, each representing a different window size. * **Window Size=32 (Blue):** The CDF starts at approximately 0.52 at a rollback rate of 0.0. It steadily increases, reaching approximately 0.65 at a rollback rate of 0.2, 0.80 at 0.4, 0.90 at 0.6, 0.95 at 0.7, and approaching 1.0 at a rollback rate of 0.8. * **Window Size=64 (Orange):** The CDF begins at approximately 0.55 at a rollback rate of 0.0. It rises to approximately 0.68 at 0.2, 0.82 at 0.4, 0.92 at 0.6, 0.96 at 0.7, and reaches approximately 1.0 at a rollback rate of 0.8. * **Window Size=128 (Green):** The CDF starts at approximately 0.53 at a rollback rate of 0.0. It increases to approximately 0.66 at 0.2, 0.78 at 0.4, 0.88 at 0.6, 0.94 at 0.7, and reaches approximately 1.0 at a rollback rate of 0.8. * **Window Size=256 (Red):** The CDF begins at approximately 0.54 at a rollback rate of 0.0. It rises to approximately 0.67 at 0.2, 0.80 at 0.4, 0.90 at 0.6, 0.95 at 0.7, and reaches approximately 1.0 at a rollback rate of 0.8. All four lines start at similar CDF values around 0.5, but diverge as the rollback rate increases. The orange line (Window Size=64) generally exhibits the highest CDF values for any given rollback rate, indicating a higher probability of observing that rollback rate or lower. The green line (Window Size=128) generally exhibits the lowest CDF values. ### Key Observations * Larger window sizes (64 and 256) tend to have higher CDF values at higher rollback rates, suggesting that they are less likely to experience high rollback rates compared to smaller window sizes (32 and 128). * The CDF curves are step-wise, indicating that the data is likely discrete or grouped. * The differences between the CDFs are most pronounced at rollback rates between 0.4 and 0.8. ### Interpretation This chart demonstrates the impact of verification window size on the distribution of rollbacks. A larger window size appears to reduce the probability of experiencing a high number of rollbacks. This could be because a larger window allows for more thorough verification, reducing the likelihood of errors that lead to rollbacks. The step-wise nature of the CDF suggests that rollbacks occur in discrete amounts or are measured in discrete intervals. The chart suggests that choosing an appropriate window size is crucial for balancing verification thoroughness and rollback frequency. The data suggests that a window size of 64 provides the most robust performance in terms of minimizing rollbacks, while a window size of 128 appears to be the least effective. Further investigation would be needed to determine the optimal window size for a specific application, considering factors such as verification cost and the impact of rollbacks. </details> (b) Rollbacks <details> <summary>x9.png Details</summary> ![0cc6bf63](/v1/image/0cc6bf63693d8bb8700dff17dca10d1ec23deb5cac3b2027b1d2a8715b1c7fa7) ### Visual Description \n ## Chart: Cumulative Distribution Function of Recomputed Tokens per Request ### Overview The image presents a cumulative distribution function (CDF) plot illustrating the relationship between the number of recomputed tokens per request and the cumulative probability. The plot compares this relationship for different window sizes: 32, 64, 128, and 256. ### Components/Axes * **X-axis Title:** "Recomputed Tokens per Request" * Scale: 0 to 1750, with markers at 0, 250, 500, 750, 1000, 1250, and 1750. * **Y-axis Title:** "CDF" (Cumulative Distribution Function) * Scale: 0.0 to 1.0, with markers at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. * **Legend:** Located in the top-right corner. * Window Size=32 (Blue) * Window Size=64 (Orange) * Window Size=128 (Green) * Window Size=256 (Red) ### Detailed Analysis The chart displays four CDF curves, each representing a different window size. * **Window Size = 32 (Blue):** The curve starts at approximately (0, 0) and rapidly increases, reaching a CDF of approximately 0.8 at around 150 recomputed tokens. It plateaus around a CDF of 0.95 at approximately 300 recomputed tokens, and approaches 1.0 around 600 recomputed tokens. * **Window Size = 64 (Orange):** This curve begins similarly to the blue curve, but its initial rise is slightly slower. It reaches a CDF of approximately 0.8 at around 200 recomputed tokens. It plateaus around a CDF of 0.95 at approximately 400 recomputed tokens, and approaches 1.0 around 700 recomputed tokens. * **Window Size = 128 (Green):** The green curve exhibits a slower initial rise compared to the blue and orange curves. It reaches a CDF of approximately 0.8 at around 250 recomputed tokens. It plateaus around a CDF of 0.95 at approximately 500 recomputed tokens, and approaches 1.0 around 900 recomputed tokens. * **Window Size = 256 (Red):** This curve has the slowest initial rise of all four. It reaches a CDF of approximately 0.8 at around 300 recomputed tokens. It plateaus around a CDF of 0.95 at approximately 600 recomputed tokens, and approaches 1.0 around 1200 recomputed tokens. All curves eventually converge towards a CDF of 1.0 as the number of recomputed tokens increases. ### Key Observations * Larger window sizes generally require more recomputed tokens to achieve the same CDF value. * The CDF curves demonstrate that a significant portion of requests require a relatively small number of recomputed tokens, while a smaller portion of requests require a larger number. * The difference in CDF values between the window sizes is most pronounced at lower recomputed token counts. ### Interpretation The data suggests that increasing the window size leads to a higher number of recomputed tokens per request. This is likely due to the increased context considered by larger windows, which may necessitate more frequent recomputation to maintain accuracy or consistency. The CDF plot provides a probabilistic view of this relationship, showing the distribution of recomputed token counts for each window size. The curves' convergence towards CDF=1 indicates that, regardless of window size, there's an upper limit to the number of recomputed tokens required for any given request. The differences in the curves highlight a trade-off between window size and computational cost. Smaller window sizes may be more efficient in terms of recomputed tokens, but they may sacrifice accuracy or context awareness. Larger window sizes may provide better accuracy but at the expense of increased computational cost. The plot is useful for understanding the performance characteristics of a system that utilizes different window sizes and for making informed decisions about the optimal window size based on specific performance requirements. </details> (c) CDF of recomputed tokens <details> <summary>x10.png Details</summary> ![f4ccea12](/v1/image/f4ccea126cca571948ad5472d84bc1c6cf883998ab9c3b765a211013eb4aa9f5) ### Visual Description \n ## Bar Chart: Recompute Cost vs. Window Size ### Overview This image presents a bar chart illustrating the relationship between "Window Size" and "Recompute Cost (%)." The chart displays the recompute cost for four different window sizes: 32, 64, 128, and 256. The recompute cost increases significantly as the window size increases. ### Components/Axes * **X-axis:** "Window Size" with markers at 32, 64, 128, and 256. * **Y-axis:** "Recompute Cost (%)" ranging from 0 to approximately 50%. * **Bars:** Four vertical bars representing the recompute cost for each window size. The bars are filled with a diagonal hatch pattern and are colored in a light purple hue. * **Data Labels:** Each bar is labeled with the corresponding recompute cost percentage. ### Detailed Analysis The chart shows a clear upward trend: as the window size increases, the recompute cost also increases. Let's examine the specific values: * **Window Size 32:** Recompute Cost is 6.81%. * **Window Size 64:** Recompute Cost is 12.42%. This represents an increase of approximately 5.61% compared to a window size of 32. * **Window Size 128:** Recompute Cost is 24.92%. This represents an increase of approximately 12.5% compared to a window size of 64. * **Window Size 256:** Recompute Cost is 46.41%. This represents an increase of approximately 21.49% compared to a window size of 128. The increase in recompute cost is not linear; it appears to accelerate as the window size grows larger. ### Key Observations * The recompute cost is relatively low for small window sizes (32 and 64). * The recompute cost increases dramatically for larger window sizes (128 and 256). * The largest window size (256) results in a recompute cost that is more than six times higher than the smallest window size (32). ### Interpretation This data suggests that increasing the window size significantly increases the computational cost of recomputation. This is likely due to the increased amount of data that needs to be processed when the window size is larger. The accelerating increase in recompute cost for larger window sizes indicates that the computational complexity may be more than linear with respect to window size. This information is valuable for system designers who need to balance the benefits of larger window sizes (e.g., improved accuracy or responsiveness) against the increased computational cost. The chart highlights the trade-off between window size and recompute cost, and it can help designers choose an appropriate window size for their specific application. The data suggests that for applications where recompute cost is a critical factor, it may be preferable to use smaller window sizes, even if it means sacrificing some accuracy or responsiveness. </details> (d) Total recomputation overhead Figure 9: The cost of verification and recomputation with varying window sizes. #### Case-2: When verification fails. Occasionally, a verification pass disagrees with one or more decoded tokens, e.g., only $T_{1}^{\prime}$ matches with the verifier’s output in 8(b). In this case, LLM-42 commits the output only up to the last matching token ( $T_{1}^{\prime}$ ); all subsequent tokens ( $T_{2}^{\prime}$ and $T_{3}^{\prime}$ ) are rejected. The verifier-generated token that appears immediately after the last matching position ( $T_{2}$ ) is also accepted and the KV cache is truncated at this position. The next decode iteration resumes from this repaired, consistent state. #### Making KV cache consistent. During the decode phase, the KV cache is populated by fast-path iterations that execute under dynamic batching. Consequently, even when token verification succeeds, the KV cache corresponding to the verified window may still be inconsistent, since it was produced by non-deterministic execution. This inconsistency could affect tokens generated in future iterations despite the current window being verified. Token-level verification alone is therefore insufficient without repairing the KV cache. To address this, we overwrite the KV cache entries produced during decoding with the corresponding entries from the verification pass. This ensures that both the emitted tokens and the KV cache state are consistent for subsequent decode iterations, eliminating downstream divergence. #### Guaranteed forward progress. Note that each verification pass produces at least one new consistent output token as shown in Figure 8. As a result, DVR guarantees forward progress even in contrived worst-case scenarios that might trigger rollbacks for every optimistically generated token. We have not observed any such scenario in our experiments. Overall, DVR follows the high-level structure of speculative decoding, but differs in several important ways as discussed in Table 1. Under DVR, decoded tokens result from a faithful execution of the model’s forward pass, with deviations arising only from the floating-point rounding errors. In contrast, speculative execution techniques generate candidate tokens using modified forward computations—for example, by running a smaller or distilled draft model [specdecoding-icml2023, specinfer-2024] or by truncating or pruning attention/context [medusa, fu2023lookahead]. Because these approximations change the computation itself, speculation depth is typically limited to a few (2-8) tokens [medusa, fu2023lookahead, specdecoding-icml2023, specinfer-2024]. In contrast, leveraging O1, DVR can safely speculate over longer (32-64) token sequences. For the same reason, verification in DVR succeeds with high probability, in contrast to the much lower acceptance rates observed in existing speculative decoding schemes. Finally, DVR performs verification using the same model, whereas most speculative decoding approaches require a separate model for verification. ### 4.3 Grouped Verification The efficiency of DVR depends on the size of verification window and involves a trade-off: smaller windows incur higher verification cost but low recomputation overhead, whereas larger windows reduce verification cost at the expense of increased recomputation. We empirically characterize this trade-off by profiling Llama-3.1-8B-Instruct on an H100-PCIe GPU. We measure per-token verification cost based on the latency of verification forward pass. To quantify the recomputation cost, we execute 4096 requests from the ShareGPT dataset in an online setting at 12 queries per second. For each window size, we measure the total number of tokens decoded versus the number of tokens returned to the user; the difference between the two denotes recomputation overhead. 9(a) shows the per-token verification cost as a function of verification window size. For smaller windows, the verification pass is memory-bound: each token performs very little computation, resulting in low arithmetic intensity and poor hardware utilization. Consequently, the verification cost is high—up to 0.75 ms per-token. As the verification window increases, the kernel transitions to being compute-bound, and the per-token cost drops sharply, reaching 0.05 ms at a window size of 512, signifying a reduction of $15\times$ . 9(b) shows that more than half of the requests complete without any rollback (i.e., all their output tokens pass verification), while a small fraction of requests incur frequent rollbacks. Moreover, rollbacks become more common as the window size increases. For example, with a window size of 256, about 40% of requests observe a rollback ratio of 50% or higher indicating that at least half of the verification steps detect one or more mismatched tokens. In contrast, with a window size of 32, only about 5% of requests reach this level. The number of recomputed tokens per request also follows the same trend as shown in 9(c). Further, 9(d) shows the resulting total recomputation overhead. Note that a verification failure requires recomputing all tokens following the mismatched position within the current window. As a result, larger windows incur higher recomputation overhead on average. Concretely, recomputation overhead is 6.81% at a window size of 32 and increases roughly linearly, reaching 46.41% at a window size of 256. We propose grouped verification to address the inherent trade-off between the cost of verification and recomputation. Instead of verifying a large window of a single request (256 tokens), we verify small, fixed-size windows of multiple requests together in a single pass (e.g., 8 requests, 32 tokens each). This way each request retains the rollback properties of a small window, while the verification pass operates on a large effective batch, achieving high utilization and low verification cost. ### 4.4 Discussion Guaranteeing determinism requires every operator in the inference pipeline to behave consistently. This section highlights a few subtle sources of inconsistency in commonly used operators. We do not introduce new techniques here; instead, we adopt established approaches to enforce consistent behavior and summarize them for completeness. #### Attention. Attention involves reductions along multiple steps: reductions along the sequence dimension (e.g., in FlashDecoding-style sequence parallelism [flashdecoding]) and reductions inside the softmax computation. To make it consistent, we use FlashAttention-3 attention kernel and set the number of KV splits to one in the verification pass; fast-path decode iterations run as usual. Some performance optimizations are possible based on SGLang implementation https://lmsys.org/blog/2025-09-22-sglang-deterministic/. #### Communication collectives. Classical ring-based AllReduce reduces tokens in different orders depending on their position in the batch. In contrast, multimem-based AllReduce, introduced in CUDA 13.0, follows consistent reduction schedules [nvls-deterministic] and should be preferred whenever supported. On older platforms where multimem/NVLS is unavailable, one can use tree-based AllReduce with fixed NCCL configuration (e.g., setting num_channels to one and using a fixed protocol) to achieve determinism, at the expense of higher communication cost than ring-based AllReduce. #### Sampling. We adopt sampler from SGLang. When temperature is zero, it selects the token with the highest logit (argmax) and if there are multiple maximal values then it returns the index of the first maximal value. For non-greedy sampling, however, the sampling process involves randomness, making the outputs sensitive to how random numbers are generated and consumed. To address this issue, SGLang introduces a new sampling function, multinomial_with_seed, which replaces torch.multinomial, an inherently non-deterministic operator under batched execution. This function perturbs logits with Gumbel noise generated from a seeded hash function, ensuring that the same input–seed pair always produces the same sample [SGLangTeam2025]. Overall, enforcing determinism across the entire inference pipeline requires careful configuration of the attention kernel and selecting an appropriate AllReduce implementation. The sampling module does require a new implementation, but this is a one-time cost since the same sampler is used across all models. In contrast, the bulk of the performance and engineering effort in LLM systems lies in GEMM kernels, which LLM-42 reuses entirely, along with RMSNorm and FusedMoE kernels. ## 5 Evaluation Our evaluation answers the following questions: | ShareGPT ArXiv | 92812 5941 | Input Length Output Length Input Length | 304 192 7017 | 136 118 6435 | 491 212 3479 | | --- | --- | --- | --- | --- | --- | | Test Split | | Output Length | 198 | 191 | 74 | Table 3: Datasets and their input/output context lengths. <details> <summary>x11.png Details</summary> ![219db2ef](/v1/image/219db2ef5d6715634827e41fe18c345ad5f89e2eb4b3cd9a6318557c10adaec6) ### Visual Description \n ## Bar Chart: Throughput Comparison of Language Models ### Overview This bar chart compares the throughput (tokens/s) of different language models (SGLang and LLM-42) under various conditions. The conditions vary by dataset (ArXiv, ShareGPT, and different input/output token lengths) and LLM-42's percentage of non-deterministic behavior (2%, 5%, 10%, 20%, 50%, 100%). The chart uses grouped bar representations to show the throughput of deterministic and non-deterministic SGLang, and different levels of non-determinism in LLM-42. Values are shown above each bar, representing a ratio relative to a baseline. ### Components/Axes * **X-axis:** Represents the different datasets and input/output token length combinations. The categories are: "ArXiv", "ShareGPT", "in=1024 out=256", "in=1024 out=512", "in=2048 out=256", "in=2048 out=512", "in=4096 out=256", "in=4096 out=512". * **Y-axis:** Represents Throughput in tokens/s, ranging from 0 to 20000. * **Legend:** Located at the top-right of the chart. It defines the colors used for each data series: * Light Green: SGLang non-deterministic * Light Purple: SGLang deterministic * Darker Purple shades: LLM-42 @2%, LLM-42 @5%, LLM-42 @10%, LLM-42 @20%, LLM-42 @50%, LLM-42 @100% * **Value Labels:** Small text labels positioned above each bar, indicating a ratio (e.g., "0.70x", "1.00x"). ### Detailed Analysis The chart consists of eight groups of bars, each corresponding to a different dataset/token length combination. Within each group, there are bars representing SGLang deterministic, SGLang non-deterministic, and six different levels of LLM-42 non-determinism. **ArXiv:** * SGLang deterministic: ~15000 tokens/s, ratio 0.84x * SGLang non-deterministic: ~16000 tokens/s, ratio 1.00x * LLM-42 @2%: ~14000 tokens/s, ratio 0.70x * LLM-42 @5%: ~14000 tokens/s, ratio 0.78x * LLM-42 @10%: ~14000 tokens/s, ratio 0.84x * LLM-42 @20%: ~14000 tokens/s, ratio 0.84x * LLM-42 @50%: ~14000 tokens/s, ratio 0.84x * LLM-42 @100%: ~14000 tokens/s, ratio 0.84x **ShareGPT:** * SGLang deterministic: ~8000 tokens/s, ratio 0.56x * SGLang non-deterministic: ~11000 tokens/s, ratio 0.64x * LLM-42 @2%: ~9000 tokens/s, ratio 0.66x * LLM-42 @5%: ~9000 tokens/s, ratio 0.72x * LLM-42 @10%: ~9000 tokens/s, ratio 0.72x * LLM-42 @20%: ~9000 tokens/s, ratio 0.72x * LLM-42 @50%: ~9000 tokens/s, ratio 0.72x * LLM-42 @100%: ~9000 tokens/s, ratio 0.72x **in=1024 out=256:** * SGLang deterministic: ~16000 tokens/s, ratio 0.89x * SGLang non-deterministic: ~17000 tokens/s, ratio 0.90x * LLM-42 @2%: ~15000 tokens/s, ratio 0.86x * LLM-42 @5%: ~15000 tokens/s, ratio 0.86x * LLM-42 @10%: ~15000 tokens/s, ratio 0.86x * LLM-42 @20%: ~15000 tokens/s, ratio 0.86x * LLM-42 @50%: ~15000 tokens/s, ratio 0.86x * LLM-42 @100%: ~15000 tokens/s, ratio 0.86x **in=1024 out=512:** * SGLang deterministic: ~13000 tokens/s, ratio 0.73x * SGLang non-deterministic: ~14000 tokens/s, ratio 0.79x * LLM-42 @2%: ~13000 tokens/s, ratio 0.66x * LLM-42 @5%: ~13000 tokens/s, ratio 0.76x * LLM-42 @10%: ~13000 tokens/s, ratio 0.76x * LLM-42 @20%: ~13000 tokens/s, ratio 0.76x * LLM-42 @50%: ~13000 tokens/s, ratio 0.76x * LLM-42 @100%: ~13000 tokens/s, ratio 0.76x **in=2048 out=256:** * SGLang deterministic: ~16000 tokens/s, ratio 0.97x * SGLang non-deterministic: ~17000 tokens/s, ratio 1.00x * LLM-42 @2%: ~15000 tokens/s, ratio 0.92x * LLM-42 @5%: ~15000 tokens/s, ratio 0.92x * LLM-42 @10%: ~15000 tokens/s, ratio 0.92x * LLM-42 @20%: ~15000 tokens/s, ratio 0.92x * LLM-42 @50%: ~15000 tokens/s, ratio 0.92x * LLM-42 @100%: ~15000 tokens/s, ratio 0.92x **in=2048 out=512:** * SGLang deterministic: ~13000 tokens/s, ratio 0.76x * SGLang non-deterministic: ~14000 tokens/s, ratio 0.85x * LLM-42 @2%: ~13000 tokens/s, ratio 0.74x * LLM-42 @5%: ~13000 tokens/s, ratio 0.74x * LLM-42 @10%: ~13000 tokens/s, ratio 0.74x * LLM-42 @20%: ~13000 tokens/s, ratio 0.74x * LLM-42 @50%: ~13000 tokens/s, ratio 0.74x * LLM-42 @100%: ~13000 tokens/s, ratio 0.74x **in=4096 out=256:** * SGLang deterministic: ~13000 tokens/s, ratio 0.85x * SGLang non-deterministic: ~14000 tokens/s, ratio 0.99x * LLM-42 @2%: ~13000 tokens/s, ratio 0.85x * LLM-42 @5%: ~13000 tokens/s, ratio 0.85x * LLM-42 @10%: ~13000 tokens/s, ratio 0.85x * LLM-42 @20%: ~13000 tokens/s, ratio 0.85x * LLM-42 @50%: ~13000 tokens/s, ratio 0.85x * LLM-42 @100%: ~13000 tokens/s, ratio 0.85x **in=4096 out=512:** * SGLang deterministic: ~9000 tokens/s, ratio 0.64x * SGLang non-deterministic: ~11000 tokens/s, ratio 0.72x * LLM-42 @2%: ~9000 tokens/s, ratio 0.64x * LLM-42 @5%: ~9000 tokens/s, ratio 0.64x * LLM-42 @10%: ~9000 tokens/s, ratio 0.64x * LLM-42 @20%: ~9000 tokens/s, ratio 0.64x * LLM-42 @50%: ~9000 tokens/s, ratio 0.64x * LLM-42 @100%: ~9000 tokens/s, ratio 0.64x ### Key Observations * SGLang non-deterministic consistently outperforms SGLang deterministic across all datasets/token lengths. * LLM-42's throughput is relatively stable across different levels of non-determinism for each dataset/token length combination. * The throughput of LLM-42 is generally lower than that of SGLang, especially the non-deterministic version. * The highest throughput is achieved with SGLang non-deterministic on the ArXiv dataset. * The lowest throughput is observed with LLM-42 on the in=4096 out=512 dataset. ### Interpretation The data suggests that non-determinism in SGLang improves throughput, potentially by allowing for more parallelization or exploration of different solution paths. The lack of significant variation in LLM-42's throughput with changing non-determinism levels indicates that the model may not be effectively utilizing the increased randomness. The consistent performance of LLM-42 across different non-determinism levels could also suggest that other factors are limiting its throughput. The differences in throughput between datasets likely reflect the complexity of the data and the model's ability to process it efficiently. The ratios provided allow for a direct comparison of performance relative to an unspecified baseline, highlighting the relative gains or losses associated with each configuration. The chart provides valuable insights into the trade-offs between determinism, throughput, and model performance for different language models and datasets. </details> Figure 10: Throughput in offline inference. SGLang has only two modes: all requests run in either deterministic or non-deterministic mode whereas LLM-42 supports selective determinism (% values in the legend reflect the fraction of traffic that requires deterministic output). - How does LLM-42 compare against the baseline’s deterministic and non-deterministic modes? - How does LLM-42 perform for different mix of deterministic and non-deterministic traffic? - How do configuration parameters affect the performance of grouped verification in LLM-42? Models and environment: We evaluate LLM-42 with the Llama-3.1-8B-Instruct model. It has 32 query heads, 8 KV heads, and 32 layers. Experiments were conducted on a system equipped with four NVIDIA H100 PCIe GPUs with 80 GB HBM3 memory and 114 streaming multiprocessors per GPU. The host CPU has 64 physical cores (128 hardware threads) and approximately 1.65 TB of DRAM. Workloads: We evaluate on synthetic inputs with varying context lengths, following common practice in prior work [nanoflow2024]. We additionally benchmark on the widely used ShareGPT [huggingfacesharegpt] and Arxiv [arxiv-summarization] datasets, whose characteristics are summarized in Table 3. All experiments use the meta-llama/Llama-3.1-8B-Instruct tokenizer. To evaluate LLM-42 under different workload conditions, we vary the fraction of requests that require deterministic output between 2%, 5%, 10%, 20%, 50%, and 100%. While higher deterministic ratios are included for stress testing, we expect that in practical deployments only a small fraction of requests will require determinism. Finally, note that the deterministic request ratio applies only to LLM-42; approaches based on batch-invariant computation support only global determinism. Serving system baselines: We implement LLM-42 on top of the SGLang serving framework version 0.5.3rc0 and compare performance against SGLang with deterministic execution switched on. To understand the upper limit on performance, we also compare with the non-deterministic version of SGLang. We refer to these two baselines as SGLang-Deterministic and SGLang-Non-Deterministic. We use FA-3 attention kernel in all our experiments; we set num_splits = 1 in the verification step of LLM-42 and use the default settings for the decode phases. Metrics: In offline inference, we evaluate throughput as the number of tokens processed per second. For online inference, we evaluate end-to-end request execution latencies and time-to-first-token. To better understand the overhead of our approach, we also report the number of rollbacks and recomputed tokens across different configurations of LLM-42. ### 5.1 Offline Inference For offline inference, we evaluate eight workload configurations: two traces from the ArXiv and ShareGPT datasets and six other configurations with different input and output lengths. Each configuration executes 4096 requests to completion. Figure 10 reports the resulting throughput across systems measured in tokens per second; we summarize the main observations below. Enabling deterministic inference in SGLang incurs a substantial throughput penalty, ranging from 24% (in=2048, out=512) to 36% (ShareGPT). This slowdown stems from the use of batch-invariant kernels, which are consistently slower than the standard optimized kernels, as shown in Figure 4. In contrast, LLM-42 mostly uses regular optimized kernels and therefore achieves significantly higher throughput. Even in its worst case—when 100% of requests require deterministic output— LLM-42 outperforms SGLang-Deterministic in all but one setting (ShareGPT), where it is only 6% slower. | Dataset / Config Total number of rollbacks | Deterministic Ratio 2% | 5% | 10% | 20% | 50% | 100% | | --- | --- | --- | --- | --- | --- | --- | | ArXiv | 70 | 170 | 372 | 697 | 1733 | 3351 | | ShareGPT | 4 | 5 | 10 | 15 | 44 | 96 | | in=512, out=256 | 0 | 0 | 0 | 0 | 0 | 0 | | in=1024, out=256 | 0 | 1 | 5 | 5 | 9 | 24 | | in=1024, out=512 | 48 | 106 | 225 | 386 | 792 | 1536 | | in=2048, out=256 | 7 | 22 | 79 | 134 | 378 | 724 | | in=2048, out=512 | 0 | 0 | 0 | 2 | 2 | 4 | | in=4096, out=512 | 31 | 79 | 98 | 220 | 538 | 1087 | | Total number of recomputed tokens | | | | | | | | ArXiv | 1805 | 4414 | 8737 | 13198 | 37687 | 89248 (10.97%) | | ShareGPT | 139 | 158 | 164 | 396 | 1185 | 2691 (0.32%) | | in=512, out=256 | 0 | 0 | 0 | 0 | 0 | 0 (0%) | | in=1024, out=256 | 0 | 44 | 151 | 151 | 288 | 776 (0.07%) | | in=1024, out=512 | 1724 | 3272 | 6615 | 12392 | 25967 | 50454 (2.41%) | | in=2048, out=256 | 268 | 767 | 2470 | 4080 | 12173 | 21772 (2.08%) | | in=2048, out=512 | 0 | 0 | 0 | 81 | 81 | 101 ( $<\!0.01\$ ) | | in=4096, out=512 | 1134 | 2961 | 3598 | 7208 | 17161 | 32430 (1.55%) | Table 4: Rollback and recomputation statistics (grouped verification over 8 requests, 64 tokens each). Recompute cost is shown only for the column with 100% deterministic traffic. Crucially, LLM-42 ’s throughput improves monotonically as the fraction of deterministic requests decreases. On ArXiv, LLM-42 operates within 8%, 2%, 1% and 1% of the best-case throughput (SGLang-Non-Deterministic) at 5%, 10%, 20%, and 50% deterministic traffic, respectively. This behavior is expected: LLM-42 imposes no computation overhead on requests that do not require determinism. As a result, LLM-42 is up to 41% faster than SGLang-Deterministic in these regimes. This trend holds across all workload configurations; for example, at 10% deterministic traffic, LLM-42 is 33% faster than SGLang-Deterministic on ShareGPT and up to 48% faster (in=2048, out=256) on other workloads. Table 4 reports two complementary metrics that characterize the overhead of enforcing determinism in LLM-42: (1) the total number of rollbacks aggregated over all 4096 requests, and (2) the total number of recomputed tokens incurred due to these rollbacks. We find that some configurations incur zero rollbacks even under non-trivial deterministic ratios, and the average case remains low across datasets and input–output lengths. Even in the worst case, ArXiv at 100% deterministic traffic, the system triggers 3351 rollbacks over 4096 requests, i.e., fewer than one rollback per request on average. A similar trend holds for recomputation overhead: the recomputed token fraction is consistently small, with the worst case at 100% deterministic traffic reaching at most 10.97%, while the average case across datasets and configurations is substantially lower. Overall, these results indicate that both rollback frequency and recomputation cost are modest in practice. <details> <summary>x12.png Details</summary> ![ab6b44da](/v1/image/ab6b44daec156b18b18acef01f8b68034a68fd4ae17f68ae7943a2c52f89d508) ### Visual Description \n ## Chart: Cumulative Distribution Function of E2E Latency ### Overview The image presents a cumulative distribution function (CDF) plot comparing the end-to-end (E2E) latency of two models, SGLang and LLM-42, under varying percentile conditions. The x-axis represents E2E Latency in milliseconds (ms), and the y-axis represents the CDF. The chart visualizes how the probability of latency being less than or equal to a given value changes for each model and percentile. ### Components/Axes * **X-axis:** E2E Latency (ms), ranging from approximately 0 to 100000 ms. * **Y-axis:** CDF, ranging from 0.0 to 1.0. * **Legend:** Located in the top-right corner, listing the data series: * SGLang non-deterministic (Teal) * SGLang deterministic (Red) * LLM-42 @2% (Light Blue) * LLM-42 @5% (Orange) * LLM-42 @10% (Light Green) * LLM-42 @20% (Purple) * LLM-42 @50% (Gray) * LLM-42 @100% (Dark Gray) ### Detailed Analysis The chart displays several CDF curves. * **SGLang non-deterministic (Teal):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises rapidly to CDF 0.8 at approximately 10000 ms, and plateaus around CDF 0.95 at approximately 30000 ms, eventually reaching CDF 1.0 at around 60000 ms. * **SGLang deterministic (Red):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises very steeply to CDF 0.8 at approximately 5000 ms, and plateaus around CDF 0.95 at approximately 15000 ms, reaching CDF 1.0 at around 30000 ms. * **LLM-42 @2% (Light Blue):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 15000 ms, and plateaus around CDF 0.95 at approximately 40000 ms, reaching CDF 1.0 at around 70000 ms. * **LLM-42 @5% (Orange):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 20000 ms, and plateaus around CDF 0.95 at approximately 50000 ms, reaching CDF 1.0 at around 80000 ms. * **LLM-42 @10% (Light Green):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 25000 ms, and plateaus around CDF 0.95 at approximately 60000 ms, reaching CDF 1.0 at around 90000 ms. * **LLM-42 @20% (Purple):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 30000 ms, and plateaus around CDF 0.95 at approximately 70000 ms, reaching CDF 1.0 at around 95000 ms. * **LLM-42 @50% (Gray):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 40000 ms, and plateaus around CDF 0.95 at approximately 80000 ms, reaching CDF 1.0 at around 98000 ms. * **LLM-42 @100% (Dark Gray):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 50000 ms, and plateaus around CDF 0.95 at approximately 90000 ms, reaching CDF 1.0 at around 100000 ms. ### Key Observations * SGLang deterministic is consistently faster than SGLang non-deterministic, as indicated by the steeper slope and earlier plateau of its CDF curve. * As the percentile increases for LLM-42, the CDF curve shifts to the right, indicating higher latency. Higher percentiles represent longer tail latencies. * LLM-42 exhibits significantly higher latency than SGLang, especially at higher percentiles (e.g., 50% and 100%). * The LLM-42 @2% curve is closest to the SGLang curves, suggesting that a small percentage of requests have relatively low latency. ### Interpretation This chart demonstrates the trade-offs between determinism and latency in SGLang, and the overall higher latency of LLM-42. The deterministic version of SGLang provides faster and more predictable performance. The LLM-42 model shows a wider distribution of latencies, with a significant tail of slower responses, especially as the percentile increases. This suggests that while LLM-42 can sometimes respond quickly, it is more prone to experiencing longer delays compared to SGLang. The percentile values for LLM-42 indicate the latency experienced by that percentage of requests. For example, 50% of LLM-42 requests take less than approximately 40000 ms, while 100% take less than approximately 50000 ms. This data is valuable for understanding the performance characteristics of each model and making informed decisions about which model to use based on latency requirements. The chart highlights the importance of considering tail latency when evaluating model performance, particularly in applications where consistent response times are critical. </details> (a) QPS=12 <details> <summary>x13.png Details</summary> ![dbbb7e46](/v1/image/dbbb7e463fbe82d335f6f9037f29b4b4a10cba459023dd5e8a6596c149ee18fd) ### Visual Description ## Chart: Cumulative Distribution Function of E2E Latency ### Overview The image presents a cumulative distribution function (CDF) plot illustrating the end-to-end (E2E) latency of different language models. The x-axis represents latency in milliseconds (ms), and the y-axis represents the cumulative distribution function (CDF), ranging from 0.0 to 1.0. Several lines are plotted, each representing a different model or configuration. ### Components/Axes * **X-axis Title:** E2E Latency (ms) * **Y-axis Title:** CDF * **Legend:** Located in the top-right corner, listing the following data series: * SGLang non-deterministic (Green) * SGLang deterministic (Red) * LLM-42 @2% (Blue) * LLM-42 @5% (Orange) * LLM-42 @10% (Purple) * LLM-42 @20% (Brown) * LLM-42 @50% (Pink) * LLM-42 @100% (Light Blue) * **Gridlines:** A light gray grid is present to aid in reading values. * **Axis Scale:** X-axis ranges from approximately 0 to 100000 ms. Y-axis ranges from 0.0 to 1.0. ### Detailed Analysis The chart displays the CDF for several models. Here's a breakdown of each line's trend and approximate data points: * **SGLang non-deterministic (Green):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises sharply to CDF 0.8 at approximately 10000 ms, and reaches CDF 1.0 at around 25000 ms. * **SGLang deterministic (Red):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises sharply to CDF 0.8 at approximately 15000 ms, and reaches CDF 1.0 at around 30000 ms. * **LLM-42 @2% (Blue):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 30000 ms, and reaches CDF 1.0 at around 60000 ms. * **LLM-42 @5% (Orange):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 35000 ms, and reaches CDF 1.0 at around 70000 ms. * **LLM-42 @10% (Purple):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 40000 ms, and reaches CDF 1.0 at around 80000 ms. * **LLM-42 @20% (Brown):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 45000 ms, and reaches CDF 1.0 at around 90000 ms. * **LLM-42 @50% (Pink):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 50000 ms, and reaches CDF 1.0 at around 95000 ms. * **LLM-42 @100% (Light Blue):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.8 at approximately 55000 ms, and reaches CDF 1.0 at around 100000 ms. ### Key Observations * The SGLang models (both deterministic and non-deterministic) exhibit significantly lower latency compared to the LLM-42 models. * As the percentage increases in the LLM-42 models (@2% to @100%), the latency generally increases. This is evident in the rightward shift of the CDF curves. * The deterministic SGLang model has slightly higher latency than the non-deterministic version. * The LLM-42 models show a relatively consistent increase in latency as the percentage parameter increases. ### Interpretation This chart demonstrates the trade-off between latency and potentially other factors (like accuracy or complexity) in different language models. The SGLang models are faster, suggesting they might be simpler or optimized for speed. The LLM-42 models, while slower, offer a range of configurations (represented by the percentages) that allow for tuning the latency based on application requirements. The CDF plot is useful for understanding the probability of observing a particular latency value for each model. For example, the chart shows that the LLM-42 @2% model has a 50% probability of completing within approximately 30000 ms, while the LLM-42 @100% model has a 50% probability of completing within approximately 55000 ms. The increasing latency with higher percentages in LLM-42 likely indicates increased computational cost associated with more complex processing or larger model sizes. </details> (b) QPS=14 <details> <summary>x14.png Details</summary> ![0b8b2e48](/v1/image/0b8b2e4876468340bf227f42977ee0613ef36e0d7882e319bbf20f59f7d24572) ### Visual Description ## Chart: Cumulative Distribution Function of E2E Latency ### Overview The image presents a cumulative distribution function (CDF) plot illustrating the relationship between E2E Latency (in milliseconds) and the Cumulative Distribution Function (CDF) value, ranging from 0.0 to 1.0. The chart compares the latency distributions of two models, SGLang (deterministic and non-deterministic) and LLM-42, across different sampling percentages (2%, 5%, 10%, 20%, 50%, and 100%). ### Components/Axes * **X-axis:** E2E Latency (ms), ranging from 0 to 120000 ms. * **Y-axis:** CDF, ranging from 0.0 to 1.0. * **Legend:** Located in the top-right corner, identifies the different data series: * SGLang non-deterministic (Green) * SGLang deterministic (Red) * LLM-42 @2% (Blue) * LLM-42 @5% (Orange) * LLM-42 @10% (Purple) * LLM-42 @20% (Brown) * LLM-42 @50% (Pink) * LLM-42 @100% (Light Blue) ### Detailed Analysis The chart displays several CDF curves. Here's a breakdown of each series and approximate data points: * **SGLang non-deterministic (Green):** This line starts at approximately CDF 0.0 at E2E Latency 0 ms, rapidly increases to CDF 0.5 around 10000 ms, and reaches CDF 0.9 around 30000 ms. It plateaus near CDF 1.0 around 50000 ms. * **SGLang deterministic (Red):** This line begins at CDF 0.0 at E2E Latency 0 ms, rises to CDF 0.5 around 15000 ms, and reaches CDF 0.9 around 40000 ms. It plateaus near CDF 1.0 around 60000 ms. * **LLM-42 @2% (Blue):** Starts at CDF 0.0 at E2E Latency 0 ms, reaches CDF 0.5 around 25000 ms, and CDF 0.9 around 60000 ms. It approaches CDF 1.0 around 80000 ms. * **LLM-42 @5% (Orange):** Begins at CDF 0.0 at E2E Latency 0 ms, reaches CDF 0.5 around 15000 ms, and CDF 0.9 around 50000 ms. It approaches CDF 1.0 around 70000 ms. * **LLM-42 @10% (Purple):** Starts at CDF 0.0 at E2E Latency 0 ms, reaches CDF 0.5 around 10000 ms, and CDF 0.9 around 40000 ms. It approaches CDF 1.0 around 60000 ms. * **LLM-42 @20% (Brown):** Begins at CDF 0.0 at E2E Latency 0 ms, reaches CDF 0.5 around 7000 ms, and CDF 0.9 around 30000 ms. It approaches CDF 1.0 around 50000 ms. * **LLM-42 @50% (Pink):** Starts at CDF 0.0 at E2E Latency 0 ms, reaches CDF 0.5 around 5000 ms, and CDF 0.9 around 25000 ms. It approaches CDF 1.0 around 40000 ms. * **LLM-42 @100% (Light Blue):** Begins at CDF 0.0 at E2E Latency 0 ms, reaches CDF 0.5 around 3000 ms, and CDF 0.9 around 20000 ms. It approaches CDF 1.0 around 35000 ms. ### Key Observations * Higher sampling percentages for LLM-42 generally result in lower latency (curves shift to the left). * SGLang deterministic is consistently slower than SGLang non-deterministic. * LLM-42 @100% has the lowest latency overall. * The LLM-42 curves show a clear trend: as the sampling percentage decreases, the latency increases. * SGLang non-deterministic and LLM-42 @50% have similar CDF curves in the 10000-30000ms range. ### Interpretation This chart demonstrates the trade-off between latency and sampling percentage in LLM-42. Increasing the sampling percentage reduces latency, but potentially at the cost of output diversity or quality. The comparison between SGLang deterministic and non-deterministic suggests that determinism introduces additional latency. The CDF plot allows for a probabilistic assessment of latency; for example, we can estimate the latency below which 90% of requests will fall for each configuration. The data suggests that LLM-42, particularly with higher sampling percentages, offers significantly lower latency compared to SGLang. The differences in CDF curves highlight the varying distributions of latency for each model and sampling configuration, providing valuable insights for optimizing performance based on specific application requirements. </details> (c) QPS=16 <details> <summary>x15.png Details</summary> ![fd0c5d0a](/v1/image/fd0c5d0a29cfa4c820d2d209578a8f92d306ad18605f62b8e29a904d43797b42) ### Visual Description \n ## Chart: Cumulative Distribution Function of E2E Latency ### Overview The image presents a cumulative distribution function (CDF) plot illustrating the end-to-end (E2E) latency of different language models. The x-axis represents latency in milliseconds (ms), and the y-axis represents the cumulative distribution function (CDF), ranging from 0.0 to 1.0. The chart compares the performance of "SGLang" (both non-deterministic and deterministic) and "LLM-42" at varying percentage levels (2%, 5%, 10%, 20%, 50%, and 100%). ### Components/Axes * **X-axis Title:** "E2E Latency (ms)" - Scale ranges from 0 to 140000 ms. * **Y-axis Title:** "CDF" - Scale ranges from 0.0 to 1.0. * **Legend:** Located in the top-right corner of the chart. * SGLang non-deterministic (Green) * SGLang deterministic (Red) * LLM-42 @2% (Dark Blue) * LLM-42 @5% (Orange) * LLM-42 @10% (Purple) * LLM-42 @20% (Brown) * LLM-42 @50% (Pink) * LLM-42 @100% (Light Blue) ### Detailed Analysis The chart displays several CDF curves. Here's a breakdown of each: * **SGLang non-deterministic (Green):** This line starts at approximately CDF 0.0 at 0 ms, rises steeply to CDF 0.2 at around 5000 ms, reaches CDF 0.4 at approximately 10000 ms, CDF 0.6 at around 15000 ms, CDF 0.8 at approximately 25000 ms, and approaches CDF 1.0 around 60000 ms. * **SGLang deterministic (Red):** This line starts at approximately CDF 0.0 at 0 ms, rises steeply to CDF 0.2 at around 2000 ms, reaches CDF 0.4 at approximately 7000 ms, CDF 0.6 at around 12000 ms, CDF 0.8 at approximately 20000 ms, and approaches CDF 1.0 around 40000 ms. * **LLM-42 @2% (Dark Blue):** This line starts at approximately CDF 0.0 at 0 ms, rises to CDF 0.2 at around 10000 ms, reaches CDF 0.4 at approximately 20000 ms, CDF 0.6 at around 30000 ms, CDF 0.8 at approximately 50000 ms, and approaches CDF 1.0 around 90000 ms. * **LLM-42 @5% (Orange):** This line starts at approximately CDF 0.0 at 0 ms, rises to CDF 0.2 at around 5000 ms, reaches CDF 0.4 at approximately 15000 ms, CDF 0.6 at around 25000 ms, CDF 0.8 at approximately 40000 ms, and approaches CDF 1.0 around 70000 ms. * **LLM-42 @10% (Purple):** This line starts at approximately CDF 0.0 at 0 ms, rises to CDF 0.2 at around 3000 ms, reaches CDF 0.4 at approximately 10000 ms, CDF 0.6 at around 20000 ms, CDF 0.8 at approximately 35000 ms, and approaches CDF 1.0 around 60000 ms. * **LLM-42 @20% (Brown):** This line starts at approximately CDF 0.0 at 0 ms, rises to CDF 0.2 at around 2000 ms, reaches CDF 0.4 at approximately 7000 ms, CDF 0.6 at around 15000 ms, CDF 0.8 at approximately 25000 ms, and approaches CDF 1.0 around 50000 ms. * **LLM-42 @50% (Pink):** This line starts at approximately CDF 0.0 at 0 ms, rises to CDF 0.2 at around 1000 ms, reaches CDF 0.4 at approximately 5000 ms, CDF 0.6 at around 10000 ms, CDF 0.8 at approximately 18000 ms, and approaches CDF 1.0 around 40000 ms. * **LLM-42 @100% (Light Blue):** This line starts at approximately CDF 0.0 at 0 ms, rises to CDF 0.2 at around 500 ms, reaches CDF 0.4 at approximately 3000 ms, CDF 0.6 at around 7000 ms, CDF 0.8 at approximately 12000 ms, and approaches CDF 1.0 around 30000 ms. ### Key Observations * SGLang deterministic consistently exhibits lower latency than the non-deterministic version. * As the percentage increases for LLM-42, the latency generally increases. LLM-42 @100% has the highest latency, while LLM-42 @2% has the highest latency. * LLM-42 @2% has a latency profile similar to SGLang deterministic, but shifted to the right (higher latency). * The LLM-42 curves show a clear trend: higher percentages correspond to higher latency. ### Interpretation This chart demonstrates the trade-off between determinism and latency in language models. SGLang's deterministic mode offers lower latency but potentially at the cost of reproducibility. The LLM-42 results suggest that increasing the percentage parameter (likely related to sampling or generation strategy) increases latency. The CDF plots allow for a comparison of the distribution of latency for each model and configuration, providing insights into the reliability and performance characteristics of each. The data suggests that for applications requiring low latency, SGLang deterministic or LLM-42 @2% might be preferred, while applications prioritizing higher quality or diversity might tolerate the increased latency of LLM-42 at higher percentages. The steepness of the CDF curves indicates how quickly the models achieve a certain level of completion. A steeper curve means faster completion times for a given percentage of requests. </details> (d) QPS=18 Figure 11: CDF of request end-to-end latency in online inference with varying load for the ShareGPT dataset. ### 5.2 Online Inference | 12 P75 P90 | P50 35.4 50.3 | 27.0 69.8 107.8 | 53.4 35.8 52.5 | 27.4 36.1 54.1 | 27.6 37.1 54.0 | 27.9 41.1 57.8 | 29.3 52.0 68.1 | 38.5 58.0 73.9 | 45.4 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 14 | P50 | 27.9 | 60.1 | 27.9 | 28.2 | 28.9 | 31.1 | 42.5 | 48.2 | | P75 | 37.0 | 79.9 | 37.5 | 37.8 | 39.5 | 45.1 | 57.0 | 61.6 | | | P90 | 55.8 | 127.0 | 57.0 | 56.5 | 58.6 | 63.8 | 75.8 | 82.0 | | | 16 | P50 | 28.8 | 62.5 | 28.8 | 29.2 | 29.8 | 35.2 | 46.1 | 51.2 | | P75 | 39.0 | 86.0 | 39.7 | 40.5 | 41.6 | 51.2 | 61.0 | 65.9 | | | P90 | 60.5 | 138.6 | 61.7 | 60.9 | 63.3 | 74.0 | 83.6 | 87.8 | | | 18 | P50 | 29.8 | 76.2 | 30.2 | 30.6 | 32.2 | 40.7 | 50.7 | 57.1 | | P75 | 41.4 | 105.6 | 42.7 | 43.3 | 46.3 | 57.5 | 67.3 | 75.4 | | | P90 | 65.2 | 171.6 | 65.9 | 67.6 | 70.2 | 79.3 | 90.5 | 101.2 | | Table 5: Time-to-first-token (TTFT) latency with varying load for the ShareGPT dataset. Figure 11 reports the CDF of end-to-end latency for online inference under increasing load on the ShareGPT dataset, with each experiment running 4096 requests. Across all QPS (queries per second) values, SGLang-Deterministic exhibits a pronounced rightward shift with a long tail, reflecting substantially higher median and tail latencies. For example, at 12 QPS, SGLang-Deterministic’s median (P50) latency is 4.64 seconds with a P99 of 28 seconds, compared to 2.15 seconds (P50) and 13.2 seconds (P99) for SGLang-Non-Deterministic. This gap widens further under higher load: at 18 QPS, SGLang-Deterministic reaches a P50 latency of 10.6 seconds and a P99 latency of 71.1 seconds, whereas SGLang-Non-Deterministic remains at 2.84 and 17.4 seconds, respectively. SGLang-Non-Deterministic consistently achieves the lowest latency and thus serves as a practical lower bound. In contrast, LLM-42 closely tracks the non-deterministic baseline, with only modest increases in latency as the fraction of deterministic traffic rises from 2% to 100%. At 12 QPS, LLM-42 at 2% deterministic traffic incurs only a 3% increase in median latency over SGLang-Non-Deterministic (2.21 vs. 2.15 seconds), and even at 50% deterministic traffic, its P99 latency is comparable to that of SGLang-Deterministic (at low load) or better (at higher load). This degradation is smooth and monotonic across all loads: at the highest QPS, LLM-42 maintains significantly tighter CDFs and substantially lower tail latency than SGLang-Deterministic. Only at lower QPS and when most requests require deterministic output (100%) does LLM-42 exhibit higher latency than the deterministic baseline. This behavior stems from two implementation artifacts: (1) verification currently induces a global pause that temporarily stalls all in-flight requests, and (2) prefill is not batched in our current prototype, reducing efficiency for short input prompts. We plan to address them as part of future work. Across all QPS levels, time-to-first-token (TTFT) latency in LLM-42 also increases monotonically with the fraction of deterministic traffic, with modest overhead at low ratios (2–10%) and higher once deterministic traffic exceeds 20–50%. However, even when the entire traffic is deterministic, LLM-42 still provides much lower tail TTFT than SGLang-Deterministic: at QPS 18, P90 TTFT of LLM-42 is 101.2 milliseconds vs. 171.6 milliseconds of SGLang-Deterministic. ### 5.3 Ablation Study <details> <summary>x16.png Details</summary> ![d20d1691](/v1/image/d20d16915242d5d3aafbdc42eb0ea005f3234dc5fac78b7afaca74e66d865719) ### Visual Description \n ## Heatmap: P99 E2E Latency vs. Batch Size and Window Size ### Overview This image presents a heatmap visualizing the relationship between P99 End-to-End (E2E) Latency (in seconds) and two parameters: Batch Size and Window Size. The heatmap uses a color gradient to represent latency values, with warmer colors (orange/red) indicating higher latency and cooler colors (purple/blue) indicating lower latency. ### Components/Axes * **X-axis:** Window Size, with values: 16, 32, 64, 128, 256, 512. * **Y-axis:** Batch Size, with values: 1, 2, 4, 8, 16, 32. * **Color Scale (Legend):** Located on the right side of the image. Represents P99 E2E Latency in seconds, ranging from approximately 100s (purple) to 600s (orange/red). The scale is linear. * **Data Points:** Each cell in the heatmap represents a specific combination of Batch Size and Window Size, with the corresponding P99 E2E Latency value displayed within the cell. ### Detailed Analysis The heatmap contains the following data points: * **Batch Size 1:** * Window Size 16: 615.86 seconds * Window Size 32: 316.14 seconds * Window Size 64: 155.69 seconds * Window Size 128: 56.18 seconds * Window Size 256: 65.59 seconds * Window Size 512: 99.94 seconds * **Batch Size 2:** * Window Size 16: 302.09 seconds * Window Size 32: 113.21 seconds * Window Size 64: 44.15 seconds * Window Size 128: 43.06 seconds * Window Size 256: 65.31 seconds * **Batch Size 4:** * Window Size 16: 97.83 seconds * Window Size 32: 38.57 seconds * Window Size 64: 35.47 seconds * Window Size 128: 54.03 seconds * **Batch Size 8:** * Window Size 16: 36.94 seconds * Window Size 32: 34.18 seconds * Window Size 64: 46.17 seconds * **Batch Size 16:** * Window Size 16: 34.30 seconds * Window Size 32: 45.90 seconds * **Batch Size 32:** * Window Size 16: 46.36 seconds **Trends:** * For Batch Size 1, latency decreases significantly as Window Size increases from 16 to 128, then increases slightly at 256 and 512. * For Batch Size 2, latency decreases as Window Size increases from 16 to 64, then remains relatively stable. * For Batch Size 4, latency is relatively stable across all Window Sizes. * For Batch Size 8, latency is relatively stable across all Window Sizes. * For Batch Size 16 and 32, latency increases as Window Size increases. ### Key Observations * The highest latency is observed at Batch Size 1 and Window Size 16 (615.86 seconds). * The lowest latency is observed at multiple points, including Batch Size 4 and Window Size 64 (35.47 seconds). * Latency generally decreases as Batch Size increases from 1 to 4, but then becomes more variable. * There appears to be an optimal Window Size range (around 64-128) for minimizing latency, depending on the Batch Size. ### Interpretation The heatmap demonstrates the trade-offs between Batch Size, Window Size, and P99 E2E Latency. A large Batch Size (1) combined with a small Window Size (16) results in significantly higher latency, likely due to increased processing overhead and queuing delays. Increasing the Window Size initially reduces latency, but beyond a certain point, it may start to increase it again, potentially due to increased memory usage or contention. The data suggests that for optimal performance, a Batch Size of 4 or less, combined with a Window Size between 64 and 128, is preferable. The specific optimal configuration will depend on the system's resources and workload characteristics. The anomaly of increasing latency with larger window sizes for Batch Sizes 16 and 32 suggests a potential bottleneck or resource limitation that is exacerbated by larger window sizes. Further investigation would be needed to determine the root cause of this behavior. </details> (a) P99 latency <details> <summary>x17.png Details</summary> ![477fe5f4](/v1/image/477fe5f4acfebe7bdca718693958edd4af245621245c29a4613abd29038e6171) ### Visual Description \n ## Heatmap: Recompute Cost vs. Batch Size and Window Size ### Overview This image presents a heatmap visualizing the relationship between "Window Size" and "Batch Size" in relation to "Recompute Cost (%)". The heatmap displays a grid of cells, each colored according to the corresponding recompute cost percentage. The color scale ranges from approximately 2.82% (dark purple) to 46.41% (dark orange). ### Components/Axes * **X-axis:** "Window Size" with markers at 16, 32, 64, 128, 256, and 512. * **Y-axis:** "Batch Size" with markers at 1, 2, 4, 8, 16, and 32. * **Color Scale (Legend):** Located on the right side of the heatmap, representing "Recompute Cost (%)". The scale ranges from approximately 5% (light blue) to 45% (dark orange). * **Data Cells:** Each cell represents a specific combination of Window Size and Batch Size, with the recompute cost percentage displayed within the cell. ### Detailed Analysis The heatmap contains the following data points: * **Batch Size = 32:** * Window Size = 16: 2.82% * Window Size = 32: 6.40% * **Batch Size = 16:** * Window Size = 16: 3.09% * Window Size = 32: 6.49% * Window Size = 64: 13.83% * **Batch Size = 8:** * Window Size = 16: 3.11% * Window Size = 32: 6.47% * Window Size = 64: 14.09% * Window Size = 128: 27.85% * **Batch Size = 4:** * Window Size = 16: 2.97% * Window Size = 32: 6.47% * Window Size = 64: 13.86% * Window Size = 128: 28.96% * Window Size = 256: 42.28% * **Batch Size = 2:** * Window Size = 16: 3.31% * Window Size = 32: 6.03% * Window Size = 64: 12.42% * Window Size = 128: 24.92% * Window Size = 256: 46.41% * Window Size = 512: 42.33% * **Batch Size = 1:** * Window Size = 16: 3.44% * Window Size = 32: 6.81% * Window Size = 64: 12.42% * Window Size = 128: 24.92% * Window Size = 256: 46.41% * Window Size = 512: 42.33% **Trends:** * As Window Size increases, Recompute Cost generally increases for a given Batch Size. * For smaller Window Sizes (16, 32), Recompute Cost remains relatively low across all Batch Sizes. * The highest Recompute Costs are observed with Window Size = 256 and Batch Size = 2, and Window Size = 256 and Batch Size = 1, both at approximately 46.41%. * Recompute Cost decreases when Window Size = 512 and Batch Size = 1 or 2. ### Key Observations * The lowest recompute cost is 2.82% when Batch Size is 32 and Window Size is 16. * There is a noticeable increase in recompute cost as the window size increases beyond 128, especially for smaller batch sizes. * The recompute cost appears to plateau or even decrease slightly for very large window sizes (512), potentially indicating diminishing returns or optimization benefits. ### Interpretation The heatmap demonstrates the trade-off between Window Size, Batch Size, and Recompute Cost. Larger Window Sizes generally lead to higher recompute costs, likely due to the increased computational complexity of processing larger data segments. However, the decrease in recompute cost at a Window Size of 512 suggests that there might be a point where larger windows become more efficient, possibly due to better utilization of parallel processing or optimized algorithms. The Batch Size also influences the recompute cost, but its effect is less pronounced than that of the Window Size. The optimal combination of Window Size and Batch Size depends on the specific application and the available computational resources. This data suggests that for minimizing recompute cost, smaller window sizes and moderate batch sizes are preferable. However, if computational resources are abundant, larger window sizes might be acceptable or even beneficial. The data suggests a non-linear relationship between the parameters. A simple linear model would not accurately capture the observed behavior, particularly the plateauing effect at larger window sizes. Further investigation might be needed to understand the underlying reasons for this behavior and to identify the optimal parameter settings for different scenarios. </details> (b) Recompute cost Figure 12: The effect of different verification strategies on end-to-end latency (left) and recompute cost (right). Batch size denotes the number of requests verified together. Grouped verification helps LLM-42 reduce both verification and recomputation cost. It is parameterized by (1) the verification window size per request and (2) the number of requests verified together in a singe pass. To isolate the impact of each parameter, we run an ablation on the ShareGPT dataset, varying the per-request window size from 16 to 512 tokens and the number of requests verified together from one to 32. For each configuration, we execute 4096 requests in an online setting at 12 QPS; all requests require determinism. Figure 12 summarizes the results: 12(a) reports P99 request latency, while 12(b) shows recomputation overhead. Without grouped verification (batch size 1, first row), latency exhibits a non-monotonic dependence on window size. Increasing the window initially reduces latency, but latency rises when window size becomes too large. Concretely, P99 latency drops from 615 seconds at a window size of 16 to 56.18 seconds at 128, then increases to 99.94 seconds at 512. This behavior reflects the fundamental trade-off between verification and recomputation: smaller windows incur higher verification overhead, while larger windows amplify recomputation cost—for example, recomputation cost is 42.33% at window size 512, compared to only 3.44% at window size 16. Without grouped verification, the best performance occurs at window size of 128 where the latency is 56.18 seconds. Grouped verification substantially lowers the end-to-end latency. Even batching just two requests reduces P99 latency to 43.06 seconds (best case in the second row from the bottom). Increasing the batch size yields further gains, with the best overall configurations verifying a total of 256 tokens per step—distributed across 16, 8, or 4 requests—all achieving P99 tail latencies of 34–35 seconds. It is worth noting that the recomputation cost in the offline experiments (typically below 2% as shown in Table 4) is lower than in the online inference setting (typically more than 6% as shown in 12(b)). This difference arises because in offline inference, all requests are available at the beginning and the system tries to maximally utilize memory capacity, resulting in more stable batch sizes. In contrast, online inference experiences significantly greater batch-size variability due to fluctuating load, request arrival and departure times; naturally, higher recomputation rates are correlated with this increased batch-size variability. ### Discussion Evaluation across models and multiple GPUs: While we report performance results only for Llama-3.1-8B-Instruct in this paper, we have evaluated the correctness of our approach across multiple additional models, including Qwen-4B-Instruct-2507, Qwen3-14B and Qwen3-30B-A3B-Instruct-2507, and across 1-4 GPUs with tensor-parallelism. Our current experimental setup does not support multimem/NVLS and is limited to pairwise NVLink connectivity; consequently, fairly evaluating multi-GPU performance would require a different platform, which we leave to future work. Supporting these models required no additional code changes in LLM-42 which highlights the simplicity of our approach. Limitations: In our current prototype, the verification pass introduces latency overhead for all requests, even when determinism is enabled selectively. This overhead could be mitigated by confining verification to a subset of GPU SMs or by deferring verification batches—for example, by prioritizing regular prefills and decodes over verification. A second limitation is that prefill and decode use different reduction strategies in LLM-42, making the system non–prefill-decode invariant. As a result, LLM-42 currently does not support sharing prefix caches across multiple turns of the same request or sharing across requests. Our implementation also does not currently integrate with speculative decoding–based LLM inference. Addressing these limitations is interesting future work. ## 6 Related Works Recent advances in LLM inference systems have focused on optimizing throughput, latency, and resource utilization [orca, vllmsosp, sarathiserve2024, tetriinfer, distserve2024, vattention2024, dynamollm2024, pod-attn]. Orca [orca] pioneered continuous batching, enabling low-latency serving by dynamically scheduling requests at the granularity of individual iterations. vLLM [vllmsosp] introduced PagedAttention, a memory management abstraction that allows higher batch sizes and reuse of KV cache across requests, dramatically improving system throughput. Sarathi-Serve [sarathiserve2024] further improved the throughput–latency trade-off through chunked prefills to better exploit GPU parallelism. DistServe [distserve2024] and Splitwise [patel2023splitwise] proposed disaggregated inference architectures that separate prefill and decode workloads across specialized servers, mitigating interference. Complementary lines of work such as speculative decoding [leviathan2022fast, chen2023accelerating, mamou2024dynamic] and FlexGen [flexgen] pursue orthogonal optimizations by reducing the number of autoregressive steps or offloading memory to CPU and SSDs. While all these systems achieve impressive performance gains, they typically assume non-deterministic execution, leaving the trade-offs between determinism and efficiency unexplored. Enabling determinism in LLM inference has recently drawn increasing attention. Song et al. [song2024greedy] argued that such non-determinism can distort evaluation results, calling for reproducibility-aware benchmarking. Yuan et al. [yuan2025fp32death] further linked reproducibility failures to mixed-precision arithmetic and fused kernel inconsistencies, showing their impact on multi-step reasoning accuracy. Rainbird AI [rainbird2025deterministic] emphasized deterministic inference as a requirement for traceability and compliance in enterprise and safety-critical domains. Recent work by Zhang et al. proposes tensor-parallel–invariant kernels that eliminate training–inference mismatch by ensuring bitwise deterministic results across different tensor parallel sizes for LLM inference [zhang2025-deterministic-tp]. In contrast, existing LLM serving systems typically achieve determinism through batch-invariant computation. We examine the performance and engineering costs of this approach and propose an alternative mechanism for enabling determinism in LLM inference. By leveraging properties of GPU kernels together with characteristics of LLM inference workloads, our approach seeks to reconcile reproducibility with efficiency—a design space that remains largely unexplored in prior work. ## 7 Conclusion Enabling determinism in LLM inference remains tedious today. Existing systems achieve determinism by enforcing batch-invariant computation, an approach that is cumbersome in practice: it requires rewriting kernels at a time when both hardware and model architectures are evolving rapidly. Moreover, batch-invariant kernels are inherently suboptimal, as they prevent kernels from adapting their parallelism strategies at runtime. We present LLM-42, a simpler alternative that enables determinism in LLM inference by repurposing speculative decoding. LLM-42 minimizes the need to write new kernels and supports selective enforcement of determinism, incurring runtime overhead only for the fraction of traffic that actually requires it. ## References

Rendering Paper...