2601.17768v2

Model: gemini-2.0-flash

# LLM-42: Enabling Determinism in LLM Inference with Verified Speculation Abstract In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non-determinism arises from floating-point non-associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non-determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch-invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM-42 Many developers unknowingly set 42 as a random state variable (seed) to ensure reproducibility, a reference to The Hitchhiker’s Guide to the Galaxy. The number itself is arbitrary, but widely regarded as a playful tradition. —a scheduling-based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token is likely to be consistent even with dynamic batching. Moreover, most GPU kernels use shape-consistent reductions. Leveraging these insights, LLM-42 decodes tokens using a non-deterministic fast path and enforces determinism via a lightweight verify–rollback loop. The verifier replays candidate tokens under a fixed-shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. LLM-42 mostly re-uses existing kernels unchanged and incurs overhead only in proportion to the traffic that requires determinism. 1 Introduction LLMs are becoming increasingly more powerful [openai2022gpt4techreport, kaplan2020scalinglaws]. However, a key challenge many users usually face with LLMs is their non-determinism [he2025nondeterminism, atil2024nondeterminism, yuan2025fp32death, song2024greedy]: the same model can produce different outputs across different runs of a given prompt, even with identical decoding parameters and hardware. Enabling determinism in LLM inference (aka deterministic inference) has gained significant traction recently for multiple reasons. It helps developers isolate subtle implementation bugs that arise only under specific batching choices; improves reward stability in reinforcement-learning training [zhang2025-deterministic-tp]; is essential for integration testing in large-scale systems. Moreover, determinism underpins scientific reproducibility [song2024greedy] and enables traceability [rainbird2025deterministic]. Consequently, users have increasingly requested support for deterministic inference in LLM serving engines [Charlie2025, Anadkat2025consistent]. He et al. [he2025nondeterminism] showed that non-determinism in LLM inference at the system-level stems from the non-associativity of floating-point arithmetic LLM outputs can also vary due to differences in sampling strategies (e.g., top-k, top-p, or temperature). Our goal is to ensure that output is deterministic for fixed sampling hyper-parameters; different hyper-parameters can result in different output and this behavior is intentional.. This effect manifests in practice because most core LLM operators—including matrix multiplications, attention, and normalization—rely on reduction operations, and GPU kernels for these operators choose different reduction schedules to maximize performance at different batch sizes. The same study also proposed batch-invariant computation as a means to eliminate non-determinism. In this approach, a kernel processes each input token using a single, universal reduction strategy independent of batching. Popular LLM serving systems such as vLLM and SGLang have recently adopted this approach [SGLangTeam2025, vllm-batch-invariant-2025]. While batch-invariant computation guarantees determinism, we find that this approach is fundamentally over-constrained. Enforcing a fixed reduction strategy for every token—regardless of model phase or batch geometry—strips GPU kernels of the very parallelism they are built to exploit. For example, the batch-invariant GEMM kernels provided by He et al. do not use the split-K strategy that is otherwise commonly used to accelerate GEMMs at low batch sizes [tritonfusedkernel-splitk-meta, nvidia_cutlass_blog]. Furthermore, most GPU kernels are not batch-invariant to begin with, so insisting on batch-invariant execution effectively demands new kernel implementations, increasing engineering and maintenance overhead. Finally, batch-invariant execution makes determinism the default for all requests, even when determinism is undesirable or even harmful [det-inf-kills]. Our observations suggest that determinism can be enabled with a simpler approach. (O1) Token-level inconsistencies are rare: as long as a sequence remains in a consistent state, the next emitted token is mostly identical across runs; sequence-level divergence arises mainly from autoregressive decoding after the first inconsistent token. (O2) Most GPU kernels already use shape-consistent reduction schedules: they apply the same reduction strategy on all inputs of a given shape, but potentially different reduction strategies on inputs of different shapes. (O3) Determinism requires only position-consistent reductions: a particular token position must use the same reduction schedule across runs, but different positions within or across sequences can use different reduction schedules. (O4) Real-world LLM systems require determinism only for select tasks (e.g., evaluation, auditing, continuous-integration pipelines), while creative workloads benefit from the stochasticity of LLMs. Based on these observations, we introduce LLM-42, a scheduling-based approach to deterministic LLM inference inspired by speculative decoding. In speculative decoding [specdecoding-icml2023, chen2023accelerating, xia2023speculative, specinfer-2024], a fast path produces multiple candidate tokens while a verifier validates their correctness. We observe that the same structure can be repurposed to enable determinism (see Figure 1): fast path can optimize for high throughput token generation while a verifier can enforce determinism. Table 1 highlights key differences between conventional speculative decoding and our approach. <details> <summary>figures_h100_pcie/diagram/banner.png Details</summary> ![4cdf4c50](/v1/image/4cdf4c505b67b94c4a45a65f220654d3d8e921758836d841f03ac7a8e789cbd3) ### Visual Description ## Diagram: Token Generation and Comparison ### Overview The image is a flowchart illustrating a process involving token generation, replay, and comparison. It shows two distinct paths, one for regular decoding with dynamic batching and another for fixed input shape without dynamic batching. The diagram outlines the steps involved in generating, comparing, and accepting tokens. ### Components/Axes * **Nodes:** The diagram consists of rounded rectangular nodes representing different stages of the process. * "User Prompt" (top) * "Generate N tokens auto-regressively" (green) * "Replay N tokens in parallel" (blue) * "Compare N tokens" (yellow) * "Accept matching tokens" (white) * **Edges:** Arrows indicate the flow of the process. * **Alternative Paths:** Two paths are highlighted in red and green, representing different input methods. * "Regular Decoding (Dynamic Batching)" (red) * "Fixed Input Shape (No Dynamic Batching)" (red) * **Feedback Loop:** A loop from "Compare N tokens" back to "Generate N tokens auto-regressively" via "Recompute mismatched tokens". ### Detailed Analysis 1. **User Prompt:** The process begins with a "User Prompt" at the top. 2. **Generate N tokens auto-regressively:** From the "User Prompt", the process flows to a green node labeled "Generate N tokens auto-regressively". * Two input paths lead to this node: * A dashed red arrow from "Regular Decoding (Dynamic Batching)". * A solid green arrow from "Fixed Input Shape (No Dynamic Batching)". 3. **Replay N tokens in parallel:** From the green node, a green arrow leads to a blue node labeled "Replay N tokens in parallel". 4. **Compare N tokens:** From the blue node, a blue arrow leads to a yellow node labeled "Compare N tokens". 5. **Accept matching tokens:** From the yellow node, a black arrow leads to a white node labeled "Accept matching tokens". 6. **Recompute mismatched tokens:** A black arrow leads from the "Compare N tokens" node back to the "Generate N tokens auto-regressively" node, labeled "Recompute mismatched tokens". ### Key Observations * The diagram illustrates a cyclical process where tokens are generated, replayed, compared, and either accepted or recomputed. * Two distinct input methods ("Regular Decoding" and "Fixed Input Shape") feed into the token generation stage. * The feedback loop suggests an iterative refinement of the generated tokens. ### Interpretation The diagram describes a system for generating and validating tokens, likely in the context of natural language processing or machine learning. The two input paths suggest different ways of feeding data into the system, with one allowing for dynamic batching and the other using a fixed input shape. The feedback loop indicates that the system can identify and correct mismatched tokens, improving the accuracy of the generated output. The process continues until the generated tokens match the expected tokens, at which point they are accepted. </details> Figure 1: Overview of LLM-42. LLM-42 employs a decode–verify–rollback protocol that decouples regular decoding from determinism enforcement. It generates candidate output tokens using standard fast-path autoregressive decoding whose output is largely—but not provably—consistent across runs (O1). Determinism is guaranteed by a verifier that periodically replays a fixed-size window of recently generated tokens. Because the input shape is fixed during verification, replayed tokens follow a consistent reduction order (O2) and serve as a deterministic reference execution. Tokens are released to the user only after verification. When the verifier detects a mismatch, LLM-42 rolls the sequence back to the last matching token and resumes decoding from this known-consistent state. In general, two factors make this approach practical: (1) verification is low cost i.e., like prefill, it is typically 1-2 orders of magnitude more efficient than the decode phase, and (2) rollbacks are infrequent: more than half of the requests complete without any rollback, and only a small fraction require multiple rollbacks. | Fast path generates tokens using some form of approximation | No approximation; only floating-point rounding errors | | --- | --- | | Low acceptance rate and hence limited speculation depth (2-8 tokens) | High acceptance rate and hence longer speculation (32-64 tokens) | | Typically uses separate draft and target models | Uses the same model for decoding and verification | Table 1: Speculative decoding vs. LLM-42. By decoupling determinism from token generation, LLM-42 enables determinism to be enforced selectively, preserving the natural variability and creativity of LLM outputs where appropriate. This separation also helps performance: the fast path can use whatever batch sizes and reduction schedules are most efficient, prefill computation can follow different reduction strategy than decode (O3) and its execution need not be verified (in our design, prefill is deterministic by construction), and finally, verification can be skipped entirely for requests that do not need determinism (O4). The efficiency of our approach critically depends on the size of verification window i.e., number of tokens verified together. Smaller windows incur high verification overhead since their computation is largely memory-bound but require less recomputation on verification failures. In contrast, larger windows incur lower verification cost due to being compute-bound, but increase recomputation cost by triggering longer rollbacks on mismatches. To balance this trade-off, we introduce grouped verification: instead of verifying a large window of a single request, we verify smaller fixed-size windows of multiple requests together. This design preserves the low rollback cost of small windows while amortizing verification overhead. Overall, we make the following contributions: - We present the first systematic analysis of batch-invariant computation to highlight the performance and engineering cost associated with this approach. - We present an alternate approach LLM-42 to enable determinism in LLM inference. - We implement LLM-42 on top of SGLang and show that its overhead is proportional to fraction of traffic that requires determinism; it retains near-peak performance when deterministic traffic is low, whereas SGLang incurs high overhead of up to 56% in deterministic mode. Our source code will be available at https://github.com/microsoft/llm-42. 2 Background and Motivation <details> <summary>figures_h100_pcie/diagram/llm-arch.png Details</summary> ![2d9a7c2a](/v1/image/2d9a7c2ad1e9d3adab78b3a3bb1fdf683ed0fc41382588564547dac721c61167) ### Visual Description ## Diagram: Autoregressive Model Architecture ### Overview The image is a diagram illustrating the architecture of an autoregressive model. It shows the flow of data through different components, including input, embedding, attention, feed-forward network (FFN), sampler, and output, with a repeating layer structure. ### Components/Axes * **Input:** A white rounded rectangle at the top-left. * **Embedding:** A blue rounded rectangle below the Input. * **Attention:** A pink rounded rectangle within a larger yellow rounded rectangle labeled "x L layers". A smaller light green rounded rectangle below it is labeled "KV Cache". * **All Reduce RMSNorm:** Two yellow rounded rectangles within the "x L layers" structure, one to the right of Attention and another to the right of FFN. * **FFN:** A light green rounded rectangle within the "x L layers" structure, between the two "All Reduce RMSNorm" blocks. * **Sampler:** A purple rounded rectangle to the right of the "x L layers" structure. * **Output:** A light pink rounded rectangle at the top-right. * **Autoregressive:** Text above the diagram indicating the autoregressive nature of the model. * **x L layers:** Text indicating that the layers within the yellow rounded rectangle are repeated L times. ### Detailed Analysis * **Flow:** * The Input flows into the Embedding layer. * The Embedding layer flows into the "x L layers" structure, starting with the Attention block. * Within the "x L layers" structure, the flow is: Attention -> All Reduce RMSNorm -> FFN -> All Reduce RMSNorm. * The output of the "x L layers" structure flows into the Sampler. * The Sampler flows into the Output. * The Output flows back to the Input, indicating the autoregressive loop. * **KV Cache:** Located below the Attention block, likely related to caching key-value pairs for attention mechanisms. ### Key Observations * The diagram highlights the key components of a typical autoregressive model. * The "x L layers" structure indicates a repeating block of layers, common in deep learning models. * The autoregressive loop is clearly shown, where the output is fed back as input for the next iteration. ### Interpretation The diagram illustrates a standard autoregressive model architecture. The input is embedded, processed through multiple layers involving attention and feed-forward networks, and then sampled to produce an output. The autoregressive nature of the model is emphasized by the feedback loop from the output back to the input. The "KV Cache" suggests an optimization technique for attention mechanisms, and the "All Reduce RMSNorm" blocks likely represent normalization layers used for stabilizing training. The repetition of layers ("x L layers") indicates a deep learning architecture capable of learning complex patterns. </details> Figure 2: High-level architecture of LLMs. This section introduces LLMs, explains the source of non-determinism in LLM inference and quantifies the cost of batch-invariant computation based deterministic inference. 2.1 Large Language Models Transformer-based LLMs compose a stack of identical decoder blocks, each containing a self-attention module, a feed-forward network (FFN), normalization layers and communication primitives (Figure 2). The hidden state of every token flows through these blocks sequentially, but within each block the computation is highly parallel: attention performs matrix multiplications over the key–value (KV) cache, FFNs apply two large GEMMs surrounding a nonlinearity, and normalization applies a per-token reduction over the hidden dimension. Inference happens in two distinct phases namely prefill and decode. Prefill processes prompt tokens in parallel, generating KV cache for all input tokens. This phase is dominated by large parallel computation. Decode is autoregressive: each step consumes the most recent token, updates the KV cache with a single new key and value, and produces the next output token. Decode is therefore sequential within a sequence but parallel across other requests in the batch. 2.2 Non-determinism in LLM Inference In finite precision, arithmetic operations such as accumulation are non-associative, meaning that $(a+b)+c≠ a+(b+c)$ . Non-determinism in LLM inference stems from this non-associativity when combined with dynamic batching, a standard technique to achieve high throughput inference. With dynamic batching, the same request may be co-located with different sets of requests across different runs, resulting in varying batch sizes. Further, GPU kernels adapt their parallelization—and consequently their reduction—strategies based on the input sizes. As a result, the same logical operation can be evaluated with different floating-point accumulation orders depending on the batch it appears in, leading to inconsistent numerical results. Reductions are common in LLM inference, appearing in matrix multiplications (GEMMs), attention, normalization and collective communication such as AllReduce. GEMM is the most common and time consuming operator. High-performance GEMM implementations on GPUs use hierarchical tiling and parallel reductions to improve occupancy and memory reuse. A common optimization is split-K parallelism, where the reduction dimension is partitioned across multiple thread blocks. Each block computes a partial result, which is then combined via an additional reduction step. Whether split-K is used—and how many splits are chosen—depends on the shape on input matrices. These choices directly change the reduction tree as shown in Figure 3, producing different numerical results for a given token across runs. Similarly, attention kernels split work across the key–value dimension to increase SM utilization, followed by a reduction to combine partial results. Normalization operators reduce across feature dimensions. As inference progresses, batch-dependent reduction choices in these operators introduce small numerical drifts that propagate across kernels, layers and decoding steps, eventually influencing output tokens even when the sampling hyper-parameters are fixed. <details> <summary>figures_h100_pcie/diagram/gemm_wo_splitk.png Details</summary> ![5cfc40d8](/v1/image/5cfc40d847d053aa31511e59534f42ffac9d723e424bb815c52c47e00ef88cd9) ### Visual Description ## Matrix Multiplication Diagram ### Overview The image is a diagram illustrating the computation of a single element in the result of a matrix multiplication. It shows a row vector multiplied by a column vector, resulting in a scalar value. The diagram breaks down the multiplication and addition operations involved in calculating this scalar product using a tree-like structure. ### Components/Axes * **Row Vector:** A horizontal array containing four elements: A0, A1, A2, A3. It is enclosed in a blue rounded rectangle with a double-headed arrow below it, indicating its length. * **Column Vector:** A vertical array containing four elements: B0, B1, B2, B3. It is enclosed in a blue rounded rectangle with a double-headed arrow to the right of it, indicating its length. * **Result:** A scalar value labeled "C", enclosed in a pink rounded rectangle. * **Multiplication Symbol:** A "x" symbol between the row and column vectors. * **Equals Symbol:** An "=" symbol between the column vector and the result "C". * **Addition Nodes:** Blue rounded rectangles containing a "+" symbol. These represent addition operations. * **Multiplication Nodes:** White rounded rectangles containing multiplication expressions (e.g., "A0 * B0"). These represent multiplication operations. * **Arrows:** Arrows indicate the flow of data from the multiplication nodes to the addition nodes. ### Detailed Analysis or ### Content Details The diagram illustrates the following calculation: C = (A0 * B0) + (A1 * B1) + (A2 * B2) + (A3 * B3) The calculation is broken down as follows: 1. **Bottom Level:** * A0 is multiplied by B0, resulting in "A0 * B0". * A1 is multiplied by B1, resulting in "A1 * B1". 2. **Middle Level:** * A2 is multiplied by B2, resulting in "A2 * B2". * A3 is multiplied by B3, resulting in "A3 * B3". * "A0 * B0" and "A1 * B1" are added together. 3. **Top Level:** * The result of ("A0 * B0" + "A1 * B1") is added to "A2 * B2". * The result of that addition is added to "A3 * B3", resulting in the final value "C". ### Key Observations * The diagram visually represents the dot product of two vectors. * The tree structure shows the order of operations, with multiplications performed before additions. * The diagram is a simplified representation of matrix multiplication, focusing on the calculation of a single element in the resulting matrix. ### Interpretation The diagram effectively illustrates how a single element in the resulting matrix is computed during matrix multiplication. It breaks down the process into individual multiplication and addition operations, making it easier to understand the underlying calculations. The tree structure highlights the parallel nature of the multiplications and the sequential nature of the additions. This type of diagram is useful for explaining the concept of matrix multiplication to someone unfamiliar with the operation. </details> (a) Without split-K. <details> <summary>figures_h100_pcie/diagram/gemm_w_splitk.png Details</summary> ![6dc45e70](/v1/image/6dc45e7096a97383c258fe74922a8c9b721c6994e6b57c0b156e7990cac44a03) ### Visual Description ## Diagram: Matrix Multiplication and Summation Tree ### Overview The image depicts a matrix multiplication operation followed by a summation tree. It illustrates how the elements of two matrices, A and B, are multiplied and then summed to produce a scalar result, C. The diagram shows the multiplication of corresponding elements and their subsequent addition in a tree-like structure. ### Components/Axes * **Matrices:** * Matrix A: A horizontal array of elements A0, A1, A2, A3. The entire matrix is enclosed in a blue rounded rectangle, with A2 and A3 also enclosed in a green rounded rectangle. A horizontal double-headed arrow indicates the dimension of the matrix. * Matrix B: A vertical array of elements B0, B1, B2, B3. The entire matrix is enclosed in a blue rounded rectangle, with B2 and B3 also enclosed in a green rounded rectangle. A vertical double-headed arrow indicates the dimension of the matrix. * **Result:** * C: A scalar result represented by the variable C, enclosed in a pink rounded rectangle. * **Operations:** * Multiplication: Represented by the "x" symbol between matrices A and B. * Equality: Represented by the "=" symbol between the matrix multiplication and the result C. * Multiplication Nodes: Four rounded rectangles containing the multiplication of corresponding elements: A0 * B0, A1 * B1, A2 * B2, A3 * B3. * Summation Nodes: Two rounded rectangles containing "+" symbols, one blue and one green, representing addition operations. A final pink rounded rectangle containing a "+" symbol represents the final summation. * **Flow:** Arrows indicate the flow of data from the multiplication nodes to the summation nodes, and finally to the result. ### Detailed Analysis 1. **Matrix Multiplication:** * Matrix A: [A0, A1, A2, A3] * Matrix B: [B0, B1, B2, B3] (arranged vertically) * The multiplication is represented as A x B = C. 2. **Summation Tree:** * The multiplication results (A0\*B0, A1\*B1, A2\*B2, A3\*B3) are inputs to the summation tree. * A0\*B0 and A1\*B1 are added together in a blue node. * A2\*B2 and A3\*B3 are added together in a green node. * The results of the blue and green nodes are added together in a pink node to produce the final result C. ### Key Observations * The diagram illustrates a specific case of matrix multiplication where the result is a scalar. * The summation tree shows a parallel addition process, where pairs of multiplication results are added simultaneously. * The colors (blue, green, pink) highlight the different stages or groups of operations. ### Interpretation The diagram demonstrates a method for computing the dot product of two vectors (represented as matrices A and B). The multiplication of corresponding elements is followed by a summation of these products. The summation tree structure suggests a parallel computation approach, which can be more efficient than a sequential addition, especially for large matrices. The diagram effectively visualizes the mathematical operations and their relationships, providing a clear understanding of the computation process. </details> (b) With split-K. Figure 3: GEMM kernels compute dot products using standard accumulation or split-K parallelization. While split-K increases parallelism, it alters the reduction tree based on K. 2.3 Defeating Non-determinism He et al. at Thinking Machines Lab recently introduced a new approach named batch-invariant computation to enable deterministic LLM inference [he2025nondeterminism]. In this approach, a GPU kernel is constrained to use a single, universal reduction strategy for all tokens, eliminating batch-dependent reductions. Both SGLang and vLLM use this approach [SGLangTeam2025, vllm-batch-invariance-2025, vllm-batch-invariant-2025]: these systems either deploy the batch-invariant kernels provided by He et al. [tmops2025] or implement new kernels [batch-inv-inf-vllm, det-infer-batch-inv-ops-sglang]. <details> <summary>x1.png Details</summary> ![d0aadefe](/v1/image/d0aadefec9d380420d1c37169238ed0a3cdb0670bebff47f0fef0d775bee8d8b) ### Visual Description ## Log-Log Chart: Performance Comparison of cuBLAS and Triton ### Overview The image is a log-log chart comparing the performance of two linear algebra libraries, cuBLAS and Triton, in terms of TFLOPS (Tera Floating Point Operations Per Second) as a function of the number of tokens. The x-axis represents the number of tokens, ranging from 1 to 8192, and the y-axis represents the TFLOPS, ranging from 10^0 to 10^2. The chart compares "Non-batch-invariant (cuBLAS)" and "Batch-invariant (Triton)". ### Components/Axes * **X-axis:** "# Tokens" - logarithmic scale with values: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192. * **Y-axis:** "TFLOPS" - logarithmic scale with values: 10^0 (1), 10^1 (10), 10^2 (100). * **Legend:** Located in the bottom-right of the chart. * Blue line with circle markers: "Non-batch-invariant (cuBLAS)" * Red line with square markers: "Batch-invariant (Triton)" ### Detailed Analysis * **Non-batch-invariant (cuBLAS) - Blue Line:** * Trend: The line generally slopes upward, indicating increasing TFLOPS with an increasing number of tokens. The slope decreases as the number of tokens increases, suggesting diminishing returns. * Data Points: * 1 Token: ~1.5 TFLOPS * 2 Tokens: ~3 TFLOPS * 4 Tokens: ~5 TFLOPS * 8 Tokens: ~8 TFLOPS * 16 Tokens: ~14 TFLOPS * 32 Tokens: ~25 TFLOPS * 64 Tokens: ~45 TFLOPS * 128 Tokens: ~75 TFLOPS * 256 Tokens: ~110 TFLOPS * 512 Tokens: ~140 TFLOPS * 1024 Tokens: ~150 TFLOPS * 2048 Tokens: ~155 TFLOPS * 4096 Tokens: ~160 TFLOPS * 8192 Tokens: ~160 TFLOPS * **Batch-invariant (Triton) - Red Line:** * Trend: The line generally slopes upward, indicating increasing TFLOPS with an increasing number of tokens. The slope decreases as the number of tokens increases, suggesting diminishing returns. The Triton line plateaus earlier than the cuBLAS line. * Data Points: * 1 Token: ~0.3 TFLOPS * 2 Tokens: ~0.6 TFLOPS * 4 Tokens: ~1.2 TFLOPS * 8 Tokens: ~2.5 TFLOPS * 16 Tokens: ~5 TFLOPS * 32 Tokens: ~10 TFLOPS * 64 Tokens: ~20 TFLOPS * 128 Tokens: ~40 TFLOPS * 256 Tokens: ~85 TFLOPS * 512 Tokens: ~95 TFLOPS * 1024 Tokens: ~100 TFLOPS * 2048 Tokens: ~105 TFLOPS * 4096 Tokens: ~110 TFLOPS * 8192 Tokens: ~110 TFLOPS ### Key Observations * For a small number of tokens (1-64), cuBLAS outperforms Triton. * Both cuBLAS and Triton show performance improvements with an increasing number of tokens, but the rate of improvement decreases as the number of tokens increases. * cuBLAS consistently outperforms Triton across all token counts. * Both lines plateau at higher token counts, indicating a limit to performance gains. ### Interpretation The chart demonstrates the performance scaling of cuBLAS and Triton with respect to the number of tokens. cuBLAS, being non-batch-invariant, generally achieves higher TFLOPS compared to Triton, which is batch-invariant. The diminishing returns observed at higher token counts suggest that other factors, such as memory bandwidth or computational bottlenecks, become limiting factors. The data suggests that cuBLAS is more efficient for this particular workload across the tested range of token counts. The plateauing of performance indicates that simply increasing the number of tokens will not indefinitely improve performance, and other optimization strategies may be necessary. </details> (a) GEMM <details> <summary>x2.png Details</summary> ![5a7c4375](/v1/image/5a7c43757b9722b336907550415d27bc7ade67e3867169f2041115856dba9f80) ### Visual Description ## Line Chart: Execution Time vs. Number of Tokens ### Overview The image is a line chart comparing the execution time (in milliseconds) of three different implementations (CUDA, Python, and Triton) as the number of tokens increases. The chart shows how the execution time scales with the number of tokens for each implementation. ### Components/Axes * **X-axis:** "# Tokens" with values 1, 8, 32, 128, 256, 512, 1024, 2048, 4096. * **Y-axis:** "Execution Time (ms)" with values ranging from 0.0 to 1.2 in increments of 0.2. * **Legend (Top-Left):** * Green: Non-batch-invariant (CUDA) * Red: Batch-invariant (Python) * Blue: Batch-invariant (Triton) ### Detailed Analysis * **Non-batch-invariant (CUDA) - Green Line:** * Trend: Relatively flat initially, then increases slightly. * Data Points: * 1 Token: ~0.03 ms * 8 Tokens: ~0.03 ms * 32 Tokens: ~0.03 ms * 128 Tokens: ~0.03 ms * 256 Tokens: ~0.03 ms * 512 Tokens: ~0.02 ms * 1024 Tokens: ~0.03 ms * 2048 Tokens: ~0.09 ms * 4096 Tokens: ~0.15 ms * **Batch-invariant (Python) - Red Line:** * Trend: Relatively flat initially, then increases sharply. * Data Points: * 1 Token: ~0.10 ms * 8 Tokens: ~0.10 ms * 32 Tokens: ~0.10 ms * 128 Tokens: ~0.10 ms * 256 Tokens: ~0.10 ms * 512 Tokens: ~0.13 ms * 1024 Tokens: ~0.33 ms * 2048 Tokens: ~0.68 ms * 4096 Tokens: ~1.25 ms * **Batch-invariant (Triton) - Blue Line:** * Trend: Relatively flat initially, then increases. * Data Points: * 1 Token: ~0.09 ms * 8 Tokens: ~0.09 ms * 32 Tokens: ~0.09 ms * 128 Tokens: ~0.09 ms * 256 Tokens: ~0.09 ms * 512 Tokens: ~0.02 ms * 1024 Tokens: ~0.03 ms * 2048 Tokens: ~0.08 ms * 4096 Tokens: ~0.21 ms ### Key Observations * For a small number of tokens (1-512), all three implementations have relatively low and similar execution times. * The Batch-invariant (Python) implementation shows a significant increase in execution time as the number of tokens increases beyond 512. * The Non-batch-invariant (CUDA) and Batch-invariant (Triton) implementations scale much better than the Python implementation, with the CUDA implementation showing the lowest execution time for a large number of tokens. ### Interpretation The chart demonstrates the performance differences between CUDA, Python, and Triton implementations when processing varying numbers of tokens. The Python implementation's poor scaling suggests it is not suitable for processing large sequences of tokens. CUDA and Triton offer better performance, with CUDA being the most efficient for large token counts in this specific scenario. The "batch-invariant" and "non-batch-invariant" labels likely refer to how the implementations handle batch processing, with the non-batch-invariant CUDA implementation being optimized for single-sequence processing. </details> (b) RMSNorm Figure 4: Performance comparison between batch-invariant vs. non-batch-invariant kernels. While batch-invariant computation eliminates non-determinism, we find that it is a poor fit for real LLM serving systems. By enforcing a universal reduction strategy across all executions, it couples determinism to kernel design and sacrifices performance opportunities. It also turns determinism into a fixed tax paid by every request: dynamic batching aggregates requests, but batch-invariant kernels eliminate the very optimizations—such as split-K and shape-aware tiling—that make batching effective in the first place. Worse, because kernels are not batch-invariant, adopting this approach requires maintaining a parallel kernel stack solely for determinism. We also quantify the performance cost of this approach below. GEMM. 4(a) compares the throughput of cuBLAS based GEMMs used in PyTorch against Triton-based batch-invariant kernels developed by He et al. The matrix dimensions correspond to the down projection operation of the Llama-3.1-8B-Instruct model’s feed-forward-network. On our GPU, cuBLAS (via torch.mm) reaches up to 527 TFLOPS, whereas the batch-invariant kernel peaks at 194 TFLOPS, a slowdown of 63%. This gap arises because this Triton-based batch-invariant implementation does not use split-K or exploit newer hardware features such as Tensor Memory Accelerators [TMA_Engine] or advanced techniques like warp specialization [fa-3], all of which are leveraged by PyTorch through vendor-optimized cuBLAS kernels. RMSNorm. 4(b) compares RMSNorm execution time for varying number of tokens for three implementations: Python-based version used in SGLang, Thinking Machines’ Triton-based kernel, and SGLang’s default CUDA kernel. The first two are batch-invariant; the CUDA kernel is not. The Python implementation is up to $7×$ slower than the non-batch-invariant CUDA kernel due to unfused primitive operations and poor shared-memory utilization. The Triton kernel performs substantially better but remains up to 50% slower than the fused CUDA implementation, which benefits from optimized reductions and kernel fusion. These overheads are amplified at high batch sizes or long context lengths, where normalization may account for a nontrivial fraction of inference time [gond2025tokenweave]. <details> <summary>x3.png Details</summary> ![2e885763](/v1/image/2e8857636e9f30f27276e74376b8dd5c4a0227b91e7a5c19e25a7244dc032fcf) ### Visual Description ## Bar Chart: Decode Throughput vs. Batch Size ### Overview The image is a bar chart comparing the decode throughput (tokens/sec) for different configurations across two batch sizes (10 and 11). The chart compares "SGLang non-deterministic", "SGLang deterministic", and "LLM-42". ### Components/Axes * **X-axis:** Batch Size, with values 10 and 11. * **Y-axis:** Decode Throughput (tokens/sec), ranging from 0 to 1200. * **Legend (top-left):** * SGLang non-deterministic (light blue) * SGLang deterministic (light red/brown) * LLM-42 (light green) ### Detailed Analysis * **Batch Size 10:** * SGLang non-deterministic: Approximately 840 tokens/sec. * SGLang deterministic: Not present for batch size 10. * LLM-42: Not present for batch size 10. * **Batch Size 11:** * SGLang non-deterministic: Approximately 940 tokens/sec. * SGLang deterministic: Approximately 420 tokens/sec. * LLM-42: Approximately 520 tokens/sec. ### Key Observations * For batch size 10, only SGLang non-deterministic data is available. * For batch size 11, all three configurations (SGLang non-deterministic, SGLang deterministic, and LLM-42) are present. * SGLang non-deterministic throughput increases from batch size 10 to 11. ### Interpretation The chart compares the decode throughput of different language model configurations at batch sizes 10 and 11. The data suggests that increasing the batch size from 10 to 11 improves the throughput of SGLang non-deterministic. The introduction of SGLang deterministic and LLM-42 at batch size 11 allows for a direct comparison of their performance. LLM-42 has a slightly higher throughput than SGLang deterministic at batch size 11. </details> Figure 5: Decode throughput under different scenarios. End-to-end throughput. Figure 5 measures token generation throughput (tokens per second) under three scenarios: (1) 10 requests running in non-deterministic mode, (2) 11 requests running in non-deterministic mode, and (3) 11 requests running in deterministic mode but only one of them requires deterministic output. With 10 concurrent non-deterministic requests, the system generates 845 tokens/s. The batch size increases to 11 when a new request arrives and if decoding continues non-deterministically, throughput improves to 931 tokens/s (a jump of about 10%). In contrast, if the new request requires determinism, the entire batch is forced to execute through the slower batch-invariant kernels, causing throughput to collapse by 56% to about 415 tokens/s—penalizing every in-flight request for a single deterministic one. This behavior is undesirable because it couples the performance of all requests to that of the slowest request. Overall, these results show that batch-invariant execution incurs a substantial performance penalty. While it may be feasible to improve the performance of batch-invariant kernels, doing so would require extensive model- and hardware-specific kernel development. This engineering and maintenance burden makes the approach difficult to sustain in practice. This may be why deterministic inference is largely confined to debugging and verification today, rather than being deployed for real-world LLM serving. 3 Observations In this section, we distill a set of concrete observations about non-determinism, GPU kernels and LLM use-cases. These observations expose why batch-invariant computation is overly restrictive and motivate a more general approach to enable determinism in LLM inference. Observation-1 (O1). If a sequence is already in a consistent state, the next emitted token is usually consistent even under dynamic batching. However, once a token diverges, autoregressive decoding progressively amplifies this difference over subsequent steps. This is because tokens become inconsistent only when floating-point drift is large enough to alter the effective decision made by the sampler—e.g., by changing the relative ordering or acceptance of high-probability candidates under the decoder’s sampling policy (e.g., greedy or stochastic sampling). In practice, such boundary-crossing events are rare, as numerical drift typically perturbs logits only slightly. However, autoregressive decoding amplifies even a single such deviation: once a token differs, all subsequent tokens may diverge. Since a single request typically produces hundreds to thousands of output tokens, two sequence-level outputs can look dramatically different even if the initial divergence is caused by a single token flip induced by a different reduction order across runs. To demonstrate this phenomenon empirically, we conduct an experiment using the Llama-3.1-8B-Instruct model on the ShareGPT dataset. We first execute 350 requests with batch size one—i.e., without dynamic batching—to obtain reference (“ground-truth”) output tokens. We then re-run the same requests under dynamic batching at a load of 6 queries per second and compare each request’s output against its reference. In both runs, we fix the output length to 512 tokens. <details> <summary>x4.png Details</summary> ![78f6943b](/v1/image/78f6943b459bced4eaf89d9cc24780cacd939e5b045ccc5098cc85b026579d9d) ### Visual Description ## Line Chart: Token Span Analysis ### Overview The image is a line chart comparing the number of tokens for 'first_consistent_span', 'second_consistent_span', and 'output_span' across different 'Request Ids'. The x-axis represents 'Request Id' ranging from 0 to 80, and the y-axis represents '# Tokens' ranging from 0 to 500. ### Components/Axes * **X-axis:** 'Request Id', with tick marks at intervals of 10, ranging from 0 to 80. * **Y-axis:** '# Tokens', with tick marks at intervals of 100, ranging from 0 to 500. * **Legend:** Located at the top of the chart, indicating: * 'first_consistent_span' (blue line) * 'second_consistent_span' (green line) * 'output_span' (red line) ### Detailed Analysis * **first_consistent_span (blue line):** This line shows significant fluctuations across different 'Request Ids'. The values range approximately from 20 to 520. The area under the curve is shaded light blue. * At Request Id 2, the value is approximately 240 tokens. * At Request Id 5, the value drops to approximately 180 tokens. * At Request Id 10, the value rises to approximately 510 tokens. * At Request Id 25, the value drops to approximately 50 tokens. * At Request Id 35, the value rises to approximately 520 tokens. * At Request Id 45, the value drops to approximately 180 tokens. * At Request Id 55, the value rises to approximately 520 tokens. * At Request Id 65, the value drops to approximately 100 tokens. * At Request Id 75, the value rises to approximately 200 tokens. * **second_consistent_span (green line):** This line remains consistently low, close to 0 tokens, with a few spikes. * At Request Id 12, the value is approximately 80 tokens. * At Request Id 20, the value is approximately 60 tokens. * At Request Id 60, the value is approximately 20 tokens. * **output_span (red line):** This line remains constant at approximately 510 tokens across all 'Request Ids'. ### Key Observations * The 'first_consistent_span' exhibits high variability in the number of tokens, while 'output_span' remains constant. * The 'second_consistent_span' generally stays near zero, with occasional spikes. ### Interpretation The chart illustrates the token usage for different spans across a series of requests. The 'output_span' consistently uses a fixed number of tokens (approximately 510), suggesting a fixed output size. The 'first_consistent_span' shows variable token usage, indicating that the number of tokens required for the first consistent span changes significantly with different requests. The 'second_consistent_span' uses very few tokens, suggesting it plays a minor role in the overall process. The variability in 'first_consistent_span' could be due to the complexity or length of the input requests. </details> Figure 6: Length of the first and second consistent span (number of tokens that match with the ground-truth) for different requests under dynamic batching. We quantify divergence using two metrics. The first consistent span of a request measures the number of initial output tokens that match exactly across the two runs, while the second consistent span measures the number of matching tokens between the first and second divergence points. Figure 6 shows these metrics for 80 requests. In the common case, hundreds of initial tokens are identical across both runs, with some requests exhibiting an exact match of all 512 tokens in the first consistent span. However, once a single token diverges, the sequence rapidly drifts: the second consistent span is near zero for most requests, indicating that divergence quickly propagates through the remainder of the output. <details> <summary>figures/position-invariant-kernel.png Details</summary> ![42cd1cce](/v1/image/42cd1ccea12cf247bdfcb4f237e3fa18ba3d26878d6a8eaf9a4a0afed065fb02) ### Visual Description ## Diagram: Kernel Runs ### Overview The image depicts two separate runs (Run-1 and Run-2) of a "Kernel" process. Each run takes different inputs (T0, T1 for Run-1 and T1, T2 for Run-2) and produces corresponding outputs (T0', T1' for Run-1 and T1'', T2'' for Run-2). A red circle highlights T1' in Run-1 and T1'' in Run-2, possibly indicating a point of interest or a change. ### Components/Axes * **Run-1 (Left Side):** * Inputs: T0, T1 * Process: Kernel (represented by a light blue rounded rectangle) * Outputs: T0', T1' (T1' is circled in red) * **Run-2 (Right Side):** * Inputs: T1, T2 * Process: Kernel (represented by a light blue rounded rectangle) * Outputs: T1'', T2'' (T1'' is circled in red) * **Arrows:** Black arrows indicate the flow of data from inputs to the Kernel and from the Kernel to outputs. ### Detailed Analysis or ### Content Details * **Run-1:** * Input T0 flows into the Kernel and produces output T0'. * Input T1 flows into the Kernel and produces output T1'. T1' is highlighted with a red circle. * **Run-2:** * Input T1 flows into the Kernel and produces output T1''. T1'' is highlighted with a red circle. * Input T2 flows into the Kernel and produces output T2''. ### Key Observations * The diagram shows two distinct runs of the same "Kernel" process. * The inputs and outputs are labeled with "T" followed by a subscript and potentially one or two apostrophes. * The red circles around T1' and T1'' suggest these outputs are significant or different in some way. * Run-1 takes T0 and T1 as input, while Run-2 takes T1 and T2 as input. ### Interpretation The diagram likely illustrates a process where a "Kernel" transforms input data. The apostrophes on the outputs (T0', T1', T1'', T2'') likely indicate transformations or versions of the original inputs. The red circles around T1' and T1'' in each run could indicate a specific output being monitored, flagged for further analysis, or representing an error state. The diagram suggests that the Kernel process is being run multiple times with different inputs, and the outputs are being compared or analyzed. The fact that T1 is an input to both runs could indicate a dependency or a common element between the two processes. </details> Figure 7: A position-invariant kernel produces the same output for a given input element irrespective of its position in the batch, as long as the total batch size is fixed. In this example, $T_{1}^{\prime}==T_{1}^{\prime\prime}$ if the kernel is position-invariant. | Category | Operator | Invariant | | | --- | --- | --- | --- | | Batch | Position | | | | Matmul | CuBLAS GEMM | ✗ | ✓ | | Attention | FlashAttention-3 ‡ | ✓ | ✓ | | Communication | Multimem-based AllReduce ∗ | ✓ | ✓ | | Ring-based AllReduce | ✗ | ✗ | | | Tree-based AllReduce ⋆ | ✓ | ✓ | | | Normalization | RMSNorm † | ✗ | ✓ | | Fused RMSNorm + Residual † | ✗ | ✓ | | Table 2: Invariance properties of common inference operators (‡ num_splits=1, ∗ CUDA 13.0+, ⋆ specific NCCL settings, † vLLM/SGLang defaults). Observation-2 (O2). Most GPU kernels use uniform, shape-consistent reductions: they apply the same reduction strategy to all elements within a given batch. Moreover, the strategy remains fixed for all batches of the same shape, changing only when the shape changes. The simplest example of this is GEMM kernels. For a given input matrix A of size M x K, a GEMM kernel computes all the M input elements under the same reduction order (say R). Moreover, it applies the same reduction order R to all input matrices of size M x K. We refer to such kernels as position-invariant. Position invariance implies that, with a fixed total batch size, an input element’s output is independent of its position in the batch. Note that such guarantees do not hold for kernels that implement reductions via atomic operations. Fortunately, kernels used in the LLM forward pass do not use atomic reductions. Figure 7 shows an example of a position-invariant kernel and Table 2 shows the invariance properties of common LLM operators. The motivation for batch-invariant computation stems from the fact that GPU kernels used in LLMs, while deterministic for a particular input, are not batch-invariant. We observe that position-invariance captures a strictly stronger property than determinism: determinism only requires that the same input produce the same output across runs, whereas position-invariance implies that the output of a given input remains consistent as long as the input size to the kernel remains the same. This allows us to reason about kernel behavior at the level of input shapes, rather than individual input values. Observation-3 (O3). For deterministic inference, it is sufficient to ensure that a given token position goes through the same reduction strategy across all runs of a given request; reduction strategy for different token positions within or across sequences can be different. Numerical differences in the output of a token arise from differences in how its own floating-point reductions are performed, not from the numerical values of other co-batched tokens. While batching affects how computations are scheduled and grouped, the computation for a given token position is functionally independent: it consumes the same inputs and executes the same sequence of operations. As a result, interactions across tokens occur only indirectly through execution choices—such as which partial sums are reduced together—not through direct data dependencies. Consequently, as long as a token position is always reduced using the same strategy, its output remains consistent regardless of how other token positions are computed. Observation-4 (O4). Current systems take an all-or-nothing approach: they either enforce determinism for every request or disable it entirely. Such a design is not a natural fit for LLM deployments. This is because many LLM workloads neither require bit-wise reproducibility nor benefit from it. In fact, controlled stochasticity is often desirable as it enhances output diversity and creativity of LLMs [integrating_randomness_llm2024, diversity_bias_llm2025, det-inf-kills]. In contrast, requests such as evaluation runs, safety audits, or regression testing require bit-level reproducibility. Overall, different use-cases imply that enforcing determinism for all requests is an overkill. It is also worth highlighting that this all-or-nothing behavior largely stems from the batch-invariant approach that ties determinism to the kernel design. Since all co-batched tokens go through the same set of kernels, determinism becomes a global property of the batch rather than being selective. While one could run different requests through separate deterministic and non-deterministic kernels, doing so would fragment batches, complicate scheduling, and likely hurt efficiency. <details> <summary>x5.png Details</summary> ![31e47d4d](/v1/image/31e47d4d989ffb050dfb9988241b1faff9df12bc9884871de1c9666a223d1c49) ### Visual Description ## Diagram: KV Cache Workflow ### Overview The image illustrates the workflow of a Key-Value (KV) cache during prefill, decode, and verification stages, culminating in a state where all tokens are accepted without rollback. The diagram shows how the KV cache evolves as tokens are processed, highlighting the transition from a deterministic request to the acceptance of individual tokens. ### Components/Axes * **KV cache after prefill:** Represents the state of the KV cache after the initial prefill stage. * **KV cache after decode:** Represents the state of the KV cache after the decode stage. * **KV cache after accepting all tokens:** Represents the final state of the KV cache after all tokens have been accepted. * **Prefill:** The initial stage where a deterministic request is processed. * **Decode:** The stage where tokens are decoded and processed. * **Verify:** The stage where tokens are verified. * **No Rollback:** Indicates the final state where all tokens are accepted without the need for a rollback. * **Deterministic request:** An initial request that is processed during the prefill stage. * **Other requests:** Additional requests that are processed during the decode stage. * **Accepted tokens:** Tokens that have been successfully verified and accepted. * **Tokens:** T0, T1, T2, T3, T4 represent individual tokens. T0', T1', T2', T3' represent tokens after decode. ### Detailed Analysis 1. **Prefill Stage:** * A "deterministic request" (blue rectangle) is processed. * "Other requests" (grey rectangles) are shown below the initial request, indicating additional processing. 2. **Decode Stage:** * Tokens T0', T1', T2', and T3' are introduced. * The tokens are represented as yellow/green rectangles, with some portions being cross-hatched, indicating a state of partial processing or uncertainty. 3. **Verify Stage:** * Tokens T0', T1', T2', and T3' are verified. * Arrows point from the decoded tokens to corresponding verified tokens T1 (=T1'), T2 (=T2'), and T3 (=T3'), which are now green. * Token T4 is also present, represented as cross-hatched. * Checkmarks indicate that the tokens have been accepted. 4. **Final State (No Rollback):** * The sequence shows tokens T0, T1, T2, T3, and T4. * Tokens T0, T1, T2, and T3 are green, indicating they have been accepted. * Token T4 is cross-hatched, indicating it has been accepted. 5. **KV Cache Evolution:** * The KV cache transitions from a state after prefill (blue cylinder) to a state after decode (cylinder with blue and yellow/green sections) and finally to a state after accepting all tokens (cylinder with blue and green sections). ### Key Observations * The diagram illustrates a sequential process of prefill, decode, and verify. * The tokens transition from an initial state to a verified and accepted state. * The KV cache reflects the state of token processing at each stage. * The "No Rollback" state indicates a successful completion of the process. ### Interpretation The diagram demonstrates the workflow of a KV cache in processing tokens. It shows how the cache evolves as tokens are decoded, verified, and accepted. The transition from a deterministic request to the acceptance of individual tokens highlights the dynamic nature of the KV cache. The "No Rollback" state signifies a successful operation, indicating that all tokens have been processed without errors or the need to revert to a previous state. The use of color and cross-hatching effectively communicates the state of each token at different stages of the process. </details> (a) DVR without rollbacks. <details> <summary>x6.png Details</summary> ![466d9ff9](/v1/image/466d9ff98bd18afb91407a61265be9b81650562893288e1d7141b3cfcba1ccbd) ### Visual Description ## Diagram: KV Cache Workflow ### Overview The image illustrates a workflow involving a Key-Value (KV) cache through stages of prefill, decode, verify, and rollback. It shows how the cache is populated, how requests are processed, and how the system handles verification and potential rollbacks. ### Components/Axes * **Top Row:** Shows the state of the KV cache after each stage: prefill, decode, and verify-rollback. Each stage is represented by a cylinder. * KV cache after prefill * KV cache after decode * KV cache after verify-rollback * **Prefill:** Shows a "deterministic request" populating the cache, while "other requests" are also being processed. * **Decode:** Shows tokens T0, T1', T2', and T3' being decoded. * **Verify:** Shows the verification process, where tokens are compared. T1 is equal to T1', and T2 is not equal to T2'. Tokens T3 and T4 are also present. * **Accepted/Rejected Tokens:** Shows the outcome of the verification process, with accepted tokens marked with a checkmark and rejected tokens marked with an "X". * **Rollback:** Shows the state of the KV cache after a rollback, with the sequence and KV cache restored until the final accepted token (T2). ### Detailed Analysis * **KV Cache States:** * **Prefill:** The KV cache is partially filled with a light blue color. * **Decode:** The KV cache is partially filled with a yellow color. * **Verify-Rollback:** The KV cache is partially filled with a light green color. * **Prefill Stage:** * A "deterministic request" fills a horizontal bar with light blue. * "Other requests" are represented by a gray block below the deterministic request. * **Decode Stage:** * Tokens T0, T1', T2', and T3' are represented by adjacent blocks. * T0 is light green. * T1' is light yellow. * T2' is yellow. * T3' is dark yellow. * The gray block from the "other requests" in the prefill stage extends into this stage, with a cross-hatched pattern indicating a change. * **Verify Stage:** * Tokens T0, T1', T2', T3', and T4 are represented by vertical blocks. * T0 is light green. * T1' is light yellow. * T2' is yellow with a cross-hatched pattern. * T3' is light red. * T4 is red. * Arrows connect the decoded tokens to the verification tokens. * **Accepted/Rejected Tokens:** * T1 (=T1') is accepted (checkmark). * T2 (!=T2') is accepted (checkmark). * T3 and T4 are rejected (X marks). * **Rollback Stage:** * The sequence and KV cache are restored until the final accepted token (T2). * The tokens T0, T1, and T2 are represented by adjacent blocks. * T0 is light blue. * T1 is light green. * T2 is yellow with a cross-hatched pattern. ### Key Observations * The diagram illustrates a process where requests are decoded, verified, and potentially rolled back based on token verification. * The KV cache changes its state as it progresses through the workflow. * The verification process determines which tokens are accepted and which are rejected. * The rollback stage restores the system to a previous state based on the accepted tokens. ### Interpretation The diagram depicts a mechanism for ensuring data integrity and consistency in a system utilizing a KV cache. The prefill stage initializes the cache, while the decode and verify stages process incoming requests and validate their authenticity or correctness. The rollback mechanism provides a way to revert to a known good state if verification fails, ensuring that the system does not operate on corrupted or invalid data. The use of tokens and their verification suggests a security or consistency protocol is in place. The diagram highlights the importance of verification in maintaining the integrity of the KV cache and the overall system. </details> (b) DVR with rollbacks. Figure 8: An example of decode-verify-rollback. After generating a fixed number of tokens through regular decoding, LLM-42 verifies them in parallel via a separate forward pass. (a) all the tokens generated in the decode phase pass verification; LLM-42 accepts all these tokens along with the new verifier-generated token ( $T_{4}$ ). (b) some tokens do not match between decode and verification; LLM-42 accepts only the initial matching tokens ( $T_{1}^{\prime}$ ) and the following verifier-generated token ( $T_{2}$ ); all other tokens are recomputed. In both cases, the verifier replaces the KV cache entries generated by the decode phase with its own. 4 LLM-42 Since non-determinism in LLM inference comes from dynamic batching, disabling it would make inference deterministic. However, dynamic batching is arguably the most powerful performance optimization for LLM inference [orca, distserve2024, vllmsosp, sarathi2023]. Therefore, our goal is to enable determinism in the presence of dynamic batching. In this section, we introduce a speculative decoding-inspired, scheduler-driven approach LLM-42 to achieve this goal. 4.1 Overall Design LLM-42 exploits the observations presented in §3 as follows: Leveraging O1. Tokens decoded from a consistent state are mostly consistent. Based on this observation, we re-purpose speculative-decoding style mechanism to enforce determinism via a decode–verify–rollback (DVR) protocol. DVR optimistically decodes tokens using the fast path and a verifier ensures determinism. Only tokens approved by the verifier are returned to the user, while the few that fail verification are discarded and recomputed. The key, however, is to ensure that the verifier’s output itself is always consistent. We leverage O2 to achieve this. Leveraging O2. Because GPU kernels rely on shape-consistent reductions, we make the verifier deterministic by always operating on a fixed number of tokens (e.g., verifying 10 tokens at a time). The only corner case occurs at the end of a sequence, where fewer than T tokens may remain (for instance, when the 11th token is an end-of-sequence token). We handle this by padding with dummy tokens so that the verifier always processes exactly T tokens. Leveraging O3. Determinism only requires that each token position follow a consistent strategy across runs whereas different positions can follow different strategies. This observation lets us compute prefill and decode phases using different strategies. Because prefill is massively parallel even within a single request, it can be made deterministic simply by avoiding arbitrary batching of prefill tokens, eliminating the need for a verifier in this phase. Verifier is required only for the tokens generated by the decode phase. Leveraging O4: LLM-42 decouples determinism from token generation by moving it into a separate verification phase. This makes selective determinism straightforward: only requests that require deterministic inference incur verification, while all other traffic avoids it. We expose this control to the users via a new API flag is_deterministic=True|False that allows them to explicitly request determinism on a per-request basis; default is False. Figure 5 quantifies the performance benefit of selective determinism. When one out of 11 requests requires determinism, LLM-42 achieves decode throughput of 911 tokens per second which is $2.2×$ higher than the deterministic mode throughput of SGLang and only within 3% of the non-deterministic mode (best case) throughput. We present a more detailed evaluation in §5. 4.2 Decode-verify-rollback (DVR) DVR performs decoding optimistically by first generating tokens using high-throughput, non-deterministic execution and then correcting any inconsistencies through verification and recomputation. Rather than enforcing determinism upfront, it identifies inconsistencies in the generated sequence on the fly, and recomputes only those tokens that are not guaranteed to be consistent across all possible runs of a given request. Figure 8 illustrates how DVR operates, assuming a verification window of four tokens. The blue request requires deterministic output and its first output token $T_{0}$ is produced by deterministic prefill phase that avoids arbitrary inter-request batching. LLM-42 then generates candidate tokens $T_{1}^{\prime}$ , $T_{2}^{\prime}$ , and $T_{3}^{\prime}$ using the regular fast path with dynamic batching, where gray requests may be batched arbitrarily. The first input to the verifier should be consistent (in this case, $T_{0}$ is consistent since it comes from the prefill phase). The verifier replays these four tokens $T_{0}$ and $T_{1}^{\prime}$ – $T_{3}^{\prime}$ as input and produces output tokens $T_{1}$ – $T_{4}$ . Verification has two possible outcomes: (1) all tokens pass verification, or (2) one or more tokens fail verification. We describe these two cases in detail below. Case-1: When verification succeeds. In the common case, verification reproduces the same tokens that the preceding decode iterations generated. For example, in 8(a), $T_{1}=T_{1}^{\prime}$ , $T_{2}=T_{2}^{\prime}$ and $T_{3}=T_{3}^{\prime}$ . In this case, LLM-42 accepts all these tokens; in addition, it also accepts the token $T_{4}$ since $T_{4}$ was generated by the verifier from a consistent state and is therefore consistent. <details> <summary>x7.png Details</summary> ![46066dbc](/v1/image/46066dbcdd1b6f3a7ca42e77c2fe12edb3c7aed03525eef78d4161b59dd7fb71) ### Visual Description ## Line Chart: Latency per Token vs. Number of Tokens ### Overview The image is a line chart showing the relationship between the number of tokens and the latency per token (in milliseconds). The chart illustrates a decreasing trend in latency as the number of tokens increases. The area under the line is shaded in light green. ### Components/Axes * **X-axis:** "Number of Tokens" with values 16, 32, 64, 128, 256, and 512. * **Y-axis:** "Latency per Token (ms)" with values ranging from 0.0 to 0.8, in increments of 0.1. * **Data Series:** A single blue line representing the latency per token. The area under the curve is shaded light green. ### Detailed Analysis The blue line shows the latency per token as the number of tokens increases. * **16 Tokens:** Latency is approximately 0.76 ms. * **32 Tokens:** Latency is approximately 0.38 ms. * **64 Tokens:** Latency is approximately 0.19 ms. * **128 Tokens:** Latency is approximately 0.09 ms. * **256 Tokens:** Latency is approximately 0.05 ms. * **512 Tokens:** Latency is approximately 0.04 ms. The line slopes downward, indicating a negative correlation between the number of tokens and latency per token. The rate of decrease slows as the number of tokens increases. ### Key Observations * The latency per token decreases significantly as the number of tokens increases from 16 to 64. * The rate of decrease in latency slows down as the number of tokens increases beyond 128. * The latency appears to plateau around 0.04-0.05 ms for 256 and 512 tokens. ### Interpretation The chart suggests that increasing the number of tokens can reduce the latency per token, especially at lower token counts. However, there appears to be a point of diminishing returns, where increasing the number of tokens further does not significantly reduce latency. This could be due to overhead costs associated with processing a large number of tokens, or limitations in the processing architecture. The data indicates that optimizing for a token count between 128 and 256 may provide a good balance between token count and latency. </details> (a) Verification latency <details> <summary>x8.png Details</summary> ![097d200c](/v1/image/097d200c2522950c46283aca3c6539fd03d59f3921e643675e7f47914f8a5e54) ### Visual Description ## CDF Chart: Rollbacks per Verification Window vs. CDF for Different Window Sizes ### Overview The image is a Cumulative Distribution Function (CDF) plot showing the relationship between "Rollbacks per Verification Window" and the cumulative distribution for different window sizes (32, 64, 128, and 256). The x-axis represents the number of rollbacks per verification window, and the y-axis represents the cumulative distribution function (CDF). The plot compares the CDF for different window sizes, allowing for analysis of how window size affects the distribution of rollbacks. ### Components/Axes * **X-axis:** "Rollbacks per Verification Window". Scale ranges from 0.0 to approximately 0.8, with tick marks at intervals of 0.2. * **Y-axis:** "CDF" (Cumulative Distribution Function). Scale ranges from 0.0 to 1.0, with tick marks at intervals of 0.2. * **Legend:** Located in the bottom-right of the chart. It identifies the different lines by color and corresponding window size: * Blue: Window Size = 32 * Orange: Window Size = 64 * Green: Window Size = 128 * Red: Window Size = 256 ### Detailed Analysis * **Window Size = 32 (Blue):** The CDF starts at approximately 0.57 at 0 rollbacks, then increases rapidly between 0 and 0.4 rollbacks, reaching a CDF of approximately 0.95. It then gradually approaches 1.0. * **Window Size = 64 (Orange):** The CDF starts at approximately 0.59 at 0 rollbacks, then increases rapidly between 0 and 0.4 rollbacks, reaching a CDF of approximately 0.9. It then gradually approaches 1.0. * **Window Size = 128 (Green):** The CDF starts at approximately 0.58 at 0 rollbacks, then increases in steps between 0 and 0.6 rollbacks, reaching a CDF of approximately 0.98. * **Window Size = 256 (Red):** The CDF starts at 0.0 at 0 rollbacks, then jumps to approximately 0.57, and then increases in steps between 0 and 0.6 rollbacks, reaching a CDF of approximately 0.95. ### Key Observations * The CDF curves for smaller window sizes (32 and 64) increase more rapidly than those for larger window sizes (128 and 256). * The CDF for window size 256 has a distinct step-like pattern, indicating discrete jumps in the cumulative distribution. * All CDFs eventually approach 1.0, indicating that all rollbacks per verification window are accounted for in the cumulative distribution. * The initial CDF value at 0 rollbacks is different for each window size, with window size 256 starting at 0.0. ### Interpretation The plot illustrates the impact of window size on the distribution of rollbacks per verification window. Smaller window sizes (32 and 64) exhibit a more continuous increase in the CDF, suggesting a wider range of rollback values. Larger window sizes (128 and 256) show a more step-like CDF, indicating that rollbacks tend to occur in discrete increments. The initial jump in the CDF for window size 256 suggests that a significant portion of verification windows experience a certain number of rollbacks. The data suggests that the choice of window size can significantly affect the observed distribution of rollbacks, which could be important for optimizing system performance or resource allocation. </details> (b) Rollbacks <details> <summary>x9.png Details</summary> ![0cc6bf63](/v1/image/0cc6bf63693d8bb8700dff17dca10d1ec23deb5cac3b2027b1d2a8715b1c7fa7) ### Visual Description ## Chart: CDF of Recomputed Tokens per Request ### Overview The image is a Cumulative Distribution Function (CDF) plot showing the distribution of recomputed tokens per request for different window sizes (32, 64, 128, and 256). The x-axis represents the number of recomputed tokens per request, and the y-axis represents the cumulative probability (CDF). ### Components/Axes * **X-axis:** "Recomputed Tokens per Request". The scale ranges from 0 to 1750, with tick marks at intervals of 250 (0, 250, 500, 750, 1000, 1250, 1500, 1750). * **Y-axis:** "CDF" (Cumulative Distribution Function). The scale ranges from 0.0 to 1.0, with tick marks at intervals of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). * **Legend:** Located in the bottom-right corner of the plot. It identifies the different lines by their corresponding window sizes: * Blue: Window Size = 32 * Orange: Window Size = 64 * Green: Window Size = 128 * Red: Window Size = 256 ### Detailed Analysis * **Window Size = 32 (Blue):** The CDF rises sharply near x=0, reaching a CDF of approximately 0.95 by x=100. It then gradually approaches 1.0. * **Window Size = 64 (Orange):** The CDF rises sharply near x=0, reaching a CDF of approximately 0.9 by x=150. It then gradually approaches 1.0. * **Window Size = 128 (Green):** The CDF rises sharply near x=0, reaching a CDF of approximately 0.7 by x=200. It then gradually approaches 1.0. * **Window Size = 256 (Red):** The CDF rises sharply near x=0, reaching a CDF of approximately 0.6 by x=25. It then gradually approaches 1.0, but at a slower rate compared to the other window sizes. It reaches a CDF of approximately 0.8 by x=500, and 0.9 by x=750. ### Key Observations * Smaller window sizes (32 and 64) have a steeper initial rise in their CDF curves, indicating that a larger proportion of requests require fewer recomputed tokens. * Larger window sizes (128 and 256) have a more gradual rise, indicating that a larger proportion of requests require more recomputed tokens. * All CDF curves eventually approach 1.0, meaning that all requests will eventually be processed, regardless of the number of recomputed tokens required. ### Interpretation The CDF plot illustrates the impact of window size on the distribution of recomputed tokens per request. Smaller window sizes result in a higher proportion of requests requiring fewer recomputed tokens, suggesting better efficiency in terms of token usage. Conversely, larger window sizes lead to a broader distribution, with a greater proportion of requests requiring more recomputed tokens. This suggests a trade-off between window size and token recomputation efficiency. The choice of window size should be based on the specific requirements of the system, considering factors such as token availability and request processing latency. </details> (c) CDF of recomputed tokens <details> <summary>x10.png Details</summary> ![f4ccea12](/v1/image/f4ccea126cca571948ad5472d84bc1c6cf883998ab9c3b765a211013eb4aa9f5) ### Visual Description ## Bar Chart: Recompute Cost vs. Window Size ### Overview The image is a bar chart that illustrates the relationship between "Window Size" and "Recompute Cost (%)". The chart displays four data points, each representing a different window size (32, 64, 128, and 256) and their corresponding recompute cost percentages. The bars are purple with a diagonal line pattern. ### Components/Axes * **Y-axis (Vertical):** "Recompute Cost (%)". The scale ranges from 0 to 40, with implied tick marks at intervals of 10. * **X-axis (Horizontal):** "Window Size". The categories are discrete values: 32, 64, 128, and 256. * **Bars:** Purple bars represent the recompute cost for each window size. The exact percentage value is displayed above each bar. * **Gridlines:** Horizontal gridlines are present to aid in reading the values on the y-axis. ### Detailed Analysis The chart shows a clear upward trend: as the window size increases, the recompute cost also increases. * **Window Size 32:** Recompute Cost is 6.81%. * **Window Size 64:** Recompute Cost is 12.42%. * **Window Size 128:** Recompute Cost is 24.92%. * **Window Size 256:** Recompute Cost is 46.41%. ### Key Observations * The recompute cost more than doubles when the window size increases from 128 to 256. * The recompute cost increases consistently with the window size. ### Interpretation The bar chart demonstrates a positive correlation between window size and recompute cost. This suggests that larger window sizes require more recomputation, potentially due to increased complexity or data volume. The significant jump in recompute cost between window sizes 128 and 256 indicates a possible threshold where the computational demands increase substantially. This information is valuable for optimizing window size selection in systems where recomputation is a factor. </details> (d) Total recomputation overhead Figure 9: The cost of verification and recomputation with varying window sizes. Case-2: When verification fails. Occasionally, a verification pass disagrees with one or more decoded tokens, e.g., only $T_{1}^{\prime}$ matches with the verifier’s output in 8(b). In this case, LLM-42 commits the output only up to the last matching token ( $T_{1}^{\prime}$ ); all subsequent tokens ( $T_{2}^{\prime}$ and $T_{3}^{\prime}$ ) are rejected. The verifier-generated token that appears immediately after the last matching position ( $T_{2}$ ) is also accepted and the KV cache is truncated at this position. The next decode iteration resumes from this repaired, consistent state. Making KV cache consistent. During the decode phase, the KV cache is populated by fast-path iterations that execute under dynamic batching. Consequently, even when token verification succeeds, the KV cache corresponding to the verified window may still be inconsistent, since it was produced by non-deterministic execution. This inconsistency could affect tokens generated in future iterations despite the current window being verified. Token-level verification alone is therefore insufficient without repairing the KV cache. To address this, we overwrite the KV cache entries produced during decoding with the corresponding entries from the verification pass. This ensures that both the emitted tokens and the KV cache state are consistent for subsequent decode iterations, eliminating downstream divergence. Guaranteed forward progress. Note that each verification pass produces at least one new consistent output token as shown in Figure 8. As a result, DVR guarantees forward progress even in contrived worst-case scenarios that might trigger rollbacks for every optimistically generated token. We have not observed any such scenario in our experiments. Overall, DVR follows the high-level structure of speculative decoding, but differs in several important ways as discussed in Table 1. Under DVR, decoded tokens result from a faithful execution of the model’s forward pass, with deviations arising only from the floating-point rounding errors. In contrast, speculative execution techniques generate candidate tokens using modified forward computations—for example, by running a smaller or distilled draft model [specdecoding-icml2023, specinfer-2024] or by truncating or pruning attention/context [medusa, fu2023lookahead]. Because these approximations change the computation itself, speculation depth is typically limited to a few (2-8) tokens [medusa, fu2023lookahead, specdecoding-icml2023, specinfer-2024]. In contrast, leveraging O1, DVR can safely speculate over longer (32-64) token sequences. For the same reason, verification in DVR succeeds with high probability, in contrast to the much lower acceptance rates observed in existing speculative decoding schemes. Finally, DVR performs verification using the same model, whereas most speculative decoding approaches require a separate model for verification. 4.3 Grouped Verification The efficiency of DVR depends on the size of verification window and involves a trade-off: smaller windows incur higher verification cost but low recomputation overhead, whereas larger windows reduce verification cost at the expense of increased recomputation. We empirically characterize this trade-off by profiling Llama-3.1-8B-Instruct on an H100-PCIe GPU. We measure per-token verification cost based on the latency of verification forward pass. To quantify the recomputation cost, we execute 4096 requests from the ShareGPT dataset in an online setting at 12 queries per second. For each window size, we measure the total number of tokens decoded versus the number of tokens returned to the user; the difference between the two denotes recomputation overhead. 9(a) shows the per-token verification cost as a function of verification window size. For smaller windows, the verification pass is memory-bound: each token performs very little computation, resulting in low arithmetic intensity and poor hardware utilization. Consequently, the verification cost is high—up to 0.75 ms per-token. As the verification window increases, the kernel transitions to being compute-bound, and the per-token cost drops sharply, reaching 0.05 ms at a window size of 512, signifying a reduction of $15×$ . 9(b) shows that more than half of the requests complete without any rollback (i.e., all their output tokens pass verification), while a small fraction of requests incur frequent rollbacks. Moreover, rollbacks become more common as the window size increases. For example, with a window size of 256, about 40% of requests observe a rollback ratio of 50% or higher indicating that at least half of the verification steps detect one or more mismatched tokens. In contrast, with a window size of 32, only about 5% of requests reach this level. The number of recomputed tokens per request also follows the same trend as shown in 9(c). Further, 9(d) shows the resulting total recomputation overhead. Note that a verification failure requires recomputing all tokens following the mismatched position within the current window. As a result, larger windows incur higher recomputation overhead on average. Concretely, recomputation overhead is 6.81% at a window size of 32 and increases roughly linearly, reaching 46.41% at a window size of 256. We propose grouped verification to address the inherent trade-off between the cost of verification and recomputation. Instead of verifying a large window of a single request (256 tokens), we verify small, fixed-size windows of multiple requests together in a single pass (e.g., 8 requests, 32 tokens each). This way each request retains the rollback properties of a small window, while the verification pass operates on a large effective batch, achieving high utilization and low verification cost. 4.4 Discussion Guaranteeing determinism requires every operator in the inference pipeline to behave consistently. This section highlights a few subtle sources of inconsistency in commonly used operators. We do not introduce new techniques here; instead, we adopt established approaches to enforce consistent behavior and summarize them for completeness. Attention. Attention involves reductions along multiple steps: reductions along the sequence dimension (e.g., in FlashDecoding-style sequence parallelism [flashdecoding]) and reductions inside the softmax computation. To make it consistent, we use FlashAttention-3 attention kernel and set the number of KV splits to one in the verification pass; fast-path decode iterations run as usual. Some performance optimizations are possible based on SGLang implementation https://lmsys.org/blog/2025-09-22-sglang-deterministic/. Communication collectives. Classical ring-based AllReduce reduces tokens in different orders depending on their position in the batch. In contrast, multimem-based AllReduce, introduced in CUDA 13.0, follows consistent reduction schedules [nvls-deterministic] and should be preferred whenever supported. On older platforms where multimem/NVLS is unavailable, one can use tree-based AllReduce with fixed NCCL configuration (e.g., setting num_channels to one and using a fixed protocol) to achieve determinism, at the expense of higher communication cost than ring-based AllReduce. Sampling. We adopt sampler from SGLang. When temperature is zero, it selects the token with the highest logit (argmax) and if there are multiple maximal values then it returns the index of the first maximal value. For non-greedy sampling, however, the sampling process involves randomness, making the outputs sensitive to how random numbers are generated and consumed. To address this issue, SGLang introduces a new sampling function, multinomial_with_seed, which replaces torch.multinomial, an inherently non-deterministic operator under batched execution. This function perturbs logits with Gumbel noise generated from a seeded hash function, ensuring that the same input–seed pair always produces the same sample [SGLangTeam2025]. Overall, enforcing determinism across the entire inference pipeline requires careful configuration of the attention kernel and selecting an appropriate AllReduce implementation. The sampling module does require a new implementation, but this is a one-time cost since the same sampler is used across all models. In contrast, the bulk of the performance and engineering effort in LLM systems lies in GEMM kernels, which LLM-42 reuses entirely, along with RMSNorm and FusedMoE kernels. 5 Evaluation Our evaluation answers the following questions: | ShareGPT ArXiv | 92812 5941 | Input Length Output Length Input Length | 304 192 7017 | 136 118 6435 | 491 212 3479 | | --- | --- | --- | --- | --- | --- | | Test Split | | Output Length | 198 | 191 | 74 | Table 3: Datasets and their input/output context lengths. <details> <summary>x11.png Details</summary> ![219db2ef](/v1/image/219db2ef5d6715634827e41fe18c345ad5f89e2eb4b3cd9a6318557c10adaec6) ### Visual Description ## Bar Chart: Throughput Comparison of SGLang and LLM-42 ### Overview The image is a bar chart comparing the throughput (tokens/s) of SGLang (deterministic and non-deterministic) and LLM-42 at different parameter settings (2%, 5%, 10%, 20%, 50%, and 100%). The throughput is evaluated across different input/output sizes and datasets (ArXiv, ShareGPT). ### Components/Axes * **Y-axis:** Throughput (tokens/s), ranging from 0 to 20000, with tick marks at 2500 increments. * **X-axis:** Categorical axis representing different datasets and input/output sizes: ArXiv, ShareGPT, in=1024 out=256, in=1024 out=512, in=2048 out=256, in=2048 out=512, in=4096 out=512, in=512 out=256. * **Legend (top-left):** * Green: SGLang non-deterministic * Red: SGLang deterministic * Purple with diagonal lines: LLM-42 @2% * Purple with horizontal lines: LLM-42 @5% * Purple with vertical lines: LLM-42 @10% * Purple with forward-leaning diagonal lines: LLM-42 @20% * Purple with backward-leaning diagonal lines: LLM-42 @50% * Purple with a grid pattern: LLM-42 @100% ### Detailed Analysis **ArXiv:** * SGLang non-deterministic: ~16000 tokens/s * SGLang deterministic: ~11200 tokens/s (0.70x relative to SGLang non-deterministic) * LLM-42 @2%: ~16000 tokens/s (1.00x) * LLM-42 @5%: ~15840 tokens/s (0.99x) * LLM-42 @10%: ~15840 tokens/s (0.99x) * LLM-42 @20%: ~15680 tokens/s (0.98x) * LLM-42 @50%: ~14720 tokens/s (0.92x) * LLM-42 @100%: ~13440 tokens/s (0.84x) **ShareGPT:** * SGLang non-deterministic: ~16000 tokens/s * SGLang deterministic: ~10240 tokens/s (0.64x) * LLM-42 @2%: ~14400 tokens/s (0.90x) * LLM-42 @5%: ~13600 tokens/s (0.85x) * LLM-42 @10%: ~11040 tokens/s (0.69x) * LLM-42 @20%: ~9280 tokens/s (0.58x) * LLM-42 @50%: ~15360 tokens/s (0.96x) **in=1024 out=256:** * SGLang non-deterministic: ~17600 tokens/s * SGLang deterministic: ~12672 tokens/s (0.72x) * LLM-42 @2%: ~17424 tokens/s (0.99x) * LLM-42 @5%: ~17248 tokens/s (0.98x) * LLM-42 @10%: ~17072 tokens/s (0.97x) * LLM-42 @20%: ~16640 tokens/s (0.95x) * LLM-42 @50%: ~15664 tokens/s (0.89x) * LLM-42 @100%: ~14256 tokens/s (0.81x) **in=1024 out=512:** * SGLang non-deterministic: ~11500 tokens/s * SGLang deterministic: ~8400 tokens/s (0.73x) * LLM-42 @2%: ~11400 tokens/s (0.99x) * LLM-42 @5%: ~11300 tokens/s (0.98x) * LLM-42 @10%: ~11200 tokens/s (0.97x) * LLM-42 @20%: ~10800 tokens/s (0.94x) * LLM-42 @50%: ~9900 tokens/s (0.86x) * LLM-42 @100%: ~8700 tokens/s (0.76x) **in=2048 out=256:** * SGLang non-deterministic: ~16000 tokens/s * SGLang deterministic: ~10560 tokens/s (0.66x) * LLM-42 @2%: ~16000 tokens/s (1.00x) * LLM-42 @5%: ~15840 tokens/s (0.99x) * LLM-42 @10%: ~15680 tokens/s (0.98x) * LLM-42 @20%: ~15520 tokens/s (0.97x) * LLM-42 @50%: ~14720 tokens/s (0.92x) * LLM-42 @100%: ~13440 tokens/s (0.84x) **in=2048 out=512:** * SGLang non-deterministic: ~12000 tokens/s * SGLang deterministic: ~9120 tokens/s (0.76x) * LLM-42 @2%: ~12000 tokens/s (1.00x) * LLM-42 @5%: ~11900 tokens/s (0.99x) * LLM-42 @10%: ~11800 tokens/s (0.98x) * LLM-42 @20%: ~11600 tokens/s (0.97x) * LLM-42 @50%: ~11200 tokens/s (0.93x) * LLM-42 @100%: ~10320 tokens/s (0.86x) **in=4096 out=512:** * SGLang non-deterministic: ~16000 tokens/s * SGLang deterministic: ~11840 tokens/s (0.74x) * LLM-42 @2%: ~15840 tokens/s (0.99x) * LLM-42 @5%: ~15840 tokens/s (0.99x) * LLM-42 @10%: ~15680 tokens/s (0.98x) * LLM-42 @20%: ~15520 tokens/s (0.97x) * LLM-42 @50%: ~15040 tokens/s (0.94x) * LLM-42 @100%: ~13600 tokens/s (0.85x) **in=512 out=256:** * SGLang non-deterministic: ~17600 tokens/s * SGLang deterministic: ~10240 tokens/s (0.64x) * LLM-42 @2%: ~17424 tokens/s (0.99x) * LLM-42 @5%: ~17072 tokens/s (0.97x) * LLM-42 @10%: ~16640 tokens/s (0.95x) * LLM-42 @20%: ~16224 tokens/s (0.92x) * LLM-42 @50%: ~14576 tokens/s (0.83x) * LLM-42 @100%: ~12672 tokens/s (0.72x) ### Key Observations * SGLang non-deterministic generally has higher throughput than SGLang deterministic across all datasets and input/output sizes. * LLM-42 throughput decreases as the parameter setting increases from 2% to 100%. * The performance difference between LLM-42 at 2% and SGLang non-deterministic is minimal in most cases. * The deterministic version of SGLang has significantly lower throughput than the non-deterministic version. * The throughput varies depending on the input and output sizes. ### Interpretation The bar chart illustrates the performance comparison between SGLang and LLM-42 under different conditions. The data suggests that SGLang non-deterministic and LLM-42 at lower parameter settings (e.g., 2%) achieve comparable throughput. As the parameter setting of LLM-42 increases, the throughput decreases, indicating a trade-off between model size/complexity and processing speed. The deterministic version of SGLang consistently underperforms compared to its non-deterministic counterpart, suggesting potential optimizations in the non-deterministic implementation. The input and output sizes also play a crucial role in determining the throughput, highlighting the importance of considering these factors when evaluating the performance of these models. The relative values (e.g., 0.70x) indicate the performance ratio compared to SGLang non-deterministic, providing a normalized view of the performance differences. </details> Figure 10: Throughput in offline inference. SGLang has only two modes: all requests run in either deterministic or non-deterministic mode whereas LLM-42 supports selective determinism (% values in the legend reflect the fraction of traffic that requires deterministic output). - How does LLM-42 compare against the baseline’s deterministic and non-deterministic modes? - How does LLM-42 perform for different mix of deterministic and non-deterministic traffic? - How do configuration parameters affect the performance of grouped verification in LLM-42? Models and environment: We evaluate LLM-42 with the Llama-3.1-8B-Instruct model. It has 32 query heads, 8 KV heads, and 32 layers. Experiments were conducted on a system equipped with four NVIDIA H100 PCIe GPUs with 80 GB HBM3 memory and 114 streaming multiprocessors per GPU. The host CPU has 64 physical cores (128 hardware threads) and approximately 1.65 TB of DRAM. Workloads: We evaluate on synthetic inputs with varying context lengths, following common practice in prior work [nanoflow2024]. We additionally benchmark on the widely used ShareGPT [huggingfacesharegpt] and Arxiv [arxiv-summarization] datasets, whose characteristics are summarized in Table 3. All experiments use the meta-llama/Llama-3.1-8B-Instruct tokenizer. To evaluate LLM-42 under different workload conditions, we vary the fraction of requests that require deterministic output between 2%, 5%, 10%, 20%, 50%, and 100%. While higher deterministic ratios are included for stress testing, we expect that in practical deployments only a small fraction of requests will require determinism. Finally, note that the deterministic request ratio applies only to LLM-42; approaches based on batch-invariant computation support only global determinism. Serving system baselines: We implement LLM-42 on top of the SGLang serving framework version 0.5.3rc0 and compare performance against SGLang with deterministic execution switched on. To understand the upper limit on performance, we also compare with the non-deterministic version of SGLang. We refer to these two baselines as SGLang-Deterministic and SGLang-Non-Deterministic. We use FA-3 attention kernel in all our experiments; we set num_splits = 1 in the verification step of LLM-42 and use the default settings for the decode phases. Metrics: In offline inference, we evaluate throughput as the number of tokens processed per second. For online inference, we evaluate end-to-end request execution latencies and time-to-first-token. To better understand the overhead of our approach, we also report the number of rollbacks and recomputed tokens across different configurations of LLM-42. 5.1 Offline Inference For offline inference, we evaluate eight workload configurations: two traces from the ArXiv and ShareGPT datasets and six other configurations with different input and output lengths. Each configuration executes 4096 requests to completion. Figure 10 reports the resulting throughput across systems measured in tokens per second; we summarize the main observations below. Enabling deterministic inference in SGLang incurs a substantial throughput penalty, ranging from 24% (in=2048, out=512) to 36% (ShareGPT). This slowdown stems from the use of batch-invariant kernels, which are consistently slower than the standard optimized kernels, as shown in Figure 4. In contrast, LLM-42 mostly uses regular optimized kernels and therefore achieves significantly higher throughput. Even in its worst case—when 100% of requests require deterministic output— LLM-42 outperforms SGLang-Deterministic in all but one setting (ShareGPT), where it is only 6% slower. | Dataset / Config Total number of rollbacks | Deterministic Ratio 2% | 5% | 10% | 20% | 50% | 100% | | --- | --- | --- | --- | --- | --- | --- | | ArXiv | 70 | 170 | 372 | 697 | 1733 | 3351 | | ShareGPT | 4 | 5 | 10 | 15 | 44 | 96 | | in=512, out=256 | 0 | 0 | 0 | 0 | 0 | 0 | | in=1024, out=256 | 0 | 1 | 5 | 5 | 9 | 24 | | in=1024, out=512 | 48 | 106 | 225 | 386 | 792 | 1536 | | in=2048, out=256 | 7 | 22 | 79 | 134 | 378 | 724 | | in=2048, out=512 | 0 | 0 | 0 | 2 | 2 | 4 | | in=4096, out=512 | 31 | 79 | 98 | 220 | 538 | 1087 | | Total number of recomputed tokens | | | | | | | | ArXiv | 1805 | 4414 | 8737 | 13198 | 37687 | 89248 (10.97%) | | ShareGPT | 139 | 158 | 164 | 396 | 1185 | 2691 (0.32%) | | in=512, out=256 | 0 | 0 | 0 | 0 | 0 | 0 (0%) | | in=1024, out=256 | 0 | 44 | 151 | 151 | 288 | 776 (0.07%) | | in=1024, out=512 | 1724 | 3272 | 6615 | 12392 | 25967 | 50454 (2.41%) | | in=2048, out=256 | 268 | 767 | 2470 | 4080 | 12173 | 21772 (2.08%) | | in=2048, out=512 | 0 | 0 | 0 | 81 | 81 | 101 ( $<\!0.01\%$ ) | | in=4096, out=512 | 1134 | 2961 | 3598 | 7208 | 17161 | 32430 (1.55%) | Table 4: Rollback and recomputation statistics (grouped verification over 8 requests, 64 tokens each). Recompute cost is shown only for the column with 100% deterministic traffic. Crucially, LLM-42 ’s throughput improves monotonically as the fraction of deterministic requests decreases. On ArXiv, LLM-42 operates within 8%, 2%, 1% and 1% of the best-case throughput (SGLang-Non-Deterministic) at 5%, 10%, 20%, and 50% deterministic traffic, respectively. This behavior is expected: LLM-42 imposes no computation overhead on requests that do not require determinism. As a result, LLM-42 is up to 41% faster than SGLang-Deterministic in these regimes. This trend holds across all workload configurations; for example, at 10% deterministic traffic, LLM-42 is 33% faster than SGLang-Deterministic on ShareGPT and up to 48% faster (in=2048, out=256) on other workloads. Table 4 reports two complementary metrics that characterize the overhead of enforcing determinism in LLM-42: (1) the total number of rollbacks aggregated over all 4096 requests, and (2) the total number of recomputed tokens incurred due to these rollbacks. We find that some configurations incur zero rollbacks even under non-trivial deterministic ratios, and the average case remains low across datasets and input–output lengths. Even in the worst case, ArXiv at 100% deterministic traffic, the system triggers 3351 rollbacks over 4096 requests, i.e., fewer than one rollback per request on average. A similar trend holds for recomputation overhead: the recomputed token fraction is consistently small, with the worst case at 100% deterministic traffic reaching at most 10.97%, while the average case across datasets and configurations is substantially lower. Overall, these results indicate that both rollback frequency and recomputation cost are modest in practice. <details> <summary>x12.png Details</summary> ![ab6b44da](/v1/image/ab6b44daec156b18b18acef01f8b68034a68fd4ae17f68ae7943a2c52f89d508) ### Visual Description ## Cumulative Distribution Function (CDF) Chart: E2E Latency Comparison ### Overview The image is a cumulative distribution function (CDF) chart comparing the end-to-end (E2E) latency of different systems: SGLang (non-deterministic and deterministic) and LLM-42 at various percentages (2%, 5%, 10%, 20%, 50%, and 100%). The chart displays the cumulative probability of the E2E latency on the y-axis against the E2E latency in milliseconds (ms) on the x-axis. ### Components/Axes * **X-axis:** E2E Latency (ms). Scale ranges from 0 to 100000 ms, with tick marks at intervals of 20000 ms. * **Y-axis:** CDF. Scale ranges from 0.0 to 1.0, with tick marks at intervals of 0.2. * **Legend:** Located on the right side of the chart, listing the different systems and their corresponding colors: * SGLang non-deterministic (Green) * SGLang deterministic (Red) * LLM-42 @2% (Dark Blue) * LLM-42 @5% (Orange) * LLM-42 @10% (Purple) * LLM-42 @20% (Brown) * LLM-42 @50% (Pink) * LLM-42 @100% (Light Blue) ### Detailed Analysis * **SGLang non-deterministic (Green):** The line rises sharply from 0 to 1 CDF between 0 and approximately 10000 ms. * **SGLang deterministic (Red):** The line rises sharply from 0 to 1 CDF between 0 and approximately 15000 ms. * **LLM-42 @2% (Dark Blue):** The line rises sharply from 0 to 1 CDF between 0 and approximately 12000 ms. * **LLM-42 @5% (Orange):** The line rises sharply from 0 to 1 CDF between 0 and approximately 13000 ms. * **LLM-42 @10% (Purple):** The line rises sharply from 0 to 1 CDF between 0 and approximately 14000 ms. * **LLM-42 @20% (Brown):** The line rises sharply from 0 to 1 CDF between 0 and approximately 15000 ms. * **LLM-42 @50% (Pink):** The line rises sharply from 0 to 1 CDF between 0 and approximately 17000 ms. * **LLM-42 @100% (Light Blue):** The line rises gradually from 0 to 1 CDF between 0 and approximately 60000 ms. ### Key Observations * SGLang non-deterministic has the lowest latency, reaching a CDF of 1 at approximately 10000 ms. * SGLang deterministic, LLM-42 @2%, LLM-42 @5%, LLM-42 @10%, and LLM-42 @20% have similar latency profiles, reaching a CDF of 1 between 12000 and 15000 ms. * LLM-42 @50% has a slightly higher latency, reaching a CDF of 1 at approximately 17000 ms. * LLM-42 @100% has the highest latency, reaching a CDF of 1 at approximately 60000 ms. ### Interpretation The CDF chart illustrates the distribution of E2E latency for different systems. The steeper the curve, the lower the latency. The data suggests that SGLang non-deterministic has the best latency performance, followed by SGLang deterministic and LLM-42 at lower percentages (2% to 20%). As the percentage of LLM-42 increases, the latency also increases, with LLM-42 @100% exhibiting significantly higher latency compared to the other systems. This indicates that increasing the percentage of LLM-42 has a negative impact on the E2E latency. The chart allows for a comparison of the probability of a given latency occurring for each system. For example, one can determine the probability that the latency will be less than 20000 ms for each system. </details> (a) QPS=12 <details> <summary>x13.png Details</summary> ![dbbb7e46](/v1/image/dbbb7e463fbe82d335f6f9037f29b4b4a10cba459023dd5e8a6596c149ee18fd) ### Visual Description ## Chart: CDF of E2E Latency ### Overview The image is a cumulative distribution function (CDF) plot comparing the end-to-end (E2E) latency of SGLang (both non-deterministic and deterministic versions) and LLM-42 at various sparsity levels (2%, 5%, 10%, 20%, 50%, and 100%). The x-axis represents E2E latency in milliseconds (ms), and the y-axis represents the cumulative distribution function (CDF), ranging from 0 to 1. ### Components/Axes * **X-axis:** E2E Latency (ms), with ticks at 0, 20000, 40000, 60000, 80000, and 100000. * **Y-axis:** CDF, with ticks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. * **Legend:** Located on the right side of the plot, it identifies each line by color and label: * Green: SGLang non-deterministic * Red: SGLang deterministic * Blue: LLM-42 @2% * Orange: LLM-42 @5% * Purple: LLM-42 @10% * Brown: LLM-42 @20% * Pink: LLM-42 @50% * Teal: LLM-42 @100% ### Detailed Analysis * **SGLang non-deterministic (Green):** The CDF rises sharply from 0 to 1 between approximately 0 ms and 5000 ms. * **SGLang deterministic (Red):** The CDF rises sharply from 0 to 1 between approximately 0 ms and 15000 ms. * **LLM-42 @2% (Blue):** The CDF rises sharply from 0 to 1 between approximately 0 ms and 10000 ms. * **LLM-42 @5% (Orange):** The CDF rises sharply from 0 to 1 between approximately 0 ms and 12000 ms. * **LLM-42 @10% (Purple):** The CDF rises sharply from 0 to 1 between approximately 0 ms and 20000 ms. * **LLM-42 @20% (Brown):** The CDF rises sharply from 0 to 1 between approximately 0 ms and 20000 ms. * **LLM-42 @50% (Pink):** The CDF rises sharply from 0 to 1 between approximately 0 ms and 40000 ms. * **LLM-42 @100% (Teal):** The CDF rises sharply from 0 to 1 between approximately 0 ms and 40000 ms. ### Key Observations * SGLang non-deterministic has the lowest E2E latency, followed by LLM-42 @2% and LLM-42 @5%. * SGLang deterministic has a higher E2E latency than SGLang non-deterministic. * LLM-42 @10% and LLM-42 @20% have similar E2E latency distributions. * LLM-42 @50% and LLM-42 @100% have the highest E2E latencies among the LLM-42 variants. ### Interpretation The CDF plot illustrates the impact of sparsity on the E2E latency of LLM-42 and compares it to SGLang. Lower sparsity levels (2% and 5%) result in lower latencies, approaching the performance of SGLang. As sparsity increases (10%, 20%, 50%, and 100%), the E2E latency of LLM-42 increases significantly. The deterministic version of SGLang has a higher latency than the non-deterministic version. This suggests that sparsity can be used to trade off model size and performance, but there are diminishing returns as sparsity increases beyond a certain point. The plot highlights the performance differences between different configurations and provides insights into the latency characteristics of these systems. </details> (b) QPS=14 <details> <summary>x14.png Details</summary> ![0b8b2e48](/v1/image/0b8b2e4876468340bf227f42977ee0613ef36e0d7882e319bbf20f59f7d24572) ### Visual Description ## Cumulative Distribution Function (CDF) of E2E Latency ### Overview The image is a cumulative distribution function (CDF) plot comparing the end-to-end (E2E) latency of different language models (SGLang and LLM-42) under varying conditions. The x-axis represents E2E latency in milliseconds (ms), and the y-axis represents the cumulative distribution function (CDF), indicating the probability that the latency is less than or equal to a given value. The plot compares SGLang in non-deterministic and deterministic modes, as well as LLM-42 at different utilization levels (2%, 5%, 10%, 20%, 50%, and 100%). ### Components/Axes * **X-axis:** E2E Latency (ms), ranging from 0 to 120,000 ms, with tick marks at intervals of 20,000 ms. * **Y-axis:** CDF, ranging from 0.0 to 1.0, with tick marks at intervals of 0.2. * **Legend:** Located in the top-right corner of the plot, the legend identifies each line by model and condition: * Green: SGLang non-deterministic * Red: SGLang deterministic * Blue: LLM-42 @2% * Orange: LLM-42 @5% * Purple: LLM-42 @10% * Brown: LLM-42 @20% * Pink: LLM-42 @50% * Cyan: LLM-42 @100% ### Detailed Analysis * **SGLang non-deterministic (Green):** The green line rises sharply near the origin, reaching a CDF of 1.0 at approximately 10,000 ms. This indicates that almost all requests complete within 10,000 ms. * **SGLang deterministic (Red):** The red line also rises sharply near the origin, but it is slightly to the right of the green line. It reaches a CDF of 1.0 at approximately 15,000 ms. * **LLM-42 @2% (Blue):** The blue line rises sharply near the origin, reaching a CDF of 1.0 at approximately 20,000 ms. * **LLM-42 @5% (Orange):** The orange line rises sharply near the origin, reaching a CDF of 1.0 at approximately 25,000 ms. * **LLM-42 @10% (Purple):** The purple line rises sharply near the origin, reaching a CDF of 1.0 at approximately 30,000 ms. * **LLM-42 @20% (Brown):** The brown line rises sharply near the origin, reaching a CDF of 1.0 at approximately 35,000 ms. * **LLM-42 @50% (Pink):** The pink line rises sharply near the origin, reaching a CDF of 1.0 at approximately 40,000 ms. * **LLM-42 @100% (Cyan):** The cyan line rises sharply near the origin, reaching a CDF of 1.0 at approximately 60,000 ms. ### Key Observations * SGLang (both deterministic and non-deterministic) exhibits significantly lower E2E latency compared to LLM-42 at all utilization levels. * As the utilization of LLM-42 increases, the E2E latency also increases. The 100% utilization case has the highest latency. * The CDF curves for all models and conditions are generally steep, indicating that the latency is relatively consistent for each configuration. ### Interpretation The data suggests that SGLang is more efficient in terms of E2E latency compared to LLM-42. The increase in latency for LLM-42 as utilization increases indicates a performance bottleneck or resource contention as the system becomes more loaded. The CDF plots provide a clear visualization of the distribution of latencies, allowing for a comparison of the performance characteristics of each model and condition. The steepness of the curves suggests that the latency is relatively predictable for each configuration, which is important for real-time applications. </details> (c) QPS=16 <details> <summary>x15.png Details</summary> ![fd0c5d0a](/v1/image/fd0c5d0a29cfa4c820d2d209578a8f92d306ad18605f62b8e29a904d43797b42) ### Visual Description ## CDF Plot: E2E Latency Comparison ### Overview The image is a cumulative distribution function (CDF) plot comparing the end-to-end (E2E) latency of SGLang (both non-deterministic and deterministic versions) and LLM-42 models at various percentages (2%, 5%, 10%, 20%, 50%, and 100%). The x-axis represents E2E latency in milliseconds (ms), and the y-axis represents the cumulative distribution function (CDF), ranging from 0 to 1. ### Components/Axes * **X-axis:** E2E Latency (ms). Scale ranges from 0 to 140000 ms, with tick marks at intervals of 20000 ms. * **Y-axis:** CDF. Scale ranges from 0.0 to 1.0, with tick marks at intervals of 0.2. * **Legend:** Located on the right side of the plot, identifying each line by color and label: * Green: SGLang non-deterministic * Red: SGLang deterministic * Blue: LLM-42 @2% * Orange: LLM-42 @5% * Purple: LLM-42 @10% * Brown: LLM-42 @20% * Pink: LLM-42 @50% * Cyan: LLM-42 @100% ### Detailed Analysis * **SGLang non-deterministic (Green):** The green line rises sharply near 0 ms and reaches CDF = 1.0 around 10000 ms. * **SGLang deterministic (Red):** The red line rises sharply near 0 ms and reaches CDF = 1.0 around 10000 ms. * **LLM-42 @2% (Blue):** The blue line rises sharply near 0 ms and reaches CDF = 1.0 around 20000 ms. * **LLM-42 @5% (Orange):** The orange line rises sharply near 0 ms and reaches CDF = 1.0 around 25000 ms. * **LLM-42 @10% (Purple):** The purple line rises sharply near 0 ms and reaches CDF = 1.0 around 30000 ms. * **LLM-42 @20% (Brown):** The brown line rises sharply near 0 ms and reaches CDF = 1.0 around 40000 ms. * **LLM-42 @50% (Pink):** The pink line rises sharply near 0 ms and reaches CDF = 1.0 around 60000 ms. * **LLM-42 @100% (Cyan):** The cyan line rises sharply near 0 ms and reaches CDF = 1.0 around 80000 ms. ### Key Observations * SGLang (both deterministic and non-deterministic) exhibits the lowest E2E latency, with the CDF reaching 1.0 at approximately 10000 ms. * As the percentage associated with LLM-42 increases, the E2E latency also increases. LLM-42 @2% has the lowest latency among the LLM-42 variants, while LLM-42 @100% has the highest. * The CDF curves for all models are generally steep, indicating that most requests have relatively low latency. ### Interpretation The CDF plot illustrates the distribution of E2E latency for different models. SGLang demonstrates significantly lower latency compared to LLM-42 at various percentages. The increasing latency of LLM-42 as the percentage increases suggests a potential correlation between the percentage parameter and the processing time or resource allocation within the LLM-42 model. The steepness of the CDF curves indicates that the latency is relatively consistent for each model, with most requests completing within a specific time range. </details> (d) QPS=18 Figure 11: CDF of request end-to-end latency in online inference with varying load for the ShareGPT dataset. 5.2 Online Inference | 12 P75 P90 | P50 35.4 50.3 | 27.0 69.8 107.8 | 53.4 35.8 52.5 | 27.4 36.1 54.1 | 27.6 37.1 54.0 | 27.9 41.1 57.8 | 29.3 52.0 68.1 | 38.5 58.0 73.9 | 45.4 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 14 | P50 | 27.9 | 60.1 | 27.9 | 28.2 | 28.9 | 31.1 | 42.5 | 48.2 | | P75 | 37.0 | 79.9 | 37.5 | 37.8 | 39.5 | 45.1 | 57.0 | 61.6 | | | P90 | 55.8 | 127.0 | 57.0 | 56.5 | 58.6 | 63.8 | 75.8 | 82.0 | | | 16 | P50 | 28.8 | 62.5 | 28.8 | 29.2 | 29.8 | 35.2 | 46.1 | 51.2 | | P75 | 39.0 | 86.0 | 39.7 | 40.5 | 41.6 | 51.2 | 61.0 | 65.9 | | | P90 | 60.5 | 138.6 | 61.7 | 60.9 | 63.3 | 74.0 | 83.6 | 87.8 | | | 18 | P50 | 29.8 | 76.2 | 30.2 | 30.6 | 32.2 | 40.7 | 50.7 | 57.1 | | P75 | 41.4 | 105.6 | 42.7 | 43.3 | 46.3 | 57.5 | 67.3 | 75.4 | | | P90 | 65.2 | 171.6 | 65.9 | 67.6 | 70.2 | 79.3 | 90.5 | 101.2 | | Table 5: Time-to-first-token (TTFT) latency with varying load for the ShareGPT dataset. Figure 11 reports the CDF of end-to-end latency for online inference under increasing load on the ShareGPT dataset, with each experiment running 4096 requests. Across all QPS (queries per second) values, SGLang-Deterministic exhibits a pronounced rightward shift with a long tail, reflecting substantially higher median and tail latencies. For example, at 12 QPS, SGLang-Deterministic’s median (P50) latency is 4.64 seconds with a P99 of 28 seconds, compared to 2.15 seconds (P50) and 13.2 seconds (P99) for SGLang-Non-Deterministic. This gap widens further under higher load: at 18 QPS, SGLang-Deterministic reaches a P50 latency of 10.6 seconds and a P99 latency of 71.1 seconds, whereas SGLang-Non-Deterministic remains at 2.84 and 17.4 seconds, respectively. SGLang-Non-Deterministic consistently achieves the lowest latency and thus serves as a practical lower bound. In contrast, LLM-42 closely tracks the non-deterministic baseline, with only modest increases in latency as the fraction of deterministic traffic rises from 2% to 100%. At 12 QPS, LLM-42 at 2% deterministic traffic incurs only a 3% increase in median latency over SGLang-Non-Deterministic (2.21 vs. 2.15 seconds), and even at 50% deterministic traffic, its P99 latency is comparable to that of SGLang-Deterministic (at low load) or better (at higher load). This degradation is smooth and monotonic across all loads: at the highest QPS, LLM-42 maintains significantly tighter CDFs and substantially lower tail latency than SGLang-Deterministic. Only at lower QPS and when most requests require deterministic output (100%) does LLM-42 exhibit higher latency than the deterministic baseline. This behavior stems from two implementation artifacts: (1) verification currently induces a global pause that temporarily stalls all in-flight requests, and (2) prefill is not batched in our current prototype, reducing efficiency for short input prompts. We plan to address them as part of future work. Across all QPS levels, time-to-first-token (TTFT) latency in LLM-42 also increases monotonically with the fraction of deterministic traffic, with modest overhead at low ratios (2–10%) and higher once deterministic traffic exceeds 20–50%. However, even when the entire traffic is deterministic, LLM-42 still provides much lower tail TTFT than SGLang-Deterministic: at QPS 18, P90 TTFT of LLM-42 is 101.2 milliseconds vs. 171.6 milliseconds of SGLang-Deterministic. 5.3 Ablation Study <details> <summary>x16.png Details</summary> ![d20d1691](/v1/image/d20d16915242d5d3aafbdc42eb0ea005f3234dc5fac78b7afaca74e66d865719) ### Visual Description ## Heatmap: P99 E2E Latency vs. Batch Size and Window Size ### Overview The image is a heatmap displaying the p99 end-to-end (E2E) latency in seconds for different combinations of batch sizes and window sizes. The x-axis represents the window size, and the y-axis represents the batch size. The color of each cell in the heatmap corresponds to the latency value, with a color scale ranging from blue (low latency) to orange (high latency). The heatmap is triangular, indicating that not all combinations of batch size and window size were tested. ### Components/Axes * **X-axis:** Window Size, with values 16, 32, 64, 128, 256, and 512. * **Y-axis:** Batch Size, with values 1, 2, 4, 8, 16, and 32. * **Color Scale (Legend):** Located on the right side of the heatmap. It represents the p99 E2E latency in seconds, ranging from approximately 100 (blue) to 600 (orange). The scale has markers at 100, 200, 300, 400, 500, and 600. * **Data Values:** Numerical values are displayed within each cell of the heatmap, representing the p99 E2E latency for the corresponding batch size and window size. ### Detailed Analysis The heatmap presents latency values for various batch size and window size combinations. The data is structured as follows: | Batch Size | Window Size 16 | Window Size 32 | Window Size 64 | Window Size 128 | Window Size 256 | Window Size 512 | | :---------- | :------------- | :------------- | :------------- | :-------------- | :-------------- | :-------------- | | 32 | 46.36 | N/A | N/A | N/A | N/A | N/A | | 16 | 34.30 | 45.90 | N/A | N/A | N/A | N/A | | 8 | 36.94 | 34.18 | 46.17 | N/A | N/A | N/A | | 4 | 97.83 | 38.57 | 35.47 | 54.03 | N/A | N/A | | 2 | 302.09 | 113.21 | 44.15 | 43.06 | 65.31 | N/A | | 1 | 615.86 | 316.14 | 155.69 | 56.18 | 65.59 | 99.94 | **Observations:** * **Batch Size 32:** Only one data point is available: 46.36 at Window Size 16. * **Batch Size 16:** Two data points are available: 34.30 at Window Size 16 and 45.90 at Window Size 32. * **Batch Size 8:** Three data points are available: 36.94 at Window Size 16, 34.18 at Window Size 32, and 46.17 at Window Size 64. * **Batch Size 4:** Four data points are available: 97.83 at Window Size 16, 38.57 at Window Size 32, 35.47 at Window Size 64, and 54.03 at Window Size 128. * **Batch Size 2:** Five data points are available: 302.09 at Window Size 16, 113.21 at Window Size 32, 44.15 at Window Size 64, 43.06 at Window Size 128, and 65.31 at Window Size 256. * **Batch Size 1:** Six data points are available: 615.86 at Window Size 16, 316.14 at Window Size 32, 155.69 at Window Size 64, 56.18 at Window Size 128, 65.59 at Window Size 256, and 99.94 at Window Size 512. ### Key Observations * The highest latency (615.86 s) occurs at Batch Size 1 and Window Size 16, indicated by the orange color. * Generally, latency decreases as the window size increases for a given batch size. * Latency tends to increase as the batch size decreases for a given window size. * The lowest latency values are observed for larger batch sizes (e.g., 8, 16, 32) and larger window sizes (where available). ### Interpretation The heatmap illustrates the relationship between batch size, window size, and p99 E2E latency. The data suggests that smaller batch sizes and smaller window sizes result in higher latencies. This is likely due to increased overhead associated with processing smaller chunks of data. As the window size increases, the latency generally decreases, indicating that processing larger windows of data is more efficient. The optimal configuration appears to be a balance between batch size and window size, where both are reasonably large to minimize latency. The missing data points indicate that certain combinations of batch size and window size were not tested, possibly due to practical limitations or resource constraints. </details> (a) P99 latency <details> <summary>x17.png Details</summary> ![477fe5f4](/v1/image/477fe5f4acfebe7bdca718693958edd4af245621245c29a4613abd29038e6171) ### Visual Description ## Heatmap: Recompute Cost vs. Batch Size and Window Size ### Overview The image is a heatmap visualizing the recompute cost (in percentage) for different combinations of batch sizes and window sizes. The heatmap uses a color gradient from blue to orange, where blue represents lower recompute costs and orange represents higher costs. The x-axis represents window size, and the y-axis represents batch size. ### Components/Axes * **X-axis:** Window Size, with values 16, 32, 64, 128, 256, and 512. * **Y-axis:** Batch Size, with values 1, 2, 4, 8, 16, and 32. * **Color Legend (Right):** Recompute Cost (%), ranging from 5% (blue) to 45% (orange). The legend has markers at 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, and 45%. ### Detailed Analysis The heatmap presents recompute cost percentages for various batch size and window size combinations. The values are displayed within each cell of the heatmap. Here's a breakdown of the values: * **Batch Size 32:** * Window Size 16: 2.82% (Blue) * **Batch Size 16:** * Window Size 16: 3.09% (Blue) * Window Size 32: 6.40% (Blue) * **Batch Size 8:** * Window Size 16: 3.11% (Blue) * Window Size 32: 6.49% (Blue) * Window Size 64: 13.83% (Blue) * **Batch Size 4:** * Window Size 16: 2.97% (Blue) * Window Size 32: 6.47% (Blue) * Window Size 64: 14.09% (Blue) * Window Size 128: 27.85% (Tan) * **Batch Size 2:** * Window Size 16: 3.31% (Blue) * Window Size 32: 6.03% (Blue) * Window Size 64: 13.86% (Blue) * Window Size 128: 28.96% (Tan) * Window Size 256: 42.28% (Orange) * **Batch Size 1:** * Window Size 16: 3.44% (Blue) * Window Size 32: 6.81% (Blue) * Window Size 64: 12.42% (Blue) * Window Size 128: 24.92% (Blue-Tan) * Window Size 256: 46.41% (Orange) * Window Size 512: 42.33% (Orange) ### Key Observations * The recompute cost generally increases as both the batch size decreases and the window size increases. * Lower batch sizes (1, 2, 4) show a more significant increase in recompute cost as the window size increases compared to higher batch sizes (8, 16, 32). * The lowest recompute costs are observed with larger batch sizes (16, 32) and smaller window sizes (16, 32). * The highest recompute costs are observed with the smallest batch size (1) and larger window sizes (256, 512). ### Interpretation The heatmap illustrates the trade-offs between batch size, window size, and recompute cost. Smaller batch sizes often lead to increased recomputation, especially when combined with larger window sizes. This suggests that the computational overhead of recomputing activations becomes more significant when processing smaller chunks of data over larger contexts. Conversely, larger batch sizes reduce the need for frequent recomputation, resulting in lower costs, particularly with smaller window sizes. The data suggests that optimizing both batch size and window size is crucial for minimizing recompute costs in a given system or model. </details> (b) Recompute cost Figure 12: The effect of different verification strategies on end-to-end latency (left) and recompute cost (right). Batch size denotes the number of requests verified together. Grouped verification helps LLM-42 reduce both verification and recomputation cost. It is parameterized by (1) the verification window size per request and (2) the number of requests verified together in a singe pass. To isolate the impact of each parameter, we run an ablation on the ShareGPT dataset, varying the per-request window size from 16 to 512 tokens and the number of requests verified together from one to 32. For each configuration, we execute 4096 requests in an online setting at 12 QPS; all requests require determinism. Figure 12 summarizes the results: 12(a) reports P99 request latency, while 12(b) shows recomputation overhead. Without grouped verification (batch size 1, first row), latency exhibits a non-monotonic dependence on window size. Increasing the window initially reduces latency, but latency rises when window size becomes too large. Concretely, P99 latency drops from 615 seconds at a window size of 16 to 56.18 seconds at 128, then increases to 99.94 seconds at 512. This behavior reflects the fundamental trade-off between verification and recomputation: smaller windows incur higher verification overhead, while larger windows amplify recomputation cost—for example, recomputation cost is 42.33% at window size 512, compared to only 3.44% at window size 16. Without grouped verification, the best performance occurs at window size of 128 where the latency is 56.18 seconds. Grouped verification substantially lowers the end-to-end latency. Even batching just two requests reduces P99 latency to 43.06 seconds (best case in the second row from the bottom). Increasing the batch size yields further gains, with the best overall configurations verifying a total of 256 tokens per step—distributed across 16, 8, or 4 requests—all achieving P99 tail latencies of 34–35 seconds. It is worth noting that the recomputation cost in the offline experiments (typically below 2% as shown in Table 4) is lower than in the online inference setting (typically more than 6% as shown in 12(b)). This difference arises because in offline inference, all requests are available at the beginning and the system tries to maximally utilize memory capacity, resulting in more stable batch sizes. In contrast, online inference experiences significantly greater batch-size variability due to fluctuating load, request arrival and departure times; naturally, higher recomputation rates are correlated with this increased batch-size variability. Discussion Evaluation across models and multiple GPUs: While we report performance results only for Llama-3.1-8B-Instruct in this paper, we have evaluated the correctness of our approach across multiple additional models, including Qwen-4B-Instruct-2507, Qwen3-14B and Qwen3-30B-A3B-Instruct-2507, and across 1-4 GPUs with tensor-parallelism. Our current experimental setup does not support multimem/NVLS and is limited to pairwise NVLink connectivity; consequently, fairly evaluating multi-GPU performance would require a different platform, which we leave to future work. Supporting these models required no additional code changes in LLM-42 which highlights the simplicity of our approach. Limitations: In our current prototype, the verification pass introduces latency overhead for all requests, even when determinism is enabled selectively. This overhead could be mitigated by confining verification to a subset of GPU SMs or by deferring verification batches—for example, by prioritizing regular prefills and decodes over verification. A second limitation is that prefill and decode use different reduction strategies in LLM-42, making the system non–prefill-decode invariant. As a result, LLM-42 currently does not support sharing prefix caches across multiple turns of the same request or sharing across requests. Our implementation also does not currently integrate with speculative decoding–based LLM inference. Addressing these limitations is interesting future work. 6 Related Works Recent advances in LLM inference systems have focused on optimizing throughput, latency, and resource utilization [orca, vllmsosp, sarathiserve2024, tetriinfer, distserve2024, vattention2024, dynamollm2024, pod-attn]. Orca [orca] pioneered continuous batching, enabling low-latency serving by dynamically scheduling requests at the granularity of individual iterations. vLLM [vllmsosp] introduced PagedAttention, a memory management abstraction that allows higher batch sizes and reuse of KV cache across requests, dramatically improving system throughput. Sarathi-Serve [sarathiserve2024] further improved the throughput–latency trade-off through chunked prefills to better exploit GPU parallelism. DistServe [distserve2024] and Splitwise [patel2023splitwise] proposed disaggregated inference architectures that separate prefill and decode workloads across specialized servers, mitigating interference. Complementary lines of work such as speculative decoding [leviathan2022fast, chen2023accelerating, mamou2024dynamic] and FlexGen [flexgen] pursue orthogonal optimizations by reducing the number of autoregressive steps or offloading memory to CPU and SSDs. While all these systems achieve impressive performance gains, they typically assume non-deterministic execution, leaving the trade-offs between determinism and efficiency unexplored. Enabling determinism in LLM inference has recently drawn increasing attention. Song et al. [song2024greedy] argued that such non-determinism can distort evaluation results, calling for reproducibility-aware benchmarking. Yuan et al. [yuan2025fp32death] further linked reproducibility failures to mixed-precision arithmetic and fused kernel inconsistencies, showing their impact on multi-step reasoning accuracy. Rainbird AI [rainbird2025deterministic] emphasized deterministic inference as a requirement for traceability and compliance in enterprise and safety-critical domains. Recent work by Zhang et al. proposes tensor-parallel–invariant kernels that eliminate training–inference mismatch by ensuring bitwise deterministic results across different tensor parallel sizes for LLM inference [zhang2025-deterministic-tp]. In contrast, existing LLM serving systems typically achieve determinism through batch-invariant computation. We examine the performance and engineering costs of this approach and propose an alternative mechanism for enabling determinism in LLM inference. By leveraging properties of GPU kernels together with characteristics of LLM inference workloads, our approach seeks to reconcile reproducibility with efficiency—a design space that remains largely unexplored in prior work. 7 Conclusion Enabling determinism in LLM inference remains tedious today. Existing systems achieve determinism by enforcing batch-invariant computation, an approach that is cumbersome in practice: it requires rewriting kernels at a time when both hardware and model architectures are evolving rapidly. Moreover, batch-invariant kernels are inherently suboptimal, as they prevent kernels from adapting their parallelism strategies at runtime. We present LLM-42, a simpler alternative that enables determinism in LLM inference by repurposing speculative decoding. LLM-42 minimizes the need to write new kernels and supports selective enforcement of determinism, incurring runtime overhead only for the fraction of traffic that actually requires it. References

Rendering Paper...