## Bar Chart: Throughput Comparison of SGLang and LLM-42 Systems
### Overview
This is a grouped bar chart comparing the throughput performance, measured in tokens per second, of two SGLang configurations (non-deterministic and deterministic) against an LLM-42 system operating at various percentage levels (2%, 5%, 10%, 20%, 50%, 100%). The comparison is made across seven different workload scenarios defined by dataset (ArXiv, ShareGPT) and input/output sequence lengths.
### Components/Axes
* **Y-Axis:** Labeled "Throughput (tokens/s)". The scale runs from 0 to 20,000, with major gridlines at intervals of 2,500.
* **X-Axis:** Represents seven distinct workload categories. From left to right:
1. `ArXiv`
2. `ShareGPT`
3. `in=1024 out=256`
4. `in=1024 out=512`
5. `in=2048 out=256`
6. `in=2048 out=512`
7. `in=4096 out=512`
8. `in=512 out=256`
* **Legend:** Located at the top of the chart. It defines seven data series with distinct colors and patterns:
* **SGLang non-deterministic:** Green bar with diagonal stripes (top-left to bottom-right).
* **SGLang deterministic:** Red bar with diagonal stripes (top-left to bottom-right).
* **LLM-42 @2%:** Light purple bar with a dense dot pattern.
* **LLM-42 @5%:** Medium purple bar with a sparse dot pattern.
* **LLM-42 @10%:** Darker purple bar with a diagonal cross-hatch pattern.
* **LLM-42 @20%:** Purple bar with a horizontal line pattern.
* **LLM-42 @50%:** Purple bar with a vertical line pattern.
* **LLM-42 @100%:** Dark purple bar with a solid fill.
* **Data Labels:** Each bar has a numerical label on top indicating its value in tokens/s and a multiplier (e.g., "1.00x") showing its performance relative to the "SGLang non-deterministic" baseline for that specific workload category.
### Detailed Analysis
The chart presents a consistent pattern across all seven workload categories. For each category, there are eight bars grouped together.
**Trend Verification & Data Points (Approximate values from labels):**
1. **ArXiv:**
* SGLang non-deterministic (Green): ~16,000 tokens/s (1.00x baseline).
* SGLang deterministic (Red): ~10,700 tokens/s (0.67x).
* LLM-42 throughput **decreases** as the percentage increases: @2% (~19,000, 1.19x) > @5% (~18,500, 1.16x) > @10% (~18,000, 1.13x) > @20% (~17,500, 1.09x) > @50% (~17,000, 1.06x) > @100% (~16,500, 1.03x).
2. **ShareGPT:**
* SGLang non-deterministic: ~15,000 tokens/s (1.00x).
* SGLang deterministic: ~10,000 tokens/s (0.67x).
* LLM-42 trend: @2% (~18,000, 1.20x) > @5% (~17,500, 1.17x) > @10% (~17,000, 1.13x) > @20% (~16,000, 1.07x) > @50% (~15,000, 1.00x) > @100% (~14,000, 0.93x).
3. **in=1024 out=256:**
* SGLang non-deterministic: ~17,000 tokens/s (1.00x).
* SGLang deterministic: ~12,000 tokens/s (0.71x).
* LLM-42 trend: @2% (~19,000, 1.12x) > @5% (~18,500, 1.09x) > @10% (~18,000, 1.06x) > @20% (~17,500, 1.03x) > @50% (~16,500, 0.97x) > @100% (~15,500, 0.91x).
4. **in=1024 out=512:**
* SGLang non-deterministic: ~13,000 tokens/s (1.00x).
* SGLang deterministic: ~9,500 tokens/s (0.73x).
* LLM-42 trend: @2% (~15,000, 1.15x) > @5% (~14,500, 1.12x) > @10% (~14,000, 1.08x) > @20% (~13,500, 1.04x) > @50% (~12,500, 0.96x) > @100% (~11,500, 0.88x).
5. **in=2048 out=256:**
* SGLang non-deterministic: ~16,500 tokens/s (1.00x).
* SGLang deterministic: ~11,500 tokens/s (0.70x).
* LLM-42 trend: @2% (~18,500, 1.12x) > @5% (~18,000, 1.09x) > @10% (~17,500, 1.06x) > @20% (~17,000, 1.03x) > @50% (~16,000, 0.97x) > @100% (~15,000, 0.91x).
6. **in=2048 out=512:**
* SGLang non-deterministic: ~12,500 tokens/s (1.00x).
* SGLang deterministic: ~9,000 tokens/s (0.72x).
* LLM-42 trend: @2% (~14,500, 1.16x) > @5% (~14,000, 1.12x) > @10% (~13,500, 1.08x) > @20% (~13,000, 1.04x) > @50% (~12,000, 0.96x) > @100% (~11,000, 0.88x).
7. **in=4096 out=512:**
* SGLang non-deterministic: ~12,000 tokens/s (1.00x).
* SGLang deterministic: ~8,500 tokens/s (0.71x).
* LLM-42 trend: @2% (~14,000, 1.17x) > @5% (~13,500, 1.13x) > @10% (~13,000, 1.08x) > @20% (~12,500, 1.04x) > @50% (~11,500, 0.96x) > @100% (~10,500, 0.88x).
8. **in=512 out=256:**
* SGLang non-deterministic: ~17,500 tokens/s (1.00x).
* SGLang deterministic: ~12,000 tokens/s (0.69x).
* LLM-42 trend: @2% (~19,500, 1.11x) > @5% (~19,000, 1.09x) > @10% (~18,500, 1.06x) > @20% (~17,500, 1.00x) > @50% (~16,500, 0.94x) > @100% (~15,500, 0.89x).
### Key Observations
1. **Consistent Hierarchy:** In every single workload category, the performance order from highest to lowest throughput is: LLM-42 @2% > LLM-42 @5% > LLM-42 @10% > LLM-42 @20% > SGLang non-deterministic > LLM-42 @50% > LLM-42 @100% > SGLang deterministic.
2. **LLM-42 Inverse Scaling:** There is a clear, monotonic **downward trend** in LLM-42 throughput as its operational percentage increases from 2% to 100%. The 2% configuration is consistently the fastest system overall.
3. **SGLang Deterministic Penalty:** The deterministic mode of SGLang consistently incurs a significant performance penalty (approximately 30-33% lower throughput) compared to its non-deterministic mode across all tests.
4. **Workload Impact:** Throughput is generally highest for the `in=512 out=256` and `ArXiv` workloads and lowest for the `in=4096 out=512` workload, suggesting longer input sequences significantly reduce processing speed.
### Interpretation
The data strongly suggests that the LLM-42 system's throughput is highly sensitive to the "percentage" parameter, which likely represents a resource allocation, sampling rate, or confidence threshold. Lower percentages (2%, 5%) yield superior performance, outperforming even the optimized SGLang non-deterministic baseline. This indicates a potential trade-off where allocating fewer resources or accepting a lower confidence level per token results in higher overall system throughput.
The consistent underperformance of SGLang's deterministic mode highlights the computational overhead required to guarantee reproducible outputs. The choice between SGLang modes would thus depend on whether reproducibility is a critical requirement for the application, justifying the ~30% speed cost.
The workload-specific variations (e.g., lower throughput for longer input sequences) provide practical guidance for system deployment, indicating that performance will degrade as the context length of the task increases. The chart effectively demonstrates that for maximizing raw token generation speed, a lightly-loaded LLM-42 instance is the optimal choice among the tested configurations.