Image fd0c5d0a29cf...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## CDF Graph: Latency Distribution Comparison

### Overview
The image shows a cumulative distribution function (CDF) graph comparing latency distributions across different configurations. The x-axis represents end-to-end (E2E) latency in milliseconds (0–140,000 ms), and the y-axis represents the CDF (0–1.0). Seven data series are plotted, with distinct trends indicating performance differences between configurations.

### Components/Axes
- **X-axis**: E2E Latency (ms) – Linear scale from 0 to 140,000 ms.
- **Y-axis**: CDF – Linear scale from 0.0 to 1.0.
- **Legend**: Located in the bottom-right corner, mapping colors to configurations:
  - Green: SGLang non-deterministic
  - Red: SGLang deterministic
  - Blue: LLM-42 @2%
  - Orange: LLM-42 @5%
  - Purple: LLM-42 @10%
  - Brown: LLM-42 @20%
  - Pink: LLM-42 @50%
  - Cyan: LLM-42 @100%

### Detailed Analysis
1. **SGLang non-deterministic (Green)**:
   - Sharp rise to 1.0 CDF by ~20,000 ms.
   - Reaches 0.8 CDF at ~10,000 ms.
   - Minimal tail latency beyond 20,000 ms.

2. **SGLang deterministic (Red)**:
   - Similar to non-deterministic but slightly slower.
   - Reaches 1.0 CDF by ~25,000 ms.
   - 0.8 CDF at ~12,000 ms.

3. **LLM-42 @2% (Blue)**:
   - Slower than SGLang; 1.0 CDF at ~40,000 ms.
   - 0.8 CDF at ~25,000 ms.
   - Longer tail latency extending to 80,000 ms.

4. **LLM-42 @5% (Orange)**:
   - 1.0 CDF at ~50,000 ms.
   - 0.8 CDF at ~35,000 ms.
   - Tail latency up to 100,000 ms.

5. **LLM-42 @10% (Purple)**:
   - 1.0 CDF at ~60,000 ms.
   - 0.8 CDF at ~45,000 ms.
   - Tail latency to 120,000 ms.

6. **LLM-42 @20% (Brown)**:
   - 1.0 CDF at ~70,000 ms.
   - 0.8 CDF at ~55,000 ms.
   - Tail latency to 130,000 ms.

7. **LLM-42 @50% (Pink)**:
   - 1.0 CDF at ~80,000 ms.
   - 0.8 CDF at ~65,000 ms.
   - Tail latency to 140,000 ms.

8. **LLM-42 @100% (Cyan)**:
   - Slowest distribution; 1.0 CDF at ~90,000 ms.
   - 0.8 CDF at ~75,000 ms.
   - Flat plateau beyond 90,000 ms.

### Key Observations
- **SGLang Dominance**: Both SGLang configurations (green/red) achieve 1.0 CDF at latencies 2–3x lower than LLM-42 @2%.
- **Determinism Impact**: SGLang deterministic (red) is marginally slower than non-deterministic (green), suggesting minimal overhead from determinism.
- **LLM-42 Scaling**: Higher percentiles (e.g., 50%, 100%) correlate with progressively longer latencies and heavier tails.
- **Cyan Line Anomaly**: LLM-42 @100% plateaus at ~90,000 ms, indicating a hard cap on latency for this configuration.

### Interpretation
The data demonstrates that SGLang configurations (non-deterministic and deterministic) consistently outperform LLM-42 across all latency percentiles. The deterministic variant of SGLang introduces only a minor latency penalty (~5,000 ms difference in 1.0 CDF), suggesting that determinism does not significantly degrade performance in this context.

LLM-42's latency increases with higher percentiles, indicating that a larger proportion of requests experience longer delays. The @100% configuration (cyan) shows a flat tail, implying that 100% of requests complete within ~90,000 ms, but with substantial variability in lower percentiles. This could reflect resource contention or inefficiencies in handling edge cases.

For latency-sensitive applications, SGLang is preferable, while LLM-42 may be suitable for scenarios where higher percentiles are acceptable. The deterministic vs. non-deterministic tradeoff in SGLang appears negligible here, prioritizing consistency over marginal speed gains.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

fd0c5d0a29cfa4c820d2d209

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1