Image 2e8857636e9f...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## Bar Chart: Decode Throughput Comparison by Batch Size and Model

### Overview
The chart compares decode throughput (tokens/second) across three models (SGLang non-deterministic, SGLang deterministic, LLM-42) at two batch sizes (10 and 11). Throughput is measured on the y-axis (0–1200 tokens/sec), while batch size is on the x-axis (10 and 11). The legend in the top-left corner maps colors to models: blue (SGLang non-deterministic), red (SGLang deterministic), and green (LLM-42).

### Components/Axes
- **X-axis (Batch Size)**: Labeled "Batch Size" with ticks at 10 and 11.
- **Y-axis (Decode Throughput)**: Labeled "Decode Throughput (tokens/sec)" with increments of 200 up to 1200.
- **Legend**: Positioned in the top-left corner, with three entries:
  - Blue: SGLang non-deterministic
  - Red: SGLang deterministic
  - Green: LLM-42

### Detailed Analysis
- **Batch Size 10**:
  - **SGLang non-deterministic (blue)**: Dominates with ~800 tokens/sec.
  - **SGLang deterministic (red)**: Not visible (implied 0 tokens/sec).
  - **LLM-42 (green)**: Not visible (implied 0 tokens/sec).
- **Batch Size 11**:
  - **SGLang non-deterministic (blue)**: Minimal throughput (~10 tokens/sec).
  - **SGLang deterministic (red)**: ~400 tokens/sec.
  - **LLM-42 (green)**: ~800 tokens/sec.

### Key Observations
1. **SGLang non-deterministic** shows a drastic drop in throughput when batch size increases from 10 to 11 (~800 → ~10 tokens/sec).
2. **SGLang deterministic** and **LLM-42** maintain stable performance at batch size 11 (~400 and ~800 tokens/sec, respectively).
3. No data is reported for SGLang deterministic and LLM-42 at batch size 10.

### Interpretation
The data suggests that **SGLang non-deterministic** is highly sensitive to batch size increases, experiencing a near-collapse in efficiency at batch size 11. In contrast, **SGLang deterministic** and **LLM-42** demonstrate robustness, maintaining significant throughput even at larger batch sizes. This implies that non-deterministic processing may face scalability challenges in high-throughput scenarios, while deterministic and LLM-42 architectures are better optimized for larger batches. The absence of data for deterministic and LLM-42 at batch size 10 raises questions about whether these models were tested at that configuration or if the results were intentionally omitted.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

2e8857636e9f30f27276e743

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1