\n
## [Combination Chart]: Performance vs. KV Budget for R1-Llama and R1-Qwen Models
### Overview
The image displays two side-by-side combination charts (bar and line) comparing the performance of two models, "R1-Llama" and "R1-Qwen," across different Key-Value (KV) Cache Budgets. Each chart plots two metrics: "Pass@1" (a performance score, represented by blue bars) and "Throughput" in Tokens Per Second (TPS, represented by an orange line with circular markers). The charts illustrate a trade-off between model performance and computational efficiency as the KV budget increases.
### Components/Axes
**Common Elements (Both Charts):**
* **X-Axis:** Labeled "KV Budget". It has six discrete, evenly spaced categories: `2500`, `3000`, `3500`, `4000`, `4500`, `5000`.
* **Primary Y-Axis (Left):** Labeled "Pass@1". Scale ranges from 30 to 80.
* **Secondary Y-Axis (Right):** Labeled "Throughput (TPS)". The scale differs between the two charts.
* **Legend:** Positioned in the top-right corner of each chart's plot area. It contains two entries:
* A blue square labeled "Pass@1".
* An orange line with a circle marker labeled "Throughput".
**Chart-Specific Details:**
* **Left Chart Title:** "R1-Llama" (centered at the top).
* **Right Chart Title:** "R1-Qwen" (centered at the top).
* **R1-Llama Secondary Y-Axis Scale:** Ranges from 400 to 600 TPS.
* **R1-Qwen Secondary Y-Axis Scale:** Ranges from 600 to 800 TPS.
### Detailed Analysis
**1. R1-Llama Chart (Left):**
* **Pass@1 (Blue Bars):** The values show a general upward trend with increasing KV Budget, with a slight dip at 4500.
* KV 2500: 44.2
* KV 3000: 50.4
* KV 3500: 51.0
* KV 4000: 50.8
* KV 4500: 49.9
* KV 5000: 53.0
* **Throughput (Orange Line):** The line shows a clear, consistent downward slope from left to right.
* KV 2500: ~580 TPS (point is near the top of the axis, between 550 and 600).
* KV 3000: ~540 TPS.
* KV 3500: ~510 TPS.
* KV 4000: ~490 TPS.
* KV 4500: ~460 TPS.
* KV 5000: ~430 TPS (point is near the bottom of the axis, between 400 and 450).
**2. R1-Qwen Chart (Right):**
* **Pass@1 (Blue Bars):** The values show a steady, monotonic increase with KV Budget.
* KV 2500: 49.8
* KV 3000: 52.6
* KV 3500: 54.1
* KV 4000: 54.3
* KV 4500: 54.3
* KV 5000: 56.3
* **Throughput (Orange Line):** The line shows a clear, consistent downward slope from left to right.
* KV 2500: ~770 TPS (point is near the top of the axis, between 750 and 800).
* KV 3000: ~750 TPS.
* KV 3500: ~730 TPS.
* KV 4000: ~710 TPS.
* KV 4500: ~680 TPS.
* KV 5000: ~650 TPS.
### Key Observations
1. **Inverse Relationship:** In both models, there is a clear inverse relationship between the KV Budget and Throughput. As the KV budget increases, throughput (processing speed) decreases.
2. **Performance Trend:** Pass@1 performance generally improves with a larger KV budget for both models, though the improvement is not perfectly linear for R1-Llama (a dip at 4500).
3. **Model Comparison:** The R1-Qwen model operates at a significantly higher throughput range (650-770 TPS) compared to R1-Llama (430-580 TPS) for the same KV budgets. Its Pass@1 scores also start higher and show a more consistent upward trend.
4. **Trade-off Point:** The charts visually highlight the engineering trade-off: allocating more KV cache (budget) improves model accuracy (Pass@1) but reduces the speed at which the model can generate tokens (Throughput).
### Interpretation
The data demonstrates a fundamental constraint in serving large language models: the memory and computational cost of the KV cache. A larger KV budget allows the model to attend to more context, which typically improves task performance (higher Pass@1). However, managing a larger cache requires more memory bandwidth and computation per generated token, which directly reduces throughput.
The comparison between R1-Llama and R1-Qwen suggests architectural or optimization differences. R1-Qwen achieves higher throughput at all measured points, indicating it may be a more efficient model for inference. Furthermore, its performance (Pass@1) scales more predictably with KV budget. The dip in R1-Llama's Pass@1 at a budget of 4500 could be an experimental artifact or indicate a point of diminishing returns or instability for that specific model configuration.
For a system designer, these charts provide critical data for provisioning. If the application is latency-sensitive (requires high throughput), a lower KV budget might be chosen, accepting a potential drop in accuracy. If accuracy is paramount (e.g., for complex reasoning tasks), a higher KV budget is justified despite the speed penalty. The optimal operating point depends on the specific requirements of the application.