# Technical Document Extraction: Roofline Model Analysis for Llama 33B
## 1. Document Header
* **Title:** Llama 33B, A100 80GB PCIe
* **Subject:** Performance analysis of a Large Language Model (Llama 33B) on specific hardware (NVIDIA A100 80GB PCIe) using a Roofline Model.
## 2. Chart Metadata and Axes
The image is a **Roofline Plot**, which relates computational performance to operational intensity. Both axes use a logarithmic scale.
* **Y-Axis (Performance):**
* **Label:** Performance (FLOP/s)
* **Scale:** Logarithmic, ranging from 10G to 100T+.
* **Major Markers:** 10G, 100G, 1T, 10T, 100T.
* **X-Axis (Operational Intensity):**
* **Label:** Operational Intensity (FLOP/Byte)
* **Scale:** Logarithmic, ranging from ~0.6 to 10k.
* **Major Markers:** 1, 10, 100, 1k, 10k.
## 3. Legend and Component Isolation
The legend is located in the lower-right quadrant of the main chart area.
### Hardware Limits (Lines)
* **Blue Dashed Diagonal Line:** Represents the memory bandwidth limit.
* **Label:** 1,935GB/s
* **Trend:** Slopes upward from left to right, indicating that at low operational intensity, performance is bound by memory transfer speeds.
* **Red Dashed Horizontal Line:** Represents the peak computational throughput.
* **Label:** 312 TFLOP/s
* **Trend:** Constant horizontal line at the top of the chart, indicating the hardware's maximum theoretical performance.
* **Green Dashed Vertical Line:** Represents the "Ridge Point" where the memory limit meets the compute limit. This occurs at an operational intensity of approximately 161 FLOP/Byte ($312 \times 10^{12} / 1935 \times 10^9$).
### Data Series (Scatter Points)
The data points represent different configurations of "qk/pv" (Query-Key/Projection-Value operations) for standard Autoregressive (ar) vs. Medusa decoding with varying candidate counts.
| Color | Label | Operational Intensity Range (Approx) | Performance Range (Approx) |
| :--- | :--- | :--- | :--- |
| **Grey** | qk/pv ar | ~1 FLOP/Byte | 400G - 1.5T FLOP/s |
| **Orange** | qk/pv Medusa (# cand.: 16) | ~15 FLOP/Byte | 5T - 20T FLOP/s |
| **Light Coral** | qk/pv Medusa (# cand.: 32) | ~25 FLOP/Byte | 10T - 35T FLOP/s |
| **Red-Pink** | qk/pv Medusa (# cand.: 48) | ~35 FLOP/Byte | 15T - 50T FLOP/s |
| **Deep Pink** | qk/pv Medusa (# cand.: 64) | ~45 FLOP/Byte | 20T - 65T FLOP/s |
| **Magenta** | qk/pv Medusa (# cand.: 80) | ~50 FLOP/Byte | 22T - 75T FLOP/s |
| **Purple** | qk/pv Medusa (# cand.: 96) | ~55 FLOP/Byte | 25T - 85T FLOP/s |
| **Dark Violet** | qk/pv Medusa (# cand.: 112) | ~60 FLOP/Byte | 30T - 90T FLOP/s |
## 4. Key Trends and Observations
1. **Memory Bound Regime:** All plotted data points fall significantly to the left of the green vertical ridge point. This indicates that the Llama 33B model on this hardware is **memory-bandwidth bound**, not compute-bound.
2. **Medusa Efficiency:** The standard autoregressive (ar) method (grey dots) has the lowest operational intensity (~1) and lowest performance.
3. **Scaling with Candidates:** As the number of Medusa candidates increases (from 16 to 112):
* The **Operational Intensity** increases (shifts right on the X-axis).
* The **Performance** increases (shifts up on the Y-axis).
* The data points follow the slope of the blue dashed line (1,935 GB/s), confirming that the performance gains are directly tied to utilizing more of the available memory bandwidth by increasing the work done per byte fetched.
4. **Performance Gap:** Even the highest performing Medusa configuration (~90 TFLOP/s) remains well below the theoretical peak of 312 TFLOP/s, as it is still constrained by the memory ceiling.