# Technical Document Extraction: Attention Forward Speed Benchmark
## 1. Document Header
* **Title:** Attention forward speed, head dim 256 (H100 80GB SXM5)
* **Hardware Context:** NVIDIA H100 80GB SXM5 GPU.
* **Operation:** Attention forward pass with a head dimension of 256.
## 2. Chart Metadata
* **Chart Type:** Grouped Bar Chart.
* **X-Axis Label:** Sequence length
* **X-Axis Categories:** 512, 1k, 2k, 4k, 8k, 16k.
* **Y-Axis Label:** Speed (TFLOPS/s)
* **Y-Axis Scale:** Linear, ranging from 0 to 1200 with major ticks at 400, 800, and 1200.
* **Legend Categories:**
* **Triton:** Green bar
* **cuDNN:** Red bar
* **FlashAttention-3:** Purple bar
## 3. Component Isolation & Trend Analysis
### Region: Main Chart Area
The chart compares three software implementations (Triton, cuDNN, and FlashAttention-3) across increasing sequence lengths.
* **Triton (Green):** Shows a steady upward trend as sequence length increases, starting at 299 TFLOPS/s and reaching 663 TFLOPS/s. It is consistently the lowest performing of the three across all sequence lengths.
* **cuDNN (Red):** Shows a sharp upward trend, particularly between 1k and 4k sequence lengths. It overtakes FlashAttention-3 at the 2k mark and remains the highest-performing implementation for all subsequent sequence lengths, peaking at 1099 TFLOPS/s.
* **FlashAttention-3 (Purple):** Shows a strong upward trend. It is the fastest implementation at the smallest sequence lengths (512 and 1k). While its performance continues to grow to 1024 TFLOPS/s, it is surpassed by cuDNN from sequence length 2k onwards.
## 4. Data Table Reconstruction
| Sequence Length | Triton (Green) [TFLOPS/s] | cuDNN (Red) [TFLOPS/s] | FlashAttention-3 (Purple) [TFLOPS/s] |
| :--- | :--- | :--- | :--- |
| **512** | 299 | 304 | 329 |
| **1k** | 425 | 449 | 521 |
| **2k** | 520 | 768 | 703 |
| **4k** | 591 | 1015 | 856 |
| **8k** | 628 | 1056 | 960 |
| **16k** | 663 | 1099 | 1024 |
## 5. Key Findings
* **Maximum Performance:** cuDNN achieves the highest recorded throughput of **1099 TFLOPS/s** at a sequence length of 16k.
* **Crossover Point:** FlashAttention-3 is superior for short sequences (up to 1k). At sequence length 2k and above, cuDNN becomes the most efficient implementation.
* **Scaling:** All three methods benefit from larger sequence lengths, though Triton's scaling is more linear and shallow compared to the significant jumps seen in cuDNN and FlashAttention-3.