Image e27582d2db38...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview

INTEL_VERIFIED

# Technical Document Extraction: Attention Forward Speed Benchmark

## 1. Header Information
*   **Title:** Attention forward speed, head dim 128 (H100 80GB SXM5)
*   **Hardware Context:** NVIDIA H100 80GB SXM5 GPU.
*   **Parameter Context:** Head dimension is fixed at 128.

## 2. Chart Structure and Metadata
*   **Chart Type:** Grouped Bar Chart.
*   **X-Axis (Independent Variable):** Sequence length.
    *   **Categories:** 512, 1k, 2k, 4k, 8k, 16k.
*   **Y-Axis (Dependent Variable):** Speed (TFLOPS/s).
    *   **Scale:** Linear, ranging from 0 to 600+ (markers at 200, 400, 600).
*   **Legend:**
    1.  **Standard attention** (Blue)
    2.  **FlashAttention-2** (Orange)
    3.  **Triton** (Green)
    4.  **cuDNN** (Red)
    5.  **FlashAttention-3** (Purple)

## 3. Data Table Reconstruction
The following table extracts the numerical values (TFLOPS/s) labeled above each bar in the chart.

| Sequence Length | Standard attention (Blue) | FlashAttention-2 (Orange) | Triton (Green) | cuDNN (Red) | FlashAttention-3 (Purple) |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **512** | 74 | 309 | 323 | 497 | 467 |
| **1k** | 100 | 350 | 372 | 574 | 565 |
| **2k** | 119 | 362 | 389 | 617 | 625 |
| **4k** | 133 | 368 | 389 | 609 | 638 |
| **8k** | 139 | 370 | 392 | 600 | 646 |
| **16k** | OOM* | 370 | 395 | 595 | 648 |

*\*OOM: Out of Memory*

## 4. Trend Analysis and Component Isolation

### Series Trends
*   **Standard attention (Blue):** Shows a slow upward slope from 74 to 139 TFLOPS/s as sequence length increases, but fails at 16k due to memory constraints (OOM). It is consistently the lowest performing method.
*   **FlashAttention-2 (Orange):** Slopes upward initially and plateaus quickly around 370 TFLOPS/s from 8k onwards.
*   **Triton (Green):** Follows a similar trend to FlashAttention-2 but maintains a consistent lead of ~20-25 TFLOPS/s over it, plateauing near 395 TFLOPS/s.
*   **cuDNN (Red):** Shows a sharp upward trend, peaking at 2k (617 TFLOPS/s), then exhibits a slight downward trend as sequence length increases to 16k (595 TFLOPS/s).
*   **FlashAttention-3 (Purple):** Shows the strongest scaling performance. While it starts slightly behind cuDNN at 512 and 1k, it overtakes all other methods at 2k and continues to slope upward, reaching the highest recorded value of 648 TFLOPS/s at 16k.

### Comparative Observations
*   **Efficiency Gap:** There is a massive performance gap between "Standard attention" and all optimized kernels. At 8k, FlashAttention-3 is approximately 4.6x faster than Standard attention.
*   **Crossover Point:** FlashAttention-3 overtakes cuDNN as the fastest implementation at the 2k sequence length mark.
*   **Scalability:** FlashAttention-3 is the only implementation that shows continuous (albeit slowing) growth in TFLOPS/s across the entire tested range without regressing at higher sequence lengths.

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

# Technical Document Extraction: Attention Forward Speed Analysis

## Chart Title
**Attention forward speed, head dim 128 (H100 80GB SXM5)**

---

### Axis Labels
- **X-axis**: Sequence length (categories: 512, 1k, 2k, 4k, 8k, 16k)
- **Y-axis**: Speed (TFLOPs/s)

---

### Legend
| Color       | Method               |
|-------------|----------------------|
| Blue        | Standard attention   |
| Orange      | FlashAttention-2     |
| Green       | Triton               |
| Red         | cuDNN                |
| Purple      | FlashAttention-3     |

---

### Data Points by Sequence Length
#### 512
- Standard attention: 74 TFLOPs/s
- FlashAttention-2: 309 TFLOPs/s
- Triton: 323 TFLOPs/s
- cuDNN: 467 TFLOPs/s
- FlashAttention-3: 497 TFLOPs/s

#### 1k
- Standard attention: 100 TFLOPs/s
- FlashAttention-2: 350 TFLOPs/s
- Triton: 372 TFLOPs/s
- cuDNN: 574 TFLOPs/s
- FlashAttention-3: 565 TFLOPs/s

#### 2k
- Standard attention: 119 TFLOPs/s
- FlashAttention-2: 362 TFLOPs/s
- Triton: 389 TFLOPs/s
- cuDNN: 617 TFLOPs/s
- FlashAttention-3: 625 TFLOPs/s

#### 4k
- Standard attention: 133 TFLOPs/s
- FlashAttention-2: 368 TFLOPs/s
- Triton: 389 TFLOPs/s
- cuDNN: 609 TFLOPs/s
- FlashAttention-3: 638 TFLOPs/s

#### 8k
- Standard attention: 139 TFLOPs/s
- FlashAttention-2: 370 TFLOPs/s
- Triton: 392 TFLOPs/s
- cuDNN: 600 TFLOPs/s
- FlashAttention-3: 646 TFLOPs/s

#### 16k
- Standard attention: **OOM** (Out of Memory)
- FlashAttention-2: 395 TFLOPs/s
- Triton: 395 TFLOPs/s
- cuDNN: 595 TFLOPs/s
- FlashAttention-3: 648 TFLOPs/s

---

### Key Trends
1. **Performance Scaling**: All methods show increased speed with longer sequence lengths, except Standard attention at 16k (OOM).
2. **FlashAttention-3 Dominance**: Consistently achieves highest TFLOPs/s across all sequence lengths (up to 648 TFLOPs/s at 16k).
3. **cuDNN Performance**: Second-highest performance, with a peak of 617 TFLOPs/s at 2k.
4. **Standard Attention Limitations**: Significantly lower performance and fails at 16k due to OOM.
5. **Triton vs. FlashAttention-2**: Triton slightly outperforms FlashAttention-2 in most cases (e.g., 389 vs. 368 TFLOPs/s at 4k).

---

### Critical Observations
- **OOM at 16k**: Standard attention cannot handle 16k sequence length on H100 80GB SXM5.
- **Efficiency Gaps**: FlashAttention-3 achieves ~30-40% higher speed than cuDNN at 16k (648 vs. 595 TFLOPs/s).
- **Consistency**: Triton and FlashAttention-2 show minimal variation across sequence lengths (368-395 TFLOPs/s range).

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

e27582d2db385c9bac87ebc9

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1

EXPERT: nemotron-free VERSION 1