Image 66c4bb4bf8b1...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Document Extraction: Attention Forward Speed Benchmark

## 1. Header Information
*   **Title:** Attention forward speed, head dim 64 (H100 80GB SXM5)
*   **Hardware Context:** NVIDIA H100 80GB SXM5 GPU.
*   **Parameter Context:** Head dimension is fixed at 64.

## 2. Chart Metadata and Structure
*   **Chart Type:** Grouped Bar Chart.
*   **X-Axis (Independent Variable):** Sequence length.
    *   **Scale:** Categorical/Logarithmic progression (512, 1k, 2k, 4k, 8k, 16k).
*   **Y-Axis (Dependent Variable):** Speed (TFLOPs/s).
    *   **Scale:** Linear, ranging from 0 to 1200, with major ticks at intervals of 400.
*   **Legend:**
    *   **Green Bar:** Triton
    *   **Red Bar:** cuDNN
    *   **Purple Bar:** FlashAttention-3

## 3. Component Analysis and Trend Verification

### Triton (Green Bars)
*   **Visual Trend:** Shows a consistent upward slope as sequence length increases, beginning to plateau slightly between 8k and 16k.
*   **Performance:** It is the fastest implementation at the shortest sequence length (512) but is overtaken by FlashAttention-3 as the sequence length grows.

### cuDNN (Red Bars)
*   **Visual Trend:** Shows a steady upward slope. It consistently performs lower than Triton across most sequence lengths but matches Triton's performance at the 16k mark.
*   **Performance:** Generally the slowest or second-slowest implementation depending on the sequence length.

### FlashAttention-3 (Purple Bars)
*   **Visual Trend:** Shows the steepest upward slope. While it starts as the slowest implementation at 512, it scales the most efficiently, overtaking cuDNN at 2k and Triton at 4k.
*   **Performance:** Becomes the dominant implementation for long sequences (4k and above).

## 4. Data Table Reconstruction
The following table transcribes the numerical values labeled above each bar in the chart.

| Sequence Length | Triton (Green) [TFLOPs/s] | cuDNN (Red) [TFLOPs/s] | FlashAttention-3 (Purple) [TFLOPs/s] |
| :--- | :--- | :--- | :--- |
| **512** | 234 | 194 | 164 |
| **1k** | 325 | 258 | 244 |
| **2k** | 393 | 317 | 369 |
| **4k** | 440 | 324 | 475 |
| **8k** | 459 | 464 | 533 |
| **16k** | 481 | 483 | 572 |

## 5. Summary of Findings
The benchmark demonstrates that while **Triton** is optimized for shorter sequence lengths (under 1k), **FlashAttention-3** exhibits superior scaling properties on H100 hardware. By the 16k sequence length, FlashAttention-3 reaches **572 TFLOPs/s**, outperforming Triton (481 TFLOPs/s) and cuDNN (483 TFLOPs/s) by approximately 18-19%.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

66c4bb4bf8b186164a0b56ce

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1