Image 73664a1a8a8a...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview

INTEL_VERIFIED

# Technical Document Extraction: Attention Forward Speed Benchmark

## 1. Header Information
*   **Title:** Attention forward speed, head dim 128 (H100 80GB SXM5)
*   **Hardware Context:** NVIDIA H100 80GB SXM5 GPU.
*   **Operation:** Attention forward pass.
*   **Parameter:** Head dimension fixed at 128.

## 2. Chart Metadata
*   **Type:** Grouped Bar Chart.
*   **X-Axis Label:** Sequence length (Categorical/Logarithmic scale: 512 to 16k).
*   **Y-Axis Label:** Speed (TFLOPS/s).
*   **Y-Axis Scale:** 0 to 1200, with major ticks at 400, 800, and 1200.
*   **Legend Placement:** Top-left [x ≈ 0.05, y ≈ 0.95].

## 3. Legend and Series Identification
The chart compares three software implementations/kernels:
1.  **Triton** (Green): Represented by the leftmost bar in each group.
2.  **cuDNN** (Red): Represented by the middle bar in each group.
3.  **FlashAttention-3** (Purple): Represented by the rightmost bar in each group.

## 4. Data Table Extraction
The following table reconstructs the numerical values displayed above each bar in the chart.

| Sequence Length | Triton (Green) | cuDNN (Red) | FlashAttention-3 (Purple) |
| :--- | :---: | :---: | :---: |
| **512** | 241 | 253 | 241 |
| **1k** | 340 | 384 | 367 |
| **2k** | 413 | 528 | 553 |
| **4k** | 464 | 719 | 716 |
| **8k** | 483 | 883 | 815 |
| **16k** | 510 | 922 | 881 |

## 5. Trend Analysis and Observations

### Component Isolation: Trend Verification
*   **Triton (Green):** Shows a steady, linear-like upward slope. It starts as the lowest performer at 512 (tied with FlashAttention-3) and remains the slowest implementation as sequence length increases, plateauing significantly compared to the others at higher sequence lengths.
*   **cuDNN (Red):** Shows a strong upward slope. It is the fastest implementation at the smallest scale (512) and the largest scales (8k and 16k). It exhibits the highest peak performance at 922 TFLOPS/s.
*   **FlashAttention-3 (Purple):** Shows a strong upward slope, closely tracking cuDNN. It surpasses cuDNN briefly at the 2k sequence length mark but falls slightly behind cuDNN at the 8k and 16k marks.

### Key Findings
*   **Scaling:** All three methods show improved TFLOPS/s as sequence length increases, indicating better hardware utilization at larger scales.
*   **Performance Leaders:** At sequence lengths of 4k and above, cuDNN and FlashAttention-3 significantly outperform Triton, achieving nearly double the throughput at 16k (922 and 881 TFLOPS/s vs. 510 TFLOPS/s).
*   **Peak Performance:** The maximum recorded speed is **922 TFLOPS/s** achieved by **cuDNN** at a sequence length of 16k.

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

73664a1a8a8a87161019c426

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1

EXPERT: nemotron-free VERSION 1