Image f60142e2221e...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
# Technical Document: Attention Backward Speed Analysis

## Chart Title
**Attention backward speed, head dim 128 (H100 80GB SXM5)**

---

### Axis Labels
- **X-axis**: Sequence length (categories: 512, 1k, 2k, 4k, 8k, 16k)
- **Y-axis**: Speed (TFLOPs/s)

---

### Legend
| Color       | Method               |
|-------------|----------------------|
| Blue        | Standard attention   |
| Orange      | FlashAttention-2     |
| Red         | cuDNN                |
| Purple      | FlashAttention-3     |

---

### Data Points (Speed in TFLOPs/s)
| Sequence Length | Standard attention | FlashAttention-2 | cuDNN | FlashAttention-3 |
|-----------------|--------------------|------------------|-------|------------------|
| 512             | 104                | 214              | 305   | 316              |
| 1k              | 131                | 260              | 408   | 424              |
| 2k              | 159                | 291              | 465   | 501              |
| 4k              | 174                | 310              | 499   | 542              |
| 8k              | 181                | 318              | 518   | 559              |
| 16k             | OOM                | 322              | 516   | 561              |

---

### Key Observations
1. **Performance Trends**:
   - **FlashAttention-3** consistently achieves the highest speed across all sequence lengths (316–561 TFLOPs/s).
   - **cuDNN** outperforms **FlashAttention-2** and **Standard attention** at all sequence lengths (305–518 TFLOPs/s).
   - **Standard attention** shows diminishing returns and becomes **out of memory (OOM)** at 16k sequence length.

2. **Scalability**:
   - All methods exhibit increased speed with longer sequence lengths, except Standard attention at 16k.
   - FlashAttention-3 demonstrates the most significant performance improvement (e.g., +245 TFLOPs/s from 512 to 16k).

3. **Hardware Context**:
   - Benchmarked on **H100 80GB SXM5** GPU with head dimension 128.

---

### Notes
- **OOM**: Indicates "out of memory" error for Standard attention at 16k sequence length.
- **Head dim 128**: Refers to the dimensionality of attention heads in the model architecture.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

f60142e2221e7ea9f80b7017

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1