Image 55f47a106c25...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
# Technical Document: Attention Forward Speed Analysis

## Chart Title
Attention forward speed, head dim 64 (H100 80GB SXM5)

## Axes
- **X-axis**: Sequence length  
  Categories: `512`, `1k`, `2k`, `4k`, `8k`, `16k`  
- **Y-axis**: Speed (TFLOPs/s)  
  Range: 0–600 (discrete increments)

## Legend
| Color       | Method               |
|-------------|----------------------|
| Blue        | Standard attention   |
| Orange      | FlashAttention-2     |
| Green       | Triton               |
| Red         | cuDNN                |
| Purple      | FlashAttention-3     |

## Data Points
### Sequence Length: 512
- Standard attention: 52 TFLOPs/s  
- FlashAttention-2: 282 TFLOPs/s  
- Triton: 340 TFLOPs/s  
- cuDNN: 335 TFLOPs/s  
- FlashAttention-3: 333 TFLOPs/s  

### Sequence Length: 1k
- Standard attention: 63 TFLOPs/s  
- FlashAttention-2: 306 TFLOPs/s  
- Triton: 382 TFLOPs/s  
- cuDNN: 373 TFLOPs/s  
- FlashAttention-3: 392 TFLOPs/s  

### Sequence Length: 2k
- Standard attention: 67 TFLOPs/s  
- FlashAttention-2: 318 TFLOPs/s  
- Triton: 396 TFLOPs/s  
- cuDNN: 395 TFLOPs/s  
- FlashAttention-3: 460 TFLOPs/s  

### Sequence Length: 4k
- Standard attention: 72 TFLOPs/s  
- FlashAttention-2: 321 TFLOPs/s  
- Triton: 400 TFLOPs/s  
- cuDNN: 408 TFLOPs/s  
- FlashAttention-3: 476 TFLOPs/s  

### Sequence Length: 8k
- Standard attention: 73 TFLOPs/s  
- FlashAttention-2: 322 TFLOPs/s  
- Triton: 401 TFLOPs/s  
- cuDNN: 412 TFLOPs/s  
- FlashAttention-3: 496 TFLOPs/s  

### Sequence Length: 16k
- Standard attention: **OOM** (Out of Memory)  
- FlashAttention-2: 324 TFLOPs/s  
- Triton: 403 TFLOPs/s  
- cuDNN: 413 TFLOPs/s  
- FlashAttention-3: 497 TFLOPs/s  

## Key Trends
1. **Performance Scaling**:  
   - All methods show increased speed with longer sequence lengths, except Standard attention (OOM at 16k).  
   - FlashAttention-3 consistently outperforms other methods across all sequence lengths.  

2. **Standard Attention Limitations**:  
   - Significantly lower performance than other methods.  
   - Fails at 16k sequence length (OOM).  

3. **Relative Efficiency**:  
   - Triton and cuDNN exhibit similar performance, trailing FlashAttention-3 but outperforming FlashAttention-2.  
   - FlashAttention-2 lags behind Triton and cuDNN but remains viable for shorter sequences.  

## Notes
- **Hardware**: H100 80GB SXM5 GPU  
- **Head Dimension**: 64  
- **OOM**: Indicates out-of-memory errors for Standard attention at 16k sequence length.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

55f47a106c25f22983b2741a

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1