Image 58c8821040a0...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
# Technical Document Extraction

## (a) Different shapes of GEMMs in LLM

### Table Structure
| Operation                | M                | N                | K                |
|--------------------------|------------------|------------------|------------------|
| **Prefill phase**        |                  |                  |                  |
| K, Q, V projection       | SeqLen*B         | HD*3             | HD               |
| O projection             | SeqLen*B         | HD               | HD               |
| FFN1                     | SeqLen*B         | FD               | HD               |
| FFN2                     | SeqLen*B         | HD               | FD               |
| **Decode phase**         |                  |                  |                  |
| K, Q, V projection       | B                | HD*3             | HD               |
| O projection             | B                | HD               | HD               |
| FFN1                     | B                | FD               | HD               |
| FFN2                     | B                | HD               | FD               |

### Key Notes
- **Color Coding**: 
  - Blue: Prefill phase
  - Red: Decode phase
- **Footnotes**:
  - HD: Hidden dimension size
  - FD: Dimension size after first FFN
  - B: Batch size
  - SeqLen: Input sequence length
- **Highlight**: "Only 4 shapes!" (red text)

## (b) Decision flow

### Flowchart Description
1. **Start**: "For a certain LLM, traverse four [N, K] selections"
2. **First Decision**:
   - `Impl.B > Impl.A?`
     - **Yes**: `M++` (increment M)
     - **No**: Proceed to next decision
3. **Second Decision**:
   - `Impl.C > Impl.B?`
     - **Yes**: `M++` (increment M)
     - **No**: Find `M₂` (final M value)
4. **Termination**: "End"

### Abbreviations
- `ImplA`: FastGEMV
- `ImplB`: Our flat GEMM
- `ImplC`: CUTLASS

## (c) Example of heuristic dataflow with hardware resource adaptation

### Table Structure
| M       | Pattern Description               | Label Description                          | [N, K] Dimensions       |
|---------|-----------------------------------|--------------------------------------------|-------------------------|
| M=17    | Striped (blue)                    | Using cuBLAS/CUTLASS...                    | Not specified           |
| M=16    | Striped (blue)                    | Using cuBLAS/CUTLASS...                    | Not specified           |
| M=9     | Striped (blue)                    | Using cuBLAS/CUTLASS...                    | Not specified           |
| M=8     | Striped (blue)                    | Using cuBLAS/CUTLASS...                    | Not specified           |
| M=3     | Dotted (red)                      | Using our flat GEMM optimization           | Not specified           |
| M=2     | Dotted (red)                      | Using our flat GEMM optimization           | Not specified           |
| M=1     | Solid (blue)                      | Using GEMV on CUDA Core (e.g., FastGEMV)   | Not specified           |

### Footnotes
- `[N, K] = [12288, 4096]` (M=17)
- `[N, K] = [4096, 4096]` (M=16)
- `[N, K] = [11008, 4096]` (M=9)
- `[N, K] = [4096, 11008]` (M=1)

### Color Legend
- **Blue**: Prefill phase / cuBLAS/CUTLASS usage
- **Red**: Decode phase / Flat GEMM optimization
- **Solid Blue**: GEMV on CUDA Core

## Spatial Grounding & Trend Verification
1. **Table (a)**:
   - All entries follow `[Operation, M, N, K]` format
   - Color coding matches phase labels
   - No numerical trends (categorical data)

2. **Flowchart (b)**:
   - Linear decision tree with two branching points
   - No numerical data, only logical conditions

3. **Table (c)**:
   - M values decrease from 17 to 1
   - Pattern changes from striped → dotted → solid
   - [N, K] dimensions vary non-linearly

## Component Isolation
1. **Header**: 
   - Title: "Different shapes of GEMMs in LLM"
   - Subtitle: "Only 4 shapes!" (highlighted)

2. **Main Chart**:
   - Table (a) with phase-specific operations
   - Flowchart (b) with decision logic

3. **Footer**:
   - Table (c) with hardware adaptation examples
   - Footnotes explaining abbreviations

## Critical Observations
1. **Hardware Optimization**:
   - Different GEMM implementations (FastGEMV, Flat GEMM, CUTLASS) correspond to specific M values
   - Resource adaptation shown through [N, K] dimension changes

2. **Phase-Specific Operations**:
   - Prefill phase uses larger dimensions (SeqLen*B)
   - Decode phase uses batch size (B) with reduced dimensions

3. **Decision Logic**:
   - M value selection depends on implementation comparisons
   - Final M value determined through sequential comparisons

## Missing Information
- No explicit numerical trends (all data categorical)
- No explicit axis titles beyond table headers
- No explicit legend placement coordinates

## Language Notes
- All text in English
- No non-English content detected
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

58c8821040a0ae26b346aa00

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1