# Technical Document Extraction: Pipeline Execution Diagram
## 1. Overview
This image is a technical timing diagram illustrating a pipelined execution flow across three distinct processing units or stages. It demonstrates how tasks are staggered over time to maximize hardware utilization. The diagram uses a horizontal "time" axis and vertical labels for the execution units.
## 2. Component Isolation
### Header/Labels (Left Side)
The vertical axis identifies three processing stages:
1. **WGMMA0**: Likely referring to a "Warpgroup Matrix Multiply-Accumulate" unit 0.
2. **Softmax**: A specialized stage for calculating the Softmax activation function.
3. **WGMMA1**: Likely referring to a "Warpgroup Matrix Multiply-Accumulate" unit 1.
### Main Chart Area (Center)
The chart uses colored blocks to represent discrete work units (iterations or batches). Each block contains a numerical index (0, 1, 2, 3, 4, ..., N-3, N-2, N-1).
* **Vertical Dashed Lines**: These represent synchronization points or clock cycle boundaries.
* **Horizontal Arrow**: Labeled **"time"**, indicating the progression of execution from left to right.
## 3. Data Flow and Pipelining Logic
The diagram illustrates a **three-stage pipeline**. A single task (e.g., Task 0) must pass through WGMMA0, then Softmax, then WGMMA1.
### Trend Verification
* **Staggered Start**: Each subsequent stage starts exactly one time-step after the previous stage begins.
* **Parallelism**: Once the pipeline is full (at the third time-step), all three units are working simultaneously on different data indices.
* **Dependency**: Softmax index $i$ cannot start until WGMMA0 index $i$ is complete. WGMMA1 index $i$ cannot start until Softmax index $i$ is complete.
### Execution Sequence Table
The following table reconstructs the spatial placement of the blocks relative to the time-steps (columns):
| Stage | T1 | T2 | T3 | T4 | T5 | ... | Tn-2 | Tn-1 | Tn | Tn+1 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **WGMMA0** | 0 | 1 | 2 | 3 | 4 | ... | N-1 | | | |
| **Softmax** | | 0 | 1 | 2 | 3 | ... | N-2 | N-1 | | |
| **WGMMA1** | | | 0 | 1 | 2 | ... | N-3 | N-2 | N-1 | |
## 4. Color Coding and Indexing
The blocks are color-coded to track specific data indices across the stages:
* **Index 0 (Pink/Red)**: Starts in WGMMA0 at T1, moves to Softmax at T2, moves to WGMMA1 at T3.
* **Index 1 (Orange/Tan)**: Starts in WGMMA0 at T2, moves to Softmax at T3, moves to WGMMA1 at T4.
* **Index 2 (Green)**: Starts in WGMMA0 at T3, moves to Softmax at T4, moves to WGMMA1 at T5.
* **Index 3 (Blue)**: Starts in WGMMA0 at T4, moves to Softmax at T5.
* **Index 4 (Grey)**: Starts in WGMMA0 at T5.
* **Index N-1 (Purple)**: The final index, shown completing the sequence at the end of the timeline.
## 5. Summary of Information
* **Total Stages**: 3
* **Total Tasks**: N (from 0 to N-1)
* **Concurrency**: High. After the initial ramp-up, the hardware achieves 100% utilization across these three units.
* **Latency**: Each individual task takes 3 time units to complete, but the throughput is 1 task per time unit.