# Technical Document Extraction: Distributed Training System Diagram
## Diagram Overview
The image depicts a distributed training system workflow with three devices (Device 1, Device 2, Device 3) and a central Controller. The system illustrates parameter updates, failure handling, and synchronization mechanisms across training steps.
---
### **Key Components & Flow**
1. **Legend**
- **Location**: Top-left corner
- **Colors**:
- `Dark Green`: Parameters not updated
- `Light Green`: Parameters updated
2. **Device Workflow**
- **Device 1**:
- `Forward` (blue) → `Backward` (green) → `Optimizer Step` (orange) → `Forward` (blue)
- **Device 2**:
- `Forward` (blue) with **Failure** (red lightning bolt) → `Backward` (green)
- **Device 3**:
- `Forward` (blue) → `Backward` (green) → `Optimizer Step` (orange) → `Forward` (blue)
3. **Controller Actions**
- Triggers `Stop+Clean+Reset` (red arrow) when a failure occurs (Device 2).
- Manages `Gradient Synchronization` (orange barrier) between training steps.
4. **Failure Handling**
- `Restoration+Rollback` (red arrow) initiated after Device 2 failure.
5. **Training Steps**
- `iᵗʰ step` (current step) and `(i+1)ᵗʰ step` (next step) demarcated by the `Gradient Synchronization` barrier.
---
### **Spatial Grounding & Color Verification**
- **Legend Colors**:
- Dark Green (`#2E8B57`) matches `Backward` phases and non-updated parameters.
- Light Green (`#90EE90`) matches updated parameters (not explicitly labeled but implied).
- **Device Phases**:
- `Forward`: Blue (`#00BFFF`)
- `Backward`: Dark Green (`#2E8B57`)
- `Optimizer Step`: Orange (`#FFA500`)
- `Failure`: Red (`#FF0000`)
---
### **Critical Observations**
1. **Failure Propagation**:
- Device 2’s failure during `Forward` halts its workflow, requiring `Restoration+Rollback`.
- Other devices proceed independently but synchronize gradients at the `Gradient Synchronization` barrier.
2. **Parameter Updates**:
- Parameters are updated (`Light Green`) only after `Optimizer Step` phases.
- Non-updated parameters (`Dark Green`) persist during `Forward` and `Backward` phases.
3. **Controller Role**:
- Acts as a central coordinator for synchronization and failure recovery.
---
### **Textual Elements**
- **Labels**:
- `Forward`, `Backward`, `Optimizer Step`, `Failure`, `Restoration+Rollback`, `Stop+Clean+Reset`, `Gradient Synchronization`, `Barrier`.
- **Axis Titles**:
- `iᵗʰ step` (current step) and `(i+1)ᵗʰ step` (next step) on the x-axis.
---
### **Conclusion**
The diagram emphasizes fault tolerance and synchronization in distributed training. Device failures trigger rollback mechanisms, while the Controller ensures gradient consistency across steps. Parameter updates occur only after optimizer steps, with color coding clarifying update states.