# Technical Document Extraction: Reinforcement Learning Training Pipeline
This image illustrates a technical architecture for a machine learning training pipeline, specifically focusing on a hybrid learning approach combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO).
The diagram is segmented into three primary functional regions: **Explorer**, **Buffer**, and **Trainer**.
---
## 1. Component Isolation
### Region A: Explorer (Left, Yellow Background)
This region represents the data generation phase.
* **Icon:** A robot head.
* **Components:**
* **Rollout engine:** A grey rounded rectangular block that executes tasks to generate data.
* **Internal Flow:**
* Receives a **Task** (represented by a clipboard icon) from the **Taskset** in the Buffer region.
* The process of selecting a task is labeled **Sampling**.
* Outputs **Experience** (represented by a document icon) to the Buffer region.
### Region B: Buffer (Center, Blue Background)
This region acts as the data storage and retrieval layer.
* **Icon:** A database/server stack.
* **Components:**
* **Usual Experiences:** A purple cylinder (database icon).
* **Expert Experiences:** A blue cylinder (database icon).
* **Taskset:** A white cylinder (database icon) located at the bottom.
* **Internal Flow:**
* Receives "Experience" from the Explorer into the **Usual Experiences** storage.
* Data is retrieved from both "Usual Experiences" and "Expert Experiences" via a process labeled **Sampling**.
* The sampled data is represented by two document icons (one purple, one blue) being sent to the Trainer.
### Region C: Trainer (Right, Green Background)
This region represents the model optimization and update phase.
* **Icon:** A human head with gears inside.
* **Components:**
* **SFT Loss:** A purple rounded rectangular block.
* **GRPO Loss:** A blue rounded rectangular block.
* **Summation Operator (+):** A blue circle with a plus sign.
* **MIX Loss:** A grey rounded rectangular block.
* **Update model:** A grey rounded rectangular block (final output).
* **Internal Flow:**
* The purple document icon (Usual/Expert data) feeds into the **SFT Loss** calculation.
* The blue document icon feeds into the **GRPO Loss** calculation.
* Both loss values are combined at the **(+)** operator.
* The combined result flows into **MIX Loss**.
* The final step is the **Update model** action, triggered by the MIX Loss.
---
## 2. Data Flow and Logic Summary
The system operates in a continuous loop or pipeline:
1. **Task Selection:** The **Explorer** samples a **Task** from the **Taskset** in the **Buffer**.
2. **Experience Generation:** The **Rollout engine** processes the task to generate new **Experience**, which is stored in the **Usual Experiences** buffer.
3. **Data Sampling:** The **Trainer** samples data from both the **Usual Experiences** and **Expert Experiences** buffers.
4. **Loss Calculation:**
* **SFT Loss** (Supervised Fine-Tuning) is calculated, likely using the Expert or high-quality data.
* **GRPO Loss** (Group Relative Policy Optimization) is calculated, typically used for reinforcement learning from feedback.
5. **Optimization:** The two losses are aggregated into a **MIX Loss**, which is then used to **Update model**.
---
## 3. Textual Transcription
| Category | Transcribed Text |
| :--- | :--- |
| **Headers** | Explorer, Buffer, Trainer |
| **Process Labels** | Sampling, Sampling |
| **Data Objects** | Experience, Task, Usual Experiences, Expert Experiences, Taskset |
| **Functional Blocks** | Rollout engine, SFT Loss, GRPO Loss, MIX Loss, Update model |
| **Symbols** | + (Addition/Aggregation) |
**Language Declaration:** All text in this image is in **English**.