Image 5d6ed2db8d21...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
# Technical Document Extraction: GRPO System Architectures

## Diagram Overview
The image contains two comparative diagrams of Group Relative Policy Optimization (GRPO) system architectures:
1. **Vanilla GRPO** (Top diagram)
2. **Training-free GRPO** (Bottom diagram)

---

## (a) Vanilla GRPO Architecture

### Component Breakdown
1. **Policy Model**
   - Input: `q` (query/input)
   - Output: `O₁, O₂, ..., O_G` (group outputs)
   - Visual: Brain icon with neural network lines

2. **Reference Model**
   - Input: `O₁, O₂, ..., O_G`
   - Output: `r₁, r₂, ..., r_G` (reference rewards)
   - Visual: Brain icon with snowflake

3. **Reward Model**
   - Input: `O₁, O₂, ..., O_G`
   - Output: `r₁, r₂, ..., r_G` (reward signals)
   - Visual: Robot icon with checklist

4. **Group Computation**
   - Input: `r₁, r₂, ..., r_G`
   - Output: `A₁, A₂, ..., A_G` (action outputs)
   - Visual: Gray rectangle

### Flow Diagram
```
q → [Policy Model] → [Reference Model] → [Reward Model] → [Group Computation] → A₁...A_G
```

### Key Annotations
- Orange "Update" arrow connecting Policy Model to Reference/Reward Models
- Fire icon on Policy Model (possibly indicating optimization process)
- No explicit numerical data present

---

## (b) Training-free GRPO Architecture

### Component Breakdown
1. **Experiences**
   - Input: External data source
   - Output: Feeds into Policy Model
   - Visual: Filing cabinet icon

2. **Controller**
   - Input: Policy Model outputs
   - Actions: ADD, DELETE, KEEP, MODIFY
   - Visual: Robot icon with control buttons

3. **Reward Model**
   - Input: `O₁, O₂, ..., O_G`
   - Output: `r₁, r₂, ..., r_G` (with X/✓ annotations)
   - Visual: Robot icon with checklist

4. **LLM (Large Language Model)**
   - Input: `s₁, s₂, ..., s_G` (summarized experiences)
   - Output: `A_text₁, A_text₂, ..., A_text_G` (text actions)
   - Functions: Summarize, Extract Experience
   - Visual: Brain icon with neural network

### Flow Diagram
```
q → [Policy Model] → [Controller] → [Reward Model] → [LLM] → A_text₁...A_text_G
```

### Key Annotations
- Orange "Update" arrow connecting Experiences to Policy Model
- Fire icon on Experiences component
- X marks on `r₁` and `r_G` (rejected rewards)
- ✓ mark on `r₂` (accepted reward)
- LLM processes summarized experiences (`s₁...s_G`) before generating text actions

---

## Comparative Analysis

| Feature                | Vanilla GRPO                | Training-free GRPO          |
|------------------------|-----------------------------|-----------------------------|
| **Learning Mechanism** | Supervised reference model  | Experience-based controller |
| **Output Type**        | Action vectors (A₁...A_G)   | Text actions (A_text₁...A_text_G) |
| **Experience Handling**| No                          | Explicit experience module  |
| **Reward Processing**  | Direct computation          | Filtered with X/✓ annotations |
| **LLM Integration**    | No                          | Central component           |

---

## Spatial Grounding & Trend Verification
- **Legend Confirmation**: No explicit legend present; component icons serve as visual identifiers
- **Flow Direction**: All diagrams follow left-to-right processing flow
- **Data Flow Validation**: 
  - Vanilla: Numerical rewards → Group computation → Action vectors
  - Training-free: Textual experiences → Controller → Filtered rewards → LLM → Text actions

---

## Missing Elements
- No numerical data points or statistical values present
- No explicit time axis or performance metrics
- No comparative scale between diagrams

---

## Conclusion
The diagrams illustrate two GRPO variants with distinct processing pipelines. The training-free version introduces experience management and LLM integration, while the vanilla version follows a more traditional reference/reward model architecture. Both systems maintain group-based processing (`O_G`, `r_G`, `A_G`) but differ fundamentally in their learning mechanisms and output modalities.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

5d6ed2db8d217e673d88f667

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1