# : A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models
**Authors**: Alibaba Group
Abstract
Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness. footnotetext: Equal contribution. footnotetext: Corresponding author. {chenyanxi.cyx,panxuchen.pxc,daoyuanchen.cdy,yaliang.li,bolin.ding}@alibaba-inc.com
GitHub: https://github.com/modelscope/Trinity-RFT
Documentation: https://modelscope.github.io/Trinity-RFT
Note: Trinity-RFT is currently under active development. This technical report corresponds to commit id 63d4920 (July 14, 2025) of the GitHub repository, and will be continuously updated as the codebase evolves. Comments, suggestions and contributions are welcome!
1 Introduction
Reinforcement learning (RL) has achieved remarkable success in the development of large language models (LLMs). Examples include aligning LLMs with human preferences via reinforcement learning from human feedback (RLHF) [24], and training long-CoT reasoning models via RL with rule-based or verifiable rewards (RLVR) [5, 35]. However, such approaches are limited in their abilities to handle dynamic, agentic and continuous learning in the real world.
We envision a future where AI agents learn by interacting directly with environments, collecting lagged or complex reward signals, and continuously refining their behavior through RL based on the collected experiences [32]. For example, imagine an AI scientist who designs an experiment, executes it, waits for feedback (while working on other tasks concurrently), and iteratively updates itself based on true environmental rewards and feedback when the experiment is finally finished.
This vision motivates us to develop Trinity-RFT, a reinforcement fine-tuning (RFT) framework that aims to offer a path into this future. The modular, decoupled and trinity design of Trinity-RFT illustrated in Figure 1, along with its various features, makes it a promising solution for realizing such a vision.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: System Architecture for Agent-Environment Interaction and Training
### Overview
The diagram illustrates a multi-component system architecture for agent-environment interaction, training, and inference. It depicts data flow between components such as the Buffer, Explorer, Trainer, RFT-core, and LLM Infra, with feedback loops and synchronization mechanisms.
### Components/Axes
- **Key Components**:
- **Buffer**: Central node receiving "Additional Feedback" from Environment & Human and processing data via "Clean/Filter/Prioritize/Synthesize..." pipelines.
- **Explorer**: Receives "Rollout Experiences" from Buffer and interacts with Environment & Human via "Agent-Env Interaction" loop.
- **Trainer**: Processes "Training Data" from Buffer and "Process Training Batch" from Explorer, synchronizing model weights with LLM Infra.
- **RFT-core**: Central to feedback loops, connected to both Explorer and Trainer.
- **LLM Infra**: Handles "Training, Inference, Model Sync..." at the lowest level, connected to both Explorer and Trainer.
- **Data Flow**:
- Arrows indicate directional processes (e.g., "Rollout Experiences" from Buffer to Explorer).
- Feedback loops (e.g., "Agent-Env Interaction" between Explorer and Environment & Human).
- Synchronization mechanisms (e.g., "Synchronize Model Weights" between Trainer and LLM Infra).
### Detailed Analysis
- **Buffer**: Acts as a central hub for data preprocessing and prioritization, feeding into both Explorer and Trainer.
- **Explorer**: Engages in active exploration of the environment, generating experiences that inform training.
- **Trainer**: Processes training data and batches, ensuring model updates are synchronized with LLM Infra.
- **RFT-core**: Likely represents a reinforcement learning framework core, enabling iterative learning through feedback.
- **LLM Infra**: Provides the foundational infrastructure for training and inference, ensuring model consistency.
### Key Observations
- **Feedback Loops**: Multiple feedback mechanisms (e.g., "Agent-Env Interaction," "Rollout Experiences") suggest iterative learning and adaptation.
- **Synchronization**: Explicit emphasis on model weight synchronization between Trainer and LLM Infra highlights the importance of consistency.
- **Modular Design**: Components are decoupled but interconnected, allowing scalability and specialization (e.g., Buffer handles data pipelines, LLM Infra focuses on inference).
### Interpretation
This architecture represents a closed-loop system for training and deploying intelligent agents. The Buffer ensures data quality before distribution, while the Explorer and Trainer balance exploration and exploitation. The RFT-core and LLM Infra enable efficient learning and inference, with feedback loops allowing continuous improvement. The system’s modularity suggests adaptability to different environments and tasks, with the Buffer and Data Pipelines acting as critical control points for data integrity.
</details>
Figure 1: The high-level design of Trinity-RFT.
1.1 Key Features of Trinity-RFT
Trinity-RFT is a general-purpose, unified, scalable and user-friendly RL framework that can be easily adapted for diverse experimental or real-world scenarios. It integrates both macroscopic and microscopic RL methodologies in one place; roughly speaking, the former deals with natural language and plain text, while the latter handles torch.Tensor (such as token probabilities, gradients, and model weights of LLMs) Many prior RL works for games/control/LLMs focus mainly on the microscopic aspect, e.g., designing policy loss functions or optimization techniques for updating the policy model. On the other hand, pre-trained LLMs, as generative models with rich prior knowledge of natural language and the world, open up numerous opportunities at the macroscopic level, e.g., experience synthesis by reflection or reasoning with environmental feedback [4], leveraging existing text processing methods like deduplication and quality filtering [2], among others.. The key features of Trinity-RFT are presented below, which will be further elaborated in Section 2 and exemplified in Section 3.
An RFT-core that unifies and generalizes diverse RL modes.
Trinity-RFT implements diverse RL methodologies in a unified manner, supporting synchronous/asynchronous, on-policy/off-policy, and online/offline training. These RL modes can be flexibly generalized, e.g., a hybrid mode that incorporates expert trajectories to accelerate an online RL process [21, 46]. This unification is made possible partly by our decoupled design (which will soon be introduced in Section 2.1) that allows rollout and training to be executed separately and scaled up independently on different devices, while having access to the same stand-alone experience buffer. The efficacy of various RL modes has been validated empirically by our experiments in Section 3.3, which particularly highlight the efficiency gains by off-policy or asynchronous methods.
Agent-environment interaction as a first-class citizen.
Trinity-RFT allows delayed rewards and environmental feedback in multi-step/time-lagged feedback loop, handles long-tailed latencies and the straggler effect via asynchronous and streaming LLM inference, and deals with environment/agent failures gracefully via dedicated timeout/retry/skip mechanisms. These together ensure efficiency and robustness of continuous agent-environment interaction in complex real-world scenarios.
Systematic data pipelines optimized for RFT.
Figure 2 illustrates the high-level design of data pipelines in Trinity-RFT, which regard rollout tasks and experiences as dynamic assets to be actively managed throughout the RFT lifecycle. Trinity-RFT empowers users to: (1) curate tasks for curriculum learning, e.g., by prioritizing easier tasks at the beginning of training to stabilize and accelerate the learning process; (2) actively manipulate experience by cleaning, filtering, or synthesizing new experiences, such as repairing failed trajectories or amplifying successful ones; (3) perform online reward shaping by augmenting sparse environmental rewards with dense, computed signals, such as quality or diversity scores; (4) customize interfaces for human-in-the-loop curation and utilize an agentic paradigm for RFT data processing that translates high-level natural language commands (e.g., “improve response diversity and safety for coding scenarios”) into complex data pipelines, powered by established community tools like Data-Juicer [2]. For instance, Section 3.4 presents experiments that demonstrate the efficacy of task prioritization and reward shaping empowered by Trinity-RFT.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Flowchart: Data Pipeline Architecture
### Overview
The diagram illustrates a multi-stage data pipeline for processing raw datasets into trained models, with feedback loops for iterative improvement. Key components include data conversion, task sampling, experience exploration, buffering, and model training.
### Components/Axes
- **Components**:
- **Raw Dataset** (leftmost box)
- **Task Set** (conversion from Raw Dataset)
- **Explorer** (processes sampled tasks)
- **Buffer** (stores rollout experiences)
- **Trainer** (uses experience batches for training)
- **Arrows/Labels**:
- "Conversion" (Raw Dataset → Task Set)
- "Sample Tasks for Rollout" (Task Set → Explorer)
- "Rollout Experiences" (Explorer → Buffer)
- "Experience Shaping / Cleaning / Synthesis" (Buffer → Trainer)
- "Sample Experience Batches for RL Training" (Buffer → Trainer)
- "Local Experience Replay" (Trainer → Buffer)
- **Legend**:
- "Data Pipelines" (magnifying glass icon at the top)
### Detailed Analysis
1. **Raw Dataset → Task Set**:
- Raw data is converted into a structured task set via synthesis and prioritization.
2. **Task Set → Explorer**:
- Tasks are sampled for rollout, implying iterative testing or simulation.
3. **Explorer → Buffer**:
- The Explorer generates "Rollout Experiences," which are stored in the Buffer.
4. **Buffer → Trainer**:
- Experiences are shaped, cleaned, and synthesized before being sampled as batches for reinforcement learning (RL) training.
5. **Trainer → Buffer**:
- A feedback loop ("Local Experience Replay") sends trained experiences back to the Buffer for reuse.
### Key Observations
- **Sequential Flow**: Data progresses linearly from Raw Dataset to Trainer, with a critical feedback loop from Trainer to Buffer.
- **Iterative Refinement**: The Buffer acts as a reservoir for experiences, enabling continuous improvement through replay.
- **Modular Design**: Each component (Explorer, Buffer, Trainer) has distinct responsibilities, suggesting a modular architecture.
### Interpretation
This pipeline emphasizes **closed-loop learning**, where the Trainer not only processes experiences but also replays them locally to refine future iterations. The Buffer’s role as a central hub for experience storage and preprocessing highlights its importance in managing data quality and diversity. The feedback loop ensures that the system adapts dynamically, leveraging past experiences to improve future training cycles.
**Notable Patterns**:
- The use of "Local Experience Replay" suggests a focus on efficiency and reducing reliance on external data sources.
- The separation of "Experience Shaping/Cleaning/Synthesis" implies rigorous preprocessing to enhance training stability.
**Underlying Logic**:
The diagram aligns with reinforcement learning principles, where agents learn through interaction (rollout experiences) and refine policies via experience replay. The modular components suggest scalability, allowing each stage to be optimized independently.
</details>
Figure 2: The high-level design of data pipelines in Trinity-RFT.
User-friendliness as a top priority.
For development and research, the modular and decoupled design of Trinity-RFT allows the user to develop new RFT methodologies by adding one or a few small, plug-and-play classes (modified from built-in templates) that implement the essential functionalities of interest, with minimal code duplication or intrusive changes to the codebase. An example can be found in Section 3.2, which shows that three compact python classes (with around 200 lines of code in total) suffice for implementing a hybrid RL process that leverages samples from multiple data sources and updates the policy model with a customized loss function. For applications, the user can adapt Trinity-RFT to a new scenario by simply implementing a single Workflow class that specifies the logic of agent-environment interaction, as will be exemplified in Section 3.1. To further enhance usability, Trinity-RFT incorporates various graphical user interfaces to support low-code usage and development, enhance transparency of the RFT process, and facilitate easy monitoring and tracking.
1.2 Related Works
There exist numerous open-source RLHF frameworks, such as veRL [30], OpenRLHF [13], TRL [40], ChatLearn [1], Asynchronous RLHF [23], among others. Some of them have been further adapted for training long-CoT reasoning models or for agentic RL more recently.
Concurrent to Trinity-RFT, some recent works on LLM reinforcement learning also advocate a decoupled and/or asynchronous design; examples include StreamRL [50], MiMo [44], AReaL [9], ROLL [37], LlamaRL [43], Magistral [18], AsyncFlow [12], among others.
Complementary to this huge number of related works, Trinity-RFT provides the community with a new solution that is powerful, easy-to-use, and unique in certain aspects. In a nutshell, Trinity-RFT aims to be general-purpose and applicable to diverse application scenarios, while unifying various RFT modes, RFT methodologies at macroscopic and microscopic levels, and RFT-core/agent-environment interaction/data pipelines. Such a system-engineering perspective makes Trinity-RFT particularly useful for handling the whole RFT pipeline in one place. We also hope that some specific features of Trinity-RFT, such as data persistence in the experience buffer, and distributed deployment of multiple independent explorers, will open up new opportunities for LLM reinforcement fine-tuning.
2 Design and Implementations
The overall design of Trinity-RFT exhibits a trinity consisting of (1) RFT-core, (2) agent-environment interaction, and (3) data pipelines, which are illustrated in Figure 1 and elaborated in this section.
2.1 RFT-Core
RFT-core is the component of Trinity-RFT, highlighted at the center of Figure 1, where the core RFT process happens. Its design also exhibits a trinity, consisting of the explorer, buffer, and trainer.
- The explorer, powered by a rollout model, takes a task as input and solves it by executing a workflow that specifies the logic of agent-environment interaction, thereby collecting experiences (including rollout trajectories, rewards, and other useful information) to be stored in the buffer.
- The buffer stores experiences that can be generated by the explorer or by other sources, such as human experts. It can be realized in various forms, such as a non-persistent ray.Queue or a persistent SQLite database. It also assists with fetching training samples for the trainer, and can be integrated with advanced sampling strategies and post-processing operations.
- The trainer, backed by a policy model, samples batches of experiences from the buffer and updates the policy model via RL algorithms.
Our implementations allow the explorer and trainer to be deployed on separate machines and act independently. They are only connected via (1) access to the same experience buffer with a customizable sampling strategy, and (2) model weight synchronization by a customizable schedule. See Figure 3 for an illustration. This decoupled design of RFT-core offers support for diverse RFT modes with great flexibility.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Diagram: Reinforcement Learning System Architecture
### Overview
This diagram illustrates a reinforcement learning (RL) system architecture, depicting the flow of data and interactions between components. It includes elements for task processing, agent-environment interaction, experience management, and model training.
### Components/Axes
**Key Components**:
1. **Taskset** → **Task Data Processor** → **Workflow Runner**
2. **Workflow Runner** → **Agent**, **Environment**, **Rollout Model**, **Reward Model**
3. **Experience Data Processor** → **Raw Experiences** → **Verified Experiences** → **Buffer**
4. **Trainer** → **Reference Model**, **Actor Model**, **Critic Model**
5. **Synchronize Weights** (connects Workflow Runner and Trainer)
**Flow Direction**:
- Top-to-bottom: Taskset → Task Data Processor → Workflow Runner → Experience Data Processor → Trainer.
- Horizontal: Workflow Runner ↔ Trainer via "Synchronize Weights."
**Color Coding**:
- **Top Section (Blue)**: Taskset, Task Data Processor, Experience Data Processor.
- **Middle Section (Orange)**: Workflow Runner, Agent, Environment, Rollout Model, Reward Model.
- **Bottom Section (Green)**: Trainer, Reference Model, Actor Model, Critic Model.
- **Buffer**: Black icon with "Buffer" label.
### Detailed Analysis
1. **Taskset → Task Data Processor**:
- The Taskset (input) is processed by the Task Data Processor, which generates a "Task" output.
2. **Workflow Runner**:
- Contains an **Agent** that interacts with an **Environment** via **actions** and **rewards**.
- Uses a **Rollout Model** (predicts actions) and **Reward Model** (evaluates rewards).
- Outputs "Experience" to the Experience Data Processor.
3. **Experience Data Processor**:
- Processes **Raw Experiences** into **Verified Experiences**, which are stored in the **Buffer**.
4. **Trainer**:
- Uses **Reference Model** (baseline/expert model), **Actor Model** (policy), and **Critic Model** (value function) to train on experiences from the Buffer.
- **Synchronize Weights** ensures alignment between the Workflow Runner and Trainer models.
### Key Observations
- **Modular Design**: The system separates task processing (top), agent interaction (middle), and training (bottom).
- **Feedback Loop**: The Workflow Runner and Trainer share weights, enabling continuous improvement.
- **Experience Pipeline**: Raw experiences are filtered/verified before training, ensuring data quality.
### Interpretation
This architecture represents a closed-loop RL system:
1. **Exploration**: The Agent explores the Environment, generating experiences.
2. **Experience Refinement**: The Experience Data Processor cleans and validates data.
3. **Training**: The Trainer updates the Actor and Critic models using the Reference Model as a guide.
4. **Weight Synchronization**: Ensures the Workflow Runner’s models (e.g., Rollout, Reward) stay aligned with the Trainer’s policies.
The system emphasizes **data quality** (via verification) and **model alignment** (via weight synchronization), critical for stable RL training. The modular structure allows scalability and separation of concerns.
</details>
Figure 3: The architecture of RFT-core in Trinity-RFT.
2.1.1 Unified Support for Diverse RFT Modes
We present the RFT modes supported by Trinity-RFT, some of which are demonstrated in Figure 4.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Diagram: Synchronization Strategies in a Distributed System
### Overview
The image depicts four synchronization strategies (a-d) for a distributed system involving **Explorers**, **Buffers**, and **Trainers**. Each quadrant illustrates data flow, synchronization points, and component interactions using color-coded blocks and arrows.
### Components/Axes
- **Components**:
- **Explorers**: Generate data (yellow blocks labeled "Rollout").
- **Buffers**: Temporary storage (blue blocks).
- **Trainers**: Process data (green blocks labeled "Train").
- **Synchronization Elements**:
- **Sync. weight (NCCL)**: Purple blocks (inter-component synchronization).
- **Sync. weight (Checkpoint)**: Red blocks (checkpoint synchronization).
- **Arrows**: Indicate data flow direction.
- **Legend**: Located at the bottom, mapping colors to actions (Rollout, Train, Sync. weight, Checkpoint).
### Detailed Analysis
#### Quadrant (a) Synchronous
- **Flow**:
- Explorers → Buffers (yellow → blue).
- Buffers → Trainers (blue → green).
- Trainers wait for experiences (dashed arrow).
- **Synchronization**:
- Purple "Sync. weight (NCCL)" blocks between Explorers and Buffers.
- No checkpoints (no red blocks).
#### Quadrant (b) One-Step Off-Policy
- **Flow**:
- Explorers → Buffers (yellow → blue).
- Buffers → Trainers (blue → green).
- **Synchronization**:
- One-step offset (dashed arrow from Explorers to Buffers).
- Purple "Sync. weight (NCCL)" blocks between Buffers and Trainers.
- "Wait for synchronization" note.
#### Quadrant (c) Fully Asynchronous
- **Flow**:
- Explorers → Buffers (yellow → blue).
- Buffers → Trainers (blue → green).
- **Synchronization**:
- Red "Sync. weight (Checkpoint)" blocks at intervals.
- No explicit waiting (fully asynchronous).
#### Quadrant (d) Multi-Explorer Asynchronous
- **Flow**:
- Two Explorers (Explorer1, Explorer2) → Buffers (yellow → blue).
- Buffers → Trainers (blue → green).
- **Synchronization**:
- Red "Sync. weight (Checkpoint)" blocks at intervals.
- Arrows indicate parallel data flow from multiple Explorers.
### Key Observations
1. **Synchronous (a)**: Strict synchronization with delays ("Wait for experiences").
2. **One-Step Off-Policy (b)**: Introduces latency via offset but synchronizes before training.
3. **Fully Asynchronous (c)**: No waiting but relies on checkpoints for consistency.
4. **Multi-Explorer Asynchronous (d)**: Scales to multiple Explorers but requires checkpoint synchronization.
### Interpretation
- **Trade-offs**:
- Synchronous strategies prioritize consistency but reduce throughput.
- Asynchronous strategies improve parallelism but risk data staleness without checkpoints.
- **Checkpoints**: Critical in asynchronous systems (red blocks) to ensure data consistency.
- **Multi-Explorer Scalability**: Quadrant (d) suggests handling multiple data sources but complicates synchronization.
- **NCCL vs. Checkpoint**: NCCL (purple) enables real-time sync, while checkpoints (red) act as recovery points.
This diagram highlights the balance between synchronization overhead, data freshness, and system scalability in distributed training pipelines.
</details>
Figure 4: A visualization of diverse RFT modes supported by Trinity-RFT, including: (a) synchronous mode, with sync_interval=2; (b) one-step off-policy mode, with sync_interval=1 and sync_offset=1; (c) fully asynchronous mode, with sync_interval=2; (d) multi-explorer asynchronous mode, with sync_interval=2. The buffer supports, in principle, arbitrary management and sampling strategies for experiences.
Synchronous mode.
In the synchronous mode shown in Figure 4 (a), the explorer and trainer get launched simultaneously, work in close coordination, and synchronize their model weights once every sync_interval training steps. Within each synchronization period, the explorer continuously generates sync_interval batches of rollout experiences and stores them in the buffer, which are then retrieved and utilized by the trainer for updating the policy model. If sync_interval=1, this is a strictly on-policy RL process, whereas if sync_interval>1, it becomes off-policy (akin to the mode adopted in [35]) and can be accelerated by pipeline parallelism between the explorer and trainer. This mode can be activated by setting the configuration parameter mode=both.
One-step off-policy mode.
This mode, demonstrated in Figure 4 (b), closely resembles the synchronous mode, except for an offset of one batch between the explorer and trainer. This allows the trainer to sample experiences from the buffer immediately after model weight synchronization, thereby streamlining the execution of explorer and trainer with smaller pipeline bubbles, at the cost of slight off-policyness. The visualization in Figure 4 (b) corresponds to configuration parameters sync_interval=1 and sync_offset=1, both of which can take more general values in Trinity-RFT.
Asynchronous mode.
In the fully asynchronous mode shown in Figure 4 (c), the explorer and trainer act almost independently. The explorer continuously generates rollout experiences and stores them in the buffer, while the trainer continuously samples experiences from the buffer and uses them for training the policy model. External experiences, e.g., those generated by expert models or humans, can be continuously incorporated into the buffer as well. The explorer and trainer independently load or save model weights from the checkpoint directory every sync_interval steps, keeping the distribution of rollout experiences up to date. This mode can be activated by setting mode=explore/train and launching the explorer and trainer separately on different GPUs.
Multi-explorer asynchronous mode.
One benefit brought by the decoupled design is that explorers and trainers can scale up independently on separate devices. As a proof-of-concept, Trinity-RFT offers support for a multi-explorer asynchronous mode, as demonstrated in Figure 4 (d), where multiple explores send the generated rollout experiences to the same buffer. Scaling up the number of independent and distributed explorers can be particularly useful for resolving data scarcity and speeding up the generation of experiences in real-world scenarios where rollout trajectories have to be sampled via interaction with the physical world, or in an environment with sparse and lagged feedback. Another by-product of this multi-explorer mode is 24-hour non-interrupted service for real-world online serving situations: since the explorers can pause and update model weights at different moments, it can be guaranteed that there is always one explorer ready to serve an incoming request immediately whenever it arrives. This is in contrast to a single-explorer mode, where online service has to be paused when the explorer is updating its model weights.
Benchmark mode.
Trinity-RFT supports a benchmark mode that allows the user to evaluate one or multiple checkpoints on arbitrary benchmarks, after the RFT training process has finished. To activate this mode, the user simply needs to set mode=bench and specify the paths for the evaluation datasets in the configurations. This mode can be particularly useful for experimental purposes; for example, the user might want to try out different RFT techniques or configurations quickly (with limited evaluation on hold-out data) during training, identify which RFT trials have achieved stable convergence and high rewards, and then conduct more thorough evaluations only for the checkpoints of these successful trials.
Train-only mode.
In certain scenarios, the user would like to train the policy model without further exploration, using experiences that have already been collected and stored in the buffer. This train-only mode can be activated by setting the configuration parameter mode $=$ train and launching the trainer alone. Offline methods like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) [25] can be regarded as special cases of such scenarios, both of which are natively supported by Trinity-RFT. For another example, consider an online RFT process that expands over a long period, where the explorer alone is launched during the daytime for serving human users and collecting experiences, while the trainer alone is launched at night for updating the policy model (which will be thoroughly validated and evaluated before it can be actually deployed as the rollout model for the next day).
Discussions.
We conclude this subsection with two remarks. (1) Given the unified implementation of various RFT modes, it is easy to design and implement a hybrid mode with Trinity-RFT that combines multiple modes into a single learning process. One example is learning with both online rollout data and offline-collected expert data, via jointly optimizing two loss terms corresponding to these two data sources. Section 3.2 illustrates how to implement this conveniently in Trinity-RFT. (2) We take a system-algorithm co-design perspective in the development of Trinity-RFT, aiming to unify and generalize diverse RFT methodologies in this framework. RFT-core provides the necessary infrastructure for achieving this goal. This technical report focuses on the system perspective, and we refer interested readers to the literature for recent algorithmic developments in off-policy / asynchronous RL for LLMs [21, 6, 26, 35, 7, 23, 42, 45, 46, 47].
2.1.2 Implementations of RFT-Core
We present some implementation details of RFT-core in the following.
Inference and training engines.
The current version of Trinity-RFT leverages vLLM [15] as the inference engine for the explorer, which offers features including paged attention, continuous batching [49], asynchronous and concurrent inference for multiple rollout trajectories, among others. Trinity-RFT also leverages verl [30] as the training engine for the trainer, which gracefully handles model placement (for the policy, critic and reference models) and incorporates various performance optimizations for training (such as dynamic batching, management of padding and unpadding, etc.). Trinity-RFT stands on the shoulders of these excellent open-source projects, and will continue to benefit from their future development.
Experience buffer.
Trinity-RFT supports multiple types of experience buffers, ranging from a non-persistent ray.Queue to persistent SQLite or Redis databases. While using a basic first-in-first-out queue is the most straightforward approach, data persistence with a database opens up many new opportunities (e.g., advanced sampling strategies), as discussed throughout this report. Trinity-RFT has provided dedicated read/write control to prevent any conflict in accessing the buffer.
Model weight synchronization.
Trinity-RFT supports model weight synchronization between the explorer and trainer by NCCL [22], or by checkpoint saving and loading. The former is faster (when available), while the latter is generally more flexible and widely applicable, especially for asynchronous RFT modes.
2.2 Agent-Environment Interaction
To adapt Trinity-RFT to a new downstream scenario, the user mainly needs to define and register a customized workflow (by inheriting the base class Workflow or MultiTurnWorkflow) where the logic of agent-environment interaction for this particular scenario is implemented. Advanced methods for experience synthesis with environmental feedback [4] can be implemented in the same way as well. See Section 3.1 for detailed examples. The workflow will then be executed by workflow runners within the explorer for generating experiences, as shown in Figure 3.
Numerous challenges arise when one tries to build an RFT framework that can efficiently and robustly handle real-world interaction between the LLM-powered agent and the environment. These include long-tailed latencies, agent/environment failures, and lagged reward signals, among others. Trinity-RFT regards agent-environment interaction as a first-class citizen and incorporates various solutions to tackle these challenges, for example:
- The workflow runners in Trinity-RFT support asynchronous and streaming generation of rollout trajectories for multiple tasks. This helps mitigate the straggler effect caused by the long-tailed latencies in rollout generation and agent-environment interaction, thereby accelerating the RFT process. Load balancing among multiple LLM inference engines within one RFT training course is also taken care of, and would be one direction for further optimizing the utilization of computational resources.
- Trinity-RFT incorporates various timeout/retry/skip mechanisms for fault tolerance and robustness, which ensure that continuous rollout generation would not be interrupted or blocked by individual failures in certain rounds of agent-environment interaction. This is crucial for stable and efficient learning in real-world scenarios, e.g., when the agent interacts with a large number of MCP services [17] that differ vastly in quality and availability.
- Trinity-RFT is built to provide native support for asynchronous RFT modes, which allow great flexibility in the paces of the explorer and trainer. This can boost the overall efficiency of the RFT process, compared to synchronous modes where the slower one among the explorer and trainer can block the progress of the other and cause waste of computational resources.
- For lagged reward signals, the trinity design of RFT-core offers a natural solution. As soon as the rollout trajectory (without reward values) has been generated, it is saved into the experience buffer, but marked as “not ready for training”. The explorer is now free from this task and may continue to collect experiences for other tasks. When the reward signals from the environment finally arrive, they are written to the buffer, and the corresponding experience is now marked as “ready for training”.
- For multi-turn conversations and ReAct-style workflows [48], Trinity-RFT supports concatenating multiple rounds of agent-environment interaction compactly into a single sequence, with proper masking that indicates which tokens need to be incorporated into the training objective of RL algorithms. This avoids unnecessary recomputation and thus improves training efficiency, compared to a vanilla approach that represents a $K$ -turn rollout trajectory with $K$ separate samples.
- As another performance optimization, the implementation of Trinity-RFT allows resetting the environment in a workflow, rather than re-initializing it every time. This is especially useful for scenarios where setting up the environment is costly.
2.3 Data Pipelines
The data pipelines in Trinity-RFT aim to address fundamental challenges in RFT scenarios, such as managing heterogeneous data dynamics across interaction workflows, enabling delayed reward integration, and facilitating continuous data curation. Our solutions center on four core aspects: end-to-end data transformation, task curation, active experience shaping, and human-in-the-loop curation, each corresponding to key requirements identified in our development of RFT-core (Section 2.1).
2.3.1 End-to-end Data Transformation
To support the diverse RFT modes (e.g., synchronous or asynchronous) in Trinity-RFT, we establish a service-oriented data pipeline architecture as illustrated in Figure 5. It decouples data pipeline logic from procedure control to enable flexible RL-oriented data transformations with two key modules:
- The Formatter Module unifies disparate data sources into RFT-compatible formats, providing convenient conversion between raw inputs (e.g., meta-prompts, domain-specific corpora, and QA pairs with tagged rewards) and structured RFT representations. For efficient RFT workloads, we utilize buffer-based persistent storage (Section 2.1) to support different data models, such as ExperienceModel for prioritized rollout trajectories and DPODataModel for preference pairs. The conversion logic and data models are highly customizable to meet diverse requirements for managing experience data. This flexibility enables robust metadata recording and field normalization, which is essential for advanced scenarios such as asynchronous RFT in trainer-explorer environments, agent self-evolution from a cold start using meta-prompts, and knowledge injection from structurally complex domain-specific corpora.
- The Controller Module manages the complete data pipeline lifecycle through distributed server initialization, declarative configuration, and automated dataset persistence. It implements dynamic control mechanisms for asynchronous scenarios and protection against resource exhaustion, with configurable termination conditions based on compute quota or data quantity. This modular design enables Trinity-RFT to handle data transformations flexibly while maintaining consistency across different RFT modes.
The Formatter-Controller duality mirrors the explorer-trainer decoupling in RFT-core, enabling parallel data ingestion and model updating. This design also allows Trinity-RFT to handle delayed rewards through version-controlled experience updates while maintaining low-latency sampling for the trainer.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Diagram: System Architecture for Task Curation and Experience Shaping
### Overview
The diagram illustrates a two-phase system architecture for processing data and shaping experiences. It features interconnected components for task curation, data processing, exploration, and model training, with feedback loops between exploration and training phases.
### Components/Axes
1. **Left Section (Task Curation & Prioritization):**
- **Data Processor**: Contains sub-tasks:
- Convert format
- Clean & augment
- Online Scoring
- **Raw Data** (blue cylinder) → **Taskset** (blue cylinder)
- **Buffer** (gray rectangle) acts as intermediary storage
2. **Right Section (Experience Shaping):**
- **Data Processor**: Contains sub-tasks:
- Dense rewards
- Human-in-the-loop
- Counterfactual
- Dynamic synthesis
- **Raw Experience** (blue cylinder) → **Experience** (blue cylinder)
3. **Central Components:**
- **Explorer** (yellow robot icon): Receives Environment Feedback and sends Model Feedback
- **Trainer** (green brain icon): Receives Model Feedback and sends Experience Shaping outputs
4. **Feedback Loops:**
- Environment Feedback → Explorer → Model Feedback → Trainer
- Model Feedback → Experience Shaping
### Detailed Analysis
- **Task Curation Flow**: Raw Data undergoes preprocessing (format conversion, cleaning, augmentation) and online scoring before becoming Taskset. The Buffer manages data flow between these stages.
- **Experience Shaping Flow**: Raw Experience is processed through human-centric methods (dense rewards, counterfactual analysis) and dynamic synthesis to create refined Experience outputs.
- **Exploration-Training Interaction**: The Explorer interacts with the environment, receives feedback, and shares Model Feedback with the Trainer. This creates a closed-loop system for iterative improvement.
### Key Observations
1. Dual Data Processors handle distinct but complementary workflows (task preparation vs. experience refinement)
2. Buffer acts as a critical synchronization point between raw data and taskset generation
3. Human-in-the-loop components appear in both processing phases, emphasizing human-AI collaboration
4. Feedback loops suggest continuous system improvement through environmental interaction
### Interpretation
This architecture represents a hybrid AI system where:
1. **Task Curation** focuses on preparing structured data for specific applications
2. **Experience Shaping** emphasizes human-AI interaction quality through:
- Reward design
- Counterfactual analysis (what could have been)
- Dynamic adaptation
3. The Explorer-Trainer loop mirrors reinforcement learning paradigms, with the key difference being explicit human feedback integration at multiple stages
4. The system prioritizes both data quality (through rigorous preprocessing) and experience quality (through human-centric shaping)
The architecture suggests a framework for developing AI systems that balance automated processing with human oversight, particularly in applications requiring nuanced understanding of human preferences and contextual adaptation.
</details>
Figure 5: The interaction of data processor and data buffers in Trinity-RFT, divided into two key stages. Left: Task Curation & Prioritization prepares the initial tasks for the explorer. Right: Experience Shaping processes the collected trajectories from the explorer before they are used by the trainer. The data processor is a central component that operates on different buffers at different stages.
2.3.2 Task Curation and Prioritization
Before the RFT loop begins, it is crucial to prepare a high-quality set of initial tasks. This stage, depicted on the left side of Figure 5, transforms raw data into an optimized task set for the explorer.
The process begins with raw data sources (e.g., prompts, domain corpora), which are ingested into a buffer. The Data Processor, powered by over 100 operators from Data-Juicer [2], reads from this buffer to perform various curation tasks. It provides composable building blocks for experience cleaning (e.g., length filters, duplication removal), safety alignment (e.g., toxicity detection, ethics checks), and preference data synthesis (e.g., critique-conditioned augmentation). By treating Data-Juicer as a modular data processing operator pool rather than a central dependency, Trinity-RFT provides RL-specific abstractions and coherence, while benefiting from well-established data tools.
The processed data is then organized into a structured task buffer. This stage effectively implements a form of curriculum learning by allowing users to prioritize tasks (e.g., from easy to hard), guiding the explorer towards a more efficient and stable learning trajectory from the outset. This entire workflow is managed by a service-oriented architecture that decouples data logic from procedural control, ensuring flexibility and scalability, especially in asynchronous and distributed settings.
2.3.3 Active Experience Shaping
Once the explorer begins interacting with the environment, it generates a continuous stream of experience data. To maximize learning efficiency, this raw experience must be actively shaped before it reaches the trainer. This stage is shown on the right side of Figure 5.
Generated experiences are first collected in a buffer. The Data Processor is applied again with a series of transformations to clean, augment, or synthesize these experiences. This is where the core of RFT data intelligence lies. Key capabilities include:
- Agent-Driven Data Processing: Trinity-RFT introduces a powerful agentic paradigm for data manipulation. Users can define complex processing pipelines through high-level objectives, specified as either natural language commands (e.g., “improve safety” or “increase response diversity”) or explicit Data-Juicer configurations. The framework automatically translates these commands into executable workflows backed by its modular components like DataCleaner and DataSynthesizer. This design provides a user-friendly abstraction layer over the underlying Data-Juicer operators, making advanced processing functionalities accessible to both RFT users familiar with Data-Juicer and those who are not. It also facilitates the flexible injection of user-defined inductive biases into the learning process, unlocking new research directions for self-evolving agents, as we will discuss later in Section 2.3.5.
- Online Reward Shaping: The data processor can dynamically augment the reward signal. Instead of relying on a single, often sparse, task-completion reward, users can add dense rewards based on quality, diversity, or safety scores computed on the fly. This enriched feedback provides a much stronger learning signal for the trainer.
- Prioritized Experience Replay: Experiences are not treated equally. Trinity-RFT allows for flexible, multi-dimensional utility scoring to prioritize the most valuable samples for training. The DataActiveIterator supports version-controlled experience reuse and cross-task data lineage tracking, ensuring that the trainer always learns from the most informative data available. This mechanism is also critical for handling delayed rewards, as experience utilities can be updated asynchronously as new feedback arrives.
2.3.4 Human-AI Collaboration
In scenarios where human feedback is irreplaceable, Trinity-RFT establishes a bi-directional human-AI collaboration loop that provides first-class support for human annotations, based on Label Studio [39] and Data-Juicer’s HumanOPs.
- Multi-stage annotation. Trinity-RFT implements configurable procedures combining automatic pre-screening and human verification. Typical stages include preference annotation (comparative assessment of model responses), quality auditing (human verification of automated cleaning/synthesis results), and cold-start bootstrapping (initial dataset curation through expert demonstrations).
- Native asynchronism support. As the collection of human feedback is generally slower than AI/model feedback, we provide dedicated capabilities to handle both synchronous and asynchronous feedback modes, with configurable timeout and polling parameters. The feedback collaboration is based on an event-driven design, with automatic task creation upon data state changes, configurable notifications via email/Slack/webhook, and an atomic transaction model for annotation batches.
- Customization. Different applications may involve humans in heterogeneous ways. We thus prioritize flexibility in both the interaction-interface and service levels. Examples include rich built-in interfaces that can be extended in a visualized style with XML-like tags provided by Label Studio, fine-grained quality scoring for reward shaping, free-form feedback attachment for dataset shaping, among others. Moreover, for easy deployment, we provide local Label Studio instance management with automatic environment setup via Docker/pip; optimized SDK interactions with batch request coalescing; unified logging across annotation tools and ML services; and concurrent annotation campaigns through priority-based task routing, while maintaining full data lineage preserved via LineageTracker.
The decoupled design of Trinity-RFT, and the presence of a standalone experience buffer in particular, enable human feedback to participate in RL loops without breaking the asynchronous execution model. For instance, human-verified samples can be prioritized for training while fresh experiences are being collected, which is a critical capability for real-world deployment scenarios with mixed feedback sources. Further details for human-AI collaboration in Trinity-RFT will be illustrated in Section 3.5.
2.3.5 Discussion: Unlocking New Research & Development Directions
The modular design of our data pipelines and the powerful data processor open up promising research and development avenues to be further explored.
One direction is about effective management of experience data. While prior RFT works often treat the experience as a static log, Trinity-RFT enables a more sophisticated, full-lifecycle approach to data, from selective acquisition to efficient representation:
- Intelligent Perception and Collection: In an open-ended environment, what experience is “worth” recording? Storing everything creates a low signal-to-noise ratio and burdens the trainer. Trinity-RFT ’s architecture allows researchers to implement active collection strategies. For instance, one could design a data processor operator that evaluates incoming experiences from the explorer based on metrics like surprise, uncertainty, or information gain, and only commits the most salient trajectories to the replay buffer. This transforms data collection from passive logging into a targeted, intelligent process.
- Adaptive Representation: Raw experience is often high-dimensional and redundant (e.g., long dialogues, complex code generation traces). How can this be distilled into a format that an agent can efficiently learn from? The data processor in Trinity-RFT acts as a powerful transformation engine. Researchers can use it to explore various representation learning techniques, such as automatically summarizing trajectories, extracting causal chains from tool usage, or converting a multi-turn dialogue into a structured preference pair. This not only makes training more efficient but also opens the door to building meta-experience (more abstract and reusable knowledge) from raw interaction data.
- Agentic Workflows: Trinity-RFT ’s agent-driven processing enables the research and development of self-improving agents, e.g., by configuring the policy agent to also serve as the “processing agent” for LLM-based Data-Juicer operators. Such an agent could perform its own critique and dynamically curate its own training data, creating a truly autonomous learning and data management loop.
Another direction is about synthetic and counterfactual experience processing. The integration of synthesis operators enables research into creating “better-than-real” data. Instead of relying solely on the agent’s own trial-and-error, our framework facilitates exploring questions like:
- Dynamic and Composable Rewarding: With our framework, researchers can move beyond static, hand-crafted rewards. It is now possible to investigate dynamic reward shaping, where auxiliary signals like novelty, complexity, or alignment scores are automatically extracted from trajectories and composed into a dense reward function. How to define “good” experience and how can we learn the optimal combination of these reward components as the agent’s policy evolves?
- Experience Reorganization: Can successful sub-trajectories from different tasks be “spliced” together to solve a novel, composite task? For example, can an agent that has learned to “open a door” and “pick up a cup” synthesize a new trajectory to "enter the room and fetch the cup"?
- Failure Repair: Can the data processor identify where errors occur in a failed trajectory, and synthesize a corrected version for the trainer to learn from, effectively turning failures into valuable lessons?
- Success Amplification: Can a single successful experience be augmented into multiple diverse yet successful variants, thereby improving the generalization and robustness of the learned policy?
By providing dedicated capabilities for such advanced data and reward manipulation, Trinity-RFT aims to facilitates flexible processing of “experience data” for the next generation of self-evolving LLMs.
2.4 User-Friendliness
Trinity-RFT has been designed with user-friendliness as a top priority.
For development and research:
The modular and decoupled design of Trinity-RFT allows users to develop a new algorithm for a specific aspect of RFT by adding one or a few new classes that implement the essential functionalities of interest, without concerning about other aspects of RFT or intrusive modifications of the original codebase. In addition, we include a monitor (built upon Wandb [41] and TensorBoard [38]) that makes it easy to track the progress of an RFT process, both quantitatively (e.g., via learning curves for rewards and other metrics) and qualitatively (e.g., via concrete examples of rollout trajectories generated at different RL steps). See Figure 6 for an example snapshot of the monitor.
For RFT applications:
Trinity-RFT offers extensive graphical user interfaces to support low-code usage of the framework, and to maximize transparency of the RFT process. For example, we implement a configuration manager, as shown in Figure 7, that allows the user to create configuration files conveniently via a front-end interface. We also provide Trinity-Studio, an all-in-one unified UI (including the aforementioned monitor and configuration manager) that allows the user to configure and run data inspection, data processing, RFT learning process, etc., all by clicking the mouse and filling forms, without writing any code. An example for using Trinity-Studio will be introduced in Section 3.6. Such functionalities, of course, can be useful not only for applications but also for development and research.
<details>
<summary>figs/wandb_screencut.png Details</summary>

### Visual Description
```markdown
## Dashboard: AI Interaction Analysis
### Overview
The image displays a technical dashboard analyzing AI interaction data. It includes:
1. A table showing structured data for three steps (1-3)
2. Two line graphs tracking evaluation metrics over 40 steps
3. Text-based prompts and responses demonstrating AI reasoning
### Components/Axes
**Table Structure**:
- Columns: Step | Reward | Prompt | Response
- Rows: 3 entries (Steps 1-3)
**Graphs**:
1. **eval/accuracy/mean** (Blue line):
- X-axis: Step (1-40)
- Y-axis: Accuracy (0.52-0.70)
2. **critic/rewards/mean** (Red line):
- X-axis: Step (1-40)
- Y-axis: Rewards (0.2-0.85)
**Text Elements**:
- Table headers in bold black text
- Graph titles in bold black text
- Axis labels in smaller black text
### Detailed Analysis
**Table Data**:
| Step | Reward | Prompt (Example) | Response (Example) |
|------|--------|------------------|--------------------|
| 1 | 1.1 |
</details>
Figure 6: A snapshot of the monitor implemented in Trinity-RFT.
<details>
<summary>figs/config_manager_beginner.jpg Details</summary>

### Visual Description
## Screenshot: Trinity-RFT Config Generator Interface
### Overview
The image shows a configuration interface for the Trinity-RFT system in **Beginner Mode**. The interface contains form fields for essential configuration parameters, with some fields pre-filled and others requiring user input. The layout is structured in a vertical form with grouped input elements.
### Components/Axes
1. **Mode Selection**:
- Two radio buttons: "Beginner Mode" (selected, red highlight) and "Expert Mode" (unselected, gray)
2. **Essential Configs Section**:
- **Project**: Pre-filled with "Trinity-RFT"
- **Experiment Name**: Pre-filled with "qwen2.5-1.5B"
- **Model Path**: Empty field with placeholder text "Please input model path."
- **Checkpoint Path**: Empty field with placeholder text "Please input checkpoint path."
- **Taskset Path**: Empty field with placeholder text "Please input taskset path."
3. **Advanced Settings**:
- **Algorithm Type**: Dropdown with "ppo" selected
- **SFT Warmup Steps**: Numeric input showing "0" with increment/decrement controls
- **Monitor Type**: Dropdown with "tensorboard" selected
### Detailed Analysis
- **Textual Content**:
- All labels and values are in English
- Placeholder texts use imperative phrasing ("Please input...")
- Numeric input for warmup steps uses integer values
- **Visual Hierarchy**:
- Selected mode ("Beginner Mode") is visually emphasized with red color
- Pre-filled fields use darker text than empty fields
- Dropdowns and numeric inputs use consistent gray styling
### Key Observations
1. **Missing Critical Paths**:
- Model, checkpoint, and taskset paths are all empty, requiring user input
2. **Default Configuration**:
- Algorithm type set to PPO (Proximal Policy Optimization)
- Zero warmup steps suggests no pretraining phase
- TensorBoard selected as monitoring tool
3. **Interface Design**:
- Clear separation between pre-filled and required fields
- Minimalist design with no decorative elements
### Interpretation
This configuration interface appears to be for setting up a reinforcement learning experiment using the Qwen-2.5-1.5B model. The empty paths indicate that the user must provide:
1. Model architecture/weights location
2. Training checkpoint directory
3. Task specification files
The selected PPO algorithm with zero warmup steps suggests a direct deployment scenario rather than a training setup. The TensorBoard monitoring choice indicates integration with MLflow or similar tracking systems. The absence of expert mode selection implies this is a simplified configuration for standard use cases.
The interface follows a logical flow from project identification to algorithm selection, with required fields grouped together for easy validation. The numeric warmup steps control allows precise adjustment of pretraining duration if needed later.
</details>
(a) The “beginner” mode.
<details>
<summary>figs/config_manager_expert.jpg Details</summary>

### Visual Description
## Screenshot: Trinity-RFT Config Generator UI
### Overview
The image displays a configuration interface for the Trinity-RFT model generator. The UI is in "Expert Mode" with the "Model" tab active. Key configuration parameters include project details, model paths, resource allocation, and training constraints.
### Components/Axes
1. **Tabs**:
- Top-level tabs: "Beginner Mode" (unselected) and "Expert Mode" (selected, highlighted in red).
- Secondary tabs under "Expert Mode": "Model" (active), "Buffer," "Explorer and Synchronizer," "Trainer."
2. **Input Fields**:
- **Project**: "Trinity-RFT" (text field).
- **Experiment Name**: "qwen2.5-1.5B" (text field).
- **Model Path**: Empty with placeholder "Please input model path."
- **Critic Model Path**: Defaults to "model_path" (text field).
- **Checkpoint Path**: Empty with placeholder "Please input checkpoint path."
3. **Dropdowns/Selectors**:
- **Monitor Type**: "tensorboard" (selected from dropdown).
4. **Resource Allocation**:
- **Node Num**: 1 (numeric input with +/- controls).
- **GPU Per Node**: 8 (numeric input with +/- controls).
5. **Token Constraints**:
- **Max Prompt Tokens**: 1024 (numeric input with +/- controls).
- **Max Response Tokens**: 1024 (numeric input with +/- controls).
### Detailed Analysis
- **Required Fields**: "Model Path" and "Checkpoint Path" are mandatory, indicated by placeholder text in yellow-highlighted fields.
- **Default Values**: "Critic Model Path" defaults to "model_path," suggesting a fallback or placeholder value.
- **Resource Configuration**: "Node Num" and "GPU Per Node" define computational resources, with values 1 and 8 respectively.
- **Token Limits**: Both prompt and response tokens are capped at 1024, likely to manage model input/output size.
### Key Observations
- The interface enforces mandatory inputs for model and checkpoint paths, critical for training workflows.
- Default values (e.g., "model_path") may indicate incomplete configuration or placeholder text.
- Token limits suggest optimization for medium-sized language models, balancing context length and computational efficiency.
### Interpretation
This UI is designed for advanced users configuring a large language model (LLM) training pipeline. The "Model Path" and "Checkpoint Path" fields are essential for specifying training data and model states. The integration with TensorBoard implies real-time monitoring capabilities. Resource allocation (1 node, 8 GPUs) suggests a distributed computing setup, while token limits reflect constraints on sequence processing. The absence of filled paths indicates the configuration is incomplete, requiring user input to proceed. The design prioritizes flexibility (dropdowns, numeric controls) while enforcing critical dependencies (required fields).
</details>
(b) The “expert” mode.
Figure 7: Snapshots of the configuration manager.
3 Examples, Applications, and Experiments
This section demonstrates the utilities and user-friendliness of Trinity-RFT and exemplifies some concepts introduced in previous sections, through a diverse range of examples, applications and experiments. Additional step-by-step tutorials can be found on the documentation website https://modelscope.github.io/Trinity-RFT, or the examples folder of the GitHub repository https://github.com/modelscope/Trinity-RFT/tree/main/examples.
3.1 Customizing Agent-Environment Interaction
With a modular design, Trinity-RFT can be easily adapted to a new downstream scenario by implementing the logic of agent-environment interaction in a single workflow class, without modifications to other components of the codebase. This approach is also sufficient for macroscopic RL algorithm design that targets high-quality experience synthesis with environmental feedback [4]. We provide some concrete examples in the rest of this subsection.
3.1.1 Single-turn Scenarios
In a simple yet common scenario, a user of Trinity-RFT would like to train an LLM for completing single-turn tasks, where the LLM generates one response to each input query. For this purpose, the user mainly needs to (1) define and register a single-turn workflow class (by inheriting the base class Workflow) tailored to the targeted tasks, and (2) specify the tasksets (for training and/or evaluation) and the initial LLM, both of which are compatible with HuggingFace [14] and ModelScope [19] formats.
Listing 1 gives a minimal example for implementing a single-turn workflow. Suppose that each task is specified by a <question, answer> tuple. The run() method of ExampleWorkflow calls the LLM once to generate a response for the question, calculates its reward, and returns an Experience instance that consists of the response itself, the reward value, and the log-probabilities of response tokens predicted by the rollout model (which is necessary for certain RL algorithms, such as PPO [28] and GRPO [29]). Some built-in workflows and reward functions have been implemented in Trinity-RFT, e.g., the MathWorkflow class for math-related tasks.
In some cases, the user wants to utilize auxiliary LLMs in the workflow, e.g., for computing rewards via LLM-as-a-judge, or for playing the roles of other agents in a multi-agent scenario. For these purposes, the user can specify auxiliary_models via APIs when initializing the workflow.
⬇
1 @WORKFLOWS. register_module ("example_workflow")
2 class ExampleWorkflow (Workflow):
3
4 def __init__ (
5 self,
6 model: ModelWrapper,
7 task: Task,
8 auxiliary_models: Optional [List [openai. OpenAI]] = None,
9 ):
10 super (). __init__ (model, task, auxiliary_models)
11 self. question = task. raw_task. get ("question")
12 self. answer = task. raw_task. get ("answer")
13
14 def calculate_reward_by_rule (self, response: str, truth: str) -> float:
15 return 1.0 if response == truth else 0.0
16
17 def calculate_reward_by_llm_judge (self, response: str, truth: str) -> float:
18 judge_model = self. auxiliary_models [0]
19 PROMPT_FOR_JUDGE = "" "Please evaluate..." ""
20 completion = judge_model. chat. completions. create (
21 model = "gpt-4", # Or another suitable judge model
22 messages =[{"role": "user", "content": PROMPT_FOR_JUDGE}],
23 )
24 reward_str = completion. choices [0]. message. content. strip ()
25 reward = float (reward_str)
26 return reward
27
28 def run (self) -> List [Experience]:
29 response = self. model. chat (
30 [
31 {
32 "role": "user",
33 "content": f "Question:\n{self.question}",
34 }
35 ],
36 ** self. rollout_args,
37 )
38 reward: float = self. calculate_reward_by_rule (response. response_text, self. answer)
39 # reward: float = self.calculate_reward_by_llm_judge(response.response_text, self.answer)
40 return [
41 Experience (
42 tokens = response. tokens,
43 prompt_length = response. prompt_length,
44 reward = reward,
45 logprobs = response. logprobs,
46 )
47 ]
Listing 1: A minimal example for implementing a customized workflow.
3.1.2 Multi-turn Scenarios
In more advanced cases, the user would like to train an LLM-powered agent that solves multi-turn tasks by repeatedly interacting with the environment. In Trinity-RFT, achieving this is mostly as simple as in the single-turn case, except that the user needs to define and register a multi-turn workflow class by inheriting the base class MultiTurnWorkflow. Listing 2 provides one such example using the ALFWorld dataset [31]. For training efficiency, the process_messages_to_experience() method concatenates multiple rounds of agent-environment interactions compactly into an Experience instance consisting of a single token sequence with proper masking, which can readily be consumed by standard RL algorithms like PPO and GRPO.
For more detailed examples of multi-turn cases, please refer to the documentation https://modelscope.github.io/Trinity-RFT/tutorial/example_multi_turn.html.
⬇
1 @WORKFLOWS. register_module ("alfworld_workflow")
2 class AlfworldWorkflow (MultiTurnWorkflow):
3 "" "A workflow for the ALFWorld task." ""
4
5 def generate_env_inference_samples (self, env, rollout_num) -> List [Experience]:
6 print ("Generating env inference samples...")
7 experience_list = []
8 for i in range (rollout_num):
9 observation, info = env. reset ()
10 final_reward = -0.1
11 memory = []
12 memory. append ({"role": "system", "content": AlfWORLD_SYSTEM_PROMPT})
13 for r in range (self. max_env_steps):
14 format_obs = format_observation (observation)
15 memory = memory + [{"role": "user", "content": format_obs}]
16 response_text = self. model. chat (memory, n =1)[0]. response_text
17 memory. append ({"role": "assistant", "content": response_text})
18 action = parse_action (response_text)
19 observation, reward, done, info = env. step (action)
20 if done:
21 final_reward = reward
22 break
23 experience = self. process_messages_to_experience (
24 memory, final_reward, {"env_rounds": r, "env_done": 1 if done else 0}
25 )
26 experience_list. append (experience)
27 # Close the env to save CPU memory
28 env. close ()
29 return experience_list
30
31 def run (self) -> List [Experience]:
32 # ...
33 game_file_path = self. task_desc
34 rollout_n = self. repeat_times
35 # ...
36 env = create_environment (game_file_path)
37 return self. generate_env_inference_samples (env, rollout_n)
Listing 2: An implementation of a multi-turn workflow for ALFWorld [31].
3.1.3 Experience Synthesis in Workflows
As mentioned in Section 1.1, Trinity-RFT has been designed to streamline RL algorithm design and development at both macroscopic and microscopic levels. One example for the former is experience synthesis: at each RL step, the agent (backed by the rollout LLM) iteratively generates refined responses to a query by incorporating feedback or guidance from the environment, which can be in the form of plain text rather than numerical reward values. The resulting data will then be utilized for updating the policy model, e.g., by a standard SFT or RL loss. Such a macroscopic RL approach is made possible by pre-trained LLMs’ generative nature and rich prior knowledge about natural language. Closely related to this idea is Agent-RLVR [4], a contemporary work that applies such an approach to software engineering scenarios.
Within Trinity-RFT, this process of experience synthesis can be regarded as a particular way of agent-environment interaction, and thus can be realized by simply implementing a Workflow class. As a minimal demonstration, suppose that we want to implement this approach for a math reasoning scenario, where the agent generates multiple rollout responses to an input query, receives feedback from the environment regarding correctness of the responses, reflects on the gathered information, and generates a final response to the query. Listing 3 presents an implementation of this approach within Trinity-RFT.
⬇
1 @WORKFLOWS. register_module ("reflect_once_workflow")
2 class ReflectOnceWorkflow (Workflow):
3
4 def run (self) -> List [Experience]:
5 experiences = []
6
7 # Stage 1: K-rollout generation
8 rollout_messages = self. create_rollout_messages ()
9 responses = self. model. chat (
10 rollout_messages,
11 n = self. k_rollouts,
12 temperature = self. temperature,
13 logprobs = self. logprobs,
14 max_tokens = self. task. rollout_args. max_tokens,
15 )
16 rollout_responses = [response. response_text. strip () for response in responses]
17
18 # Stage 2: Verification
19 verification_results = []
20 for rollout_response in rollout_responses:
21 is_correct = self. verify_answer (rollout_response, self. ground_truth)
22 verification_results. append (is_correct)
23
24 # Stage 3: Reflection
25 reflection_messages = self. create_reflection_messages (
26 rollout_responses,
27 verification_results,
28 )
29 reflection_responses = self. model. chat (
30 reflection_messages,
31 n =1,
32 temperature = self. temperature,
33 logprobs = self. logprobs,
34 max_tokens = self. task. rollout_args. max_tokens,
35 )
36 reflection_response = reflection_responses [0]
37
38 # Verify the reflection response
39 reflection_text = reflection_response. response_text. strip ()
40 reflection_is_correct = self. verify_answer (reflection_text, self. ground_truth)
41
42 if reflection_is_correct:
43 sharegpt_message = [
44 {
45 "role": "system",
46 "content": self. task. format_args. system_prompt
47 },
48 {
49 "role": "user",
50 "content": self. question
51 },
52 {
53 "role": "assistant",
54 "content": reflection_text
55 }
56 ]
57 experience = self. process_messages_to_experience (sharegpt_message)
58 experiences. append (experience)
59
60 # Save experience to file
61 if self. exp_file and sharegpt_message is not None:
62 exp_data = sharegpt_message
63 self. exp_file. write (json. dumps (exp_data, ensure_ascii = False) + "\n")
64 self. exp_file. flush ()
65 return experiences
Listing 3: A toy implementation of experience synthesis with environmental feedback.
3.2 RL Algorithm Development with Trinity-RFT
To support RL algorithm development, Trinity-RFT allows researchers and developers to focus on designing and implementing the essential logic of a new RL algorithm, without the need to care about the internal engineering details about Trinity-RFT.
As an example, suppose that we want to implement a MIX algorithm that seamlessly integrates online RL and offline SFT into a single learning process. In its most basic form, the MIX algorithm requires that (1) the trainer samples from two sources of experiences, i.e., the rollout experiences collected online and the high-quality expert trajectories collected offline; and (2) the trainer updates its policy model with a loss function that handles both sources of experiences properly, e.g., a weighted sum of GRPO loss for the on-policy rollout experiences and SFT loss for the expert trajectories.
Variants of this MIX algorithm include adaptive weighting of multiple loss terms [10], alternating between RL and SFT [16], incorporating expert trajectories into RL loss [21, 34, 46], or incorporating SFT loss for high-reward rollout trajectories generated by older versions of the rollout model [27]. Such approaches have proved to be effective in accelerating the online RL process with only a small amount of expert experiences, or to enhance stability and plasticity in continual learning.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Diagram: Reinforcement Learning System Architecture
### Overview
The diagram illustrates a three-stage pipeline for a reinforcement learning (RL) system: **Explorer**, **Buffer**, and **Trainer**. Arrows indicate data flow and interactions between components, with distinct color-coded sections for each stage.
### Components/Axes
#### Explorer (Left Section, Peach Background)
- **Rollout engine**: Generates experiences via interaction with a task.
- **Sampling**: Arrows indicate data extraction from the rollout engine.
- **Task**: Represents the environment or problem domain.
#### Buffer (Middle Section, Light Blue Background)
- **Usual Experiences**: Stored in a pink cylinder, representing standard data.
- **Expert Experiences**: Stored in a blue cylinder, representing high-quality or pre-trained data.
- **Sampling**: Arrows indicate data extraction for training.
- **Taskset**: A gray cylinder representing a collection of tasks.
#### Trainer (Right Section, Light Green Background)
- **SFT Loss**: Supervised Fine-Tuning loss, represented in pink.
- **GRPO Loss**: Group Relative Policy Optimization loss, represented in blue.
- **MIX Loss**: Combined loss function (SFT + GRPO), represented in gray.
- **Update model**: Final step to refine the model using MIX Loss.
### Detailed Analysis
1. **Explorer**:
- The rollout engine interacts with a task to generate experiences.
- Sampling extracts these experiences for storage in the Buffer.
2. **Buffer**:
- Experiences are categorized into "Usual" (pink) and "Expert" (blue) and stored separately.
- Sampling from both categories feeds data into the Trainer.
3. **Trainer**:
- SFT Loss and GRPO Loss are computed independently and combined via a summation node (+) to form MIX Loss.
- MIX Loss drives the model update process.
### Key Observations
- **Flow Direction**: Data moves unidirectionally from Explorer → Buffer → Trainer.
- **Loss Function Design**: The Trainer integrates SFT (supervised learning) and GRPO (RL-specific) losses, suggesting a hybrid optimization strategy.
- **Buffer Segmentation**: Separating "Usual" and "Expert" experiences implies a focus on balancing exploration and leveraging prior knowledge.
### Interpretation
This architecture represents a **meta-RL** or **multi-task RL** system where:
- The **Explorer** collects diverse experiences across tasks.
- The **Buffer** acts as a memory bank, preserving both standard and expert trajectories to mitigate catastrophic forgetting.
- The **Trainer** uses a mixed loss function to balance supervised learning (SFT) and RL objectives (GRPO), enabling efficient adaptation to new tasks while retaining expertise.
The system emphasizes **sample efficiency** (via expert experiences) and **generalization** (via mixed loss), critical for real-world RL applications. The absence of explicit numerical values suggests a conceptual framework rather than empirical results.
</details>
Figure 8: A visualization of the MIX algorithm.
The MIX algorithm is visualized in Figure 8, where we integrate GRPO loss for usual experiences generated by the rollout model and SFT loss for expert experiences into a unified training pipeline. It requires dealing with different sources of experiences, and two types of loss functions; fortunately, to implement such an algorithm in Trinity-RFT, we only need to define three new classes — MixSampleStrategy, MIXPolicyLossFn, and MIXAlgorithm — as demonstrated in Listing 4. With these components, Trinity-RFT enables a seamless integration of online RL and offline SFT within the same training loop. More details of the MIX algorithm are referred to the documentation https://modelscope.github.io/Trinity-RFT/tutorial/example_mix_algo.html.
⬇
1 @SAMPLE_STRATEGY. register_module ("mix")
2 class MixSampleStrategy (SampleStrategy):
3 def __init__ (self, buffer_config: BufferConfig, trainer_type: str, ** kwargs):
4 # ...
5 self. usual_exp_buffer = get_buffer_reader (
6 buffer_config. trainer_input. experience_buffer, usual_buffer_config
7 )
8 self. expert_exp_buffer = get_buffer_reader (
9 buffer_config. trainer_input. sft_warmup_dataset, expert_buffer_config
10 )
11 # ...
12
13 def sample (self, step: int) -> Tuple [Any, Dict, List]:
14 "" "Sample a batch composed of rollout experiences and expert trajectories" ""
15 usual_exp_list = self. usual_exp_buffer. read ()
16 expert_exp_list = self. expert_exp_buffer. read ()
17 exp_list = usual_exp_list + expert_exp_list
18 exps = Experiences. gather_experiences (exp_list, self. pad_token_id)
19 # ...
20
21
22 @POLICY_LOSS_FN. register_module ("mix")
23 class MIXPolicyLossFn (PolicyLossFn):
24 def __init__ (self, mu: float = 0.1, ...):
25 # ...
26 self. mu = mu
27 self. grpo_loss_fn = PPOPolicyLossFn (...)
28 self. sft_loss_fn = SFTLossFn (...)
29
30 def __call__ (
31 self,
32 logprob: torch. Tensor,
33 old_logprob: torch. Tensor,
34 action_mask: torch. Tensor,
35 advantages: torch. Tensor,
36 is_expert_mask: torch. Tensor,
37 ** kwargs,
38 ) -> Tuple [torch. Tensor, Dict]:
39 "" "Calculate a weighted sum of GRPO loss and SFT loss" ""
40 # ...
41 grpo_loss, grpo_metrics = self. grpo_loss_fn (
42 logprob [~ is_expert_mask],
43 old_logprob [~ is_expert_mask],
44 action_mask [~ is_expert_mask],
45 advantages [~ is_expert_mask],
46 ** kwargs,
47 )
48 sft_loss, sft_metrics = self. sft_loss_fn (
49 logprob [is_expert_mask],
50 action_mask [is_expert_mask],
51 )
52 loss = (1 - self. mu) * grpo_loss + self. mu * sft_loss
53 # ...
54 return loss, metrics
55
56 @ALGORITHM_TYPE. register_module ("mix")
57 class MIXAlgorithm (AlgorithmType):
58 "" "MIX algorithm." ""
59
60 use_critic: bool = False
61 use_reference: bool = True
62 use_advantage: bool = True
63 can_balance_batch: bool = True
64 schema: type = ExperienceModel
65
66 @classmethod
67 def default_config (cls) -> Dict:
68 return {
69 "repeat_times": 8,
70 "policy_loss_fn": "mix", # Specify the MIX loss function
71 "advantage_fn": "grpo",
72 "sample_strategy": "mix", # Specify the MIX sampling strategy
73 }
Listing 4: An implementation of the MIX algorithm with Trinity-RFT.
3.3 Unified Support for Diverse RL Modes
As explained previously in Section 2.1.1, Trinity-RFT offers support for synchronous/asynchronous, on-policy/off-policy, and online/offline RL, controlled by a few configuration parameters. In this subsection, we conduct experiments for comparing the following RL modes:
- The synchronous mode: mode=both, sync_interval={1,2,10}, sync_offset=0;
- The one-step off-policy mode: mode=both, sync_interval=1, sync_offset=1;
- The fully asynchronous mode: the explorer and trainer are launched with mode=explore and train respectively, with sync_interval=10.
Our experiments include dummy learning processes (which will soon be explained) for performance profiling, as well as real learning processes with vanilla GRPO [29] in different modes.
3.3.1 Experiments: Performance Profiling
Settings.
We aim to measure and compare the efficiency of different RL modes under controlled settings. It is noteworthy that, even with all other variables controlled, different RL modes can still result in different trained models — and thus different rollout response lengths — during the learning processes, which have direct impacts on performance metrics like wall-clock time and GPU utilization rate.
To mitigate this, we conduct performance profiling with dummy learning processes, where the learning rate is set to zero. A dummy learning process closely resembles a real learning process, in that all necessary computation and communication (e.g., rollout generation, gradient computation, model weight synchronization) are executed; the only difference is that the rollout model (and thus the distribution of rollout trajectories) remains fixed throughout and same across different RL modes.
To show the performance of Trinity-RFT in diverse scenarios, we consider both math reasoning task (GSM8k [3]) and multi-turn agentic task (ALFWorld [31]). In the experiments, we use the Qwen2.5-Instruct [36] model series of different sizes (1.5B, 3B, and 7B), and run the GRPO [5] algorithm (with 8 rollout trajectories per task) in all modes. We choose a 100-step training trace to evaluate the performance and report the following metrics: (1) end-to-end wall-clock time and time speedup: the wall-clock time from the start of running the command to the end of finishing 100 training steps; (2) GPU utilization: the GPU utilization in percent for each GPU; (3) GPU power usage: the GPU power usage as a percentage of its power capacity for each GPU. Metrics for GPU utilization and power usage were extracted from WandB https://docs.wandb.ai/guides/models/app/settings-page/system-metrics/ and averaged over all GPUs. We run each experiment for three random trials and report the mean and standard deviation. Unless specified otherwise, each experiment uses 8 NVIDIA A100-80G GPUs.
Results for GSM8k.
We use the 2/6 GPU partition for explorer/trainer. While this configuration is not the optimal one for all experiments, it is sufficient to show the difference between several RL modes. In our GSM8k experiments, the batch size is 96 tasks, and the temperature is 1.0. The results for both Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct models are shown in Table 1.
We observe that in the synchronous mode (with sync_offset=0), a less frequent synchronization (i.e., a larger sync_interval) effectively improves the efficiency, GPU utilization, and GPU power usage. This is mainly because, as shown in Figure 4 (a), the impacts of pipeline bubbles in this mode can be effectively reduced by using a lower synchronization frequency. Besides, Table 1 shows that one-step off-policy and fully asynchronous modes also accelerate the training process with higher GPU utilization, compared to a strictly on-policy mode. In one-step off-policy mode, the trainer leverages the one-step off-policy data stored in the buffer without needing to wait for new experiences generated by the explorer after weight synchronization, which significantly reduces the GPU idle ratio. In fully asynchronous mode, the explorer and trainer operate almost independently while fully leveraging GPU resources, except when loading or saving model checkpoints.
Table 1: Performance profiling for GSM8k with 2/6 GPU partition for explorer/trainer.
| Mode | Qwen2.5-1.5B-Instruct | | | |
| --- | --- | --- | --- | --- |
| Speedup $\uparrow$ | Time (Minutes) $\downarrow$ | GPU Utilization (%) $\uparrow$ | GPU Power Usage (%) $\uparrow$ | |
| Sync. (sync_interval=1) | $1.00×$ | $38.70± 0.34$ | $33.64± 2.15$ | $35.85± 1.83$ |
| Sync. (sync_interval=2) | $1.24×$ | $31.19± 0.08$ | $36.05± 0.49$ | $38.74± 1.47$ |
| Sync. (sync_interval=10) | $1.59×$ | $24.28± 0.16$ | $\mathbf{38.27}± 0.98$ | $\mathbf{44.41}± 0.81$ |
| One-step off-policy | $1.25×$ | $30.84± 0.20$ | $32.39± 1.17$ | $39.70± 0.78$ |
| Fully async. | $\mathbf{1.61}×$ | $\mathbf{23.97}± 0.03$ | $36.04± 0.61$ | $43.91± 0.48$ |
| Mode | Qwen2.5-7B-Instruct | | | |
| --- | --- | --- | --- | --- |
| Speedup $\uparrow$ | Time (Minutes) $\downarrow$ | GPU Utilization (%) $\uparrow$ | GPU Power Usage (%) $\uparrow$ | |
| Sync. (sync_interval=1) | $1.00×$ | $68.71± 0.54$ | $55.61± 0.80$ | $52.88± 0.29$ |
| Sync. (sync_interval=2) | $1.31×$ | $52.44± 0.41$ | $64.88± 1.35$ | $61.90± 1.32$ |
| Sync. (sync_interval=10) | $\mathbf{1.85}×$ | $\mathbf{37.17}± 0.15$ | $\mathbf{78.44}± 1.03$ | $\mathbf{77.77}± 0.96$ |
| One-step off-policy | $1.69×$ | $40.73± 0.57$ | $77.19± 2.26$ | $76.17± 1.56$ |
| Fully async. | $1.63×$ | $42.17± 1.06$ | $73.90± 2.00$ | $72.74± 1.82$ |
Table 2: Performance profiling for ALFWorld with 4/4 GPU partition for explorer/trainer.
| Mode | Batch_Size = 4 | | | |
| --- | --- | --- | --- | --- |
| Speedup $\uparrow$ | Time (Minutes) $\downarrow$ | GPU Utilization (%) $\uparrow$ | GPU Power Usage (%) $\uparrow$ | |
| Sync. (sync_interval=1) | $1.00×$ | $333.68± 0.06$ | $17.19± 0.58$ | $28.44± 0.37$ |
| Sync. (sync_interval=2) | $1.70×$ | $196.64± 0.59$ | $21.69± 0.18$ | $31.35± 0.06$ |
| Sync. (sync_interval=10) | $5.21×$ | $64.09± 0.39$ | $32.85± 0.18$ | $40.86± 0.58$ |
| One-step off-policy | $0.98×$ | $340.12± 3.99$ | $14.63± 1.17$ | $28.21± 0.48$ |
| Fully async. | $\mathbf{5.45}×$ | $\mathbf{61.27}± 0.35$ | $\mathbf{36.46}± 0.10$ | $\textbf{42.51}± 0.72$ |
| Mode | Batch_Size = 32 | | | |
| --- | --- | --- | --- | --- |
| Speedup $\uparrow$ | Time (Minutes) $\downarrow$ | GPU Utilization (%) $\uparrow$ | GPU Power Usage (%) $\uparrow$ | |
| Sync. (sync_interval=1) | $1.00×$ | $561.21± 2.04$ | $39.37± 0.89$ | $39.93± 0.22$ |
| Sync. (sync_interval=2) | $1.13×$ | $496.80± 0.36$ | 37.74 $± 0.39$ | $41.90± 0.44$ |
| Sync. (sync_interval=10) | $1.59×$ | $352.11± 0.49$ | $44.50± 0.58$ | $49.95± 0.61$ |
| One-step off-policy | $1.14×$ | $494.13± 0.28$ | $34.89± 0.75$ | $43.05± 0.81$ |
| Fully async. | $\mathbf{1.65}×$ | $\mathbf{339.51}± 0.24$ | $\mathbf{45.55}± 0.20$ | $\textbf{50.77}± 0.45$ |
Results for ALFWorld.
A particular feature of ALFWorld is the long-horizon multi-turn interaction with the environment. To accommodate the heavy computational demands in rollout, we use the 4/4 GPU partition for explorer/trainer. In our ALFWorld experiments, we use the Qwen2.5-3B-Instruct model and set the rollout temperature to 1.0.
The results with different batch sizes are shown in Table 2. One observation is that, when the batch size is 4 tasks, the one-step off-policy mode exhibits no efficiency advantage over the synchronous mode with sync_interval=1. This phenomenon can be attributed to the computational imbalance between the explorer and trainer. In ALFWorld, the larger computation latency of the explorer emerges primarily from (1) the inherent complexity of multi-turn environment interactions, and (2) the long-tailed latency distribution when certain tasks require extended rollout durations, whose effect is further exacerbated by a small batch size. The one-step off-policy mode cannot eliminate pipeline bubbles caused by long-tailed latencies in the explorer, whereas this can be mitigated by the synchronous mode with a large sync_interval as well as by the asynchronous mode, thanks to the implementation of streaming rollout generation in Trinity-RFT. Another observation, due to the same reason, is that when scaling the batch size from 4 to 32, all modes incur a small increase (much smaller than $8×$ ) in wall-clock time for the same number of training steps, thanks to better GPU usage.
3.3.2 Experiments: Real Learning with Vanilla GRPO
Settings.
We aim to compare the real learning processes by different RL modes. For simplicity and controlled variability, we use the vanilla GRPO [29] algorithm for all RL modes, without specific algorithm design for asynchronous or off-policy cases. GRPO mainly relies on the mechanism of clipping probability ratio (defaults to the range of $1± 0.2$ ) to handle the off-policyness of experiences. For future works, we will investigate more advanced off-policy or asynchronous RL algorithms, and develop new ones to accommodate diverse RL modes. To encourage the exploration of the rollout model, we disable the Kullback-Leibler (KL) penalty or loss in our experiments.
Training.
For each RL mode, we fine-tune the Qwen2.5-7B-Instruct model for one epoch on the OpenR1-Math-46k [46] https://huggingface.co/datasets/Elliott/Openr1-Math-46k-8192 dataset, a filtered version of the OpenR1-Math-220k https://huggingface.co/datasets/open-r1/OpenR1-Math-220k dataset. The allocated GPU ratio for the explorer and trainer is 4/4. We set the batch size to 120 tasks, the rollout number per task to 32, and the learning rate to 1e-6.
Evaluation.
For each RL mode, we save the checkpoint of the rollout model once every 100 steps, evaluate the checkpoints using the bench mode, and report the best results among the checkpoints. The models are evaluated on several math benchmarks, including AIME2024 https://huggingface.co/datasets/math-ai/aime25, AIME2025 https://huggingface.co/datasets/math-ai/aime25, AMC https://huggingface.co/datasets/math-ai/amc23, and MATH500 https://huggingface.co/datasets/HuggingFaceH4/MATH-500. For AIME2024, AIME2025, and AMC, we generate 32 responses (with temperature 0.6) per task and report the average accuracy (Avg@32), to ensure the accuracy of the evaluation process; for MATH500, we report Avg@4 as the dataset is relatively large. For more detailed comparisons, we also plot some training metrics, including reward, response length, gradient norm, and KL distance from the initial LLM, with the wall-clock time as the X-axis.
Results.
Figure 9 presents the training curves. It is observed that several RL modes show increasing trends in terms of rewards and response lengths. The synchronous mode with sync_interval=1 exhibits longer responses and larger KL divergence than other RL modes, likely because it updates the rollout model most frequently and leverages on-policy data in each step.
Table 3 presents the evaluation results. We observe that, for the synchronous mode with sync_offset=0, increasing sync_interval reduces the total training time for one epoch, at the cost of slightly compromising the average evaluation performance. In contrast, the one-step off-policy mode with sync_interval=1 achieves comparable or better performance than the other modes on several benchmarks, while achieving around $2.66×$ speedup in wall-clock time over the strictly on-policy mode.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Graphs: Multi-Metric Performance Over Time
### Overview
The image contains four line graphs tracking different performance metrics over 120 hours. Each graph compares three synchronization strategies (sync_interval=1, sync_interval=2, sync_interval=10) and a one-step off-policy baseline. Metrics include reward, response length, gradient norm, and KL divergence.
### Components/Axes
- **X-axis**: Time (hours), ranging from 0 to 120 in all graphs.
- **Y-axes**:
- Reward: 0.40–0.55
- Response Length: 1,000–2,500
- Gradient Norm: 0.08–0.16
- KL Divergence: 0.0–0.5
- **Legends**: Positioned at the top of each graph, with colors:
- Blue: Sync (sync_interval=1)
- Green: Sync (sync_interval=2)
- Red: Sync (sync_interval=10)
- Purple: One-Step Off-Policy
### Detailed Analysis
1. **Reward Graph**:
- Sync_interval=1 (blue): Starts at ~0.45, peaks at ~0.52 (60h), fluctuates between 0.48–0.53.
- Sync_interval=2 (green): Starts at ~0.43, peaks at ~0.51 (60h), fluctuates between 0.47–0.52.
- Sync_interval=10 (red): Starts at ~0.43, peaks at ~0.49 (60h), fluctuates between 0.46–0.50.
- Off-Policy (purple): Starts at ~0.44, peaks at ~0.50 (60h), fluctuates between 0.47–0.51.
2. **Response Length Graph**:
- Sync_interval=1 (blue): Starts at ~1,000, peaks at ~2,500 (60h), drops to ~2,200 (120h).
- Sync_interval=2 (green): Starts at ~1,200, peaks at ~2,000 (60h), drops to ~1,800 (120h).
- Sync_interval=10 (red): Starts at ~1,100, peaks at ~1,800 (60h), drops to ~1,600 (120h).
- Off-Policy (purple): Starts at ~1,300, peaks at ~2,200 (60h), drops to ~2,000 (120h).
3. **Gradient Norm Graph**:
- Sync_interval=1 (blue): Starts at ~0.16, drops to ~0.08 (120h), with spikes at 20h (~0.14) and 60h (~0.12).
- Sync_interval=2 (green): Starts at ~0.14, drops to ~0.09 (120h), with spikes at 20h (~0.13) and 60h (~0.11).
- Sync_interval=10 (red): Starts at ~0.12, drops to ~0.08 (120h), with spikes at 20h (~0.11) and 60h (~0.10).
- Off-Policy (purple): Starts at ~0.10, drops to ~0.08 (120h), with spikes at 20h (~0.10) and 60h (~0.09).
4. **KL Divergence Graph**:
- Sync_interval=1 (blue): Starts at ~0.0, peaks at ~0.5 (40h), drops to ~0.2 (120h).
- Sync_interval=2 (green): Starts at ~0.0, peaks at ~0.2 (40h), drops to ~0.1 (120h).
- Sync_interval=10 (red): Starts at ~0.0, peaks at ~0.1 (40h), drops to ~0.05 (120h).
- Off-Policy (purple): Starts at ~0.0, peaks at ~0.3 (40h), drops to ~0.2 (120h).
### Key Observations
- **Reward**: Sync_interval=1 and off-policy methods achieve higher rewards, with sync_interval=1 showing the most volatility.
- **Response Length**: Sync_interval=1 consistently outperforms others, with the largest peak at 60h.
- **Gradient Norm**: All methods show a general decline over time, with sync_interval=1 having the highest initial values.
- **KL Divergence**: Sync_interval=1 exhibits the sharpest divergence spike at 40h, suggesting significant policy mismatch.
### Interpretation
The data indicates that shorter synchronization intervals (sync_interval=1) improve reward and response length but increase KL divergence, implying greater deviation from the target policy. Longer intervals (sync_interval=10) reduce divergence but sacrifice performance. The off-policy baseline balances these trade-offs. Gradient norm trends suggest stabilizing training dynamics across all methods, with sync_interval=1 maintaining higher computational intensity. The KL divergence spikes highlight critical moments of policy misalignment, particularly for sync_interval=1.
</details>
Figure 9: Results of training for one epoch by vanilla GRPO in different RL modes. The results are smoothed using a 40-step moving average for clarity.
Table 3: Performance comparison of different RL modes.
| | AIME2024 | AIME2025 | AMC | MATH500 | Average | Runtime (Hours) |
| --- | --- | --- | --- | --- | --- | --- |
| Qwen2.5-7B-Instruct | 11.15 | 6.95 | 51.35 | 70.96 | 35.10 | N/A |
| Sync. (sync_interval=1) | 14.58 | 14.06 | 61.25 | 76.25 | 41.54 | 130.33 |
| Sync. (sync_interval=2) | 15.52 | 11.67 | 57.97 | 75.15 | 40.08 | 73.57 |
| Sync. (sync_interval=10) | 14.38 | 12.71 | 57.66 | 75.05 | 39.95 | 44.43 |
| One-Step Off-Policy | 16.88 | 12.19 | 59.92 | 74.55 | 40.89 | 48.98 |
3.4 Data Processors for Tasks and Experiences
We present practical examples to demonstrate how the data pipeline concepts described in Section 2.3 are applied in Trinity-RFT. These use cases highlight how manipulating data at both the task and experience level directly improves RFT performance and provides granular control over the agent’s learning process.
3.4.1 Static Task Prioritization for Curriculum Learning
A common strategy for effective training is to present tasks in increasing order of difficulty. This use case demonstrates how Trinity-RFT facilitates curriculum learning by prioritizing tasks before exploration begins. This is particularly crucial for RFT, as it helps stabilize the initial learning phase of the explorer and prevents it from getting stuck on overly complex tasks, leading to a more efficient exploration path.
As shown in Listing 5, a user can configure this pipeline with a simple YAML file The full configuration files can be accessed at baseline_run and priority_run.. Here, we use the GSM8K mathematical reasoning dataset. The user provides a natural language instruction via dj_process_desc: “Please compute difficulty scores for these math questions.”. Trinity-RFT ’s data service then orchestrates a three-phase process:
1. The data processor invokes an LLM (Qwen-Max) to score the difficulty of each math problem.
1. The system prioritizes samples with lower difficulty scores, creating an easy-to-hard ordering (by setting ‘priority_weights["difficulty"]: -1.0’).
1. The curated and prioritized data is formatted into an RL-ready task set for the explorer.
As shown in Figure 10, this simple curation strategy yields more stable performance gains. This pattern is highly extensible: users can easily customize the difficulty metric, apply it to their own datasets, or even make the prioritization dynamic by re-ranking tasks periodically based on the agent’s current performance.
⬇
1 # Core dataset configuration
2 data_processor:
3 data_workflow_url: "http://127.0.0.1:5005/data_workflow"
4 task_pipeline:
5 # I/O buffers
6 input_buffers:
7 - name: "raw_input"
8 path: "openai/gsm8k"
9 storage_type: "file"
10 raw: true
11 output_buffer:
12 name: "raw_output"
13 path: "outputs/task_pipeline_output/prioritized_gsm8k.jsonl"
14 storage_type: "file"
15 format:
16 prompt_key: "question"
17 response_key: "answer"
18 # data active iterator related
19 dj_process_desc: "Please compute difficulty scores for these math questions."
20 agent_model_name: "qwen-max"
21 clean_strategy: "iterative"
22 priority_weights:
23 difficulty: -1.0 # easy-to-hard
Listing 5: Data processor configuration, applied on customizable buffers.
<details>
<summary>figs/data-pipelines/data-flow-static-priority-res.jpg Details</summary>

### Visual Description
## Line Charts: Model Performance Metrics
### Overview
The image contains three line charts comparing performance metrics across three categories:
1. `eval/math-eval/accuracy/mean`
2. `actor/entropy_loss`
3. `actor/kl_loss`
Each chart uses dual-colored lines (red and blue) to represent different data series, with x-axis values ranging from 0 to 30 and y-axis values varying by chart.
---
### Components/Axes
#### Common Elements
- **X-axis**: Labeled "x" in all charts, scaled from 0 to 30 in increments of 5.
- **Y-axis**: Labeled "y" in all charts, with chart-specific ranges:
- `eval/math-eval/accuracy/mean`: 0.2 to 0.35
- `actor/entropy_loss`: 0.05 to 0.2
- `actor/kl_loss`: 0 to 1
- **Legend**: Positioned on the right side of each chart, featuring a pin icon and two entries:
- Red line: "Series A"
- Blue line: "Series B"
- **Gridlines**: Present in all charts for reference.
#### Chart-Specific Details
1. **`eval/math-eval/accuracy/mean`**
- Y-axis: Accuracy metric (0.2–0.35).
- Red line: Starts at 0.24 (x=0), peaks at 0.35 (x=15), then declines to 0.32 (x=30).
- Blue line: Starts at 0.2 (x=0), rises to 0.35 (x=10), then declines to 0.31 (x=30).
2. **`actor/entropy_loss`**
- Y-axis: Entropy loss (0.05–0.2).
- Red line: Starts at 0.2 (x=0), drops to 0.05 (x=10), then fluctuates between 0.05–0.1.
- Blue line: Starts at 0.2 (x=0), drops to 0.02 (x=30), with minor fluctuations.
3. **`actor/kl_loss`**
- Y-axis: Kullback-Leibler divergence (0–1).
- Red line: Starts at 0 (x=0), peaks at 0.6 (x=10), then fluctuates between 0.4–0.5.
- Blue line: Starts at 0 (x=0), peaks at 1.0 (x=20), then declines to 0.55 (x=30).
---
### Detailed Analysis
#### `eval/math-eval/accuracy/mean`
- **Red Line**:
- Initial rise from 0.24 (x=0) to 0.35 (x=15), followed by a gradual decline.
- Key data points:
- x=5: 0.32
- x=10: 0.33
- x=15: 0.35
- x=20: 0.35
- x=25: 0.35
- x=30: 0.32
- **Blue Line**:
- Rapid rise to 0.35 (x=10), then steady decline.
- Key data points:
- x=5: 0.33
- x=10: 0.35
- x=15: 0.34
- x=20: 0.32
- x=25: 0.32
- x=30: 0.31
#### `actor/entropy_loss`
- **Red Line**:
- Sharp decline from 0.2 (x=0) to 0.05 (x=10), followed by minor oscillations.
- Key data points:
- x=5: 0.1
- x=10: 0.05
- x=15: 0.07
- x=20: 0.12
- x=25: 0.09
- x=30: 0.06
- **Blue Line**:
- Steady decline from 0.2 (x=0) to 0.02 (x=30).
- Key data points:
- x=5: 0.15
- x=10: 0.08
- x=15: 0.09
- x=20: 0.07
- x=25: 0.05
- x=30: 0.02
#### `actor/kl_loss`
- **Red Line**:
- Initial rise to 0.6 (x=10), followed by fluctuations between 0.4–0.5.
- Key data points:
- x=5: 0.3
- x=10: 0.6
- x=15: 0.45
- x=20: 0.4
- x=25: 0.42
- x=30: 0.38
- **Blue Line**:
- Sharp rise to 1.0 (x=20), then decline to 0.55 (x=30).
- Key data points:
- x=5: 0.2
- x=10: 0.5
- x=15: 0.8
- x=20: 1.0
- x=25: 0.7
- x=30: 0.55
---
### Key Observations
1. **`eval/math-eval/accuracy/mean`**:
- Both lines peak around x=10–15, suggesting optimal performance midway through the observed range.
- Red line maintains higher stability after x=15 compared to the blue line.
2. **`actor/entropy_loss`**:
- Red line exhibits higher volatility, with a sharp drop followed by oscillations.
- Blue line shows a consistent, smooth decline, indicating stable entropy reduction.
3. **`actor/kl_loss`**:
- Blue line demonstrates a significant divergence (KL loss = 1.0) at x=20, suggesting a critical point of divergence between models.
- Red line remains relatively stable after x=10, contrasting with the blue line's volatility.
---
### Interpretation
1. **Performance Trends**:
- The `eval/math-eval/accuracy/mean` chart indicates that both models achieve peak accuracy midway through the evaluation period, with the red line (Series A) maintaining higher stability in later stages.
- The `actor/entropy_loss` chart reveals that Series B (blue) achieves a more consistent reduction in entropy loss, while Series A (red) experiences fluctuations, possibly due to model instability.
2. **Divergence in `kl_loss`**:
- The blue line's sharp rise to 1.0 at x=20 in the `actor/kl_loss` chart suggests a critical divergence between the models at this point, potentially indicating a failure mode or architectural mismatch.
- The red line's stability after x=10 implies that Series A may be more robust to changes in the input space beyond this threshold.
3. **Anomalies**:
- The blue line in `actor/kl_loss` exhibits an abrupt drop from 1.0 (x=20) to 0.55 (x=30), which could indicate a recovery phase or a reset in the model's behavior.
- The red line in `actor/entropy_loss` shows an unexpected spike at x=20 (0.12), deviating from its otherwise declining trend.
---
### Conclusion
The charts collectively highlight trade-offs between model stability and performance. Series A (red) demonstrates higher accuracy and stability in later stages, while Series B (blue) shows rapid initial improvement but greater volatility in entropy and KL loss. The divergence in KL loss at x=20 warrants further investigation into potential model failures or optimization challenges.
</details>
Figure 10: Performance on a math reasoning task. Prioritizing tasks from easy to hard (red line) leads to faster and better convergence compared to the default setting (blue line).
3.4.2 Dynamic Experience and Reward Shaping
While task curation primes the model before exploration, experience shaping refines the learning signal after each agent-environment interaction. This is critical for RFT algorithms that rely on rich feedback, as standard rewards (e.g., binary pass/fail) are often too sparse to guide learning effectively. We demonstrate how to augment rewards with metrics for quality and diversity, transforming a sparse signal into a dense, multi-faceted one that provides clearer guidance to the trainer.
Use Case 1: Quality Reward Augmentation.
To encourage the model to generate high-quality responses, we can augment the base reward with a quality score. As illustrated in Figure 11, during each RFT step, we use the data processor to evaluate the quality of each generated rollout. For our experiment, we trained a Qwen2.5-1.5B model and used a more powerful Qwen3-32B as the scorer LLM. Specifically, we invoked the llm_quality_filter from Data-Juicer, which normalized the quality scores to the range [-0.5, 0.5] and added them to the original reward.
Crucially, this processing is applied to the experience buffer at each RFT step. This allows the reward signal to adapt dynamically to the policy model’s evolving capabilities, a more responsive approach than one-time static processing. With a sync_interval of 3 over 36 steps on the Math-500 validation set, the results in Figure 12 show that: (1) The model with quality reward augmentation (red line) achieves higher accuracy. (2) The introduced quality reward itself improves over time, confirming it is a learnable signal. (3) We observe a slight increase in response length, which likely reflects an inductive bias from the larger scorer model being implicitly distilled into the smaller policy model.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Flowchart Diagram: GSM8K Response Evaluation Pipeline
### Overview
The diagram illustrates a technical workflow for evaluating responses generated from GSM8K samples using a rollout process. It shows the flow of data from initial samples through multiple response generations, reward calculations, and final scoring by an LLM scorer. Key components include response generation, reward aggregation, and model scoring.
### Components/Axes
1. **Input**:
- "GSM8K Sample" (leftmost box)
- "Rollout" arrow connecting to response generation
2. **Response Generation**:
- Multiple response boxes labeled "Response 1", "Response 2", ..., "Response n"
3. **Reward Calculation**:
- Dashed box containing:
- "Math Acc Reward" (Math Accuracy Reward)
- "Format Reward"
- "DJ-Quality Reward" (highlighted in red)
- All rewards are summed together
4. **Output**:
- "LLM Scorer" (Qwen3 32B) receiving aggregated rewards
5. **Additional Elements**:
- "Qwen2.5 1.5B" model mentioned near the rollout arrow
- "GRPO" label at the bottom-left corner
### Detailed Analysis
- **Flow Direction**:
- Left-to-right flow from GSM8K Sample → Rollout → Response Generation → Reward Calculation → LLM Scoring
- **Key Connections**:
- All responses (1 to n) feed into the same reward calculation box
- Reward components are connected via "+" operators
- Final reward sum connects to Qwen3 32B scorer
- **Color Coding**:
- DJ-Quality Reward box is highlighted in red (no legend present)
- Qwen2.5 and Qwen3 models use blue icons with white text
### Key Observations
1. The pipeline emphasizes iterative response generation ("Response n" suggests multiple attempts)
2. DJ-Quality Reward is visually emphasized (red box) suggesting it's a critical evaluation metric
3. Qwen3 32B is positioned as the final scorer, while Qwen2.5 1.5B appears in the rollout phase
4. GRPO (possibly a training method) is labeled but not connected to any specific component
### Interpretation
This diagram represents a multi-stage evaluation system for mathematical reasoning models. The GSM8K samples (likely math problems) undergo a rollout process generating multiple responses, which are then evaluated across three dimensions: mathematical accuracy, formatting quality, and domain-specific quality (DJ-Quality). The red highlighting of DJ-Quality Reward implies this metric may carry special importance in the evaluation process. The use of different Qwen model sizes (1.5B vs 32B) suggests a hierarchy where smaller models handle initial response generation while larger models perform final scoring. The GRPO label hints at potential reinforcement learning optimization in the pipeline, though its exact role isn't specified in the diagram.
</details>
Figure 11: The enhanced math workflow with quality-reward shaping from data processor, where DJ indicates DataJuicer [2], from which more operators can be utilized to extend this worklow.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Charts: Comparative Analysis of Metrics Over Time
### Overview
The image contains three line charts comparing two data series (red and blue) across three distinct metrics: accuracy, response length, and quality. Each chart tracks changes over a time axis (x-axis) from 5 to 35 units. The red line consistently outperforms the blue line in all metrics, with notable trends in stability and growth.
---
### Components/Axes
1. **Chart 1: `eval/math-eval/accuracy/mean`**
- **X-axis**: Time (5–35 units, increments of 5)
- **Y-axis**: Accuracy (0.25–0.45, increments of 0.05)
- **Legend**: Red = "Series A", Blue = "Series B"
2. **Chart 2: `response_length/mean`**
- **X-axis**: Time (5–35 units, increments of 5)
- **Y-axis**: Response Length (140–220, increments of 20)
- **Legend**: Red = "Series A", Blue = "Series B"
3. **Chart 3: `eval/math-eval/quality/mean`**
- **X-axis**: Time (5–35 units, increments of 5)
- **Y-axis**: Quality (0.1–0.25, increments of 0.05)
- **Legend**: Red = "Series A", Blue = "Series B"
---
### Detailed Analysis
#### Chart 1: Accuracy
- **Red Line**: Starts at ~0.30, rises steadily to ~0.42 by time 35. Minor plateau between 20–25 units.
- **Blue Line**: Begins at ~0.25, increases gradually to ~0.35 by time 35. Slight dip at time 15.
- **Trend**: Red line maintains a 0.05–0.10 advantage over blue throughout.
#### Chart 2: Response Length
- **Red Line**: Fluctuates between ~180–200, peaking at ~210 at time 10. Stabilizes after time 20.
- **Blue Line**: Oscillates between ~140–160, with a sharp spike to ~180 at time 30. More volatile than red.
- **Trend**: Red line remains 20–40 units higher than blue, despite higher variability.
#### Chart 3: Quality
- **Red Line**: Starts at ~0.10, rises sharply to ~0.22 by time 30. Sustains growth after time 25.
- **Blue Line**: Begins at ~0.12, increases to ~0.18 by time 35. Slower growth rate than red.
- **Trend**: Red line achieves 0.04–0.10 higher values consistently.
---
### Key Observations
1. **Consistent Superiority**: Red line outperforms blue in all metrics across all time points.
2. **Volatility**: Blue line in Chart 2 shows irregular fluctuations, while red lines in Charts 1 and 3 exhibit smoother growth.
3. **Asymptotic Behavior**: In Chart 1, red line’s growth slows after time 25, suggesting diminishing returns.
4. **Outlier**: Blue line in Chart 2 spikes to ~180 at time 30, briefly closing the gap with red (~200).
---
### Interpretation
The data suggests **Series A (red)** demonstrates superior performance in accuracy, response length, and quality metrics compared to **Series B (blue)**. The stability of red lines implies optimized processes or models, while blue’s volatility (especially in response length) may indicate inefficiencies or external noise. The quality metric’s steep rise for red after time 25 hints at a critical intervention or model update. Notably, blue’s late-stage spike in response length (time 30) could reflect a temporary anomaly or resource-intensive adjustment. Overall, the trends emphasize the importance of metric-specific optimizations for Series A.
</details>
Figure 12: Experimental results for quality-reward shaping. Augmenting the reward with a quality score (red line) improves final accuracy and provides a learnable reward signal.
Use Case 2: Diversity Reward Augmentation.
A common failure mode in RFT is policy collapse, where the agent repeatedly generates similar, suboptimal responses. To counteract this, we introduce a diversity reward that encourages the explorer to explore different solution paths. As shown in Figure 13, we used the GTE-Qwen2-1.5B model to convert rollouts into semantic embeddings. The diversity reward was calculated based on the cosine similarity of a rollout’s embedding to the mean embedding of its group, with lower similarity (i.e., higher diversity) yielding a higher reward.
To prevent exploration from becoming chaotic, we applied a simple decay schedule to the diversity reward’s weight, starting at 0.5 and decaying to 0.3 over the training steps. The experiment, using the same setting as before, yielded compelling results (Figure 14): (1) The diversity-augmented model (red line) shows a significant performance improvement over the baseline. (2) The response length is consistently longer, indicating the reward encourages more elaborate answers. (3) Most importantly, the actor entropy loss remains consistently higher, providing strong evidence that the model is maintaining a healthier, more diverse exploration strategy, which helps it escape local optima.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Flowchart: GSM8K Response Generation and Reward System
### Overview
The diagram illustrates a technical workflow for processing GSM8K (Grade School Math 8K) samples through a response generation pipeline, embedding analysis, and reward calculation system. It combines elements of natural language processing (NLP) and reinforcement learning concepts.
### Components/Axes
1. **Left Section (Input/Processing):**
- **GSM8K Sample**: Starting point for math problem input
- **Rollout**: Process generating multiple responses (Res 1 to Res n)
- **Embeddings (Ebd 1 to Ebd n)**: Vector representations of responses
- **Embedding Average**: Aggregated representation of all response embeddings
- **Cosine Similarity**: Measures diversity between response vectors
2. **Right Section (Output/Rewards):**
- **Reward Calculation Block**: Contains three weighted reward components:
- Format Reward (+0.5 weight)
- Math Accuracy Reward (+0.1 weight)
- Diversity Reward (+0.3 weight, highlighted in red)
3. **Model Components:**
- **Qwen2.5 1.5B**: Model architecture used for response generation
- **GTE-Qwen2**: Embedding model for response vectorization
### Detailed Analysis
- **Response Generation Flow**:
GSM8K samples → Rollout process → Multiple responses (Res 1 to Res n) → Embeddings (Ebd 1 to Ebd n)
- **Embedding Analysis**:
- Embeddings are averaged to create a composite representation
- Cosine similarity calculations determine response diversity (values shown as +0.5, +0.1, +0.3)
- **Reward System**:
- Format Reward (0.5 weight): Likely evaluates response structure/clarity
- Math Accuracy Reward (0.1 weight): Assesses correctness of mathematical solutions
- Diversity Reward (0.3 weight): Prioritizes varied response generation (highlighted in red)
### Key Observations
1. The Diversity Reward receives the highest weight (0.3) despite being lower than Format Reward (0.5), suggesting a balance between solution variety and presentation quality
2. Cosine similarity values (+0.5, +0.1, +0.3) indicate moderate to high similarity between response embeddings
3. The red highlighting of Diversity Reward emphasizes its importance in the optimization process
4. Multiple response generation (Res 1 to Res n) suggests a beam search or sampling approach
### Interpretation
This system appears designed to optimize educational response generation by:
1. Balancing solution accuracy with response diversity
2. Using embedding similarity to quantify response variation
3. Implementing a weighted reward system that values diverse solutions (0.3) more than mathematical accuracy alone (0.1)
4. Prioritizing format quality (0.5) while maintaining diversity
The architecture suggests a reinforcement learning approach where responses are evaluated through both direct metrics (format, accuracy) and indirect measures (embedding diversity). The emphasis on diversity reward indicates an intent to prevent model collapse toward single solution patterns, which is particularly important in educational contexts where multiple valid solution paths exist.
</details>
Figure 13: The enhanced math workflow with diversity-reward shaping from data processor
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Charts: Model Performance Metrics
### Overview
The image contains three line charts comparing performance metrics of two models (Model A in red, Model B in blue) across different evaluation dimensions. Each chart tracks a distinct metric over a shared x-axis range (5–35), with distinct y-axis scales.
---
### Components/Axes
1. **Chart 1: `eval/math-eval/accuracy/mean`**
- **X-axis**: Iteration/Step (5–35)
- **Y-axis**: Accuracy (0.25–0.45)
- **Legend**:
- Red: Model A
- Blue: Model B
2. **Chart 2: `response_length/mean`**
- **X-axis**: Iteration/Step (5–35)
- **Y-axis**: Response Length (200–400)
- **Legend**:
- Red: Model A
- Blue: Model B
3. **Chart 3: `actor/entropy_loss`**
- **X-axis**: Iteration/Step (5–35)
- **Y-axis**: Entropy Loss (0.5–1.5)
- **Legend**:
- Red: Model A
- Blue: Model B
---
### Detailed Analysis
#### Chart 1: Accuracy
- **Model A (Red)**:
- Starts at ~0.33, peaks at ~0.4 (x=20), dips to ~0.35 (x=30), then rises to ~0.4 (x=35).
- Shows volatility with two local maxima.
- **Model B (Blue)**:
- Starts at ~0.25, steadily increases to ~0.36 (x=35).
- Smooth upward trend with no fluctuations.
#### Chart 2: Response Length
- **Model A (Red)**:
- Oscillates between ~200–300, peaking at ~350 (x=35).
- High variability with frequent local maxima.
- **Model B (Blue)**:
- Remains flat between ~150–200.
- Minimal deviation throughout.
#### Chart 3: Entropy Loss
- **Model A (Red)**:
- Begins at ~0.5, dips to ~0.4 (x=10), then surges to ~1.5 (x=35).
- Sharp exponential growth in later steps.
- **Model B (Blue)**:
- Starts at ~0.5, peaks at ~0.7 (x=5), then declines to ~0.5 (x=35).
- Initial spike followed by stabilization.
---
### Key Observations
1. **Accuracy vs. Entropy**: Model A achieves higher accuracy but exhibits increasing entropy loss, suggesting potential overfitting or instability.
2. **Response Length**: Model A’s responses grow longer and more variable over time, while Model B maintains consistency.
3. **Model B’s Stability**: Model B shows smoother trends across all metrics, indicating robustness but lower peak performance.
---
### Interpretation
- **Model A** prioritizes accuracy at the cost of computational efficiency (longer responses) and stability (rising entropy). Its erratic entropy loss may reflect complex decision-making or overfitting to training data.
- **Model B** balances simplicity and consistency, with stable entropy and response lengths but lower accuracy. This could make it preferable for applications requiring reliability over peak performance.
- The divergence in entropy trends (Model A’s spike vs. Model B’s decline) highlights a trade-off between model complexity and generalization. Further investigation into training data or regularization techniques might clarify these dynamics.
</details>
Figure 14: Experimental results for diversity-reward shaping. Rewarding diverse responses (red line) significantly improves task accuracy and maintains higher entropy.
3.5 RFT with Human in the Loop
This example demonstrates the human-in-the-loop capability in Trinity-RFT for preference modeling. As illustrated in Listing 6 and Figure 15, our framework integrates Label Studio’s annotation interface with asynchronous data pipelines through four coordinated stages: (1) task generation: auto-creating annotation batches from model rollouts; (2) interactive labeling: providing UI for side-by-side response comparison; (3) quality control: enforcing inter-annotator agreement thresholds; and (4) versioned storage: tracking preference lineage in pre-defined fields like those in DPODataModel.
This pipeline reflects Trinity-RFT ’s bi-directional collaboration feature (Section 2.3.4), backed by timeout-aware task polling and support of atomic batch commit. It enables hybrid procedures where initial AI pre-screening can reduce human workload in production deployments. Annotation activities can scale across distributed teams through event-driven task routing. The system’s flexibility benefits rapid adaptation to diverse annotation protocols, allowing developers to implement custom labeling interfaces through XML-based templates or integrate third-party annotation services via unified SDK endpoints. This capability underpins advanced use cases such as safety red-teaming datasets and online instruction tuning scenarios where human judgment remains irreplaceable for quality-critical decisions, particularly in human-centric sociocultural contexts where data quality, difficulty, and reward signals are difficult to verify logically.
⬇
1 # Human annotation configuration
2 class HumanAnnotationConfig:
3 "" "Preference annotation pipeline configuration" ""
4
5 def __init__ (self):
6 self. process = [
7 {
8 "human_preference_annotation_mapper": {
9 "wait_for_annotations": True, # Block until annotations complete
10 "timeout": 3600, # Maximum wait time in seconds
11 "prompt_key": "prompt", # Source field for prompts
12 "answer1_key": "answer1", # First candidate response
13 "chosen_key": "chosen" # Selected response key
14 }
15 }
16 ]
17
18 def get_pipeline (self) -> List [Dict]:
19 "" "Get annotation processing pipeline" ""
20 return self. process
Listing 6: Configuration for human preference annotation.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Screenshot: Label Studio Interface
### Overview
The image depicts a user interface from Label Studio, a data labeling tool. The layout includes a left sidebar with a list of questions, a central area displaying a question and answer options, and a right panel with metadata tabs. The interface is designed for annotating data, likely for machine learning tasks.
### Components/Axes
- **Left Sidebar**:
- A vertical list of questions, each prefixed with a checkbox and a count of "0" (e.g., "What is the capital of France?", "Which planet is known as the Red Planet?").
- The first question ("What is the capital of France?") is highlighted in blue, indicating selection.
- Questions are numbered sequentially (e.g., "0", "1", "2") but lack explicit axis labels.
- **Central Area**:
- A question displayed in a blue banner: "What is the capital of France?"
- Two answer options in dark gray banners: "Paris" (left) and "Lyon" (right).
- No explicit axis titles or legends, but the layout suggests a multiple-choice format.
- **Right Panel**:
- Tabs labeled "Info," "History," and "Selection Details."
- "Selection Details" is empty, with no data or annotations.
- A "Regions" tab is partially visible but lacks content.
### Detailed Analysis
- **Left Sidebar**:
- Questions are static text with no numerical values or trends.
- The count "0" next to each question may indicate unanswered or unselected items.
- **Central Area**:
- The question "What is the capital of France?" is the active focus.
- Answer options "Paris" and "Lyon" are presented as selectable choices.
- **Right Panel**:
- Tabs suggest metadata tracking (e.g., "Info" for general details, "History" for versioning).
- "Selection Details" is empty, implying no data has been saved or processed yet.
### Key Observations
- The interface is in a **pre-labeling state**, as no answers are selected or saved.
- The color scheme uses **blue** for the active question and **dark gray** for answer options, likely to distinguish interactive elements.
- The absence of data in "Selection Details" suggests the user has not yet finalized annotations.
### Interpretation
This interface is designed for **iterative data annotation**, where users select answers from predefined options. The empty "Selection Details" indicates the workflow is in its initial phase, with no committed annotations. The structured layout implies a focus on **efficiency and clarity**, with questions and answers organized for easy navigation. The lack of numerical data or trends suggests the tool prioritizes qualitative labeling over quantitative analysis.
## No numerical data or trends present. The image focuses on UI structure and textual content for data labeling.
</details>
Figure 15: An interactive interface for human preference annotation.
3.6 Low-Code Usage and Development with Trinity-Studio
<details>
<summary>figs/studio-showcase/dashboard.png Details</summary>

### Visual Description
## Screenshot: RFT Portal Dashboard Interface
### Overview
The image depicts a user interface for an RFT (Rapid Feature Training) Portal Dashboard. The layout is structured as a responsive grid with three primary functional sections aligned horizontally. A dark navigation bar spans the top, containing tabs for navigation, while the main content area features interactive buttons and descriptive text blocks.
### Components/Axes
1. **Navigation Bar (Top)**
- Tabs: `Dashboard` (highlighted), `pgAdmin`, `Label Studio`, `Training Portal`, `Settings`
- Background: Dark blue (#2c3e50)
- Text Color: White
2. **Main Content Area**
- **Title**: "RFT Portal Dashboard" (Centered, bold, dark gray)
- **Three Functional Sections** (Equal-width columns):
- **Training Portal** (Left)
- Icon: Blue circle with "TP" (White text)
- Button: "Open Training Portal" (Blue, rounded rectangle)
- Description: "Access the training portal to manage your training data and models."
- **pgAdmin** (Center)
- Icon: Green circle with "DB" (White text)
- Button: "Open pgAdmin" (Green, rounded rectangle)
- Description: "Manage your PostgreSQL databases with pgAdmin."
- **Label Studio** (Right)
- Icon: Red circle with "LS" (White text)
- Button: "Open Label Studio" (Red, rounded rectangle)
- Description: "Label and annotate your data with Label Studio."
### Detailed Analysis
- **Navigation Bar**: The `Dashboard` tab is visually distinguished by a darker shade of blue, indicating it is the active view. Other tabs are uniformly styled with white text on a dark blue background.
- **Functional Sections**:
- **Training Portal**: Positioned leftmost, uses blue branding (icon and button) to signify its association with data/model management.
- **pgAdmin**: Central section, green branding aligns with database management themes.
- **Label Studio**: Rightmost section, red branding emphasizes data annotation tasks.
- **Buttons**: Each section includes a prominently styled button with matching color to its icon, ensuring visual consistency and intuitive navigation.
### Key Observations
- **Color Coding**: Each functional section uses a distinct primary color (blue, green, red) to differentiate its purpose, aiding user recognition.
- **Button Placement**: Buttons are centrally aligned within their respective sections, drawing immediate attention.
- **Text Hierarchy**: Descriptions are concise and positioned below buttons, ensuring clarity without visual clutter.
### Interpretation
The dashboard is designed for administrative and data management tasks, with a clear separation of responsibilities across the three sections. The use of color coding and consistent button styling suggests an emphasis on usability, allowing users to quickly identify and access tools for training data management (Training Portal), database administration (pgAdmin), and data labeling (Label Studio). The absence of numerical data or dynamic visualizations implies this is a static interface for tool access rather than data analysis. The structure prioritizes simplicity, likely targeting users who require direct interaction with backend systems rather than exploratory data analysis.
</details>
(a) Trinity-Studio dashboard.
<details>
<summary>figs/studio-showcase/training-portal-click-run.jpg Details</summary>

### Visual Description
## Screenshot: Training Portal Interface
### Overview
The image depicts a web-based training portal interface with a dark-themed navigation bar, configuration settings, and a generated configuration file display. The interface includes input fields for hyperparameters, a job submission status, and a link to a Ray Dashboard for progress tracking.
### Components/Axes
1. **Navigation Bar (Top)**
- Tabs: Dashboard, pgAdmin, Label Studio, Training Portal (highlighted), Settings
- "Training Portal" tab is active, indicated by a blue banner below it
- Blue banner contains:
- Hamburger menu icon (☰)
- "Training Portal" text
- Gear icon labeled "TOOLS"
- "Deploy" button with three-dot menu
2. **Configuration Settings**
- **Micro Batch Size Per GPU**:
- Label: "Micro Batch Size Per GPU :blue-badge"
- Input field: `8` (with +/- adjustment controls)
- **Learning Rate**:
- Label: "Learning Rate :blue-badge"
- Input field: `1.0e-6` (with +/- adjustment controls)
- **Generate Config Button**:
- Icon: Folder with plus sign (+)
- Text: "Generate Config"
3. **Generated Config File**
- Header: "Generated Config File"
- Actions:
- "Save" button (with download icon)
- "Run" button (with play icon)
- Status Notification:
- Green banner with checkmark: "Job submitted successfully!"
- Link: "View progress in the Ray Dashboard: http://127.0.0.1:8265"
- Config Details:
- `mode: both`
- `data:`
- `total_epochs: 20` (orange text)
- `batch_size: 96` (orange text)
### Key Observations
- The "Training Portal" tab is emphasized through highlighting and a dedicated blue banner
- Hyperparameter values are explicitly set (batch size = 8, learning rate = 1e-6)
- The generated config file includes both training modes and specific data parameters
- A direct link to the Ray Dashboard is provided for real-time monitoring
- The interface uses color coding (blue badges, green success notification) for visual hierarchy
### Interpretation
This interface appears to be a machine learning training configuration tool. The presence of both batch size and learning rate parameters suggests it's used for neural network training. The generated config file indicates the system is designed for reproducibility, with explicit parameter values and a direct link to monitoring tools. The "Deploy" button implies this interface connects to a production environment, while the Ray Dashboard link suggests integration with a distributed computing framework for tracking training progress. The use of scientific notation for the learning rate (1.0e-6) indicates precision requirements typical in deep learning applications.
</details>
(b) Start training on the “Training Portal” page.
<details>
<summary>figs/studio-showcase/pgadmin-select-table.jpg Details</summary>

### Visual Description
## Screenshot: pgAdmin Database Management Interface
### Overview
This image depicts a database management interface for PostgreSQL within the pgAdmin tool. The interface shows a table structure view for the `experience_buffer` table, along with navigation elements and an SQL query input area. The environment is labeled as "Local Development," indicating a non-production setup.
### Components/Axes
1. **Navigation Bar** (Top):
- Tabs: Dashboard, pgAdmin (active), Label Studio, Training Portal, Settings
- Blue banner with "pgAdmin" branding on the left
- "Local Development" dropdown menu on the right
2. **Left Sidebar**:
- Database dropdown labeled "testdb"
- Table list with icons:
- xxx
- sft_data_buffer
- rft_dataset
- task_buffer
- experience_buffer (highlighted)
- dpo_data_buffer
3. **Main Content Area**:
- Table structure header: "Table Structure: experience_buffer"
- Column metadata table with three columns:
- **Column**: Column names
- **Type**: Data types
- **Nullable**: YES/NO indicators
4. **SQL Query Section**:
- Text input box labeled "Enter SQL query..."
- Blue "EXECUTE QUERY" button with play icon
### Detailed Analysis
**Table Structure Details**:
| Column | Type | Nullable |
|-----------------|-------------------|----------|
| consumed | integer | YES |
| priority | double precision | YES |
| serialized_exp | bytea | YES |
| id | integer | NO |
| reward | double precision | YES |
| response | character varying | YES |
| prompt | character varying | YES |
**Key UI Elements**:
- All tables use identical vertical bar icons (5 horizontal bars)
- Nullable column shows "YES" for all except `id` (NO)
- Data types include integer, double precision, bytea, and character varying
- SQL input area has placeholder text "Enter SQL query..."
### Key Observations
1. The `id` column is the only non-nullable field in the table
2. All other columns allow NULL values
3. The interface uses consistent styling for data types:
- Numeric: integer/double precision
- Binary: bytea
- Text: character varying
4. The highlighted `experience_buffer` table suggests active user focus
5. "Local Development" environment implies testing/staging context
### Interpretation
This interface demonstrates a typical PostgreSQL database management workflow:
1. **Navigation**: Users switch between tools via the top tabs
2. **Database Selection**: The "testdb" database is currently active
3. **Table Inspection**: The `experience_buffer` table's structure reveals:
- A mix of numeric, binary, and text data types
- An auto-incrementing primary key (`id`)
- Potential machine learning/experiment tracking use case (buffer, reward, response)
4. **Query Execution**: The SQL input area enables direct database interaction
The "Local Development" label suggests this is a development environment, possibly for testing database schemas or running analytical queries before deployment to production. The highlighted table and SQL input area indicate active database exploration or modification.
</details>
(c) Manage data on the “pgAdmin” page.
<details>
<summary>figs/studio-showcase/label-studio-enter.jpg Details</summary>

### Visual Description
## Screenshot: Label Studio Dashboard Interface
### Overview
The image shows a web application dashboard for "Label Studio," a data labeling platform. The interface includes navigation controls, project management tools, and resource links. Key elements include a welcome message, project creation options, recent project tracking, and documentation resources.
### Components/Axes
1. **Navigation Bar (Top)**
- Tabs: Dashboard, pgAdmin, Label Studio (active), Training Portal, Settings
- Active tab: "Label Studio" (highlighted with darker background)
2. **Main Content Area**
- **Header Section**
- Logo: "Label Studio" with orange square icon
- Navigation: Hamburger menu icon (three horizontal lines)
- User indicator: Circular icon with "AD" initials and blue dot (unread notification)
- **Welcome Section**
- Text: "Welcome 👋 Let's get you started."
- Action buttons:
- "Create Project" (folder icon with "+")
- "Invite People" (person icon with "+")
- **Recent Projects Section**
- Header: "Recent Projects"
- "View All" link (blue text)
- Project entries:
1. `Human_Preference_Annotation_Demo_acc038`
- Status: "10 of 10 Tasks (100%)"
- Progress bar: Full (teal color)
2. `Human_Preference_Annotation_Demo_8a87e7`
- Status: "10 of 10 Tasks (100%)"
- Progress bar: Full (teal color)
- **Resources Section (Right Panel)**
- Header: "Resources"
- Subheader: "Learn, explore and get help"
- Links with icons:
- Documentation (📄 icon)
- API Documentation (📄 icon)
- Release Notes (📄 icon)
- LabelStudi.io Blog (📄 icon)
- Slack Community (📄 icon)
3. **Footer**
- Version note: "Label Studio Version: Community"
- Logo: Orange square icon (matches header)
### Detailed Analysis
- **Navigation**: Standard dashboard layout with persistent top navigation. Active tab styling indicates current context.
- **Project Management**:
- Two identical project names suggest template usage or duplicate entries.
- 100% completion for both projects implies successful task execution.
- **Resource Accessibility**:
- All resources have identical iconography (document symbol), suggesting unified documentation structure.
- Slack Community link indicates active user support channels.
### Key Observations
1. **Duplicate Projects**: Two projects share identical names but different UUIDs (`acc038` vs `8a87e7`), possibly indicating template variations.
2. **Completion Status**: Both projects show 100% completion, suggesting either:
- Immediate task finalization
- Test data with pre-filled annotations
3. **Resource Uniformity**: Consistent iconography across documentation links implies standardized documentation format.
### Interpretation
This dashboard serves as a centralized hub for data labeling workflows. The presence of identical 100% completed projects suggests either:
1. A demonstration environment with pre-configured test data
2. A production setup where tasks are completed rapidly
3. A potential UI inconsistency showing duplicate project entries
The resources section emphasizes community support and documentation accessibility, indicating the platform's focus on collaborative annotation work. The "AD" user initials suggest administrative access, with the blue dot indicating pending notifications that could relate to project updates or user activity.
The interface design prioritizes quick access to project management tools and documentation, with clear visual indicators for task completion status. The duplicate project entries warrant further investigation to determine if this represents intentional template usage or a potential UI bug.
</details>
(d) Process data on the “Label Studio” page.
Figure 16: Snapshots of Trinity-Studio.
Trinity-Studio provides visual interaction for the core capabilities of Trinity-RFT, designed to bridge the gap between system complexity and user accessibility. As shown in Figure 16(a), its three integrated modules — “Training Portal”, “pgAdmin”, and “Label Studio” — form a cohesive interface that supports low-code usage and development with Trinity-RFT, and make it easy to monitor and track the full RFT pipeline with transparency.
- “Training Portal” (Figure 16(b)) implements configuration-to-execution procedures through declarative YAML editing with auto completion and live validation that prevents misconfigurations. Furthermore, the integration of runtime metrics with tools like Wandb/TensorBoard directly helps the active data optimization feature by surfacing signals such as difficulty distribution drifts and diversity metrics mentioned in Section 3.4. This transparency ensures that users can monitor how data curation strategies impact RFT performance in real time.
- “pgAdmin” (Figure 16(c)) reflects Trinity-RFT ’s end-to-end data transformation capabilities by providing a visual panel for PostgreSQL-based storage. This design benefits the versioned data lineage requirements of RFT, particularly for scenarios involving asynchronous training (Section 2.3.3). With intuitive SQL query builders, users can easily adjust schema, audit training experiences and human annotation batches with fine-grained precision. This capability is valuable for rapid validation of active learning policies by cross-referencing training outcomes with metadata (e.g., difficulty scores and staleness in asynchronous mode).
- “Label Studio” page (Figure 16(d)) operationalizes Trinity-RFT ’s bi-directional human-AI collaboration capability (Section 2.3.4). Utilizing the provided task polling and atomic batch commit mechanisms, users can annotate the data or experiences directly, allowing an asynchronous way to involve human feedback and to dynamically influence data curation.
By unifying these capabilities in a single UI, Trinity-Studio reduces the cognitive load of managing complex RFT procedures. For example, a researcher tuning a math reasoning task could use the Training Portal to adjust difficulty scoring parameters, view the resulting distribution shifts in the pgAdmin module, and then validate human annotators’ preferences in the Label Studio page. This end-to-end visibility can be useful for debugging and iterating RFT strategies, and complements the programmatic APIs of Trinity-RFT while maintaining full compatibility with CLI procedures.
We implement Trinity-Studio with the Singe-Spa framework [33]. The modular architecture enables custom view development through JavaScript plugins and flexible extensions for general-purpose usage.
4 Conclusion and Next Steps
We have presented Trinity-RFT, a general-purpose, flexible, scalable and user-friendly framework for reinforcement fine-tuning of large language models. Trinity-RFT offers a path into “the era of experience” [32], by supporting applications in diverse scenarios with complex agent-environment interaction, and serving as a unified platform for exploring advanced methodologies in each stage of the complete RFT pipeline, at both macroscopic and microscopic levels.
Further development of Trinity-RFT includes incorporating more advanced RL algorithms (especially off-policy or asynchronous ones), making the choices of hyperparameters more adaptive and less reliant on manual tuning, augmenting data pipelines with smarter sampling strategies and data processing operations, and conducting more thorough experiments and evaluations with Trinity-RFT.
Acknowledgements
Trinity-RFT is built upon many excellent open-source projects, including but not limited to: verl [30] and PyTorch’s FSDP [8] for LLM training; vLLM [15] for LLM inference; Data-Juicer [2] for data-related functionalities; AgentScope [11] for agentic workflow; and Ray [20] for distributed runtime.
References
- [1] ChatLearn. https://github.com/alibaba/ChatLearn.
- [2] Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, and Jingren Zhou. Data-juicer: A one-stop data processing system for large language models. In International Conference on Management of Data, 2024.
- [3] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- [4] Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, and Sean Hendryx. Agent-rlvr: Training software engineering agents via guidance and environment rewards. arXiv, 2025.
- [5] DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv, 2025.
- [6] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023.
- [7] Yannis Flet-Berliac, Nathan Grinsztajn, Florian Strub, Bill Wu, Eugene Choi, Chris Cremer, Arash Ahmadian, Yash Chandak, Mohammad Gheshlaghi Azar, Olivier Pietquin, and Matthieu Geist. Contrastive policy gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion. In EMNLP, 2024.
- [8] Pytorch FSDP. https://pytorch.org/docs/stable/fsdp.html.
- [9] Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv, 2025.
- [10] Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning. arXiv, 2025.
- [11] Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, Liuyi Yao, Hongyi Peng, Ze Yu Zhang, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. Agentscope: A flexible yet robust multi-agent platform. arXiv, 2024.
- [12] Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training. arXiv, 2025.
- [13] Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework. arXiv, 2024.
- [14] Huggingface. https://huggingface.co/.
- [15] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- [16] Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. arXiv, 2025.
- [17] Model context protocol servers. https://github.com/modelcontextprotocol/servers.
- [18] Mistral-AI. Magistral. arXiv, 2025.
- [19] Modelscope. https://www.modelscope.cn/home.
- [20] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging ai applications. arXiv, 2018.
- [21] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In NIPS, 2017.
- [22] Nccl. https://github.com/NVIDIA/nccl.
- [23] Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models. In The Thirteenth International Conference on Learning Representations, 2025.
- [24] Long Ouyang, Pamela Mishkin, Jeff Wu, C L Mar, Jacob Hilton, Amanda Askell, and Paul Christiano. Training language models to follow instructions with human feedback. arXiv, 2022.
- [25] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- [26] Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, and Bilal Piot. Offline regularised reinforcement learning for large language models alignment. arXiv, 2024.
- [27] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Advances in Neural Information Processing Systems, volume 32, 2019.
- [28] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017.
- [29] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv, 2024.
- [30] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient rlhf framework. arXiv, 2024.
- [31] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2021.
- [32] David Silver and Richard S. Sutton. Welcome to the era of experience. https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf, 2025.
- [33] A javascript framework for front-end microservices, 2025.
- [34] Yuda Song, Yifei Zhou, Ayush Sekhari, Drew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid RL: Using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations, 2023.
- [35] Kimi Team. Kimi k1.5: Scaling reinforcement learning with LLMs. arXiv, 2025.
- [36] Qwen Team. Qwen2.5 technical report, 2025.
- [37] ROLL Team and Other ROLL Contributors. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library. arXiv, 2025.
- [38] TensorBoard. https://www.tensorflow.org/tensorboard.
- [39] Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling software, 2020-2025. Open source software available from https://github.com/HumanSignal/label-studio.
- [40] Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- [41] Weights & Biases. https://wandb.ai/home.
- [42] Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye HAO, Jun Wang, and Kun Shao. DistRL: An asynchronous distributed reinforcement learning framework for on-device control agent. In The Thirteenth International Conference on Learning Representations, 2025.
- [43] Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training. arXiv, 2025.
- [44] LLM-Core-Team Xiaomi. Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining. arXiv, 2025.
- [45] Tianbing Xu. Training large language models to reason via EM policy gradient. arXiv, 2025.
- [46] Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv, 2025.
- [47] Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, and Bolin Ding. Group-relative reinforce is secretly an off-policy algorithm: Demystifying some myths about grpo and its friends. arXiv, 2025.
- [48] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
- [49] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
- [50] Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, and Daxin Jiang. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv, 2025.