2505.17826v3

Model: gemma-3-27b-it-free

# : A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models **Authors**: Alibaba Group Abstract Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness. footnotetext: Equal contribution. footnotetext: Corresponding author. {chenyanxi.cyx,panxuchen.pxc,daoyuanchen.cdy,yaliang.li,bolin.ding}@alibaba-inc.com GitHub: https://github.com/modelscope/Trinity-RFT Documentation: https://modelscope.github.io/Trinity-RFT Note: Trinity-RFT is currently under active development. This technical report corresponds to commit id 63d4920 (July 14, 2025) of the GitHub repository, and will be continuously updated as the codebase evolves. Comments, suggestions and contributions are welcome! 1 Introduction Reinforcement learning (RL) has achieved remarkable success in the development of large language models (LLMs). Examples include aligning LLMs with human preferences via reinforcement learning from human feedback (RLHF) [24], and training long-CoT reasoning models via RL with rule-based or verifiable rewards (RLVR) [5, 35]. However, such approaches are limited in their abilities to handle dynamic, agentic and continuous learning in the real world. We envision a future where AI agents learn by interacting directly with environments, collecting lagged or complex reward signals, and continuously refining their behavior through RL based on the collected experiences [32]. For example, imagine an AI scientist who designs an experiment, executes it, waits for feedback (while working on other tasks concurrently), and iteratively updates itself based on true environmental rewards and feedback when the experiment is finally finished. This vision motivates us to develop Trinity-RFT, a reinforcement fine-tuning (RFT) framework that aims to offer a path into this future. The modular, decoupled and trinity design of Trinity-RFT illustrated in Figure 1, along with its various features, makes it a promising solution for realizing such a vision. <details> <summary>x2.png Details</summary> ![e1007c2c](/v1/image/e1007c2c2153d9029ec069f357b9a8c69c292bfadb948707766d33983c205f2b) ### Visual Description \n ## Diagram: Reinforcement Learning from Tool-use (RFT) Core System ### Overview The image depicts a diagram of a Reinforcement Learning from Tool-use (RFT) core system. It illustrates the interaction between an environment, a human, an agent, and various components involved in training and utilizing a Large Language Model (LLM). The diagram is structured into three levels: High-Level, Middle-Level, and Low-Level, indicating the abstraction of the processes. ### Components/Axes The diagram consists of the following components: * **Environment & Human:** Represented by a head-and-shoulders icon, providing "Additional Feedback". * **Buffer:** A cylindrical shape, labeled "Buffer". * **Data Pipelines:** Represented by a stethoscope icon, labeled "Data Pipelines". * **RFT-core:** A central hexagonal shape, labeled "RFT-core". * **Explorer:** A robot icon, labeled "Explorer". * **Trainer:** A gear icon, labeled "Trainer". * **LLM Infra:** A cloud icon, labeled "LLM Infra". * **Agent-Env Interaction:** A curved arrow indicating interaction between the agent and the environment. * **Rollout Experiences:** An arrow from Explorer to Buffer, labeled "Rollout Experiences". * **Training Data:** An arrow from RFT-core to Trainer, labeled "Training Data". * **Process Training Batch:** An arrow from Trainer to LLM Infra, labeled "Process Training Batch". * **Clean/Filter/Prioritize/Synthesize...:** An arrow from Buffer to Data Pipelines, labeled "Clean/Filter/Prioritize/Synthesize...". * **Synchronize Model Weights:** A bidirectional arrow between Explorer and Trainer, labeled "Synchronize Model Weights". * **High-Level, Middle-Level, Low-Level:** Labels indicating the hierarchical structure of the system. * **(Training, Inference, Model Sync, ...):** Text below LLM Infra, describing its functions. ### Detailed Analysis or Content Details The diagram shows a cyclical flow of information. 1. The **Environment & Human** provide "Additional Feedback" to the **Buffer**. 2. The **Buffer** processes this feedback via "Clean/Filter/Prioritize/Synthesize..." and sends it to **Data Pipelines**. 3. The **Explorer** interacts with the **Environment**, generating "Rollout Experiences" which are sent to the **RFT-core**. 4. The **RFT-core** generates "Training Data" for the **Trainer**. 5. The **Trainer** processes this data into a "Process Training Batch" and sends it to the **LLM Infra**. 6. The **LLM Infra** handles "Training, Inference, Model Sync..." and provides updated model weights. 7. The **Explorer** and **Trainer** synchronize their model weights. The diagram is vertically divided into three levels: * **High-Level:** Contains the **Environment & Human** and **Data Pipelines**. * **Middle-Level:** Contains the **Buffer**, **RFT-core**, **Explorer**, and **Trainer**. * **Low-Level:** Contains the **LLM Infra**. The arrows indicate the direction of data flow and interaction between the components. The arrows are colored as follows: * Blue: Feedback from Environment & Human to Buffer. * Red: Rollout Experiences from Explorer to RFT-core. * Green: Training Data from RFT-core to Trainer. * Orange: Process Training Batch from Trainer to LLM Infra. * Purple: Synchronize Model Weights between Explorer and Trainer. ### Key Observations The diagram emphasizes the iterative nature of reinforcement learning, with continuous feedback loops between the environment, agent, and LLM infrastructure. The hierarchical structure suggests a modular design, where each level performs specific functions. The RFT-core appears to be a central component, coordinating the flow of data between the different parts of the system. ### Interpretation This diagram illustrates a system for reinforcement learning that leverages the capabilities of Large Language Models (LLMs). The RFT-core acts as the central hub, managing the interaction between an agent exploring an environment, a human providing feedback, and the LLM infrastructure responsible for training and inference. The system is designed to continuously improve the agent's performance through iterative feedback loops and model synchronization. The three-level structure suggests a separation of concerns, allowing for modularity and scalability. The inclusion of human feedback indicates a potential for incorporating human knowledge and guidance into the learning process. The diagram highlights the importance of data processing and synthesis in preparing data for training the LLM. The system appears to be designed for complex tasks that require both exploration and learning from experience. The bidirectional arrow between the Explorer and Trainer suggests a continuous refinement of the model weights based on the agent's interactions with the environment. </details> Figure 1: The high-level design of Trinity-RFT. 1.1 Key Features of Trinity-RFT Trinity-RFT is a general-purpose, unified, scalable and user-friendly RL framework that can be easily adapted for diverse experimental or real-world scenarios. It integrates both macroscopic and microscopic RL methodologies in one place; roughly speaking, the former deals with natural language and plain text, while the latter handles torch.Tensor (such as token probabilities, gradients, and model weights of LLMs) Many prior RL works for games/control/LLMs focus mainly on the microscopic aspect, e.g., designing policy loss functions or optimization techniques for updating the policy model. On the other hand, pre-trained LLMs, as generative models with rich prior knowledge of natural language and the world, open up numerous opportunities at the macroscopic level, e.g., experience synthesis by reflection or reasoning with environmental feedback [4], leveraging existing text processing methods like deduplication and quality filtering [2], among others.. The key features of Trinity-RFT are presented below, which will be further elaborated in Section 2 and exemplified in Section 3. An RFT-core that unifies and generalizes diverse RL modes. Trinity-RFT implements diverse RL methodologies in a unified manner, supporting synchronous/asynchronous, on-policy/off-policy, and online/offline training. These RL modes can be flexibly generalized, e.g., a hybrid mode that incorporates expert trajectories to accelerate an online RL process [21, 46]. This unification is made possible partly by our decoupled design (which will soon be introduced in Section 2.1) that allows rollout and training to be executed separately and scaled up independently on different devices, while having access to the same stand-alone experience buffer. The efficacy of various RL modes has been validated empirically by our experiments in Section 3.3, which particularly highlight the efficiency gains by off-policy or asynchronous methods. Agent-environment interaction as a first-class citizen. Trinity-RFT allows delayed rewards and environmental feedback in multi-step/time-lagged feedback loop, handles long-tailed latencies and the straggler effect via asynchronous and streaming LLM inference, and deals with environment/agent failures gracefully via dedicated timeout/retry/skip mechanisms. These together ensure efficiency and robustness of continuous agent-environment interaction in complex real-world scenarios. Systematic data pipelines optimized for RFT. Figure 2 illustrates the high-level design of data pipelines in Trinity-RFT, which regard rollout tasks and experiences as dynamic assets to be actively managed throughout the RFT lifecycle. Trinity-RFT empowers users to: (1) curate tasks for curriculum learning, e.g., by prioritizing easier tasks at the beginning of training to stabilize and accelerate the learning process; (2) actively manipulate experience by cleaning, filtering, or synthesizing new experiences, such as repairing failed trajectories or amplifying successful ones; (3) perform online reward shaping by augmenting sparse environmental rewards with dense, computed signals, such as quality or diversity scores; (4) customize interfaces for human-in-the-loop curation and utilize an agentic paradigm for RFT data processing that translates high-level natural language commands (e.g., “improve response diversity and safety for coding scenarios”) into complex data pipelines, powered by established community tools like Data-Juicer [2]. For instance, Section 3.4 presents experiments that demonstrate the efficacy of task prioritization and reward shaping empowered by Trinity-RFT. <details> <summary>x3.png Details</summary> ![db34d42d](/v1/image/db34d42dc9dc184de1ad383198e02bea7970848d7b9ef99cf11af2aba81824d0) ### Visual Description \n ## Diagram: Data Pipelines ### Overview The image is a diagram illustrating a data pipeline, likely within a reinforcement learning (RL) context. It depicts the flow of data from a "Raw Dataset" through several processing stages to a "Trainer". The pipeline is visually represented as a series of connected boxes and branching paths, with a central "Data Pipelines" node at the top. ### Components/Axes The diagram consists of the following components, arranged horizontally from left to right: * **Raw Dataset:** The initial data source. * **Task Set:** Data after "Conversion" from the Raw Dataset. * **Augment Task Set (Synthesis, Prioritization, ...):** A branch off the Task Set, indicating data augmentation. * **Explorer:** Receives "Sample Tasks for Rollout". * **Rollout Experiences:** Output from the Explorer. * **Buffer:** Stores "Experience Shaping / Cleaning / Synthesis". * **Trainer:** Receives "Sample Experience Batches for RL Training". * **Local Experience Replay:** A branch off the Trainer. * **Data Pipelines:** A central node at the top, from which multiple connections originate. The diagram also includes labels describing the data flow between components. ### Detailed Analysis or Content Details The diagram shows a linear flow from "Raw Dataset" to "Task Set" to "Explorer" to "Buffer" to "Trainer". There are branching paths originating from "Task Set" and "Trainer". 1. **Raw Dataset to Task Set:** A direct connection labeled "Conversion". 2. **Task Set to Augment Task Set:** A branch labeled "Augment Task Set (Synthesis, Prioritization, ...)". 3. **Task Set to Explorer:** A direct connection labeled "Sample Tasks for Rollout". 4. **Explorer to Rollout Experiences:** The connection is implied, representing the output of the Explorer. 5. **Rollout Experiences to Buffer:** A direct connection labeled "Rollout Experiences". 6. **Buffer to Trainer:** A direct connection labeled "Experience Shaping / Cleaning / Synthesis". 7. **Trainer to Local Experience Replay:** A branch labeled "Local Experience Replay". 8. **Buffer to Trainer:** A direct connection labeled "Sample Experience Batches for RL Training". 9. **Data Pipelines to all components:** Multiple connections originate from the "Data Pipelines" node, visually representing the overarching process. ### Key Observations The diagram highlights the key stages in a typical RL data pipeline: data acquisition, task generation, exploration, experience storage, and training. The branching paths suggest that data can be augmented and experiences can be replayed to improve learning. The "Data Pipelines" node emphasizes the overall system architecture. ### Interpretation This diagram illustrates a common architecture for reinforcement learning systems. The "Raw Dataset" represents the environment or source of information. The "Task Set" represents the problems or scenarios the agent will learn to solve. The "Explorer" interacts with the environment to generate experiences. The "Buffer" stores these experiences for later use. The "Trainer" uses the experiences to update the agent's policy. The "Local Experience Replay" suggests a mechanism for reusing past experiences to improve learning efficiency. The "Data Pipelines" node signifies the overall orchestration of these components. The diagram suggests a focus on efficient data handling and reuse, which are crucial for successful RL applications. The inclusion of "Synthesis, Prioritization" in the augmentation step indicates a desire to intelligently select and generate training data. </details> Figure 2: The high-level design of data pipelines in Trinity-RFT. User-friendliness as a top priority. For development and research, the modular and decoupled design of Trinity-RFT allows the user to develop new RFT methodologies by adding one or a few small, plug-and-play classes (modified from built-in templates) that implement the essential functionalities of interest, with minimal code duplication or intrusive changes to the codebase. An example can be found in Section 3.2, which shows that three compact python classes (with around 200 lines of code in total) suffice for implementing a hybrid RL process that leverages samples from multiple data sources and updates the policy model with a customized loss function. For applications, the user can adapt Trinity-RFT to a new scenario by simply implementing a single Workflow class that specifies the logic of agent-environment interaction, as will be exemplified in Section 3.1. To further enhance usability, Trinity-RFT incorporates various graphical user interfaces to support low-code usage and development, enhance transparency of the RFT process, and facilitate easy monitoring and tracking. 1.2 Related Works There exist numerous open-source RLHF frameworks, such as veRL [30], OpenRLHF [13], TRL [40], ChatLearn [1], Asynchronous RLHF [23], among others. Some of them have been further adapted for training long-CoT reasoning models or for agentic RL more recently. Concurrent to Trinity-RFT, some recent works on LLM reinforcement learning also advocate a decoupled and/or asynchronous design; examples include StreamRL [50], MiMo [44], AReaL [9], ROLL [37], LlamaRL [43], Magistral [18], AsyncFlow [12], among others. Complementary to this huge number of related works, Trinity-RFT provides the community with a new solution that is powerful, easy-to-use, and unique in certain aspects. In a nutshell, Trinity-RFT aims to be general-purpose and applicable to diverse application scenarios, while unifying various RFT modes, RFT methodologies at macroscopic and microscopic levels, and RFT-core/agent-environment interaction/data pipelines. Such a system-engineering perspective makes Trinity-RFT particularly useful for handling the whole RFT pipeline in one place. We also hope that some specific features of Trinity-RFT, such as data persistence in the experience buffer, and distributed deployment of multiple independent explorers, will open up new opportunities for LLM reinforcement fine-tuning. 2 Design and Implementations The overall design of Trinity-RFT exhibits a trinity consisting of (1) RFT-core, (2) agent-environment interaction, and (3) data pipelines, which are illustrated in Figure 1 and elaborated in this section. 2.1 RFT-Core RFT-core is the component of Trinity-RFT, highlighted at the center of Figure 1, where the core RFT process happens. Its design also exhibits a trinity, consisting of the explorer, buffer, and trainer. - The explorer, powered by a rollout model, takes a task as input and solves it by executing a workflow that specifies the logic of agent-environment interaction, thereby collecting experiences (including rollout trajectories, rewards, and other useful information) to be stored in the buffer. - The buffer stores experiences that can be generated by the explorer or by other sources, such as human experts. It can be realized in various forms, such as a non-persistent ray.Queue or a persistent SQLite database. It also assists with fetching training samples for the trainer, and can be integrated with advanced sampling strategies and post-processing operations. - The trainer, backed by a policy model, samples batches of experiences from the buffer and updates the policy model via RL algorithms. Our implementations allow the explorer and trainer to be deployed on separate machines and act independently. They are only connected via (1) access to the same experience buffer with a customizable sampling strategy, and (2) model weight synchronization by a customizable schedule. See Figure 3 for an illustration. This decoupled design of RFT-core offers support for diverse RFT modes with great flexibility. <details> <summary>x4.png Details</summary> ![b0cc54ea](/v1/image/b0cc54ea26394436bb1fe229b921cf378aab476d5a8c89def54f2a4a51fcd076) ### Visual Description \n ## Diagram: Reinforcement Learning Workflow ### Overview This diagram illustrates a reinforcement learning workflow, depicting the interaction between an agent, environment, and various processing components. The workflow is segmented into three main areas: a Task Processing section (top), an Explorer section (center-left), and a Trainer section (center-right). Data flow is indicated by arrows, and key components are represented as labeled boxes. ### Components/Axes The diagram contains the following components: * **Taskset:** A database of tasks. * **Task Data Processor:** Processes the taskset and outputs a "Task". * **Workflow Runner:** Contains the "Agent" and "Environment" and manages the interaction between them. * **Agent:** Takes "action" and receives a "reward" from the environment. * **Environment:** Provides feedback (reward) to the agent based on its actions. * **Rollout Model:** Part of the Explorer, receives "Experience" from the Workflow Runner. * **Reward Model:** Part of the Explorer, receives "Experience" from the Workflow Runner. * **Explorer:** Contains the Rollout and Reward Models. * **Experience Data Processor:** Processes "Raw Experiences" and outputs "Verified Experiences". * **Raw Experiences:** A database of raw experiences. * **Verified Experiences:** A database of verified experiences. * **Reference Model:** Part of the Trainer, receives "Experience". * **Actor Model:** Part of the Trainer, receives "Experience" from the Reference Model. * **Critic Model:** Part of the Trainer, receives input from the Actor Model. * **Trainer:** Contains the Reference, Actor, and Critic Models. * **Buffer:** Stores verified experiences. * **Sync. Weights:** Indicates synchronization of weights between models. ### Detailed Analysis or Content Details The diagram shows a cyclical flow within the Workflow Runner: the Agent takes an action, the Environment provides a reward, and this cycle repeats. The Explorer receives "Experience" from the Workflow Runner. The Experience Data Processor transforms "Raw Experiences" into "Verified Experiences", which are stored in a "Buffer". The Trainer utilizes the "Verified Experiences" to update the Reference, Actor, and Critic Models. The "Sync. Weights" arrow indicates that the weights of the models are synchronized. ### Key Observations The diagram highlights the separation of concerns in a reinforcement learning system: task generation, agent interaction, experience processing, and model training. The cyclical nature of the agent-environment interaction is clearly depicted. The flow of "Experience" is central to the entire process, being passed from the Workflow Runner to the Explorer, then to the Experience Data Processor, and finally to the Trainer. ### Interpretation This diagram represents a common architecture for reinforcement learning algorithms, particularly those employing techniques like experience replay and actor-critic methods. The "Taskset" and "Task Data Processor" suggest a system capable of handling multiple tasks or a dynamic task environment. The separation of the "Explorer" and "Trainer" components is indicative of an off-policy learning approach, where the data used for training is not necessarily generated by the current policy. The "Reference Model" likely serves as a target or baseline for the Actor and Critic models, facilitating stable learning. The "Buffer" is a crucial component for experience replay, allowing the agent to learn from past experiences and break correlations in the data. The synchronization of weights suggests a mechanism for periodically updating the models in the Explorer with the latest knowledge from the Trainer. The overall workflow demonstrates a sophisticated system for learning optimal policies through trial and error, leveraging experience replay and model-based techniques. </details> Figure 3: The architecture of RFT-core in Trinity-RFT. 2.1.1 Unified Support for Diverse RFT Modes We present the RFT modes supported by Trinity-RFT, some of which are demonstrated in Figure 4. <details> <summary>x5.png Details</summary> ![ef0f2e3c](/v1/image/ef0f2e3c22e41363867f60c5a7d141c55a55fd7003d8a49c728df3482c26aae0) ### Visual Description \n ## Diagram: Asynchronous Reinforcement Learning Architectures ### Overview The image presents a comparative diagram illustrating four different architectures for asynchronous reinforcement learning: Synchronous, One-Step Off-Policy, Fully Asynchronous, and Multi-Explorer Asynchronous. Each architecture is depicted as a timeline with three main components: Explorer, Buffer, and Trainer. The diagram visually represents the flow of data and synchronization mechanisms between these components. ### Components/Axes The diagram consists of four sub-diagrams labeled (a) through (d), each representing a different architecture. Each sub-diagram has three horizontal lanes representing the Explorer, Buffer, and Trainer. A legend at the bottom-right explains the color-coding: * **Rollout** (Yellow): Represents the exploration phase. * **Train** (Green): Represents the training phase. * **Sync. weight (NCCL)** (Red): Represents weight synchronization using NCCL. * **Sync. weight (Checkpoint)** (Pink): Represents weight synchronization using Checkpoints. Arrows indicate the flow of data between components. Text labels describe synchronization mechanisms and delays. ### Detailed Analysis or Content Details **(a) Synchronous:** * Explorer: Yellow blocks representing "Rollout" are stacked sequentially. * Buffer: Blue blocks representing data storage. * Trainer: Green blocks representing "Train". * Data flows from Explorer to Buffer, then to Trainer, and back to Explorer. * Label: "Wait for experiences" indicates a synchronization point. **(b) One-Step Off-Policy:** * Explorer: Similar to (a), with yellow "Rollout" blocks. * Buffer: Blue blocks. * Trainer: Green blocks. * Data flow is similar to (a), but with additional labels: "One step offset" and "Wait for synchronization". **(c) Fully Asynchronous:** * Explorer: Yellow "Rollout" blocks. * Buffer: Blue blocks. * Trainer: Green "Train" blocks interspersed with pink "Sync. weight (Checkpoint)" blocks. * Data flows from Explorer to Buffer, then to Trainer. * Synchronization occurs via checkpoints. **(d) Multi-Explorer Asynchronous:** * Explorer1 & Explorer2: Two Explorer lanes, each with yellow "Rollout" blocks. * Buffer: Blue blocks. * Trainer: Green "Train" blocks interspersed with red "Sync. weight (NCCL)" blocks. * Data flows from both Explorers to the Buffer, then to the Trainer. * Synchronization occurs via NCCL. ### Key Observations * The Synchronous architecture requires waiting for experiences before training. * The One-Step Off-Policy introduces a one-step offset and synchronization delay. * The Fully Asynchronous architecture uses checkpoints for synchronization. * The Multi-Explorer Asynchronous architecture utilizes multiple explorers and NCCL for synchronization. * The complexity of synchronization increases as we move from Synchronous to Multi-Explorer Asynchronous. * The color coding is consistent across all four diagrams. ### Interpretation The diagram illustrates the evolution of asynchronous reinforcement learning architectures, highlighting different approaches to managing synchronization and data flow. The Synchronous approach is the simplest but can be inefficient due to waiting. The One-Step Off-Policy attempts to mitigate this with an offset, but still requires synchronization. The Fully Asynchronous and Multi-Explorer Asynchronous architectures aim to improve efficiency by reducing synchronization overhead, with the Multi-Explorer approach leveraging multiple explorers and faster NCCL synchronization. The choice of architecture depends on the specific application and the trade-off between synchronization overhead and training efficiency. The diagram effectively visualizes these trade-offs by showing the flow of data and the points at which synchronization occurs in each architecture. The use of color-coding makes it easy to understand the different phases of the learning process (Rollout, Train, Sync). The diagram suggests a progression towards more parallel and asynchronous approaches to improve the scalability and efficiency of reinforcement learning. </details> Figure 4: A visualization of diverse RFT modes supported by Trinity-RFT, including: (a) synchronous mode, with sync_interval=2; (b) one-step off-policy mode, with sync_interval=1 and sync_offset=1; (c) fully asynchronous mode, with sync_interval=2; (d) multi-explorer asynchronous mode, with sync_interval=2. The buffer supports, in principle, arbitrary management and sampling strategies for experiences. Synchronous mode. In the synchronous mode shown in Figure 4 (a), the explorer and trainer get launched simultaneously, work in close coordination, and synchronize their model weights once every sync_interval training steps. Within each synchronization period, the explorer continuously generates sync_interval batches of rollout experiences and stores them in the buffer, which are then retrieved and utilized by the trainer for updating the policy model. If sync_interval=1, this is a strictly on-policy RL process, whereas if sync_interval>1, it becomes off-policy (akin to the mode adopted in [35]) and can be accelerated by pipeline parallelism between the explorer and trainer. This mode can be activated by setting the configuration parameter mode=both. One-step off-policy mode. This mode, demonstrated in Figure 4 (b), closely resembles the synchronous mode, except for an offset of one batch between the explorer and trainer. This allows the trainer to sample experiences from the buffer immediately after model weight synchronization, thereby streamlining the execution of explorer and trainer with smaller pipeline bubbles, at the cost of slight off-policyness. The visualization in Figure 4 (b) corresponds to configuration parameters sync_interval=1 and sync_offset=1, both of which can take more general values in Trinity-RFT. Asynchronous mode. In the fully asynchronous mode shown in Figure 4 (c), the explorer and trainer act almost independently. The explorer continuously generates rollout experiences and stores them in the buffer, while the trainer continuously samples experiences from the buffer and uses them for training the policy model. External experiences, e.g., those generated by expert models or humans, can be continuously incorporated into the buffer as well. The explorer and trainer independently load or save model weights from the checkpoint directory every sync_interval steps, keeping the distribution of rollout experiences up to date. This mode can be activated by setting mode=explore/train and launching the explorer and trainer separately on different GPUs. Multi-explorer asynchronous mode. One benefit brought by the decoupled design is that explorers and trainers can scale up independently on separate devices. As a proof-of-concept, Trinity-RFT offers support for a multi-explorer asynchronous mode, as demonstrated in Figure 4 (d), where multiple explores send the generated rollout experiences to the same buffer. Scaling up the number of independent and distributed explorers can be particularly useful for resolving data scarcity and speeding up the generation of experiences in real-world scenarios where rollout trajectories have to be sampled via interaction with the physical world, or in an environment with sparse and lagged feedback. Another by-product of this multi-explorer mode is 24-hour non-interrupted service for real-world online serving situations: since the explorers can pause and update model weights at different moments, it can be guaranteed that there is always one explorer ready to serve an incoming request immediately whenever it arrives. This is in contrast to a single-explorer mode, where online service has to be paused when the explorer is updating its model weights. Benchmark mode. Trinity-RFT supports a benchmark mode that allows the user to evaluate one or multiple checkpoints on arbitrary benchmarks, after the RFT training process has finished. To activate this mode, the user simply needs to set mode=bench and specify the paths for the evaluation datasets in the configurations. This mode can be particularly useful for experimental purposes; for example, the user might want to try out different RFT techniques or configurations quickly (with limited evaluation on hold-out data) during training, identify which RFT trials have achieved stable convergence and high rewards, and then conduct more thorough evaluations only for the checkpoints of these successful trials. Train-only mode. In certain scenarios, the user would like to train the policy model without further exploration, using experiences that have already been collected and stored in the buffer. This train-only mode can be activated by setting the configuration parameter mode $=$ train and launching the trainer alone. Offline methods like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) [25] can be regarded as special cases of such scenarios, both of which are natively supported by Trinity-RFT. For another example, consider an online RFT process that expands over a long period, where the explorer alone is launched during the daytime for serving human users and collecting experiences, while the trainer alone is launched at night for updating the policy model (which will be thoroughly validated and evaluated before it can be actually deployed as the rollout model for the next day). Discussions. We conclude this subsection with two remarks. (1) Given the unified implementation of various RFT modes, it is easy to design and implement a hybrid mode with Trinity-RFT that combines multiple modes into a single learning process. One example is learning with both online rollout data and offline-collected expert data, via jointly optimizing two loss terms corresponding to these two data sources. Section 3.2 illustrates how to implement this conveniently in Trinity-RFT. (2) We take a system-algorithm co-design perspective in the development of Trinity-RFT, aiming to unify and generalize diverse RFT methodologies in this framework. RFT-core provides the necessary infrastructure for achieving this goal. This technical report focuses on the system perspective, and we refer interested readers to the literature for recent algorithmic developments in off-policy / asynchronous RL for LLMs [21, 6, 26, 35, 7, 23, 42, 45, 46, 47]. 2.1.2 Implementations of RFT-Core We present some implementation details of RFT-core in the following. Inference and training engines. The current version of Trinity-RFT leverages vLLM [15] as the inference engine for the explorer, which offers features including paged attention, continuous batching [49], asynchronous and concurrent inference for multiple rollout trajectories, among others. Trinity-RFT also leverages verl [30] as the training engine for the trainer, which gracefully handles model placement (for the policy, critic and reference models) and incorporates various performance optimizations for training (such as dynamic batching, management of padding and unpadding, etc.). Trinity-RFT stands on the shoulders of these excellent open-source projects, and will continue to benefit from their future development. Experience buffer. Trinity-RFT supports multiple types of experience buffers, ranging from a non-persistent ray.Queue to persistent SQLite or Redis databases. While using a basic first-in-first-out queue is the most straightforward approach, data persistence with a database opens up many new opportunities (e.g., advanced sampling strategies), as discussed throughout this report. Trinity-RFT has provided dedicated read/write control to prevent any conflict in accessing the buffer. Model weight synchronization. Trinity-RFT supports model weight synchronization between the explorer and trainer by NCCL [22], or by checkpoint saving and loading. The former is faster (when available), while the latter is generally more flexible and widely applicable, especially for asynchronous RFT modes. 2.2 Agent-Environment Interaction To adapt Trinity-RFT to a new downstream scenario, the user mainly needs to define and register a customized workflow (by inheriting the base class Workflow or MultiTurnWorkflow) where the logic of agent-environment interaction for this particular scenario is implemented. Advanced methods for experience synthesis with environmental feedback [4] can be implemented in the same way as well. See Section 3.1 for detailed examples. The workflow will then be executed by workflow runners within the explorer for generating experiences, as shown in Figure 3. Numerous challenges arise when one tries to build an RFT framework that can efficiently and robustly handle real-world interaction between the LLM-powered agent and the environment. These include long-tailed latencies, agent/environment failures, and lagged reward signals, among others. Trinity-RFT regards agent-environment interaction as a first-class citizen and incorporates various solutions to tackle these challenges, for example: - The workflow runners in Trinity-RFT support asynchronous and streaming generation of rollout trajectories for multiple tasks. This helps mitigate the straggler effect caused by the long-tailed latencies in rollout generation and agent-environment interaction, thereby accelerating the RFT process. Load balancing among multiple LLM inference engines within one RFT training course is also taken care of, and would be one direction for further optimizing the utilization of computational resources. - Trinity-RFT incorporates various timeout/retry/skip mechanisms for fault tolerance and robustness, which ensure that continuous rollout generation would not be interrupted or blocked by individual failures in certain rounds of agent-environment interaction. This is crucial for stable and efficient learning in real-world scenarios, e.g., when the agent interacts with a large number of MCP services [17] that differ vastly in quality and availability. - Trinity-RFT is built to provide native support for asynchronous RFT modes, which allow great flexibility in the paces of the explorer and trainer. This can boost the overall efficiency of the RFT process, compared to synchronous modes where the slower one among the explorer and trainer can block the progress of the other and cause waste of computational resources. - For lagged reward signals, the trinity design of RFT-core offers a natural solution. As soon as the rollout trajectory (without reward values) has been generated, it is saved into the experience buffer, but marked as “not ready for training”. The explorer is now free from this task and may continue to collect experiences for other tasks. When the reward signals from the environment finally arrive, they are written to the buffer, and the corresponding experience is now marked as “ready for training”. - For multi-turn conversations and ReAct-style workflows [48], Trinity-RFT supports concatenating multiple rounds of agent-environment interaction compactly into a single sequence, with proper masking that indicates which tokens need to be incorporated into the training objective of RL algorithms. This avoids unnecessary recomputation and thus improves training efficiency, compared to a vanilla approach that represents a $K$ -turn rollout trajectory with $K$ separate samples. - As another performance optimization, the implementation of Trinity-RFT allows resetting the environment in a workflow, rather than re-initializing it every time. This is especially useful for scenarios where setting up the environment is costly. 2.3 Data Pipelines The data pipelines in Trinity-RFT aim to address fundamental challenges in RFT scenarios, such as managing heterogeneous data dynamics across interaction workflows, enabling delayed reward integration, and facilitating continuous data curation. Our solutions center on four core aspects: end-to-end data transformation, task curation, active experience shaping, and human-in-the-loop curation, each corresponding to key requirements identified in our development of RFT-core (Section 2.1). 2.3.1 End-to-end Data Transformation To support the diverse RFT modes (e.g., synchronous or asynchronous) in Trinity-RFT, we establish a service-oriented data pipeline architecture as illustrated in Figure 5. It decouples data pipeline logic from procedure control to enable flexible RL-oriented data transformations with two key modules: - The Formatter Module unifies disparate data sources into RFT-compatible formats, providing convenient conversion between raw inputs (e.g., meta-prompts, domain-specific corpora, and QA pairs with tagged rewards) and structured RFT representations. For efficient RFT workloads, we utilize buffer-based persistent storage (Section 2.1) to support different data models, such as ExperienceModel for prioritized rollout trajectories and DPODataModel for preference pairs. The conversion logic and data models are highly customizable to meet diverse requirements for managing experience data. This flexibility enables robust metadata recording and field normalization, which is essential for advanced scenarios such as asynchronous RFT in trainer-explorer environments, agent self-evolution from a cold start using meta-prompts, and knowledge injection from structurally complex domain-specific corpora. - The Controller Module manages the complete data pipeline lifecycle through distributed server initialization, declarative configuration, and automated dataset persistence. It implements dynamic control mechanisms for asynchronous scenarios and protection against resource exhaustion, with configurable termination conditions based on compute quota or data quantity. This modular design enables Trinity-RFT to handle data transformations flexibly while maintaining consistency across different RFT modes. The Formatter-Controller duality mirrors the explorer-trainer decoupling in RFT-core, enabling parallel data ingestion and model updating. This design also allows Trinity-RFT to handle delayed rewards through version-controlled experience updates while maintaining low-latency sampling for the trainer. <details> <summary>x6.png Details</summary> ![5e106bd6](/v1/image/5e106bd6fb601bd72a687207517cd120fc4b43534b2be779298debae61f6c5e3) ### Visual Description \n ## Diagram: System Architecture for Task Curation and Experience Shaping ### Overview The image depicts a system architecture diagram illustrating a process for task curation & prioritization and experience shaping. The diagram shows a flow of data between several components, including Data Processors, databases (Raw Data, Taskset, Raw Experience, Experience), an Explorer agent, and a Trainer. The diagram is divided into two main sections, visually separated by color and function. ### Components/Axes The diagram consists of the following components: * **Task Curation & Prioritization** (Purple Box): This section handles raw data processing and task creation. * **Experience Shaping** (Light Purple Box): This section processes experience data and provides feedback to the trainer. * **Data Processor** (Two instances, Purple): Processes data, performs conversions, cleaning, augmentation, and online scoring. * **Raw Data** (Blue Cylinder): Stores initial raw data. * **Taskset** (Blue Cylinder): Stores curated tasks. * **Raw Experience** (Blue Cylinder): Stores raw experience data. * **Experience** (Blue Cylinder): Stores processed experience data. * **Explorer** (Yellow Hexagon): An agent that interacts with the environment and generates feedback. * **Trainer** (Green Head): Receives model feedback and updates the model. * **Buffer** (Text Label): Indicates a buffer between the Raw Data and Taskset. * **Environment Feedback** (Text Label): Feedback from the environment to the Explorer. * **Model Feedback** (Text Label): Feedback from the model to the Trainer. * **Experience Shaping** (Text Label): Connection between Data Processor and Trainer. ### Detailed Analysis or Content Details The diagram illustrates a data flow as follows: 1. **Task Curation & Prioritization:** * Raw Data is fed into a Data Processor. * The Data Processor performs operations: "Convert format", "Clean & augment", "Online Scoring". * The processed data is stored in a Taskset. * The Taskset is connected to the Explorer via a dotted line. * The Explorer sends "Environment Feedback" back to the Taskset. 2. **Experience Shaping:** * Raw Experience is fed into a Data Processor. * The Data Processor performs operations: "Dense rewards", "Human-in-the-loop", "Counterfactual, dynamic synthesis", and "..." (indicating more operations). * The processed data is stored in Experience. * Experience is connected to the Trainer. * The Trainer receives "Model Feedback". 3. **Interaction:** * The Explorer interacts with both the Taskset and the Experience. * The Trainer receives feedback from the Experience. The dotted lines indicate feedback loops. The solid lines indicate data flow. ### Key Observations * The diagram highlights a clear separation between task creation and experience processing. * The Data Processor plays a central role in both sections, performing data transformations. * The Explorer acts as a bridge between the system and the environment. * The Trainer is responsible for learning from the processed experience. * The "..." notation in the Experience Shaping Data Processor suggests that the list of operations is not exhaustive. ### Interpretation This diagram represents a reinforcement learning or similar iterative system. The "Task Curation & Prioritization" section focuses on generating tasks for an agent (the Explorer) to learn from. The "Experience Shaping" section focuses on processing the agent's experiences to provide meaningful feedback for learning. The separation of these two sections suggests a modular design, allowing for independent optimization of task generation and experience processing. The feedback loops indicate a continuous learning process, where the agent's performance influences the tasks it receives and the feedback it gets. The use of "Dense rewards" and "Human-in-the-loop" in the Experience Shaping section suggests a focus on providing rich and informative feedback signals to the agent. The "Counterfactual, dynamic synthesis" operation suggests an attempt to learn from hypothetical scenarios and adapt to changing environments. The overall architecture appears designed to facilitate efficient and effective learning in a complex environment. </details> Figure 5: The interaction of data processor and data buffers in Trinity-RFT, divided into two key stages. Left: Task Curation & Prioritization prepares the initial tasks for the explorer. Right: Experience Shaping processes the collected trajectories from the explorer before they are used by the trainer. The data processor is a central component that operates on different buffers at different stages. 2.3.2 Task Curation and Prioritization Before the RFT loop begins, it is crucial to prepare a high-quality set of initial tasks. This stage, depicted on the left side of Figure 5, transforms raw data into an optimized task set for the explorer. The process begins with raw data sources (e.g., prompts, domain corpora), which are ingested into a buffer. The Data Processor, powered by over 100 operators from Data-Juicer [2], reads from this buffer to perform various curation tasks. It provides composable building blocks for experience cleaning (e.g., length filters, duplication removal), safety alignment (e.g., toxicity detection, ethics checks), and preference data synthesis (e.g., critique-conditioned augmentation). By treating Data-Juicer as a modular data processing operator pool rather than a central dependency, Trinity-RFT provides RL-specific abstractions and coherence, while benefiting from well-established data tools. The processed data is then organized into a structured task buffer. This stage effectively implements a form of curriculum learning by allowing users to prioritize tasks (e.g., from easy to hard), guiding the explorer towards a more efficient and stable learning trajectory from the outset. This entire workflow is managed by a service-oriented architecture that decouples data logic from procedural control, ensuring flexibility and scalability, especially in asynchronous and distributed settings. 2.3.3 Active Experience Shaping Once the explorer begins interacting with the environment, it generates a continuous stream of experience data. To maximize learning efficiency, this raw experience must be actively shaped before it reaches the trainer. This stage is shown on the right side of Figure 5. Generated experiences are first collected in a buffer. The Data Processor is applied again with a series of transformations to clean, augment, or synthesize these experiences. This is where the core of RFT data intelligence lies. Key capabilities include: - Agent-Driven Data Processing: Trinity-RFT introduces a powerful agentic paradigm for data manipulation. Users can define complex processing pipelines through high-level objectives, specified as either natural language commands (e.g., “improve safety” or “increase response diversity”) or explicit Data-Juicer configurations. The framework automatically translates these commands into executable workflows backed by its modular components like DataCleaner and DataSynthesizer. This design provides a user-friendly abstraction layer over the underlying Data-Juicer operators, making advanced processing functionalities accessible to both RFT users familiar with Data-Juicer and those who are not. It also facilitates the flexible injection of user-defined inductive biases into the learning process, unlocking new research directions for self-evolving agents, as we will discuss later in Section 2.3.5. - Online Reward Shaping: The data processor can dynamically augment the reward signal. Instead of relying on a single, often sparse, task-completion reward, users can add dense rewards based on quality, diversity, or safety scores computed on the fly. This enriched feedback provides a much stronger learning signal for the trainer. - Prioritized Experience Replay: Experiences are not treated equally. Trinity-RFT allows for flexible, multi-dimensional utility scoring to prioritize the most valuable samples for training. The DataActiveIterator supports version-controlled experience reuse and cross-task data lineage tracking, ensuring that the trainer always learns from the most informative data available. This mechanism is also critical for handling delayed rewards, as experience utilities can be updated asynchronously as new feedback arrives. 2.3.4 Human-AI Collaboration In scenarios where human feedback is irreplaceable, Trinity-RFT establishes a bi-directional human-AI collaboration loop that provides first-class support for human annotations, based on Label Studio [39] and Data-Juicer’s HumanOPs. - Multi-stage annotation. Trinity-RFT implements configurable procedures combining automatic pre-screening and human verification. Typical stages include preference annotation (comparative assessment of model responses), quality auditing (human verification of automated cleaning/synthesis results), and cold-start bootstrapping (initial dataset curation through expert demonstrations). - Native asynchronism support. As the collection of human feedback is generally slower than AI/model feedback, we provide dedicated capabilities to handle both synchronous and asynchronous feedback modes, with configurable timeout and polling parameters. The feedback collaboration is based on an event-driven design, with automatic task creation upon data state changes, configurable notifications via email/Slack/webhook, and an atomic transaction model for annotation batches. - Customization. Different applications may involve humans in heterogeneous ways. We thus prioritize flexibility in both the interaction-interface and service levels. Examples include rich built-in interfaces that can be extended in a visualized style with XML-like tags provided by Label Studio, fine-grained quality scoring for reward shaping, free-form feedback attachment for dataset shaping, among others. Moreover, for easy deployment, we provide local Label Studio instance management with automatic environment setup via Docker/pip; optimized SDK interactions with batch request coalescing; unified logging across annotation tools and ML services; and concurrent annotation campaigns through priority-based task routing, while maintaining full data lineage preserved via LineageTracker. The decoupled design of Trinity-RFT, and the presence of a standalone experience buffer in particular, enable human feedback to participate in RL loops without breaking the asynchronous execution model. For instance, human-verified samples can be prioritized for training while fresh experiences are being collected, which is a critical capability for real-world deployment scenarios with mixed feedback sources. Further details for human-AI collaboration in Trinity-RFT will be illustrated in Section 3.5. 2.3.5 Discussion: Unlocking New Research & Development Directions The modular design of our data pipelines and the powerful data processor open up promising research and development avenues to be further explored. One direction is about effective management of experience data. While prior RFT works often treat the experience as a static log, Trinity-RFT enables a more sophisticated, full-lifecycle approach to data, from selective acquisition to efficient representation: - Intelligent Perception and Collection: In an open-ended environment, what experience is “worth” recording? Storing everything creates a low signal-to-noise ratio and burdens the trainer. Trinity-RFT ’s architecture allows researchers to implement active collection strategies. For instance, one could design a data processor operator that evaluates incoming experiences from the explorer based on metrics like surprise, uncertainty, or information gain, and only commits the most salient trajectories to the replay buffer. This transforms data collection from passive logging into a targeted, intelligent process. - Adaptive Representation: Raw experience is often high-dimensional and redundant (e.g., long dialogues, complex code generation traces). How can this be distilled into a format that an agent can efficiently learn from? The data processor in Trinity-RFT acts as a powerful transformation engine. Researchers can use it to explore various representation learning techniques, such as automatically summarizing trajectories, extracting causal chains from tool usage, or converting a multi-turn dialogue into a structured preference pair. This not only makes training more efficient but also opens the door to building meta-experience (more abstract and reusable knowledge) from raw interaction data. - Agentic Workflows: Trinity-RFT ’s agent-driven processing enables the research and development of self-improving agents, e.g., by configuring the policy agent to also serve as the “processing agent” for LLM-based Data-Juicer operators. Such an agent could perform its own critique and dynamically curate its own training data, creating a truly autonomous learning and data management loop. Another direction is about synthetic and counterfactual experience processing. The integration of synthesis operators enables research into creating “better-than-real” data. Instead of relying solely on the agent’s own trial-and-error, our framework facilitates exploring questions like: - Dynamic and Composable Rewarding: With our framework, researchers can move beyond static, hand-crafted rewards. It is now possible to investigate dynamic reward shaping, where auxiliary signals like novelty, complexity, or alignment scores are automatically extracted from trajectories and composed into a dense reward function. How to define “good” experience and how can we learn the optimal combination of these reward components as the agent’s policy evolves? - Experience Reorganization: Can successful sub-trajectories from different tasks be “spliced” together to solve a novel, composite task? For example, can an agent that has learned to “open a door” and “pick up a cup” synthesize a new trajectory to "enter the room and fetch the cup"? - Failure Repair: Can the data processor identify where errors occur in a failed trajectory, and synthesize a corrected version for the trainer to learn from, effectively turning failures into valuable lessons? - Success Amplification: Can a single successful experience be augmented into multiple diverse yet successful variants, thereby improving the generalization and robustness of the learned policy? By providing dedicated capabilities for such advanced data and reward manipulation, Trinity-RFT aims to facilitates flexible processing of “experience data” for the next generation of self-evolving LLMs. 2.4 User-Friendliness Trinity-RFT has been designed with user-friendliness as a top priority. For development and research: The modular and decoupled design of Trinity-RFT allows users to develop a new algorithm for a specific aspect of RFT by adding one or a few new classes that implement the essential functionalities of interest, without concerning about other aspects of RFT or intrusive modifications of the original codebase. In addition, we include a monitor (built upon Wandb [41] and TensorBoard [38]) that makes it easy to track the progress of an RFT process, both quantitatively (e.g., via learning curves for rewards and other metrics) and qualitatively (e.g., via concrete examples of rollout trajectories generated at different RL steps). See Figure 6 for an example snapshot of the monitor. For RFT applications: Trinity-RFT offers extensive graphical user interfaces to support low-code usage of the framework, and to maximize transparency of the RFT process. For example, we implement a configuration manager, as shown in Figure 7, that allows the user to create configuration files conveniently via a front-end interface. We also provide Trinity-Studio, an all-in-one unified UI (including the aforementioned monitor and configuration manager) that allows the user to configure and run data inspection, data processing, RFT learning process, etc., all by clicking the mouse and filling forms, without writing any code. An example for using Trinity-Studio will be introduced in Section 3.6. Such functionalities, of course, can be useful not only for applications but also for development and research. <details> <summary>figs/wandb_screencut.png Details</summary> ![aec4b93f](/v1/image/aec4b93f9609b0abdebf1f794f18708beb25aded47f4a03a3d912a90fbe03f27) ### Visual Description \n ## Screenshot: Rollout Examples & Evaluation Metrics ### Overview This screenshot displays a user interface, likely from a machine learning experiment tracking tool. The top section shows a table of "rollout examples" with columns for "step", "reward", "prompt", and "response". The bottom section presents two line graphs: one showing "eval/accuracy/mean" versus "step", and the other showing "critic/rewards/mean" versus "step". ### Components/Axes * **Top Section (Table):** * Columns: "step", "reward", "prompt", "response" * Table Header: "runs.summary["rollout_examples"]" * Pagination: "< 1 - 3 of 4 >" * Button: "Export as CSV" * Button: "Columns..." * Button: "Reset Table" * **Bottom Section (Graphs):** * **Left Graph:** * X-axis: "Step" (ranging from approximately 5 to 45) * Y-axis: "eval/accuracy/mean" (ranging from approximately 0.50 to 0.70) * Line 1 (Blue): Represents a data series with fluctuating accuracy. * Line 2 (Orange): Represents a data series with fluctuating accuracy. * **Right Graph:** * X-axis: "Step" (ranging from approximately 5 to 45) * Y-axis: "critic/rewards/mean" (ranging from approximately 0.0 to 0.8) * Line 1 (Blue): Represents a data series with fluctuating rewards. ### Detailed Analysis or Content Details **Table Data (Rollout Examples):** | Step | Reward | Prompt | |---|---|---| </details> Figure 6: A snapshot of the monitor implemented in Trinity-RFT. <details> <summary>figs/config_manager_beginner.jpg Details</summary> ![becccad1](/v1/image/becccad132d4b24359d82e187c3d1452a2b8a24d9ea8d91ee6f56c1ed549e4b2) ### Visual Description \n ## Screenshot: Trinity-RFT Config Generator ### Overview This is a screenshot of a user interface for a configuration generator named "Trinity-RFT". The interface appears to be a web application, allowing users to set parameters for a machine learning or AI project. It is divided into sections for "Essential Configs" and includes input fields for project details, model paths, and algorithm settings. The interface offers two modes: "Beginner Mode" and "Expert Mode". ### Components/Axes The UI elements include: * **Title:** "Trinity-RFT Config Generator" (top-center) * **Mode Buttons:** "Beginner Mode" and "Expert Mode" (top-left) * **Section Header:** "Essential Configs" (center-left) * **Labels:** "Project", "Experiment Name", "Model Path", "Checkpoint Path", "Taskset Path", "Algorithm Type", "SFT Warmup Steps", "Monitor Type" * **Input Fields:** Text boxes for entering paths and names, dropdown menus for selecting algorithm type and monitor type. * **Placeholder Text:** "Please input model path.", "Please input checkpoint path.", "Please input taskset path." * **Pre-filled Values:** "Trinity-RFT" (Project), "qwen2.5-1.5B" (Experiment Name), "ppo" (Algorithm Type), "0" (SFT Warmup Steps), "tensorboard" (Monitor Type) ### Detailed Analysis or Content Details The screenshot displays the following specific values: * **Project:** Trinity-RFT * **Experiment Name:** qwen2.5-1.5B * **Model Path:** (Empty - placeholder text: "Please input model path.") * **Checkpoint Path:** (Empty - placeholder text: "Please input checkpoint path.") * **Taskset Path:** (Empty - placeholder text: "Please input taskset path.") * **Algorithm Type:** ppo (selected from a dropdown menu) * **SFT Warmup Steps:** 0 (numerical input, with "+" and "-" buttons for incrementing/decrementing) * **Monitor Type:** tensorboard (selected from a dropdown menu) The dropdown menus for "Algorithm Type" and "Monitor Type" suggest there are other options available, but they are not visible in the screenshot. ### Key Observations The interface is designed for ease of use, with clear labels and placeholder text guiding the user. The pre-filled values for "Project", "Experiment Name", "Algorithm Type", "SFT Warmup Steps", and "Monitor Type" suggest a default configuration or a recent setting. The empty input fields for paths indicate that these are required inputs from the user. ### Interpretation The "Trinity-RFT Config Generator" is a tool for configuring a machine learning or AI project, likely related to Reinforcement Learning from Human Feedback (RFT) given the name. The configuration options suggest the project involves training a model (potentially a large language model, given the "qwen2.5-1.5B" experiment name) using a specific algorithm (ppo) and monitoring its performance using TensorBoard. The "SFT Warmup Steps" parameter likely refers to Supervised Fine-Tuning, a common pre-training step. The presence of "Beginner Mode" and "Expert Mode" suggests the tool caters to users with varying levels of technical expertise. The interface is focused on collecting essential parameters needed to initiate the training process. The tool is likely part of a larger workflow for developing and deploying AI models. </details> (a) The “beginner” mode. <details> <summary>figs/config_manager_expert.jpg Details</summary> ![34cf2e04](/v1/image/34cf2e042fb07ae89039ddec747527a6b8e44d28b2bbc1b8507f63619d523d88) ### Visual Description \n ## Screenshot: Trinity-RFT Config Generator ### Overview This is a screenshot of a web-based configuration generator for "Trinity-RFT". The interface allows users to set parameters for a machine learning project, likely related to large language models. The interface is divided into sections for different aspects of the configuration. The current view is in "Expert Mode". ### Components/Axes The interface includes the following elements: * **Title:** "Trinity-RFT Config Generator" (top-center) * **Mode Selection:** Buttons for "Beginner Mode" and "Expert Mode" (top-left). "Expert Mode" is currently selected (orange background). * **Tab Navigation:** Tabs for "Model", "Buffer", "Explorer and Synchronizer", and "Trainer" (below mode selection). "Model" is currently selected. * **Project Field:** Labelled "Project", with a pre-filled value of "Trinity-RFT". * **Experiment Name Field:** Labelled "Experiment Name", with a pre-filled value of "qwen2.5-1.5B". * **Model Path Field:** Labelled "Model Path", with placeholder text "Please input model path.". * **Critic Model Path Field:** Labelled "Critic Model Path (defaults to model_path )". * **Checkpoint Path Field:** Labelled "Checkpoint Path", with placeholder text "Please input checkpoint path.". * **Monitor Type Dropdown:** Labelled "Monitor Type", with a current selection of "tensorboard". * **Node Num:** Labelled "Node Num", with a value of "1". Plus and minus buttons are present for incrementing/decrementing. * **GPU Per Node:** Labelled "GPU Per Node", with a value of "8". Plus and minus buttons are present for incrementing/decrementing. * **Max Prompt Tokens:** Labelled "Max Prompt Tokens", with a value of "1024". Plus and minus buttons are present for incrementing/decrementing. * **Max Response Tokens:** Labelled "Max Response Tokens", with a value of "1024". Plus and minus buttons are present for incrementing/decrementing. ### Detailed Analysis or Content Details The screenshot shows a configuration interface pre-populated with some default values. * **Project:** Trinity-RFT * **Experiment Name:** qwen2.5-1.5B * **Model Path:** Empty (placeholder text present) * **Critic Model Path:** Empty (defaults to model\_path) * **Checkpoint Path:** Empty (placeholder text present) * **Monitor Type:** tensorboard * **Node Num:** 1 * **GPU Per Node:** 8 * **Max Prompt Tokens:** 1024 * **Max Response Tokens:** 1024 The plus and minus buttons next to "Node Num", "GPU Per Node", "Max Prompt Tokens", and "Max Response Tokens" suggest these values can be adjusted. The dropdown for "Monitor Type" allows selection of different monitoring tools. ### Key Observations The interface is designed for configuring a machine learning training or inference process. The pre-filled values suggest a specific model ("qwen2.5-1.5B") is being used as a starting point. The presence of fields for "Model Path" and "Checkpoint Path" indicates the user needs to specify where the model and its saved states are located. The "Monitor Type" setting allows for tracking the training process. ### Interpretation This configuration generator is likely part of a larger system for training or deploying large language models. The "Trinity-RFT" name suggests a specific framework or methodology. The "Expert Mode" indicates that more advanced configuration options are available beyond what is shown in this screenshot. The pre-filled values provide a reasonable default configuration for the specified model, but the user has the flexibility to customize the settings based on their specific needs and resources. The interface is designed to simplify the process of setting up and running machine learning experiments. The use of plus/minus buttons and dropdown menus makes it easy to adjust parameters without requiring manual input of values. </details> (b) The “expert” mode. Figure 7: Snapshots of the configuration manager. 3 Examples, Applications, and Experiments This section demonstrates the utilities and user-friendliness of Trinity-RFT and exemplifies some concepts introduced in previous sections, through a diverse range of examples, applications and experiments. Additional step-by-step tutorials can be found on the documentation website https://modelscope.github.io/Trinity-RFT, or the examples folder of the GitHub repository https://github.com/modelscope/Trinity-RFT/tree/main/examples. 3.1 Customizing Agent-Environment Interaction With a modular design, Trinity-RFT can be easily adapted to a new downstream scenario by implementing the logic of agent-environment interaction in a single workflow class, without modifications to other components of the codebase. This approach is also sufficient for macroscopic RL algorithm design that targets high-quality experience synthesis with environmental feedback [4]. We provide some concrete examples in the rest of this subsection. 3.1.1 Single-turn Scenarios In a simple yet common scenario, a user of Trinity-RFT would like to train an LLM for completing single-turn tasks, where the LLM generates one response to each input query. For this purpose, the user mainly needs to (1) define and register a single-turn workflow class (by inheriting the base class Workflow) tailored to the targeted tasks, and (2) specify the tasksets (for training and/or evaluation) and the initial LLM, both of which are compatible with HuggingFace [14] and ModelScope [19] formats. Listing 1 gives a minimal example for implementing a single-turn workflow. Suppose that each task is specified by a <question, answer> tuple. The run() method of ExampleWorkflow calls the LLM once to generate a response for the question, calculates its reward, and returns an Experience instance that consists of the response itself, the reward value, and the log-probabilities of response tokens predicted by the rollout model (which is necessary for certain RL algorithms, such as PPO [28] and GRPO [29]). Some built-in workflows and reward functions have been implemented in Trinity-RFT, e.g., the MathWorkflow class for math-related tasks. In some cases, the user wants to utilize auxiliary LLMs in the workflow, e.g., for computing rewards via LLM-as-a-judge, or for playing the roles of other agents in a multi-agent scenario. For these purposes, the user can specify auxiliary_models via APIs when initializing the workflow. ⬇ 1 @WORKFLOWS. register_module ("example_workflow") 2 class ExampleWorkflow (Workflow): 3 4 def __init__ ( 5 self, 6 model: ModelWrapper, 7 task: Task, 8 auxiliary_models: Optional [List [openai. OpenAI]] = None, 9 ): 10 super (). __init__ (model, task, auxiliary_models) 11 self. question = task. raw_task. get ("question") 12 self. answer = task. raw_task. get ("answer") 13 14 def calculate_reward_by_rule (self, response: str, truth: str) -> float: 15 return 1.0 if response == truth else 0.0 16 17 def calculate_reward_by_llm_judge (self, response: str, truth: str) -> float: 18 judge_model = self. auxiliary_models [0] 19 PROMPT_FOR_JUDGE = "" "Please evaluate..." "" 20 completion = judge_model. chat. completions. create ( 21 model = "gpt-4", # Or another suitable judge model 22 messages =[{"role": "user", "content": PROMPT_FOR_JUDGE}], 23 ) 24 reward_str = completion. choices [0]. message. content. strip () 25 reward = float (reward_str) 26 return reward 27 28 def run (self) -> List [Experience]: 29 response = self. model. chat ( 30 [ 31 { 32 "role": "user", 33 "content": f "Question:\n{self.question}", 34 } 35 ], 36 ** self. rollout_args, 37 ) 38 reward: float = self. calculate_reward_by_rule (response. response_text, self. answer) 39 # reward: float = self.calculate_reward_by_llm_judge(response.response_text, self.answer) 40 return [ 41 Experience ( 42 tokens = response. tokens, 43 prompt_length = response. prompt_length, 44 reward = reward, 45 logprobs = response. logprobs, 46 ) 47 ] Listing 1: A minimal example for implementing a customized workflow. 3.1.2 Multi-turn Scenarios In more advanced cases, the user would like to train an LLM-powered agent that solves multi-turn tasks by repeatedly interacting with the environment. In Trinity-RFT, achieving this is mostly as simple as in the single-turn case, except that the user needs to define and register a multi-turn workflow class by inheriting the base class MultiTurnWorkflow. Listing 2 provides one such example using the ALFWorld dataset [31]. For training efficiency, the process_messages_to_experience() method concatenates multiple rounds of agent-environment interactions compactly into an Experience instance consisting of a single token sequence with proper masking, which can readily be consumed by standard RL algorithms like PPO and GRPO. For more detailed examples of multi-turn cases, please refer to the documentation https://modelscope.github.io/Trinity-RFT/tutorial/example_multi_turn.html. ⬇ 1 @WORKFLOWS. register_module ("alfworld_workflow") 2 class AlfworldWorkflow (MultiTurnWorkflow): 3 "" "A workflow for the ALFWorld task." "" 4 5 def generate_env_inference_samples (self, env, rollout_num) -> List [Experience]: 6 print ("Generating env inference samples...") 7 experience_list = [] 8 for i in range (rollout_num): 9 observation, info = env. reset () 10 final_reward = -0.1 11 memory = [] 12 memory. append ({"role": "system", "content": AlfWORLD_SYSTEM_PROMPT}) 13 for r in range (self. max_env_steps): 14 format_obs = format_observation (observation) 15 memory = memory + [{"role": "user", "content": format_obs}] 16 response_text = self. model. chat (memory, n =1)[0]. response_text 17 memory. append ({"role": "assistant", "content": response_text}) 18 action = parse_action (response_text) 19 observation, reward, done, info = env. step (action) 20 if done: 21 final_reward = reward 22 break 23 experience = self. process_messages_to_experience ( 24 memory, final_reward, {"env_rounds": r, "env_done": 1 if done else 0} 25 ) 26 experience_list. append (experience) 27 # Close the env to save CPU memory 28 env. close () 29 return experience_list 30 31 def run (self) -> List [Experience]: 32 # ... 33 game_file_path = self. task_desc 34 rollout_n = self. repeat_times 35 # ... 36 env = create_environment (game_file_path) 37 return self. generate_env_inference_samples (env, rollout_n) Listing 2: An implementation of a multi-turn workflow for ALFWorld [31]. 3.1.3 Experience Synthesis in Workflows As mentioned in Section 1.1, Trinity-RFT has been designed to streamline RL algorithm design and development at both macroscopic and microscopic levels. One example for the former is experience synthesis: at each RL step, the agent (backed by the rollout LLM) iteratively generates refined responses to a query by incorporating feedback or guidance from the environment, which can be in the form of plain text rather than numerical reward values. The resulting data will then be utilized for updating the policy model, e.g., by a standard SFT or RL loss. Such a macroscopic RL approach is made possible by pre-trained LLMs’ generative nature and rich prior knowledge about natural language. Closely related to this idea is Agent-RLVR [4], a contemporary work that applies such an approach to software engineering scenarios. Within Trinity-RFT, this process of experience synthesis can be regarded as a particular way of agent-environment interaction, and thus can be realized by simply implementing a Workflow class. As a minimal demonstration, suppose that we want to implement this approach for a math reasoning scenario, where the agent generates multiple rollout responses to an input query, receives feedback from the environment regarding correctness of the responses, reflects on the gathered information, and generates a final response to the query. Listing 3 presents an implementation of this approach within Trinity-RFT. ⬇ 1 @WORKFLOWS. register_module ("reflect_once_workflow") 2 class ReflectOnceWorkflow (Workflow): 3 4 def run (self) -> List [Experience]: 5 experiences = [] 6 7 # Stage 1: K-rollout generation 8 rollout_messages = self. create_rollout_messages () 9 responses = self. model. chat ( 10 rollout_messages, 11 n = self. k_rollouts, 12 temperature = self. temperature, 13 logprobs = self. logprobs, 14 max_tokens = self. task. rollout_args. max_tokens, 15 ) 16 rollout_responses = [response. response_text. strip () for response in responses] 17 18 # Stage 2: Verification 19 verification_results = [] 20 for rollout_response in rollout_responses: 21 is_correct = self. verify_answer (rollout_response, self. ground_truth) 22 verification_results. append (is_correct) 23 24 # Stage 3: Reflection 25 reflection_messages = self. create_reflection_messages ( 26 rollout_responses, 27 verification_results, 28 ) 29 reflection_responses = self. model. chat ( 30 reflection_messages, 31 n =1, 32 temperature = self. temperature, 33 logprobs = self. logprobs, 34 max_tokens = self. task. rollout_args. max_tokens, 35 ) 36 reflection_response = reflection_responses [0] 37 38 # Verify the reflection response 39 reflection_text = reflection_response. response_text. strip () 40 reflection_is_correct = self. verify_answer (reflection_text, self. ground_truth) 41 42 if reflection_is_correct: 43 sharegpt_message = [ 44 { 45 "role": "system", 46 "content": self. task. format_args. system_prompt 47 }, 48 { 49 "role": "user", 50 "content": self. question 51 }, 52 { 53 "role": "assistant", 54 "content": reflection_text 55 } 56 ] 57 experience = self. process_messages_to_experience (sharegpt_message) 58 experiences. append (experience) 59 60 # Save experience to file 61 if self. exp_file and sharegpt_message is not None: 62 exp_data = sharegpt_message 63 self. exp_file. write (json. dumps (exp_data, ensure_ascii = False) + "\n") 64 self. exp_file. flush () 65 return experiences Listing 3: A toy implementation of experience synthesis with environmental feedback. 3.2 RL Algorithm Development with Trinity-RFT To support RL algorithm development, Trinity-RFT allows researchers and developers to focus on designing and implementing the essential logic of a new RL algorithm, without the need to care about the internal engineering details about Trinity-RFT. As an example, suppose that we want to implement a MIX algorithm that seamlessly integrates online RL and offline SFT into a single learning process. In its most basic form, the MIX algorithm requires that (1) the trainer samples from two sources of experiences, i.e., the rollout experiences collected online and the high-quality expert trajectories collected offline; and (2) the trainer updates its policy model with a loss function that handles both sources of experiences properly, e.g., a weighted sum of GRPO loss for the on-policy rollout experiences and SFT loss for the expert trajectories. Variants of this MIX algorithm include adaptive weighting of multiple loss terms [10], alternating between RL and SFT [16], incorporating expert trajectories into RL loss [21, 34, 46], or incorporating SFT loss for high-reward rollout trajectories generated by older versions of the rollout model [27]. Such approaches have proved to be effective in accelerating the online RL process with only a small amount of expert experiences, or to enhance stability and plasticity in continual learning. <details> <summary>x7.png Details</summary> ![715b2c7d](/v1/image/715b2c7d74fcee1336ca0dbecff829abfeef7587fdec1b771e94701934cb7e42) ### Visual Description \n ## Diagram: Reinforcement Learning System Architecture ### Overview The image depicts a diagram illustrating the architecture of a reinforcement learning system, divided into three main components: an Explorer, a Buffer, and a Trainer. The diagram shows the flow of data and interactions between these components, highlighting the process of experience collection, storage, and model updating. ### Components/Axes The diagram is segmented into three main areas, each with a distinct background color: * **Explorer (Yellow):** Contains a "Rollout engine" and receives "Task" input. * **Buffer (Blue):** Contains "Usual Experiences", "Expert Experiences", and "Taskset". * **Trainer (Green):** Contains "SFT Loss", "GRPO Loss", "MIX Loss", and "Update model". Arrows indicate the direction of data flow and interactions between components. Text labels are used to identify each component and the data being processed. ### Detailed Analysis or Content Details The diagram illustrates the following flow: 1. **Task Input:** A "Task" is sampled and fed into the "Rollout engine" within the Explorer. 2. **Experience Generation:** The "Rollout engine" generates "Experience" based on the task. 3. **Buffer Storage:** The "Experience" is stored in the "Buffer", specifically within "Usual Experiences" and "Expert Experiences" data stores. The "Taskset" is also present within the Buffer. 4. **Sampling from Buffer:** Data is sampled from the "Buffer". 5. **Loss Calculation:** The sampled data is fed into the "Trainer", where "SFT Loss" and "GRPO Loss" are calculated. 6. **Loss Mixing:** The "SFT Loss" and "GRPO Loss" are combined using an addition operation (represented by a plus sign) to produce "MIX Loss". 7. **Model Update:** The "MIX Loss" is used to "Update model" within the Trainer. The diagram does not contain numerical data or axes. It is a conceptual representation of a system architecture. ### Key Observations The diagram emphasizes the cyclical nature of reinforcement learning: exploration, experience collection, and model improvement. The separation of "Usual Experiences" and "Expert Experiences" suggests a potential for learning from both standard interactions and demonstrations. The use of separate loss functions ("SFT Loss" and "GRPO Loss") indicates a potentially complex training objective. ### Interpretation This diagram illustrates a common architecture for reinforcement learning, particularly one that incorporates elements of imitation learning or learning from demonstrations. The "Explorer" represents the agent interacting with the environment, while the "Buffer" serves as a memory for past experiences. The "Trainer" utilizes these experiences to refine the agent's policy. The distinction between "Usual Experiences" and "Expert Experiences" suggests a hybrid approach where the agent learns both through trial-and-error and by observing expert behavior. The "SFT Loss" and "GRPO Loss" likely represent different components of the overall training objective, potentially related to supervised fine-tuning (SFT) and reinforcement learning with a specific reward function (GRPO). The "MIX Loss" combines these components to guide the model update process. The diagram highlights the importance of efficient experience replay and the careful design of loss functions in achieving successful reinforcement learning. </details> Figure 8: A visualization of the MIX algorithm. The MIX algorithm is visualized in Figure 8, where we integrate GRPO loss for usual experiences generated by the rollout model and SFT loss for expert experiences into a unified training pipeline. It requires dealing with different sources of experiences, and two types of loss functions; fortunately, to implement such an algorithm in Trinity-RFT, we only need to define three new classes — MixSampleStrategy, MIXPolicyLossFn, and MIXAlgorithm — as demonstrated in Listing 4. With these components, Trinity-RFT enables a seamless integration of online RL and offline SFT within the same training loop. More details of the MIX algorithm are referred to the documentation https://modelscope.github.io/Trinity-RFT/tutorial/example_mix_algo.html. ⬇ 1 @SAMPLE_STRATEGY. register_module ("mix") 2 class MixSampleStrategy (SampleStrategy): 3 def __init__ (self, buffer_config: BufferConfig, trainer_type: str, ** kwargs): 4 # ... 5 self. usual_exp_buffer = get_buffer_reader ( 6 buffer_config. trainer_input. experience_buffer, usual_buffer_config 7 ) 8 self. expert_exp_buffer = get_buffer_reader ( 9 buffer_config. trainer_input. sft_warmup_dataset, expert_buffer_config 10 ) 11 # ... 12 13 def sample (self, step: int) -> Tuple [Any, Dict, List]: 14 "" "Sample a batch composed of rollout experiences and expert trajectories" "" 15 usual_exp_list = self. usual_exp_buffer. read () 16 expert_exp_list = self. expert_exp_buffer. read () 17 exp_list = usual_exp_list + expert_exp_list 18 exps = Experiences. gather_experiences (exp_list, self. pad_token_id) 19 # ... 20 21 22 @POLICY_LOSS_FN. register_module ("mix") 23 class MIXPolicyLossFn (PolicyLossFn): 24 def __init__ (self, mu: float = 0.1, ...): 25 # ... 26 self. mu = mu 27 self. grpo_loss_fn = PPOPolicyLossFn (...) 28 self. sft_loss_fn = SFTLossFn (...) 29 30 def __call__ ( 31 self, 32 logprob: torch. Tensor, 33 old_logprob: torch. Tensor, 34 action_mask: torch. Tensor, 35 advantages: torch. Tensor, 36 is_expert_mask: torch. Tensor, 37 ** kwargs, 38 ) -> Tuple [torch. Tensor, Dict]: 39 "" "Calculate a weighted sum of GRPO loss and SFT loss" "" 40 # ... 41 grpo_loss, grpo_metrics = self. grpo_loss_fn ( 42 logprob [~ is_expert_mask], 43 old_logprob [~ is_expert_mask], 44 action_mask [~ is_expert_mask], 45 advantages [~ is_expert_mask], 46 ** kwargs, 47 ) 48 sft_loss, sft_metrics = self. sft_loss_fn ( 49 logprob [is_expert_mask], 50 action_mask [is_expert_mask], 51 ) 52 loss = (1 - self. mu) * grpo_loss + self. mu * sft_loss 53 # ... 54 return loss, metrics 55 56 @ALGORITHM_TYPE. register_module ("mix") 57 class MIXAlgorithm (AlgorithmType): 58 "" "MIX algorithm." "" 59 60 use_critic: bool = False 61 use_reference: bool = True 62 use_advantage: bool = True 63 can_balance_batch: bool = True 64 schema: type = ExperienceModel 65 66 @classmethod 67 def default_config (cls) -> Dict: 68 return { 69 "repeat_times": 8, 70 "policy_loss_fn": "mix", # Specify the MIX loss function 71 "advantage_fn": "grpo", 72 "sample_strategy": "mix", # Specify the MIX sampling strategy 73 } Listing 4: An implementation of the MIX algorithm with Trinity-RFT. 3.3 Unified Support for Diverse RL Modes As explained previously in Section 2.1.1, Trinity-RFT offers support for synchronous/asynchronous, on-policy/off-policy, and online/offline RL, controlled by a few configuration parameters. In this subsection, we conduct experiments for comparing the following RL modes: - The synchronous mode: mode=both, sync_interval={1,2,10}, sync_offset=0; - The one-step off-policy mode: mode=both, sync_interval=1, sync_offset=1; - The fully asynchronous mode: the explorer and trainer are launched with mode=explore and train respectively, with sync_interval=10. Our experiments include dummy learning processes (which will soon be explained) for performance profiling, as well as real learning processes with vanilla GRPO [29] in different modes. 3.3.1 Experiments: Performance Profiling Settings. We aim to measure and compare the efficiency of different RL modes under controlled settings. It is noteworthy that, even with all other variables controlled, different RL modes can still result in different trained models — and thus different rollout response lengths — during the learning processes, which have direct impacts on performance metrics like wall-clock time and GPU utilization rate. To mitigate this, we conduct performance profiling with dummy learning processes, where the learning rate is set to zero. A dummy learning process closely resembles a real learning process, in that all necessary computation and communication (e.g., rollout generation, gradient computation, model weight synchronization) are executed; the only difference is that the rollout model (and thus the distribution of rollout trajectories) remains fixed throughout and same across different RL modes. To show the performance of Trinity-RFT in diverse scenarios, we consider both math reasoning task (GSM8k [3]) and multi-turn agentic task (ALFWorld [31]). In the experiments, we use the Qwen2.5-Instruct [36] model series of different sizes (1.5B, 3B, and 7B), and run the GRPO [5] algorithm (with 8 rollout trajectories per task) in all modes. We choose a 100-step training trace to evaluate the performance and report the following metrics: (1) end-to-end wall-clock time and time speedup: the wall-clock time from the start of running the command to the end of finishing 100 training steps; (2) GPU utilization: the GPU utilization in percent for each GPU; (3) GPU power usage: the GPU power usage as a percentage of its power capacity for each GPU. Metrics for GPU utilization and power usage were extracted from WandB https://docs.wandb.ai/guides/models/app/settings-page/system-metrics/ and averaged over all GPUs. We run each experiment for three random trials and report the mean and standard deviation. Unless specified otherwise, each experiment uses 8 NVIDIA A100-80G GPUs. Results for GSM8k. We use the 2/6 GPU partition for explorer/trainer. While this configuration is not the optimal one for all experiments, it is sufficient to show the difference between several RL modes. In our GSM8k experiments, the batch size is 96 tasks, and the temperature is 1.0. The results for both Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct models are shown in Table 1. We observe that in the synchronous mode (with sync_offset=0), a less frequent synchronization (i.e., a larger sync_interval) effectively improves the efficiency, GPU utilization, and GPU power usage. This is mainly because, as shown in Figure 4 (a), the impacts of pipeline bubbles in this mode can be effectively reduced by using a lower synchronization frequency. Besides, Table 1 shows that one-step off-policy and fully asynchronous modes also accelerate the training process with higher GPU utilization, compared to a strictly on-policy mode. In one-step off-policy mode, the trainer leverages the one-step off-policy data stored in the buffer without needing to wait for new experiences generated by the explorer after weight synchronization, which significantly reduces the GPU idle ratio. In fully asynchronous mode, the explorer and trainer operate almost independently while fully leveraging GPU resources, except when loading or saving model checkpoints. Table 1: Performance profiling for GSM8k with 2/6 GPU partition for explorer/trainer. | Mode | Qwen2.5-1.5B-Instruct | | | | | --- | --- | --- | --- | --- | | Speedup $\uparrow$ | Time (Minutes) $\downarrow$ | GPU Utilization (%) $\uparrow$ | GPU Power Usage (%) $\uparrow$ | | | Sync. (sync_interval=1) | $1.00×$ | $38.70± 0.34$ | $33.64± 2.15$ | $35.85± 1.83$ | | Sync. (sync_interval=2) | $1.24×$ | $31.19± 0.08$ | $36.05± 0.49$ | $38.74± 1.47$ | | Sync. (sync_interval=10) | $1.59×$ | $24.28± 0.16$ | $\mathbf{38.27}± 0.98$ | $\mathbf{44.41}± 0.81$ | | One-step off-policy | $1.25×$ | $30.84± 0.20$ | $32.39± 1.17$ | $39.70± 0.78$ | | Fully async. | $\mathbf{1.61}×$ | $\mathbf{23.97}± 0.03$ | $36.04± 0.61$ | $43.91± 0.48$ | | Mode | Qwen2.5-7B-Instruct | | | | | --- | --- | --- | --- | --- | | Speedup $\uparrow$ | Time (Minutes) $\downarrow$ | GPU Utilization (%) $\uparrow$ | GPU Power Usage (%) $\uparrow$ | | | Sync. (sync_interval=1) | $1.00×$ | $68.71± 0.54$ | $55.61± 0.80$ | $52.88± 0.29$ | | Sync. (sync_interval=2) | $1.31×$ | $52.44± 0.41$ | $64.88± 1.35$ | $61.90± 1.32$ | | Sync. (sync_interval=10) | $\mathbf{1.85}×$ | $\mathbf{37.17}± 0.15$ | $\mathbf{78.44}± 1.03$ | $\mathbf{77.77}± 0.96$ | | One-step off-policy | $1.69×$ | $40.73± 0.57$ | $77.19± 2.26$ | $76.17± 1.56$ | | Fully async. | $1.63×$ | $42.17± 1.06$ | $73.90± 2.00$ | $72.74± 1.82$ | Table 2: Performance profiling for ALFWorld with 4/4 GPU partition for explorer/trainer. | Mode | Batch_Size = 4 | | | | | --- | --- | --- | --- | --- | | Speedup $\uparrow$ | Time (Minutes) $\downarrow$ | GPU Utilization (%) $\uparrow$ | GPU Power Usage (%) $\uparrow$ | | | Sync. (sync_interval=1) | $1.00×$ | $333.68± 0.06$ | $17.19± 0.58$ | $28.44± 0.37$ | | Sync. (sync_interval=2) | $1.70×$ | $196.64± 0.59$ | $21.69± 0.18$ | $31.35± 0.06$ | | Sync. (sync_interval=10) | $5.21×$ | $64.09± 0.39$ | $32.85± 0.18$ | $40.86± 0.58$ | | One-step off-policy | $0.98×$ | $340.12± 3.99$ | $14.63± 1.17$ | $28.21± 0.48$ | | Fully async. | $\mathbf{5.45}×$ | $\mathbf{61.27}± 0.35$ | $\mathbf{36.46}± 0.10$ | $\textbf{42.51}± 0.72$ | | Mode | Batch_Size = 32 | | | | | --- | --- | --- | --- | --- | | Speedup $\uparrow$ | Time (Minutes) $\downarrow$ | GPU Utilization (%) $\uparrow$ | GPU Power Usage (%) $\uparrow$ | | | Sync. (sync_interval=1) | $1.00×$ | $561.21± 2.04$ | $39.37± 0.89$ | $39.93± 0.22$ | | Sync. (sync_interval=2) | $1.13×$ | $496.80± 0.36$ | 37.74 $± 0.39$ | $41.90± 0.44$ | | Sync. (sync_interval=10) | $1.59×$ | $352.11± 0.49$ | $44.50± 0.58$ | $49.95± 0.61$ | | One-step off-policy | $1.14×$ | $494.13± 0.28$ | $34.89± 0.75$ | $43.05± 0.81$ | | Fully async. | $\mathbf{1.65}×$ | $\mathbf{339.51}± 0.24$ | $\mathbf{45.55}± 0.20$ | $\textbf{50.77}± 0.45$ | Results for ALFWorld. A particular feature of ALFWorld is the long-horizon multi-turn interaction with the environment. To accommodate the heavy computational demands in rollout, we use the 4/4 GPU partition for explorer/trainer. In our ALFWorld experiments, we use the Qwen2.5-3B-Instruct model and set the rollout temperature to 1.0. The results with different batch sizes are shown in Table 2. One observation is that, when the batch size is 4 tasks, the one-step off-policy mode exhibits no efficiency advantage over the synchronous mode with sync_interval=1. This phenomenon can be attributed to the computational imbalance between the explorer and trainer. In ALFWorld, the larger computation latency of the explorer emerges primarily from (1) the inherent complexity of multi-turn environment interactions, and (2) the long-tailed latency distribution when certain tasks require extended rollout durations, whose effect is further exacerbated by a small batch size. The one-step off-policy mode cannot eliminate pipeline bubbles caused by long-tailed latencies in the explorer, whereas this can be mitigated by the synchronous mode with a large sync_interval as well as by the asynchronous mode, thanks to the implementation of streaming rollout generation in Trinity-RFT. Another observation, due to the same reason, is that when scaling the batch size from 4 to 32, all modes incur a small increase (much smaller than $8×$ ) in wall-clock time for the same number of training steps, thanks to better GPU usage. 3.3.2 Experiments: Real Learning with Vanilla GRPO Settings. We aim to compare the real learning processes by different RL modes. For simplicity and controlled variability, we use the vanilla GRPO [29] algorithm for all RL modes, without specific algorithm design for asynchronous or off-policy cases. GRPO mainly relies on the mechanism of clipping probability ratio (defaults to the range of $1± 0.2$ ) to handle the off-policyness of experiences. For future works, we will investigate more advanced off-policy or asynchronous RL algorithms, and develop new ones to accommodate diverse RL modes. To encourage the exploration of the rollout model, we disable the Kullback-Leibler (KL) penalty or loss in our experiments. Training. For each RL mode, we fine-tune the Qwen2.5-7B-Instruct model for one epoch on the OpenR1-Math-46k [46] https://huggingface.co/datasets/Elliott/Openr1-Math-46k-8192 dataset, a filtered version of the OpenR1-Math-220k https://huggingface.co/datasets/open-r1/OpenR1-Math-220k dataset. The allocated GPU ratio for the explorer and trainer is 4/4. We set the batch size to 120 tasks, the rollout number per task to 32, and the learning rate to 1e-6. Evaluation. For each RL mode, we save the checkpoint of the rollout model once every 100 steps, evaluate the checkpoints using the bench mode, and report the best results among the checkpoints. The models are evaluated on several math benchmarks, including AIME2024 https://huggingface.co/datasets/math-ai/aime25, AIME2025 https://huggingface.co/datasets/math-ai/aime25, AMC https://huggingface.co/datasets/math-ai/amc23, and MATH500 https://huggingface.co/datasets/HuggingFaceH4/MATH-500. For AIME2024, AIME2025, and AMC, we generate 32 responses (with temperature 0.6) per task and report the average accuracy (Avg@32), to ensure the accuracy of the evaluation process; for MATH500, we report Avg@4 as the dataset is relatively large. For more detailed comparisons, we also plot some training metrics, including reward, response length, gradient norm, and KL distance from the initial LLM, with the wall-clock time as the X-axis. Results. Figure 9 presents the training curves. It is observed that several RL modes show increasing trends in terms of rewards and response lengths. The synchronous mode with sync_interval=1 exhibits longer responses and larger KL divergence than other RL modes, likely because it updates the rollout model most frequently and leverages on-policy data in each step. Table 3 presents the evaluation results. We observe that, for the synchronous mode with sync_offset=0, increasing sync_interval reduces the total training time for one epoch, at the cost of slightly compromising the average evaluation performance. In contrast, the one-step off-policy mode with sync_interval=1 achieves comparable or better performance than the other modes on several benchmarks, while achieving around $2.66×$ speedup in wall-clock time over the strictly on-policy mode. <details> <summary>x8.png Details</summary> ![516ecfb1](/v1/image/516ecfb1bcddc551aa0919ea8afe8cbd9d3795b3a7d5bf4a8743d80f5703c5a4) ### Visual Description ## Line Chart: Training Metrics Over Time ### Overview The image presents four line charts arranged horizontally, displaying training metrics over time (in hours). The metrics are Reward, Response Length, Gradient Norm, and KL Divergence. Each chart compares the performance of different synchronization intervals (sync_interval) during training: 1, 2, and 10, as well as a One-Step Off-Policy method. ### Components/Axes * **X-axis (all charts):** Time (hours), ranging from 0 to 120. * **Y-axis (Reward):** Reward, ranging from approximately 0.40 to 0.52. * **Y-axis (Response Length):** Response Length, ranging from approximately 800 to 2500. * **Y-axis (Gradient Norm):** Gradient Norm, ranging from approximately 0.08 to 0.16. * **Y-axis (KL Divergence):** KL Divergence, ranging from approximately 0.0 to 0.5. * **Legend:** Located at the top-right of the image. * Blue Line: Sync. (sync\_interval=1) * Green Line: Sync. (sync\_interval=2) * Red Line: Sync. (sync\_interval=10) * Black Line: One-Step Off-Policy ### Detailed Analysis or Content Details **1. Reward Chart:** * The blue line (sync\_interval=1) starts at approximately 0.46 and generally increases, with fluctuations, reaching around 0.51 at 120 hours. * The green line (sync\_interval=2) starts at approximately 0.46, increases rapidly to around 0.49 at 20 hours, then plateaus and fluctuates between 0.48 and 0.51. * The red line (sync\_interval=10) starts at approximately 0.45, increases to around 0.48 at 20 hours, and then fluctuates between 0.47 and 0.50. * The black line (One-Step Off-Policy) starts at approximately 0.46, increases rapidly to around 0.50 at 20 hours, then decreases to around 0.47 at 40 hours, and then fluctuates between 0.47 and 0.49. **2. Response Length Chart:** * The blue line (sync\_interval=1) shows a significant increase from approximately 1000 at 0 hours to around 2400 at 120 hours, with some oscillations. * The green line (sync\_interval=2) starts at approximately 900 and increases to around 1800 at 120 hours, with a more gradual increase than the blue line. * The red line (sync\_interval=10) starts at approximately 800 and increases to around 1200 at 120 hours, showing the slowest increase. * The black line (One-Step Off-Policy) starts at approximately 900 and increases to around 1500 at 120 hours. **3. Gradient Norm Chart:** * The blue line (sync\_interval=1) starts at approximately 0.11, decreases to around 0.09 at 20 hours, and then fluctuates between 0.10 and 0.13. * The green line (sync\_interval=2) starts at approximately 0.10, decreases to around 0.08 at 20 hours, and then fluctuates between 0.09 and 0.12. * The red line (sync\_interval=10) starts at approximately 0.10, decreases to around 0.08 at 20 hours, and then fluctuates between 0.08 and 0.10. * The black line (One-Step Off-Policy) starts at approximately 0.11, decreases to around 0.09 at 20 hours, and then fluctuates between 0.09 and 0.11. **4. KL Divergence Chart:** * The blue line (sync\_interval=1) starts at approximately 0.05, increases to around 0.25 at 20 hours, and then fluctuates between 0.20 and 0.35. * The green line (sync\_interval=2) starts at approximately 0.02, increases to around 0.15 at 20 hours, and then fluctuates between 0.10 and 0.20. * The red line (sync\_interval=10) starts at approximately 0.01, increases to around 0.08 at 20 hours, and then fluctuates between 0.05 and 0.10. * The black line (One-Step Off-Policy) starts at approximately 0.03, increases to around 0.20 at 20 hours, and then fluctuates between 0.15 and 0.30. ### Key Observations * The Response Length generally increases with time for all methods, but the rate of increase varies significantly. Sync. (sync\_interval=1) shows the fastest increase. * The Gradient Norm remains relatively stable across all methods, with minor fluctuations. * The KL Divergence increases rapidly in the initial phase (0-20 hours) for all methods, then stabilizes. Sync. (sync\_interval=10) exhibits the lowest KL Divergence. * Reward values are relatively similar across all methods, with some fluctuations. ### Interpretation The charts demonstrate the impact of different synchronization intervals on the training process. A smaller sync\_interval (1 or 2) leads to faster increases in Response Length and Reward, but also higher KL Divergence, potentially indicating a more unstable learning process. A larger sync\_interval (10) results in slower increases but lower KL Divergence, suggesting a more stable, but potentially slower, learning process. The One-Step Off-Policy method shows intermediate behavior. The relationship between these metrics suggests a trade-off between learning speed and stability. Faster learning (higher sync\_interval) may lead to instability (higher KL Divergence), while slower learning (lower sync\_interval) may be more stable. The optimal sync\_interval likely depends on the specific application and desired balance between these factors. The Gradient Norm remaining relatively constant suggests that the learning rate is well-tuned across all methods. </details> Figure 9: Results of training for one epoch by vanilla GRPO in different RL modes. The results are smoothed using a 40-step moving average for clarity. Table 3: Performance comparison of different RL modes. | | AIME2024 | AIME2025 | AMC | MATH500 | Average | Runtime (Hours) | | --- | --- | --- | --- | --- | --- | --- | | Qwen2.5-7B-Instruct | 11.15 | 6.95 | 51.35 | 70.96 | 35.10 | N/A | | Sync. (sync_interval=1) | 14.58 | 14.06 | 61.25 | 76.25 | 41.54 | 130.33 | | Sync. (sync_interval=2) | 15.52 | 11.67 | 57.97 | 75.15 | 40.08 | 73.57 | | Sync. (sync_interval=10) | 14.38 | 12.71 | 57.66 | 75.05 | 39.95 | 44.43 | | One-Step Off-Policy | 16.88 | 12.19 | 59.92 | 74.55 | 40.89 | 48.98 | 3.4 Data Processors for Tasks and Experiences We present practical examples to demonstrate how the data pipeline concepts described in Section 2.3 are applied in Trinity-RFT. These use cases highlight how manipulating data at both the task and experience level directly improves RFT performance and provides granular control over the agent’s learning process. 3.4.1 Static Task Prioritization for Curriculum Learning A common strategy for effective training is to present tasks in increasing order of difficulty. This use case demonstrates how Trinity-RFT facilitates curriculum learning by prioritizing tasks before exploration begins. This is particularly crucial for RFT, as it helps stabilize the initial learning phase of the explorer and prevents it from getting stuck on overly complex tasks, leading to a more efficient exploration path. As shown in Listing 5, a user can configure this pipeline with a simple YAML file The full configuration files can be accessed at baseline_run and priority_run.. Here, we use the GSM8K mathematical reasoning dataset. The user provides a natural language instruction via dj_process_desc: “Please compute difficulty scores for these math questions.”. Trinity-RFT ’s data service then orchestrates a three-phase process: 1. The data processor invokes an LLM (Qwen-Max) to score the difficulty of each math problem. 1. The system prioritizes samples with lower difficulty scores, creating an easy-to-hard ordering (by setting ‘priority_weights["difficulty"]: -1.0’). 1. The curated and prioritized data is formatted into an RL-ready task set for the explorer. As shown in Figure 10, this simple curation strategy yields more stable performance gains. This pattern is highly extensible: users can easily customize the difficulty metric, apply it to their own datasets, or even make the prioritization dynamic by re-ranking tasks periodically based on the agent’s current performance. ⬇ 1 # Core dataset configuration 2 data_processor: 3 data_workflow_url: "http://127.0.0.1:5005/data_workflow" 4 task_pipeline: 5 # I/O buffers 6 input_buffers: 7 - name: "raw_input" 8 path: "openai/gsm8k" 9 storage_type: "file" 10 raw: true 11 output_buffer: 12 name: "raw_output" 13 path: "outputs/task_pipeline_output/prioritized_gsm8k.jsonl" 14 storage_type: "file" 15 format: 16 prompt_key: "question" 17 response_key: "answer" 18 # data active iterator related 19 dj_process_desc: "Please compute difficulty scores for these math questions." 20 agent_model_name: "qwen-max" 21 clean_strategy: "iterative" 22 priority_weights: 23 difficulty: -1.0 # easy-to-hard Listing 5: Data processor configuration, applied on customizable buffers. <details> <summary>figs/data-pipelines/data-flow-static-priority-res.jpg Details</summary> ![8f2e227d](/v1/image/8f2e227d657385456058d901d10044c499bcdbf3fc353335d1655a75787b6015) ### Visual Description ## Line Charts: Training Metrics ### Overview The image presents three line charts displaying training metrics over iterations (denoted by 'it'). The charts track `eval/math-eval/accuracy/mean`, `actor/entropy_loss`, and `actor/KL_loss`. Each chart has two lines representing different data series. The x-axis represents iterations, ranging from approximately 0 to 30. ### Components/Axes * **X-axis (all charts):** Iterations ('it'), ranging from 0 to 30, with tick marks at intervals of 5. * **Y-axis (eval/math-eval/accuracy/mean):** Accuracy, ranging from approximately 0.24 to 0.36, with tick marks at intervals of 0.02. * **Y-axis (actor/entropy_loss):** Entropy Loss, ranging from approximately 0.04 to 0.20, with tick marks at intervals of 0.02. * **Y-axis (actor/KL_loss):** KL Loss, ranging from approximately 0.15 to 1.0, with tick marks at intervals of 0.2. * **Legend (all charts):** Located at the top-right corner, containing two unnamed data series represented by different colored lines (red and teal). Each chart also has icons for full screen, download, and other functions. ### Detailed Analysis or Content Details **Chart 1: eval/math-eval/accuracy/mean** * **Red Line:** Starts at approximately 0.27 at iteration 0, increases to a peak of around 0.34 at iteration 7, then decreases to approximately 0.30 at iteration 15, and finally rises to around 0.32 at iteration 30. * **Teal Line:** Starts at approximately 0.25 at iteration 0, increases to a peak of around 0.32 at iteration 5, then decreases to approximately 0.26 at iteration 15, and rises to around 0.28 at iteration 30. **Chart 2: actor/entropy_loss** * **Red Line:** Starts at approximately 0.12 at iteration 0, fluctuates significantly between approximately 0.06 and 0.18 until iteration 20, then decreases sharply to approximately 0.04 at iteration 30. * **Teal Line:** Starts at approximately 0.08 at iteration 0, fluctuates between approximately 0.05 and 0.10 until iteration 20, then decreases to approximately 0.05 at iteration 30. **Chart 3: actor/KL_loss** * **Red Line:** Starts at approximately 0.35 at iteration 0, fluctuates between approximately 0.30 and 0.60 until iteration 20, then increases to approximately 0.65 at iteration 25, and decreases to approximately 0.55 at iteration 30. * **Teal Line:** Starts at approximately 0.55 at iteration 0, increases to a peak of approximately 1.0 at iteration 10, then decreases to approximately 0.45 at iteration 20, and fluctuates between approximately 0.45 and 0.60 until iteration 30. ### Key Observations * The accuracy (Chart 1) generally increases over iterations, with fluctuations. Both lines show a similar trend. * The entropy loss (Chart 2) decreases over iterations, suggesting the actor is becoming more confident in its actions. * The KL loss (Chart 3) shows more variability. The teal line exhibits a significant peak around iteration 10, while the red line remains relatively stable. ### Interpretation The charts likely represent the training progress of a reinforcement learning agent. The `eval/math-eval/accuracy/mean` chart indicates the agent's performance on a math evaluation task, which is improving over time. The `actor/entropy_loss` chart suggests the agent is learning a more deterministic policy, as the entropy loss decreases. The `actor/KL_loss` chart measures the divergence between the current policy and a prior policy, and its fluctuations may indicate exploration or adaptation to new states. The peak in the teal line of the KL loss chart around iteration 10 could represent a significant policy update or exploration phase. The overall trend suggests the agent is learning and improving its performance on the math evaluation task. The fact that the two lines in each chart are close suggests that the training is relatively stable and consistent. </details> Figure 10: Performance on a math reasoning task. Prioritizing tasks from easy to hard (red line) leads to faster and better convergence compared to the default setting (blue line). 3.4.2 Dynamic Experience and Reward Shaping While task curation primes the model before exploration, experience shaping refines the learning signal after each agent-environment interaction. This is critical for RFT algorithms that rely on rich feedback, as standard rewards (e.g., binary pass/fail) are often too sparse to guide learning effectively. We demonstrate how to augment rewards with metrics for quality and diversity, transforming a sparse signal into a dense, multi-faceted one that provides clearer guidance to the trainer. Use Case 1: Quality Reward Augmentation. To encourage the model to generate high-quality responses, we can augment the base reward with a quality score. As illustrated in Figure 11, during each RFT step, we use the data processor to evaluate the quality of each generated rollout. For our experiment, we trained a Qwen2.5-1.5B model and used a more powerful Qwen3-32B as the scorer LLM. Specifically, we invoked the llm_quality_filter from Data-Juicer, which normalized the quality scores to the range [-0.5, 0.5] and added them to the original reward. Crucially, this processing is applied to the experience buffer at each RFT step. This allows the reward signal to adapt dynamically to the policy model’s evolving capabilities, a more responsive approach than one-time static processing. With a sync_interval of 3 over 36 steps on the Math-500 validation set, the results in Figure 12 show that: (1) The model with quality reward augmentation (red line) achieves higher accuracy. (2) The introduced quality reward itself improves over time, confirming it is a learnable signal. (3) We observe a slight increase in response length, which likely reflects an inductive bias from the larger scorer model being implicitly distilled into the smaller policy model. <details> <summary>x9.png Details</summary> ![a7e3e035](/v1/image/a7e3e0354960e3e6790f23d7b00313a59c04a50918be5c735acfa83bdf34ed70) ### Visual Description \n ## Diagram: GRPO Process Flow ### Overview The image depicts a diagram illustrating the GRPO (likely an acronym for a process) workflow. A GSM8K sample is used as input, which is then processed through a model (Qwen2.5 1.5B) to generate multiple responses. These responses are then evaluated by an LLM Scorer (Qwen3 32B) to produce rewards based on Math Accuracy, Format, and DJ-Quality. ### Components/Axes The diagram consists of the following components: * **GSM8K Sample:** The initial input to the process. * **Rollout:** An arrow indicating the process of generating responses from the GSM8K sample. * **Qwen2.5 1.5B:** A model (indicated by a logo) used for generating responses. * **Response 1, Response 2, ..., Response n:** Multiple responses generated by the Qwen2.5 1.5B model. * **Math Acc Reward:** A reward component based on mathematical accuracy. * **Format Reward:** A reward component based on the format of the response. * **DJ-Quality Reward:** A reward component based on the DJ-Quality of the response (highlighted in red). * **LLM Scorer:** An LLM (Qwen3 32B) used for scoring the responses. * **Qwen3 32B:** The LLM model used for scoring (indicated by a logo). * **GRPO:** Label at the bottom left of the diagram. ### Detailed Analysis / Content Details The diagram shows a sequential process: 1. A **GSM8K Sample** is fed into the **Qwen2.5 1.5B** model. 2. The model generates multiple **Responses** (Response 1 to Response n). 3. These responses are then passed to the **LLM Scorer (Qwen3 32B)**. 4. The LLM Scorer evaluates each response and assigns three types of rewards: **Math Acc Reward**, **Format Reward**, and **DJ-Quality Reward**. 5. The **DJ-Quality Reward** is highlighted with a red border. 6. The flow of information is indicated by arrows. The arrow from GSM8K Sample to Response 1, 2, ... n is labeled "Rollout". The arrow from the responses to the DJ-Quality Reward is a dashed line, indicating a scoring process. ### Key Observations * The diagram emphasizes the use of multiple responses (Response 1 to Response n) to improve the evaluation process. * The **DJ-Quality Reward** is visually highlighted, suggesting its importance in the GRPO process. * Two different LLM models are used: Qwen2.5 1.5B for response generation and Qwen3 32B for scoring. * The diagram does not provide any numerical data or specific values for the rewards. ### Interpretation The diagram illustrates a Reinforcement Learning from Human Feedback (RLHF) or similar process. The GRPO likely stands for a method of generating responses and then scoring them using an LLM to provide rewards. The rewards are then used to fine-tune the initial model (Qwen2.5 1.5B). The use of multiple responses and different reward components (Math Accuracy, Format, and DJ-Quality) suggests a comprehensive evaluation strategy. The highlighting of the DJ-Quality Reward indicates that this aspect is particularly important for the GRPO process. The diagram is a high-level overview and does not provide details on the specific algorithms or implementation details of the GRPO process. The diagram suggests a feedback loop where the rewards are used to improve the response generation model. </details> Figure 11: The enhanced math workflow with quality-reward shaping from data processor, where DJ indicates DataJuicer [2], from which more operators can be utilized to extend this worklow. <details> <summary>x10.png Details</summary> ![16eee39d](/v1/image/16eee39d9016ae50291a95fec3c15b7b74acc61f1cc49a9dd17019269ca23286) ### Visual Description ## Line Charts: Evaluation Metrics Over Time ### Overview The image presents three separate line charts, each displaying a different evaluation metric related to "math eval" over a time scale from 0 to 30. The charts are arranged horizontally. Each chart has a title indicating the metric being measured: accuracy/mean, response length/mean, and quality/mean. Each chart displays two lines representing different aspects of the metric. ### Components/Axes Each chart shares the following components: * **X-axis:** Represents time, scaled from 0 to 30. The axis is labeled with numerical markers at intervals of 5. * **Y-axis:** Represents the metric value. The scale varies for each chart. * **Line 1 (Red):** Represents one aspect of the metric. * **Line 2 (Blue):** Represents another aspect of the metric. * **Title:** Located at the top of each chart, indicating the metric being displayed. * **Icons:** Located at the top-right of each chart. These appear to be controls for exporting or manipulating the chart (e.g., download, edit). Specifics for each chart: * **Chart 1 (Accuracy/Mean):** Y-axis ranges from approximately 0.25 to 0.45. * **Chart 2 (Response Length/Mean):** Y-axis ranges from approximately 40 to 220. * **Chart 3 (Quality/Mean):** Y-axis ranges from approximately 0.05 to 0.25. ### Detailed Analysis or Content Details **Chart 1: eval/math eval/accuracy/mean** * **Red Line (Accuracy):** Starts at approximately 0.32 at x=0, increases to a peak of around 0.41 at x=10, decreases to approximately 0.37 at x=20, and then increases to approximately 0.43 at x=30. The line generally slopes upward. * **Blue Line (Mean):** Starts at approximately 0.27 at x=0, steadily increases to approximately 0.36 at x=30. The line slopes upward. **Chart 2: response_length/mean** * **Red Line (Response Length):** Starts at approximately 180 at x=0, fluctuates significantly between approximately 80 and 210, ending at approximately 190 at x=30. The line exhibits high variability. * **Blue Line (Mean):** Starts at approximately 60 at x=0, fluctuates between approximately 40 and 70, ending at approximately 50 at x=30. The line exhibits moderate variability. **Chart 3: eval/math eval/quality/mean** * **Red Line (Quality):** Starts at approximately 0.10 at x=0, increases to approximately 0.14 at x=10, decreases to approximately 0.12 at x=15, and then sharply increases to approximately 0.23 at x=30. The line shows a strong upward trend in the later stages. * **Blue Line (Quality):** This line is not visible in the image. ### Key Observations * **Accuracy and Mean (Chart 1):** Both accuracy and the mean value increase over time, suggesting improvement in the evaluation metric. * **Response Length (Chart 2):** The response length fluctuates considerably, with no clear upward or downward trend. The mean response length remains relatively stable. * **Quality (Chart 3):** The quality metric shows a significant improvement towards the end of the time period. ### Interpretation The data suggests a positive trend in both accuracy and overall mean performance (Chart 1) over the observed time period. While response length is variable (Chart 2), the mean remains relatively consistent. The most notable observation is the substantial increase in quality (Chart 3) towards the end of the period, indicating a potential breakthrough or improvement in the evaluation process. The relationship between the charts suggests that improvements in accuracy and quality may not necessarily correlate with response length. The fluctuating response length could be due to variations in the complexity of the evaluated tasks or the strategies employed. The late-stage quality improvement could be a result of optimizations or adjustments made to the evaluation process. The absence of a visible blue line in Chart 3 is an anomaly that requires further investigation. It is possible that the data for the second aspect of the quality metric is missing or not being displayed. </details> Figure 12: Experimental results for quality-reward shaping. Augmenting the reward with a quality score (red line) improves final accuracy and provides a learnable reward signal. Use Case 2: Diversity Reward Augmentation. A common failure mode in RFT is policy collapse, where the agent repeatedly generates similar, suboptimal responses. To counteract this, we introduce a diversity reward that encourages the explorer to explore different solution paths. As shown in Figure 13, we used the GTE-Qwen2-1.5B model to convert rollouts into semantic embeddings. The diversity reward was calculated based on the cosine similarity of a rollout’s embedding to the mean embedding of its group, with lower similarity (i.e., higher diversity) yielding a higher reward. To prevent exploration from becoming chaotic, we applied a simple decay schedule to the diversity reward’s weight, starting at 0.5 and decaying to 0.3 over the training steps. The experiment, using the same setting as before, yielded compelling results (Figure 14): (1) The diversity-augmented model (red line) shows a significant performance improvement over the baseline. (2) The response length is consistently longer, indicating the reward encourages more elaborate answers. (3) Most importantly, the actor entropy loss remains consistently higher, providing strong evidence that the model is maintaining a healthier, more diverse exploration strategy, which helps it escape local optima. <details> <summary>x11.png Details</summary> ![89f19d94](/v1/image/89f19d9479e9ea4ef41b91e2a0ecb3162f0ae98c448bafd9fb4f8794dc3940dd) ### Visual Description \n ## Diagram: GRPO System Architecture ### Overview The image depicts a diagram illustrating the architecture of a system named "GRPO". The system takes a "GSM8K Sample" as input and processes it through a series of steps involving "Rollout", "Res" (presumably representing responses), "Ebd" (presumably representing embeddings), and reward calculations. The diagram highlights the flow of information and the weighting of different reward components. ### Components/Axes The diagram consists of the following key components: * **Input:** "GSM8K Sample" (top-left) * **Model:** "Qwen2.5 1.5B" (connected to "Rollout") * **Response Generation:** "Rollout" * **Response Blocks:** "Res 1" to "Res n" (multiple blocks in a vertical sequence) * **Embedding Blocks:** "Ebd 1" to "Ebd n" (corresponding to each "Res" block) * **Embedding Aggregation:** "Ebd Avg" (aggregates embeddings from "Ebd 1" to "Ebd n") * **Similarity Calculation:** "Cos Similarity" (calculates cosine similarity) * **Reward Weights:** "+0.5", "+0.1", "+0.3" (weights applied to different reward components) * **Reward Components:** "Format Reward", "Math Acc Reward", "Diversity Reward" (rewards used for system optimization) * **Model:** "GTE-Qwen2" (connected to "Embedding...") * **System Identifier:** "GRPO" (bottom-left) ### Detailed Analysis or Content Details The diagram shows a process flow starting with a "GSM8K Sample". This sample is fed into a "Rollout" process powered by the "Qwen2.5 1.5B" model. The "Rollout" generates multiple responses, labeled "Res 1" through "Res n". Each response is then converted into an embedding, labeled "Ebd 1" through "Ebd n". These embeddings are aggregated into "Ebd Avg", and then used to calculate "Cos Similarity". The "Cos Similarity" output is then used to calculate weighted rewards. The weights are: * "+0.5" for "Format Reward" * "+0.1" for "Math Acc Reward" * "+0.3" for "Diversity Reward" The "Diversity Reward" component is highlighted with a dashed red border. The "Embedding..." block is connected to the "GTE-Qwen2" model. ### Key Observations * The system appears to prioritize "Format Reward" as it has the highest weight (0.5). * "Diversity Reward" is visually emphasized, suggesting its importance in the system's design. * The diagram shows a parallel processing structure with multiple "Res" and "Ebd" blocks, indicating the system generates and evaluates multiple responses. * The diagram does not provide specific numerical data beyond the reward weights. ### Interpretation The diagram illustrates a Reinforcement Learning (RL) framework for generating responses to GSM8K problems. The system uses a large language model ("Qwen2.5 1.5B") to generate multiple responses ("Rollout"). These responses are then evaluated based on three criteria: format, mathematical accuracy, and diversity. The weights assigned to each reward component suggest that the system prioritizes generating well-formatted responses, followed by diversity, and then mathematical accuracy. The use of embeddings and cosine similarity suggests that the system is evaluating the semantic similarity between generated responses, potentially to encourage diversity. The dashed red border around "Diversity Reward" indicates that this aspect is a key focus of the GRPO system. The "GTE-Qwen2" model is likely used for embedding generation. The overall architecture suggests a system designed to generate high-quality, diverse, and accurate solutions to mathematical problems. </details> Figure 13: The enhanced math workflow with diversity-reward shaping from data processor <details> <summary>x12.png Details</summary> ![10ee09a5](/v1/image/10ee09a55c84f6c6b175afa9f3b03e3da37306e92b34349ec4bdff6a3e5592a1) ### Visual Description \n ## Line Charts: Training Metrics ### Overview The image presents three separate line charts, likely representing training metrics for a machine learning model. Each chart displays two lines over a range of training steps (x-axis). The charts are arranged horizontally. The metrics are: `eval/math_eval/accuracy/mean`, `response_length/mean`, and `actor_entropy/loss`. Each chart has a set of icons in the top-right corner: a save icon, a refresh icon, and a settings icon. ### Components/Axes Each chart shares the following components: * **X-axis:** Represents training steps, ranging from approximately 0 to 30. The axis is labeled with numerical values at intervals of 5. * **Y-axis:** Represents the metric value. The scale varies for each chart. * **Line 1 (Red):** Represents the training metric. * **Line 2 (Blue):** Represents the validation metric. * **Titles:** Each chart has a title indicating the metric being plotted. Specifics for each chart: * **Chart 1:** `eval/math_eval/accuracy/mean`. Y-axis ranges from approximately 0.2 to 0.45. * **Chart 2:** `response_length/mean`. Y-axis ranges from approximately 150 to 400. * **Chart 3:** `actor_entropy/loss`. Y-axis ranges from approximately 0 to 1.6. ### Detailed Analysis or Content Details **Chart 1: `eval/math_eval/accuracy/mean`** * **Red Line (Training Accuracy):** Starts at approximately 0.35 at step 0, increases to a peak of approximately 0.43 at step 15, then decreases slightly to approximately 0.41 at step 30. The line exhibits an overall upward trend with some fluctuation. * **Blue Line (Validation Accuracy):** Starts at approximately 0.25 at step 0, increases steadily to approximately 0.35 at step 30. The line exhibits a consistent upward trend. **Chart 2: `response_length/mean`** * **Red Line (Training Response Length):** Starts at approximately 250 at step 0, fluctuates between approximately 200 and 350, and then increases sharply to approximately 400 at step 30. The line shows significant volatility. * **Blue Line (Validation Response Length):** Starts at approximately 175 at step 0, fluctuates between approximately 150 and 225, and remains relatively stable around 200 at step 30. The line shows less volatility than the red line. **Chart 3: `actor_entropy/loss`** * **Red Line (Training Loss):** Starts at approximately 0.6 at step 0, decreases to approximately 0.3 at step 10, then increases dramatically to approximately 1.6 at step 30. The line exhibits a strong upward trend in the later stages. * **Blue Line (Validation Loss):** Starts at approximately 0.5 at step 0, decreases to approximately 0.2 at step 10, and then increases slowly to approximately 0.3 at step 30. The line shows a relatively stable trend. ### Key Observations * In Chart 1, the validation accuracy consistently lags behind the training accuracy, indicating potential overfitting. * In Chart 2, the training response length shows a significant increase towards the end of training, while the validation response length remains relatively stable. * In Chart 3, the training loss increases sharply towards the end of training, while the validation loss remains relatively stable, suggesting overfitting and potential instability. * The red lines (training metrics) generally exhibit more volatility than the blue lines (validation metrics). ### Interpretation The charts likely represent the performance of a machine learning model during training. The increasing gap between training and validation metrics in all three charts suggests that the model is overfitting to the training data. The sharp increase in training loss and response length, coupled with the relatively stable validation metrics, indicates that the model may be diverging or becoming unstable towards the end of training. The `eval/math_eval/accuracy/mean` chart shows that the model is learning to perform math evaluations, but the gap between training and validation accuracy suggests that it may not generalize well to unseen data. The `response_length/mean` chart indicates that the model is generating longer responses during training, which could be a sign of increased complexity or verbosity. The `actor_entropy/loss` chart suggests that the model is becoming more uncertain or unpredictable, which could be a result of overfitting or instability. Further investigation is needed to determine the cause of the overfitting and instability. Potential solutions include regularization, early stopping, or adjusting the learning rate. </details> Figure 14: Experimental results for diversity-reward shaping. Rewarding diverse responses (red line) significantly improves task accuracy and maintains higher entropy. 3.5 RFT with Human in the Loop This example demonstrates the human-in-the-loop capability in Trinity-RFT for preference modeling. As illustrated in Listing 6 and Figure 15, our framework integrates Label Studio’s annotation interface with asynchronous data pipelines through four coordinated stages: (1) task generation: auto-creating annotation batches from model rollouts; (2) interactive labeling: providing UI for side-by-side response comparison; (3) quality control: enforcing inter-annotator agreement thresholds; and (4) versioned storage: tracking preference lineage in pre-defined fields like those in DPODataModel. This pipeline reflects Trinity-RFT ’s bi-directional collaboration feature (Section 2.3.4), backed by timeout-aware task polling and support of atomic batch commit. It enables hybrid procedures where initial AI pre-screening can reduce human workload in production deployments. Annotation activities can scale across distributed teams through event-driven task routing. The system’s flexibility benefits rapid adaptation to diverse annotation protocols, allowing developers to implement custom labeling interfaces through XML-based templates or integrate third-party annotation services via unified SDK endpoints. This capability underpins advanced use cases such as safety red-teaming datasets and online instruction tuning scenarios where human judgment remains irreplaceable for quality-critical decisions, particularly in human-centric sociocultural contexts where data quality, difficulty, and reward signals are difficult to verify logically. ⬇ 1 # Human annotation configuration 2 class HumanAnnotationConfig: 3 "" "Preference annotation pipeline configuration" "" 4 5 def __init__ (self): 6 self. process = [ 7 { 8 "human_preference_annotation_mapper": { 9 "wait_for_annotations": True, # Block until annotations complete 10 "timeout": 3600, # Maximum wait time in seconds 11 "prompt_key": "prompt", # Source field for prompts 12 "answer1_key": "answer1", # First candidate response 13 "chosen_key": "chosen" # Selected response key 14 } 15 } 16 ] 17 18 def get_pipeline (self) -> List [Dict]: 19 "" "Get annotation processing pipeline" "" 20 return self. process Listing 6: Configuration for human preference annotation. <details> <summary>x13.png Details</summary> ![3df2557b](/v1/image/3df2557b47d03bf6ab72246eeb8e3e7116c945916a3364b227dcde6394247c8f) ### Visual Description \n ## Screenshot: Label Studio Interface - Question Answering Task ### Overview The image is a screenshot of the Label Studio interface, specifically showing a question answering task. It presents a series of multiple-choice questions, with one question actively highlighted for annotation. The interface elements suggest a data labeling workflow for training a machine learning model. ### Components/Axes The screenshot displays the following components: * **Header:** Contains the Label Studio logo, project navigation (Projects / Human\_Preference\_Annotation\_Demo\_8a87e7 / Labeling), user information ("admin@example.com"), and a version indicator ("v1"). * **Question List:** A vertical list on the left side of the screen containing multiple questions. Each question has a radio button next to it. * **Active Question Area:** The central area displaying the currently selected question. * **Answer Options:** Two rectangular boxes labeled "Paris" and "Lyon" representing the answer choices. * **Sidebar:** A right-side sidebar with sections for "Info", "History", "Selection Details", "Regions", and "Relations". The "Regions" section indicates that no regions have been added. ### Detailed Analysis / Content Details The active question is: "What is the capital of France?". The answer options are "Paris" and "Lyon". The question list contains the following questions (transcribed): 1. What is the capital of France? 2. Which planet is known as the Red Planet? 3. What is the chemical symbol for gold? 4. Who wrote Romeo and Juliet? 5. What is the largest ocean on Earth? 6. In which year did WWII end? 7. What is the square root of 64? 8. Who painted the Mona Lisa? 9. What is the main component of the air we breathe? 10. Which programmer created WWW? The sidebar shows the following sections: * **Info:** (Empty) * **History:** (Empty) * **Selection Details:** (Empty) * **Regions:** "Regions not added" * **Relations:** (Empty) ### Key Observations The interface is designed for annotators to select the correct answer to a question. The "Regions" section being empty suggests that this task does not involve bounding box or polygon annotation. The presence of "History" suggests that the interface tracks annotation changes. ### Interpretation This screenshot demonstrates a typical setup for creating a labeled dataset for question answering. The Label Studio interface provides a user-friendly way to present questions and collect annotations (in this case, selecting the correct answer). The data collected through this interface would be used to train a machine learning model to answer similar questions. The interface is focused on simple selection-based annotation, indicating the task is likely to train a model to identify the correct answer from a limited set of options. The questions cover a range of general knowledge topics. The lack of any selected answer suggests the annotator has not yet completed the task. </details> Figure 15: An interactive interface for human preference annotation. 3.6 Low-Code Usage and Development with Trinity-Studio <details> <summary>figs/studio-showcase/dashboard.png Details</summary> ![e94efdee](/v1/image/e94efdee1e075d55f8cd69cbce7d9a866f2e7e5bcba624cef1a1e28af58c8c64) ### Visual Description \n ## Screenshot: RFT Portal Dashboard ### Overview The image is a screenshot of a web-based dashboard interface, likely for a data science or machine learning project. It presents three main sections: Training Portal, pgAdmin, and Label Studio, each represented by a card-like element. A top navigation bar provides access to other sections of the portal. ### Components/Axes The top navigation bar contains the following links, from left to right: * Dashboard * pgAdmin * Label Studio * Training Portal * Settings The main body of the screenshot consists of three cards, arranged horizontally. Each card contains: * A circular icon with two-letter abbreviation. * A title. * A short description. * A button. ### Content Details **Card 1: Training Portal** * Icon: "TP" (in a blue circle) * Title: Training Portal * Description: "Access the training portal to manage your training data and models." * Button: "Open Training Portal" (blue button) **Card 2: pgAdmin** * Icon: "DB" (in a green circle) * Title: pgAdmin * Description: "Manage your PostgreSQL databases with pgAdmin." * Button: "Open pgAdmin" (green button) **Card 3: Label Studio** * Icon: "LS" (in a red circle) * Title: Label Studio * Description: "Label and annotate your data with Label Studio." * Button: "Open Label Studio" (red button) The main title of the dashboard is "RFT Portal Dashboard", displayed prominently at the top center of the screen. ### Key Observations The dashboard provides quick access to key tools for a data science workflow: data labeling (Label Studio), database management (pgAdmin), and model training (Training Portal). The color-coding of the cards (blue, green, red) and buttons may be intended to visually differentiate the tools. The abbreviations "TP", "DB", and "LS" are used as icons for each tool. ### Interpretation This dashboard appears to be designed for users involved in the development and deployment of machine learning models. It centralizes access to essential tools for data preparation, model training, and database management. The "RFT" in the dashboard title likely refers to the project or organization name. The interface is clean and intuitive, suggesting a focus on usability. The dashboard's layout and functionality indicate a streamlined workflow for managing the entire machine learning lifecycle. </details> (a) Trinity-Studio dashboard. <details> <summary>figs/studio-showcase/training-portal-click-run.jpg Details</summary> ![f170082a](/v1/image/f170082a3c98662a93c581b2a2bc79bfdfd985fb89cb836215577e0681ebf7b2) ### Visual Description \n ## Screenshot: Training Portal Interface ### Overview This is a screenshot of a web-based training portal interface. The interface appears to be for configuring and monitoring a machine learning training job. It includes controls for setting hyperparameters, generating a configuration file, and displaying job status. ### Components/Axes The interface is divided into several sections: * **Top Navigation Bar:** Contains links to "Dashboard", "pgAdmin", "Label Studio", "Training Portal", and "Settings". * **Training Portal Header:** Displays "Training Portal" on the left and a "TOOLS" button on the right. * **Hyperparameter Controls:** Includes controls for "Micro Batch Size Per GPU" and "Learning Rate". * **Configuration Generation:** A button labeled "Generate Config". * **Generated Config File Section:** Contains "Save" and "Run" buttons. * **Job Status:** Displays a success message and a link to the Ray Dashboard. * **Configuration Output:** Displays a code block with configuration parameters. ### Detailed Analysis or Content Details **Top Navigation Bar:** * Dashboard * pgAdmin * Label Studio * Training Portal * Settings **Hyperparameter Controls:** * **Micro Batch Size Per GPU:** Currently set to 8. There are "+" and "-" buttons to adjust the value. The label includes "blue-badge". * **Learning Rate:** Currently set to 1.0e-6. There are "+" and "-" buttons to adjust the value. The label includes "blue-badge". **Configuration Generation:** * Button: "Generate Config" **Generated Config File Section:** * Button: "Save" (light blue background) * Button: "Run" (light green background) **Job Status:** * Message: "Job submitted successfully!" (green checkmark icon) * Link: "View progress in the Ray Dashboard: http://127.0.0.1:8265" **Configuration Output (Code Block):** ``` mode: both data: total_epochs: 20 batch_size: 96 ``` ### Key Observations * The interface is clean and straightforward. * The "blue-badge" label on the hyperparameter controls might indicate a specific configuration or feature set. * The job submission was successful, and a link to the Ray Dashboard is provided for monitoring. * The configuration output shows a "both" mode, 20 total epochs, and a batch size of 96. ### Interpretation The screenshot depicts a user interface for initiating and managing a machine learning training process. The user can adjust the "Micro Batch Size Per GPU" and "Learning Rate" to fine-tune the training process. The "Generate Config" button likely creates a configuration file based on the selected hyperparameters. The successful job submission and link to the Ray Dashboard suggest that the training job is running on a Ray cluster. The configuration output provides details about the training setup, including the training mode ("both"), the number of epochs (20), and the batch size (96). The "both" mode could refer to a mixed-precision training strategy or a combination of data modalities. The Ray Dashboard link (http://127.0.0.1:8265) indicates that the training is likely happening locally on the user's machine. </details> (b) Start training on the “Training Portal” page. <details> <summary>figs/studio-showcase/pgadmin-select-table.jpg Details</summary> ![6989ac49](/v1/image/6989ac492f09e587b41009bd952200a207a2099ccbf0b2b9ee8bddce10c8d090) ### Visual Description \n ## Screenshot: pgAdmin Interface - Table Structure ### Overview This is a screenshot of the pgAdmin interface, displaying the structure of the `experience_buffer` table within the `testdb` database. The interface includes a top navigation bar, a database selection dropdown, a table list on the left, and a table structure display in the main content area, along with an SQL query input field. ### Components/Axes * **Top Navigation:** Dashboard, pgAdmin, Label Studio, Training Portal, Settings. * **Database Dropdown:** Currently selected is `testdb`. * **Table List (Left Panel):** * xxxx * sft_data_buffer * rft_dataset * task_buffer * experience_buffer (Highlighted) * dpo_data_buffer * **Table Structure (Main Content):** * **Column:** consumed, priority, serialized\_exp, id, reward, response, prompt * **Type:** integer, double precision, bytea, integer, double precision, character varying, character varying * **Nullable:** YES, YES, YES, NO, YES, YES, YES * **SQL Query Input:** A text field labeled "Enter SQL query..." with an "EXECUTE QUERY" button. * **Top-Right Corner:** "Local Development" dropdown. ### Detailed Analysis or Content Details The `experience_buffer` table has the following structure: | Column | Type | Nullable | | ------------- | ----------------- | -------- | | consumed | integer | YES | | priority | double precision | YES | | serialized_exp| bytea | YES | | id | integer | NO | | reward | double precision | YES | | response | character varying| YES | | prompt | character varying| YES | The SQL query input field is empty. ### Key Observations The table `experience_buffer` contains columns related to storing experience data, likely for a reinforcement learning or similar application. The `id` column is the only one marked as `NOT NULL`, suggesting it's the primary key. The `serialized_exp` column uses the `bytea` type, indicating it stores binary data, potentially a serialized object. The `response` and `prompt` columns suggest a conversational or interaction-based system. ### Interpretation This screenshot depicts a database schema designed to store experiences, likely generated from interactions within a system. The presence of `prompt` and `response` columns suggests a system that processes inputs (prompts) and generates outputs (responses). The `reward` column indicates a mechanism for evaluating the quality of these responses, which is common in reinforcement learning. The `serialized_exp` column likely stores the state of the system at a given point in time. The `priority` column could be used for sampling experiences during training. The table is part of a `testdb` database, suggesting it's used for development or testing purposes. The interface is running in a "Local Development" environment. The presence of other tables like `sft_data_buffer`, `rft_dataset`, and `dpo_data_buffer` suggests a pipeline involving supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO). </details> (c) Manage data on the “pgAdmin” page. <details> <summary>figs/studio-showcase/label-studio-enter.jpg Details</summary> ![6a705e50](/v1/image/6a705e50a2185b68c0775e83c347417331516f6019558a3ac8b6ea80bd980909) ### Visual Description \n ## Screenshot: Label Studio Home Page ### Overview This is a screenshot of the Label Studio home page, a data labeling platform. The interface displays a welcome message, recent projects, and a resources section. The top navigation bar provides access to key sections like Dashboard, pgAdmin, Label Studio, Training Portal, and Settings. ### Components/Axes The screenshot contains the following UI elements: * **Top Navigation Bar:** Dashboard, pgAdmin, Label Studio, Training Portal, Settings. * **Left Sidebar:** Label Studio logo and a hamburger menu icon. * **Main Content Area:** * Welcome section with a smiling face emoji and introductory text. * "Create Project" button. * "Invite People" button. * "Recent Projects" section with project names and progress bars. * "View All" link for Recent Projects. * "Resources" section with links to Documentation, API Documentation, Release Notes, LabelStudio.io Blog, and Slack Community. * **Footer:** Label Studio Version: Community. * **Right Sidebar:** An advertisement labeled "AD". ### Detailed Analysis or Content Details The screenshot displays the following specific information: * **Top Navigation:** The navigation bar is positioned at the very top of the screen. * **Welcome Message:** "Welcome 🥳 Let's get you started." * **Recent Projects:** * **Project 1:** "Human\_Preference\_Annotation\_Demo\_acc038" - 10 of 10 Tasks (100%) - Progress bar is fully filled. * **Project 2:** "Human\_Preference\_Annotation\_Demo\_8a87e7" - 10 of 10 Tasks (100%) - Progress bar is fully filled. * **Resources:** * Documentation - Link icon. * API Documentation - Link icon. * Release Notes - Link icon. * LabelStudio.io Blog - Link icon. * Slack Community - Link icon. * **Footer:** "Label Studio Version: Community" ### Key Observations * Both listed recent projects are fully completed (100% of tasks finished). * The interface is clean and straightforward, designed for ease of use. * The resources section provides easy access to documentation and support. * The presence of an advertisement suggests a freemium or community-supported model. ### Interpretation The screenshot showcases the Label Studio platform's user-friendly interface, geared towards onboarding new users and providing quick access to essential features. The completed projects indicate a successful workflow, and the readily available resources suggest a commitment to user support. The "Community" version label in the footer implies that this is a free or open-source offering, potentially with limited features compared to a paid enterprise version. The overall design emphasizes simplicity and efficiency, aiming to streamline the data labeling process. The presence of an advertisement suggests a business model that relies on upselling to premium features or services. </details> (d) Process data on the “Label Studio” page. Figure 16: Snapshots of Trinity-Studio. Trinity-Studio provides visual interaction for the core capabilities of Trinity-RFT, designed to bridge the gap between system complexity and user accessibility. As shown in Figure 16(a), its three integrated modules — “Training Portal”, “pgAdmin”, and “Label Studio” — form a cohesive interface that supports low-code usage and development with Trinity-RFT, and make it easy to monitor and track the full RFT pipeline with transparency. - “Training Portal” (Figure 16(b)) implements configuration-to-execution procedures through declarative YAML editing with auto completion and live validation that prevents misconfigurations. Furthermore, the integration of runtime metrics with tools like Wandb/TensorBoard directly helps the active data optimization feature by surfacing signals such as difficulty distribution drifts and diversity metrics mentioned in Section 3.4. This transparency ensures that users can monitor how data curation strategies impact RFT performance in real time. - “pgAdmin” (Figure 16(c)) reflects Trinity-RFT ’s end-to-end data transformation capabilities by providing a visual panel for PostgreSQL-based storage. This design benefits the versioned data lineage requirements of RFT, particularly for scenarios involving asynchronous training (Section 2.3.3). With intuitive SQL query builders, users can easily adjust schema, audit training experiences and human annotation batches with fine-grained precision. This capability is valuable for rapid validation of active learning policies by cross-referencing training outcomes with metadata (e.g., difficulty scores and staleness in asynchronous mode). - “Label Studio” page (Figure 16(d)) operationalizes Trinity-RFT ’s bi-directional human-AI collaboration capability (Section 2.3.4). Utilizing the provided task polling and atomic batch commit mechanisms, users can annotate the data or experiences directly, allowing an asynchronous way to involve human feedback and to dynamically influence data curation. By unifying these capabilities in a single UI, Trinity-Studio reduces the cognitive load of managing complex RFT procedures. For example, a researcher tuning a math reasoning task could use the Training Portal to adjust difficulty scoring parameters, view the resulting distribution shifts in the pgAdmin module, and then validate human annotators’ preferences in the Label Studio page. This end-to-end visibility can be useful for debugging and iterating RFT strategies, and complements the programmatic APIs of Trinity-RFT while maintaining full compatibility with CLI procedures. We implement Trinity-Studio with the Singe-Spa framework [33]. The modular architecture enables custom view development through JavaScript plugins and flexible extensions for general-purpose usage. 4 Conclusion and Next Steps We have presented Trinity-RFT, a general-purpose, flexible, scalable and user-friendly framework for reinforcement fine-tuning of large language models. Trinity-RFT offers a path into “the era of experience” [32], by supporting applications in diverse scenarios with complex agent-environment interaction, and serving as a unified platform for exploring advanced methodologies in each stage of the complete RFT pipeline, at both macroscopic and microscopic levels. Further development of Trinity-RFT includes incorporating more advanced RL algorithms (especially off-policy or asynchronous ones), making the choices of hyperparameters more adaptive and less reliant on manual tuning, augmenting data pipelines with smarter sampling strategies and data processing operations, and conducting more thorough experiments and evaluations with Trinity-RFT. Acknowledgements Trinity-RFT is built upon many excellent open-source projects, including but not limited to: verl [30] and PyTorch’s FSDP [8] for LLM training; vLLM [15] for LLM inference; Data-Juicer [2] for data-related functionalities; AgentScope [11] for agentic workflow; and Ray [20] for distributed runtime. References - [1] ChatLearn. https://github.com/alibaba/ChatLearn. - [2] Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, and Jingren Zhou. Data-juicer: A one-stop data processing system for large language models. In International Conference on Management of Data, 2024. - [3] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. - [4] Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, and Sean Hendryx. Agent-rlvr: Training software engineering agents via guidance and environment rewards. arXiv, 2025. - [5] DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv, 2025. - [6] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. - [7] Yannis Flet-Berliac, Nathan Grinsztajn, Florian Strub, Bill Wu, Eugene Choi, Chris Cremer, Arash Ahmadian, Yash Chandak, Mohammad Gheshlaghi Azar, Olivier Pietquin, and Matthieu Geist. Contrastive policy gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion. In EMNLP, 2024. - [8] Pytorch FSDP. https://pytorch.org/docs/stable/fsdp.html. - [9] Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv, 2025. - [10] Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning. arXiv, 2025. - [11] Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, Liuyi Yao, Hongyi Peng, Ze Yu Zhang, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. Agentscope: A flexible yet robust multi-agent platform. arXiv, 2024. - [12] Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training. arXiv, 2025. - [13] Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework. arXiv, 2024. - [14] Huggingface. https://huggingface.co/. - [15] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. - [16] Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. arXiv, 2025. - [17] Model context protocol servers. https://github.com/modelcontextprotocol/servers. - [18] Mistral-AI. Magistral. arXiv, 2025. - [19] Modelscope. https://www.modelscope.cn/home. - [20] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging ai applications. arXiv, 2018. - [21] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In NIPS, 2017. - [22] Nccl. https://github.com/NVIDIA/nccl. - [23] Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models. In The Thirteenth International Conference on Learning Representations, 2025. - [24] Long Ouyang, Pamela Mishkin, Jeff Wu, C L Mar, Jacob Hilton, Amanda Askell, and Paul Christiano. Training language models to follow instructions with human feedback. arXiv, 2022. - [25] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. - [26] Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, and Bilal Piot. Offline regularised reinforcement learning for large language models alignment. arXiv, 2024. - [27] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Advances in Neural Information Processing Systems, volume 32, 2019. - [28] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017. - [29] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv, 2024. - [30] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient rlhf framework. arXiv, 2024. - [31] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2021. - [32] David Silver and Richard S. Sutton. Welcome to the era of experience. https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf, 2025. - [33] A javascript framework for front-end microservices, 2025. - [34] Yuda Song, Yifei Zhou, Ayush Sekhari, Drew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid RL: Using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations, 2023. - [35] Kimi Team. Kimi k1.5: Scaling reinforcement learning with LLMs. arXiv, 2025. - [36] Qwen Team. Qwen2.5 technical report, 2025. - [37] ROLL Team and Other ROLL Contributors. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library. arXiv, 2025. - [38] TensorBoard. https://www.tensorflow.org/tensorboard. - [39] Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling software, 2020-2025. Open source software available from https://github.com/HumanSignal/label-studio. - [40] Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020. - [41] Weights & Biases. https://wandb.ai/home. - [42] Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye HAO, Jun Wang, and Kun Shao. DistRL: An asynchronous distributed reinforcement learning framework for on-device control agent. In The Thirteenth International Conference on Learning Representations, 2025. - [43] Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training. arXiv, 2025. - [44] LLM-Core-Team Xiaomi. Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining. arXiv, 2025. - [45] Tianbing Xu. Training large language models to reason via EM policy gradient. arXiv, 2025. - [46] Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv, 2025. - [47] Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, and Bolin Ding. Group-relative reinforce is secretly an off-policy algorithm: Demystifying some myths about grpo and its friends. arXiv, 2025. - [48] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. - [49] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022. - [50] Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, and Daxin Jiang. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv, 2025.

Rendering Paper...