2510.05592

Model: gemini-2.5-flash-free

# In-the-Flow Agentic System Optimization for Effective Planning and Tool Use ## Abstract Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns. footnotetext: *Equal contribution. † Co-senior authors. Work was partially done while ZL and HZ were visiting Stanford. <details> <summary>x3.png Details</summary> ![5eb361f3](/v1/image/5eb361f3369ba00eef0578101279bbf6c10ec18541f5196f7a938e6a16559f8b) ### Visual Description ## Mixed Chart Analysis: AgentFlow Performance Across Diverse Tasks ### Overview The image presents a comparative analysis of "AgentFlow" performance against a baseline "AgentFlow (w/o Flow-GRPO)" and several other models across various tasks. The left side features a radar chart illustrating the relative performance of two AgentFlow variants across ten distinct tasks, grouped into four broader categories (Science, Search, Agentic, Math), with aggregated percentage improvements. The right side displays six individual bar charts, providing a more granular comparison of "AgentFlow (7B)" against up to seven other models (Qwen-2.5-7B, GPT-4o (~200B), TIR (7B), ToRL (7B), Search-R1 (7B), ReSearch (7B), AutoGen (7B)) for specific tasks, showing "Accuracy (%)". ### Components/Axes **Left Side: Radar Chart** * **Title:** Implicitly comparing "AgentFlow (w/o Flow-GRPO)" and "AgentFlow". * **Legend (Top-Left):** * Blue line with circular markers: "AgentFlow (w/o Flow-GRPO)" * Red line with circular markers: "AgentFlow" * **Radial Axes/Categories (Clockwise from top):** * Bamboogle * 2Wiki * HotpotQA * Musique * GAIA * AIME24 * AMC23 * GameOf24 * GPQA * MedQA * **Radial Scale:** Concentric circles represent increasing performance values, likely percentages. The innermost circle represents 0, and the outermost visible circle corresponds to values up to 80.0. * **Additional Labels:** * Top-left, near MedQA: "+7.0% Science" * Top-right, near 2Wiki: "+10.1% Search" * Bottom-right, near GAIA: "+15.9% Agentic" * Bottom-left, near GameOf24: "+19.8% Math" **Right Side: Bar Charts** * **Common Y-axis (Left side of each bar chart):** "Accuracy (%)" * **Common Legend (Positioned centrally, below the top row of bar charts):** * Light Gray bar: Qwen-2.5-7B * Dark Gray bar: GPT-4o (~200B) * Light Green bar: TIR (7B) * Light Blue bar: ToRL (7B) * Medium Blue bar: Search-R1 (7B) * Dark Blue bar: ReSearch (7B) * Light Purple bar: AutoGen (7B) * Red bar: AgentFlow (7B) * **Individual Chart Titles:** * Top-left: 2Wiki (Search) * Top-middle: HotpotQA (Search) * Top-right: GAIA (Agentic) * Bottom-left: AIME24 (Math) * Bottom-middle: GameOf24 (Math) * Bottom-right: GPQA (Science) ### Detailed Analysis **Radar Chart (Left Side)** The radar chart visually compares the performance of "AgentFlow" (red line) against "AgentFlow (w/o Flow-GRPO)" (blue line) across ten tasks. The red line consistently encloses or significantly extends beyond the blue line, indicating superior performance for "AgentFlow" in all categories. * **AgentFlow (w/o Flow-GRPO) (Blue Line):** * MedQA: 76.0 * GPQA: 37.0 * GameOf24: 31.0 * AMC23: 47.4 * AIME24: 16.7 * GAIA: 17.2 * Musique: 25.3 * HotpotQA: 51.3 * 2Wiki: 60.0 * Bamboogle: 58.4 * **AgentFlow (Red Line):** * MedQA: 80.0 * GPQA: 47.0 * GameOf24: 53.0 * AMC23: 61.5 * AIME24: 40.0 * GAIA: 33.1 * Musique: 33.1 * HotpotQA: 57.0 * 2Wiki: 71.2 * Bamboogle: 69.6 * **Aggregated Improvements:** * **Science:** +7.0% (associated with MedQA, GPQA, Bamboogle) * **Search:** +10.1% (associated with 2Wiki, HotpotQA) * **Agentic:** +15.9% (associated with Musique, GAIA) * **Math:** +19.8% (associated with AIME24, AMC23, GameOf24) **Bar Charts (Right Side)** Each bar chart shows the "Accuracy (%)" for different models on a specific task. The red bar, representing "AgentFlow (7B)", consistently shows the highest performance in all six tasks. 1. **2Wiki (Search)** * Trend: AgentFlow (7B) is significantly higher than all other models. * Qwen-2.5-7B (light gray): 23.0 * GPT-4o (~200B) (dark gray): 49.5 * ToRL (7B) (light blue): 38.2 * Search-R1 (7B) (medium blue): 47.6 * AutoGen (7B) (light purple): 44.0 * AgentFlow (7B) (red): 77.2 2. **HotpotQA (Search)** * Trend: AgentFlow (7B) is the highest, with GPT-4o and AutoGen as the next best performers. * Qwen-2.5-7B (light gray): 21.0 * GPT-4o (~200B) (dark gray): 54.0 * ToRL (7B) (light blue): 37.0 * Search-R1 (7B) (medium blue): 43.5 * AutoGen (7B) (light purple): 50.0 * AgentFlow (7B) (red): 57.0 3. **GAIA (Agentic)** * Trend: AgentFlow (7B) shows a substantial lead over all other models, which perform significantly lower. * Qwen-2.5-7B (light gray): 3.2 * GPT-4o (~200B) (dark gray): 17.3 * ToRL (7B) (light blue): 19.1 * Search-R1 (7B) (medium blue): 17.3 * ReSearch (7B) (dark blue): 6.3 * AutoGen (7B) (light purple): Not present (The dark blue bar for ReSearch (7B) is the 6th bar, with value 6.3. The light purple bar for AutoGen (7B) is not present in this chart.) * AgentFlow (7B) (red): 33.1 4. **AIME24 (Math)** * Trend: AgentFlow (7B) is significantly superior, with ToRL (7B) being the second-best but still far behind. * Qwen-2.5-7B (light gray): 6.7 * GPT-4o (~200B) (dark gray): 13.3 * TIR (7B) (light green): 10.0 * ToRL (7B) (light blue): 20.0 * AutoGen (7B) (light purple): 13.3 * AgentFlow (7B) (red): 40.0 5. **GameOf24 (Math)** * Trend: AgentFlow (7B) performs substantially better than other models, which are clustered in a lower range. * Qwen-2.5-7B (light gray): 33.0 * GPT-4o (~200B) (dark gray): 32.0 * TIR (7B) (light green): 33.0 * ToRL (7B) (light blue): 31.0 * AutoGen (7B) (light purple): 24.0 * AgentFlow (7B) (red): 53.0 6. **GPQA (Science)** * Trend: AgentFlow (7B) achieves the highest accuracy, followed by TIR (7B) and Search-R1 (7B). * Qwen-2.5-7B (light gray): 34.0 * GPT-4o (~200B) (dark gray): 31.0 * TIR (7B) (light green): 42.0 * ToRL (7B) (light blue): 35.0 * Search-R1 (7B) (medium blue): 42.0 * AgentFlow (7B) (red): 47.0 ### Key Observations * **Consistent Superiority of AgentFlow:** In both the radar chart and all six bar charts, "AgentFlow" (red line/bar) consistently outperforms its baseline "AgentFlow (w/o Flow-GRPO)" and all other compared models. * **Impact of Flow-GRPO:** The radar chart clearly demonstrates that the "Flow-GRPO" component significantly boosts AgentFlow's performance across all ten tasks, as the red line (with Flow-GRPO) is always outside the blue line (without Flow-GRPO). * **Significant Gains in Math and Agentic Tasks:** The aggregated percentage improvements highlight that "AgentFlow" achieves its largest gains over the baseline in Math tasks (+19.8%) and Agentic tasks (+15.9%). * **Strong Performance in Search Tasks:** AgentFlow shows substantial leads in 2Wiki (77.2%) and HotpotQA (57.0%) compared to other models, with a +10.1% aggregated improvement in Search tasks. * **Varied Competitor Performance:** While AgentFlow is consistently best, the performance of other models varies. GPT-4o (~200B) often performs well among the non-AgentFlow models, particularly in search tasks (e.g., 49.5% in 2Wiki, 54.0% in HotpotQA). * **Large Gaps in Challenging Tasks:** In tasks like GAIA (Agentic) and AIME24 (Math), AgentFlow's lead is particularly pronounced, suggesting its approach is highly effective for these more complex problem types. For instance, in GAIA, AgentFlow scores 33.1% while the next best is ToRL (7B) at 19.1%. ### Interpretation The data strongly suggests that "AgentFlow" is a highly effective system, and its "Flow-GRPO" component is critical for its superior performance. The radar chart provides a holistic view, emphasizing the broad applicability and consistent improvement across diverse task categories, including Science, Search, Agentic, and Math. The aggregated percentage increases further quantify these improvements, highlighting the most impactful areas. The bar charts provide crucial validation by comparing "AgentFlow (7B)" against a range of other established and competitive models. The consistent top performance of "AgentFlow (7B)" across all six detailed tasks (2Wiki, HotpotQA, GAIA, AIME24, GameOf24, GPQA) indicates its robustness and generalizability. The significant performance gaps, especially in tasks like GAIA and AIME24, imply that "AgentFlow" might possess unique capabilities or a more effective strategy for handling the complexities inherent in agentic and mathematical reasoning tasks. The fact that "AgentFlow (7B)" often doubles or triples the accuracy of other 7B models (e.g., in AIME24) and even outperforms much larger models like GPT-4o (~200B) in several instances (e.g., 2Wiki, GAIA, AIME24, GameOf24), underscores its efficiency and effectiveness. Overall, the image serves as compelling evidence for the efficacy of the "AgentFlow" system, particularly when augmented with "Flow-GRPO," positioning it as a leading solution for a wide array of challenging AI benchmarks. The data implies that the "Flow-GRPO" mechanism likely contributes to better task understanding, planning, or execution, leading to these substantial performance gains. </details> Figure 1: Left: Performance of AgentFlow with a 7B-scale backbone before and after Flow-GRPO tuning across ten diverse reasoning benchmarks. Flow-GRPO substantially improves performance by enhancing planning quality and tool-calling reliability. Right: AgentFlow achieves consistent gains over top baselines, including base LLMs, tool-integrated RL models, and training-free agentic systems. All 7B results use Qwen2.5-7B-Base/Instruct as the backbone and tools. ## 1 Introduction Recent advances in large language models (LLMs) have unlocked remarkable reasoning capabilities, largely driven by reinforcement learning (RL) from outcome-based feedback. By fine-tuning models to maximize verifiable rewards, LLMs like DeepSeek-R1 (Guo et al., 2025) and SimpleRL (Zeng et al., 2025b) have demonstrated sophisticated behaviors in self-correction and multi-step deduction. A complementary line of work augments LLMs with external tools (e.g., web search, code execution) for knowledge retrieval and precise computation. Tool-integrated reasoning (TIR) extends reinforcement learning with verifiable rewards to learn when and how to call tools by interleaving reasoning (e.g., <think>) with tool invocations (e.g., <tool call>) under full context (Jin et al., 2025; Song et al., 2025; Chen et al., 2025; Feng et al., 2025). Early systems supported only a single tool type, whereas recent work enables multi-tool settings by encoding tool metadata into prompts (Dong et al., 2025; Qian et al., 2025a; Zhang et al., 2025). However, these methods still train a single, monolithic policy under multi-turn full-context reasoning, which introduces scaling challenges: (i) training becomes increasingly unstable as horizons lengthen, tool diversity grows, and environments shift with tool feedback (Wang et al., 2025c; Mai et al., 2025; Moonshot AI, 2025; Xue et al., 2025); and (ii) inference -time generalization remains brittle to unseen tasks or tools (Dong et al., 2025; Hu et al., 2025b). Agentic systems (Wu et al., 2024; Hong et al., 2024; Hu et al., 2025b) offer a promising alternative to monolithic tool-integrated reasoning models. They consist of multiple modules—often distinct LLMs with prescribed roles (e.g., planner, critic) or specialized components with dedicated tools and capabilities (e.g., executor, coder)—that coordinate via shared memory and inter-module communication. By decomposing problems into sub-goals and iterating over multiple turns, these systems can tackle tasks that demand diverse tools, long horizons, or multi-stage reasoning. However, achieving robust coordination in such systems ultimately requires training, since handcrafted logic or static prompting cannot reliably capture when and how modules should collaborate, adapt to evolving tool outputs, or recover from early mistakes. At the same time, they introduce new training challenges: modules coordinate sequentially, outcome feedback propagates through long reasoning chains, and state distributions shift with evolving tool outputs. As a result, most systems remain training-free, relying on handcrafted logic or prompting heuristics. While some employ supervised fine-tuning or preference optimization for key modules (Motwani et al., 2024; Park et al., 2025), these off-policy approaches are decoupled from live dynamics and learn poorly from downstream successes or failures. Thus, agentic systems struggle with sparse rewards, brittle adaptation, and inefficient orchestration in dynamic environments. To address the central challenge of learning long-horizon reasoning with sparse rewards in tool-integrated agentic systems, we introduce AgentFlow, a trainable framework for effective planning and tool use (Figure 2). AgentFlow comprises four specialized modules—planner, executor, verifier, and generator—that interact iteratively over multiple turns via a shared evolving memory and a toolset. The system operates in the flow, with each turn cycling through planning, execution, and verification. Unlike prior agentic systems, AgentFlow directly optimizes its planner on-policy, inside the live multi-turn loop, allowing it to dynamically adapt to trajectories shaped by tool calls, verifier signals, and memory updates. This evolving memory serves as a deterministic, structured record of the reasoning process, enabling transparent state tracking, controllable behavior, and bounded context growth. <details> <summary>x4.png Details</summary> ![479e686f](/v1/image/479e686fcfd6cdd46048a70f2caac24752e2eba3c531204c3b185685b80d4630) ### Visual Description ## Diagram: AgentFlow: In-the-Flow Agentic System and Rollout ### Overview This image presents two interconnected diagrams illustrating an "In-the-Flow Agentic System" named AgentFlow. Diagram (a) depicts the overall multi-turn, iterative architecture of the system, showing how a query is processed through a series of turns involving planning, execution, and verification, with a central memory component. Diagram (b) provides a detailed breakdown of the components and their interactions within a single turn (`t`), specifying their inputs, outputs, and whether they are "Trained" or "Frozen." ### Components/Axes **Diagram (a): AgentFlow: In-the-Flow Agentic System** * **Inputs (Top):** * **Query:** A white rounded rectangle at the top-left, representing the initial input to the system. * **Toolkit Set:** A white rounded rectangle at the top-right, representing available tools. It contains icons for Python, a Globe (web browser), a Magnifying Glass (search), a Gear (settings/utility), and an ellipsis (...), indicating a variety of tools. * **Main Process Area:** A large light orange/yellow rounded rectangle enclosing the iterative turns. * **Memory:** A vertical orange/yellow bar on the far left of the main process area, labeled "Memory" at its bottom. It contains three brain icons, one for each turn segment, indicating its role in storing and providing context. * **Turn Components (within each turn, e.g., Turn 1, Turn 2, Turn T):** * **Planner:** A yellow rounded rectangle. It receives input from the Query/Memory and Toolkit Set. A flame icon is positioned next to it, possibly indicating an active or dynamic process. * **`a^t` (Action):** A white rounded rectangle, representing the output of the Planner for turn `t`. * **Executor:** A blue rounded rectangle, receiving `a^t` as input. * **Verifier:** A blue rounded rectangle, receiving output from the Executor. * **Generator (only in Turn T):** A blue rounded rectangle, receiving output from the Verifier in the final turn. * **`o` (Output):** A white rounded rectangle, representing the output of the Generator. * **Answer:** A green rounded rectangle, representing the final output of the system. * **Turn Labels:** "Turn 1", "Turn 2", "Turn T" are explicitly labeled above their respective process rows. An ellipsis "..." indicates intermediate turns between Turn 2 and Turn T. **Diagram (b): In-the-Flow Rollout at Turn t** * **Components (Top Row, left to right):** * **Planner π_θ:** A yellow rounded rectangle. * **Inputs (above Planner):** `q` (query), `K` (toolkit), `M^t` (memory at turn `t`). * **Icon:** A target icon to its right. * **Output (below Planner):** `a^t` (action at turn `t`). * **Executor:** A blue rounded rectangle. * **Inputs (above Executor):** `a^t` (action at turn `t`), `K` (toolkit). * **Icon:** A wrench icon to its right. * **Output (below Executor)::** `e^t` (execution result at turn `t`). * **Verifier:** A blue rounded rectangle. * **Inputs (above Verifier):** `q` (query), `e^t` (execution result at turn `t`), `M^t` (memory at turn `t`). * **Icons:** A red 'X' hand icon and a green checkmark hand icon to its right, indicating approval/rejection. * **Output (below Verifier):** `v^t` (verification status at turn `t`). * **Memory:** An orange/yellow rounded rectangle, positioned to the right of the Verifier. * **Input:** From the Verifier (via the hand icons). * **Icon:** A brain icon above it. * **Output (below Memory):** `M^{t+1}` (memory for the next turn `t+1`). * **Detailed Input/Output Boxes (Below components):** * **Planner Input/Output (yellow box below Planner):** * **Input:** * [Query Analysis] * [Global Goal] * [Required Skills] * **Output:** * [Current Sub-Goal] * [Selected Tool] * [Context for Tool Use] * **Executor Input/Output (blue box below Executor):** * **Input:** * [Current Sub-Goal] * [Selected Tool & Context] * [Tool Metadata] * **Output:** * [Generated Command] * [Execution Result] * **Verifier Input/Output (blue box below Verifier):** * **Input:** * [Generated Command] * [Execution Result] * **Output:** * [Execution Analysis] * [Memory Analysis] * [Verification Status] * **Legend (Bottom-right):** * **Trained:** A yellow rounded rectangle. * **Frozen:** A blue rounded rectangle. ### Detailed Analysis **Diagram (a): AgentFlow: In-the-Flow Agentic System** The system processes a "Query" using a "Toolkit Set" over multiple turns. 1. **Turn 1:** The "Query" and "Toolkit Set" are fed into the "Planner". The "Planner" (indicated by a flame icon, suggesting active processing) generates an action `a^1`. This action is then processed by the "Executor", whose output is passed to the "Verifier". The "Verifier"'s output is then stored in "Memory" (represented by the top brain icon in the vertical bar). 2. **Turn 2:** The "Planner" for Turn 2 receives input from "Memory" (the second brain icon) and the "Toolkit Set". It generates action `a^2`, which flows through the "Executor" and "Verifier". The "Verifier"'s output is again stored in "Memory" (the middle brain icon). 3. **Intermediate Turns (...):** This iterative process continues, with each turn's "Planner" receiving updated context from "Memory" and the "Toolkit Set", and the "Verifier" updating "Memory" with its findings. 4. **Turn T (Final Turn):** The "Planner" for Turn T receives input from "Memory" (the bottom brain icon) and the "Toolkit Set", generating action `a^T`. This action flows through the "Executor" and "Verifier". Unlike previous turns, the "Verifier" in Turn T outputs to a "Generator". The "Generator" produces an output `o`, which is then presented as the final "Answer". **Diagram (b): In-the-Flow Rollout at Turn t** This diagram details the internal workings of a single turn `t`. 1. **Planner π_θ (Trained):** This component, marked with a target icon, takes the current query (`q`), the available toolkit (`K`), and the current memory state (`M^t`) as inputs. Its detailed inputs include "Query Analysis", "Global Goal", and "Required Skills". It outputs an action `a^t`, which comprises a "Current Sub-Goal", "Selected Tool", and "Context for Tool Use". The yellow color indicates it is a "Trained" component. 2. **Executor (Frozen):** This component, marked with a wrench icon, takes the action `a^t` from the Planner and the toolkit `K` as inputs. Its detailed inputs are "Current Sub-Goal", "Selected Tool & Context", and "Tool Metadata". It outputs `e^t`, which consists of a "Generated Command" and the "Execution Result". The blue color indicates it is a "Frozen" component. 3. **Verifier (Frozen):** This component, marked with red 'X' and green checkmark hand icons, takes the original query (`q`), the execution result `e^t` from the Executor, and the current memory state `M^t` as inputs. Its detailed inputs are "Generated Command" and "Execution Result". It outputs `v^t`, which includes "Execution Analysis", "Memory Analysis", and "Verification Status". The blue color indicates it is a "Frozen" component. 4. **Memory (Trained):** This component, marked with a brain icon, receives input from the Verifier (specifically, the verification status and analysis). It then updates its state to `M^{t+1}`, which will be used in the next turn. The orange/yellow color indicates it is a "Trained" component. ### Key Observations * **Iterative Refinement:** Diagram (a) clearly shows an iterative process where the system refines its understanding and actions over multiple turns, leveraging a persistent "Memory". * **Memory as Context:** "Memory" is central to the multi-turn process, providing context (`M^t`) to the "Planner" and being updated by the "Verifier" at each step. * **Modular Design:** The system is composed of distinct modules: Planner, Executor, Verifier, and Generator, each with specific roles. * **Trained vs. Frozen Components:** Diagram (b) highlights that the "Planner" and "Memory" are "Trained" components (yellow), suggesting they are adaptable and learn from experience. In contrast, the "Executor" and "Verifier" are "Frozen" (blue), implying their logic is fixed or pre-defined. * **Role of Icons:** The icons (flame, target, wrench, hands, brain) visually reinforce the function of each component, making the diagram intuitive. * **Finalization:** The "Generator" and "Answer" components appear only in the final turn (Turn T), indicating a specific phase for synthesizing the final response. ### Interpretation The AgentFlow system describes a sophisticated agentic architecture designed to handle complex queries requiring multiple steps and tool interactions. The "in-the-flow" aspect suggests that the system dynamically plans, executes, and verifies actions within a continuous loop, adapting its strategy based on intermediate results and accumulated memory. The distinction between "Trained" (Planner, Memory) and "Frozen" (Executor, Verifier) components is crucial. It implies that the core mechanisms for *how* tools are used (Executor) and *how* results are evaluated (Verifier) are stable and reliable, perhaps based on robust, pre-engineered logic or models. Conversely, the "Planner" and "Memory" are designed to be adaptive. The "Planner" (π_θ) learns to strategize and select tools effectively based on the query, global goals, and required skills, while "Memory" learns to store and retrieve relevant context for future planning. This architecture allows the system to be flexible and learn new strategies without compromising the integrity of execution and verification. The iterative nature, where the Verifier updates Memory, and Memory feeds back into the Planner, forms a self-correcting loop. If an execution or verification fails (implied by the red 'X' hand icon), the updated memory can guide the Planner to adjust its strategy in subsequent turns. This continuous feedback mechanism is essential for robust autonomous agents. The final "Generator" and "Answer" suggest that after a series of verification-driven iterations, the system synthesizes a coherent response, indicating a completion state for the query. This system aims to provide a structured and adaptive approach to agentic problem-solving, particularly in environments requiring tool use and complex reasoning. </details> Figure 2: (a) Overview of AgentFlow, a trainable agentic system for in-the-flow planning and tool use. Four modules (planner, executor, verifier, generator) coordinate via a shared evolving memory $M$ and toolset $K$ , given a query $q$ . The planner policy is optimized on-policy inside the system’s multi-turn loop to enable adaptive, long-horizon reasoning. (b) A single state transition, showing the action $a^t$ , execution result $e^t$ , and verifier signal $v^t$ that update the memory from $M^t$ to $M^t+1$ . To train the planner on-policy within this agentic system, we need to overcome the long-horizon credit assignment problem inherent to sparse, trajectory-level rewards. We introduce Flow-based Group Refined Policy Optimization (Flow-GRPO, Figure 4), an on-policy algorithm designed for this setting. Flow-GRPO operates on in-the-flow rollouts, which capture the full trajectory of states, actions, and tool events induced by the live system. Instead of attempting to assign credit with brittle, intermediate heuristics, we assign a single, verifiable final-outcome reward to the entire trajectory and broadcast it to every turn. This design effectively transforms the multi-turn reinforcement learning challenge into a series of single-turn updates: at each turn, the planner has access to the full memory context and receives a consistent reward signal aligned with global success. This approach, coupled with group-normalized advantages to stabilize training, enables robust credit assignment and allows the planner to learn effective long-horizon strategies from sparse feedback. We evaluate AgentFlow on ten benchmarks across diverse reasoning domains, as results highlighted in Figure 1. In our main setting, all four modules use Qwen2.5-7B-Instruct (Yang et al., 2024a) as a backbone, with only the planner trained via Flow-GRPO. AgentFlow substantially outperforms top-performing specialized tool-integrated reasoning models and agentic systems, achieving average accuracy by 14.9% on knowledge-intensive search, 14.0% on broader agentic tasks, 14.5% on mathematical reasoning, and 4.1% on scientific reasoning (§ 4.2). Notably, our 7B-backbone system even surpasses the $∼$ 200B-parameter GPT-4o (Hurst et al., 2024) across all domains. The trained planner learns to optimize planning, enhance tool-calling reliability, and discover effective solution pathways (§ 4.3). Further analyses confirm that our in-the-flow optimization with Flow-GRPO is crucial, far surpassing offline supervised tuning (§ 4.4). Moreover, our training approach proves highly efficient, leading to increased rewards and condensed responses compared to traditional tool-integrated RL methods (§ 4.5). Finally, we demonstrate that these benefits generalize, with consistent gains from scaling backbone size and turn budget (§ 4.6). Our work makes three key contributions: (1) We present AgentFlow, a trainable in-the-flow agentic system that directly optimizes its planner inside the multi-turn loop. By coordinating specialized modules through an evolving memory, it enables adaptive long-horizon planning and robust tool orchestration. (2) We introduce Flow-GRPO, an on-policy, outcome-driven algorithm that hat converts multi-turn RL into a sequence of tractable single-turn policy updates by broadcasting a single, verifiable final-outcome reward to every turn. (3) Through comprehensive experiments on ten benchmarks, we show that AgentFlow with a 7B backbone outperforms specialized baselines and even larger proprietary models. Further analyses reveal improved planning, enhanced tool-calling reliability, and positive scaling with model size and turn budgets. ## 2 Preliminary Reinforcement learning for reasoning LLMs. Recent progress in reasoning LLMs has been significantly driven by reinforcement learning from outcome feedback, using a verifiable reward signal (Shao et al., 2024; Yu et al., 2025). This paradigm fine-tunes a language model to maximize an outcome-based reward while remaining close to a reference policy. Formally, the objective is to optimize a policy LLM $π_θ$ to generate a response $o$ for a given query $q$ from dataset $D$ : $$ \max_π_{θ} E_x∼D, o∼π_θ(·\mid q)\big[R(q,o)\big]-β D_KL≤ft(π_θ(o\mid q) \| π_ref(o\mid q)\right), \tag{1} $$ where $R(q,o)$ is the outcome-based reward, $π_ref$ is a reference model to prevent policy collapse, and $β$ controls KL regularization. Algorithms like Group Relative Policy Optimization (GRPO) (Shao et al., 2024) implement this by sampling groups of responses, normalizing advantages by their rewards, and updating the policy with a clipped objective to encourage high-reward outputs. <details> <summary>x5.png Details</summary> ![9b6e0941](/v1/image/9b6e0941ba58de6dac26d9f1105ecaa65d53a98f3da7c806f324ccb6dc1bdadc) ### Visual Description ## Diagram: Architectures of Agentic Systems ### Overview This image presents a technical diagram illustrating two primary categories of agentic systems: (a) Tool-Integrated Reasoning Models (LLM Agents) and (b) Training-Free Agentic Systems. Each category is further broken down into two distinct architectural examples, showcasing different internal flows and component interactions. The diagram uses color-coding to denote the state or type of each component (Trained, Frozen, or token) and includes a legend to clarify these distinctions. ### Components/Axes The diagram is structured with a legend on the top-left and two main sections labeled (a) and (b) at the bottom. **Legend (top-left):** * A yellow rounded rectangle labeled "Trained" * A light blue rounded rectangle labeled "Frozen" * A purple rounded rectangle labeled "token" **Common Elements across all diagrams:** * **Input:** A white rounded rectangle at the top of each flow, labeled "Query". * **Output:** A green rounded rectangle at the bottom of each flow, labeled "Answer". * **Final Output Object:** A white square with slightly rounded corners, labeled "o", connected by an arrow from the "Answer" box. * **System Boundary:** Each set of internal components for a system is enclosed within a light grey shaded background. ### Detailed Analysis The image is divided into two main sections, (a) and (b), each containing two sub-diagrams. **Section (a): Tool-Integrated Reasoning Models (LLM Agents)** This section is located on the left side of the image. * **Sub-diagram (a.1) - Leftmost LLM Agent:** * **Flow:** 1. An arrow points from the "Query" (white) input to an "LLM" (yellow) component. 2. The "LLM" box has a small red flame icon at its top-right, indicating it is trainable or actively being used in a dynamic, adaptable manner. 3. An arrow points from the "LLM" to a purple box labeled ``. 4. An arrow points from `` to a purple box labeled `<tool_call> </tool_call>`. 5. A feedback loop arrow points from the bottom of `<tool_call> </tool_call>` back to the right side of the "LLM" box. 6. An arrow points from `<tool_call> </tool_call>` to a purple box labeled `<answer> </answer>`. 7. An arrow points from `<answer> </answer>` to the "Answer" (green) output. 8. An arrow points from "Answer" to the final output object "o" (white). * **Component Colors:** "LLM" is yellow (Trained). `<think>`, `<tool_call>`, and `<answer>` are purple (token). "Answer" is green. * **Sub-diagram (a.2) - Rightmost LLM Agent (within section a):** * **Flow:** 1. An arrow points from the "Query" (white) input to an "LLM" (yellow) component. 2. An arrow points from `<tools>` (purple) to the "LLM" (yellow) component. 3. The "LLM" box has a small red flame icon at its top-right. 4. An arrow points from the "LLM" to a purple box labeled ``. 5. An arrow points from `` to a purple box labeled `<tool_call> </tool_call>`. 6. A feedback loop arrow points from the bottom of `<tool_call> </tool_call>` back to the right side of the "LLM" box. 7. An arrow points from `<tool_call> </tool_call>` to a purple box labeled `<answer> </answer>`. 8. An arrow points from `<answer> </answer>` to the "Answer" (green) output. 9. An arrow points from "Answer" to the final output object "o" (white). * **Component Colors:** "LLM" is yellow (Trained). `<tools>`, `<think>`, `<tool_call>`, and `<answer>` are purple (token). "Answer" is green. **Section (b): Training-Free Agentic Systems** This section is located on the right side of the image. * **Sub-diagram (b.1) - Leftmost Training-Free System:** * **Flow:** 1. An arrow points from the "Query" (white) input to "Module 1" (light blue). 2. An arrow points from "Module 1" to "Module 2" (light blue). 3. A vertical ellipsis "..." with arrows above and below indicates a sequence of intermediate modules. 4. An arrow points from the ellipsis to "Module N" (light blue). 5. An arrow points from "Module N" to the "Answer" (green) output. 6. An arrow points from "Answer" to the final output object "o" (white). * **Component Colors:** "Module 1", "Module 2", and "Module N" are light blue (Frozen). "Answer" is green. No flame icon is present. * **Sub-diagram (b.2) - Rightmost Training-Free System (within section b):** * **Flow:** 1. An arrow points from the "Query" (white) input to "Module 1" (light blue) and also to "Module 2" (light blue). 2. A double-headed arrow connects "Module 1" and "Module 2", indicating bidirectional communication. 3. Arrows point from both "Module 1" and "Module 2" downwards to "Module 3" (light blue). 4. A vertical ellipsis "..." with arrows above and below indicates a sequence of intermediate modules. 5. An arrow points from the ellipsis to "Module N" (light blue). 6. An arrow points from "Module N" to the "Answer" (green) output. 7. An arrow points from "Answer" to the final output object "o" (white). * **Component Colors:** "Module 1", "Module 2", "Module 3", and "Module N" are light blue (Frozen). "Answer" is green. No flame icon is present. ### Key Observations * **Color-Coding Significance:** The legend clearly defines the state of components: "Trained" (yellow) for the core LLM, "Frozen" (light blue) for fixed modules, and "token" (purple) for intermediate outputs or structured prompts within LLM agents. * **Trainability vs. Fixed Modules:** LLM Agents (a) feature a "Trained" LLM with a flame icon, implying adaptability or fine-tuning. Training-Free Agentic Systems (b) use "Frozen" modules, indicating pre-defined, unchangeable components. * **LLM Agent Internal Process:** LLM Agents demonstrate an iterative reasoning process involving explicit "tokens" for thinking (`<think>`), tool invocation (`<tool_call>`), and answer formulation (`<answer>`), with a feedback loop from tool calls back to the LLM. * **Tool Integration:** Sub-diagram (a.2) explicitly shows `<tools>` as an input to the LLM, highlighting a mechanism for providing external capabilities to the LLM's reasoning. * **Training-Free System Modularity:** Training-Free systems (b) emphasize modularity, with flows ranging from simple sequential execution (b.1) to more complex, interconnected module interactions (b.2). * **Consistent Output:** All four architectures ultimately produce an "Answer" and an associated output object "o", suggesting a common goal despite diverse internal mechanisms. ### Interpretation The diagram provides a clear conceptual distinction between two major paradigms for designing intelligent agents. **Tool-Integrated Reasoning Models (LLM Agents)** represent a paradigm where a central, adaptable Large Language Model (LLM) acts as the primary orchestrator and reasoner. The "Trained" (yellow) LLM with the flame icon signifies its dynamic nature, capable of learning, adapting, or being fine-tuned. The use of "token" (purple) tags like `<think>`, `<tool_call>`, and `<answer>` suggests that the LLM generates structured internal thoughts or prompts to guide its own reasoning process. The feedback loop from `<tool_call>` back to the LLM is crucial, enabling iterative refinement: the LLM can call a tool, observe its output, and then use that information to further refine its thinking or make subsequent tool calls. This architecture is highly flexible and can handle complex, open-ended tasks by leveraging the LLM's emergent reasoning capabilities and its ability to interact with external tools. The explicit `<tools>` input in (a.2) further emphasizes the LLM's role in integrating and utilizing external functionalities. **Training-Free Agentic Systems**, in contrast, represent a more traditional, modular approach. The "Frozen" (light blue) modules indicate that these components have fixed functionalities and are not subject to runtime training or adaptation. This paradigm is suitable for tasks where the sub-problems are well-defined and can be encapsulated within specialized, pre-built modules. Sub-diagram (b.1) illustrates a straightforward sequential pipeline, where information flows linearly through a series of modules, each performing a specific step. Sub-diagram (b.2) demonstrates a more sophisticated modular design, allowing for parallel processing or bidirectional communication between modules (e.g., "Module 1" and "Module 2") before converging into a subsequent processing chain. This approach offers greater control, transparency, and potentially higher reliability for specific tasks, as the behavior of each module is predictable. In essence, the diagram highlights a trade-off: LLM Agents offer adaptability and emergent intelligence through a central, trainable model, often at the cost of full transparency and predictability. Training-Free Agentic Systems offer predictability and control through a composition of fixed, specialized modules, potentially at the cost of adaptability to novel situations. Both approaches aim to process a "Query" and yield an "Answer" with an associated output "o", indicating that the choice of architecture depends on the specific requirements of the agent's task, including the need for flexibility, interpretability, and performance. </details> Figure 3: Comparison of two paradigms of LLMs with tool use. (a) Monolithic tool-integrated reasoning models train a single policy to interleave reasoning (e.g., <think>) and tool calls (e.g., <tool_call>) within a single, full-context trajectory. (b) Agentic systems decompose tasks across multiple specialized modules (e.g., planner, coder) that collaborate. These systems are typically training-free, orchestrated by handcrafted logic or prompting. Tool-integrated reasoning models (LLM agents). LLMs can be augmented with external tools to access knowledge and perform precise computation under reinforcement learning with outcome-based reward. As shown in Figure 3 (a), the LLM interleaves reasoning and tool calls, producing a chain of thought within <think></think> tokens followed by tool invocations (e.g., <tool call></tool call>). The resulting trajectory $τ$ is a sequence of model generations and tool observations: $τ=\{s^1,a^1,e^1,…,s^T,a^T\}$ , where $s^t$ denotes the context, $a^t$ the generated action (thought + tool call), and $e^t$ the tool’s execution result. The policy model $π_θ$ is then trained to maximize a final outcome reward. Prior work has explored single- and multi-tool settings for search and code execution (Jin et al., 2025; Chen et al., 2025; Feng et al., 2025; Qian et al., 2025a). Agentic systems with tool usage. An alternative approach is the use of agentic systems (Wu et al., 2024; Hong et al., 2024; Lu et al., 2025). As shown in Figure 3 (b), these frameworks deploy multiple specialized modules—often distinct LLMs with carefully designed prompts and roles—within a collaborative workflow. By decomposing tasks and assigning subproblems to modules with dedicated tools and capabilities (e.g., planner, coder, critic), they can address complex problems such as web browsing, document processing, and multi-stage programming that exceed the scope of a single model. A central limitation, however, is that these systems are typically training-free: modules remain frozen pre-trained models orchestrated by handcrafted logic or prompting heuristics. ## 3 In-the-Flow Agentic System Optimization We aim to bridge the gap between trainable but monolithic reasoning models and flexible yet static agentic systems. We present AgentFlow, a flexible and trainable agentic system that integrates four specialized modules with an evolving memory (§ 3.1). Unlike prior agentic systems, AgentFlow directly optimizes the planner within the multi-turn loop of an agentic system (§ 3.2). ### 3.1 AgentFlow : An In-the-Flow Agentic System We propose AgentFlow, a general-purpose tool-integrated agentic framework for solving complex reasoning tasks through fine-grained planning and effective tool use within a multi-turn architecture. As shown in Figure 2, the framework comprises four specialized modules— Action Planner $P$ , Tool Executor $E$ , Execution Verifier $V$ , and Solution Generator $G$ —coordinated by a shared evolving memory $M$ and a toolset $K$ . These modules interact sequentially and iteratively to perform action planning, tool execution, context verification, and solution generation, thereby enabling tool-integrated reasoning across multiple turns. We formalize AgentFlow ’s problem-solving process as a multi-turn Markov Decision Process (MDP). Given a query $q$ and a toolset $K$ , the system proceeds for a variable number of turns. Let $M^t$ denote the memory state before turn $t$ (with $M^1$ initialized from $q$ ). At turn $t$ , the planner $P$ (a trainable policy $π_θ$ ) formulates a sub-goal, selects an appropriate tool $k∈ K$ , and retrieves relevant context from memory, producing an action: $a^t∼π_θ(a^t\mid q,K,M^t)$ . The executor $E$ invokes the chosen tool with context, yielding an execution observation $e^t∼E(e^t\mid a^t,K)$ . The verifier $V$ then evaluates whether $e^t$ is valid and whether the accumulated memory is sufficient to solve the query, producing a binary verification signal $v^t∼V(v^t\mid q,e^t,M^t)$ . If $v^t=0$ , the memory is updated deterministically to incorporate new evidence: $M^t+1=f_mem(M^t,a^t,e^t,v^t)$ , where $f_mem(·)$ denotes the memory-update function, which records agent-process information in a concise, structured form along with contextual details such as time, turn index, and error signals. The process repeats until $v^t=1$ (termination) or a predefined maximum turn budget is reached. Upon termination at turn $T$ , the solution generator $G$ produces the final solution $o$ , conditioned on the query and the accumulated memory: $o∼G(o\mid q,M^T)$ . This formulation decomposes multi-turn, tool-integrated reasoning into structured, observable transitions. After $T$ turns, the trajectory $τ=\{(a^t,e^t,v^t)\}_t=1^T$ records the history of planning, execution, and verification. The joint generative process can be written as $$ p_θ≤ft(\{a^t,e^t,v^t\}_t=1^T, o\mid q\right)=\Bigg[∏_t=1^Tπ_θ(a^t\mid q,K,M^t) E(e^t\mid a^t,K) V(v^t\mid q,e^t,M^t)\Bigg] G(o\mid q,M^T), \tag{2} $$ where $\{a^t,e^t,v^t\}_t=1^T$ are explicit realizations of the latent reasoning chain. Importantly, unlike latent thoughts behind trajectories, our memory $M$ is an explicit and deterministic record of the reasoning process, ensuring transparency and controllability of multi-turn decisions. <details> <summary>x6.png Details</summary> ![c0d7692c](/v1/image/c0d7692cbbf8f05eaaf80d673d4cc3f7627fb56dfa2a6ba9fc35837856ae380c) ### Visual Description ## Diagram: Flow-GRPO System Architecture ### Overview This image presents a system architecture diagram titled "Flow-GRPO," illustrating a process for training a policy model within a multi-turn agentic system. The diagram details the interaction between a policy model, a reference model, a reward model, and various inputs and computational steps, including multi-turn rollouts and group computation, with clear distinctions between trained and frozen model components. ### Components/Axes The diagram is structured as a left-to-right flow with feedback loops. **Input Components (Left):** * **q:** A white rounded rectangle, representing an input. * **M:** An orange rounded rectangle, representing an input. * **K:** A light blue rounded rectangle, representing an input. **Model Components:** * **Policy Model:** An orange rounded rectangle, centrally located on the left side. It has a small red flame icon on its top-right, indicating it is actively being trained or updated. * **Reference Model:** A light blue rounded rectangle, positioned below the Policy Model. It has a small light blue cube icon on its top-right, indicating it is a frozen or fixed component. * **Reward Model:** A light blue vertically oriented rounded rectangle, positioned in the middle of the diagram. It also has a small light blue cube icon on its top-left, indicating it is a frozen or fixed component. **Process Blocks:** * **Multi-turn Agentic System Rollouts:** A large orange-bordered rounded rectangle, spanning the middle-left section of the diagram. This block contains multiple rows of action sequences and observations. * Each row represents a sequence of actions `a_i^1, a_i^2, ..., a_i^{T_i}` (where `i` denotes the group/agent and `T_i` denotes the number of turns for that group). These action sequences are enclosed in light grey rounded rectangles. * Each action sequence leads to an observation `o_i` (e.g., `o_1, o_2, o_3, ..., o_G`), represented by white rounded rectangles. * **Multi-turn Group Computation:** A white rounded rectangle, positioned on the far right, above the legend. **Output/Intermediate Data Blocks:** * A large light grey-bordered rounded rectangle, positioned in the middle-right section of the diagram. This block contains multiple rows of reward sequences. * Each row represents a sequence of rewards `r_i^1, r_i^2, ..., r_i^{T_i}` (where `i` denotes the group/agent and `T_i` denotes the number of turns for that group). These reward sequences are enclosed in light grey rounded rectangles. **Legend (Bottom-right):** * **Trained Models:** An orange rounded rectangle. * **Frozen Models:** A light blue rounded rectangle. ### Detailed Analysis **Flow from Inputs to Policy Model:** * Input `q` feeds into the `Policy Model`. * Input `M` feeds into the `Policy Model`. * Input `K` feeds into the `Reference Model`. * The `Reference Model` feeds into the `Policy Model` with a connection labeled `KL`, suggesting a Kullback-Leibler divergence constraint or regularization. **Multi-turn Agentic System Rollouts:** * The `Policy Model` outputs to the `Multi-turn Agentic System Rollouts` block, specifically influencing the generation of actions `a_i^t`. * The `Reference Model` also outputs to the `Multi-turn Agentic System Rollouts` block, specifically influencing the generation of actions `a_i^t`. * Within the "Multi-turn Agentic System Rollouts" block: * Row 1: `a_1^1, a_1^2, a_1^3` leads to `o_1`. * Row 2: `a_2^1, a_2^2` leads to `o_2`. * Row 3: `a_3^1, a_3^2, ..., a_3^{t_3}` leads to `o_3`. * Ellipses indicate more rows. * Last Row: `a_G^1, a_G^2, a_G^3, ..., a_G^{T_G}` leads to `o_G`. * The `o_i` observations are then fed into the `Reward Model`. **Reward Generation:** * The `Reward Model` takes inputs from the `o_i` observations (e.g., an arrow from `o_3` points to the `Reward Model`). * The `Reward Model` outputs to the light grey-bordered block containing reward sequences `r_i^t`. * Row 1: `r_1^1, r_1^2, r_1^3`. * Row 2: `r_2^1, r_2^2`. * Row 3: `r_3^1, r_3^2, ..., r_3^{t_3}`. * Ellipses indicate more rows. * Last Row: `r_G^1, r_G^2, r_G^3, ..., r_G^{T_G}`. **Policy Update Loop:** * The reward sequences (from the light grey-bordered block) feed into the `Multi-turn Group Computation` block. * The `Multi-turn Group Computation` block has a feedback loop, with an arrow pointing back to the `Policy Model`. This indicates that the computation based on rewards is used to update the `Policy Model`. **Model Status (from Legend):** * **Trained Models (Orange):** `Policy Model`, `M`. The orange border of "Multi-turn Agentic System Rollouts" suggests this process is part of the trained system's operation. * **Frozen Models (Light Blue):** `Reference Model`, `Reward Model`, `K`. The cube icons on `Reference Model` and `Reward Model` reinforce their frozen status. ### Key Observations * The `Policy Model` is the primary component undergoing training, indicated by its orange color and flame icon. * The `Reference Model` and `Reward Model` are fixed or pre-trained components, indicated by their light blue color and cube icons. * The system involves multi-turn interactions (`t` superscript) and multiple groups/agents (`i` subscript). * A `KL` divergence term is used to regularize the `Policy Model`'s updates with respect to the `Reference Model`, a common technique in reinforcement learning (e.g., PPO). * The process is iterative, with rewards from rollouts feeding back to update the policy. ### Interpretation The Flow-GRPO diagram illustrates a reinforcement learning (RL) framework, likely for training a policy in a multi-agent, multi-turn environment. The core idea is to train a `Policy Model` (which generates actions `a_i^t`) by interacting with an environment (represented by the "Multi-turn Agentic System Rollouts" and subsequent reward generation). 1. **Policy Training:** The `Policy Model` is the trainable component, taking inputs `q` and `M`. It generates actions `a_i^t` for multiple agents/groups over multiple turns. 2. **Reference and Regularization:** A `Reference Model` (frozen) provides a baseline or constraint for the `Policy Model`'s updates, enforced by a `KL` divergence term. This prevents the policy from making drastic changes, promoting stable learning. Input `K` might be related to the reference model's parameters or state. 3. **Rollouts and Observations:** The generated actions `a_i^t` are executed in a simulated or real environment (the "Multi-turn Agentic System Rollouts"), leading to observations `o_i`. The orange border of this block suggests it's an active part of the training process, driven by the trained policy. 4. **Reward Evaluation:** The `Reward Model` (frozen) evaluates the observations `o_i` to produce rewards `r_i^t`. This implies the reward function is fixed and not learned during this process, providing a stable signal for policy improvement. 5. **Policy Update:** The `Multi-turn Group Computation` aggregates or processes these rewards to generate an update signal that is fed back to the `Policy Model`. This completes the RL loop, where the policy learns to maximize cumulative rewards. The "Flow-GRPO" likely stands for a specific algorithm, possibly related to Group Reinforcement Policy Optimization, where "Flow" might imply a continuous or iterative process, and "GRPO" points to a group-based or generalized policy optimization method, potentially incorporating elements of PPO due to the KL divergence. The distinction between "Trained Models" and "Frozen Models" is crucial, highlighting which components are adaptive and which are fixed during the training phase. This architecture suggests a robust and controlled training process, leveraging fixed reference and reward models to guide the policy's learning. </details> Figure 4: Optimization for our proposed agentic system AgentFlow. Given a query $q$ , an evolving memory $M$ , and a toolset $K$ , the policy model generates actions that target sub-goals and select tools. It is trained via Flow-based Group Refined Policy Optimization (Flow-GRPO), which enables multi-turn reinforcement learning and stable optimization under collaborative dynamics. ### 3.2 In-the-Flow Reinforcement Learning Optimization We target tool-integrated agentic systems operating under long-horizon tasks with sparse rewards. In this setting, the Action Planner (the trainable policy of AgentFlow) selects a sequence of interdependent actions while the state $(q,K,M^t)$ evolves with tool results and verifier feedback. Conventional offline training—e.g., supervised fine-tuning or preference fine-tuning on curated traces—optimizes the planner outside the active loop (Motwani et al., 2024; Park et al., 2025). This decoupling prevents real-time coordination with the executor, verifier, and solution generator, induces distribution shift between training and deployment, and provides limited guidance about which intermediate decisions truly matter. As a result, planners often adapt poorly to multi-turn dynamics; early errors cascade, and post-hoc fixes are brittle. In-the-flow learning. To address these issues, we optimize the planner in the flow of execution. We roll out the full AgentFlow system under the current policy, collect the actual trajectory $τ$ of states, actions, and tool events it induces, and update the policy within the agentic system using a verifiable final-outcome signal. This exposes the multi-turn credit-assignment problem directly and trains the planner on the exact states it will face at inference. Our objective, Flow-GRPO, is designed to stabilize learning under sparse, trajectory-level rewards over multiple turns. As established in § 3.1, rollouts in AgentFlow define a finite-horizon MDP with a variable horizon $T$ . At turn $t$ , the planner observes the state $(q,K,M^t)$ , selects an action $a^t$ , the executor and verifier return $(e^t,v^t)$ , and the memory updates deterministically to $M^t+1$ . Policy optimization objective. The planner policy $π_θ$ is trained to maximize the expected return over on-policy rollouts. Let $R(τ)$ be the reward for a complete trajectory $τ$ . The objective is: $$ J(θ)=E_τ∼π_{θ}\big[R(τ)\big], θ^⋆=\arg\max_θJ(θ), \tag{3} $$ where a rollout $τ$ is the sequence of decisions $\{a^t\}_t=1^T$ generated on-policy by $π_θ$ . Final-outcome reward. Assigning credit to intermediate actions is challenging because each $a^t$ influences the final solution only indirectly, and their value may only emerge after several turns (e.g., error or improvement accumulation). To avoid brittle local feedback, we adopt a final-outcome-based reward: every action within a rollout receives the same global reward signal, based on the correctness of the final solution $o$ with respect to query $q$ and ground truth $y^*$ : $$ \displaystyle r=R(a^t)=\bar{R}(o,q,y^*), ∀ t=1,\dots,T, \tag{4} $$ where $\bar{R}(o,q,y^*)∈\{0,1\}$ is assigned by an LLM-as-judge rubric for semantic, numeric, and option-level equivalence (see § E.3). This propagates a trajectory-level success signal back through the reasoning chain, aligning every decision $a^t$ with global correctness. Objective function. We formalize Flow -based G roup R efined P olicy O ptimization for the planner. The goal is to optimize the policy $π_θ$ by maximizing the expected return over a group of parallel rollouts. For each query-label pair from training corpus $(q,y^*)∼D$ , we sample a group of $G$ on-policy trajectories $\{τ_i\}_i=1^G$ by running the current behavior policy $π_θ_{old}$ inside AgentFlow, where $τ_i=\{a_i^1,....a_i^T_i,o_i\}$ . Let $s_i^t=(q,K,M_i^t)$ be the state at turn $t$ of rollout $i$ , $a_i^t$ the planner’s action (a token sequence of length $|a_i^t|$ ), and $o_i$ the final response. This structure is key to addressing the long-horizon credit assignment challenge: by broadcasting a single trajectory-level reward to all turns, we effectively decompose the multi-turn RL problem into a set of independent, single-turn policy updates; we provide a formal proof of this equivalence and analyze its convergence properties in § B. Each update for an action $a_i^t$ is conditioned on the full historical context encapsulated in the state $s_i^t$ and receives the same global success signal, simplifying optimization. The objective is $$ \displaystyleJ_Flow-GRPO(θ) \displaystyle=E_(q,y^*)∼D, \{τ_i\_i=1^G∼π_θ_{old}} \displaystyle\Bigg[\frac{1}{G}∑_i=1^G\frac{1}{T_i}∑_t=1^T_i\frac{1}{|a_i^t|}∑_j=1^|a_i^{t|}\min\Big\{ρ_i,j^tA_i^t, clip(ρ_i,j^t,1-ε,1+ε) A_i^t\Big\} - β D_KL\big(π_θ \| π_ref\big)\Bigg], \tag{5} $$ where $T_i$ is the (variable) number of turns in rollout $i$ , and $$ ρ_i,j^t=\frac{π_θ\big(a_i,j^t \big| s_i^t,a_i,1:j-1^t\big)}{π_θ_{old}\big(a_i,j^t \big| s_i^t,a_i,1:j-1^t\big)} \tag{6} $$ is the token-level importance ratio for the $j$ -th token of $a_i^t$ , $ε>0$ is the PPO clipping parameter, and $β>0$ controls the KL penalty to a fixed reference policy $π_ref$ . Group-normalized advantages. Because the reward in Eq. 4 is a single trajectory-level signal, the per-turn advantage $A_i^t$ is constant over $t$ within a rollout $i$ . We reduce variance and sharpen credit assignment across the group by using a group-normalized advantage: $$ A_i^t=\frac{\bar{R}(o_i,q,y^*)-mean≤ft(\{\bar{R}(o_k,q,y^*)\}_k=1^G\right)}{std≤ft(\{\bar{R}(o_k,q,y^*)\}_k=1^G\right)}. \tag{7} $$ Technical contribution summary. To tackle long-horizon, sparse-reward training in multi-module agentic systems, we propose Flow-GRPO. This novel algorithm (i) formalizes the multi-turn RL problem in agentic systems into a series of tractable, single-turn policy updates, and (ii) broadcasts a single trajectory-level outcome to every turn to align local planner decisions with global success. Training uses an LLM-based rubric to assign verifiable final-outcome rewards, with group-normalized advantages, KL regularization, and clipping to stabilize learning. ## 4 Experiments ### 4.1 Experimental Setup | Model Qwen-2.5-7B-Instruct | Size 7B-Inst | Search Intensive Bamboogle 12.0 | Agentic 2Wiki 23.0 | HotpotQA 21.0 | Musique 6.0 | Avg. 15.5 | $Δ$ $↑$ 41.8 | GAIA 3.2 | $Δ$ $↑$ 29.9 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Qwen-2.5-14B-Instruct | 14B-Inst | 21.6 | 26.7 | 20.0 | 8.0 | 19.1 | $↑$ 38.2 | 5.5 | $↑$ 27.6 | | Qwen-2.5-32B-Instruct | 32B-Inst | 24.0 | 26.7 | 27.0 | 6.0 | 20.9 | $↑$ 36.4 | 9.5 | $↑$ 23.6 | | Llama-3.3-70B-Instruct | 70B-Inst | 18.4 | 22.7 | 52.0 | 16.0 | 27.3 | $↑$ 30.0 | 3.2 | $↑$ 29.9 | | GPT-4o-mini (Hurst et al., 2024) | $∼$ 8B | 40.8 | 35.6 | 41.0 | 15.0 | 33.1 | $↑$ 24.2 | 7.1 | $↑$ 26.0 | | GPT-4o (Hurst et al., 2024) | $∼$ 200B | 68.8 | 49.5 | 54.0 | 24.0 | 49.1 | $↑$ 8.2 | 17.3 | $↑$ 15.8 | | Supervised Fine-Tuning (SFT) | 7B-Inst | 12.0 | 25.9 | 22.0 | 6.6 | 16.6 | $↑$ 40.7 | 3.2 | $↑$ 29.9 | | Iter-RetGen (Shao et al., 2023) | 7B-Inst | 36.8 | 33.6 | 37.4 | 17.8 | 31.4 | $↑$ 25.9 | 3.9 | $↑$ 29.2 | | Search-R1 (Jin et al., 2025) | 7B-Inst | 43.2 | 38.2 | 37.0 | 14.6 | 33.3 | $↑$ 24.0 | 19.1 | $↑$ 14.0 | | ZeroSearch (Sun et al., 2025) | 7B-Base | 27.8 | 35.2 | 34.6 | 18.0 | 28.9 | $↑$ 28.4 | 16.5 | $↑$ 16.6 | | ReSearch (Chen et al., 2025) | 7B-Base | 42.4 | 47.6 | 43.5 | 22.3 | 39.0 | $↑$ 18.3 | 17.3 | $↑$ 15.8 | | StepSearch (Wang et al., 2025d) | 7B-Base | 40.0 | 36.6 | 38.6 | 22.6 | 34.5 | $↑$ 22.8 | – | – | | VerlTool (Jiang et al., 2025) | 7B-Base | 46.4 | 45.3 | 44.8 | 19.3 | 39.0 | $↑$ 18.3 | 11.2 | $↑$ 21.9 | | AutoGen (Wu et al., 2024) | 7B-Inst | 59.6 | 44.0 | 50.0 | 15.9 | 42.4 | $↑$ 14.9 | 6.3 | $↑$ 26.8 | | AgentFlow | 7B-Inst | 58.4 | 60.0 | 51.3 | 19.2 | 47.2 | $↑$ 12.1 | 17.2 | $↑$ 15.9 | | AgentFlow (w/ Flow-GRPO) | 7B-Inst | 69.6 | 77.2 | 57.0 | 25.3 | 57.3 | – | 33.1 | – | Table 1: Accuracy comparison on search-intensive and agentic tasks. 7B-Base refers to Qwen-2.5-7B-Base and 7B-Inst refers to Qwen-2.5-7B-Instruct. AutoGen and our AgentFlow method are agentic systems, which use Qwen-2.5-7B-Instruct for the LLM-powered agents and tools for fair comparison. We visualize the gains of AgentFlow to the each baseline in the $Δ$ columns. | | | Math Reasoning | Scientific Reasoning | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Model | Size | AIME24 | AMC23 | GameOf24 | Avg. | $Δ$ | GPQA | MedQA | Avg. | $Δ$ | | Qwen-2.5-7B-Instruct | 7B-Inst | 6.7 | 47.5 | 33.0 | 29.1 | $↑$ 22.5 | 34.0 | 66.0 | 50.0 | $↑$ 13.5 | | Qwen-2.5-14B-Instruct | 14B-Inst | 6.7 | 60.0 | 25.0 | 30.6 | $↑$ 21.0 | 31.0 | 75.0 | 53.0 | $↑$ 10.5 | | Llama-3.3-70B-Instruct | 70B-Inst | 6.7 | 47.5 | 31.0 | 28.4 | $↑$ 23.1 | 35.0 | 67.0 | 51.0 | $↑$ 12.5 | | Llama-3.1-405B-Instruct | 405B-Inst | 26.7 | 47.5 | 23.0 | 32.4 | $↑$ 19.1 | 30.0 | 62.0 | 46.0 | $↑$ 17.5 | | GPT-4o-mini (Hurst et al., 2024) | $∼$ 8B | 13.3 | 57.5 | 16.0 | 28.9 | $↑$ 22.6 | 27.0 | 66.0 | 46.5 | $↑$ 17.0 | | GPT-4o (Hurst et al., 2024) | $∼$ 200B | 13.3 | 60.0 | 32.0 | 35.1 | $↑$ 16.4 | 31.0 | 60.0 | 45.5 | $↑$ 18.0 | | Supervised Fine-Tuning (SFT) | 7B-Inst | 6.7 | 47.5 | 33.0 | 29.1 | $↑$ 22.5 | 34.0 | 66.0 | 50.0 | $↑$ 13.5 | | SimpleRL-reason (Zeng et al., 2025b) | 7B-Base | 16.7 | 60.0 | 33.0 | 36.6 | $↑$ 15.0 | 45.0 | 65.0 | 50.0 | $↑$ 13.5 | | Open-Reasoner-Zero (Hu et al., 2025a) | 7B-Base | 16.7 | 54.9 | 32.0 | 34.5 | $↑$ 17.0 | 34.0 | 54.0 | 44.0 | $↑$ 19.5 | | General-Reasoner (Ma et al., 2025) | 7B-Base | 13.3 | 55.0 | 33.0 | 33.8 | $↑$ 17.7 | 35.5 | 61.0 | 48.3 | $↑$ 15.2 | | Luffy (Yan et al., 2025) | 7B-Inst | 30.7 | 44.8 | 33.0 | 36.2 | $↑$ 15.3 | 34.0 | 77.0 | 55.5 | $↑$ 8.0 | | TIR (Yang et al., 2024b) | 7B-Inst | 10.0 | 50.0 | 33.0 | 31.0 | $↑$ 20.5 | 42.0 | 76.8 | 59.4 | $↑$ 4.1 | | ToRL (Li et al., 2025b) | 7B-Inst | 20.0 | 60.0 | 31.0 | 37.0 | $↑$ 14.5 | 35.0 | 76.5 | 55.8 | $↑$ 7.7 | | AutoGen (Wu et al., 2024) | 7B-Inst | 13.3 | 57.5 | 24.0 | 31.6 | $↑$ 19.9 | 42.0 | 72.0 | 57.0 | $↑$ 6.5 | | AgentFlow | 7B-Inst | 16.7 | 47.4 | 31.0 | 31.7 | $↑$ 19.8 | 37.0 | 76.0 | 56.5 | $↑$ 7.0 | | AgentFlow (w/ Flow-GRPO) | 7B-Inst | 40.0 | 61.5 | 53.0 | 51.5 | – | 47.0 | 80.0 | 63.5 | – | Table 2: Accuracy comparison of mathematical and scientific reasoning tasks. As the same in Table 1, AutoGen and AgentFlow use Qwen-2.5-7B-Instruct for the LLM-powered tools. Implementation. In our main experiments, all modules—Action Planner, Tool Executor, Executive Verifier, and Solution Generator—are instantiated with the Qwen2.5-7B-Instruct model (Yang et al., 2024a). Among these, only the Action Planner is trainable. The system operates with five interactive tools: Base Generator is an instance of Qwen2.5-7B-Instruct that acts as the default reasoning engine if the planner decides not to use an external tool; Python Coder generates and executes Python code given a query and returns the execution result; Google Search searches the web and returns a summarization of Top-K search results; Wikipedia Search searches articles matching a given query and returns a summarization; and Web Search returns summarized information from a given web page. During the RL fine-tuning phase, we mix data from Search-R1 (Jin et al., 2025) and DeepMath (He et al., 2025) as training data, which provides paired question-answer examples across search and mathematical domains. Training. We provide further details on the training setup for AgentFlow. Our Flow-GRPO implementation uses a learning rate of $1× 10^-6$ . The Action Planner generates actions with a sampling temperature of $0.5$ to balance exploration and exploitation. To prevent policy collapse and stabilize training, we incorporate a KL-divergence penalty against a reference policy with a coefficient $β=0.001$ . The maximum output length for the planner is set to 2048 tokens to ensure complete exploration during rollouts. We use a batch size of 32 with 8 rollouts per sample. To accelerate the training speed, we limit the maximum number of turns per rollout to $3$ . The final-outcome reward signal (Eq. 4) is provided by an LLM-as-judge, for which we use GPT-4o. All tool calls are executed synchronously with a 500-second timeout to handle external service latency robustly. The LLM engines within the tools are set to a temperature of 0.0 to ensure deterministic and stable outputs. The full training process was conducted on 8 NVIDIA A100 GPUs. Further details on agent prompts and the memory update mechanism are provided in § E.1. Evaluation. To comprehensively evaluate tool-use capabilities of AgentFlow, we conduct experiments on four types of reasoning tasks: (1) Knowledge-intensive search including Bamboogle (Press et al., 2023), 2Wiki (Ho et al., 2020), HotpotQA (Yang et al., 2018), and Musique (Trivedi et al., 2022); (2) Agentic reasoning such as GAIA (Mialon et al., 2023) (where we adopt the textual split); (3) Logic-dense mathematical reasoning including AIME2024 (Art of Problem Solving, 2025), AMC23 (MAA, 2023), and GameOf24 (Lightman et al., 2023); and (4) Scientific reasoning including GPQA (Rein et al., 2024) and MedQA (Yang et al., 2024c). To mitigate randomness, we report the average accuracy across three trials for all experiments. More evaluation details are provided in § C. ### 4.2 Main Results Baselines. As presented in Tables 1 and 2, we include five categories of baselines: (1) Open-source LLMs: Qwen2.5 (Yang et al., 2024a), Llama-3.1, and Llama-3.3 (Dubey et al., 2024); (2) Proprietary LLMs: GPT-4o-mini and GPT-4o; (3) Reasoning LLMs: supervised fine-tuning (Yang et al., 2024b), SimpleRL-reason, Open-Reasoner-Zero, General-Reasoner, and LUFFY; (4) Tool-integrated reasoning LLMs: both search-enhanced, including Iter-RetGen, Search-R1, ZeroSearch, ReSearch, StepSearch, and VerlTool, and code-enhanced, including TIR and ToRL; (5) Training-free agentic system: AutoGen. More details on baseline implementations are in § C.2. Key insights. AgentFlow consistently outperforms all baseline models by large margins. Compared to the best-performing 7B models without tool integration, AgentFlow achieves absolute gains of 40.7% on search (SFT), 29.9% on agentic reasoning (SFT), 15.0% on math (SimpleRL-reason), and 8.0% on scientific tasks (Luffy). Against specialized tool-integrated systems, AgentFlow surpasses the top models by 14.9% in search (AutoGen), 14.0% in agentic reasoning (Search-R1), 14.5% in math (ToRL), and 4.1% in science (TIR). Notably, our 7B-backbone AgentFlow even outperforms the $∼$ 200B-parameter GPT-4o across all domains, with gains ranging from 8.2% to 18.0%. A detailed analysis is provided in § D.1. ### 4.3 In-depth Analysis of Optimized Planning Flow-GRPO optimizes tool usage. We compare tool usage distributions before and after in-the-flow RL training. Figure 6 shows results on two knowledge-intensive tasks, 2Wiki and MedQA, which exhibit distinct optimization patterns alongside improved task accuracy. For 2Wiki, which requires broad factual knowledge, Flow-GRPO optimizes the planner to increase Google Search usage by 42.0%. In contrast, for the specialized MedQA benchmark, which requires deep, domain-specific information retrieval, fine-tuning shifts the planner away from general tools, reducing Google Search calls (66.2 $→$ 10.9%) in favor of in-document Web Search (0 $→$ 19.5%) and specialized Wikipedia Search (0 $→$ 59.8%). This demonstrates that the planner learns to select task-appropriate tools. <details> <summary>x7.png Details</summary> ![4840bb27](/v1/image/4840bb274928c4fb17aafe4ed4107a2f62b33c9d7f11294a38ac2d745ee94ae5) ### Visual Description ## Bar Charts: Tool Call Ratio and Accuracy Before and After Fine-tuning ### Overview This image presents two bar charts, labeled (a) 2Wiki and (b) MedQA, comparing the "Tool Call Ratio (%)" for different search tools and a "Base Generator" at two distinct "Training Steps": "Step 0" (before fine-tuning) and "Step 32" (after fine-tuning). Each chart also displays an overall accuracy metric ("Acc") for both steps, along with the percentage change in accuracy after fine-tuning. The charts illustrate how fine-tuning impacts the utilization of various tools across two different datasets. ### Components/Axes The image consists of two side-by-side bar charts, (a) on the left and (b) on the right, sharing a common legend positioned at the top-center. * **Legend (Top-center):** * Light Red: Base Generator * Green: Google Search * Blue: Web Search * Purple: Wikipedia Search * **Y-axis (Left side of both charts):** * Title: "Tool Call Ratio (%)" * Scale: Ranges from 0 to 80, with major grid lines and labels at 0, 10, 20, 30, 40, 50, 60, 70, 80. * **X-axis (Bottom, shared across both charts):** * Title: "Training Steps" * Categories: "Step 0" and "Step 32" for each sub-chart. * **Common Labels (Above the bars, between "Step 0" and "Step 32" for both charts):** * Text: "After Fine-tuning" * Visual: A gray arrow pointing from left to right, indicating the progression from "Step 0" to "Step 32". * **Sub-chart Titles (Bottom-left of each chart):** * (a) 2Wiki * (b) MedQA * **Accuracy Boxes (Top-left and Top-right above the bars for each chart):** * **Chart (a) 2Wiki:** * Above "Step 0": "Acc: 60.0%" * Above "Step 32": "Acc: 77.2% (+17.2%)" (The "+17.2%" is colored red, indicating an increase). * **Chart (b) MedQA:** * Above "Step 0": "Acc: 76.0%" * Above "Step 32": "Acc: 80.0% (+4.0%)" (The "+4.0%" is colored red, indicating an increase). ### Detailed Analysis **Chart (a) 2Wiki** * **Step 0 (Before Fine-tuning):** * **Base Generator (Light Red):** The bar is very short, visually close to 0%, estimated at approximately 0.5%. * **Google Search (Green):** The bar reaches 28.5%. * **Web Search (Blue):** The bar reaches 36.0%. * **Wikipedia Search (Purple):** The bar reaches 28.8%. * *Trend:* At Step 0, Web Search has the highest tool call ratio, followed closely by Wikipedia Search and Google Search, while the Base Generator is negligible. * **Step 32 (After Fine-tuning):** * **Base Generator (Light Red):** The bar remains very short, visually close to 0%, estimated at approximately 0.2%. * **Google Search (Green):** The bar dramatically increases to 70.5%. An associated label "+42.0" (green) indicates a significant increase from Step 0. * **Web Search (Blue):** The bar significantly decreases to 13.6%. An associated label "-22.4" (blue) indicates a decrease from Step 0. * **Wikipedia Search (Purple):** The bar significantly decreases to 4.0%. An associated label "-24.8" (purple) indicates a decrease from Step 0. * *Trend:* After fine-tuning, Google Search shows a massive increase in tool call ratio, becoming the dominant tool. Web Search and Wikipedia Search show substantial decreases, while the Base Generator remains minimal. **Chart (b) MedQA** * **Step 0 (Before Fine-tuning):** * **Base Generator (Light Red):** The bar reaches 28.7%. * **Google Search (Green):** The bar reaches 66.2%. * **Web Search (Blue):** The bar is very short, visually close to 0%, estimated at approximately 0.5%. * **Wikipedia Search (Purple):** The bar is very short, visually close to 0%, estimated at approximately 0.5%. * *Trend:* At Step 0, Google Search has a very high tool call ratio, followed by the Base Generator. Web Search and Wikipedia Search are negligible. * **Step 32 (After Fine-tuning):** * **Base Generator (Light Red):** The bar significantly decreases to 6.3%. An associated label "-22.4" (red) indicates a decrease from Step 0. * **Google Search (Green):** The bar significantly decreases to 10.9%. An associated label "-55.3" (green) indicates a substantial decrease from Step 0. * **Web Search (Blue):** The bar dramatically increases to 19.5%. An associated label "+19.5" (blue) indicates a significant increase from Step 0. * **Wikipedia Search (Purple):** The bar dramatically increases to 59.8%. An associated label "+59.8" (purple) indicates a massive increase from Step 0. * *Trend:* After fine-tuning, Base Generator and Google Search show significant decreases in tool call ratio. Conversely, Web Search and Wikipedia Search show dramatic increases, with Wikipedia Search becoming the most utilized tool. ### Key Observations * **Overall Accuracy Improvement:** Both datasets, 2Wiki and MedQA, show an increase in overall accuracy after fine-tuning, with 2Wiki experiencing a larger relative gain (+17.2%) compared to MedQA (+4.0%). * **Divergent Tool Utilization Patterns:** Fine-tuning leads to drastically different tool utilization patterns between the 2Wiki and MedQA datasets. * For **2Wiki**, fine-tuning strongly favors **Google Search**, which sees a massive increase in its tool call ratio (from 28.5% to 70.5%). Web Search and Wikipedia Search, which were moderately used before, become much less utilized. The Base Generator remains largely unused. * For **MedQA**, fine-tuning shifts preference away from **Google Search** and the **Base Generator** (both seeing significant decreases) towards **Wikipedia Search** and **Web Search** (both seeing dramatic increases). Wikipedia Search becomes the dominant tool after fine-tuning. * **Base Generator Role:** The "Base Generator" tool call ratio is consistently very low for 2Wiki both before and after fine-tuning. For MedQA, it starts at a moderate level (28.7%) but significantly decreases after fine-tuning (to 6.3%). * **Magnitude of Change:** The changes in tool call ratio are substantial for most tools after fine-tuning, indicating a strong impact of the fine-tuning process on tool selection behavior. ### Interpretation The data suggests that fine-tuning a model for specific datasets (2Wiki vs. MedQA) leads to specialized and optimized tool-calling strategies, rather than a universal improvement across all tools. For the **2Wiki dataset**, the fine-tuning process appears to have learned that "Google Search" is the most effective tool for improving accuracy. The model's reliance on Google Search dramatically increases, while other search tools (Web Search, Wikipedia Search) become less relevant. This implies that for tasks within the 2Wiki domain, Google Search provides the most valuable information or is best integrated with the fine-tuned model's capabilities. The significant accuracy gain (+17.2%) for 2Wiki is strongly correlated with this increased reliance on Google Search. For the **MedQA dataset**, the fine-tuning process identifies "Wikipedia Search" and "Web Search" as the primary tools for enhancing performance. The model significantly reduces its calls to "Base Generator" and "Google Search," which were initially more prominent. This indicates that for medical question-answering tasks (MedQA), information from Wikipedia and general web searches is more pertinent or effectively leveraged by the fine-tuned model. The smaller, but still positive, accuracy gain (+4.0%) for MedQA is achieved through this shift in tool preference. The "Base Generator" generally plays a minor role, especially for 2Wiki, suggesting that for these tasks, external tools are almost always preferred over the base model's generation capabilities. Its decrease in MedQA further supports the idea that fine-tuning directs the model to more specialized external resources. In essence, fine-tuning acts as a mechanism to learn which external tools are most beneficial for a given domain, leading to a highly specialized and efficient tool-calling strategy that maximizes accuracy, even if the preferred tools differ significantly across datasets. The "After Fine-tuning" process is not merely boosting existing tool usage but actively re-prioritizing and re-allocating tool calls based on the dataset's specific information needs. </details> Figure 5: Tool call ratio change by Flow-GRPO fine-tuning. <details> <summary>x8.png Details</summary> ![c152ca02](/v1/image/c152ca02535913c22428df9922b029ae9544c35ee4552085e6e1bb12055965fb) ### Visual Description ## Chart Type: Line Chart - Calling Error Rate vs. Training Steps ### Overview This image displays a 2D line chart illustrating the "Calling Error Rate (%)" on the Y-axis as a function of "Training Steps" on the X-axis. Four different methods or models—GAIA, 2Wiki, Bamboogle, and AIME24—are compared, each represented by a distinct colored line and marker. The chart shows how the calling error rate changes for each method as the number of training steps increases from 0 to 32. Additionally, percentage reduction values are displayed next to the final data point for each line, indicating the absolute decrease in error rate from 0 to 32 training steps. ### Components/Axes * **X-axis Label**: "Training Steps" * **X-axis Scale**: Numeric, ranging from 0 to 32. * **X-axis Markers**: 0, 8, 18, 28, 32. * **Y-axis Label**: "Calling Error Rate (%)" * **Y-axis Scale**: Numeric, ranging from 0 to 50. * **Y-axis Markers**: 0, 10, 20, 30, 40, 50. * **Legend**: Located in the top-right quadrant of the plot area. * **GAIA**: Green line with filled diamond markers. * **2Wiki**: Magenta line with filled square markers. * **Bamboogle**: Blue line with filled circle markers. * **AIME24**: Orange line with filled diamond markers. * **Additional Labels (Right side of plot)**: * Next to the GAIA line (green): "-28.4%" (in a green box) * Next to the 2Wiki line (magenta): "-19.4%" (in a magenta box) * Next to the Bamboogle line (blue): "-7.8%" (in a blue box) * Next to the AIME24 line (orange): "-8.4%" (in an orange box) These values represent the absolute reduction in "Calling Error Rate (%)" from 0 to 32 Training Steps for each respective method. ### Detailed Analysis The chart presents four data series, each tracking the Calling Error Rate (%) across five distinct Training Steps (0, 8, 18, 28, 32). 1. **GAIA (Green line with diamond markers)**: * **Trend**: This line shows a consistent downward trend, indicating a decrease in calling error rate as training steps increase. It starts at the highest error rate among all methods and maintains the highest error rate throughout. * **Data Points**: * (Training Steps: 0, Calling Error Rate: ~51.5%) * (Training Steps: 8, Calling Error Rate: ~40.5%) * (Training Steps: 18, Calling Error Rate: ~33.5%) * (Training Steps: 28, Calling Error Rate: ~26.5%) * (Training Steps: 32, Calling Error Rate: ~23.1%) * **Total Reduction (0 to 32 steps)**: -28.4 percentage points. 2. **2Wiki (Magenta line with square markers)**: * **Trend**: This line also exhibits a clear downward trend, showing a reduction in calling error rate with more training steps. It starts as the second-highest error rate and remains so. * **Data Points**: * (Training Steps: 0, Calling Error Rate: ~34.2%) * (Training Steps: 8, Calling Error Rate: ~28.0%) * (Training Steps: 18, Calling Error Rate: ~20.5%) * (Training Steps: 28, Calling Error Rate: ~18.5%) * (Training Steps: 32, Calling Error Rate: ~14.8%) * **Total Reduction (0 to 32 steps)**: -19.4 percentage points. 3. **Bamboogle (Blue line with circle markers)**: * **Trend**: This line shows a relatively stable but decreasing trend in calling error rate. It starts at a moderate error rate and consistently decreases, ending as the second-lowest error rate. * **Data Points**: * (Training Steps: 0, Calling Error Rate: ~17.2%) * (Training Steps: 8, Calling Error Rate: ~14.0%) * (Training Steps: 18, Calling Error Rate: ~13.0%) * (Training Steps: 28, Calling Error Rate: ~11.0%) * (Training Steps: 32, Calling Error Rate: ~9.4%) * **Total Reduction (0 to 32 steps)**: -7.8 percentage points. 4. **AIME24 (Orange line with diamond markers)**: * **Trend**: This line shows an initial sharp decrease, then a plateau, followed by a slight increase, and finally another decrease. It starts with the lowest initial error rate and generally maintains the lowest error rate throughout, despite a minor fluctuation. * **Data Points**: * (Training Steps: 0, Calling Error Rate: ~11.2%) * (Training Steps: 8, Calling Error Rate: ~2.5%) * (Training Steps: 18, Calling Error Rate: ~2.5%) * (Training Steps: 28, Calling Error Rate: ~5.0%) * (Training Steps: 32, Calling Error Rate: ~2.8%) * **Total Reduction (0 to 32 steps)**: -8.4 percentage points. ### Key Observations * All four methods generally show a decrease in "Calling Error Rate (%)" as "Training Steps" increase, indicating that more training is beneficial for reducing errors. * GAIA starts with the highest error rate (~51.5%) and ends with the highest (~23.1%), but also achieves the largest absolute reduction in error rate (-28.4 percentage points). * AIME24 consistently demonstrates the lowest calling error rate across most training steps, starting at ~11.2% and ending at ~2.8%. It shows a significant initial drop by 8 training steps. * The reduction in error rate for AIME24 is -8.4 percentage points, which is slightly more than Bamboogle's -7.8 percentage points, despite Bamboogle having a higher initial error rate. * The rate of decrease varies among methods. GAIA and 2Wiki show steeper initial declines, while Bamboogle's decline is more gradual. AIME24 has a very sharp initial drop, then a more complex pattern. * A notable anomaly is AIME24's slight increase in error rate between 18 and 28 training steps (from ~2.5% to ~5.0%) before dropping again. ### Interpretation The data suggests that while all evaluated methods benefit from increased training steps in terms of reducing calling error rates, their baseline performance and the effectiveness of training vary significantly. GAIA, despite having the highest initial error rate, shows the most substantial absolute improvement, reducing its error rate by over 28 percentage points. This indicates that GAIA might be a method with significant room for improvement through training, or perhaps it starts from a less optimized state. AIME24 stands out as the best-performing method overall, consistently achieving the lowest calling error rates. Its ability to reach a very low error rate (~2.5%) with just 8 training steps is remarkable, suggesting high efficiency or a strong initial model. The slight increase in error rate for AIME24 between 18 and 28 training steps could be an interesting anomaly, possibly indicating a temporary overfitting, a change in the training data distribution, or a hyperparameter instability at that stage of training. However, it recovers and ends at a very low error rate. 2Wiki and Bamboogle fall in between GAIA and AIME24 in terms of performance. 2Wiki shows a good reduction in error rate, similar to GAIA's trend but at a lower magnitude. Bamboogle, while showing improvement, has the smallest absolute reduction in error rate, suggesting its performance is less sensitive to the additional training steps within this range, or it might be closer to its performance ceiling. In summary, for applications where minimizing the calling error rate is paramount, AIME24 appears to be the superior choice, achieving very low error rates quickly. If a method starts with a high error rate but has significant potential for improvement with training, GAIA demonstrates the largest absolute gains. The choice of method would depend on the acceptable baseline error and the resources available for extensive training. </details> Figure 6: Calling error rate. Flow-GRPO enhances tool-calling efficacy. A key aspect of the model’s improvement is its increased reliability in tool usage. As shown in Figure 6, the tool-calling error rate consistently decreases across tasks during training, with a reduction of up to 28.4% on GAIA. This trend indicates that the training process not only teaches the model which tool to use but also how to invoke it correctly with proper arguments and format, leading to more robust and effective tool integration. Flow-GRPO incentivizes autonomous discovery of new solutions. We further examine qualitative examples in Figure 7 and additional cases in § F. These cases show that AgentFlow, trained with Flow-GRPO, develops enhanced capabilities for task planning and tool use. The planner exhibits adaptive efficiency, stronger self-correction, and spontaneous new integration of tools throughout step-by-step problem-solving, autonomously discovering effective solution pathways. <details> <summary>x9.png Details</summary> ![44c02e04](/v1/image/44c02e0401714989fc1b31f7f288077cfc12cb23df6a2cfef3543816e145e058) ### Visual Description ## Diagram: AgentFlow Comparison (Before and After Fine-tuning) ### Overview This image is a comparative diagram illustrating the problem-solving process of an AI agent, labeled "AgentFlow," before and after "Flow-GRPO Fine-tuning." It presents two parallel execution paths, one on the left representing the agent's behavior *before* fine-tuning and one on the right representing its behavior *after* fine-tuning, both attempting to answer the same question. Each path consists of a series of numbered steps, detailing the selected tool, sub-goal, tool call, and result, along with an indicator of success or failure. ### Components/Axes **Global Elements (Top Section):** * **Question (Top-left):** "Compute the check digit the Tropicos ID for the Helotiales would have if it were an ISBN-10 number." * Associated icons: A brown mushroom, a barcode scanner, and a magnifying glass with "01**" inside. * **Ground truth answer (Below Question):** "3" **Flow Components (Main Body):** The diagram is divided into two main vertical sections, each representing an AgentFlow version: **Left Column: AgentFlow (before Flow-GRPO Fine-tuning)** * **Title (Bottom-left):** "AgentFlow (before Flow-GRPO Fine-tuning)" * **Logo (Bottom-left):** A grey/blue icon depicting three interconnected spheres. * **Step Structure:** Each step is a rounded rectangular box containing: * A grey circle with a step number (e.g., "1") on the left. * "Selected tool:" followed by the tool name. * "Sub-goal:" followed by the objective. * "Tool call:" followed by the query or command. * "Result:" followed by the outcome. * A tool-specific icon (e.g., Wikipedia 'W', Google Chrome 'G', Python 'P') in the top-right corner of the box. * A status indicator (green checkmark for success, red 'X' for failure) in the bottom-right corner of the result line. * **Flow Direction:** Indicated by downward arrows connecting sequential steps. **Right Column: AgentFlow (after Flow-GRPO Fine-tuning)** * **Title (Bottom-right):** "AgentFlow (after Flow-GRPO Fine-tuning)" * **Logo (Bottom-right):** A red/orange icon depicting three interconnected spheres. * **Step Structure:** Identical to the left column, with the same elements and indicators. * **Flow Direction:** Indicated by downward arrows connecting sequential steps. ### Detailed Analysis **AgentFlow (before Flow-GRPO Fine-tuning) - Left Column:** * **Step 1:** * **Selected tool:** Wikipedia Search * **Sub-goal:** Retrieve Tropicos ID of Order Helotiales from Wikipedia. * **Tool call:** query="Tropicos ID for Order Helotiales" * **Result:** No results found for query: Tropicos ID for Order Helotiales. (Failure - Red X) * **Tool Icon:** Wikipedia 'W' logo. * **Step 2:** * **Selected tool:** Google Search * **Sub-goal:** Retrieve the Tropicos ID of the Order. * **Tool call:** query="Tropicos ID for the Order Helotiales" * **Result:** The Tropicos ID for the Order Helotiales is 100370510. (Success - Green Checkmark) * **Tool Icon:** Google Chrome 'G' logo. * **Step 3:** * **Selected tool:** Python Coder * **Sub-goal:** Calculate the check digit for the Tropicos ID 100370510 if it were an ISBN-10 number. * **Tool call:** query="Calculate check digit for ISBN-10 number 100370510" * **Result:** name 'isbn' is not defined. (Failure - Red X) * **Tool Icon:** Python 'P' logo. * **Steps 4-9 (Summary Block):** * **Text:** "Action steps 3 to 9 have similar sub-goals, commands, and results; the agent is stuck in error loops and unable fix the naming issue." (Failure - Red X) * **Solution (Final Block):** * **Text:** "The Python tool failed repeatedly with the same 'isbn' variable error). The agent failed to calculate the check digit." (Failure - Red X) **AgentFlow (after Flow-GRPO Fine-tuning) - Right Column:** * **Step 1:** * **Selected tool:** Wikipedia Search (Truncated...) * **Result:** No results found. (Failure - Red X) * **Tool Icon:** Wikipedia 'W' logo. * **Step 2:** * **Selected tool:** Google Search (Truncated...) * **Result:** Tropicos ID: 100370510. (Success - Green Checkmark) * **Tool Icon:** Google Chrome 'G' logo. * **Step 3:** * **Selected tool:** Python Coder * **Sub-goal:** Write and execute a Python script to calculate the check digit for 'tropicos_id' as if it were an ISBN-10 number. * **Tool call:** query="Calculate the check digit for the ISBN-10 number" + tropicos_id * **Result:** Error in execute_command: name 'tropicos_id' is not defined. (Failure - Red X) * **Tool Icon:** Python 'P' logo. * **Step 4:** * **Selected tool:** Python Coder * **Sub-goal:** Write and execute a Python script to calculate the check digit for 'tropicos_id' as if it were an ISBN-10 number. * **Tool call:** query="Compute the check digit for 100370510 as an ISBN-10 number" * **Result:** 3 (printed output). (Success - Green Checkmark) * **Tool Icon:** Python 'P' logo. * **Embedded Python Code:** ```python def calculate_check_digit(isbn): isbn_digits = [int(digit) for digit in isbn[:9]] total_sum = sum(position * digit for position, digit in enumerate(isbn_digits, start=1)) check_digit = total_sum % 11 if check_digit == 10: return 'X' else: return str(check_digit) result = calculate_check_digit("100370510") print(f"The check digit is {result}") ``` * **Solution (Final Block):** * **Text:** "The check digit is 3, resulting in the full number 1003705103." (Success - Green Checkmark) ### Key Observations * Both agents successfully identify the Tropicos ID (100370510) using Google Search after an initial failed Wikipedia search. * Both agents initially encounter the same error when attempting to use the Python Coder tool: a `name 'isbn' is not defined` or `name 'tropicos_id' is not defined` error, indicating a variable scope or naming issue in their initial Python script generation. * The "before" agent gets stuck in a repetitive error loop (steps 3-9), failing to resolve the Python variable naming issue. * The "after" agent, despite an initial Python error in Step 3, successfully corrects its approach in Step 4 by generating a working Python script that correctly calculates the check digit. * The "after" agent's final solution matches the ground truth answer (3), while the "before" agent fails to provide a solution. * The Python code provided in the successful Step 4 of the "after" agent demonstrates a standard algorithm for calculating an ISBN-10 check digit. ### Interpretation This diagram vividly illustrates the impact of "Flow-GRPO Fine-tuning" on an AI agent's ability to recover from errors and successfully complete a multi-step task. The "AgentFlow (before Flow-GRPO Fine-tuning)" represents a common challenge in AI systems: brittleness and lack of robust error recovery. While it can perform initial information retrieval, it gets trapped in a repetitive failure mode when encountering a specific programming error. This suggests a lack of adaptive reasoning or an inability to learn from immediate feedback to correct its tool usage. The agent understands *what* it needs to do (calculate a check digit) and *what data* to use (100370510), but fails at the *how* due to a persistent technical detail (variable naming). In stark contrast, the "AgentFlow (after Flow-GRPO Fine-tuning)" demonstrates significant improvement. Although it makes the same initial Python error as the unfined-tuned agent, it exhibits a crucial difference: it learns from this error and adapts its subsequent action. In Step 4, it successfully generates a correct Python script, indicating an enhanced capacity for self-correction, debugging, or more precise tool invocation. This suggests that the fine-tuning process has equipped the agent with a better understanding of tool constraints, error messages, or more effective strategies for generating executable code, allowing it to overcome obstacles and achieve the desired outcome. The successful execution of the Python code, which correctly computes the ISBN-10 check digit, confirms the agent's improved reasoning and problem-solving capabilities. In essence, the diagram highlights the transition from an agent that gets stuck in a loop of failure to one that can learn, adapt, and ultimately succeed, underscoring the value of fine-tuning for building more robust and intelligent AI systems. </details> Figure 7: One case study example. Initially failed with repetitive errors (left), AgentFlow, trained with Flow-GRPO, explores a new solution pathway at turn 4 after two failed attempts (right). ### 4.4 Training Strategies on the Planner We conduct an ablation study to analyze the impact of different training strategies for the Action Planner module in AgentFlow, with results reported in Table 3. The executor, verifier, and generator modules remain fixed as Qwen2.5-7B-Instruct, consistent with our main setup (§ 4.1). | GPT-4o | Frozen | 65.0 ${}_↑~6.6$ | 70.0 ${}_↑~10.0$ | 23.6 ${}_↑~6.4$ | 16.7 ${}_↑~0.0$ | 48.7 ${}_↑~1.3$ | 42.0 ${}_↑~11.0$ | 44.3 ${}_↑~5.8$ | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Qwen-2.5-7B | SFT | 30.4 ${}_↓~28.0$ | 32.7 ${}_↓~27.3$ | 6.3 ${}_↓~10.9$ | 3.3 ${}_↓~13.4$ | 37.5 ${}_↓~9.9$ | 7.0 ${}_↓~24.0$ | 19.5 ${}_↓~19.0$ | | Qwen-2.5-7B | Flow-GRPO | 69.6 ${}_↑~11.2$ | 77.2 ${}_↑~17.2$ | 33.1 ${}_↑~15.9$ | 40.0 ${}_↑~23.3$ | 61.5 ${}_↑~14.1$ | 53.0 ${}_↑~22.0$ | 55.7 ${}_↑~17.2$ | Table 3: Performance comparison of AgentFlow across different training methods. A more capable planner is beneficial, but has limits. Replacing the frozen Qwen2.5-7B-Instruct baseline with a stronger proprietary model, GPT-4o, yields only a modest 5.8% average gain. This indicates a key bottleneck that, while a more powerful model improves planning, its static nature prevents co-adaptation with the live dynamics of AgentFlow. Offline SFT leads to performance collapse, while in-the-flow RL is crucial. The limitations of a static planner are further exposed when distilling GPT-4o’s behavior via offline supervised fine-tuning (SFT) on its trajectories as Action Planner in AgentFlow. This results in a catastrophic performance collapse, with an average accuracy drop of 19.0% compared to the frozen baseline. This failure arises from the token-level imitation objective of SFT, which misaligns with trajectory-level task success and prevents the planner from adapting to dynamic tool feedback or recovering from compounding errors. In contrast, training the planner with our on-policy Flow-GRPO method proves highly effective: by optimizing for the final outcome, the planner learns to handle long-horizon workflows, achieving a 17.2% average gain over the frozen baseline. ### 4.5 Training Efficiency Analysis <details> <summary>x10.png Details</summary> ![c4a463b1](/v1/image/c4a463b1a0d3fc98a977045a54de61e0dbf0fe62455c09eb462359177e8842f3) ### Visual Description ## Chart Type: Dual Y-Axis Line Chart with Scatter Points: Training Reward and Response Length Trends ### Overview This image displays a dual y-axis line chart illustrating the trends of "Training Reward (Acc.)" and "Response Length (tokens)" over "Training Steps". Both metrics are shown with smoothed trend lines and underlying scatter points representing individual data observations. The chart provides insights into how a model's performance (reward/accuracy) and output verbosity (response length) evolve during a training process. ### Components/Axes The chart is structured with a horizontal x-axis at the bottom and two vertical y-axes, one on the left and one on the right. A legend is positioned at the top-center. A label "(a)" is present at the bottom-left corner. * **X-axis (Bottom-Center):** * **Label:** "Training Steps" * **Range:** From 0 to approximately 65. * **Major Ticks:** 0, 10, 20, 30, 40, 50, 60. * **Minor Ticks:** Every 5 steps. * **Left Y-axis (Left-Middle):** * **Label:** "Training Reward (Acc.)" * **Unit:** "(Acc.)" (likely representing Accuracy) * **Range:** From approximately 0.45 to 0.8. * **Major Ticks:** 0.5, 0.6, 0.7, 0.8. * **Minor Ticks:** Every 0.05. * **Right Y-axis (Right-Middle):** * **Label:** "Response Length (tokens)" * **Unit:** "(tokens)" * **Range:** From approximately 170 to 230. * **Major Ticks:** 170, 180, 190, 200, 210, 220, 230. * **Minor Ticks:** Every 5 tokens. * **Legend (Top-Center):** * A rectangular box containing two entries: * **Dark Blue Line:** "Training Reward Trend" * **Orange Line:** "Response Length Trend" ### Detailed Analysis The chart presents two primary data series, each with a trend line and corresponding scatter points. 1. **Training Reward Trend (Dark Blue Line and Light Blue/Purple Scatter Points):** * **Visual Trend:** The dark blue line generally shows an upward trend, indicating an improvement in training reward over time, despite significant fluctuations. It starts lower, experiences a dip, then rises, fluctuates, and ends at a higher value. * **Approximate Data Points:** * At Training Step 0: ~0.52 * At Training Step 5: ~0.55 * At Training Step 10: ~0.50 * At Training Step 15: ~0.54 * At Training Step 20: ~0.55 * At Training Step 25: ~0.66 (local peak) * At Training Step 30: ~0.58 * At Training Step 35: ~0.67 * At Training Step 40: ~0.62 * At Training Step 45: ~0.65 * At Training Step 50: ~0.73 (highest peak) * At Training Step 55: ~0.63 * At Training Step 60: ~0.68 * At Training Step 65: ~0.75 * **Scatter Points:** Light blue/purple dots are distributed around the dark blue trend line, representing individual training reward values at various steps. They show a wider spread in the earlier steps and appear to cluster more closely around the trend line as training progresses. 2. **Response Length Trend (Orange Line and Light Orange Scatter Points):** * **Visual Trend:** The orange line exhibits more volatility, particularly in the initial steps. It shows an initial increase, peaks around Training Step 20, then generally decreases and stabilizes with ongoing fluctuations towards the end of the training steps. * **Approximate Data Points:** * At Training Step 0: ~185 tokens * At Training Step 5: ~178 tokens * At Training Step 10: ~185 tokens * At Training Step 15: ~190 tokens * At Training Step 20: ~220 tokens (highest peak) * At Training Step 25: ~205 tokens * At Training Step 30: ~195 tokens * At Training Step 35: ~205 tokens * At Training Step 40: ~200 tokens * At Training Step 45: ~190 tokens * At Training Step 50: ~195 tokens * At Training Step 55: ~190 tokens * At Training Step 60: ~195 tokens * At Training Step 65: ~190 tokens * **Scatter Points:** Light orange dots are scattered around the orange trend line, representing individual response length values. These points show a significant spread, especially around the peak at Training Step 20, indicating variability in response length. ### Key Observations * The "Training Reward Trend" (blue) generally improves from approximately 0.52 to 0.75 over 65 training steps, indicating positive learning. * The "Response Length Trend" (orange) shows an initial surge, peaking at around 220 tokens at Training Step 20, then gradually decreases and stabilizes around 190-195 tokens by the end. * There appears to be an interesting relationship between the two trends: the peak in "Response Length" around Training Step 20-25 coincides with the "Training Reward" beginning a more consistent upward trajectory after an initial plateau. * Both metrics exhibit considerable fluctuations throughout the training process, suggesting dynamic and non-linear learning behavior. * The scatter points provide a visual representation of the raw data variability, showing that the trend lines are smoothed averages of these fluctuating individual observations. ### Interpretation This chart suggests a typical learning trajectory for a model, likely in a generative task where both performance (reward/accuracy) and output characteristics (response length) are monitored. The initial phase (0-20 training steps) shows the model exploring its output space, as indicated by the increasing "Response Length Trend" which peaks around step 20. During this exploration, the "Training Reward" remains relatively stable or shows minor fluctuations, possibly because the model is still learning to generate coherent or relevant outputs, even if they are longer. A critical point appears around Training Step 20-25. As the "Response Length" begins to decrease from its peak, the "Training Reward" starts a more pronounced and consistent upward climb. This could imply that: 1. The model initially learns to generate more verbose responses, perhaps to cover all possibilities or to avoid brevity penalties. 2. Subsequently, it learns to refine its responses, becoming more efficient and concise (reducing length) while simultaneously improving the quality or accuracy of those responses (increasing reward). This suggests that the model might be pruning unnecessary tokens or finding more optimal ways to convey information. 3. The initial phase of generating longer responses might be a necessary exploration step that lays the groundwork for later accuracy improvements. The model might first learn *what* to say (leading to longer responses) and then *how* to say it more effectively (leading to shorter, more accurate responses). The continuous fluctuations in both trends are common in iterative training processes, often due to batch processing, stochastic gradient descent, or dynamic learning rates. However, the overall positive trend in "Training Reward" and the stabilization of "Response Length" at a more moderate level indicate successful learning and refinement of the model's output generation capabilities. The data demonstrates that higher accuracy does not necessarily correlate with longer responses; in fact, the model seems to achieve better performance with more optimized, potentially shorter, outputs after an initial exploratory phase. </details> <details> <summary>x11.png Details</summary> ![0d0ab411](/v1/image/0d0ab411f72970b383b7fe1de9f7703d33770b2d37865f913d96681137f66059) ### Visual Description ## Chart Type: Line Chart - Validation Reward vs. Training Steps ### Overview This image displays a 2D line chart comparing the "Validation Reward (Acc.)" of two different methods, "Flow-GRPO (ours)" and "ToRL", over "Training Steps". The chart shows how the validation reward evolves as training progresses for both methods, highlighting their performance trends and relative effectiveness. The chart is labeled as "(b)" in the bottom-left corner, suggesting it might be part of a larger figure. ### Components/Axes * **Chart Title**: No explicit title is provided, but the axes and legend indicate the content. * **X-axis**: * **Label**: "Training Steps" * **Range**: Approximately from 0 to 32 steps. * **Major Ticks**: Labeled at 0, 10, 20, 30. * **Minor Ticks**: Present at intervals of 5 steps (e.g., 5, 15, 25). * **Y-axis**: * **Label**: "Validation Reward (Acc.)" * **Range**: Approximately from 0.0 to 0.4. * **Major Ticks**: Labeled at 0.0, 0.1, 0.2, 0.3, 0.4. * **Minor Ticks**: Present at intervals of 0.05 (e.g., 0.05, 0.15, 0.25, 0.35). * **Grid Lines**: Light gray dashed grid lines are present for both major X and Y axis ticks, aiding in data point estimation. * **Legend**: Located in the top-right corner of the plot area. * **Entry 1**: A blue line with solid circular markers (●) represents "Flow-GRPO (ours)". * **Entry 2**: An orange line with solid square markers (■) represents "ToRL". * **Figure Label**: The character "(b)" is present in the bottom-left corner of the chart area. ### Detailed Analysis The chart presents two data series, each representing the validation reward over training steps: 1. **Flow-GRPO (ours)** (Blue line with circular markers): * **Trend**: This line generally shows an increasing trend in validation reward, especially in the latter half of the training steps, but with significant fluctuations. It starts relatively low, experiences an initial increase, a sharp dip, and then a strong recovery and ascent. * **Data Points (approximate)**: * At Training Step 0: Validation Reward is approximately 0.11. * At Training Step 2: Validation Reward is approximately 0.14. * At Training Step 4: Validation Reward is approximately 0.16. * At Training Step 6: Validation Reward is approximately 0.16. * At Training Step 8: Validation Reward is approximately 0.13. * At Training Step 10: Validation Reward is approximately 0.20 (a peak). * At Training Step 12: Validation Reward drops sharply to approximately 0.04 (a trough). * At Training Step 14: Validation Reward recovers to approximately 0.13. * At Training Step 16: Validation Reward is approximately 0.10. * At Training Step 18: Validation Reward rises to approximately 0.20. * At Training Step 20: Validation Reward remains around 0.20. * At Training Step 22: Validation Reward peaks at approximately 0.30. * At Training Step 24: Validation Reward drops to approximately 0.13. * At Training Step 26: Validation Reward rises to approximately 0.27. * At Training Step 28: Validation Reward remains around 0.27. * At Training Step 30: Validation Reward increases to approximately 0.33. * At Training Step 32: Validation Reward reaches its highest point at approximately 0.40. 2. **ToRL** (Orange line with square markers): * **Trend**: This line shows a relatively stable but lower validation reward throughout the training steps, with minor fluctuations. It does not exhibit the same strong upward trend as Flow-GRPO. * **Data Points (approximate)**: * At Training Step 0: Validation Reward is approximately 0.17. * At Training Step 2: Validation Reward is approximately 0.15. * At Training Step 4: Validation Reward is approximately 0.16. * At Training Step 6: Validation Reward is approximately 0.17. * At Training Step 8: Validation Reward is approximately 0.16. * At Training Step 10: Validation Reward is approximately 0.13. * At Training Step 12: Validation Reward is approximately 0.10. * At Training Step 14: Validation Reward rises to approximately 0.17. * At Training Step 16: Validation Reward drops to approximately 0.10. * From Training Step 18 to 24: Validation Reward remains consistently around 0.10. * At Training Step 26: Validation Reward rises to approximately 0.13. * At Training Step 28: Validation Reward remains around 0.13. * At Training Step 30: Validation Reward drops to approximately 0.10. * At Training Step 32: Validation Reward remains around 0.10. ### Key Observations * **Initial Performance**: ToRL starts with a slightly higher validation reward (approx. 0.17) than Flow-GRPO (approx. 0.11) at Training Step 0. * **Early Fluctuations**: Both methods show fluctuations in the early training steps (0-10). Flow-GRPO experiences an early peak around step 10 (0.20) before a significant drop. * **Flow-GRPO's Dip**: A notable sharp decrease in Flow-GRPO's performance occurs around Training Step 12, where its reward drops to its lowest point (approx. 0.04), falling significantly below ToRL's performance at that point (approx. 0.10). * **Recovery and Outperformance**: After the dip, Flow-GRPO demonstrates a strong recovery and a clear upward trend, consistently outperforming ToRL from approximately Training Step 18 onwards. * **ToRL's Stability**: ToRL's performance remains relatively stable, hovering mostly between 0.10 and 0.17 throughout the entire training process, without significant improvements or major drops after the initial phase. * **Final Performance**: At Training Step 32, Flow-GRPO achieves a validation reward of approximately 0.40, which is four times higher than ToRL's reward of approximately 0.10 at the same step. ### Interpretation The data suggests that "Flow-GRPO (ours)" is a more effective method for achieving higher validation rewards over a longer training duration compared to "ToRL". While Flow-GRPO exhibits more volatility, including a significant performance dip early in training, its ability to recover and continuously improve leads to substantially better final performance. This indicates that Flow-GRPO might be exploring the reward landscape more aggressively or effectively, even if it encounters temporary setbacks. Conversely, "ToRL" appears to be a more stable but less performant method. Its validation reward plateaus relatively early and remains consistently low, suggesting it might converge to a local optimum or have inherent limitations in achieving higher rewards within the given training steps. The initial higher performance of ToRL compared to Flow-GRPO is quickly surpassed, and ToRL fails to demonstrate any significant learning or improvement in the later stages. The sharp dip in Flow-GRPO's performance around step 12 could be an artifact of the training process (e.g., a learning rate schedule change, exploration phase, or a particularly challenging batch of data), but its subsequent strong recovery and sustained growth highlight its robustness and potential for superior long-term performance. The "ours" designation for Flow-GRPO implies it is a novel method being presented, and the chart effectively demonstrates its advantage over the baseline or comparative method, ToRL, especially in terms of peak and final performance. </details> Figure 8: Training dynamics and efficiency of Flow-GRPO. Optimized planning with increased rewards and condensed responses. We analyze the training dynamics of the AgentFlow planner by tracking its average reward and response length on the train set (Figure 8 a). Training rewards steadily increase, indicating effective policy improvement via Flow-GRPO. Meanwhile, response length, after an initial exploratory rise, progressively shortens and stabilizes. This shows the planner learns to balance conciseness and informativeness, avoiding unnecessarily long outputs. Flow-GRPO efficiency over tool-integrated reasoning RL. We compare AgentFlow (trained with Flow-GRPO) against a monolithic tool-integrated reasoning baseline (ToRL) on AIME24. As shown in Figure 8 b, AgentFlow achieves sustained performance gains, with validation accuracy growing steadily. In contrast, ToRL’s performance quickly stagnates and trends downwards, highlighting the superior efficiency of our agentic training approach, which uses decomposition and stable credit assignment to avoid the instability. ### 4.6 Scaling Trends in AgentFlow <details> <summary>x12.png Details</summary> ![522d52d1](/v1/image/522d52d145ff64fc93be69cd99f698cd44389b3fae385ace449fad3ed880b5ee) ### Visual Description ## Chart Type: Comparative Bar Charts of AgentFlow Accuracy ### Overview The image displays two side-by-side bar charts, comparing the accuracy of "AgentFlow" models, specifically "Qwen-2.5-3B-Instruct" and "Qwen-2.5-7B-Instruct", across four different datasets. For each model and dataset, the charts show the accuracy "Before tuning" and "After tuning", illustrating the impact of the tuning process. The Y-axis represents "Accuracy (%)", and the X-axis lists the four datasets: "Bamboogle", "2Wiki", "GAIA", and "AIME24". ### Components/Axes The image consists of two distinct bar charts, arranged horizontally. **Common Elements for Both Charts:** * **Y-axis Label**: "Accuracy (%)" * **Y-axis Scale**: Ranges from 0 to 80, with major tick marks at 0, 20, 40, 60, and 80. Minor grid lines are present at intervals of 10. * **X-axis Categories**: The horizontal axis for both charts displays the same four categories, from left to right: "Bamboogle", "2Wiki", "GAIA", and "AIME24". * **Legend**: A legend is positioned in the top-right corner of the plotting area for each chart. * A light blue rectangle represents "Before tuning". * A red rectangle represents "After tuning". **Left Chart Specifics:** * **Title**: "AgentFlow (Qwen-2.5-3B-Instruct)" **Right Chart Specifics:** * **Title**: "AgentFlow (Qwen-2.5-7B-Instruct)" ### Detailed Analysis **Left Chart: AgentFlow (Qwen-2.5-3B-Instruct)** This chart evaluates the Qwen-2.5-3B-Instruct model. For each dataset, there are two vertical bars: a light blue bar for "Before tuning" and a red bar for "After tuning". * **Bamboogle**: * The light blue bar ("Before tuning") reaches an accuracy of 53.6%. * The red bar ("After tuning") is significantly taller, reaching 68.8%. * **Trend**: Tuning results in a substantial increase in accuracy for Bamboogle. * **2Wiki**: * The light blue bar ("Before tuning") shows an accuracy of 63.0%. * The red bar ("After tuning") is taller, indicating an accuracy of 72.3%. * **Trend**: Tuning leads to an improvement in accuracy for 2Wiki. * **GAIA**: * The light blue bar ("Before tuning") is relatively short, at 14.3%. * The red bar ("After tuning") is significantly taller, reaching 29.1%. * **Trend**: Tuning more than doubles the accuracy for GAIA. * **AIME24**: * The light blue bar ("Before tuning") is short, at 13.3%. * The red bar ("After tuning") is taller, showing an accuracy of 20.0%. * **Trend**: Tuning results in a notable increase in accuracy for AIME24. **Right Chart: AgentFlow (Qwen-2.5-7B-Instruct)** This chart evaluates the Qwen-2.5-7B-Instruct model. Similar to the left chart, each dataset has two bars: light blue for "Before tuning" and red for "After tuning". * **Bamboogle**: * The light blue bar ("Before tuning") shows an accuracy of 58.4%. * The red bar ("After tuning") is taller, reaching 69.6%. * **Trend**: Tuning leads to an improvement in accuracy for Bamboogle. * **2Wiki**: * The light blue bar ("Before tuning") shows an accuracy of 60.0%. * The red bar ("After tuning") is significantly taller, reaching 77.2%. * **Trend**: Tuning results in a substantial increase in accuracy for 2Wiki. * **GAIA**: * The light blue bar ("Before tuning") is relatively short, at 17.2%. * The red bar ("After tuning") is significantly taller, reaching 33.1%. * **Trend**: Tuning nearly doubles the accuracy for GAIA. * **AIME24**: * The light blue bar ("Before tuning") is short, at 16.7%. * The red bar ("After tuning") is significantly taller, reaching 40.0%. * **Trend**: Tuning results in a very substantial increase in accuracy for AIME24. ### Key Observations 1. **Consistent Improvement**: Across all four datasets and both Qwen models (3B and 7B), "After tuning" accuracy (red bars) is consistently higher than "Before tuning" accuracy (light blue bars). This indicates that the tuning process is effective in improving model performance. 2. **Varying Degrees of Improvement**: The magnitude of improvement varies. * For the 3B model, GAIA and Bamboogle show the largest absolute gains (14.8% for Bamboogle, 14.8% for GAIA). AIME24 also shows a significant relative improvement (from 13.3% to 20.0%). * For the 7B model, AIME24 shows the largest absolute gain (23.3%), followed by 2Wiki (17.2%) and GAIA (15.9%). 3. **Model Size Impact**: * The Qwen-2.5-7B-Instruct model generally starts with higher "Before tuning" accuracies than the Qwen-2.5-3B-Instruct model on Bamboogle (58.4% vs 53.6%), GAIA (17.2% vs 14.3%), and AIME24 (16.7% vs 13.3%). For 2Wiki, the 3B model starts slightly higher (63.0% vs 60.0%). * After tuning, the 7B model consistently achieves higher accuracies than the 3B model across all datasets: * Bamboogle: 69.6% (7B) vs 68.8% (3B) * 2Wiki: 77.2% (7B) vs 72.3% (3B) * GAIA: 33.1% (7B) vs 29.1% (3B) * AIME24: 40.0% (7B) vs 20.0% (3B) 4. **Dataset Performance**: * Bamboogle and 2Wiki generally show higher baseline accuracies and higher post-tuning accuracies compared to GAIA and AIME24, for both models. * GAIA and AIME24 start with much lower accuracies (around 13-17%) but show substantial relative improvements after tuning, especially AIME24 with the 7B model, which more than doubles its accuracy from 16.7% to 40.0%. ### Interpretation The data strongly suggests that the tuning process applied to the "AgentFlow" models (Qwen-2.5-3B-Instruct and Qwen-2.5-7B-Instruct) is highly effective in improving their accuracy across a diverse set of tasks or datasets. The consistent upward trend from "Before tuning" to "After tuning" for every single data point underscores the value of this tuning. Furthermore, the comparison between the 3B and 7B parameter models highlights the general advantage of larger models. The Qwen-2.5-7B-Instruct model not only tends to perform better out-of-the-box (before tuning) but also achieves higher absolute accuracies after tuning across all evaluated datasets. This indicates that increased model capacity, combined with effective tuning, leads to superior performance. The varying degrees of improvement across datasets are also insightful. Datasets like GAIA and AIME24, which start with lower baseline accuracies, often see the most dramatic relative gains from tuning, suggesting that tuning can be particularly impactful for tasks where the base model struggles. For instance, the Qwen-2.5-7B-Instruct model's accuracy on AIME24 jumps from 16.7% to 40.0%, indicating that tuning significantly enhances its capability to handle this specific task. Conversely, for datasets like Bamboogle and 2Wiki, where baseline performance is already relatively high, tuning still provides a meaningful, albeit sometimes smaller, boost. In essence, the charts demonstrate that tuning is a critical step for optimizing AgentFlow's performance, and that leveraging larger Qwen models (like the 7B variant) further amplifies these benefits, leading to higher overall accuracy across various benchmarks. The results validate the efficacy of the tuning methodology and the scalability of performance with model size within the AgentFlow framework. </details> Figure 9: Flow-GRPO fine-tuning offers consistent gains on AgentFlow as the backbone model size scales from 3B to 7B. Training scaling in backbone size. We study how backbone LLM scale affects AgentFlow ’s performance and the efficacy of Flow-GRPO. We build two versions of the system: one using Qwen2.5-3B-Instruct and another using Qwen2.5-7B-Instruct for all four modules (planner, executor, verifier, and generator) and tools. In both, only the planner is fine-tuned with Flow-GRPO. As shown in Figure 9, Flow-GRPO fine-tuning consistently improves performance across tasks for both backbones. This demonstrates that our in-the-flow optimization is effective across model capacities, enhancing AgentFlow regardless of LLM size. | 2Wiki GameOf24 AIME24 | 2.22 1.63 1.63 | 3.18 2.12 1.63 | 3.81 2.36 1.86 | 4.44 2.67 1.90 | | --- | --- | --- | --- | --- | | GAIA | 2.43 | 3.46 | 4.28 | 5.42 | <details> <summary>x13.png Details</summary> ![9eec503a](/v1/image/9eec503a931d5908961564614425b13820e3096aed6711e1c7552f36249fd227) ### Visual Description ## Chart Type: Line Chart - Accuracy vs. Max Allowed Turns ### Overview This image displays a 2D line chart illustrating the "Accuracy (%)" of four different systems or models (2Wiki, GameOf24, AIME24, GAIA) as a function of "Max Allowed Turns." Each system is represented by a distinct colored line with unique markers. The chart also includes annotations indicating the total percentage increase in accuracy for each system across the observed range of "Max Allowed Turns." ### Components/Axes The chart consists of a main plotting area, an X-axis, a Y-axis, and a legend. * **X-axis:** * **Title:** "Max Allowed Turns" * **Scale:** Numeric, ranging from 3 to 10. * **Markers:** 3, 5, 7, 10. * **Y-axis:** * **Title:** "Accuracy (%)" * **Scale:** Numeric, ranging from 20 to 80. * **Markers:** 20, 30, 40, 50, 60, 70, 80. * **Legend:** * Positioned in the top-left quadrant of the chart. * It lists four entries, each with a specific color and marker: * **2Wiki:** Green line with solid pentagon markers. * **GameOf24:** Magenta line with solid square markers. * **AIME24:** Blue line with solid circle markers. * **GAIA:** Orange line with solid diamond markers. ### Detailed Analysis The chart presents four data series, each tracking the accuracy percentage across four discrete values of "Max Allowed Turns" (3, 5, 7, 10). 1. **2Wiki (Green line with Pentagon markers):** * **Trend:** The line initially remains flat, then shows a steady increase, followed by a sharp upward trend. * **Data Points:** * At 3 Max Allowed Turns: Approximately 61% Accuracy. * At 5 Max Allowed Turns: Approximately 61% Accuracy. * At 7 Max Allowed Turns: Approximately 67% Accuracy. * At 10 Max Allowed Turns: Approximately 77% Accuracy. * **Annotation:** A green box near the final data point indicates "+15.8%". This represents the total accuracy increase from 3 to 10 Max Allowed Turns. 2. **GameOf24 (Magenta line with Square markers):** * **Trend:** The line shows an initial increase, a slight dip, and then a very sharp upward trend. * **Data Points:** * At 3 Max Allowed Turns: Approximately 33% Accuracy. * At 5 Max Allowed Turns: Approximately 37% Accuracy. * At 7 Max Allowed Turns: Approximately 35% Accuracy. * At 10 Max Allowed Turns: Approximately 53% Accuracy. * **Annotation:** A magenta box near the final data point indicates "+20.0%". This represents the total accuracy increase from 3 to 10 Max Allowed Turns. 3. **AIME24 (Blue line with Circle markers):** * **Trend:** The line shows a sharp initial increase, followed by a more gradual upward trend, and then a plateau. * **Data Points:** * At 3 Max Allowed Turns: Approximately 24% Accuracy. * At 5 Max Allowed Turns: Approximately 37% Accuracy. * At 7 Max Allowed Turns: Approximately 39% Accuracy. * At 10 Max Allowed Turns: Approximately 40% Accuracy. * **Annotation:** A blue box near the final data point indicates "+16.7%". This represents the total accuracy increase from 3 to 10 Max Allowed Turns. 4. **GAIA (Orange line with Diamond markers):** * **Trend:** The line shows a consistent, gradual upward trend across all data points. * **Data Points:** * At 3 Max Allowed Turns: Approximately 27% Accuracy. * At 5 Max Allowed Turns: Approximately 30% Accuracy. * At 7 Max Allowed Turns: Approximately 32% Accuracy. * At 10 Max Allowed Turns: Approximately 33% Accuracy. * **Annotation:** An orange box near the final data point indicates "+6.3%". This represents the total accuracy increase from 3 to 10 Max Allowed Turns. ### Key Observations * **2Wiki** consistently maintains the highest accuracy across all "Max Allowed Turns" values, starting at ~61% and reaching ~77%. * **GameOf24** shows the largest percentage increase in accuracy (+20.0%) over the range, despite starting at a lower accuracy (~33%). Its most significant gain occurs between 7 and 10 turns. * **GAIA** has the lowest accuracy values throughout the chart, ranging from ~27% to ~33%, and also exhibits the smallest percentage increase (+6.3%). * **AIME24** demonstrates a strong initial improvement between 3 and 5 turns, but its growth significantly slows down thereafter, almost plateauing between 7 and 10 turns. * All systems generally show an improvement in accuracy as "Max Allowed Turns" increases, with varying degrees of steepness and consistency. ### Interpretation The data suggests that providing more "turns" (presumably opportunities for computation, reasoning, or interaction) generally leads to improved accuracy for all evaluated systems. However, the extent and pattern of this improvement vary significantly among the systems. **2Wiki** appears to be the most robust and high-performing system, achieving high accuracy even with fewer turns and showing continued improvement with more turns. Its performance suggests it might be inherently more capable or efficient. **GameOf24** benefits most dramatically from an increased number of turns, particularly in the higher range (7 to 10 turns). This could imply that GameOf24 requires more computational depth or iterative steps to reach its full potential, or that it leverages additional turns more effectively than other systems in complex scenarios. **AIME24** shows quick gains with initial increases in turns but hits a performance ceiling relatively early. This might indicate that it quickly exhausts the benefits of additional turns or that its underlying mechanism has limitations that cannot be overcome by simply increasing turns beyond a certain point. **GAIA** consistently underperforms compared to the other systems and shows only marginal gains with more turns. This suggests that GAIA might be less effective at leveraging additional turns or has fundamental limitations in its approach that restrict its accuracy, regardless of the computational budget. In summary, while more "Max Allowed Turns" generally correlates with higher accuracy, the efficiency and impact of these additional turns are highly dependent on the specific system. 2Wiki demonstrates strong baseline performance and continued scaling, GameOf24 shows significant scaling potential with more turns, AIME24 has diminishing returns, and GAIA struggles to achieve high accuracy or significant improvement. </details> Figure 10: Average turns and accuracy with increased $T_max$ . Inference scaling in turn budgets. We investigate how the maximum allowed turns ( $T_max$ ) affect reasoning depth and final performance of AgentFlow during test-time inference with the Qwen2.5-7B-Instruct backbone. As shown in Figure 10, increasing $T_max$ from 3 to 10 consistently improves outcomes across all tasks, accompanied by a rise in average turns consumed. On knowledge-intensive benchmarks such as 2Wiki and GAIA, a larger turn budget enables AgentFlow for deeper information retrieval. On mathematical benchmarks like GameOf24 and AIME24, it supports decomposed sub-goals, alternative strategies, and refinement of errors. Final performance peaks at $T_max=10$ for all tasks, confirming that a longer reasoning horizon benefits the system without causing degenerate loops. This validates that AgentFlow adapts its turn allocation to problem complexity to achieve better solutions through iterative refinement. ## 5 Related Work Reinforcement learning (RL) from outcome-based rewards has become a dominant paradigm for training LLMs to use external tools. Much of this work trains a single, monolithic policy to interleave reasoning with tool calls. This strategy has proven effective in specialized, single-tool settings, such as code execution for mathematical problems (Mai et al., 2025; Xue et al., 2025; Feng et al., 2025; Li et al., 2025b) and web search for knowledge-intensive questions (Chen et al., 2025; Jin et al., 2025; Song et al., 2025; Li et al., 2025a; Sun et al., 2025). Recent efforts have extended this monolithic framework to multi-tool environments by focusing on data synthesis (Dong et al., 2025), unified training infrastructure (Jiang et al., 2025), and principled reward design (Qian et al., 2025a; Zhang et al., 2025). However, this monolithic approach scales poorly as task complexity and planning horizons grow. The central challenge is long-horizon credit assignment; attributing a final outcome to specific intermediate tool calls remains difficult, even with fine-grained, turn-level rewards (Zeng et al., 2025a; Wang et al., 2025d). This difficulty leads to training instability and brittle inference-time generalization, manifesting as strategic deficiencies like tool overuse or “cognitive offloading” (Wang et al., 2025b; Qian et al., 2025b), suboptimal personalization (Cheng et al., 2025), and poor alignment with user preferences for tool invocation (Huang et al., 2025). Agentic systems with tool use. Agentic systems offer an alternative to monolithic models by decomposing tasks across specialized modules. Many such systems are training-free, orchestrating pre-trained LLMs with handcrafted logic and prompting, as seen in frameworks like AutoGen (Wu et al., 2024), MetaGPT (Hong et al., 2024), and OctoTools (Lu et al., 2025). This static approach, however, limits their ability to learn and adapt collaborative strategies from experience. Recognizing this, recent work explores training these systems to improve coordination (Deng et al., 2025; Liao et al., 2025). However, most training paradigms are offline, relying on supervised fine-tuning or preference optimization on static datasets (Motwani et al., 2024; Park et al., 2025). These methods are decoupled from the live, multi-turn dynamics of the system, preventing modules from learning to adapt to evolving tool outputs or recover from early mistakes. Training directly in the flow with on-policy RL is difficult due to sparse rewards and long-horizon credit assignment, where feedback is delayed across long reasoning chains and shifting state distributions (Wang et al., 2025c). Consequently, these systems often suffer from brittle adaptation and require complex reward shaping to learn effectively (Wang et al., 2025a). ## 6 Conclusion We presented AgentFlow, a trainable, in-the-flow agentic system that coordinates four specialized modules via an evolving memory and optimizes its planner directly inside the multi-turn loop. To enable stable on-policy learning under long-horizon, sparse-reward settings, we introduced Flow-GRPO, which converts multi-turn RL into a sequence of tractable single-turn policy updates by broadcasting a single, verifiable trajectory-level outcome to every turn and stabilizing credit assignment with group-normalized advantages. Comprehensive experiments show that AgentFlow achieves strong cross-domain performance, surpassing specialized baselines and even larger proprietary models. In-depth analyses confirm improved planning and tool-calling reliability, along with positive scaling trends in model size and allowed turn budgets. Future research will focus on extending in-the-flow optimization to other modules, incorporating more fine-grained reward signals, and scaling the framework to tackle more complex, open-ended tasks. ## Acknowledgment We would like to thank Yihe Deng, Xuehang Guo, and Kunlun Zhu for their valuable input during the early stages of this work. We are grateful to Lambda for providing GPU resources. This work was partially supported by the Hoffman-Yee Research Grants program at Stanford HAI, the AI for Math Fund by Renaissance Philanthropy, ONR MURI N00014-24-1-2748, and the AI Research Hub Project through KAIST. ## References - Art of Problem Solving (2025) Art of Problem Solving. Aime problems and solutions, 2025. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions. - Chen et al. (2025) Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. ReSearch: Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025. - Cheng et al. (2025) Zihao Cheng, Hongru Wang, Zeming Liu, Yuhang Guo, Yuanfang Guo, Yunhong Wang, and Haifeng Wang. ToolSpectrum: Towards personalized tool utilization for large language models. In Findings of the Association for Computational Linguistics: ACL 2025, pp. 20679–20699, 2025. - Deng et al. (2025) Yingfan Deng, Anhao Zhou, Yuan Yuan, Xian Zhang, Yifei Zou, and Dongxiao Yu. Pe-ma: Parameter-efficient co-evolution of multi-agent systems. arXiv preprint arXiv:2506.11803, 2025. - Dong et al. (2025) Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning. arXiv preprint arXiv:2505.16410, 2025. - Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. - Feng et al. (2025) Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms. arXiv preprint arXiv:2504.11536, 2025. - Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. - He et al. (2025) Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. arXiv preprint arXiv:2504.11456, 2025. - Ho et al. (2020) Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 6609–6625, 2020. - Hong et al. (2024) Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations (ICLR), 2024. - Hu et al. (2025a) Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025a. - Hu et al. (2025b) Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, et al. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation. arXiv preprint arXiv:2505.23885, 2025b. - Huang et al. (2025) Chengrui Huang, Shen Gao, Zhengliang Shi, Dongsheng Wang, and Shuo Shang. TTPA: Token-level tool-use preference alignment training framework with fine-grained evaluation. arXiv preprint arXiv:2505.20016, 2025. - Hurst et al. (2024) Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card. arXiv preprint arXiv:2410.21276, 2024. - Jiang et al. (2025) Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. VerlTool: Towards holistic agentic reinforcement learning with tool use. arXiv preprint arXiv:2509.01055, 2025. - Jin et al. (2025) Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025. - Jin et al. (2021) Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. - Li et al. (2025a) Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366, 2025a. - Li et al. (2025b) Xuefeng Li, Haoyang Zou, and Pengfei Liu. ToRL: Scaling tool-integrated rl. arXiv preprint arXiv:2503.23383, 2025b. - Liao et al. (2025) Junwei Liao, Muning Wen, Jun Wang, and Weinan Zhang. Marft: Multi-agent reinforcement fine-tuning. arXiv preprint arXiv:2504.16129, 2025. - Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations (ICLR), 2023. - Lile (2024) Nathan Lile. Math twenty four (24s game) dataset. https://huggingface.co/datasets/nlile/24-game, 2024. - Lu et al. (2025) Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. OctoTools: An agentic framework with extensible tools for complex reasoning. arXiv preprint arXiv:2502.11271, 2025. - Ma et al. (2025) Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General-reasoner: Advancing llm reasoning across all domains. arXiv preprint arXiv:2505.14652, 2025. - MAA (2023) MAA. American mathematics competitions. In American Mathematics Competitions, 2023. - Mai et al. (2025) Xinji Mai, Haotian Xu, Xing W, Weinong Wang, Yingying Zhang, and Wenqiang Zhang. Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving. arXiv preprint arXiv:2505.07773, 2025. - Mialon et al. (2023) Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations (ICLR), 2023. - Moonshot AI (2025) Moonshot AI. Kimi-Researcher: End-to-End RL Training for Emerging Agentic Capabilities. https://moonshotai.github.io/Kimi-Researcher/, June 2025. - Motwani et al. (2024) Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip HS Torr, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt. Malt: Improving reasoning with multi-agent llm training. arXiv preprint arXiv:2412.01928, 2024. - Park et al. (2025) Chanwoo Park, Seungju Han, Xingzhi Guo, A. Ozdaglar, Kaiqing Zhang, and Joo-Kyung Kim. MAPoRL: Multi-agent post-co-training for collaborative large language models with reinforcement learning. In Annual Meeting of the Association for Computational Linguistics (ACL, 2025. URL https://api.semanticscholar.org/CorpusId:276580906. - Press et al. (2023) Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687–5711, 2023. - Qian et al. (2025a) Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. ToolRL: Reward is all tool learning needs. arXiv preprint arXiv:2504.13958, 2025a. - Qian et al. (2025b) Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. SMART: Self-aware agent for tool overuse mitigation. In Findings of the Association for Computational Linguistics: ACL 2025, pp. 4604–4621, 2025b. - Rein et al. (2024) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. - Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning (ICML), pp. 1889–1897. PMLR, 2015. - Shao et al. (2023) Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 9248–9274, 2023. - Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. - Song et al. (2025) Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025. - Sun et al. (2025) Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching. arXiv preprint arXiv:2505.04588, 2025. - Trivedi et al. (2022) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics (TACL), 10:539–554, 2022. - Wang et al. (2025a) Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, and Wenjie Li. SPA-RL: Reinforcing llm agents via stepwise progress attribution. arXiv preprint arXiv:2505.20732, 2025a. - Wang et al. (2025b) Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently. arXiv preprint arXiv:2504.14870, 2025b. URL https://arxiv.org/pdf/2504.14870. - Wang et al. (2025c) Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. RAGEN: Understanding self-evolution in llm agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073, 2025c. - Wang et al. (2025d) Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. arXiv preprint arXiv:2505.15107, 2025d. - Wu et al. (2024) Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First Conference on Language Modeling (COLM), 2024. - Xue et al. (2025) Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning. arXiv preprint arXiv:2509.02479, 2025. - Yan et al. (2025) Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025. - Yang et al. (2024a) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024a. - Yang et al. (2024b) An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024b. - Yang et al. (2024c) Hang Yang, Hao Chen, Hui Guo, Yineng Chen, Ching-Sheng Lin, Shu Hu, Jinrong Hu, Xi Wu, and Xin Wang. Llm-medqa: Enhancing medical question answering through case studies in large language models. arXiv preprint arXiv:2501.05464, 2024c. - Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2369–2380, 2018. - Yu et al. (2025) Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. - Zeng et al. (2025a) Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment. arXiv preprint arXiv:2505.11821, 2025a. - Zeng et al. (2025b) Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892, 2025b. - Zhang et al. (2025) Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. Nemotron-research-tool-n1: Tool-using language models with reinforced reasoning. arXiv preprint arXiv:2505.00024, 2025. ## Table of Contents ## Appendix A Training Algorithm of AgentFlow We provide a flowchart of the overall training algorithm of AgentFlow (§ 3) in Algorithm 1. Algorithm 1 In-the-Flow Optimization for AgentFlow 0: Dataset $D$ , Action Planner policy $π_θ$ , Tool Executor $E$ , Executive Verifier $V$ , Solution Generator $G$ , Toolset $K$ , and Shared Evolving Memory ${M}$ 0: Optimized Action Planner parameters $θ^⋆$ 1: for each training iteration do 2: for each query–label pair $(q,y^*)∼D$ do 3: 1. In-the-Flow Rollout Generation 4: Initialize: $t← 1$ , $M^t← q$ 5: repeat 6: $a^t∼π_θ(a^t\mid q,K,M^t)$ {Plan Action} 7: $e^t∼E(e^t\mid a^t,K)$ {Execute Action} 8: $v^t∼V(v^t\mid q,e^t,M^t)$ {Verify Result} 9: $M^t+1=f_mem(M^t,a^t,e^t,v^t)$ {Update Memory} 10: $t← t+1$ 11: until termination condition met 12: $o∼G(o\mid q,M^T)$ {Generate Final Solution} 13: 2. Reward Computation 14: $R(a^t)=\bar{R}(o,q,y^*), ∀ t=1,\dots,T$ 15: 3. Policy Update 16: Update the Action Planner policy $π_θ$ by maximizing the Flow-GRPO objective (Eq. 5) 17: end for 18: end for 19: return optimized parameters $θ^⋆$ ## Appendix B Theoretical Analysis of Flow-GRPO ### B.1 Preliminaries and Notation We adopt the notation from the paper to formalize our analysis. **Definition B.1 (Core Components)** *Here we list core definition of variables. Symbol and Description $π_θ$ The trainable planner policy, parameterized by $θ$ . $π_θ_{old}$ The behavior policy used to sample trajectories. $s^t$ The state at turn $t$ , defined as $s^t=(q,K,M_t)$ . $a^t$ The action (a sequence of tokens) generated at state $s^t$ , where $a^t∼π_θ(·\mid s^t)$ . $τ$ A trajectory of states and actions over $T$ time steps, defined as $τ=\{(s^t,a^t)\}_t=1^T$ . $R(τ)$ The outcome-based reward for trajectory $τ$ , where $R(τ)∈\{0,1\}$ . $A_τ$ The group-normalized advantage for trajectory $τ$ . A crucial property is that the advantage is constant for all timesteps within a trajectory defined in Eq. 7: $a^t=A_τ,~∀(s^t,a^t)∈τ$ . $ρ_i,j^t$ The token-level importance sampling ratio, defined as: $ρ_i,j^t=\frac{π_θ\big(a_i,j^t \big| s_i^t,a_i,1:j-1^t\big)}{π_θ_{old}\big(a_i,j^t \big| s_i^t,a_i,1:j-1^t\big)}.$ $L_clip(ρ,A)$ The PPO clipped objective term, defined as $L_clip(ρ,A)=\min(ρ A,clip(ρ,1-ε,1+ε)A)$ .* **Definition B.2 (Objective Functions)** *The global policy objective is the expected trajectory-level reward: $$ J(θ):=E_τ∼π_{θ}[R(τ)]. \tag{8} $$ The single-turn optimization objective for a given state $s^t$ is defined as: $$ J_local(θ;s^t):=E_a^t∼π_θ_{old(·\mid s^t)}≤ft[\frac{1}{|a^t|}∑_j=1^|a^{t|}L_clip(ρ_i,j^t,A_i^t)\right]. \tag{9} $$ The full Flow-GRPO objective function in the multi-turn setting is given by: $$ J_Flow-GRPO(θ):=E_\begin{subarray{c}(q,y^*)∼D\\ \{τ_i\}_i=1^G∼π_θ_{old}\end{subarray}}≤ft[\frac{1}{G}∑_i=1^G\frac{1}{T_i}∑_t=1^T_i\frac{1}{|a^t_i|}∑_j=1^|a^{t_i|}L_clip(ρ_i,j^t,A_i^t)\right]-βD_KL(π_θ\|π_ref). \tag{10} $$* ### B.2 Equivalence Proof for Optimization Objectives **Theorem B.1** *In Flow-GRPO, maximizing the global multi-turn objective is mathematically equivalent to maximizing the expected token-level local objective at each time step under the on-policy induced state distribution, given standard sampling assumptions (trajectories sampled i.i.d. from the policy with fixed finite turn $T$ ).* * Proof* Let’s denote the clipping part of the Flow-GRPO objective as $J_clip(θ)$ . First, by the linearity of expectation, we can simplify the expectation over a group of $G$ trajectories. Since the trajectories $\{τ_i\}$ are sampled independently and identically (i.i.d.) from the behavior policy $π_θ_{old}$ , the expectation of their average is equal to the expectation over a single trajectory. $$ \displaystyleJ_clip(θ) \displaystyle=E_(q,y^*)∼D≤ft[E_\{τ_{i\}_i=1^G∼π_θ_{old}}≤ft[\frac{1}{G}∑_i=1^G\frac{1}{T_i}∑_t=1^T_i≤ft(\frac{1}{|a^t_i|}∑_j=1^|a^{t_i|}L_clip(ρ_i,j^t,A_i^t)\right)\right]\right] \displaystyle=E_(q,y^*)∼D≤ft[E_τ∼π_{θ_{old}(·|q)}≤ft[\frac{1}{T}∑_t=1^T≤ft(\frac{1}{|a^t|}∑_j=1^|a^{t|}L_clip(ρ^t_j,A_τ)\right)\right]\right]. \tag{11} $$ Here, $τ=\{(s^t,a^t)\}_t=1^T$ represents a single, arbitrarily sampled trajectory with advantage $A_τ$ . Next, we can re-interpret the expectation over trajectories as an expectation over the state-visitation distribution induced by the policy $π_θ_{old}$ . Let $d^π_θ_{old}$ be the on-policy distribution of states visited, where each state $s^t$ in a trajectory of length $T$ is weighted by $1/T$ . The expectation can be rewritten as: $$ \displaystyleJ_clip(θ) \displaystyle=E_(q,y^*)∼D≤ft[E_s^t∼ d^{π_θ_{old}}≤ft[E_a^t∼π_θ_{old(·|s^t)}≤ft[\frac{1}{|a^t|}∑_j=1^|a^{t|}L_clip(ρ^t_j,A^t)\right]\right]\right]. \tag{13} $$ Note that $A^t$ is the advantage corresponding to the trajectory from which $s^t$ was sampled. We now recognize that the inner expectation is precisely the definition of the local, per-state objective, $J_local(θ;s^t)$ . $$ \displaystyleJ_clip(θ) \displaystyle=E_(q,y^*)∼D, s^t∼ d^π_θ_{old}≤ft[J_local(θ;s^t)\right]. \tag{14} $$ Adding the KL-divergence term back, we arrive at the final equivalence: $$ J_Flow-GRPO(θ)=E_(q,y^*)∼D, s^t∼ d^π_θ_{old}≤ft[J_local(θ;s^t)\right]-βD_KL(π_θ\|π_ref). \tag{15} $$ This proves that maximizing the global multi-turn Flow-GRPO objective is equivalent to maximizing the expected token-level local objective at each time step under the on-policy induced state distribution. ∎ ### B.3 Convergence Analysis Having established the structural validity of the objective, we now analyze its convergence properties. The analysis builds on the monotonic improvement guarantee provided by trust-region methods (Schulman et al., 2015). **Lemma B.2 (Policy Performance Difference)** *For two policies $π_θ$ and $π_θ_{\rm old}$ , the difference in expected return can be expressed as: $$ J(θ)-J(θ_\rm old)=E_τ∼π_{θ}≤ft[∑_t=1^T A_θ_{\rm old}(s^t,a^t)\right], \tag{16} $$ where $A_θ_{\rm old}$ is the advantage function under the old policy.* This lemma enables the construction of a lower bound on policy improvement. **Theorem B.3 (Monotonic Improvement Guarantee)** *Define the surrogate objective $$ L_θ_{\rm old}(θ)=E_τ∼π_{θ_{\rm old}}≤ft[∑_t=1^T \frac{π_θ(a^t|s^t)}{π_θ_{\rm old}(a^t|s^t)} A_θ_{\rm old}(s^t,a^t)\right]. \tag{17} $$ Then the performance improvement satisfies the lower bound $$ J(θ)-J(θ_\rm old) ≥ L_θ_{\rm old}(θ)-C·\bar{D}_KL≤ft(π_θ_{\rm old},π_θ\right), \tag{18} $$ where $C>0$ is a constant depending on the horizon and reward scale, and $\bar{D}_KL$ denotes the average KL-divergence between the two policies.* By optimizing the right-hand side of the above inequality, we can expect to improve the performance of $π_θ$ when the policy update remains within a trust region. While the clipping mechanism provides an approximate enforcement of this constraint, it does not offer strict guarantees. Empirically, for policies $π_θ_{\rm old}$ and $π_θ$ obtained from updates, we typically observe: $$ J(θ)\gtrsimJ(θ_\rm old), \tag{19} $$ where $\gtrsim$ denotes near-monotonic improvement in practice. Conclusion. This analysis establishes that Flow-GRPO optimizes a theoretically grounded surrogate objective that approximates trust region methods. The combination of clipping and KL regularization promotes stable policy improvement and reliable convergence to locally optimal policies, as validated by our experiments. ## Appendix C Experimental Details ### C.1 Evaluation Details Here, we outline the specifics of our evaluation protocol. For evaluation, we increase the maximum number of turns per rollout to $T=10$ to allow for more extensive and deeper reasoning. The planner’s sampling temperature is set to 0.7 to encourage diverse solution paths. Unless otherwise specified, all tool LLM engines are initialized with Qwen2.5-7B-Instruct. For fair and consistent evaluation, we adopt the previous work’s methodology while standardizing tools (Lu et al., 2025): we replace search tools in search-enhanced models with our Google Search tool and code tools in code-enhanced models with our Python Coder tool. We use GPT-4o as an LLM-based judge to determine the correctness of final answers. This approach provides a robust measure of semantic and numerical equivalence, which is critical for complex reasoning tasks. The specific judging prompt is detailed in § E.3, and additional information on evaluation datasets can be found in § C.3. To mitigate randomness, we report the average accuracy with standard deviation across three trials for all experiments. ### C.2 Compared Baselines Proprietary LLMs: - Qwen2.5 Series (Yang et al., 2024a), created by Alibaba, comes in multiple configurations. These models undergo training on multilingual corpora covering 29 different languages, demonstrating superior performance in cross-lingual applications. Furthermore, Qwen2.5 showcases robust proficiency in programming and mathematical domains. - Llama-3 Series (Dubey et al., 2024), created by Meta AI, encompasses various iterations. Each model configuration within the Llama family provides dual versions: foundational and instruction-following variants. Training incorporates diverse dataset combinations spanning multiple domains and linguistic varieties. The Llama model family demonstrates excellent results in logical reasoning, software development, and cross-lingual comprehension evaluations. Through progressive enhancements in fine-tuning methodologies and expanded sequence lengths, these models become more applicable to practical deployment scenarios. - GPT-4o Series (Hurst et al., 2024), produced by OpenAI, includes several model variants such as GPT-4o and GPT-4o-mini, with training leveraging extensive multimodal datasets encompassing text, vision, and audio modalities. The series achieves outstanding performance in complex reasoning tasks, creative generation, and multimodal understanding benchmarks with continuous refinements in alignment techniques and enhanced processing capabilities. Reasoning LLMs: - SFT (Zeng et al., 2025b) serves as our basic baseline following Search-R1 (Jin et al., 2025). We fine-tune models using supervised fine-tuning on GPT-4o-generated reasoning chains. - SimpleRL-Zoo (Zeng et al., 2025b) investigates zero reinforcement learning training across 10 diverse base models spanning different families and sizes using GRPO algorithm with simple rule-based rewards, achieving substantial improvements in reasoning accuracy. - Open-Reasoner-Zero (Hu et al., 2025a) presents the first open-source implementation of large-scale reasoning-oriented RL training using PPO with GAE and straightforward rule-based rewards, without KL regularization. The framework demonstrates that minimalist design can successfully scale both response length and benchmark performance. - General-Reasoner (Ma et al., 2025) extends LLM reasoning capabilities beyond mathematics to diverse domains using RLVR through a 230K verifiable reasoning questions dataset spanning physics, chemistry, and finance. - LUFFY (Yan et al., 2025) addresses limitations in on-policy RLVR by introducing an off-policy framework that augments training with external reasoning demonstrations using Mixed Policy GRPO and regularized importance sampling. Search-Integrated Reasoning LLMs: - Iter-RetGen (Shao et al., 2023) addresses limitations in retrieval-augmented language models by introducing iterative retrieval-generation synergy, where a model’s previous response serves as context for retrieving more relevant knowledge in subsequent iterations. - Search-R1 (Jin et al., 2025) represents a reinforcement learning approach that develops a model from the ground up to invoke search functionality throughout the reasoning process. - ZeroSearch (Sun et al., 2025) addresses high API costs in RL-based search training by using an LLM to simulate search engines, employing lightweight supervised fine-tuning to transform an LLM into a retrieval module that generates both useful and noisy documents. The framework combines this with a curriculum-based rollout strategy that progressively degrades document quality, achieving better performance than real search engine-based methods while incurring zero API costs. - ReSearch (Chen et al., 2025) proposes a reinforcement learning framework that trains LLMs to integrate search operations as components of the reasoning chain without supervised data on reasoning steps, treating search decisions as guided by text-based thinking. - StepSearch (Wang et al., 2025d) addresses the sparse reward problem in multi-hop reasoning by training search LLMs using step-wise proximal policy optimization with intermediate rewards and token-level process supervision based on information gain and redundancy penalties. - VerlTool (Jiang et al., 2025) addresses fragmentation and synchronization bottlenecks in Agentic Reinforcement Learning with Tool use by introducing a unified modular framework that extends beyond single-turn RLVR paradigms, providing upstream VeRL alignment and unified tool management with asynchronous rollout execution achieving near 2× speedup. Code-Integrated Reasoning LLMs: - TIR (Yang et al., 2024b) is a basic baseline that demonstrates the model’s ability to generate code for tool utilization. In our implementation, we directly prompt the model to write code that calls the programming interpreter and processes the returned results to generate the final answer. - ToRL (Li et al., 2025b) is a code-enhanced architecture developed via reinforcement learning that empowers models to independently activate code execution environments for mathematical reasoning tasks. Training-free Agentic System - AutoGen (Wu et al., 2024) introduces an agentic conversation framework that enables developers to build LLM applications through conversable agents that can operate using combinations of LLMs, human inputs, and tools. ### C.3 Evaluation Datasets We provide a detailed introduction to the search-intensive and agentic benchmarks in our experiments as follows: - Bamboogle (Press et al., 2023) presents a demanding multi-step reasoning dataset containing manually constructed questions requiring up to four inferential steps. The dataset evaluates models’ capacity for intricate compositional reasoning across interconnected facts. - 2Wiki (2WikiMultihopQA) (Ho et al., 2020) constitutes a comprehensive multi-step QA corpus combining structured Wikidata knowledge with unstructured Wikipedia text. The dataset encompasses varied question formats and annotated reasoning chains to facilitate interpretable sequential inference. We randomly sample 100 examples as a test set for efficiency. - HotpotQA (Yang et al., 2018) represents a widely-adopted question answering corpus featuring multi-step queries constructed from Wikipedia entries. We randomly sample 100 examples as a test set for efficiency. - Musique (Trivedi et al., 2022) comprises a multi-step reasoning corpus requiring sequential inference where each reasoning stage depends on information derived from preceding steps. We conduct evaluations using the development partition of this particularly challenging dataset. We randomly sample 100 examples as a test set for efficiency. - GAIA (Mialon et al., 2023) constitutes a benchmark engineered to assess general AI systems and agents, demanding capabilities including sequential reasoning, web navigation, and comprehensive tool utilization skills. We utilize the text-exclusive portion of this dataset, designed to challenge base language models in our experimental setup. Furthermore, we also conduct a series of experiments on math and scientific reasoning benchmarks: - AIME24 (Art of Problem Solving, 2025) A collection of 30 demanding mathematical problems sourced from the 2024 American Invitational Mathematics Examination (AIME), encompassing algebra, geometry, number theory, and combinatorics. Each JSONL-formatted record contains the problem identifier, question text, comprehensive solution methodology, and the final numerical result. Created to assess large language models’ sophisticated mathematical reasoning abilities, the dataset presents substantial difficulty, systematic multi-phase solutions, and distinctive answers—establishing it as a robust benchmark for evaluating advanced analytical capabilities. - AMC23 (MAA, 2023) contains mathematical problems derived from the 2023 American Mathematics Competition, emphasizing areas such as functional equations and complex analysis. - GameOf24 (Lile, 2024) derives from the traditional numerical puzzle known as 24 (alternatively called the 24 numbers game). The challenge requires utilizing four given numbers with fundamental arithmetic operations (addition, subtraction, multiplication, division) to create an expression yielding 24. For instance, with numbers 4, 9, 10, and 13, a correct solution would be “(10 - 4) × (13 - 9) = 24”. Successfully solving requires computational proficiency along with iterative attempts to validate potential solutions. Each challenge is formatted as open-ended inquiries. - GPQA or Graduate Level Google-Proof Q&A Benchmark (Rein et al., 2024) comprises a collection of demanding text-based multiple choice problems authored by subject specialists in biology, physics, and chemistry, intentionally crafted to be “exceptionally challenging”. We randomly sample 100 examples as a test set for efficiency. - MedQA (Jin et al., 2021) features text-based multiple choice problems assembled from professional medical licensing examinations. Problems encompass comprehensive medical knowledge and clinical reasoning skills. ## Appendix D More Discussion about Experiment Results ### D.1 Main Result Analysis Our main results are presented in Tables 1 and 2. Overall, AgentFlow consistently outperforms all baseline models across diverse domains, including search-intensive tasks, agentic tasks, and mathematical and scientific reasoning tasks. These comprehensive results yield several key insights: Monolithic LLMs are insufficient for complex reasoning. While scaling up model size (from 7B model to GPT-4o) improves average performance, their monolithic nature presents limitations when facing complex tasks that require multi-turn reasoning and sub-goal decomposition. In contrast, our proposed AgentFlow consistently outperforms these larger models. Specifically, it achieves an average improvement of 8.2% over GPT-4o on search-intensive tasks (57.3% vs. 49.1% in Table 1), and a remarkable 15.8% gain over GPT-4o on agentic tasks (33.1% vs. 17.3% in Table 1). For mathematical reasoning benchmarks, AgentFlow obtains a substantial improvement of 16.4% over GPT-4o (51.5% vs. 35.1% in Table 2). Furthermore, it surpasses the strong Llama-3.3-70B by 12.5% on scientific reasoning tasks (63.5% vs. 51.0% in Table 2). These results demonstrate that the carefully designed agentic system of AgentFlow, despite being built on a 7B-parameter backbone, can deliver superior and more efficient performance compared to substantially larger monolithic LLMs. Specialized reasoning models exhibit strong in-domain focus but limited generalizability. While domain-specific fine-tuning and tailored tool integration provide clear benefits over base LLMs, they fail to deliver robust cross-domain performance due to fundamental scaling limitations. Our evaluation across three reasoning domains substantiates these limitations. On search-intensive tasks, specialized models such as Search-R1 (33.3%) and VerlTool (39.0%) perform well within their narrow scope yet fall substantially short of AgentFlow (57.3%) as shown in Table 1. Similarly, in mathematical reasoning, methods like SimpleRL-reason (36.6%) and ToRL (37.0%) trail significantly behind AgentFlow (51.5%) in Table 2. Even in scientific reasoning, where models such as Luffy (55.5%) offer competitive results, they are consistently surpassed by AgentFlow (63.5%) in Table 2. These findings demonstrate that while specialized reasoning models excel within narrow domains, their reliance on a single monolithic policy introduces poor generalization, making them brittle when confronted with diverse, cross-domain challenges. AgentFlow demonstrates superior, versatile reasoning through its adaptive agentic system. AgentFlow establishes a new state-of-the-art agentic system by achieving an average accuracy of 57.3% on search-intensive tasks, 33.1% on agentic tasks, 51.5% on mathematical reasoning, and 63.5% on scientific reasoning. Our method’s advantage stems from combining an agentic system with targeted planning policy refinement via on-policy reinforcement learning in an online fashion. When compared to AutoGen—a general agent framework with the same backbone model— AgentFlow demonstrates a massive improvement of 14.9% on search tasks and 19.9% on math tasks. This underscores that the core advantage comes from our dedicated trainable agentic system that integrates our novel Flow-GRPO for in-system on-policy optimization, enabling effective agent planning and tool utilization to solve complex, long-horizon problems across diverse domains. ### D.2 In-depth Analysis of Optimized Planning AgentFlow adapts to inference-time tool scaling. We scale the tools—the Base Generator and Python Coder—to GPT-4o-powered versions. Empirical results on search and math datasets (Figure 12) show that AgentFlow, when using these GPT-4o-powered tools, substantially outperforms its performance with Qwen2.5-7B-Instruct-powered tools, achieving improvements of 1.0% on GAIA, 6.0% on AMC23, and a notable 13.0% on HotpotQA. This finding further supports a consistent trend: after in-the-flow RL training, the planner can adaptively leverage improvements in the underlying tools to enhance the agentic system’s overall performance. Flow-GRPO spontaneous tool usage preference change. We further compare tool usage distributions before and after in-the-flow RL training on Musique. Figure 12 shows that due to Musique’s need for a diverse source of information, Flow-GRPO optimizes the planner to increase Web Search to delve deeper into the URL provided by other search tools. This maneuver presents a steady performance improvement of 6.1%. <details> <summary>x14.png Details</summary> ![6405158c](/v1/image/6405158c2469d326b5535320969732ef03365a469aa079abbb6b2b823f24f401) ### Visual Description ## Bar Chart: Accuracy Comparison of Qwen2.5-7B-Instruct vs. GPT-4o Across Benchmarks ### Overview This image displays a bar chart comparing the "Accuracy (%)" of two language models, "Qwen2.5-7B-Instruct" and "GPT-4o", across three different benchmarks: "GAIA", "AMC23", and "HotpotQA". The chart uses grouped bars to show the performance of each model on each benchmark, with numerical labels indicating precise accuracy percentages and the performance difference of GPT-4o over Qwen2.5-7B-Instruct. ### Components/Axes * **Chart Type**: Grouped Bar Chart. * **Legend**: Positioned at the top-center of the chart. * Light blue bar color represents: "Qwen2.5-7B-Instruct" * Dark blue bar color represents: "GPT-4o" * **Y-axis (Left)**: * Title: "Accuracy (%)" * Scale: Ranges from 20 to 70, with major grid lines at 10-unit intervals (20, 30, 40, 50, 60, 70). * **X-axis (Bottom)**: * Categories (from left to right): "GAIA", "AMC23", "HotpotQA". * **Data Labels**: Numerical values are displayed directly on top of each bar, indicating the exact accuracy percentage. * **Difference Labels**: Numerical values prefixed with a "+" sign are displayed above the dark blue (GPT-4o) bars, indicating the absolute difference in accuracy between GPT-4o and Qwen2.5-7B-Instruct for that specific benchmark. ### Detailed Analysis The chart presents three groups of bars, each representing a benchmark, with two bars per group comparing the two models. 1. **GAIA Benchmark**: * The light blue bar (Qwen2.5-7B-Instruct) shows an accuracy of **33.1%**. * The dark blue bar (GPT-4o) shows an accuracy of **34.1%**. * The difference label above the GPT-4o bar is **+1.1**, indicating GPT-4o performed 1.1 percentage points better than Qwen2.5-7B-Instruct. * Trend: GPT-4o shows a slight but positive improvement over Qwen2.5-7B-Instruct on the GAIA benchmark. 2. **AMC23 Benchmark**: * The light blue bar (Qwen2.5-7B-Instruct) shows an accuracy of **61.5%**. * The dark blue bar (GPT-4o) shows an accuracy of **67.5%**. * The difference label above the GPT-4o bar is **+6.0**, indicating GPT-4o performed 6.0 percentage points better than Qwen2.5-7B-Instruct. * Trend: GPT-4o demonstrates a noticeable improvement over Qwen2.5-7B-Instruct on the AMC23 benchmark. 3. **HotpotQA Benchmark**: * The light blue bar (Qwen2.5-7B-Instruct) shows an accuracy of **57.0%**. * The dark blue bar (GPT-4o) shows an accuracy of **70.0%**. * The difference label above the GPT-4o bar is **+13.0**, indicating GPT-4o performed 13.0 percentage points better than Qwen2.5-7B-Instruct. * Trend: GPT-4o exhibits a substantial improvement over Qwen2.5-7B-Instruct on the HotpotQA benchmark, marking the largest performance gap among the three tasks. ### Key Observations * GPT-4o consistently outperforms Qwen2.5-7B-Instruct across all three benchmarks presented. * The performance gap between GPT-4o and Qwen2.5-7B-Instruct varies significantly across benchmarks, ranging from a minimal +1.1% on GAIA to a substantial +13.0% on HotpotQA. * Both models achieve their highest accuracy on the HotpotQA benchmark for GPT-4o (70.0%) and AMC23 for Qwen2.5-7B-Instruct (61.5%). * The lowest accuracy for both models is observed on the GAIA benchmark. ### Interpretation This bar chart strongly suggests that GPT-4o generally possesses superior accuracy compared to Qwen2.5-7B-Instruct across the evaluated benchmarks. The consistent positive differences indicate a robust performance advantage for GPT-4o. The varying magnitudes of the performance gap are particularly insightful. On tasks like GAIA, the models are relatively close in performance, implying that Qwen2.5-7B-Instruct might be competitive in certain domains or for specific types of questions. However, on benchmarks like AMC23 and especially HotpotQA, GPT-4o demonstrates a significantly higher capability. The large difference on HotpotQA, which is often a complex multi-hop question answering dataset, could indicate GPT-4o's advanced reasoning or information synthesis abilities. Overall, the data highlights GPT-4o as a more accurate model for these specific tasks, with its strengths becoming more pronounced on more challenging or complex benchmarks. This information would be crucial for developers or researchers deciding which model to utilize for applications requiring high accuracy in similar domains. </details> Figure 11: Tool scaling study. AgentFlow ’s performance improves when its tools are upgraded from Qwen-2.5-7B-Instruct to GPT-4o. <details> <summary>x15.png Details</summary> ![608e0349](/v1/image/608e034919ccf555bf8f285b18f58d4090e30cd52777e0e4a1a98bb414b78148) ### Visual Description ## Chart: Impact of Finetuning on Tool Call Ratio and Overall Accuracy ### Overview The image presents a grouped bar chart illustrating the "Tool Call Ratio (%)" for four different categories of generators/search mechanisms at two distinct stages: "Step 0" (before finetuning) and "Step 32" (after finetuning). Additionally, it displays the overall accuracy percentages associated with these two stages, highlighting the change due to finetuning. ### Components/Axes * **Chart Type**: Grouped Bar Chart. * **Overall Context**: The chart's primary purpose is to show the effect of a "Finet-tuning" process on tool utilization and system accuracy. * **Y-axis**: Labeled "Tool Call Ratio (%)". The scale ranges from 0 to 60, with major grid lines and numerical labels at 0, 10, 20, 30, 40, 50, and 60. * **X-axis**: Divided into two main categories representing different stages: "Step 0" and "Step 32". A prominent gray arrow points from "Step 0" to "Step 32", with the text "After Finet-tuning" positioned directly above the arrow, indicating the transition or process applied. * **Legend (Top-left)**: The legend defines the color coding for the different categories of tools/generators: * Light Red/Pink: "Base Generator" * Green: "Google Search" * Blue: "Web Search" * Purple: "Wikipedia Search" * **Accuracy Indicators (Top-center)**: Two white boxes display overall accuracy figures: * Left box (above "Step 0"): "Acc:19.2%" * Right box (above "Step 32"): "Acc: 25.2% (+6.21%)" ### Detailed Analysis **Overall Accuracy Change**: * At "Step 0", the overall accuracy is **19.2%**. * At "Step 32" (After Finet-tuning), the overall accuracy increases to **25.2%**, representing a positive change of **+6.21%** from the initial accuracy. **Tool Call Ratio (%) - Step 0 (Before Finetuning)**: * **Base Generator** (Light Red/Pink bar, left-most in the group): The bar indicates a "Tool Call Ratio" of **3.1%**. * **Google Search** (Green bar): The bar indicates a "Tool Call Ratio" of **38.7%**. * **Web Search** (Blue bar): The bar indicates a "Tool Call Ratio" of **18.4%**. * **Wikipedia Search** (Purple bar, right-most in the group): The bar indicates a "Tool Call Ratio" of **38.5%**. **Tool Call Ratio (%) - Step 32 (After Finetuning)**: * **Base Generator** (Light Red/Pink bar, left-most in the group): The bar indicates a "Tool Call Ratio" of **0.9%**. Positioned above this bar is an additional label: **-2.2**. * *Trend*: The tool call ratio for the Base Generator significantly decreased from 3.1% to 0.9%. The label "-2.2" accurately reflects this change (3.1 - 0.9 = 2.2). * **Google Search** (Green bar): The bar indicates a "Tool Call Ratio" of **13.6%**. Positioned above this bar is an additional label: **-1.5**. * *Trend*: The tool call ratio for Google Search significantly decreased from 38.7% to 13.6%. The label "-1.5" does not correspond to this change (38.7 - 13.6 = 25.1). * **Web Search** (Blue bar): The bar indicates a "Tool Call Ratio" of **13.6%**. Positioned above this bar is an additional label: **+5.2**. * *Trend*: The tool call ratio for Web Search decreased from 18.4% to 13.6%. The label "+5.2" does not correspond to this change (18.4 - 13.6 = 4.8). * **Wikipedia Search** (Purple bar, right-most in the group): The bar indicates a "Tool Call Ratio" of **13.6%**. Positioned above this bar is an additional label: **-4.7**. * *Trend*: The tool call ratio for Wikipedia Search significantly decreased from 38.5% to 13.6%. The label "-4.7" does not correspond to this change (38.5 - 13.6 = 24.9). ### Key Observations * Finetuning demonstrably improves the overall system accuracy by 6.21 percentage points, from 19.2% to 25.2%. * The "Base Generator" consistently exhibits a very low tool call ratio, which further decreases after finetuning. * Before finetuning (Step 0), "Google Search" and "Wikipedia Search" have the highest tool call ratios, both around 38.5-38.7%. "Web Search" has a moderate ratio of 18.4%. * After finetuning (Step 32), the tool call ratios for "Google Search", "Web Search", and "Wikipedia Search" all converge to an identical value of 13.6%. This represents a substantial reduction for Google and Wikipedia Search, and a smaller reduction for Web Search. * The numerical labels positioned above the "Google Search", "Web Search", and "Wikipedia Search" bars at Step 32 (-1.5, +5.2, -4.7) do not represent the direct change in "Tool Call Ratio (%)" from Step 0 to Step 32 for their respective categories. Only the "-2.2" label for "Base Generator" accurately reflects this change. ### Interpretation The data strongly suggests that the finetuning process significantly optimizes how the system utilizes external tools, leading to a notable improvement in overall accuracy. 1. **Enhanced Accuracy**: The increase in overall accuracy from 19.2% to 25.2% (+6.21%) is the most direct evidence of the finetuning's success. This indicates that the model, after finetuning, is better at performing its tasks, likely by making more appropriate decisions regarding tool usage. 2. **Strategic Tool Utilization**: * The "Base Generator" is shown to have a minimal role in tool calling, which is further reduced post-finetuning. This implies that the finetuned model relies almost exclusively on specialized search tools when a tool call is deemed necessary, rather than its inherent generation capabilities. * The most striking pattern is the convergence of "Google Search", "Web Search", and "Wikipedia Search" tool call ratios to 13.6% after finetuning. This suggests that the finetuning process has instilled a more standardized and perhaps more efficient strategy for invoking these external search tools. The significant reduction in calls for Google and Wikipedia Search (from ~38% to 13.6%) while overall accuracy improves implies that the model learns to be more selective and precise, avoiding unnecessary or redundant tool calls. It's possible the model learns to extract information more effectively from fewer calls, or to prioritize internal knowledge over external searches more often. 3. **Ambiguous Labels**: The discrepancy between the displayed change labels (-1.5, +5.2, -4.7) and the actual change in "Tool Call Ratio (%)" for Google, Web, and Wikipedia Search at Step 32 is a critical point. These labels likely refer to a different metric, such as the change in accuracy *attributable to each specific tool*, or perhaps a contribution to the overall accuracy gain. For instance, the "+5.2" for "Web Search" might indicate that this tool's *contribution to accuracy* increased by 5.2 percentage points, even though its *call ratio* decreased. Without further context, their precise meaning remains open to interpretation, but they are clearly intended to convey additional performance insights beyond just the tool call frequency. This highlights the importance of comprehensive documentation for such technical charts. </details> Figure 12: Tool call optimization on Musique. AgentFlow ’s planner increases Web Search usage after Flow-GRPO training. ## Appendix E Instruction Templates in AgentFlow ### E.1 Modules and Memory #### E.1.1 Action Planner Tool Metadata can be found in § E.2. Instruction for Action Planner Task: Determine the optimal next step to address the query using available tools and previous context. Context: Query: {Question} Available Tools: [Base Generator, Python Coder, Google Search, Wikipedia Search, Web Search] Toolbox Metadata: [Tool Metadata1, Tool Metadata2, …] Previous Steps: {Actions from Memory} Instructions: 1. Analyze the current objective, the history of executed steps, and the capabilities of the available tools. 2. Select the single most appropriate tool for the next action. 3. Consider the specificity of the task (e.g., calculation vs. information retrieval). 4. Consider the source of required information (e.g., general knowledge, mathematical computation, a specific URL). 5. Consider the limitations of each tool as defined in the metadata. 6. Formulate a clear, concise, and achievable sub-goal that precisely defines what the selected tool should accomplish. 7. Provide all necessary context (e.g., relevant data, variable names, file paths, or URLs) so the tool can execute its task without ambiguity. Response Format: 1. Justification: Explain why the chosen tool is optimal for the sub-goal, referencing its capabilities and the task requirements. 2. Context: Provide all prerequisite information for the tool. 3. Sub-Goal: State the exact objective for the tool. 4. Tool Name: State the exact name of the selected tool (e.g., Wikipedia Search). Rules: Select only one tool per step. The Sub-Goal must be directly and solely achievable by the selected tool. The Context section must contain all information the tool needs; do not assume implicit knowledge. The final response must end with the Context, Sub-Goal, and Tool Name sections in that order. No additional text should follow. #### E.1.2 Tool Executor Instruction for Tool Executor Task: Generate a precise command to execute the selected tool. Context: Query: {Question} Sub-Goal: {Sub Goal from Next Step Plan} Tool Name: {Selected Tool from Next Step Plan} Toolbox Metadata: {Selected Tool Metadata from Next Step Plan} Relevant Data: {Context from Next Step Plan} Instructions: 1. Analyze the tool’s required parameters from its metadata. 2. Construct valid Python code that addresses the sub-goal using the provided context and data. 3. The command must include at least one call to tool.execute(). 4. Each tool.execute() call must be assigned to a variable named execution. 5. Use exact numbers, strings, and parameters in the tool.execute() call based on the context. Output Format: Present your response in the following structured format. Do not include any extra text or explanations. Example 1: Generated Command: execution = tool.execute(query="Summarize the following porblom:"Isaac has 100 toys, masa gets ...., how much are their together?") Example 2: Generated Command: execution = tool.execute(query=["Methanol", "function of hyperbola", "Fermat’s Last Theorem"]) #### E.1.3 Execution Verifier Instruction for Execution Verifier Task: Evaluate if the current memory is complete and accurate enough to answer the query, or if more tools are needed. Context: Query: {Question} Available Tools: [Base Generator, Python Coder, Google Search, Wikipedia Search, Web Search] Toolbox Metadata: [Tool Metadata1, Tool Metadata2, …] Memory (Tools Used & Results): {Actions from Memory} Instructions: 1. Review the original query, the initial analysis, and the complete history of actions and results in the memory. 2. Does the accumulated information fully address all aspects of the query? 3. Are there any unanswered sub-questions or missing pieces of information? 4. Are there any inconsistencies or contradictions between different steps? 5. Is any information ambiguous, potentially hallucinated, or in need of verification? 6. Determine if any unused tools could provide critical missing information based on their metadata. Final Determination: If the memory is sufficient to form a complete and accurate answer, explain why and conclude with “Conclusion: STOP”. If more information is needed, clearly state what is missing, suggest which tool(s) could help, and conclude with “Conclusion: CONTINUE”. Rules: The response must end with either exactly “Conclusion: STOP” or “Conclusion: CONTINUE”. Do not include any text after the conclusion statement. Your justification must be concise and directly tied to the query and memory. #### E.1.4 Solution Generator Instruction for Solution Generator Task: Generate a concise final answer to the query based on all provided context. Context: Query: {Question} Initial Analysis: {Query Analysis} Actions Taken: {Actions from Memory} Instructions: 1. Carefully review the original user query, the initial analysis, and the complete sequence of actions and their results. 2. Synthesize the key findings from the action history into a coherent narrative. 3. Construct a clear, step-by-step summary that explains how each action contributed to solving the query. 4. Provide a direct, precise, and standalone final answer to the original query. Output Structure: 1. Process Summary: A clear, step-by-step breakdown of how the query was addressed. For each action, state its purpose (e.g., “To verify X”) and summarize its key result or finding in one sentence. 2. Answer: A direct and concise final answer to the query. This should be a self-contained statement that fully resolves the user’s question. Rules: The response must follow the exact two-part structure above. The Process Summary should be informative but concise, focusing on the logical flow of the solution. The Answer must be placed at the very end and be clearly identifiable. Do not include any additional sections, explanations, or disclaimers beyond the specified structure. #### E.1.5 Evolving Memory Example Memory Entry " Query ": Where is the largest shopping mall besides Tokyo’s biggest metropolitan station? " Action Turn 1 ": { "Tool Name": "Wikipedia Search", "Sub-Goal": "Retrieve detailed information about Tokyo‘s metropolitan area from Wikipedia.", "Command": "execution = tool.execute(query="Tokyo metropolitan area details")", "Result": "The Greater Tokyo Area is the largest metropolitan area in the world...", "Verification Status": " Brief Review of the Query, Initial Analysis, and Previous Memory. Assessment of Completeness and Accuracy. Conclusion: The memory is not complete and accurate enough to answer the query. Additional tools are needed to verify or generate more solutions. Final Determination: CONTINUE " }, " Action Turn 2 ": { ... }, ... " Action Turn t ": { ... "Verification Status": " Brief Review of the Query, Initial Analysis, and Previous Memory. Assessment of Completeness and Accuracy. (Including Time Dilation Calculation, Geographic Precise, Inconsistencies or Contradictions, Unit Conversion, etc. ) Conclusion: The memory is complete and accurate enough to answer the query. No additional tools are needed to verify or generate more solutions. Final Determination: STOP " } Our shared evolving memory system creates a deterministic, structured record that captures the reasoning process across three integrated agents: the Action Planner, Tool Executorr, and Execution Verifier. By sequentially stacking crucial information from each action step, the system enables transparent state tracking, controllable behavior, and bounded context growth. The memory reading and matching process employs regular expressions to parse outputs generated by different system components, adhering to standardized formats defined in their respective component instructions. For the Action Planner, we use a relatively permissive regular expression to extract key information. Specifically, it matches the content immediately following: Sub-Goal as the sub-goal and the content following; Tool Name as the selected tool. This extracted information is then used to populate the next memory entry. For the Tool Executorr, the regular expression is designed to capture the entire Command line starting with execution = tool.execute(...). Additionally, the value passed to the Query parameter within this command is parsed and saved into the memory for future reference. All results returned by the tools are directly stored in the Result field of the memory. The Verification Status is extracted from Execution Verifier, including a brief analysis of the current tool result and previous memory, and then it gives a conclusion whether the loop needs to be CONTINUE or STOP. ### E.2 Toolset Metadata This section details the implementation and metadata of the tools used in our main results. We employ a suite of specialized tools, each designed for distinct tasks. Below, we present core metadata for each tool, including its functionality, input/output schema, limitations, and best practices. #### E.2.1 Base Generator Tool Metadata of Base Generator Description: A generalized tool that takes query from the user, and answers the question step by step to the best of its ability. It can also accept an image. Input: query: str - The query that includes query from the user to guide the agent to generate response. Output: str - The generated response to the original query Demo Commands: Command: execution = tool.execute(query="Summarize the following text in a few lines") Description: Generate a short summary given the query from the user. Limitation The Base Generator may provide hallucinated or incorrect responses. Best Practice 1. Use it for general queries or tasks that don’t require specialized knowledge or specific tools in the toolbox. 2. Provide clear, specific query. 3. Use it to answer the original query through step by step reasoning for tasks without complex or multi-step reasoning. 4. For complex queries, break them down into subtasks and use the tool multiple times. 5. Use it as a starting point for complex tasks, then refine with specialized tools. 6. Verify important information from its responses. LLM Engine Required: True #### E.2.2 Python Coder Tool Metadata of Python Coder Description: A tool that generates and executes simple Python code snippets for basic arithmetical calculations and math-related problems. The generated code runs in a highly restricted environment with only basic mathematical operations available. Input: query: str - A clear, specific description of the arithmetic calculation or math problem to be solved, including any necessary numerical inputs. Output: dict - A dictionary containing the generated code, calculation result, and any error messages. Output prompt: Given a query, generate a Python code snippet that performs the specified operation on the provided data. Please think step by step. Ensure to break down the process into clear, logical steps. Make sure to print the final result in the generated code snippet with a descriptive message explaining what the output represents. The final output should be presented in the following format: ‘‘‘python <code snippet> ‘‘‘ Demo Commands: Command: execution = tool.execute(query="Find the sum of prime numbers up to 50") Description: Generate a Python code snippet to find the sum of prime numbers up to 50. Command: query=" Given the list [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], calculate the sum of squares of odd numbers” execution = tool.execute(query=query) Description: Generate a Python function for a mathematical operation on a given list of numbers. Limitation 1. Restricted to basic Python arithmetic operations and built-in mathematical functions. 2. Cannot use any external libraries or modules, including those in the Python standard library. 3. Limited to simple mathematical calculations and problems. 4. Cannot perform any string processing, data structure manipulation, or complex algorithms. 5. No access to any system resources, file operations, or network requests. 6. Cannot use ‘import’ statements. 7. All calculations must be self-contained within a single function or script. 8. Input must be provided directly in the query string. 9. Output is limited to numerical results or simple lists/tuples of numbers. 10. Output should be kept to a single numerical result or a simple list/tuple of numbers. 11. DO NOT generate loop output. Best Practice 1. Provide clear and specific queries that describe the desired mathematical calculation. 2. Include all necessary numerical inputs directly in the query string. 3. Keep tasks focused on basic arithmetic, algebraic calculations, or simple algorithms. 4. Ensure all required numerical data is included in the query. 5. Verify that the query only involves mathematical operations and does not require any data processing or complex algorithms. 6. Review generated code to ensure it only uses basic Python arithmetic operations and built-in math functions. LLM Engine Required: True #### E.2.3 Google Search Tool Metadata of Google Search Description: A web search tool powered by Google Search that provides real-time information from the internet with citation support. Input: query: str - The search query to find information on the web. Input: add_citations: bool - Whether to add citations to the results. If True, the results will be formatted with citations. By default, it is True. Output: str - The search results of the query. Demo Commands: Command: execution = tool.execute(query="What is the capital of France?") Description: Search for general information about the capital of France with default citations enabled. Command: execution = tool.execute(query="Who won the euro 2024?", add_citations=False) Description: Search for information about the Euro 2024 winner without citations. Command: execution = tool.execute(query="Physics and Society article arXiv August 11, 2016”, add_citations=True) Description: Search for specific academic articles with citations enabled. Limitation 1. This tool is only suitable for general information search. 2. This tool contains less domain-specific information. 3. This tool is not suitable for searching and analyzing videos on YouTube or other video platforms. Best Practice 1. Choose this tool when you want to search for general information about a topic. 2. Choose this tool for question types of query, such as “What is the capital of France?” or “Who invented the telephone?”. 3. The tool will return summarized information. 4. This tool is more suitable for definition, world knowledge, and general information search. LLM Engine Required: False #### E.2.4 Wikipedia Search Wikipedia search will first call Wikipedia API to retrieve relevant URLs with snippets. Then the RAG (Retrieval-Augmented Generation) process begins by extracting raw text content from the given webpage URL, cleaning it to remove HTML elements and retain only meaningful text. This content is then split into overlapping chunks of approximately 200 words each, with a 20-word overlap to preserve context across segments from the first 1M words in each URL. Next, both the user’s query and the document chunks are embedded into the vector space using the OpenAI text-embedding-3-small https://platform.openai.com/docs/models/text-embedding-3-small model. The system computes the cosine similarity between the query embedding and each chunk embedding to rank the chunks by relevance. We set that the top 10 most similar chunks are selected and passed forward as context. And a base LLM engine will summarize the extracted context. Tool Metadata of Wikipedia Search Description: A tool that searches Wikipedia and returns relevant pages with their page titles, URLs, abstract, and retrieved information based on a given query. Input: query: str - The search query for Wikipedia. Output: dict - A dictionary containing search results, all matching pages with their content, URLs, and metadata. Demo Commands: Command: execution = tool.execute(query="What is the exact mass in kg of the moon") Description: Search Wikipedia and get the information about the mass of the moon. Command: execution = tool.execute(query="Funtion of human kidney") Description: Search Wikipedia and get the information about the function of the human kidney. Command: execution = tool.execute(query="When was the first moon landing?") Description: Search Wikipedia and get the information about the first moon landing. Limitation 1. It is designed specifically for retrieving grounded information from Wikipedia pages only. 2. Filtering of relevant pages depends on LLM model performance and may not always select optimal pages. 3. The returned information accuracy depends on Wikipedia’s content quality. Best Practice 1. Use specific, targeted queries rather than broad or ambiguous questions. 2. The tool automatically filters for relevant pages using LLM-based selection - trust the “relevant_pages” results. 3. If initial results are insufficient, examine the “other_pages” section for additional potentially relevant content. 4. Use this tool as part of a multi-step research process rather than a single source of truth. 5. You can use the Web Search to get more information from the URLs. LLM Engine Required: True #### E.2.5 Web Search Web search will directly access the URL in the query. Then the RAG (Retrieval-Augmented Generation) process begins by splitting content from the page into overlapping chunks of approximately 200 words each, with a 20-word overlap to preserve context across segments from the first 1M words in each URL. Next, both the user’s query and the document chunks are embedded into the vector space using the OpenAI text-embedding-3-small https://platform.openai.com/docs/models/text-embedding-3-small model. The system computes the cosine similarity between the query embedding and each chunk embedding to rank the chunks by relevance. We set that the top 10 most similar chunks are selected and passed forward as context. And a base LLM engine will summarize the extracted context. Tool Metadata of Web Search Description: A specialized tool for answering questions by retrieving relevant information from a given website using RAG (Retrieval-Augmented Generation). Input: query: str - The search query for the website. Input: url: str - The URL of the website to retrieve information from. Output: str - The answer to the user’s query based on the information gathered from the website. Demo Commands: Command: execution = tool.execute(query="What is the exact mass in kg of the moon?", url="https://en.wikipedia.org/wiki/Moon") Description: Retrieve information about the moon’s mass from Wikipedia. Command: execution = tool.execute(query="What are the main features of Python programming language?", url="https://www.python.org/about/apps/") Description: Get information about Python features from the official website. Limitation 1. Requires valid URLs that are accessible and contain text content. 2. May not work with JavaScript-heavy websites or those requiring authentication. 3. Performance depends on the quality and relevance of the website content. 4. May return incomplete or inaccurate information if the website content is not comprehensive. 5. Limited by the chunking and embedding process which may miss context. 6. Requires OpenAI API access for embeddings and LLM generation. Best Practice 1. Use specific, targeted queries rather than broad questions. 2. Ensure the URL is accessible and contains relevant information. 3. Prefer websites with well-structured, text-rich content. 4. For complex queries, break them down into smaller, specific questions. 5. Verify important information from multiple sources when possible. 6. Use it as part of a multi-step research process rather than a single source of truth. 7. It is highly recommended to use this tool after calling other web-based tools (e.g., Google Search, Wikipedia Search, etc.) to get the real, accessible URLs. LLM Engine Required: True ### E.3 LLM-based Judging We employ GPT-4o as our judge model using a two-step “analyze-then-judge” instruction paradigm to ensure both accuracy and efficiency. Reward Function Instruction in Training Task: Determine if the Model Response is equivalent to the Ground Truth. Instructions: 1. Extract: Isolate the final answer from the Model Response, ignoring all reasoning steps. Look specifically for content within $…$ or the concluding statement. 2. Normalize & Compare: Assess equivalence after normalization: 3. Mathematical Answers: Must be mathematically identical (e.g., $\frac{1}{2}$ is equivalent to $0.5$ ). 4. Numerical/Textual Answers: Ignore formatting (commas, spaces), case sensitivity, and extraneous units/currency (e.g., “1,000” == “1000”, “Paris” == “PARIS”). 5. Multiple Choice Questions (MCQ): The answer must match either the correct option’s content (e.g., “Paris”) or its identifier (e.g., “A” or “1st”). 6. Verdict: Return “True” only if the normalized answers are semantically or mathematically equivalent. Inputs: Question: {Question} Model Response: {Final Response from Solution Generator} Ground Truth: {GT} Output Format: Present your response in the following structured format. Do not include any extra text or explanations. < analysis >: Brief analysis of the comparison. < true_false >: “True” or “False”. ## Appendix F Case Studies In this section, we conduct a case study to demonstrate how our AgentFlow, coherent with Flow-GRPO, enhances problem-solving performance with greater elegance, efficiency, and robustness. We present solution comparisons showing brief outputs from memory of the Action Planner (Qwen2.5-7B-Instruct) before (w/o) tuning by Flow-GRPO and after (w/) Flow-GRPO tuning, with the methodology detailed in § 3.2. ### F.1 Example 1: Efficient Search for Simple Tasks This case demonstrates that, with Flow-GRPO tuning, the Action Planner can effectively leverage the search engine to retrieve correct answers for simple tasks in a highly efficient manner—unlike the untuned baseline, which requires multiple trials. Example from # GameOf24 Question: Using the numbers [1, 1, 1, 13], create an expression that equals 24. You must use basic arithmetic operations (+, -, *, /) and parentheses. Answer: (13-1)*(1+1) Success Case: AgentFlow w/ Flow-GRPO Action Step 1: • Tool: Google Search • Sub-goal: Find a valid arithmetic expression that equals 24 using the numbers [1, 1, 1, 13]. • Command: tool.execute(query="[1, 1, 1, 13] arithmetic expression to get 24") • Result: Here’s an arithmetic expression using the numbers 1, 1, 1, and 13 to get 24: (13 - 1) * (1 + 1) = 24 Execution Verifier: PASS, STOP Solution Generator: The arithmetic expression is $((13-1)×(1+1))=24$ . Failure Case: AgentFlow w/o Flow-GRPO Action Step 1: • Tool: Python Coder • Sub-goal: Generate and test all possible combinations of the numbers [1, 1, 1, 13] using basic arithmetic operations and parentheses to find an expression that equals 24. • Command: tool.execute(query="[1, 1, 1, 13] execution = (1 * (13 - 1 - 1))") • Result: # Define the values value_list = [1, 1, 1, 13] operation_result = 1 * (13 - 1 - 1) # Print the result with a descriptive message print(f"The result of the operation is: operation_result") The result of the operation is: 11 Action Step 2: • Tool: Python Coder • Sub-goal: Generate and test all possible combinations of the numbers [1, 1, 1, 13] using basic arithmetic operations and parentheses to find an expression that equals 24. • Command: tool.execute(query="[1, 1, 1, 13] -> ((1 + 1 + 1) * 13)") • Result: Error: unexpected indent ( $<$ string $>$ , line 2). Action Steps 3-8: • Tool: No matched tool given (Tool Not Found) • Sub-goal: Generate and test all possible combinations of the numbers [1, 1, 1, 13] using basic arithmetic operations and parentheses to find an expression that equals 24. • Command: Not command is generated due to the tool not found. • Result: Not result is generated due to the tool not found. • Note: These steps are identical repetitions where the agent failed to find a matching tool. Execution Verifier: STOP, Tool Not Found Solution Generator: The agent incorrectly concludes that $(1+1+1)× 13=24$ . ### F.2 Example 2: Spontaneous Brute-force This case demonstrates that, when tuned with Flow-GRPO, the Action Planner first attempts several solutions, recognizes their ineffectiveness, resorts to a brute-force approach, and finally verifies the result using a search engine. Example from # GameOf24 Question: Using the numbers [1, 1, 6, 9], create an expression that equals 24. You must use basic arithmetic operations (+, -, *, /) and parentheses. Answer: (1+1)*9+6 Success Case: AgentFlow w/ Flow-GRPO Action Step 1: Initial Attempt • Tool: Python Coder • Sub-goal: Generate and test all possible arithmetic expressions using the numbers [1, 1, 6, 9] exactly once to check if any combination equals 24. • Command: tool.execute(query="Using the numbers [1, 1, 6, 9], create an expression that equals 24 using basic arithmetic operations and parentheses.") • Result: # Given numbers numbers = [1, 1, 6, 9] # Calculate the expression result = (numbers[0] + numbers[1]) * (numbers[3] - numbers[2]) # Print the result print(f"The result of the expression (numbers[0] + numbers[1] ) * (numbers[3] - numbers[2]) is: result") The result of the expression (1 + 1 ) * (9 - 6) is: 6 Action Step 2: Tool Error • Tool: Python Coder • Sub-goal: Generate and test all possible permutations of the numbers with different operator placements to find an expression that equals 24. • Command: tool.execute(query="Using the numbers [1, 1, 6, 9], create an expression with +,-,*,/ that equals 24") • Result: Error: eval() arg 1 must be a string, bytes or code object Action Step 3: Second Attempt • Tool: Python Coder • Sub-goal: Generate all possible combinations of the four numbers with every permutation of operators (+,-,*,/) to find a valid equation resulting in exactly 24. • Command: tool.execute(query="Using the numbers [1, 1, 6, 9] create an expression with +,-,*,/ that equals to ’24’") • Result: # Given numbers numbers = [1, 1, 6, 9] # Calculate the expression result = (numbers[2] - numbers[0]) * (numbers[3] / numbers[1]) / numbers[1] # Print the result print(f"The result of the expression is: result") The result of the expression is: 45.0 Action Step 4: Systematic Search (Breakthrough) • Tool: Python Coder • Sub-goal: Generate and evaluate all possible permutations of the four numbers with every permutation of operators (+,-,*,/) to find a valid equation resulting in exactly 24. • Command: tool.execute(query="Using the numbers [1, 1, 6, 9] create an expression that equals 24 using +,-,/,*") • Result: Successfully found valid expressions: The expression (1+1)*9+6 equals 24. The expression 6+(1+1)*9 equals 24. The expression 6+(9*(1+1)) equals 24. The expression 9*(1+1)+6 equals 24. Action Step 5: Confused Verification • Tool: Python Coder • Sub-goal: Simplify and verify if any minor adjustments to the existing promising expression can achieve an exact match of 24. • Command: tool.execute(query="Using the numbers [1, 1, 6, 9] and basic arithmetic operations create an expression that equals 24: (9 * (6 * (1 + 1)))") • Result: The result of the expression (9 * (1 + 1) * 6) is: 108 Action Step 6: Final Verification • Tool: Google Search • Sub-goal: Search online for examples or discussions related to creating expressions using given numbers that equal a target value. • Command: tool.execute(query="using the numbers [1, 1, 6, 9] create an expression that equals 24", add_citations=True) • Result: Here’s an expression using the numbers that equals 24: (1 + 1) * 9 + 6 = 24 Execution Verifier: PASS, STOP (via exhaustive search and verification) Solution Generator: After a multi-step process of trial, error, and verification, the final correct expression is (1 + 1) * 9 + 6 = 24. Failure Case: AgentFlow w/o Flow-GRPO Action Step 1: • Tool: Python Coder • Sub-goal: Generate and test Python code snippets to find an arithmetic expression using the numbers [1, 1, 6, 9] exactly once that equals 24. • Command: tool.execute(query="Using the numbers [1, 1, 6, 9], create an expression that equals 24 using basic arithmetic operations and parentheses.") • Result: # Define the numbers numbers = [1, 1, 6, 9] # Calculate the expression result = (6 * 9) - ((1 + 1) * 15) # Print the result with a descriptive message print("The result of the expression is:", result) The result of the expression is: 24 Execution Verifier: STOP Solution Generator: The agent incorrectly concludes the answer is $(6× 9)-((1+1)× 15)$ , failing to notice it used the number 15, which was not in the allowed input set. ### F.3 Example 3: A Good Initial Plan is Essential This case demonstrates that a well-crafted initial search with a highly relevant query is far more effective than issuing numerous wrong paths. When tuned with Flow-GRPO, the Action Planner in AgentFlow can identify the optimal search engine and formulate the most effective query, leading to a correct and targeted answer in a single trial. Example from # GAIA Question: Assuming scientists in the famous youtube video The Thinking Machine (Artificial Intelligence in the 1960s) were interviewed the same year, what is the name of the scientist predicting the sooner thinking machines or robots? Answer using the format First name Last name. Answer: Claude Shannon Success Case : AgentFlow w/ Flow-GRPO Action Turn 1: • Tool: Google Search • Sub-goal: Search for details on ”The Thinking Machine” YouTube video and identify key figures making predictions about AI. • Command: tool.execute(query="Assuming scientists in the famous youtube video The Thinking Machine (Artificial Intelligence in the 1960s) were interviewed the same year, what is the name of the scientist predicting sooner thinking machines or robots?") • Result: The scientist who predicted the sooner emergence of thinking machines or robots in “The Thinking Machine” video from the 1960s was Claude Shannon. He stated, “I confidently expect that within 10 or 15 years we will find emerging from the laboratories something not too far from the robots of science-fiction fame.” This prediction was made in the 1961 documentary, which was produced by CBS to honor MIT’s 100th birthday. Execution Verifier: PASS, STOP Solution Generator: The scientist who made the prediction is Claude Shannon. Failure Case: AgentFlow w/o Flow-GRPO Action Turn 1: • Tool: Web Search • Sub-goal: Retrieve information from the YouTube video “The Thinking Machine” (Artificial Intelligence in the 1960s) to identify the name of the scientist who made the prediction about thinking machines or robots. • Command: tool.execute(query="Who predicted the sooner thinking machines or robots in the YouTube video ’The Thinking Machine’ (Artificial Intelligence in the 1960s)?", url="https://www.youtube.com/watch?v=your_video_id") • Result: The reference information provided does not contain any details regarding predictions made about thinking machines or robots in the YouTube video ’The Thinking Machine’ (Artificial Intelligence in the 1960s). Therefore, I cannot provide an answer to the query. Action Turn 2: • Tool: Google Search • Sub-goal: Find the name of the scientist who predicted the sooner development of thinking machines or robots in the YouTube video “The Thinking Machine” (Artificial Intelligence in the 1960s). • Command: tool.execute(query="In the YouTube video The Thinking Machine (Artificial Intelligence in the 1960s), which scientist predicted the sooner development of thinking machines or robots?", add_citations=True) • Result: In the YouTube video “The Thinking Machine (Artificial Intelligence in the 1960s),” Jerome Wiesner, then Director of the Research Laboratory of Electronics at MIT, predicted the sooner development of thinking machines. He stated, “if you come back in four or 5 years I’ll say ‘Sure they really do think.”’ Action Turn 3: • Tool: Google Search • Sub-goal: Find the name of the scientist who predicted the sooner development of thinking machines or robots in the YouTube video “The Thinking Machine” (Artificial Intelligence in the 1960s). • Command: tool.execute(query="In the YouTube video The Thinking Machine (Artificial Intelligence in the 1960s), which scientist predicted the sooner development of thinking machines or robots?", add_citations=True) • Result: In the YouTube video “The Thinking Machine (Artificial Intelligence in the 1960s),” Professor Jerome B. Wiesner, director of the research laboratory of electronics at MIT, predicted the sooner development of thinking machines. He stated, “I suspect if you come back in four or 5 years I’ll say ‘Sure they really do think.”’ Another scientist in the video confidently expected “within a matter of 10 or 15 years something will emerge from the laboratories which is not too far from robot a science fiction fan.” Action Turn 4: • Tool: Google Search • Sub-goal: Find the exact year of the interview in the YouTube video “The Thinking Machine” (Artificial Intelligence in the 1960s) where Jerome B. Wiesner made the prediction about thinking machines or robots. • Command: tool.execute(query="In the YouTube video The Thinking Machine (Artificial Intelligence in the 1960s), which scientist predicted the sooner development of thinking machines or robots?", add_citations=True) • Result: In the YouTube video “The Thinking Machine (Artificial Intelligence in the 1960s),” a scientist confidently predicted that “within a matter of 10 or 15 years something will emerge from the laboratories which is not too far from the robot of science fiction fame.” While the video features interviews with several pioneering scientists in artificial intelligence, including Jerome Wiesner, Oliver Selfridge, and Claude Shannon, the specific speaker for this particular prediction about the emergence of robots within 10 to 15 years is not explicitly named in the provided transcripts. Jerome Wiesner, one of the co-hosts, also made a prediction, stating he suspected that in “four or 5 years” machines would “really do think.” Action Turn 5: • Tool: Google Search • Sub-goal: Find the exact year of the interview in the YouTube video “The Thinking Machine” (Artificial Intelligence in the 1960s). • Command: tool.execute(query="When was the YouTube video ’The Thinking Machine’ released?", add_citations=True) • Result: The original television program titled ‘The Thinking Machine’ aired on the CBS TV network on October 26, 1960. It was a special program produced by CBS News in conjunction with the Massachusetts Institute of Technology (MIT). This documentary explored early concepts of artificial intelligence and computers. While the original broadcast was in 1960, various versions of ‘The Thinking Machine’ have been uploaded to YouTube at different times. Action Turn 6: • Tool: Google Search • Sub-goal: Find the exact year of the interview in the YouTube video “The Thinking Machine” (Artificial Intelligence in the 1960s) where Jerome Wiesner made his prediction about thinking machines or robots. • Command: tool.execute(query="In which year did Jerome Wiesner make his prediction about thinking machines or robots in the YouTube video ’The Thinking Machine’ (Artificial Intelligence in the 1960s)?", add_citations=True) • Result: Jerome Wiesner made his prediction about thinking machines or robots in the YouTube video ‘The Thinking Machine’ (Artificial Intelligence in the 1960s) in 1960. The documentary, which features Wiesner, originally aired on October 26, 1960, as part of a CBS series titled “Tomorrow”. In the video, Wiesner stated his confident expectation that “within a matter of 10 or 15 years something will emerge from the laboratories which is not too far from robot a science fiction fan”. Execution Verifier: STOP, Initial Tool Failure & Inefficient Search Path Solution Generator: The answer is Jerome B. Wiesner. However, the initial direct video search failed, leading to multiple, somewhat redundant Google searches to extract the information. ### F.4 Example 4: Robust Self-Correction and Adaptation This side-by-side comparison illustrates the critical impact of Flow-GRPO tuning on strategic tool usage. The trained AgentFlow agent demonstrates adaptive planning—recovering from failed searches, refining input formulations, and ultimately achieving a correct solution in a single effective trial. In contrast, the untrained agent, despite accessing the correct information early, fails to properly utilize the Python Coder tool and becomes trapped in a repetitive error loop, unable to learn or adjust. This highlights Flow-GRPO’s role in enabling not just tool selection, but strategic resilience and goal-directed reasoning. Example from # GAIA Question: Compute the check digit the Tropicos ID for the Order Helotiales would have if it were an ISBN-10 number. Answer: 3 Success Case: AgentFlow w/ Flow-GRPO Action Step 1: • Tool: Wikipedia Search • Sub-goal: Retrieve the Tropicos ID from the Wikipedia page on Helotiales. • Command: tool.execute(query="Tropicos ID of Order Helotiales") • Result: No results found for query: Tropicos ID of Order Helotiales Action Step 2: • Tool: Google Search • Sub-goal: Retrieve the Tropicos ID of Order Helotiales from a reliable online source. • Command: tool.execute(query="Tropicos ID of Order Helotiales") • Result: The Tropicos ID for the Order Helotiales is 100370510. Action Step 3: • Tool: Python Coder • Sub-goal: Write and execute a Python script to calculate the check digit for tropicos_id as if it were an ISBN-10 number. • Command: tool.execute(query="Calculate the check digit for the ISBN-10 number " + tropicos_id) • Result: Error in execute_tool_command: name ’tropicos_id’ is not defined Action Step 4: • Tool: Python Coder • Sub-goal: Write and execute a Python script to calculate the check digit for tropicos_id as if it were an ISBN-10 number. • Command: tool.execute(query="Compute the check digit for 100370510 as an ISBN-10 number") • Result: def calculate_check_digit(isbn): # Extract the first 9 digits of the ISBN-10 isbn_digits = [int(digit) for digit in isbn[:9]] # Calculate the sum of products total_sum = sum(position * digit for position, digit in enumerate(isbn_digits, start=1)) # Determine the check digit check_digit = total_sum % 11 if check_digit == 10: return ’X’ else: return str(check_digit) # Given ISBN-10 number isbn_10 = "100370510" # Compute the check digit check_digit_result = calculate_check_digit(isbn_10) # Print the final result print(f"The check digit for the ISBN-10 number isbn_10 is check_digit_result") Printed output: The check digit for the ISBN-10 number 100370510 is 3 Action Step 5: • Tool: Python Coder • Sub-goal: Execute the existing Python script to calculate and print the check digit for tropicos_id as an ISBN-10 number. • Command: tool.execute(query="Compute the check digit for tropicos_id 100370510 as an ISBN-10 number") • Result: def calculate_isbn10_check_digit(tropicos_id): # Step 1: Extract the first 9 digits of the Tropicos ID isbn_digits = tropicos_id[:9] # Step 2: Initialize the sum for the check digit calculation total_sum = 0 # Step 3: Calculate the sum of the products of each digit and its position for i, digit in enumerate(isbn_digits, start=1): total_sum += i * int(digit) # Step 4: Calculate the check digit check_digit = total_sum # If the check digit is 10, it is represented as ’X’ if check_digit == 10: check_digit = ’X’ # Step 5: Format the ISBN-10 isbn_10 = f"isbn_digitscheck_digit" return isbn_10 # Given Tropicos ID tropicos_id = "100370510" # Step 6: Generate the ISBN-10 and print the result isbn_10 = calculate_isbn10_check_digit(tropicos_id) print(f"The ISBN-10 number for the Tropicos ID tropicos_id is: isbn_10") Printed output: The ISBN-10 number for the Tropicos ID 100370510 is: 1003705103 Execution Verifier: PASS, STOP Solution Generator: The check digit is 3, resulting in the full number 1003705103. Failure Case: AgentFlow w/o Flow-GRPO Action Step 1: • Tool: Wikipedia Search • Sub-goal: Retrieve the Tropicos ID of the Order Helotiales from Wikipedia. • Command: tool.execute(query="Tropicos ID for the Order Helotiales") • Result: No results found for query: Tropicos ID for the Order Helotiales Action Step 2: • Tool: Google Search • Sub-goal: Retrieve the Tropicos ID of the Order Helotiales from a reliable web source. • Command: tool.execute(query="Tropicos ID for the Order Helotiales", add_citations=True) • Result: The Tropicos ID for the Order Helotiales is 100370510.[1][2] Action Steps 3-9: • Tool: Python Coder • Sub-goal: Calculate the check digit for the Tropicos ID 100370510 if it were an ISBN-10 number. • Command: tool.execute(query="Calculate the check digit for the ISBN-10 number 100370510") • Result: Error: name ’isbn’ is not defined • Note: Action Steps 3 through 9 all have identical subgoals, commands, and error results—the agent is stuck in a repetitive error loop, unable to fix the variable naming issue. Execution Verifier: STOP, Stuck in Repetitive Tool Error Solution Generator: The Python tool failed repeatedly with the same ‘isbn’ variable error across 7 consecutive attempts (Steps 3-9). The agent was unable to recover from this error and failed to calculate the check digit. ### F.5 Example 5: New Combo: Retrieve with Specific URL This case highlights how both agents eventually succeed, but with markedly different efficiency and strategy. The Flow-GRPO-tuned AgentFlow agent learns to refine its queries effectively and—upon recognizing the limitations of Wikipedia search—switches tools strategically to a targeted and the most task-solving relevant web search, achieving success with minimal redundancy. In contrast, the untrained agent persists in issuing dense, ineffective queries within the same tool despite diminishing returns, only escaping the loop by eventually switching to Google Search. While both reach the correct answer, the latter exhibits inefficient exploration and delayed adaptation; furthermore, with no path consistency, underscoring Flow-GRPO’s role in fostering not just correctness, but strategic focus and timely tool transition. Example from # HotpotQA Question: Which part of Handel’s Messiah covers the birth of a child who, according to the gospels of Luke and Matthew, was born in Bethlehem? Answer: Part I Success Case: AgentFlow w/ Flow-GRPO Action Steps 1-2: Initial Failures • Tool: Wikipedia Search • Sub-goal: Retrieve information about the part of Handel’s Messiah that refers to the birth of Jesus in Bethlehem. • Command: tool.execute(query="Which part of Handel’s Messiah refers to the birth of Jesus in Bethlehem?") • Result: Initial long-form queries failed to find relevant pages (returned irrelevant pages like ”Historical Jesus”, ”Advent”, ”Passion of Jesus”). Action Step 3: Query Refinement • Tool: Wikipedia Search • Sub-goal: Identify the specific part of Handel’s Messiah that mentions the birth of Jesus in Bethlehem. • Command: tool.execute(query="Handel’s Messiah birth of Jesus Bethlehem") • Result: This successfully identified relevant pages like “Structure of Handel’s Messiah” and “Messiah Part I”. Action Steps 4-6: Further Search • Tool: Wikipedia Search • Sub-goal: Retrieve the specific part of Handel’s Messiah that mentions the birth of Jesus in Bethlehem. • Command: tool.execute(query="Which part of Handel’s Messiah refers to the birth of Jesus in Bethlehem?") and tool.execute(query="birth of Jesus in Handel’s Messiah") • Result: Agent gets briefly confused, re-runs a failed query, but then another refined query again locates the main “Messiah (Handel)” Wikipedia pages. Action Step 7: Tool Switching (Breakthrough) • Tool: Web Search (on a specific URL) • Sub-goal: Retrieve the specific part of Handel’s Messiah that mentions the birth of Jesus in Bethlehem. • Command: tool.execute(query="Which part of Handel’s Messiah covers the birth of Jesus in Bethlehem?", url="https://en.wikipedia.org/wiki/Handel%27s_Messiah") • Result: Success! The tool extracted the answer from the page: The part of Handel’s Messiah that covers the birth of Jesus in Bethlehem is found in Part I, specifically in the scene that includes the prophecy of Christ’s birth. This section features the well-known chorus “For unto us a child is born,” which celebrates the nativity of Jesus. Execution Verifier: PASS, STOP Solution Generator: The relevant section is Part I, which includes the famous chorus “For unto us a child is born”. Success Case but stuck in dense information: AgentFlow w/o Flow-GRPO Action Step 1: Initial Wikipedia Search • Tool: Wikipedia Search • Sub-goal: Retrieve information about the part of Handel’s Messiah that refers to the birth of Jesus in Bethlehem. • Command: tool.execute(query="Which part of Handel’s Messiah refers to the birth of Jesus in Bethlehem?") • Result: Returned irrelevant pages (e.g., “Historical Jesus”, “Advent”). (No direct answer) Action Step 2: Refined Wikipedia Search • Tool: Wikipedia Search • Sub-goal: Retrieve information about the specific part of Handel’s Messiah that mentions the birth of Jesus in Bethlehem. • Command: tool.execute(query="Which part of Handel’s Messiah refers to the birth of Jesus in Bethlehem?") • Result: Still returned mostly irrelevant pages, but ”Messiah Part I” appeared in other suggestions. (Hint found, but no direct answer) Action Step 3: Keyword Wikipedia Search • Tool: Wikipedia Search • Sub-goal: Retrieve information about the specific part of Handel’s Messiah that mentions the birth of Jesus in Bethlehem. • Command: tool.execute(query="Handel’s Messiah birth of Jesus Bethlehem") • Result: Still no direct relevant pages, but “Structure of Handel’s Messiah” and “Messiah Part I” again appeared. (Still no direct answer) Action Steps 4-5: Persistent Wikipedia Queries • Tool: Wikipedia Search • Sub-goal: Retrieve the specific part of Handel’s Messiah that mentions the birth of Jesus in Bethlehem. • Command: tool.execute(query="Which part of Handel’s Messiah refers to the birth of Jesus in Bethlehem?") tool.execute(query="Handel’s Messiah birth of Jesus Bethlehem") • Result: Consistently returned irrelevant pages like “Historical Jesus” and “Christmas Oratorio”. (Wikipedia tool stuck/ineffective for direct answer) Action Step 6: External Search (Breakthrough) • Tool: Google Search • Sub-goal: Find the specific part of Handel’s Messiah that mentions the birth of Jesus in Bethlehem. • Command: tool.execute(query="Which part of Handel’s Messiah covers the birth of a child who, according to the gospels of Luke and Matthew, was born in Bethlehem?", add_citations=True) • Result: Successfully found the answer: Handel’s Messiah addresses the birth of a child born in Bethlehem primarily in Part I of the work. Key elements within Part I include the chorus “For unto us a child is born” and the scene depicting the annunciation to the shepherds. Execution Verifier: PASS, STOP (via query refinement and external search after Wikipedia limitations) Solution Generator: The part of Handel’s Messiah referring to the birth of Jesus in Bethlehem is found in Part I, particularly the chorus “For unto us a child is born” and the scene depicting the annunciation to the shepherds. ### F.6 Example 6: Rapid and Correct Physics Calculation This GPQA example reveals a fundamental difference in reasoning quality between the tuned and untuned agents. The Flow-GRPO-enhanced AgentFlow correctly identifies the core challenge—relativistic time dilation over interstellar distances—and applies the appropriate physics-based computation in minimal steps, arriving at the correct answer (81 years) efficiently. In contrast, the untrained agent misinterprets the astronaut’s age as the travel duration, leading to a cascade of erroneous calculations across multiple tool calls. Despite eventually retrieving the distance via search, it fails to integrate this information coherently or recognize its conceptual mistake. This highlights that Flow-GRPO not only improves tool usage efficiency but also promotes correct problem formulation, enabling the agent to distinguish between proper time, coordinate time, and mission constraints—a critical capability for complex scientific reasoning. Example from # GPQA Question: An intelligent civilization in the Large Magellanic Cloud has engineered an extraordinary spacecraft capable of traveling at a substantial fraction of the speed of light. The average lifetime of these aliens is roughly 150 solar years. Now, having Earth as their destination in mind, they are determined to travel with this spacecraft at a constant speed of 0.99999987*c, where c is the speed of light. Approximately, how long will it take for their 22 years old astronaut (from the point of view of the astronaut) to reach the Earth using this incredibly fast spacecraft? Choices: A. 81 years B. 77 years C. 72 years D. The astronaut will die before reaching to the Earth. Answer: A. 81 years Success Case: AgentFlow w/ Flow-GRPO Action Step 1: • Tool: Python Coder • Sub-goal: Calculate the time it takes for the astronaut to reach Earth from the perspective of the astronaut. • Command: tool.execute(query="A spacecraft travels at a speed of 0.99999987*c, where c is the speed of light. The distance to Earth is 165000 light-years. If the astronaut is 22 years old, how long will it take for the astronaut to reach Earth from the perspective of the astronaut? Using the formula time = distance / speed, calculate the time dilation effect with the velocity given and determine the perceived time for the astronaut.") • Result: import math # Constants speed_of_light = 1 # Speed of light in arbitrary units distance_to_earth = 165000 # Distance to Earth in light-years astronaut_age = 22 # Astronaut’s age in years velocity_factor = 0.99999987 # Velocity as a fraction of the speed of light # Step 1: Calculate the time it takes for the spacecraft to travel to Earth travel_time = distance_to_earth / velocity_factor # Step 2: Calculate the Lorentz factor lorentz_factor = 1 / math.sqrt(1 - velocity_factor**2) # Step 3: Calculate the perceived time for the astronaut perceived_time = travel_time * lorentz_factor # Print the results print(f"The spacecraft will take travel_time:.2f years to reach Earth.") print(f"Due to time dilation, the astronaut will perceive perceived_time:.2f years passing.") The spacecraft will take 165000.02 years to reach Earth. Due to time dilation, the astronaut will perceive 323591675.56 years passing. • Note: The calculation contains an error in the formula application. The correct approach should use the inverse relationship for proper time calculation. Execution Verifier: PASS, STOP Solution Generator: Correct physics approach - properly calculated relativistic travel time considering distance and time dilation effects. Therefore, the correct choice is: A. 81 years. Failure Case: AgentFlow w/o Flow-GRPO Action Step 1: • Tool: Python Coder • Sub-goal: Generate and execute Python code to calculate the time experienced by the astronaut using time dilation formula: $t^\prime=\frac{t}{√{1-\frac{v^2}{c^2}}}$ , where $t$ is the travel time from a stationary observer’s perspective, and $v$ is 0.99999987 * c. • Command: tool.execute(query="Calculate the time experienced by an astronaut using the time dilation formula: t’ = t / sqrt(1 - vˆ2/cˆ2), where t is approximately 22 years (the lifespan of the astronaut) and v is 0.99999987 * c.") • Result: # Step 1: Define the constants c = 299792458 # Speed of light in meters per second v = 0.99999987 * c # Velocity of the astronaut as a fraction of the speed of light # Step 2: Calculate the value inside the square root value_inside_sqrt = 1 - (v ** 2 / c ** 2) # Step 3: Compute the time experienced by the astronaut t_prime = 22 / value_inside_sqrt # Lifespan of the astronaut in years # Step 4: Print the result print(f"The time experienced by the astronaut is t_prime:.6f years.") • The time experienced by the astronaut is 84615390.151914 years. The time experienced by the astronaut is 84615390.151914 years. Action Step 2: • Tool: Google Search • Sub-goal: Find scientific references discussing similar scenarios regarding travel times under relativistic conditions. • Command: tool.execute(query="relativistic travel time calculations for speeds close to the speed of light", add_citations=True) • Result: Retrieved detailed information about time dilation formulas and Lorentz factors, but this theoretical knowledge was not properly integrated with the problem’s specific requirements. Execution Verifier: STOP, Conceptual Error & Inefficient Multi-Step Process Solution Generator: The answer is 84615390.151914 years. However, the approach incorrectly treated the astronaut’s age as travel time, leading to multiple redundant calculations and a fundamentally flawed methodology. ### F.7 Example 7: Multi-Source Cross-Verification The comparison highlights the effectiveness of a multi-tool, systematic reasoning approach enabled by Flow-GRPO. In the success case, the model leveraged sequential tool usage—starting with Google Search, followed by targeted Wikipedia and Web Search—to accurately identify Gülçiçek Hatun as Olivera Despina’s mother-in-law through verified historical sources. Each step built upon prior findings, ensuring robustness and precision. In contrast, the failure case without Flow-GRPO relied on a single, improperly executed Wikipedia query without task decomposition that resulted in a timeout and no meaningful output, leading to premature termination. This demonstrates that Flow-GRPO enhances reasoning trace reliability, tool coordination, and overall task completion in complex knowledge retrieval scenarios. Example from # 2Wiki Question: Who is the mother-in-law of Olivera Despina? Answer: Gülçiçek Hatun Success Case: AgentFlow w/ Flow-GRPO Action Step 1: • Tool: Google Search • Sub-goal: Perform a Google search for “Olivera Despina biography” to find relevant information about her and possibly identify her husband’s family members. • Command: tool.execute(query="Olivera Despina biography", add_citations=True) • Result: Olivera Despina, also known as Mileva Olivera Lazarević or Despina Hatun, was a Serbian princess born around 1372. She was the youngest daughter of Prince Lazar of Serbia and Princess Milica. In 1389, shortly after the Battle of Kosovo, Olivera was given in marriage to the Ottoman Sultan Bayezid I as a peace offering between the Lazarević and Ottoman dynasties. Action Step 2: • Tool: Wikipedia Search • Sub-goal: Identify any mention of Olivera’s parents or in-laws in Wikipedia articles. • Command: tool.execute(query="Who is the mother-in-law of Olivera Despina") • Result: Returned pages about Bayezid I and other related historical figures, but no direct answer in relevant pages. Action Step 3: • Tool: Google Search • Sub-goal: Perform a Google search to find more detailed biographical information about Olivera Despina and her family members. • Command: tool.execute(query="Who is the mother-in-law of Olivera Despina") • Result: Olivera Despina’s mother-in-law was Gülçiçek Hatun. Olivera Despina was a Serbian princess who married Ottoman Sultan Bayezid I after the Battle of Kosovo in 1389. Gülçiçek Hatun was the first wife of Sultan Murad I and the mother of Bayezid I. Execution Verifier: PASS, STOP Solution Generator: Multi-tool verification approach successfully identified Gülçiçek Hatun as Olivera Despina’s mother-in-law through systematic research. Failure Case: AgentFlow w/o Flow-GRPO Action Step 1: • Tool: Google Search • Sub-goal: Search for information about the film “Pugachev” (1978) and identify the director and any awards they received. • Command: tool.execute(query="What is the award that the director of film Pugachev (1978 Film) received?", add_citations=True) • Result: Alexey Saltykov, the director of the 1978 film “Pugachev,” received the title of People’s Artist of the RSFSR in 1980 and the Order of the Badge of Honour for his contributions to cinema. While the film “Pugachev” itself was honored with a special prize at the All-Union Film Festival in 1979, this award was given to the film rather than specifically to Saltykov for his direction of it. • Note: This step shows the agent working on a completely different question than asked. Action Step 2: • Tool: Wikipedia Search • Sub-goal: Retrieve information about the personal awards received by Alexey Saltykov for directing the film “Pugachev” (1978). • Command: tool.execute(query="Alexey Saltykov awards for Pugachev 1978") • Result: Returned pages about ”Alexey Saltykov (director)”, ”Catherine the Great”, and ”Imperial Russian Army” but no relevant information about the original question. Execution Verifier: STOP - Agent worked on wrong question, no useful information gathered for original query Solution Generator: Tool attempts failed to address the actual question about Olivera Despina’s mother-in-law. The agent became confused and worked on an unrelated question about the Pugachev film director.

Rendering Paper...