# Agent-as-Tool: A Study on the Hierarchical Decision Making with Reinforcement Learning
**Authors**:
- Yanfei Zhang (Independent Researcher)
Abstract
Large Language Models (LLMs) have emerged as one of the most significant technological advancements in artificial intelligence in recent years. Their ability to understand, generate, and reason with natural language has transformed how we interact with AI systems. With the development of LLM-based agents and reinforcement-learning-based reasoning models, the study of applying reinforcement learning in agent frameworks has become a new research focus. However, all previous studies face the challenge of deciding the tool calling process and the reasoning process simultaneously, and the chain of reasoning was solely relied on the unprocessed raw result with redundant information and symbols unrelated to the task from the tool, which impose a heavy burden on the model’s capability to reason. Therefore, in our research, we proposed a hierarchical framework Agent-as-tool that detach the tool calling process and the reasoning process, which enables the model to focus on the verbally reasoning process while the tool calling process is handled by another agent. Our work had achieved comparable results with only a slight reinforcement fine-tuning on 180 samples, and had achieved exceptionally well performance in Bamboogle (Brown et al., 2020) with 63.2% of exact match and 75.2% in cover exact match, exceeding Search-R1 by 4.8% in exact match and 3.2% in cover exact match.
1 Introduction
Large Language Models (LLMs) have achieved remarkable progress in a wide range of natural language understanding and generation tasks (Liu et al., 2025; Zhang et al., 2024). As the complexity of tasks increases, a common approach is to augment LLMs with access to external tools, such as web search engines, calculators, or code interpreters. This tool-augmented paradigm enables agents to interact with the environment and perform planning, reasoning, and execution steps beyond the model’s pretraining distribution.
Recent advancements have explored integrating reinforcement learning (RL) into these agent frameworks, aiming to improve decision-making over tool usage and multi-hop reasoning steps (Guo et al., 2025; Jin et al., 2025). However, a major limitation remains: existing RL-enhanced agents conflate the tool invocation process with the verbal reasoning process. This tight coupling leads to several challenges: (1) The agent must learn tool selection, input construction, and reasoning jointly, which increases training difficulty and noise; (2) Reasoning often proceeds over noisy, unstructured outputs returned directly from external tools, which degrades answer quality.
To address these challenges, we propose Agent-as-tool, a hierarchical reasoning architecture in which reasoning and tool execution are explicitly decoupled as shown in Figure 1. The framework introduces a Planner and a Toolcaller as two separate agent components. The Planner focuses on natural language reasoning and high-level decision-making, while the Toolcaller is responsible for managing the tool interface (e.g., invoking web search) and returning structured observations.
The advantages of this design are twofold: (1) It simplifies the RL optimization process by assigning each sub-agent a focused objective; (2) It improves reasoning accuracy by allowing the Planner to operate on cleaner, more structured inputs. Furthermore, we apply a lightweight reinforcement fine-tuning procedure using GRPO on just 180 samples to demonstrate the efficiency of our framework.
This paper makes the following contributions:
- We propose Agent-as-tool, a hierarchical agent framework that separates reasoning and tool usage via a Planner and a Toolcaller.
- We introduce a reinforcement learning protocol that enhances Planner behavior while masking Toolcaller outputs to preserve credit assignment integrity.
- We empirically validate our framework on multiple multi-hop QA datasets and achieve state-of-the-art performance on Bamboogle.
- We provide qualitative insights showing that hierarchical decoupling improves reasoning clarity and decomposition over existing baselines like Search-R1.
<details>
<summary>extracted/6585337/model_graph.png Details</summary>

### Visual Description
## Diagram: Tool Calling Methods
### Overview
The image presents a diagram comparing three different methods of tool calling within a Language Model (LLM) framework. These methods are: Vanilla Tool Calling + LLM Reasoning, Multi-step Tool Calling with unprocessed results, and Agent-as-tool. The diagram illustrates the flow of information and processes involved in each method, highlighting the interactions between the LLM, tools, and the resulting observations/answers.
### Components/Axes
The diagram uses the following components:
* **Nodes:** Represent different stages or components in the tool calling process. These include:
* `<think>`: Represents the LLM's reasoning or thinking process.
* `<tool_calling>`: Represents the process of calling or using external tools.
* `<raw_obs>`: Represents raw, unprocessed observations obtained from the tools.
* `<processed_obs>`: Represents processed observations obtained from the tools.
* `<answer>`: Represents the final answer generated by the LLM.
* `Agent-Toolcaller`: Represents an agent that manages and interacts with tools.
* `Tools`: Represents the external tools used by the agent.
* **Arrows:** Indicate the flow of information or the sequence of processes.
* **Text Descriptions:** Provide detailed explanations of each method.
* **N-times:** Indicates a loop or iterative process.
### Detailed Analysis
**(a) Vanilla Tool Calling + LLM Reasoning**
* **Flow:** `<think>` -> `<tool_calling>` -> `<raw_obs>` -> `<think>` -> `<answer>`
* **Description:** The LLM thinks, calls a tool, receives raw observations, thinks again using the observations, and then provides an answer.
* **Text:** "The tool was called to process the original question (<tool_calling>), and then the unprocessed observations (<raw_obs>) were obtained, then LLM think with the given observations (<think>) to give the answer (<answer>)."
**(b) Multi-step Tool Calling with unprocessed results**
* **Flow:** `<think>` -> `<tool_calling>` -> `<raw_obs>` -> `<think>` (looping N-times) -> `<answer>`
* **Description:** The LLM thinks, calls a tool, receives raw observations, and then thinks again. This process loops multiple times (N-times) before generating the final answer.
* **Text:** "LLM think about where to start and how to answer the question (<think>), then calls tools to process the subquery (<tool_calling>), after obtaining the unprocessed observations (<raw_obs>), the LLM then think again. After multiple iterations, the final answer was reached (<answer>). The process could be finetuned by reinforcement learning."
**(c) Agent-as-tool (ours)**
* **Flow:** `<think>` -> `<tool_calling>` -> `Agent-Toolcaller` (interacts with `Tools` to produce `<processed_obs>`) -> `<processed_obs>` -> `<answer>`. There is a feedback loop from `<processed_obs>` back to `<think>` labeled "N-times".
* **Description:** The LLM thinks, calls the Agent-Toolcaller, which interacts with tools to process subqueries and generate processed observations. This process loops multiple times (N-times) before generating the final answer.
* **Text:** "LLM think about where to start and how to answer the question (<think>), then calls the agent (Toolcaller) to process the subquery (<tool_calling>), the agent use tools (Tools) to process the subqueries for one or more times and then generate the processed results based on the interaction with tools (<processed_obs>). After multiple iterations, the final answer was reached (<answer>)."
### Key Observations
* The Vanilla Tool Calling method involves a single iteration of tool calling and reasoning.
* The Multi-step Tool Calling method involves multiple iterations of tool calling and reasoning, allowing for refinement of the answer.
* The Agent-as-tool method introduces an agent (Toolcaller) that manages and interacts with tools, providing processed observations to the LLM.
* The Agent-as-tool method includes a feedback loop, allowing the LLM to refine its approach based on the processed observations.
### Interpretation
The diagram illustrates the evolution of tool calling methods in LLMs. The Vanilla approach is a simple, one-step process. The Multi-step approach adds iterative refinement. The Agent-as-tool approach introduces a dedicated agent to manage tool interactions, potentially improving efficiency and accuracy. The Agent-as-tool method, labeled as "ours," suggests that this is the method being proposed or used by the authors. The question at the top of the image "Invincible is based on the story of which Philadelphia Eagles player?" is likely the question being used to test these different tool calling methods.
</details>
Figure 1: The trajectory of a single sample from a batch of questions processed in different research configurations. In our Agent-as-tool method, we employed the agent as a tool instead of calling the tool directly. The Planner is responsible for the tool calling process and the reasoning process, and the Toolcaller is responsible for the tool calling process to provide sufficient processed observations.
2 Literature Review
2.1 Agent Frameworks based on Pre-defined Reasoning Steps
There are several agent researches that are designed to perform tasks with pre-trained LLMs and with pre-defined reasoning steps, including the CAMEL (Li et al., 2023a), OpenManus (Liang et al., 2025) and MetaGPT (Hong et al., 2023). These works tend to extend the capabilities of the pre-trained LLM with additional rule-based reasoning steps to ’stimulate’ the internal reasoning capabilities of the LLM to achieve better performances.
Specifically, considering the search and information retrieval for task completion scenario, there are also considerable works, including the Search-o1 (Li et al., 2025a), OpenResearcher (Zheng et al., 2024) (majorly focusing on the scientific research scenario).
2.2 RL Reasoning Agents
With the development of RL training frameworks and the Deepseek-R1 (Guo et al., 2025) setting, there are also considerable works to implement R1-style training paradigms on the LLM-based agents. The searching and information retrieval tasks were the first to be considered in this scenario, including the R1-searcher (Song et al., 2025) and DeepResearcher (Zheng et al., 2025).
There are also several works that integrate other external tools under the framework to complete different tasks, including the ToRL (Li et al., 2025b) that integrate the python interpreter tool, ToolRL (Qian et al., 2025) that flexibly integrate different toolkits with different pre-defined datasets (e.g. API-Bank (Li et al., 2023b)), SWiRL (Goldie et al., 2025) that control the tool selection process with different labels (<calculator> for calculator tool, and <search_query> for web search tool).
The generic process of calling an agent in these researches can be concluded as a sequence of thinking <think>, followed by a tool calling query enclosed with <tool_query>, then the tool returns observations <obs>. With the reasoning on each step, the final answer could be reached whenever the agent think the ground truths are sufficient enough to give the final answer. It is a much simpler configuration with a ReAct-like tool calling process (Yao et al., 2023), then reinforcement learning are applied to explore whether the model could exhibit the capabilities beyond simple reasoning to reach the next hop, as shown in Figure 1 as Multi-step Tool Calling with Unprocessed Results.
3 Methodology
We propose the Agent-as-tool framework as a hierarchical design for multi-hop reasoning tasks. It separates the planning and tool usage responsibilities between two agent components: a high-level Planner and a subordinate Toolcaller. The Planner manages reasoning and task decomposition, while the Toolcaller executes external actions such as web search. This section outlines the design of both components and the reinforcement learning procedure employed to optimize the Planner.
3.1 Agent Architecture
3.1.1 Planner
The Planner is a language model agent responsible for high-level reasoning and tool invocation decisions. It reasons about the current task state and emits tool usage instructions in natural language.
Reasoning: The Planner conducts internal reasoning enclosed in <think>...</think> tags, in line with DeepSeek-R1 conventions (Guo et al., 2025). It uses previous observations and the original query to plan the next subtask.
Tool Invocation: Tool calls are expressed as sub-queries wrapped in <tool_calling>...</tool_calling> tags. These queries are interpreted by the Toolcaller, and the results are returned to the Planner as <obs>...</obs> blocks for further reasoning.
3.1.2 Toolcaller
The Toolcaller is a dedicated LLM-based agent designed to interface with external tools. In our implementation, it wraps a web search tool and processes queries issued by the Planner.
We implement the Toolcaller using a CAMEL-style chat agent (Li et al., 2023a), powered by GPT-4o-mini (Hurst et al., 2024). It could retrieve top- k search results multiple times and returns structured summaries to the Planner. Although our current prototype uses only web search, the architecture supports extension to tools like calculators or code interpreters, also including MCP-based tool servers.
3.2 Reinforcement Learning with GRPO
3.2.1 Training Objective
We employ Generalized Reinforcement Policy Optimization (GRPO) (Shao et al., 2024) to fine-tune the Planner. The objective is:
$$
\begin{split}\mathcal{J}(\Theta)&=\mathbb{E}_{x\sim\mathcal{D},\{y_{i}\}_{i=1}%
^{G}\sim\pi_{\text{old}}(\cdot|x)}\Bigg{[}\frac{1}{G}\sum_{i=1}^{G}\Big{[}\\
&\min\left(\frac{\pi_{\Theta}(y_{i}|x)}{\pi_{\text{old}}(y_{i}|x)}A_{i},\text{%
clip}\left(\frac{\pi_{\Theta}(y_{i}|x)}{\pi_{\text{old}}(y_{i}|x)},1-%
\varepsilon,1+\varepsilon\right)A_{i}\right)\\
&-\beta D_{\text{KL}}(\pi_{\Theta}||\pi_{\text{ref}})\Big{]}\Bigg{]}\end{split} \tag{1}
$$
where $x$ is sampled from dataset $\mathcal{D}$ , $y_{i}$ is a rollout, $A_{i}$ is the advantage, $\varepsilon$ is the clipping threshold, and $\beta$ regulates KL penalty.
3.2.2 Observation Masking
To prevent reward leakage through Toolcaller-generated outputs, we mask the <obs> blocks during reward modeling and training. These segments are replaced with special token <fim_pad>, which is trained to embed close to zero.
3.2.3 Reward Function
Our reward function balances correctness and formatting constraints:
$$
\text{Reward}=\begin{cases}\text{F1 score}&\text{if answer is correctly %
formatted}\\
-2&\text{otherwise}\end{cases} \tag{2}
$$
The model receives a high reward when generating a valid and correct response, and a penalty when output is malformed.
4 Experiments
4.1 Experiment Settings
4.1.1 Model and Hyperparameters
We use Qwen-2.5-7B-Instruct (Qwen et al., 2025) as our base model. The training is conducted by an customized implementation of rollout and a customized implementation of GRPO on trl (von Werra et al., 2020). At each training step, we sample a batch of training data from the training set and calculate the reward for each rollout. Then, we update the policy by maximizing the reward.
The batch size is set to 3 for each training step and each sample contains 12 rollouts for each prompt. Each rollout contains at most 10 rounds of tool calling.
4.1.2 Training Settings
We conducted quite a small scale of training for the Agent-as-tool. We trained the Agent-as-tool for 60 steps with each step containing 3 training samples, and each training sample contains 12 rollouts, with total size of only 180 samples and 2160 rollouts. The training data entries were selected from the HotpotQA (Yang et al., 2018) and 2WikiMultiHopQA (Ho et al., 2020) datasets with the same ratio as the R1-searcher (Song et al., 2025).
During the training process, we observed that the loss of the Agent-as-tool is not stable for the first 30 steps, which is likely due to the small training data, but after 30 steps, the loss is stable and close to 0 and the performance of the Agent-as-tool also stabilized.
<details>
<summary>extracted/6585337/training_graph.png Details</summary>

### Visual Description
## Line Charts: Loss and Average Reward
### Overview
The image contains two line charts, one above the other. The top chart displays "Loss" values, and the bottom chart displays "Average Reward" values. Both charts share a common x-axis representing a sequence of steps or iterations, ranging from 0 to 60.
### Components/Axes
**Top Chart: Loss**
* **Title:** Loss
* **Y-axis:** Represents the Loss value, ranging from 0 to 8000.
* **X-axis:** Represents the step or iteration, ranging from 0 to 60.
* **Data Series:** A single blue line representing the loss values.
**Bottom Chart: Average Reward**
* **Title:** Average Reward
* **Y-axis:** Represents the Average Reward, ranging from -2.00 to -0.25.
* **X-axis:** Represents the step or iteration, ranging from 0 to 60.
* **Data Series:** A single blue line representing the average reward values.
### Detailed Analysis
**Top Chart: Loss**
The loss values fluctuate significantly in the early iterations (0-10), with peaks around 6700 at x=4, 8300 at x=7, and 4300 at x=9. After x=20, the loss values stabilize and remain close to 0 for the remaining iterations, with a small peak of approximately 3000 at x=31.
* x=0: Loss ≈ 100
* x=4: Loss ≈ 6700
* x=7: Loss ≈ 8300
* x=9: Loss ≈ 4300
* x=15: Loss ≈ 1500
* x=31: Loss ≈ 3000
* x=60: Loss ≈ 0
**Bottom Chart: Average Reward**
The average reward fluctuates between approximately -1.5 and -0.75 for most of the iterations. There are notable dips around x=10 (approximately -1.75) and x=33 (approximately -2.00), and a peak around x=43 (approximately -0.3).
* x=0: Average Reward ≈ -1.3
* x=10: Average Reward ≈ -1.75
* x=33: Average Reward ≈ -2.00
* x=43: Average Reward ≈ -0.3
* x=60: Average Reward ≈ -1.1
### Key Observations
* The Loss chart shows a significant decrease in loss after the initial iterations, indicating that the model is learning.
* The Average Reward chart shows fluctuations, but generally remains within a relatively narrow range.
### Interpretation
The charts likely represent the training progress of a machine learning model. The "Loss" chart indicates the error rate of the model, which decreases over time as the model learns. The "Average Reward" chart indicates the performance of the model in terms of reward, which fluctuates but does not show a clear upward trend. The initial high loss values suggest that the model is initially making large errors, but as it trains, the errors decrease significantly. The fluctuations in average reward could be due to the stochastic nature of the environment or the exploration-exploitation trade-off.
</details>
Figure 2: Training progress of the Agent-as-tool model showing loss convergence over training steps. The loss becomes stable after approximately 30 training steps.
The training curve illustrated in Figure 2 shows the convergence behavior of our model during the reinforcement learning process.
4.1.3 Benchmark Settings
In order to evaluate the performance of the Agent-as-tool, we conducted experiments on the open-domain question-answering task. We selected multiple multi-hop reasoning tasks to evaluate the performance of the Agent-as-tool, including the HotpotQA (Yang et al., 2018), 2WikiMultiHopQA (Ho et al., 2020), MuSiQue (Trivedi et al., 2022), and bamboogle (Press et al., 2023).
4.1.4 Baseline Settings
We have 1 information retrival tool: web search.
We then compare the performance of the Agent-as-tool with the following baselines:
- direct IO: This baseline employs the direct output of the Qwen-2.5-7B-Instruct (Qwen et al., 2025) as the answer without any external tool calling.
- direct IO with web search: This baseline employs the direct output of the Qwen-2.5-7B-Instruct (Qwen et al., 2025), but enables the web search to process the original question and return the top-k results as additional observations.
- CAMEL Agent: This baseline employs the CAMEL (Li et al., 2023a) chat agent driven by the GPT-4o-mini (Hurst et al., 2024).
- CAMEL Agent with web search: This baseline employs the CAMEL (Li et al., 2023a) chat agent driven by the GPT-4o-mini (Hurst et al., 2024) and the same tool setting as the Agent-as-tool with web search tool only. This baseline is used as the reference for multi-hop reasoning tasks conducted with the rule-based agent framework.
- Search-R1: We directly compare the performance of the Agent-as-tool with the Search-R1 (Jin et al., 2025) in our configurations with web search tool for a fair comparison. As Search-R1 cannot be directly integrated with the CAMEL (Li et al., 2023a) chat agent, we would directly returns the search results as the answer instead of using another Toolcaller.
We conducted experiments with Agent-as-tool with pre-finetuned and post-finetuned models.
In align with the Deepseek-R1 setting (Guo et al., 2025), we adopted the same prompt setting for all the baselines and the Agent-as-tool except Search-R1 (Jin et al., 2025) that is equipped with its orignal prompt setting, and we also modified the tool calling process to enable Search-R1 to accesss the unprocessed web search results.
4.1.5 Evaluation Metrics
In this paper, we focus on the performance of the Agent-as-tool in terms of the correctness of the answer, therefore, we employed the exact match metric (EM), the cover exact match metric (CEM) to evaluate the performance of the Agent-as-tool.
4.2 Quantitative Experiment Results
The qualitative results are shown in LABEL:tab:all_datasets_complete. Based on the results, we can see that the Agent-as-tool outperforms most of the baselines except for the EM metric in the HotpotQA, 2WikiMultiHopQA, and MuSiQue datasets, where Search-R1 still has the best performance. However, in terms of the CEM metric, our model has a substantial improvement over all the baselines, except in HotpotQA where Search-R1 still has the best performance (64.2% vs 57.4%). And in the Bamboogle dataset (Press et al., 2023), the Agent-as-tool with web search tool integrated to the Toolcaller (CAMEL (Li et al., 2023a) agent) achieves the best performance with EM of 63.2% and CEM of 75.2%.
Table 1: Performance Comparison Across Different Datasets
| Bamboogle | Direct IO | 17.6 | 26.4 |
| --- | --- | --- | --- |
| Direct IO + Web Search | 29.6 | 42.4 | |
| CAMEL | 36.8 | 47.2 | |
| CAMEL + Web Search | 51.2 | 62.4 | |
| Search-R1 + Web Search | 58.4 | 72.0 | |
| Agent-as-tool-Base + Web Search | 60.0 | 71.2 | |
| Agent-as-tool-Instruct + Web Search | 63.2 | 75.2 | |
| HotpotQA | Direct IO | 20.0 | 27.2 |
| Direct IO + Web Search | 32.6 | 52.8 | |
| CAMEL | 23.2 | 44.2 | |
| CAMEL + Web Search | 32.4 | 59.4 | |
| Search-R1 + Web Search | 47.2 | 64.2 | |
| Agent-as-tool-Base + Web Search | 35.0 | 55.2 | |
| Agent-as-tool-Instruct + Web Search | 37.2 | 57.4 | |
| 2WikiMultiHopQA | Direct IO | 22.6 | 25.4 |
| Direct IO + Web Search | 27.2 | 40.2 | |
| CAMEL | 20.8 | 34.6 | |
| CAMEL + Web Search | 35.0 | 69.4 | |
| Search-R1 + Web Search | 52.4 | 68.0 | |
| Agent-as-tool-Base + Web Search | 42.8 | 68.0 | |
| Agent-as-tool-Instruct + Web Search | 44.6 | 70.0 | |
| MuSiQue | Direct IO | 4.8 | 9.0 |
| Direct IO + Web Search | 14.0 | 18.0 | |
| CAMEL | 9.2 | 18.8 | |
| CAMEL + Web Search | 16.0 | 29.4 | |
| Search-R1 + Web Search | 20.8 | 28.6 | |
| Agent-as-tool-Base + Web Search | 15.6 | 28.8 | |
| Agent-as-tool-Instruct + Web Search | 18.4 | 29.8 | |
We compared the performance of the Agent-as-tool before and after the reinforcement fine-tuning process. The table is shown in 2. Based on the results, we can see that the Reinforcement fine-tuning based on GRPO (Shao et al., 2024) substantially improves the performance of the Agent-as-tool in all datasets with an average improvement of 2.5% in EM and 2.3% in CEM.
Table 2: Performance improvements after reinforcement fine-tuning
| Dataset | Pre-finetuned | Post-finetuned | Improvement | | | |
| --- | --- | --- | --- | --- | --- | --- |
| EM | CEM | EM | CEM | EM | CEM | |
| (%) | (%) | (%) | (%) | (%) | (%) | |
| Bamboogle | 60.0 | 71.2 | 63.2 | 75.2 | +3.2 | +4.0 |
| HotpotQA | 35.0 | 55.2 | 37.2 | 57.4 | +2.2 | +2.2 |
| 2WikiMultiHopQA | 42.8 | 68.0 | 44.6 | 70.0 | +1.8 | +2.0 |
| MuSiQue | 15.6 | 28.8 | 18.4 | 29.8 | +2.8 | +1.0 |
| Average | 38.4 | 55.8 | 40.9 | 58.1 | +2.5 | +2.3 |
Comparing with the CAMEL baseline with web search tool integrated (CAMEL + Web Search), the Agent-as-tool pre-finetuned and post-finetuned achieved a substantial improvement in EM and CEM, stating the necessity that the Agent-as-tool that enables the model to control when and what to be called in a tool calling is a more effective framework for multi-hop reasoning tasks.
Comparing with the Search-R1 baseline (Search-R1 + Web Search), which is the current best performing research of its kind, the Agent-as-tool-Instruct has substantial improvements over the Bamboogle dataset, which improves the EM by 4.8% and CEM by 3.2%, stating the effectiveness of the Agent-as-tool in multi-hop reasoning tasks. Besides, the Agent-as-tool conducted fine-tuning with 180 samples, which indicates the efficiency of the fine-tuning process.
4.3 Qualitative Results Inspection and Analysis
4.3.1 Comparison of the Agent-as-tool and the Search-R1
Comparing with Search-R1 baseline (Search-R1 + Web Search), the Agent-as-tool-Instruct had several advantages qualitatively:
- The Agent-as-tool-Instruct could reason with less fuzzy and more structured observations, comparing with the Search-R1 + Web Search which would need to reason with the unprocessed web search results with fuzzy and unstructured symbols or other unrelated details.
- As the Agent-as-tool-Instruct adopt a hierarchical reasoning process which segragate the reasoning process and the tool calling process, the agent could have a better linearly text-based reasoning process comparing with the Search-R1 + Web Search.
The qualitative comparison as a example is shown in Figure 3 (in Appendix). The Agent-as-tool-Instruct could reason with less fuzzy and more structured observations, comparing with the Search-R1 + Web Search which would need to reason with the unprocessed web search results with fuzzy and unstructured symbols or other unrelated details.
4.3.2 Comparison of the Results before and after the Reinforcement Fine-tuning
Comparing with Agent-as-tool-Base, the Agent-as-tool-Instruct had several advantages qualitatively:
- The Agent-as-tool-Instruct identify the correct decomposition of the question to identify the first hop and the second hop to be solved by the agent, comparing with the Agent-as-tool-Base which would not be able to decompose the multi-hop question correctly so it directly feed the agent with the whole question (only a sightly change from the original manner). If the agent is not capable of reasoning the multi-hop question correctly, the Agent-as-tool-Base would not be able to answer the question correctly.
- As the Agent-as-tool-Instruct was instructed to reason with the agent powered by the pretrained model, the fine-tuned model could give a more structured and reasonable question to be answered by the agent comparing with the Agent-as-tool-Base.
The qualitative comparison as a example is shown in Figure 4 (in Appendix). The Agent-as-tool-Instruct could correctly decompose the question to identify the first hop and the second hop, comparing with the Agent-as-tool-Base which would not be able to decompose the multi-hop question correctly so it directly feed the agent with the whole question.
5 Conclusions and Future Work
5.1 Conclusions
In this paper, we majorly studied the multi-hop reasoning tasks with the Agent-as-tool framework. We found that the Agent-as-tool could achieve a substantial improvement in the performance of the multi-hop reasoning tasks, especially in the Bamboogle dataset (Press et al., 2023). We also found that the Agent-as-tool could reason with less fuzzy and more structured observations, comparing with the Search-R1 + Web Search which would need to reason with the unprocessed web search results with fuzzy and unstructured symbols or other unrelated details.
5.2 Limitations and Future Work
This paper only assigns the search tool to the agent (or in another word, the search agent) so the scope is limited to the open-domain multi-hop search tasks. While because only 1 model was provided, the dynamic assignment of the tool to the agent is not considered. Therefore in our future work more tools would be considered to be assigned to the agent, while we would also explore the dynamic assignment of the tool to the agent, in another word, make the Planner as a Tool Orchestrator.
References
- Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Goldie et al. [2025] A. Goldie, A. Mirhoseini, H. Zhou, I. Cai, and C. D. Manning. Synthetic data generation and multi-step rl for reasoning and tool use, 2025. URL https://arxiv.org/abs/2504.04736.
- Guo et al. [2025] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Ho et al. [2020] X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020.
- Hong et al. [2023] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 3(4):6, 2023.
- Hurst et al. [2024] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
- Jin et al. [2025] B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URL https://arxiv.org/abs/2503.09516.
- Li et al. [2023a] G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. Camel: Communicative agents for” mind” exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023a.
- Li et al. [2023b] M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y. Li. Api-bank: A comprehensive benchmark for tool-augmented llms, 2023b. URL https://arxiv.org/abs/2304.08244.
- Li et al. [2025a] X. Li, G. Dong, J. Jin, Y. Zhang, Y. Zhou, Y. Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search-enhanced large reasoning models, 2025a. URL https://arxiv.org/abs/2501.05366.
- Li et al. [2025b] X. Li, H. Zou, and P. Liu. Torl: Scaling tool-integrated rl. arXiv preprint arXiv:2503.23383, 2025b.
- Liang et al. [2025] X. Liang, J. Xiang, Z. Yu, J. Zhang, S. Hong, S. Fan, and X. Tang. Openmanus: An open-source framework for building general ai agents, 2025. URL https://doi.org/10.5281/zenodo.15186407.
- Liu et al. [2025] B. Liu, X. Li, J. Zhang, J. Wang, T. He, S. Hong, H. Liu, S. Zhang, K. Song, K. Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems. arXiv preprint arXiv:2504.01990, 2025.
- Press et al. [2023] O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models, 2023. URL https://arxiv.org/abs/2210.03350.
- Qian et al. [2025] C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani-Tür, G. Tur, and H. Ji. Toolrl: Reward is all tool learning needs. arXiv preprint arXiv:2504.13958, 2025.
- Qwen et al. [2025] Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115.
- Shao et al. [2024] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Song et al. [2025] H. Song, J. Jiang, Y. Min, J. Chen, Z. Chen, W. X. Zhao, L. Fang, and J.-R. Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025.
- Trivedi et al. [2022] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition, 2022. URL https://arxiv.org/abs/2108.00573.
- von Werra et al. [2020] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- Yang et al. [2018] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
- Yao et al. [2023] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.03629.
- Zhang et al. [2024] J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, et al. Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024.
- Zheng et al. [2024] Y. Zheng, S. Sun, L. Qiu, D. Ru, C. Jiayang, X. Li, J. Lin, B. Wang, Y. Luo, R. Pan, et al. Openresearcher: Unleashing ai for accelerated scientific research. arXiv preprint arXiv:2408.06941, 2024.
- Zheng et al. [2025] Y. Zheng, D. Fu, X. Hu, X. Cai, L. Ye, P. Lu, and P. Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025.
Appendix A The qualitative Results inspection
A.1 The qualitative results inspection of the Agent-as-tool-Instruct + Web Search and the Search-R1 + Web Search
<details>
<summary>extracted/6585337/comparisonSearchR1andOurs.png Details</summary>

### Visual Description
## Diagram: Comparison of Search Strategies for Question Answering
### Overview
The image presents a comparison between two search strategies, "Search-R1" and "Agent-as-a-tool (Ours)", for answering the question: "Where was the mother of Ernst I, Prince Of Hohenlohe-Langenburg born?". It illustrates the step-by-step reasoning and search processes of each strategy, highlighting their differences and outcomes.
### Components/Axes
The diagram is divided into two main columns:
* **Left Column:** "Result from Search-R1"
* **Right Column:** "Result from Agent-as-a-tool (Ours)"
Each column contains a sequence of steps enclosed in XML-like tags, including:
* `<think>`: Represents a thought process or reasoning step.
* `<search>`: Represents a search query.
* `<information>`: Represents the information retrieved from a search.
* `<tool_calling>`: Represents a call to an external tool.
* `<obs>`: Represents an observation or piece of information.
* `<answer>`: Represents the final answer provided by the strategy.
Additionally, there are annotations with orange boxes and arrows indicating:
* "Fuzzier result to be processed" (pointing to the information block in Search-R1)
* "Needs to find the answer from scratch (And get the wrong one)" (pointing to the second search block in Search-R1)
* "Clearer result to be processed" (pointing to the observation block in Agent-as-a-tool)
* "Reasoning with answer given by another agent" (pointing to the second tool_calling block in Agent-as-a-tool)
### Detailed Analysis or Content Details
**Left Column: Result from Search-R1**
1. **`<think>`:** "I need to find out where the mother of Ernst I, Prince of Hohenlohe-Langenburg, was born. I'll search for it."
2. **`<search>`:** "mother of Ernst I, Prince of Hohenlohe-Langenburg"
3. **`<information>`:**
* `result_id`: 1
* `title`: "Ernst I, Prince of Hohenlohe-Langenburg"
* `link`: "https://en.wikipedia.org/wiki/Ernst\_I\_Prince\_of\_Hohenlohe-Langenburg"
* `snippet`: "He married Princess Feodora of Leiningen, the only daughter of Emich Carl, 2nd Prince of Leiningen, and Princess Victoria of Saxe-Coburg-Saalfeld on 18\xa0},... (save space for 9 more results)"
4. **`<think>`:** "I found out that the mother of Ernst I, Prince of Hohenlohe-Langenburg, was Princess Feodora of Leiningen. Now I need to find out where she was born."
5. **`<search>`:** "where was Princess Feodora of Leiningen born"
6. **`<information>`:**
* `result_id`: 1
* `title`: "Princess Feodora of Leiningen"
* `link`: "https://en.wikipedia.org/wiki/Princess\_Feodora\_of\_Leiningen"
* `snippet`: "Life. Feodora was born in Amorbach, Bavaria, on 7 December 1807 to Princess Victoria of Saxe-Coburg-Saalfeld and her first husband, Emich Carl, Prince of\xa0..."
* `result_id`: 2
* `title`: "Princess Feodora of Leiningen: Queen Victoria's Half Sister"
* `link`: "https://owlcation.com/humanities/Feodora-of-Leiningen-Queen-Victorias-Half-Sister"
* `snippet`: "Feodora was born Princess Anna Feodora Auguste Charlotte Wilhemina of Leiningen on the 7th of December 1807 in Amorbach, then in Leiningen and\xa0..."
7. **`<think>`:** "I found out that Princess Feodora of Leiningen was born in Amorbach, Bavaria, Germany. Now I can provide the answer."
8. **`<answer>`:** "Amorbach (Wrong Answer)"
**Right Column: Result from Agent-as-a-tool (Ours)**
1. **`<think>`:** "I need to find out who the mother of Ernst I, Prince of Hohenlohe-Langenburg is first."
2. **`<tool_calling>`:** "Who is the mother of Ernst I, Prince of Hohenlohe-Langenburg?"
3. **`<obs>`:** "The mother of Ernst I, Prince of Hohenlohe-Langenburg, is Countess Amalie Henriette of Solms-Baruth. For more information, you can visit the [Wikipedia page on Ernst I, Prince of Hohenlohe-Langenburg](https://en.wikipedia.org/wiki/Ernst\_I,\_Prince\_of\_Hohenlohe-Langenburg)."
4. **`<think>`:** "Now that I know the mother's name, I need to find out where she was born."
5. **`<tool_calling>`:** "Where was Countess Amalie Henriette of Solms-Baruth born?"
6. **`<obs>`:** "Countess Amalie Henriette of Solms-Baruth was born in Kliczków on January 30, 1768."
7. **`<think>`:** "The information provided states that Countess Amalie Henriette of Solms-Baruth, the mother of Ernst I, Prince of Hohenlohe-Langenburg, was born in Kliczków."
8. **`<answer>`:** "Kliczków (Correct Answer)"
### Key Observations
* Search-R1 directly searches for the mother of Ernst I's birthplace, retrieves information about Ernst I, then incorrectly focuses on Princess Feodora of Leiningen, leading to the wrong answer (Amorbach).
* Agent-as-a-tool first identifies the mother's name (Countess Amalie Henriette of Solms-Baruth) and then searches for her birthplace, leading to the correct answer (Kliczków).
* The annotations highlight that Search-R1's initial search result was "fuzzier" and required it to "find the answer from scratch," ultimately leading to an incorrect path. Agent-as-a-tool, on the other hand, obtained a "clearer result" and used "reasoning with answer given by another agent" to arrive at the correct answer.
### Interpretation
The comparison demonstrates the effectiveness of a multi-step reasoning approach (Agent-as-a-tool) compared to a direct search approach (Search-R1) for complex question answering. Agent-as-a-tool's ability to break down the question into sub-questions (identifying the mother first, then her birthplace) allows it to leverage information more effectively and avoid getting sidetracked by irrelevant details. Search-R1's approach, while simpler, is more prone to errors due to its reliance on the initial search results and its inability to correct its course when it goes down the wrong path. The annotations emphasize the importance of clear initial information and the benefits of reasoning and iterative refinement in question answering systems.
</details>
Figure 3: The Agent-as-tool-Instruct could reason with less fuzzy and more structured observations, comparing with the Search-R1 + Web Search which would need to reason with the unprocessed web search results with other unrelated details. Search-R1 was misled by the unprocessed web search results to reason with the wrong answer for the second hop (Princess Feodora of Leiningen), while the Agent-as-tool-Instruct has applied the agent to preprocess the web search results and return the correct answer (Countess Amalie Henriette of Solms-Baruth) for the second hop.
A.2 The qualitative results inspection of the Agent-as-tool-Instruct + Web Search and the Agent-as-tool-Base + Web Search
<details>
<summary>extracted/6585337/comparisonBeforeAfter1.png Details</summary>

### Visual Description
## Dialogue Comparison: Agent-as-tool-Base vs. Agent-as-tool-Instruct
### Overview
The image presents a comparison of two agents, "Agent-as-tool-Base" and "Agent-as-tool-Instruct", attempting to answer the question: "Where did Edward Dunn (Bishop)'s father graduate from?". The expected answer is "Corpus Christi College, Cambridge". The image shows the step-by-step reasoning and actions of each agent, including their thoughts, tool calls, observations, and final answers.
### Components/Axes
The image is divided into two main columns, one for each agent. Each column contains a sequence of dialogue turns, marked by tags such as `<think>`, `<tool_calling>`, `<obs>`, and `<answer>`. Orange boxes highlight key actions or conclusions of each agent.
### Detailed Analysis or ### Content Details
**Agent-as-tool-Base (Left Column):**
1. **Initial Thought:** The agent starts by stating its need to find Edward Dunn's father's educational background.
2. **Tool Call:** The agent calls a tool to provide information on Edward Dunn's father's education.
3. **Observation:** The agent reports that it couldn't find specific information about Edward Dunn's father's education but found information related to Edward Dunn himself. It asks if the user needs information on a different topic. An orange box states "Planner try to ask 2 hops simultaneously".
4. **Thought:** The agent concludes that the observation did not provide the specific information it was looking for and that it needs to find a more direct way to get the information. An orange box states "The agent was unable to answer".
5. **Tool Call:** The agent calls a tool to provide information on the educational background of Edward Dunn's father.
6. **Observation:** The agent finds some relevant information regarding Edward Dunn's family background, but specific details about his father's educational background are limited. It provides some notable findings:
* Edward T. Dunn studied theology and earned a master's degree in history and a PhD, but this relates to him rather than his father.
* Bishop Edward Dunn's father, Andrew Hunter Dunn, was an education graduate from Marlborough and Pembroke College, Cambridge.
* Other sources mainly discuss different individuals named Edward Dunn or their obituaries.
7. **Thought:** The agent concludes that it cannot get exact information. An orange box states "The agent cannot get exact information".
8. **Thought:** The agent states that the observation provided the specific information it needed: Edward Dunn's father, Andrew Hunter Dunn, graduated from Marlborough and Pembroke College, Cambridge.
9. **Answer:** The agent provides the answer "Marlborough and Pembroke College, Cambridge (Wrong Answer)".
**Agent-as-tool-Instruct (Right Column):**
1. **Initial Thought:** The agent starts by stating its need to find information about Edward Dunn's father.
2. **Tool Call:** The agent calls a tool to provide information about Edward Dunn (Bishop)'s father.
3. **Observation:** The agent states that Edward Dunn, the bishop, was born into an ecclesiastical family. His father was Andrew Hunter Dunn, who at one time served as the Bishop of Quebec. It provides a link to Edward Dunn's Wikipedia page. An orange box states "Learned to Ask the prerequisite question before the next hop".
4. **Thought:** The agent concludes that it could answer this question easily. An orange box states "The agent could answer this question easily". From the observation, it sees that Edward Dunn's father was Andrew Hunter Dunn, who was the Bishop of Quebec. However, the observation does not specify where Andrew Hunter Dunn graduated from. It needs to find this information.
5. **Tool Call:** The agent calls a tool to provide information about where Andrew Hunter Dunn graduated from.
6. **Observation:** The agent states that Andrew Hunter Dunn graduated from Corpus Christi College, Cambridge, where he earned his BA as the 29th Wrangler in 1863. It provides a link to Andrew Hunter Dunn's Wikipedia page. An orange box states "Asked the next-hop question".
7. **Thought:** The agent concludes that it was able to answer the question easily. An orange box states "The agent was able to answer the question easily". Now it has the information that Andrew Hunter Dunn, the father of Edward Dunn (Bishop), graduated from Corpus Christi College, Cambridge.
8. **Answer:** The agent provides the answer "Corpus Christi College, Cambridge (Correct Answer)".
### Key Observations
* Agent-as-tool-Base initially struggles to find the correct information and provides an incorrect answer. It attempts to ask 2 hops simultaneously.
* Agent-as-tool-Instruct successfully answers the question by breaking it down into smaller steps and asking the prerequisite question before the next hop.
### Interpretation
The comparison highlights the difference in performance between the two agents. Agent-as-tool-Instruct demonstrates a more effective reasoning process by breaking down the question into smaller, more manageable steps. This allows it to gather the necessary information and provide the correct answer. Agent-as-tool-Base, on the other hand, struggles to find the correct information and ultimately provides an incorrect answer, suggesting a less effective reasoning strategy. The orange boxes highlight the key differences in the agents' approaches, with Agent-as-tool-Instruct learning to ask the prerequisite question and Agent-as-tool-Base attempting to ask 2 hops simultaneously.
</details>
Figure 4: The Agent-as-tool-Instruct + Web Search could correctly decompose the question to identify the first hop and the second hop, comparing with the Agent-as-tool-Base + Web Search which barely decompose the question and try to ask about the whole question in another manner, i.e. the Agent-as-tool-Base + Web Search was not able to reason the whole multi-hop question correctly, while the Agent-as-tool-Instruct + Web Search could decompose the question to identify the first hop of the question to be answered by the agent then proceed to the second hop.