2507.01489v1
Model: nemotron-free
# Agent-as-Tool: A Study on the Hierarchical Decision Making with Reinforcement Learning
**Authors**:
- Yanfei Zhang (Independent Researcher)
## Abstract
Large Language Models (LLMs) have emerged as one of the most significant technological advancements in artificial intelligence in recent years. Their ability to understand, generate, and reason with natural language has transformed how we interact with AI systems. With the development of LLM-based agents and reinforcement-learning-based reasoning models, the study of applying reinforcement learning in agent frameworks has become a new research focus. However, all previous studies face the challenge of deciding the tool calling process and the reasoning process simultaneously, and the chain of reasoning was solely relied on the unprocessed raw result with redundant information and symbols unrelated to the task from the tool, which impose a heavy burden on the model’s capability to reason. Therefore, in our research, we proposed a hierarchical framework Agent-as-tool that detach the tool calling process and the reasoning process, which enables the model to focus on the verbally reasoning process while the tool calling process is handled by another agent. Our work had achieved comparable results with only a slight reinforcement fine-tuning on 180 samples, and had achieved exceptionally well performance in Bamboogle (Brown et al., 2020) with 63.2% of exact match and 75.2% in cover exact match, exceeding Search-R1 by 4.8% in exact match and 3.2% in cover exact match.
## 1 Introduction
Large Language Models (LLMs) have achieved remarkable progress in a wide range of natural language understanding and generation tasks (Liu et al., 2025; Zhang et al., 2024). As the complexity of tasks increases, a common approach is to augment LLMs with access to external tools, such as web search engines, calculators, or code interpreters. This tool-augmented paradigm enables agents to interact with the environment and perform planning, reasoning, and execution steps beyond the model’s pretraining distribution.
Recent advancements have explored integrating reinforcement learning (RL) into these agent frameworks, aiming to improve decision-making over tool usage and multi-hop reasoning steps (Guo et al., 2025; Jin et al., 2025). However, a major limitation remains: existing RL-enhanced agents conflate the tool invocation process with the verbal reasoning process. This tight coupling leads to several challenges: (1) The agent must learn tool selection, input construction, and reasoning jointly, which increases training difficulty and noise; (2) Reasoning often proceeds over noisy, unstructured outputs returned directly from external tools, which degrades answer quality.
To address these challenges, we propose Agent-as-tool, a hierarchical reasoning architecture in which reasoning and tool execution are explicitly decoupled as shown in Figure 1. The framework introduces a Planner and a Toolcaller as two separate agent components. The Planner focuses on natural language reasoning and high-level decision-making, while the Toolcaller is responsible for managing the tool interface (e.g., invoking web search) and returning structured observations.
The advantages of this design are twofold: (1) It simplifies the RL optimization process by assigning each sub-agent a focused objective; (2) It improves reasoning accuracy by allowing the Planner to operate on cleaner, more structured inputs. Furthermore, we apply a lightweight reinforcement fine-tuning procedure using GRPO on just 180 samples to demonstrate the efficiency of our framework.
This paper makes the following contributions:
- We propose Agent-as-tool, a hierarchical agent framework that separates reasoning and tool usage via a Planner and a Toolcaller.
- We introduce a reinforcement learning protocol that enhances Planner behavior while masking Toolcaller outputs to preserve credit assignment integrity.
- We empirically validate our framework on multiple multi-hop QA datasets and achieve state-of-the-art performance on Bamboogle.
- We provide qualitative insights showing that hierarchical decoupling improves reasoning clarity and decomposition over existing baselines like Search-R1.
<details>
<summary>extracted/6585337/model_graph.png Details</summary>

### Visual Description
## Flowchart: Comparative Analysis of LLM Tool-Calling Methods for Question Answering
### Overview
The image presents a comparative analysis of three methods for answering a question using Large Language Models (LLMs) and tool calling. The question posed is: "Invincible is based on the story of which Philadelphia Eagles player?" The diagram illustrates three approaches:
1. **Vanilla Tool Calling+LLM Reasoning**
2. **Multi-step Tool Calling with unprocessed results**
3. **Agent-as-tool (ours)**
Each method is represented as a sequential workflow with color-coded steps, demonstrating differences in processing logic and iteration.
---
### Components/Axes
#### Diagram Structure
- **Three Main Sections**: Labeled (a), (b), and (c), each representing a distinct method.
- **Color-Coded Steps**:
- **Pink**: `<tool_calling>`
- **Yellow**: `<raw_obs>` (raw observations)
- **Blue**: `<reasoning>` (LLM reasoning)
- **Green**: `<final_answer>`
#### Methods
1. **Vanilla Tool Calling+LLM Reasoning**
- Sequential steps: Tool calling → Raw observations → LLM reasoning → Final answer.
2. **Multi-step Tool Calling with Unprocessed Results**
- Iterative steps: Tool calling → Raw observations → Tool calling → Raw observations → LLM reasoning → Final answer.
3. **Agent-as-tool (ours)**
- Integrated steps: Tool calling → Raw observations → LLM reasoning → Tool calling → Raw observations → LLM reasoning → Final answer.
---
### Key Differences
- **Vanilla Tool Calling**: Single iteration of tool calling and reasoning.
- **Multi-step Tool Calling**: Multiple iterations of tool calling without intermediate reasoning.
- **Agent-as-tool**: Combines tool calling and reasoning in an iterative loop for enhanced accuracy.
---
### Conclusion
The flowchart highlights the trade-offs between simplicity, iteration, and integration in LLM-based question-answering systems, with the "Agent-as-tool" method offering a balanced approach for complex queries.
</details>
Figure 1: The trajectory of a single sample from a batch of questions processed in different research configurations. In our Agent-as-tool method, we employed the agent as a tool instead of calling the tool directly. The Planner is responsible for the tool calling process and the reasoning process, and the Toolcaller is responsible for the tool calling process to provide sufficient processed observations.
## 2 Literature Review
### 2.1 Agent Frameworks based on Pre-defined Reasoning Steps
There are several agent researches that are designed to perform tasks with pre-trained LLMs and with pre-defined reasoning steps, including the CAMEL (Li et al., 2023a), OpenManus (Liang et al., 2025) and MetaGPT (Hong et al., 2023). These works tend to extend the capabilities of the pre-trained LLM with additional rule-based reasoning steps to ’stimulate’ the internal reasoning capabilities of the LLM to achieve better performances.
Specifically, considering the search and information retrieval for task completion scenario, there are also considerable works, including the Search-o1 (Li et al., 2025a), OpenResearcher (Zheng et al., 2024) (majorly focusing on the scientific research scenario).
### 2.2 RL Reasoning Agents
With the development of RL training frameworks and the Deepseek-R1 (Guo et al., 2025) setting, there are also considerable works to implement R1-style training paradigms on the LLM-based agents. The searching and information retrieval tasks were the first to be considered in this scenario, including the R1-searcher (Song et al., 2025) and DeepResearcher (Zheng et al., 2025).
There are also several works that integrate other external tools under the framework to complete different tasks, including the ToRL (Li et al., 2025b) that integrate the python interpreter tool, ToolRL (Qian et al., 2025) that flexibly integrate different toolkits with different pre-defined datasets (e.g. API-Bank (Li et al., 2023b)), SWiRL (Goldie et al., 2025) that control the tool selection process with different labels (<calculator> for calculator tool, and <search_query> for web search tool).
The generic process of calling an agent in these researches can be concluded as a sequence of thinking <think>, followed by a tool calling query enclosed with <tool_query>, then the tool returns observations <obs>. With the reasoning on each step, the final answer could be reached whenever the agent think the ground truths are sufficient enough to give the final answer. It is a much simpler configuration with a ReAct-like tool calling process (Yao et al., 2023), then reinforcement learning are applied to explore whether the model could exhibit the capabilities beyond simple reasoning to reach the next hop, as shown in Figure 1 as Multi-step Tool Calling with Unprocessed Results.
## 3 Methodology
We propose the Agent-as-tool framework as a hierarchical design for multi-hop reasoning tasks. It separates the planning and tool usage responsibilities between two agent components: a high-level Planner and a subordinate Toolcaller. The Planner manages reasoning and task decomposition, while the Toolcaller executes external actions such as web search. This section outlines the design of both components and the reinforcement learning procedure employed to optimize the Planner.
### 3.1 Agent Architecture
#### 3.1.1 Planner
The Planner is a language model agent responsible for high-level reasoning and tool invocation decisions. It reasons about the current task state and emits tool usage instructions in natural language.
Reasoning: The Planner conducts internal reasoning enclosed in <think>...</think> tags, in line with DeepSeek-R1 conventions (Guo et al., 2025). It uses previous observations and the original query to plan the next subtask.
Tool Invocation: Tool calls are expressed as sub-queries wrapped in <tool_calling>...</tool_calling> tags. These queries are interpreted by the Toolcaller, and the results are returned to the Planner as <obs>...</obs> blocks for further reasoning.
#### 3.1.2 Toolcaller
The Toolcaller is a dedicated LLM-based agent designed to interface with external tools. In our implementation, it wraps a web search tool and processes queries issued by the Planner.
We implement the Toolcaller using a CAMEL-style chat agent (Li et al., 2023a), powered by GPT-4o-mini (Hurst et al., 2024). It could retrieve top- k search results multiple times and returns structured summaries to the Planner. Although our current prototype uses only web search, the architecture supports extension to tools like calculators or code interpreters, also including MCP-based tool servers.
### 3.2 Reinforcement Learning with GRPO
#### 3.2.1 Training Objective
We employ Generalized Reinforcement Policy Optimization (GRPO) (Shao et al., 2024) to fine-tune the Planner. The objective is:
$$
\begin{split}\mathcal{J}(\Theta)&=\mathbb{E}_{x\sim\mathcal{D},\{y_{i}\}_{i=1}
^{G}\sim\pi_{\text{old}}(\cdot|x)}\Bigg{[}\frac{1}{G}\sum_{i=1}^{G}\Big{[}\\
&\min\left(\frac{\pi_{\Theta}(y_{i}|x)}{\pi_{\text{old}}(y_{i}|x)}A_{i},\text{
clip}\left(\frac{\pi_{\Theta}(y_{i}|x)}{\pi_{\text{old}}(y_{i}|x)},1-
\varepsilon,1+\varepsilon\right)A_{i}\right)\\
&-\beta D_{\text{KL}}(\pi_{\Theta}||\pi_{\text{ref}})\Big{]}\Bigg{]}\end{split} \tag{1}
$$
where $x$ is sampled from dataset $\mathcal{D}$ , $y_{i}$ is a rollout, $A_{i}$ is the advantage, $\varepsilon$ is the clipping threshold, and $\beta$ regulates KL penalty.
#### 3.2.2 Observation Masking
To prevent reward leakage through Toolcaller-generated outputs, we mask the <obs> blocks during reward modeling and training. These segments are replaced with special token <fim_pad>, which is trained to embed close to zero.
#### 3.2.3 Reward Function
Our reward function balances correctness and formatting constraints:
$$
\text{Reward}=\begin{cases}\text{F1 score}&\text{if answer is correctly
formatted}\\
-2&\text{otherwise}\end{cases} \tag{2}
$$
The model receives a high reward when generating a valid and correct response, and a penalty when output is malformed.
## 4 Experiments
### 4.1 Experiment Settings
#### 4.1.1 Model and Hyperparameters
We use Qwen-2.5-7B-Instruct (Qwen et al., 2025) as our base model. The training is conducted by an customized implementation of rollout and a customized implementation of GRPO on trl (von Werra et al., 2020). At each training step, we sample a batch of training data from the training set and calculate the reward for each rollout. Then, we update the policy by maximizing the reward.
The batch size is set to 3 for each training step and each sample contains 12 rollouts for each prompt. Each rollout contains at most 10 rounds of tool calling.
#### 4.1.2 Training Settings
We conducted quite a small scale of training for the Agent-as-tool. We trained the Agent-as-tool for 60 steps with each step containing 3 training samples, and each training sample contains 12 rollouts, with total size of only 180 samples and 2160 rollouts. The training data entries were selected from the HotpotQA (Yang et al., 2018) and 2WikiMultiHopQA (Ho et al., 2020) datasets with the same ratio as the R1-searcher (Song et al., 2025).
During the training process, we observed that the loss of the Agent-as-tool is not stable for the first 30 steps, which is likely due to the small training data, but after 30 steps, the loss is stable and close to 0 and the performance of the Agent-as-tool also stabilized.
<details>
<summary>extracted/6585337/training_graph.png Details</summary>

### Visual Description
## Line Graphs: Loss and Average Reward Trends
### Overview
The image contains two line graphs stacked vertically. The top graph tracks "Loss" over an x-axis labeled "Average Reward" (0–60), while the bottom graph tracks "Average Reward" over the same x-axis. Both graphs use a single blue line to represent their respective metrics, with distinct y-axis scales and volatility patterns.
---
### Components/Axes
- **Top Graph (Loss):**
- **Y-axis (Loss):** Ranges from 0 to 8000 in increments of 2000.
- **X-axis (Average Reward):** Labeled "Average Reward," spans 0 to 60 in increments of 10.
- **Line:** Blue, with sharp peaks and troughs.
- **Legend:** Not explicitly visible, but the line is blue.
- **Bottom Graph (Average Reward):**
- **Y-axis (Average Reward):** Ranges from -2.00 to -0.25 in increments of 0.25.
- **X-axis (Average Reward):** Same as the top graph (0–60).
- **Line:** Blue, with gradual fluctuations and a sharp dip.
- **Legend:** Not explicitly visible, but the line is blue.
---
### Detailed Analysis
#### Top Graph (Loss)
- **Trend:** The line begins near 0, spikes to **~8000 at x=5**, drops to **~4000 at x=8**, then fluctuates with smaller peaks (e.g., **~2000 at x=15** and **~3000 at x=30**). After x=30, the line stabilizes near 0 with minor oscillations.
- **Key Data Points:**
- Peak: **~8000** at x=5 (highest loss).
- Secondary peak: **~4000** at x=8.
- Stabilization: Near 0 after x=30.
#### Bottom Graph (Average Reward)
- **Trend:** The line starts at **~-1.50**, fluctuates between **~-1.75 and ~-0.25**, with a sharp dip to **~-2.00 at x=32**. After x=32, it recovers to **~-1.00** by x=45, then stabilizes with smaller oscillations.
- **Key Data Points:**
- Initial value: **~-1.50** at x=0.
- Sharp dip: **~-2.00** at x=32 (lowest reward).
- Recovery: **~-1.00** at x=45.
---
### Key Observations
1. **Loss Volatility:** The top graph shows extreme spikes (e.g., x=5, x=30), suggesting instability or overfitting in the modeled system.
2. **Reward Dip:** The bottom graph’s sharp drop at x=32 correlates with a potential event (e.g., parameter reset, data shift) that temporarily degraded performance.
3. **Divergence:** While Loss stabilizes after x=30, Average Reward remains volatile, indicating a disconnect between error magnitude and reward consistency.
---
### Interpretation
- **Loss Spikes:** The abrupt increases in Loss (e.g., x=5, x=30) may reflect moments of high error, possibly due to overfitting, noisy data, or abrupt changes in the training environment. The stabilization post-x=30 suggests improved model robustness.
- **Reward Dip at x=32:** The sharp decline in Average Reward aligns with a potential intervention (e.g., hyperparameter adjustment, data corruption) that temporarily reduced performance. Recovery by x=45 implies the system adapted or corrected the issue.
- **System Behavior:** The graphs highlight a trade-off between error magnitude (Loss) and performance consistency (Average Reward). The model shows learning progress (decreasing Loss) but struggles with stability, as evidenced by persistent reward fluctuations.
---
### Spatial Grounding
- **Legend:** Absent; line color (blue) is consistent across both graphs.
- **Positioning:** Top graph occupies the upper half, bottom graph the lower half. Both share the same x-axis label ("Average Reward"), which may indicate a shared temporal or iterative scale (e.g., training steps).
---
### Content Details
- **Loss Values:** Peaks at **~8000** (x=5), **~4000** (x=8), and **~3000** (x=30); stabilizes near 0 after x=30.
- **Average Reward Values:** Ranges from **~-2.00** (x=32) to **~-0.25** (x=15); stabilizes around **~-1.00** post-x=45.
---
### Final Notes
The graphs suggest a dynamic system where Loss and Reward metrics are inversely related but not perfectly correlated. The sharp dip in Reward at x=32 warrants further investigation into potential external factors or model adjustments. The absence of a legend simplifies interpretation but limits clarity on multiple data series.
</details>
Figure 2: Training progress of the Agent-as-tool model showing loss convergence over training steps. The loss becomes stable after approximately 30 training steps.
The training curve illustrated in Figure 2 shows the convergence behavior of our model during the reinforcement learning process.
#### 4.1.3 Benchmark Settings
In order to evaluate the performance of the Agent-as-tool, we conducted experiments on the open-domain question-answering task. We selected multiple multi-hop reasoning tasks to evaluate the performance of the Agent-as-tool, including the HotpotQA (Yang et al., 2018), 2WikiMultiHopQA (Ho et al., 2020), MuSiQue (Trivedi et al., 2022), and bamboogle (Press et al., 2023).
#### 4.1.4 Baseline Settings
We have 1 information retrival tool: web search.
We then compare the performance of the Agent-as-tool with the following baselines:
- direct IO: This baseline employs the direct output of the Qwen-2.5-7B-Instruct (Qwen et al., 2025) as the answer without any external tool calling.
- direct IO with web search: This baseline employs the direct output of the Qwen-2.5-7B-Instruct (Qwen et al., 2025), but enables the web search to process the original question and return the top-k results as additional observations.
- CAMEL Agent: This baseline employs the CAMEL (Li et al., 2023a) chat agent driven by the GPT-4o-mini (Hurst et al., 2024).
- CAMEL Agent with web search: This baseline employs the CAMEL (Li et al., 2023a) chat agent driven by the GPT-4o-mini (Hurst et al., 2024) and the same tool setting as the Agent-as-tool with web search tool only. This baseline is used as the reference for multi-hop reasoning tasks conducted with the rule-based agent framework.
- Search-R1: We directly compare the performance of the Agent-as-tool with the Search-R1 (Jin et al., 2025) in our configurations with web search tool for a fair comparison. As Search-R1 cannot be directly integrated with the CAMEL (Li et al., 2023a) chat agent, we would directly returns the search results as the answer instead of using another Toolcaller.
We conducted experiments with Agent-as-tool with pre-finetuned and post-finetuned models.
In align with the Deepseek-R1 setting (Guo et al., 2025), we adopted the same prompt setting for all the baselines and the Agent-as-tool except Search-R1 (Jin et al., 2025) that is equipped with its orignal prompt setting, and we also modified the tool calling process to enable Search-R1 to accesss the unprocessed web search results.
#### 4.1.5 Evaluation Metrics
In this paper, we focus on the performance of the Agent-as-tool in terms of the correctness of the answer, therefore, we employed the exact match metric (EM), the cover exact match metric (CEM) to evaluate the performance of the Agent-as-tool.
### 4.2 Quantitative Experiment Results
The qualitative results are shown in LABEL:tab:all_datasets_complete. Based on the results, we can see that the Agent-as-tool outperforms most of the baselines except for the EM metric in the HotpotQA, 2WikiMultiHopQA, and MuSiQue datasets, where Search-R1 still has the best performance. However, in terms of the CEM metric, our model has a substantial improvement over all the baselines, except in HotpotQA where Search-R1 still has the best performance (64.2% vs 57.4%). And in the Bamboogle dataset (Press et al., 2023), the Agent-as-tool with web search tool integrated to the Toolcaller (CAMEL (Li et al., 2023a) agent) achieves the best performance with EM of 63.2% and CEM of 75.2%.
Table 1: Performance Comparison Across Different Datasets
| Bamboogle | Direct IO | 17.6 | 26.4 |
| --- | --- | --- | --- |
| Direct IO + Web Search | 29.6 | 42.4 | |
| CAMEL | 36.8 | 47.2 | |
| CAMEL + Web Search | 51.2 | 62.4 | |
| Search-R1 + Web Search | 58.4 | 72.0 | |
| Agent-as-tool-Base + Web Search | 60.0 | 71.2 | |
| Agent-as-tool-Instruct + Web Search | 63.2 | 75.2 | |
| HotpotQA | Direct IO | 20.0 | 27.2 |
| Direct IO + Web Search | 32.6 | 52.8 | |
| CAMEL | 23.2 | 44.2 | |
| CAMEL + Web Search | 32.4 | 59.4 | |
| Search-R1 + Web Search | 47.2 | 64.2 | |
| Agent-as-tool-Base + Web Search | 35.0 | 55.2 | |
| Agent-as-tool-Instruct + Web Search | 37.2 | 57.4 | |
| 2WikiMultiHopQA | Direct IO | 22.6 | 25.4 |
| Direct IO + Web Search | 27.2 | 40.2 | |
| CAMEL | 20.8 | 34.6 | |
| CAMEL + Web Search | 35.0 | 69.4 | |
| Search-R1 + Web Search | 52.4 | 68.0 | |
| Agent-as-tool-Base + Web Search | 42.8 | 68.0 | |
| Agent-as-tool-Instruct + Web Search | 44.6 | 70.0 | |
| MuSiQue | Direct IO | 4.8 | 9.0 |
| Direct IO + Web Search | 14.0 | 18.0 | |
| CAMEL | 9.2 | 18.8 | |
| CAMEL + Web Search | 16.0 | 29.4 | |
| Search-R1 + Web Search | 20.8 | 28.6 | |
| Agent-as-tool-Base + Web Search | 15.6 | 28.8 | |
| Agent-as-tool-Instruct + Web Search | 18.4 | 29.8 | |
We compared the performance of the Agent-as-tool before and after the reinforcement fine-tuning process. The table is shown in 2. Based on the results, we can see that the Reinforcement fine-tuning based on GRPO (Shao et al., 2024) substantially improves the performance of the Agent-as-tool in all datasets with an average improvement of 2.5% in EM and 2.3% in CEM.
Table 2: Performance improvements after reinforcement fine-tuning
| Dataset | Pre-finetuned | Post-finetuned | Improvement | | | |
| --- | --- | --- | --- | --- | --- | --- |
| EM | CEM | EM | CEM | EM | CEM | |
| (%) | (%) | (%) | (%) | (%) | (%) | |
| Bamboogle | 60.0 | 71.2 | 63.2 | 75.2 | +3.2 | +4.0 |
| HotpotQA | 35.0 | 55.2 | 37.2 | 57.4 | +2.2 | +2.2 |
| 2WikiMultiHopQA | 42.8 | 68.0 | 44.6 | 70.0 | +1.8 | +2.0 |
| MuSiQue | 15.6 | 28.8 | 18.4 | 29.8 | +2.8 | +1.0 |
| Average | 38.4 | 55.8 | 40.9 | 58.1 | +2.5 | +2.3 |
Comparing with the CAMEL baseline with web search tool integrated (CAMEL + Web Search), the Agent-as-tool pre-finetuned and post-finetuned achieved a substantial improvement in EM and CEM, stating the necessity that the Agent-as-tool that enables the model to control when and what to be called in a tool calling is a more effective framework for multi-hop reasoning tasks.
Comparing with the Search-R1 baseline (Search-R1 + Web Search), which is the current best performing research of its kind, the Agent-as-tool-Instruct has substantial improvements over the Bamboogle dataset, which improves the EM by 4.8% and CEM by 3.2%, stating the effectiveness of the Agent-as-tool in multi-hop reasoning tasks. Besides, the Agent-as-tool conducted fine-tuning with 180 samples, which indicates the efficiency of the fine-tuning process.
### 4.3 Qualitative Results Inspection and Analysis
#### 4.3.1 Comparison of the Agent-as-tool and the Search-R1
Comparing with Search-R1 baseline (Search-R1 + Web Search), the Agent-as-tool-Instruct had several advantages qualitatively:
- The Agent-as-tool-Instruct could reason with less fuzzy and more structured observations, comparing with the Search-R1 + Web Search which would need to reason with the unprocessed web search results with fuzzy and unstructured symbols or other unrelated details.
- As the Agent-as-tool-Instruct adopt a hierarchical reasoning process which segragate the reasoning process and the tool calling process, the agent could have a better linearly text-based reasoning process comparing with the Search-R1 + Web Search.
The qualitative comparison as a example is shown in Figure 3 (in Appendix). The Agent-as-tool-Instruct could reason with less fuzzy and more structured observations, comparing with the Search-R1 + Web Search which would need to reason with the unprocessed web search results with fuzzy and unstructured symbols or other unrelated details.
#### 4.3.2 Comparison of the Results before and after the Reinforcement Fine-tuning
Comparing with Agent-as-tool-Base, the Agent-as-tool-Instruct had several advantages qualitatively:
- The Agent-as-tool-Instruct identify the correct decomposition of the question to identify the first hop and the second hop to be solved by the agent, comparing with the Agent-as-tool-Base which would not be able to decompose the multi-hop question correctly so it directly feed the agent with the whole question (only a sightly change from the original manner). If the agent is not capable of reasoning the multi-hop question correctly, the Agent-as-tool-Base would not be able to answer the question correctly.
- As the Agent-as-tool-Instruct was instructed to reason with the agent powered by the pretrained model, the fine-tuned model could give a more structured and reasonable question to be answered by the agent comparing with the Agent-as-tool-Base.
The qualitative comparison as a example is shown in Figure 4 (in Appendix). The Agent-as-tool-Instruct could correctly decompose the question to identify the first hop and the second hop, comparing with the Agent-as-tool-Base which would not be able to decompose the multi-hop question correctly so it directly feed the agent with the whole question.
## 5 Conclusions and Future Work
### 5.1 Conclusions
In this paper, we majorly studied the multi-hop reasoning tasks with the Agent-as-tool framework. We found that the Agent-as-tool could achieve a substantial improvement in the performance of the multi-hop reasoning tasks, especially in the Bamboogle dataset (Press et al., 2023). We also found that the Agent-as-tool could reason with less fuzzy and more structured observations, comparing with the Search-R1 + Web Search which would need to reason with the unprocessed web search results with fuzzy and unstructured symbols or other unrelated details.
### 5.2 Limitations and Future Work
This paper only assigns the search tool to the agent (or in another word, the search agent) so the scope is limited to the open-domain multi-hop search tasks. While because only 1 model was provided, the dynamic assignment of the tool to the agent is not considered. Therefore in our future work more tools would be considered to be assigned to the agent, while we would also explore the dynamic assignment of the tool to the agent, in another word, make the Planner as a Tool Orchestrator.
## References
- Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Goldie et al. [2025] A. Goldie, A. Mirhoseini, H. Zhou, I. Cai, and C. D. Manning. Synthetic data generation and multi-step rl for reasoning and tool use, 2025. URL https://arxiv.org/abs/2504.04736.
- Guo et al. [2025] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Ho et al. [2020] X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020.
- Hong et al. [2023] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 3(4):6, 2023.
- Hurst et al. [2024] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
- Jin et al. [2025] B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URL https://arxiv.org/abs/2503.09516.
- Li et al. [2023a] G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. Camel: Communicative agents for” mind” exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023a.
- Li et al. [2023b] M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y. Li. Api-bank: A comprehensive benchmark for tool-augmented llms, 2023b. URL https://arxiv.org/abs/2304.08244.
- Li et al. [2025a] X. Li, G. Dong, J. Jin, Y. Zhang, Y. Zhou, Y. Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search-enhanced large reasoning models, 2025a. URL https://arxiv.org/abs/2501.05366.
- Li et al. [2025b] X. Li, H. Zou, and P. Liu. Torl: Scaling tool-integrated rl. arXiv preprint arXiv:2503.23383, 2025b.
- Liang et al. [2025] X. Liang, J. Xiang, Z. Yu, J. Zhang, S. Hong, S. Fan, and X. Tang. Openmanus: An open-source framework for building general ai agents, 2025. URL https://doi.org/10.5281/zenodo.15186407.
- Liu et al. [2025] B. Liu, X. Li, J. Zhang, J. Wang, T. He, S. Hong, H. Liu, S. Zhang, K. Song, K. Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems. arXiv preprint arXiv:2504.01990, 2025.
- Press et al. [2023] O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models, 2023. URL https://arxiv.org/abs/2210.03350.
- Qian et al. [2025] C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani-TĂĽr, G. Tur, and H. Ji. Toolrl: Reward is all tool learning needs. arXiv preprint arXiv:2504.13958, 2025.
- Qwen et al. [2025] Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115.
- Shao et al. [2024] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Song et al. [2025] H. Song, J. Jiang, Y. Min, J. Chen, Z. Chen, W. X. Zhao, L. Fang, and J.-R. Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025.
- Trivedi et al. [2022] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition, 2022. URL https://arxiv.org/abs/2108.00573.
- von Werra et al. [2020] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- Yang et al. [2018] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
- Yao et al. [2023] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.03629.
- Zhang et al. [2024] J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, et al. Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024.
- Zheng et al. [2024] Y. Zheng, S. Sun, L. Qiu, D. Ru, C. Jiayang, X. Li, J. Lin, B. Wang, Y. Luo, R. Pan, et al. Openresearcher: Unleashing ai for accelerated scientific research. arXiv preprint arXiv:2408.06941, 2024.
- Zheng et al. [2025] Y. Zheng, D. Fu, X. Hu, X. Cai, L. Ye, P. Lu, and P. Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025.
## Appendix A The qualitative Results inspection
### A.1 The qualitative results inspection of the Agent-as-tool-Instruct + Web Search and the Search-R1 + Web Search
<details>
<summary>extracted/6585337/comparisonSearchR1andOurs.png Details</summary>

### Visual Description
## Screenshot: Comparison of Search-RI vs Agent-as-a-tool for Ernst I's Mother's Birthplace
### Overview
The image compares two AI agent approaches to answering the question: "Where was the mother of Ernst I, Prince of Hohenlohe-Langenburg born?" The expected answer is KliczkĂłw. The left side shows Search-RI's flawed reasoning process, while the right side demonstrates Agent-as-a-tool's correct solution path.
### Components/Axes
- **Question**: "Where was the mother of Ernst I, Prince Of Hohenlohe-Langenburg born?"
- **Expected Answer**: KliczkĂłw
- **Methods Compared**:
1. **Search-RI** (Left side)
- Color coding: Red (#FF6B6B) for thinking, Orange (#FFB74D) for search/tool calls, Green (#81C784) for answers
- Process flow: Sequential thinking → search → answer
2. **Agent-as-a-tool** (Right side)
- Color coding: Red (#FF6B6B) for thinking, Orange (#FFB74D) for tool calls, Green (#81C784) for answers
- Process flow: Thinking → tool calls → answer
### Detailed Analysis
#### Search-RI Process (Left)
1. **Initial Question**: Identifies Ernst I's birthplace as Hohenlohe-Langenburg
2. **First Search**: Finds Ernst I's Wikipedia page mentioning marriage to Princess Feodora of Leiningen
3. **Incorrect Reasoning**: Concludes Feodora's birthplace (Amorbach) is the answer without verifying her connection to Ernst I
4. **Final Answer**: Amorbach (Wrong Answer)
#### Agent-as-a-tool Process (Right)
1. **Initial Question**: Identifies need to find Ernst I's mother
2. **Tool Call**: Uses Wikipedia to find Ernst I's mother is Countess Amalie Henriette of Solms-Baruth
3. **Second Tool Call**: Searches for Amalie Henriette's birthplace
4. **Correct Answer**: KliczkĂłw (Correct Answer)
### Key Observations
1. **Critical Difference**: Agent-as-a-tool explicitly identifies the mother's name before searching her birthplace, while Search-RI assumes the wife's birthplace is relevant
2. **Color Coding**: Both methods use consistent color schemes for cognitive steps, but Agent-as-a-tool demonstrates better tool utilization
3. **Wikipedia References**: Both methods cite Wikipedia pages, but only Agent-as-a-tool correctly follows the maternal lineage
4. **Temporal Context**: Search-RI's error occurs in 1807 (Feodora's birth), while Agent-as-a-tool correctly identifies 1768 (Amalie Henriette's birth)
### Interpretation
The image demonstrates the importance of explicit relationship reasoning in knowledge graphs. Agent-as-a-tool succeeds by:
1. **Entity Disambiguation**: Correctly identifying Ernst I's mother rather than assuming his wife's relevance
2. **Temporal Reasoning**: Recognizing that maternal lineage (1768) differs from marital connections (1807)
3. **Tool Chaining**: Using sequential tool calls to resolve nested relationships (prince → mother → birthplace)
The failure of Search-RI highlights common pitfalls in knowledge retrieval:
- **Assumption Errors**: Mistaking spousal relationships for maternal ones
- **Context Collapse**: Failing to distinguish between different historical timelines
- **Information Silos**: Not connecting separate Wikipedia entries about related individuals
This comparison illustrates how structured reasoning with explicit tool utilization outperforms simple search-based approaches in complex genealogical queries.
</details>
Figure 3: The Agent-as-tool-Instruct could reason with less fuzzy and more structured observations, comparing with the Search-R1 + Web Search which would need to reason with the unprocessed web search results with other unrelated details. Search-R1 was misled by the unprocessed web search results to reason with the wrong answer for the second hop (Princess Feodora of Leiningen), while the Agent-as-tool-Instruct has applied the agent to preprocess the web search results and return the correct answer (Countess Amalie Henriette of Solms-Baruth) for the second hop.
### A.2 The qualitative results inspection of the Agent-as-tool-Instruct + Web Search and the Agent-as-tool-Base + Web Search
<details>
<summary>extracted/6585337/comparisonBeforeAfter1.png Details</summary>

### Visual Description
## Screenshot: Agent Comparison for Multi-Hop Question Answering
### Overview
The image compares two AI agent implementations (Agent-as-tool-Base vs. Agent-as-tool-Instruct) attempting to answer the question: "Where did Edward Dunn (Bishop)'s father graduate from?" The expected answer is **Corpus Christi College, Cambridge**. The comparison shows the agents' reasoning processes, tool calls, and final answers through a structured dialogue format with color-coded annotations.
---
### Components/Axes
1. **Left Panel (Agent-as-tool-Base)**:
- **Structure**: Sequential thought process with tool calls and observations.
- **Annotations**: Orange boxes highlight key decision points (e.g., "Planner try to ask 2 hops simultaneously").
- **Final Answer**: Incorrectly identifies **Marlborough and Pembroke College, Cambridge**.
2. **Right Panel (Agent-as-tool-Instruct)**:
- **Structure**: Step-by-step reasoning with explicit tool calls and observations.
- **Annotations**: Orange boxes emphasize strategic decisions (e.g., "Learned to Ask the prerequisite question before the next hop").
- **Final Answer**: Correctly identifies **Corpus Christi College, Cambridge**.
---
### Detailed Analysis
#### Left Panel (Agent-as-tool-Base)
1. **Initial Query**: Asks for information on Edward Dunn's father's education.
2. **First Observation**: Fails to find specific details about the father's education but provides info on Edward Dunn's own education.
3. **Second Attempt**: Tries to ask two hops simultaneously but receives no useful data.
4. **Final Answer**: Incorrectly cites Marlborough and Pembroke College, Cambridge, based on incomplete data.
#### Right Panel (Agent-as-tool-Instruct)
1. **Initial Query**: Asks for information on Edward Dunn's father.
2. **First Observation**: Identifies Andrew Hunter Dunn as the father and notes his role as Bishop of Quebec.
3. **Second Attempt**: Explicitly asks where Andrew Hunter Dunn graduated from.
4. **Final Answer**: Correctly cites Corpus Christi College, Cambridge, using precise tool calls.
---
### Key Observations
1. **Agent-as-tool-Base**:
- Struggles with multi-hop reasoning, attempting parallel queries that yield no results.
- Final answer is incorrect due to reliance on ambiguous or unrelated data.
2. **Agent-as-tool-Instruct**:
- Uses sequential, targeted tool calls to isolate the correct information.
- Final answer matches the expected result, demonstrating effective structured reasoning.
---
### Interpretation
The comparison highlights the importance of **structured, sequential reasoning** in multi-hop question answering. The Agent-as-tool-Instruct successfully isolates the correct information by:
1. **Asking prerequisite questions** before proceeding (e.g., identifying the father first).
2. **Explicitly targeting** the father's educational background after establishing his identity.
3. **Leveraging precise tool calls** to extract specific data points.
In contrast, the Agent-as-tool-Base's attempt to handle multiple hops simultaneously leads to confusion and incorrect conclusions. This underscores the value of **stepwise, hypothesis-driven reasoning** in complex information retrieval tasks.
</details>
Figure 4: The Agent-as-tool-Instruct + Web Search could correctly decompose the question to identify the first hop and the second hop, comparing with the Agent-as-tool-Base + Web Search which barely decompose the question and try to ask about the whole question in another manner, i.e. the Agent-as-tool-Base + Web Search was not able to reason the whole multi-hop question correctly, while the Agent-as-tool-Instruct + Web Search could decompose the question to identify the first hop of the question to be answered by the agent then proceed to the second hop.