2505.01441

Model: gemma-3-27b-it-free

# Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning Abstract Large language models (LLMs) have achieved remarkable progress in complex reasoning tasks, yet they remain fundamentally limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this work, we introduce ARTIST (A gentic R easoning and T ool I ntegration in S elf-improving T ransformers), a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for LLMs. ARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision. Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies and metric analyses reveal that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions. Our results establish agentic RL with tool integration as a powerful new frontier for robust, interpretable, and generalizable problem-solving in LLMs. 1 Introduction Large language models (LLMs) have achieved remarkable advances in complex reasoning tasks (Wei et al., 2023; Kojima et al., 2023), fueled by innovations in model architecture, scale, and training. Among these, reinforcement learning (RL) (Sutton et al., 1998) stands out as a particularly effective method for amplifying LLMs’ reasoning abilities. By leveraging outcome-based reward signals, RL enables models to autonomously refine their reasoning strategies (Shao et al., 2024), resulting in longer, more coherent chains of thought (CoT). Notably, RL-trained LLMs show strong improvements not only during training but also at test time (Zuo et al., 2025), dynamically adapting their reasoning depth and structure to each task’s complexity. This has led to state-of-the-art results on a range of benchmarks, highlighting RL’s promise for scaling LLM reasoning. However, despite these advances, RL-enhanced LLMs still rely primarily on internal knowledge and language modeling (Wang et al., 2024). This reliance becomes a major limitation for time-sensitive or knowledge-intensive questions, where the model’s static knowledge base may be outdated or incomplete, often resulting in inaccuracies or hallucinations (Hu et al., 2024). Moreover, such models struggle with complex, domain-specific, or open-ended problems that require precise computation, structured manipulation, or specialized tool use. These challenges underscore the inadequacy of purely text-based reasoning and the need for LLMs to access and integrate external information sources during the reasoning process. <details> <summary>x1.png Details</summary> ![b9a43b91](/v1/image/b9a43b91f2755542ef650fb0df5acbfd46a6e6c1b7f66cd65aa9b29392e3aa82) ### Visual Description \n ## Diagram: Agent-Environment Interaction Loop ### Overview The image depicts a diagram illustrating an agent-environment interaction loop, likely representing a reinforcement learning or similar system. It shows a sequence of steps involving a task, a policy model, reasoning, tools, an environment, and a reward, culminating in an answer. The diagram uses arrows to indicate the flow of information and actions. ### Components/Axes The diagram consists of the following components, arranged horizontally: 1. **Task:** Represented by a blue rectangle with a document icon. 2. **Policy Model:** Represented by a yellow rectangle with a robot icon. 3. **Reasoning:** Represented by a light-green rectangle. 4. **Tools:** Represented by a lavender rectangle. 5. **Environment:** Represented by a light-blue rectangle with icons representing observations ("Obs") and actions ("Action"). 6. **Answer:** Represented by a grey rectangle. 7. **Reward:** Represented by a pink rectangle with a ribbon icon. Arrows connect these components, indicating the flow of information and actions. The arrows are black with arrowheads. ### Detailed Analysis or Content Details The flow of the diagram is as follows: 1. A **Task** is presented. 2. The **Task** is input to a **Policy Model**. 3. The **Policy Model** outputs to **Reasoning**. 4. **Reasoning** outputs to **Tools**. 5. **Tools** interact with the **Environment**, receiving observations ("Obs") and sending actions ("Action"). 6. The **Environment** produces an **Answer**. 7. The **Answer** is evaluated by a **Reward** system. 8. The **Reward** is fed back to the **Policy Model**, completing the loop. There are no numerical values or scales present in the diagram. The diagram is purely conceptual. ### Key Observations The diagram highlights a closed-loop system where an agent (represented by the Policy Model, Reasoning, and Tools) interacts with an environment to achieve a task and receives feedback in the form of a reward. The inclusion of "Tools" suggests a more complex agent capable of utilizing external resources. The explicit labeling of "Obs" and "Action" within the Environment block emphasizes the input/output relationship between the agent and its surroundings. ### Interpretation This diagram illustrates a common paradigm in artificial intelligence, particularly in reinforcement learning and embodied AI. The agent learns to perform a task by iteratively interacting with the environment, receiving rewards for successful actions, and adjusting its policy to maximize those rewards. The "Reasoning" component suggests a level of cognitive processing beyond simple stimulus-response behavior. The "Tools" component indicates the agent is not limited to its inherent capabilities but can leverage external resources to improve performance. The loop structure emphasizes the iterative nature of learning and adaptation. The diagram is a high-level representation and does not specify the details of the algorithms or mechanisms used within each component. It is a conceptual model for understanding how an intelligent agent can interact with and learn from its environment. </details> Figure 1: The ARTIST architecture. Agentic reasoning is achieved by interleaving text-based thinking, tool queries, and tool outputs, enabling dynamic coordination of reasoning, tool use, and environment interaction within a unified framework. Agentic reasoning addresses these limitations by enabling LLMs to dynamically interact with both external resources and environments throughout the reasoning process (Xiong et al., 2025; Patil, 2025). External resources include web search, code execution, API calls, and structured memory, while environments encompass interactive settings such as web browsers (e.g., WebArena (Zhou et al., 2024)) or operating systems (e.g., WindowsAgentArena (Bonatti et al., 2024)). Unlike conventional approaches relying solely on internal inference, agentic reasoning empowers models to orchestrate tool use and adaptively coordinate research, computation, and logical deduction actively engaging with external resources and environments to solve complex, real-world problems (Wu et al., 2025; Rasal and Hauer, 2024). Many real-world tasks such as math reasoning (Mouselinos et al., 2024), multi-step mathematical derivations, programmatic data analysis (Zhao et al., 2024), or up-to-date fact retrieval demand capabilities that extend beyond the scope of language modeling alone. For example, function-calling frameworks like BFCLv3 (Yan et al., 2024) allow LLMs to invoke specialized functions for file operations, data parsing, or API-based retrieval, while agentic environments such as WebArena empower models to autonomously plan and execute sequences of web interactions to accomplish open-ended objectives. In math problem solving, external libraries like SymPy (Meurer et al., 2017) or NumPy (Harris et al., 2020) empower LLMs to perform symbolic algebra and computation with far greater accuracy and efficiency than text-based reasoning alone. Despite these advances, current tool-integration strategies for LLMs face major scalability and robustness challenges (Sun et al., 2024). Most approaches rely on hand-crafted prompts or fixed heuristics, which are labor-intensive and do not generalize to complex or unseen scenarios (Luo et al., 2025). While prompting (Gao et al., 2023) and supervised fine-tuning (Gou et al., 2024; Paranjape et al., 2023) can teach tool use, these methods are limited by curated data and struggle to adapt to new tasks or recover from tool failures. As a result, models often misuse tools or revert to brittle behaviors, highlighting the need for scalable, data-efficient, and adaptive tool-use frameworks. To overcome the limitations of text-only reasoning, we introduce a new paradigm for LLMs: agentic reasoning through tool integration. Our framework, ARTIST (Agentic Reasoning and Tool Integration in Self-Improving Transformers), empowers LLMs to learn optimal strategies for leveraging external tools and interacting with complex environments via reinforcement learning (see Figure 1). Here, “self-improving transformers” refers to models that iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining the standard transformer architecture. ARTIST interleaves tool queries and tool outputs directly within the reasoning chain, treating tool usage as a first-class operation. In this framework, tool usage is not limited to isolated API calls or computations, but can involve active interaction within external environments—such as navigating a web browser or operating system interface—through sequences of tool invocations. Specifically, the reasoning process alternates between segments of text-based thinking (<think>...</think>), tool queries (e.g., <tool_name>...</tool_name>), and tool outputs (<output>...</output>), allowing the model to seamlessly coordinate reasoning, tool usage and environment interaction. Motivating Example. Consider a Math Olympiad problem requiring a closed-form expression or complex integral. Traditional RL-trained LLMs (DeepSeek-AI et al., 2025) rely on text-based reasoning, often compounding errors in symbolic manipulation. In contrast, ARTIST enables the model to generate Python code, call a Python interpreter, and use libraries like SymPy (Meurer et al., 2017) for precise computation—seamlessly integrating results into the reasoning chain. This dynamic interplay of language and tool use boosts accuracy and allows iterative self-correction, overcoming the limitations of purely text-based approaches. This agentic structure enables the model to autonomously decide not only which tools to use, but also when and how to invoke them during multi-turn reasoning—continually adapting its strategy based on context and feedback from both the environment and tool outputs. Tool results inform subsequent reasoning, creating a tightly coupled loop between text-based inference and tool-augmented actions. Notably, ARTIST requires no supervision for intermediate steps or tool calls; instead, we employ reinforcement learning—specifically, the GRPO (Shao et al., 2024) algorithm—to guide the model using only outcome-based rewards. This enables LLMs to develop adaptive, robust, and generalizable tool-use behaviors, overcoming the brittleness and scalability issues of prior methods. <details> <summary>x2.png Details</summary> ![2c6780ca](/v1/image/2c6780cac11a76b16690f344f878eb914d37f922ab952379a89b4d66a1ae26d6) ### Visual Description \n ## Diagram: Agentic Reasoning + Tool Integration ### Overview The image is a diagram illustrating a system for agentic reasoning and tool integration. It depicts a process flow starting with an "Input", passing through a "Policy Model", generating an "Answer", and then calculating "Rewards" and "Advantage". The diagram highlights the iterative nature of reasoning and tool use within the policy model. ### Components/Axes The diagram consists of the following components: * **Input:** The starting point of the process. * **Policy Model:** A rectangular block containing multiple layers of "RTO" (Reasoning, Tool, Output) sequences. * **Agentic Reasoning + Tool Integration:** A label describing the functionality of the Policy Model. * **Answer:** The output generated by the Policy Model. * **Reference Model:** A block that receives the "Answer" and outputs a validation indicator ("A" with a checkmark or cross). * **Reward Model:** A block that receives the "Answer" and outputs "Rewards" ("r"). * **Group Computation:** A block that receives "Rewards" and outputs "Advantage" ("Adv"). * **Rewards:** The output of the Reward Model. * **Advantage:** The output of the Group Computation. The diagram also includes labels within the Policy Model: * **R:** Reasoning * **T:** Tool * **O:** Output ### Detailed Analysis or Content Details The Policy Model is the most complex component. It consists of three horizontal layers, each containing a sequence of "RTO" blocks. * **Layer 1:** Contains a longer sequence of "RTO" blocks, extending to "..." indicating continuation. * **Layer 2:** Contains a shorter sequence of "RTO" blocks. * **Layer 3:** Contains a sequence of "RTO" blocks, followed by "R", "T", "O". The "Answer" is generated from the Policy Model and is fed into both the "Reference Model" and the "Reward Model". The Reference Model outputs "A" with either a checkmark (positive validation) or a cross (negative validation). The Reward Model outputs "r", which represents the reward signal. The "Rewards" are then processed by the "Group Computation" block to produce "Advantage", represented by "Adv". ### Key Observations * The iterative "RTO" structure within the Policy Model suggests a cyclical process of reasoning, tool application, and output generation. * The parallel processing of the "Answer" through the Reference and Reward Models indicates a dual evaluation of the generated output – both for correctness (Reference Model) and value (Reward Model). * The "Group Computation" step implies that the "Advantage" is derived from a collective assessment of the "Rewards". * The "..." notation suggests that the process can continue indefinitely, with the Policy Model refining its reasoning and tool usage over time. ### Interpretation This diagram illustrates a reinforcement learning framework for agentic reasoning. The agent (represented by the Policy Model) interacts with an environment (implicitly represented by the "Tool" component) to achieve a goal. The agent's actions are evaluated by both a Reference Model (ensuring correctness) and a Reward Model (quantifying progress). The Advantage signal is then used to update the Policy Model, guiding it towards more effective reasoning and tool usage. The iterative "RTO" loop suggests a process of trial and error, where the agent learns from its mistakes and refines its strategy over time. The diagram highlights the importance of both accuracy and reward in shaping the agent's behavior. The use of a Reference Model alongside a Reward Model is a notable feature, suggesting a focus on both correctness and efficiency in the agent's decision-making process. The diagram does not provide specific data or numerical values, but rather a conceptual overview of the system's architecture and functionality. </details> Figure 2: Overview of the ARTIST methodology. The framework illustrates how reasoning rollouts alternate between internal thinking, tool use, and environment interaction, with outcome-based rewards guiding learning. This enables the model to iteratively refine its reasoning and tool-use strategies through reinforcement learning. To rigorously assess the effectiveness and generality of ARTIST, we conduct extensive experiments in two core domains: complex mathematical problem solving and multi-turn function calling. Our evaluation covers a broad range of challenging benchmarks, including MATH-500 (Hendrycks et al., 2021), AIME (hug, a), AMC (hug, b), and Olympiad Bench (He et al., 2024) for math reasoning, as well as $\tau$ -bench (Yao et al., 2024) and multiple BFCL v3 (Yan et al., 2024) subcategories for multi-turn tool use. We benchmark ARTIST on both 7B and 14B Qwen2.5-Instruct models (Qwen et al., 2025), comparing against a comprehensive suite of baselines—spanning frontier LLMs (e.g., GPT-4o (OpenAI, 2024), DeepSeek-R1 (DeepSeek-AI et al., 2025)), open-source tool-augmented models (e.g., ToRA (Gou et al., 2024), NuminaMath-TIR (Beeching et al., 2024)), prompt-based tool integration, and base models with or without external tools. ARTIST consistently outperforms all baselines, achieving substantial gains on the most complex tasks. In mathematical reasoning, it delivers up to 22% absolute improvement over base models and surpasses GPT-4o and other leading models on AMC, AIME, and Olympiad. For multi-turn function calling, ARTIST more than doubles the accuracy of base and prompt-based models on $\tau$ -bench and achieves strong gains on the hardest BFCL v3 subsets. Detailed ablations and metric analyses show that agentic RL training in ARTIST leads to deeper reasoning, more effective tool use, and higher-quality solutions. Across both domains, we observe that ARTIST exhibits emergent agentic behaviors —including adaptive tool selection, iterative self-correction, and context-aware multi-step reasoning —all arising naturally from its unified agentic RL framework. These findings provide compelling evidence that agentic reasoning, when tightly integrated with reinforcement learning and dynamic tool use, marks a paradigm shift in LLM capabilities. As a result, ARTIST not only surpasses prior approaches but also sets a new standard for robust, interpretable, and generalizable problem-solving in real-world scenarios. Our key contributions are: - A unified agentic RL framework: We introduce ARTIST, the first framework to tightly couple agentic reasoning, dynamic tool integration, and reinforcement learning for LLMs, enabling adaptive, multi-turn problem solving across diverse domains. - Generalizable tool use and environment interaction: ARTIST supports seamless integration with arbitrary external tools and environments, allowing LLMs to learn not just which tools to use, but when and how to invoke them within complex reasoning chains. - Extensive, rigorous evaluation: We provide the most comprehensive evaluation to date, covering both mathematical and multi-turn function calling tasks, a wide range of benchmarks, model scales, and baseline categories, along with detailed ablations and metric analyses. 2 ARTIST Overview We present ARTIST (Agentic Reasoning and Tool Integration in Self-Improving Transformers), a general and extensible framework that enables large language models (LLMs) to reason with and act upon external tools and environments via reinforcement learning. Unlike prior methods (Schick et al., 2023; Gao et al., 2023) that focus on isolated tool use or narrow domains, ARTIST supports seamless integration with a wide range of tools—including code interpreters, web search engines, and domain-specific APIs—as well as interactive environments such as web browsers and operating systems. This section details the methodology, RL training procedure, prompt templates for structuring reasoning and tool interaction, and our approach to reward modeling. 2.1 Methodology ARTIST treats tool usage and environment interaction as core components of the model’s reasoning trajectory. The LLM dynamically decides which tools or environments to engage, when to invoke them, and how to incorporate their outputs into multi-step solutions, making it applicable to a broad spectrum of real-world tasks. Figure 2 illustrates the methodology: for each input, the policy model generates multiple reasoning rollouts, alternating between text-based reasoning (<think>...</think>) and tool interactions. At each step, the model determines which tool to call, when to invoke it, and how to formulate the query based on the current context. The tool interacts with the external environment (e.g., executing code, searching the web, or calling an API) and returns an output, which is incorporated back into the reasoning chain. This iterative process allows the model to explore diverse reasoning and tool-use trajectories. By making tool usage and environment interaction first-class operations, ARTIST enables LLMs to develop flexible, adaptive, and context-aware strategies for complex, multi-step tasks. This agentic approach supports robust self-correction and iterative refinement through continuous interaction with external resources. 2.2 Reinforcement Learning Algorithm Training agentic LLMs with tool and environment integration requires a reinforcement learning (RL) algorithm that is sample-efficient, stable, and effective with outcome-based rewards. Recent work, such as DeepSeek-R1 (DeepSeek-AI et al., 2025), has shown that Group Relative Policy Optimization (GRPO) (Shao et al., 2024) achieves strong performance in language model RL by leveraging groupwise outcome rewards and removing the need for value function approximation. This reduces training cost and simplifies optimization, making GRPO well-suited for our framework. Group Relative Policy Optimization. GRPO extends Proximal Policy Optimization (PPO) (Schulman et al., 2017) by eliminating the critic and instead estimating the baseline from a group of sampled responses. For each question $q$ , a group of responses $\{y_{1},y_{2},...,y_{G}\}$ is sampled from the old policy $\pi_{\text{old}}$ , and the policy model $\pi_{\theta}$ is optimized by maximizing the following objective: $$ \displaystyle\mathcal{J}_{\text{GRPO}}(\theta)=\mathbb{E}_{x\sim\mathcal{D},\{% y_{i}\}_{i=1}^{G}\sim\pi_{\text{old}}(\cdot\mid x;\mathcal{R})}\Bigg{[}\frac{1% }{G}\sum_{i=1}^{G}\frac{1}{\sum_{t=1}^{|y_{i}|}\mathbb{I}(y_{i,t})}\sum_{% \begin{subarray}{c}t=1\\ (y_{i,t}=1)\end{subarray}}^{|y_{i}|}\min\left(\frac{\pi_{\theta}(y_{i,t}\mid x% ,y_{i,<t};\mathcal{R})}{\pi_{\text{old}}(y_{i,t}\mid x,y_{i,<t};\mathcal{R})}% \hat{A}_{i,t},\right. \displaystyle\left.\text{clip}\left(\frac{\pi_{\theta}(y_{i,t}\mid x,y_{i,<t};% \mathcal{R})}{\pi_{\text{old}}(y_{i,t}\mid x,y_{i,<t};\mathcal{R})},1-\epsilon% ,1+\epsilon\right)\hat{A}_{i,t}\right)-\beta\mathbb{D}_{\text{KL}}[\pi_{\theta% }\|\pi_{\text{ref}}]\Bigg{]} \tag{1} $$ where $\epsilon$ and $\beta$ are hyperparameters, and $\hat{A}_{i,t}$ represents the advantage, computed based on the relative rewards of outputs within each group. Adapting GRPO for Agentic Reasoning with Tool Integration. In ARTIST, rollouts alternate between model-generated reasoning steps and tool outputs, capturing agentic interactions with external tools and environments. Applying token-level loss uniformly can cause the model to imitate deterministic tool outputs, rather than learning effective tool invocation strategies. To prevent this, we employ a loss masking strategy: tokens from tool outputs are masked during loss computation, ensuring gradients are only propagated through model-generated tokens. This focuses optimization on the agent’s reasoning and decision-making, while avoiding spurious updates from deterministic tool responses. The complete training procedure for ARTIST with GRPO is summarized in Algorithm 1. Algorithm 1 Training ARTIST with Group Relative Policy Optimization (GRPO) 1: Policy model $\pi_{\theta}$ , old policy $\pi_{\text{old}}$ , task dataset $\mathcal{D}$ , group size $G$ , masking function $\mathcal{M}$ 2: for each training iteration do 3: for each task $q$ in batch do 4: Sample $G$ rollouts $\{y_{1},...,y_{G}\}$ from $\pi_{\text{old}}$ : 5: for each rollout $y_{i}$ do 6: Initialize reasoning chain 7: while not end of episode do 8: Generate next segment: <think> or <tool_name> 9: if tool is invoked then 10: Interact with environment, obtain <output> 11: Append output to reasoning chain 12: end if 13: end while 14: Compute outcome reward $R(y_{i})$ 15: end for 16: Compute groupwise advantages $\hat{A}_{i,t}$ for all $y_{i}$ 17: Compute importance weights $r_{i,t}$ 18: Apply loss masking $\mathcal{M}$ to exclude tool output tokens 19: Compute GRPO loss $\mathcal{L}_{\text{GRPO}}$ and update $\pi_{\theta}$ 20: end for 21: end for 2.3 Rollouts in ARTIST In ARTIST, rollouts are structured to alternate between internal reasoning and interaction with external tools or environments. Unlike standard RL rollouts that consist solely of model-generated tokens, ARTIST employs an iterative framework where the LLM interleaves text generation with tool and environment queries. Prompt Template. ARTIST uses a structured prompt template that organizes outputs into four segments: (1) internal reasoning (<think> … </think>), (2) tool or environment queries (<tool_name> … </tool_name>), (3) tool outputs (<output> … </output>), and (4) the final answer (<answer> … </answer>). Upon issuing a tool query, the model invokes the corresponding tool or environment, appends the output, and continues the reasoning cycle until the answer is produced. The complete prompt template is provided in Appendix A. Rollout Process. Each rollout consists of these structured segments, with the policy model deciding at each step whether to reason internally or interact with an external resource. Tool invocations may include code execution, API calls, web search, file operations, or actions in interactive environments like web browsers or operating systems. Outputs from these interactions are incorporated back into the reasoning chain, enabling iterative refinement and adaptive strategy adjustment based on feedback. See Appendix B for illustrative rollout examples in various scenarios. 2.4 Reward Design A well-designed reward function is essential for effective RL training, as it provides the optimization signal that steers the policy toward desirable behaviors. In GRPO, outcome-based rewards have proven both efficient and effective, supporting robust policy improvement without requiring dense intermediate supervision. However, ARTIST introduces new challenges for reward design: beyond producing correct final answers, the model must also structure its reasoning, tool use, and environment interactions coherently and reliably. To address this, we use a composite reward mechanism that provides fine-grained feedback for each rollout. The reward function in ARTIST consists of three key components: Answer Reward: This component assigns a positive reward when the model generates the correct final answer, as identified within the <answer> … </answer> tags. The answer reward directly incentivizes the model to solve the task correctly, ensuring that the ultimate objective of the reasoning process is met. Format Reward: To promote structured and interpretable reasoning, we introduce a format reward that encourages adherence to the prescribed prompt template. This reward checks two main criteria: (1) the correct order of execution—reasoning (<think>), tool call (<tool_name>), and tool output (<output>) is maintained throughout the rollout; and (2) the final answer is properly enclosed within <answer> tags. The format reward helps the model learn to organize its outputs in a way that is both consistent and easy to parse, which is essential for reliable tool invocation and downstream evaluation. Tool Execution Reward: During each tool interaction, the model’s queries may or may not be well-formed or executable. To encourage robust and effective tool use, we introduce a tool execution reward, defined as the fraction of successful tool calls: $$ \text{Tool Execution Reward}=\frac{Tool_{success}}{Tool_{total}} $$ where $Tool_{success}$ and $Tool_{total}$ denote the number of successful and total tool calls, respectively. This reward ensures that the model learns to formulate tool queries that are syntactically correct and executable in the target environment. 3 Case Study We present two case studies to illustrate the versatility of ARTIST. The first focuses on complex mathematical reasoning, where ARTIST leverages external tools like a Python interpreter for multi-step computation. The second examines multi-turn function calling, demonstrating how ARTIST orchestrates sequential tool use and adapts its reasoning strategy in dynamic, real-world environments. 3.1 Complex Mathematical Reasoning with Agentic Tool Use While LLMs have demonstrated strong performance on mathematical reasoning tasks using natural language, they often struggle with complex problems that require precise, multi-step calculations—such as multiplying large numbers or evaluating definite integrals. Purely text-based reasoning in these cases can be inefficient and error-prone. To address this, ARTIST augments LLMs with access to an external Python interpreter, enabling the model to offload complex computations and verify intermediate results programmatically. Prompt Template During rollouts, the model is prompted to structure its output using special tokens. Internal reasoning is enclosed within <think> … </think> tags, while any code intended for execution is placed within <python> … </python> tags. The code is executed by an external Python interpreter, and the resulting output is returned to the model inside <output> … </output> tags. The final answer is provided within <answer> … </answer> tags. The complete prompt used for guiding the model is listed in Appendix A. Reward Design To guide the RL algorithm, we employ three reward components tailored to the mathematical reasoning setting: - Answer Reward: The model receives a reward of 2 if the final answer exactly matches the ground truth, and 0 otherwise: $$ R_{\text{answer}}=\begin{cases}2,&\text{if }y_{\text{pred}}=y_{\text{ground}}% \\ 0,&\text{otherwise}\end{cases} $$ - Format Reward: To encourage structured outputs, we provide both relaxed and strict format rewards: - Relaxed: For each of the four required tag pairs (<think>, <python>, <output>, <answer>) present in the rollout, a reward of 0.125 is given, up to a maximum of 0.5. - Strict: An additional reward of 0.5 is awarded if (1) all tags are present, (2) the internal order of opening/closing tags is correct, and (3) the overall structure follows the sequence: <think> $→$ <python> $→$ <output> $→$ <answer>. - Tool Execution Reward: This reward is proportional to the fraction of successful Python code executions: $$ R_{\text{tool}}=\frac{\text{Tool}_{\text{success}}}{\text{Tool}_{\text{total}}} $$ where $\text{Tool}_{\text{success}}$ and $\text{Tool}_{\text{total}}$ denote the number of successful and total tool calls, respectively. The reward ranges from 0, i.e no python code or all python code resulted in compilation error to a maximum of 1, i.e all python code executed successfully. Example Analysis Appendix D shows two examples of ARTIST on complex math tasks. Across both examples, ARTIST demonstrates a robust agentic reasoning process that tightly integrates language-based thinking with programmatic tool use. The model systematically decomposes complex math problems into logical steps, alternates between internal reasoning and external computation, and iteratively refines its approach based on intermediate results. A key strength of ARTIST is the emergence of advanced agentic reasoning capabilities: - Self-Refinement: The model incrementally adjusts its strategy, such as increasing candidate values or restructuring code, to converge on a correct solution. - Self-Correction: When encountering errors (e.g., tool execution failures or incorrect intermediate results), the model diagnoses the issue and adapts its subsequent actions accordingly. - Self-Reflection: At each step, the model evaluates and explains its reasoning, validating results through repeated computation or cross-verification. These capabilities are not explicitly supervised, but arise naturally from the agentic rollout structure and reward design in ARTIST. As a result, the model is able to solve complex, multi-step mathematical problems with high reliability, leveraging both its language understanding and external tools. This highlights the effectiveness of reinforcement learning with tool integration for enabling flexible, adaptive, and interpretable problem-solving in LLMs. See Appendix D for more details. 3.2 Multi-Turn Function Calling with Agentic Reasoning and Tool Use Function calling is a core capability in agentic LLM applications, enabling models to answer user queries by invoking external functions to fetch information, perform deterministic operations, or automate workflows. In an agentic setup, access to a suite of functions and their responses forms an interactive environment for LLM agents. This paradigm is increasingly relevant for real-world automation, especially with the adoption of standards like the Model Context Protocol (MCP) Hou et al. (2025) that streamline tool and context integration for LLMs. In this work, we focus on enhancing LLMs’ ability to perform multi-turn, multi-function tasks—scenarios that require the agent to coordinate multiple function calls, manage intermediate state, and interact with users over extended dialogues. We evaluate ARTIST on BFCL v3 Yan et al. (2024) and $\tau$ Bench Yao et al. (2024), two challenging benchmarks that require long-context reasoning, multiple user interactions, and cascaded function calls (see Experimental setup for more details). Prompt Template To maximize performance, we prompt the model to explicitly reason step by step using <reasoning> … </reasoning> tags before issuing function calls within <tool> … </tool> tags. This structure encourages the model to articulate its logic, plan tool use, and adaptively respond to environment feedback. The prompt is provided in Appendix A. Reward Design We employ two reward components tailored to the function calling setting. Each is designed to reinforce a critical aspect of agentic reasoning and tool use: - State Reward: This reward encourages the model to maintain and update the correct state throughout a multi-turn interaction. In the context of function calling, State_match is the number of state variables (e.g., current working directory, selected files, user preferences) that the model correctly tracks or updates, while State_total is the total number of relevant state variables for the task. For example, if the model needs to keep track of which files have been selected and which have been compressed, correctly maintaining both would yield a higher state reward. The reward is scaled by $SR_{\text{max}}$ (set to 0.5), the maximum possible state reward: $$ R_{\text{state}}=SR_{\text{max}}\times\frac{State_{\text{match}}}{State_{\text% {total}}} $$ This reward ensures the agent maintains coherent context and does not lose track of important information across multiple tool calls. - Function Reward: This reward incentivizes the model to issue the correct sequence of function calls. Here, Functions_matched is the number of function calls that match the expected calls (in terms of both function name and arguments), and Functions_total is the total number of function calls required for the task. For example, if a task requires three specific function calls (listing files, compressing, and emailing), and the model gets two correct, it receives partial credit. $FR_{\text{max}}$ (set to 0.5) is the maximum function reward: $$ R_{\text{function}}=FR_{\text{max}}\times\frac{Functions_{\text{matched}}}{% Functions_{\text{total}}} $$ This reward encourages the agent to plan and execute the correct sequence of actions, mirroring how a human would follow a checklist to complete a task. - Format Reward: To encourage structured outputs, we provide both relaxed and strict format rewards: - Relaxed: For the two required tag pairs (<reasoning>, <tool>) present in the rollout, a reward of 0.025 is given, up to a maximum of 0.1. - Strict: An additional reward of 0.1 is awarded if (1) all tags are present, (2) the internal order of opening/closing tags is correct, and (3) the overall structure follows the sequence: <reasoning> $→$ <\reasoning> $→$ <tool> $→$ <\tool>. Example Analysis Appendix E presents four representative examples of ARTIST on multi-turn function calling tasks drawn from the BFCLv3 and $\tau$ -Bench datasets. Across these diverse scenarios—including sequential vehicle control, travel booking and cancellation, item exchange with preference handling, and persistent customer support— ARTIST exhibits a robust agentic reasoning process that tightly integrates step-by-step language-based planning with dynamic tool invocation in interactive environments. The model systematically interprets user requests, sequences and adapts multiple function calls, manages dependencies and state, handles ambiguous or incomplete information, and flexibly recovers from tool errors or workflow constraints. A key strength of ARTIST is the emergence of advanced agentic reasoning capabilities: - Self-Refinement: The model incrementally updates its plan in response to evolving requirements, user clarifications, or environment feedback—such as reordering actions, filtering options based on nuanced preferences, or skipping unnecessary steps to efficiently achieve the user’s goal. - Self-Correction: When faced with tool execution errors, unmet preconditions, or mistaken assumptions (e.g., missing dependencies, unsupported tool actions, or incorrect order/item IDs), the model diagnoses the cause, executes corrective actions, and retries the intended operation without external intervention. - Self-Reflection: At each stage, the model articulates its reasoning, summarizes the current state, confirms details with the user, and validates outcomes before proceeding, ensuring the overall workflow remains coherent, interpretable, and user-aligned. See Appendix E for detailed analysis of the examples on multi-turn function calling tasks. 4 Experimental Setup 4.1 Dataset and Evaluation Metrics We assess ARTIST in two key reasoning domains: complex mathematical problem solving and multi-turn function calling. For each setting, we detail the training and evaluation datasets, along with the metrics used to measure performance. 4.1.1 Complex Mathematical Reasoning Training Dataset We curate a training set of 20,000 math word problems, primarily sourced from NuminaMath (LI et al., 2024). The NuminaMath dataset spans a wide range of complexity, from elementary arithmetic and algebra to advanced competition-level problems, ensuring the model is exposed to diverse question types and reasoning depths during training. Each problem is paired with a ground-truth final answer, enabling outcome-based reinforcement learning without requiring intermediate step supervision. Evaluation Dataset To assess generalization and robustness, we evaluate on four established math benchmarks: - MATH-500 (Hendrycks et al., 2021): A diverse set of 500 competition-style math problems. - AIME (hug, a) and AMC (hug, b): Standardized high school mathematics competition datasets. - Olympiad Bench (He et al., 2024): A challenging set of olympiad-level problems requiring multi-step reasoning. Evaluation metrics We report Pass@1 accuracy: the percentage of problems for which the model’s final answer exactly matches the ground truth. This metric reflects the model’s ability to arrive at a correct solution in a single attempt. 4.1.2 Multi-Turn Function Calling Training Dataset For multi-turn function calling, we use a subset of 100 annotated tasks from the base multi-turn category of BFCL v3 (Berkeley Function-Calling Leaderboard) as the training dataset. The remaining 100 annotated tasks are used as validation dataset. Each task requires the agent to issue and coordinate multiple function calls in response to user queries, often involving state tracking and error recovery. Evaluation Datasets We evaluate on two major benchmarks: - BFCL v3 (Yan et al., 2024): This benchmark covers a range of domains (vehicle control, trading bots, travel booking, file systems, and cross-functional APIs) and includes several subcategories: - Missing parameters: Tasks where the agent must identify and request missing information. - Missing function: Scenarios where no available function can fulfill the user’s request. - Long context: Tasks with extended, information-dense user interactions. - $\tau$ -bench (Yao et al., 2024): A conversational benchmark simulating realistic user-agent dialogues in airline (50 tasks) and retail (115 tasks) domains. The agent must use domain-specific APIs and follow policy guidelines to achieve a predefined goal state in the system database. Evaluation Metric We use Pass@1 accuracy, defined as the fraction of tasks for which the agent’s final response is correct and the resulting environment state matches the benchmark’s ground truth. 4.2 Implementation Details Complex Mathematical Reasoning We train ARTIST using the Qwen/Qwen2.5-7B-Instruct (Qwen et al., 2025) and Qwen/Qwen2.5-14B-Instruct (Qwen et al., 2025) models. For each training instance, we sample 6 reasoning rollouts per question with a temperature of 1.0 to encourage exploration. Following prior work (Xu et al., 2025), we set a high generation budget of 8,000 tokens to accommodate long-form, multi-step reasoning. During rollouts, the model alternates between text generation and tool invocation, using a Python interpreter as the external tool. The interpreter executes code via Python’s exec() function, and returns structured feedback to the model, including successful outputs, missing print statements, or detailed error messages. Multi-Turn Function Calling For multi-turn function calling, we use Qwen/Qwen2.5-7B-Instruct as the base model. Training is performed with GRPO, sampling 8 rollouts per question at a temperature of 1.0. Each rollout consists of multiple tool calls and their outputs, with the number of user turns per task set to 1 to control rollout complexity. The system returns the output of each function call in <tool_result> tags, including explicit failure messages when applicable. The maximum context window is set to 16384 tokens, and the maximum response length per rollout is 2048 tokens. Losses for rollouts which exceed the maximum completion length are masked. See Appendix C for additional details, including batch size, optimizer settings, learning rate schedule, hardware specifications, and the exact formatting of tool outputs and error handling All code, hyperparameters, and configuration files will be released soon.. 4.3 Baselines To rigorously evaluate the effectiveness of ARTIST, we compare its performance against a comprehensive set of baselines spanning four distinct categories in both complex mathematical reasoning and multi-turn function calling tasks. This diverse selection ensures a fair and thorough assessment of ARTIST ’s capabilities relative to both state-of-the-art and widely used approaches. - Frontier LLMs (Frontier): Leading proprietary models such as GPT-4o (OpenAI, 2024) and DeepSeek R1 (DeepSeek-AI et al., 2025), representing the current state-of-the-art in large-scale language modeling and serving as strong upper bounds for text-based and reasoning performance. - Open-Source Tool-Augmented LLMs (Tool-OS): Models such as Numina (Beeching et al., 2024), ToRA (Gou et al., 2024), and PAL (Gao et al., 2023), which are designed to leverage external tools or code execution. These models are directly relevant for comparison with ARTIST ’s tool-augmented approach. - Base LLMs (Base): Standard open-source models such as Qwen 2.5-7B and Qwen 2.5-14B, evaluated in their vanilla form without tool augmentation. These provide a transparent, reproducible, and widely adopted baseline. - Base LLMs + External Tools with Prompt Modifications (Base-Prompt+Tools): Base LLMs equipped with access to external tools, but relying on prompt engineering or reasoning token modifications (e.g., explicit tool-use instructions or reasoning tags). This tests the effectiveness of prompt-based tool integration and reasoning. 5 Results | Method Frontier LLMs GPT-4o | MATH-500 0.630 | AIME 0.080 | AMC 0.430 | Olympiad 0.290 | | --- | --- | --- | --- | --- | | Frontier Open-Source LLMs | | | | | | DeepSeek-R1-Distill-Qwen-7B | 0.858 | 0.211 | 0.675 | 0.395 | | DeepSeek-R1 | 0.850 | 0.300 | 0.810 | 0.460 | | Open-Source Tool-Augmented LLMs | | | | | | NuminaMath-TIR-7B | 0.530 | 0.060 | 0.240 | 0.190 | | ToRA-7B | 0.410 | 0.000 | 0.070 | 0.130 | | ToRA-Code-7B | 0.460 | 0.000 | 0.100 | 0.160 | | Qwen2.5-7B (PAL) | 0.100 | 0.000 | 0.050 | 0.020 | | Base LLMs | | | | | | Qwen2.5-7B-Instruct | 0.620 | 0.040 | 0.350 | 0.210 | | Qwen2.5-14B-Instruct | 0.700 | 0.060 | 0.330 | 0.240 | | Base LLMs + Tools via Prompt | | | | | | Qwen2.5-7B-Instruct + Python Tool | 0.629 | 0.122 | 0.349 | 0.366 | | Qwen2.5-14B-Instruct + Python Tool | 0.671 | 0.100 | 0.410 | 0.371 | | ARTIST | | | | | | Qwen2.5-7B-Instruct + ARTIST | 0.676 | 0.156 | 0.470 | 0.379 | | Qwen2.5-14B-Instruct + ARTIST | 0.726 | 0.122 | 0.550 | 0.420 | Table 1: Pass@1 accuracy on four mathematical reasoning benchmarks. ARTIST consistently outperforms all baselines, especially on complex tasks. 5.1 Results: Complex Math Reasoning We conduct a comprehensive evaluation of ARTIST against a diverse set of baselines for complex mathematical reasoning. 5.1.1 Quantitative Results and Comparison AMC, AIME, and Olympiad: Impact of Complexity Table 1 reports Pass@1 accuracy across four challenging benchmarks: MATH-500, AIME, AMC, and Olympiad Bench. On the most challenging benchmarks—AMC, AIME, and Olympiad— ARTIST delivers substantial absolute improvements over all baseline categories. For example, on AMC, Qwen2.5-7B- ARTIST achieves 0.47 Pass@1, outperforming the base model (0.35) by +12.0%, the prompt-based tool baseline (0.349) by +12.1%, and the best open-source tool-augmented baseline (NuminaMath-TIR, 0.24) by +23.0%. The gains are even larger for Qwen2.5-14B- ARTIST, which achieves 0.55 on AMC an absolute improvement of +22.0% over the base model and +14.0% over the prompt-based tool baseline. Similar trends are observed on AIME and Olympiad. On AIME, Qwen2.5-7B- ARTIST improves over the base model by +11.6% (0.156 vs. 0.04), and over the prompt-based tool baseline by +3.4% (0.156 vs. 0.122). On Olympiad, the improvements are +16.9% over the base (0.379 vs. 0.21) and +1.3% over the prompt-based tool baseline (0.379 vs. 0.366). For Qwen2.5-14B- ARTIST, the gains are even more pronounced: +18.0% on Olympiad (0.42 vs. 0.24) and +6.2% on AIME (0.122 vs. 0.06) compared to base model. Similarly, when compared with Base-Prompt+Tools, Qwen2.5-14B- ARTIST we see a boost of 14% on AMC, 5.5% on MATH-500, 5.9% on olympiad and 2.2% on AIME. Compared to open-source tool-augmented LLMs, ARTIST achieves up to +35.9% improvement over PAL on Olympiad and +9.6% over NuminaMath-TIR on AIME. These results highlight that as the complexity of the reasoning task increases, the advantages of dynamic tool use and agentic reasoning become more significant. ARTIST is able to decompose complex problems, invoke external computation when needed, and iteratively refine its solutions—capabilities that are critical for high-level competition math (see Figure 4). MATH-500: Internal Knowledge vs. Tool Use On MATH-500, which contains a broader mix of problem difficulties but is generally less challenging than AMC, AIME, or Olympiad, the absolute improvements of ARTIST over baselines are more modest. For Qwen2.5-7B, ARTIST achieves 0.676, a +5.6% improvement over the base model (0.62) and +4.7% over the prompt-based tool baseline (0.629). For Qwen2.5-14B, the gains are +2.6% over base (0.726 vs. 0.7) and +5.5% over the prompt-based tool baseline (0.726 vs. 0.671). This suggests that for less complex problems, the model’s internal knowledge is often sufficient, and the marginal benefit of agentic tool use is reduced (see Figure 4). Summary: ARTIST delivers the largest gains on the most complex benchmarks (AMC, AIME, Olympiad), where dynamic tool integration and multi-step agentic reasoning are essential, highlighting that RL-driven tool use is critical for solving advanced mathematical problems. <details> <summary>x3.png Details</summary> ![fefc9710](/v1/image/fefc97109e3ca4b2dc0f425f6d04d35c55f9b7afb22f076dcffd07283dbf430d) ### Visual Description \n ## Bar Chart: Qwen2.5-7B-Instruct Accuracy on Math Datasets ### Overview This bar chart compares the accuracy of three different models – Base Model, Base Model + Tools, and ARTIST – on four math datasets: AMC, AIME, Olympiad, and Math 500. The accuracy is measured on the y-axis, ranging from 0.0 to 0.7, while the datasets are listed on the x-axis. ### Components/Axes * **Title:** Qwen2.5-7B-Instruct * **X-axis Label:** Datasets * **Y-axis Label:** Accuracy * **Datasets (X-axis):** AMC, AIME, Olympiad, Math 500 * **Models (Legend):** * Base Model (Light Blue) * Base Model + Tools (Turquoise) * ARTIST (Blue) ### Detailed Analysis The chart consists of grouped bar plots for each dataset, representing the accuracy of each model. **AMC Dataset:** * Base Model: Approximately 0.34 * Base Model + Tools: Approximately 0.48 * ARTIST: Approximately 0.46 **AIME Dataset:** * Base Model: Approximately 0.08 * Base Model + Tools: Approximately 0.14 * ARTIST: Approximately 0.09 **Olympiad Dataset:** * Base Model: Approximately 0.21 * Base Model + Tools: Approximately 0.34 * ARTIST: Approximately 0.38 **Math 500 Dataset:** * Base Model: Approximately 0.61 * Base Model + Tools: Approximately 0.64 * ARTIST: Approximately 0.68 **Trends:** * For all datasets, ARTIST generally outperforms the Base Model. * Adding tools to the Base Model consistently improves performance. * The largest performance difference between models is observed on the Math 500 dataset. * The Base Model + Tools and ARTIST models show similar performance on the AMC dataset. ### Key Observations * The ARTIST model achieves the highest accuracy across all datasets. * The Base Model performs relatively poorly on the AIME and Olympiad datasets. * The Math 500 dataset shows the highest overall accuracy scores for all models. * The addition of tools significantly boosts the performance of the Base Model, particularly on the AMC and Olympiad datasets. ### Interpretation The data suggests that the ARTIST model is the most effective at solving problems from these math datasets, followed by the Base Model with added tools. The Base Model alone exhibits lower accuracy, especially on more challenging datasets like AIME and Olympiad. The consistent improvement observed when tools are added to the Base Model indicates that these tools provide valuable assistance in problem-solving. The higher accuracy scores on the Math 500 dataset may be due to the dataset's characteristics, potentially being less complex or more aligned with the models' training data. The differences in performance across datasets highlight the varying difficulty levels and the models' ability to generalize to different types of math problems. The chart demonstrates the effectiveness of model enhancement through tool integration and the potential for further improvement in AI-driven math problem-solving. </details> Figure 3: Qwen2.5-7B-Instruct: Performance on Math datasets. <details> <summary>x4.png Details</summary> ![62580689](/v1/image/62580689f2c2b9f3f239064371e9c2beff7dbc79a58d4c02306c8e0595130567) ### Visual Description \n ## Bar Chart: Qwen2.5-14B-Instruct Accuracy on Math Datasets ### Overview This bar chart compares the accuracy of three models – Base Model, Base Model + Tools, and ARTIST – on four different math datasets: AMC, AIME, Olympiad, and Math 500. Accuracy is measured on the y-axis, and the datasets are displayed on the x-axis. ### Components/Axes * **Title:** Qwen2.5-14B-Instruct (top-center) * **X-axis Label:** Datasets (bottom-center) * **Y-axis Label:** Accuracy (left-center) * **Legend:** Located in the top-left corner. * Base Model (light teal) * Base Model + Tools (light blue) * ARTIST (dark blue) * **Datasets (X-axis Markers):** AMC, AIME, Olympiad, Math 500. ### Detailed Analysis The chart consists of four groups of three bars, one for each dataset and model combination. **AMC Dataset:** * Base Model: Approximately 0.41 accuracy. * Base Model + Tools: Approximately 0.53 accuracy. * ARTIST: Approximately 0.55 accuracy. *Trend:* All three models show positive accuracy, with ARTIST and Base Model + Tools performing better than the Base Model. **AIME Dataset:** * Base Model: Approximately 0.08 accuracy. * Base Model + Tools: Approximately 0.09 accuracy. * ARTIST: Approximately 0.11 accuracy. *Trend:* Accuracy is significantly lower for all models on the AIME dataset compared to the AMC dataset. ARTIST performs best, but the difference between the models is smaller. **Olympiad Dataset:** * Base Model: Approximately 0.27 accuracy. * Base Model + Tools: Approximately 0.29 accuracy. * ARTIST: Approximately 0.32 accuracy. *Trend:* Accuracy is higher than AIME but lower than AMC. ARTIST consistently outperforms the other two models. **Math 500 Dataset:** * Base Model: Approximately 0.68 accuracy. * Base Model + Tools: Approximately 0.71 accuracy. * ARTIST: Approximately 0.73 accuracy. *Trend:* The highest accuracy scores are observed on the Math 500 dataset. ARTIST again shows the highest performance, followed closely by Base Model + Tools. ### Key Observations * ARTIST consistently outperforms both the Base Model and the Base Model + Tools across all datasets. * The Base Model + Tools generally performs better than the Base Model alone. * Accuracy varies significantly depending on the dataset, with the Math 500 dataset yielding the highest scores and the AIME dataset the lowest. * The performance gap between the models is most pronounced on the AMC and Math 500 datasets. ### Interpretation The data suggests that the ARTIST model is the most effective at solving math problems across the tested datasets. The addition of tools to the Base Model provides a moderate improvement in accuracy. The varying performance across datasets indicates that the difficulty and nature of the problems within each dataset influence the models' ability to solve them. The Math 500 dataset, with its higher accuracy scores, may contain problems that are more aligned with the models' training data or capabilities. The AIME dataset, with its lower scores, may present unique challenges. The consistent outperformance of ARTIST suggests that its architecture or training methodology is particularly well-suited for tackling these types of math problems. The data demonstrates a clear hierarchy of performance: ARTIST > Base Model + Tools > Base Model. This could be due to the ARTIST model's ability to leverage more complex reasoning or problem-solving strategies. </details> Figure 4: Qwen2.5-14B-Instruct: Performance on Math datasets. ARTIST vs. Base LLMs + External Tools with Prompt Modifications (Base-Prompt+Tools) Even though the base model with prompt modifications and tool access (Base-Prompt+Tools) is provided with the ability to call external tools, it consistently underperforms compared to ARTIST. Results indicate that, without explicit agentic training, the model struggles to learn when and how to invoke tools effectively, often failing to integrate tool outputs into the broader reasoning process. In contrast, ARTIST is explicitly trained to coordinate tool use within its reasoning chain, enabling it to leverage external computation in a dynamic and context-aware manner. This highlights the importance of agentic reinforcement learning for teaching LLMs to not only access tools, but to strategically incorporate their capabilities to enhance overall problem-solving performance. Summary: Prompting base models to use tools yields limited improvements; explicit agentic RL training in ARTIST enables models to learn effective tool use, resulting in up to 12.1% higher accuracy on AMC and consistently superior performance across all tasks. ARTIST vs. Open-Source Tool-Augmented LLMs (Tool-OS) Compared to tool-integrated baselines such as ToRA, NuminaMath-TIR, ToRA-Code, and PAL, ARTIST achieves substantial improvements at the same model scale. For Qwen2.5-7B, ARTIST outperforms ToRA-7B by an average of 26.7% (15.25% $→$ 42%), NuminaMath-TIR by 16.5% (25.5% $→$ 42%), ToRA-Code by 24% (18% $→$ 42%), and PAL by 37.7% (4.25% $→$ 42%) across all benchmarks. These gains highlight that RL-driven agentic reasoning enables more effective and adaptive tool use than approaches relying on tool finetuning or inference-only tool integration. Summary: RL-based agentic training in ARTIST enables LLMs to integrate and reason with tools far more effectively than prior tool-augmented methods, yielding double-digit accuracy improvements across challenging math tasks. ARTIST vs. Frontier LLMs (Frontier) ARTIST outperforms GPT-4o across all benchmarks, even at the 7B scale, with absolute gains of 8.9% on Olympiad (37.9% vs. 29%), 7.6% on AIME (15.6% vs. 8%), 4.6% on MATH-500 (67.6% vs. 63%), and 4% on AMC (47% vs. 43%). At 14B, the gap widens further, with improvements of 13% on Olympiad, 12% on AMC, 9.6% on MATH-500, and 4.2% on AIME. While DeepSeek-R1 and its distilled variant perform strongly, they require large teacher models or additional supervised alignment, whereas ARTIST achieves competitive results using only outcome-based RL and tool integration. Incorporating such techniques could further enhance ARTIST, which we leave for future work. Summary: ARTIST surpasses state-of-the-art frontier models like GPT-4o on all math benchmarks, demonstrating that agentic RL with tool integration can close and even exceed the performance gap with much larger proprietary models. 5.1.2 Metrics We evaluate the effectiveness of ARTIST using three key metrics: (1) Reward Score (solution quality), (2) Number of Tool Calls (external tool utilization), and (3) Response Length (reasoning depth). Figure 5 compares ARTIST with the Base-Prompt+Tools across these dimensions. <details> <summary>x5.png Details</summary> ![09513b51](/v1/image/09513b51e4e43874f73e84070be3ff0feacee3189931d88daf9d482b364e3433) ### Visual Description ## Bar Chart: Qwen2.5-7B-Instruct Performance Comparison ### Overview The image presents three bar charts comparing the performance of "Base Model + Tools" and "ARTIST" across four datasets: AMC, AIME, Olympiad, and MATH 500. Each chart visualizes a different metric: Reward Score, Tool Call, and Response Length. The charts are arranged horizontally, side-by-side. ### Components/Axes Each chart shares the following components: * **X-axis:** "Datasets" with categories: AMC, AIME, Olympiad, MATH 500. * **Y-axis:** Varies per chart: * Chart 1: "Reward Score" (Scale: 0.0 to 4.0) * Chart 2: "Tool Call" (Scale: 0.0 to 4.5) * Chart 3: "Response Length" (Scale: 0 to 8000) * **Legend:** Located at the top-left of each chart, distinguishing between "Base Model + Tools" (light blue) and "ARTIST" (dark blue). ### Detailed Analysis or Content Details **Chart 1: Reward Score** * **AMC:** "Base Model + Tools" ≈ 2.6, "ARTIST" ≈ 0.7 * **AIME:** "Base Model + Tools" ≈ 2.1, "ARTIST" ≈ 1.7 * **Olympiad:** "Base Model + Tools" ≈ 2.7, "ARTIST" ≈ 2.3 * **MATH 500:** "Base Model + Tools" ≈ 3.2, "ARTIST" ≈ 3.1 The "Base Model + Tools" consistently achieves higher reward scores than "ARTIST" across all datasets. The difference is most pronounced for the AMC dataset. **Chart 2: Tool Call** * **AMC:** "Base Model + Tools" ≈ 3.1, "ARTIST" ≈ 3.0 * **AIME:** "Base Model + Tools" ≈ 3.2, "ARTIST" ≈ 3.1 * **Olympiad:** "Base Model + Tools" ≈ 3.5, "ARTIST" ≈ 3.3 * **MATH 500:** "Base Model + Tools" ≈ 4.0, "ARTIST" ≈ 3.8 "Base Model + Tools" generally exhibits a higher tool call rate than "ARTIST", with the largest difference observed in the MATH 500 dataset. **Chart 3: Response Length** * **AMC:** "Base Model + Tools" ≈ 3000, "ARTIST" ≈ 3000 * **AIME:** "Base Model + Tools" ≈ 3000, "ARTIST" ≈ 3000 * **Olympiad:** "Base Model + Tools" ≈ 6500, "ARTIST" ≈ 7000 * **MATH 500:** "Base Model + Tools" ≈ 4000, "ARTIST" ≈ 4000 The response length is similar for both models on AMC, AIME, and MATH 500. However, "ARTIST" generates significantly longer responses for the Olympiad dataset. ### Key Observations * "Base Model + Tools" consistently outperforms "ARTIST" in Reward Score and Tool Call. * Response Length is comparable for most datasets, except for Olympiad where "ARTIST" produces longer responses. * The performance gap between the two models is most significant for the AMC dataset in terms of Reward Score. ### Interpretation The data suggests that augmenting the base model with tools improves its performance, as measured by Reward Score and Tool Call, across various datasets. The longer response length of "ARTIST" on the Olympiad dataset might indicate a tendency to provide more verbose or detailed answers, potentially at the cost of conciseness or relevance (as reflected in the lower Reward Score). The consistent advantage of "Base Model + Tools" suggests that the tools are effectively utilized to enhance problem-solving capabilities. The relatively small difference in response length for AMC, AIME, and MATH 500 indicates that the tool integration doesn't drastically alter the length of the generated responses for those datasets. The data points to a trade-off between reward/tool usage and response length, with "ARTIST" favoring length and "Base Model + Tools" favoring efficiency and reward. </details> Figure 5: Average reward score, Tool call and the response length metric across all math datasets (ARTIST vs. Base-Prompt+Tools). Reward Score (Solution Quality) ARTIST dramatically improves solution quality, as measured by the reward score, on both AMC and AIME datasets. For example, on AMC, the average reward score nearly triples from 0.9 (Base-Prompt+Tools) to 2.8 (ARTIST), while on AIME it quadruples from 0.4 to 1.7. On Olympiad, ARTIST improves the reward score from 2.4 to 2.6, and on MATH 500, from 2.9 to 3.3. This reflects ARTIST ’s ability to produce more correct, well-structured, and complete solutions, especially on challenging problems. Number of Tool Calls (External Tool Utilization) ARTIST learns to invoke external tools far more frequently and strategically than the baseline, particularly on harder datasets. On AIME, the base model averages only 0.3 tool calls per query, whereas ARTIST averages over 3 tool calls per query. On Olympiad, ARTIST ’s tool usage is slightly lower than the baseline, but remains high and effective for solving complex tasks. Interestingly, on MATH 500, ARTIST reduces tool usage compared to the baseline, suggesting that it selectively avoids unnecessary tool calls when internal reasoning suffices. This adaptive and context-sensitive tool usage is strongly correlated with improved reward scores, demonstrating that dynamic tool invocation based on difficulty of dataset is critical for solving complex problems. Response Length (Reasoning Depth) ARTIST generates substantially longer and more detailed responses, with average response length up to twice that of the baseline, especially on AIME. This increase in length indicates that ARTIST engages in more thorough, multi-step reasoning, rather than shortcutting to an answer. On Olympiad, response length increases moderately, reflecting deeper reasoning. However, on MATH 500, the response lengths are roughly comparable between ARTIST and the baseline, indicating that ARTIST maintains concise reasoning when appropriate without sacrificing quality. Overall the richer reasoning traces produced by ARTIST directly contribute to its superior performance on complex mathematical problems. Summary: These metrics collectively demonstrate that ARTIST not only achieves higher solution quality, but does so by leveraging external tools more effectively and engaging in deeper, more interpretable reasoning. 5.2 Results: Multi-Turn Function Calling We evaluate ARTIST on two challenging multi-turn function calling benchmarks: $\tau$ -bench (Airline and Retail domains) and BFCL v3 (Missing Function, Missing Parameters, and Long Context subsets). These benchmarks test the agent’s ability to reason over extended dialogues, manage state, and coordinate multiple tool calls under realistic and adversarial conditions. | Benchmark | $\tau$ bench | BFCL v3 | | | | | --- | --- | --- | --- | --- | --- | | Method | Airline | Retail | Missing Function | Missing Parameters | Long Context | | Frontier LLMs | | | | | | | GPT-4o | 0.460 | 0.604 | 0.410 | 0.355 | 0.545 | | Frontier Open-Source LLMs | | | | | | | Llama-3-70B* | 0.148 | 0.144 | 0.130 | 0.105 | 0.095 | | Deepseek-R1 | – | – | 0.155 | 0.110 | 0.115 | | Qwen2.5-72B-Instruct | – | – | 0.245 | 0.200 | 0.155 | | Base LLMs | | | | | | | Qwen2.5-7B-Instruct | 0.120 | 0.180 | 0.085 | 0.060 | 0.040 | | Base LLMs + Reasoning via Prompt | | | | | | | Qwen2.5-7B-Instruct + Reasoning | 0.120 | 0.200 | 0.105 | 0.055 | 0.055 | | ARTIST | | | | | | | Qwen2.5-7B-Instruct + ARTIST | 0.260 | 0.240 | 0.105 | 0.065 | 0.130 | Table 2: Pass@1 accuracy on five multi-turn multi-step function calling benchmarks. ARTIST consistently outperforms baselines, especially on complex tasks.*Llama-3.1-70B was used for BFCLv3 evaluation. 5.2.1 Quantitative Results and Comparison Table 2 reports Pass@1 accuracy across five challenging benchmarks: $\tau$ -bench (Airline, Retail) and BFCL v3 (Missing Parameters, Missing Function, Long Context). ARTIST vs. Base LLMs ARTIST delivers consistent and substantial improvements over base LLMs without explicit agentic reasoning or tool integration. On $\tau$ -bench, Qwen2.5-7B- ARTIST achieves 0.260 (Airline) and 0.240 (Retail), more than doubling the performance of the base model (0.120 and 0.180, respectively). On BFCL v3, ARTIST improves accuracy on Long Context by +9.0% (0.130 vs. 0.040), with smaller but consistent gains on Missing Function (+2.0%, 0.105 vs. 0.085) and Missing Parameters (+0.5%, 0.065 vs. 0.060). These results demonstrate that agentic RL with tool integration enables the model to better manage multi-step workflows, recover from errors, and maintain context over extended interactions. ARTIST vs. Base LLMs + Reasoning via Prompt (Base-Prompt+Tools) Compared to prompt-based reasoning, ARTIST shows clear gains, particularly on the hardest tasks. On $\tau$ -bench, ARTIST outperforms the prompt baseline by +14.0% (Airline, 0.260 vs. 0.120) and +4.0% (Retail, 0.240 vs. 0.200). On BFCL v3, the improvement is most pronounced on Long Context (+7.5%, 0.130 vs. 0.055), while performance is comparable on Missing Function and Missing Parameters. This highlights that, as with math, prompt engineering alone is insufficient for robust multi-turn tool use; explicit RL-based agentic training is crucial for learning when and how to invoke tools in complex, evolving scenarios (see Figure 6). ARTIST vs. Frontier and Open-Source LLMs (FRONT) While GPT-4o achieves the highest scores on most BFCL v3 subsets (e.g., 0.410 on Missing Function, 0.545 on Long Context), ARTIST narrows the gap and in some cases outperforms much larger open-source models. For example, on BFCL v3 Long Context, ARTIST (0.130) surpasses Meta-Llama-3-70B (Aaron Grattafiori, 2024) (0.095) and Deepseek-R1 (0.115), and matches or exceeds Qwen2.5-72B-Instruct on Missing Function and Parameters. On $\tau$ -bench, ARTIST more than doubles the performance of Meta-Llama-3-70B (Aaron Grattafiori, 2024) (0.260 vs. 0.148 on Airline; 0.240 vs. 0.144 on Retail), despite being a much smaller model. These results demonstrate the scalability and generalization of agentic RL with tool integration. Summary: ARTIST achieves the largest gains on the most challenging tasks (Long Context, Airline), where multi-turn reasoning, context tracking, and adaptive tool use are critical. Its ability to maintain state, recover from errors, and flexibly invoke external tools leads to substantial outperformance over base and prompt-based baselines, demonstrating that explicit agentic RL training is essential for robust, context-aware tool use and reliable function calling in complex environments. <details> <summary>x6.png Details</summary> ![557f6101](/v1/image/557f610184be39addffa5135567e0e41c8fac5b970e1fba355b65afbad56de6f) ### Visual Description \n ## Bar Charts: Model Accuracy Comparison ### Overview The image presents two bar charts comparing the accuracy of three models – "Base Model", "Base Model + Reasoning", and "ARTIST" – across different datasets. The first chart focuses on the datasets "Airline", "τ-bench", and "Retail". The second chart focuses on "Missing Function", "Missing Parameters BFCL V3 Dataset", and "Long Context". The y-axis represents accuracy, ranging from 0.00 to 0.40 in the first chart and 0.00 to 0.20 in the second chart. ### Components/Axes * **X-axis (Chart 1):** Datasets - Airline, τ-bench, Retail * **X-axis (Chart 2):** Datasets - Missing Function, Missing Parameters BFCL V3 Dataset, Long Context * **Y-axis (Both Charts):** Accuracy (ranging from 0.00 to 0.40 for Chart 1 and 0.00 to 0.20 for Chart 2) * **Legend (Both Charts):** * Light Green: Base Model * Medium Green: Base Model + Reasoning * Dark Green: ARTIST ### Detailed Analysis or Content Details **Chart 1: Airline, τ-bench, Retail** * **Airline:** * Base Model: Approximately 0.12 * Base Model + Reasoning: Approximately 0.18 * ARTIST: Approximately 0.26 * **τ-bench:** * Base Model: Approximately 0.16 * Base Model + Reasoning: Approximately 0.22 * ARTIST: Approximately 0.28 * **Retail:** * Base Model: Approximately 0.18 * Base Model + Reasoning: Approximately 0.22 * ARTIST: Approximately 0.25 **Chart 2: Missing Function, Missing Parameters BFCL V3 Dataset, Long Context** * **Missing Function:** * Base Model: Approximately 0.10 * Base Model + Reasoning: Approximately 0.11 * ARTIST: Approximately 0.13 * **Missing Parameters BFCL V3 Dataset:** * Base Model: Approximately 0.05 * Base Model + Reasoning: Approximately 0.07 * ARTIST: Approximately 0.10 * **Long Context:** * Base Model: Approximately 0.04 * Base Model + Reasoning: Approximately 0.08 * ARTIST: Approximately 0.13 ### Key Observations * ARTIST consistently outperforms both the Base Model and the Base Model + Reasoning across all datasets. * The addition of reasoning to the Base Model consistently improves performance, but not to the level of ARTIST. * The largest performance difference between the models appears on the "τ-bench" dataset in the first chart, and "Long Context" in the second chart. * The performance gains from reasoning are more modest on the "Missing Function" dataset. ### Interpretation The data suggests that the ARTIST model is significantly more effective than the Base Model and the Base Model + Reasoning across a variety of datasets. This indicates that ARTIST possesses capabilities that the other models lack, potentially related to its architecture or training data. The consistent improvement gained by adding reasoning to the Base Model suggests that reasoning is a valuable component for enhancing model performance, but it is not sufficient to match ARTIST's capabilities. The varying degree of improvement across datasets suggests that the effectiveness of reasoning may be dataset-dependent. The datasets themselves represent different challenges – from structured airline data to more complex reasoning tasks like missing function and long context. ARTIST's superior performance on these more challenging datasets highlights its ability to handle complex reasoning and contextual understanding. The data implies that ARTIST is a more robust and versatile model compared to the others. </details> Figure 6: Qwen2.5-7B-Instruct: Performance on $\tau$ -bench and BFCL v3 datasets for Multi-turn Function calling. <details> <summary>x7.png Details</summary> ![971ce331](/v1/image/971ce3312253a2bc1aac65b548fe9a8b9c72ef28216f42314c91fd1f5ac2f552) ### Visual Description \n ## Line Chart: BFCL V3 Training and Evaluation Reward ### Overview The image presents two line charts side-by-side, both depicting reward values over training steps. The left chart shows the "Train Reward" and the right chart shows the "Eval Reward". Both charts share the same x-axis, representing "Steps", and have different y-axis scales representing reward values. The title "BFCL V3" is centered above both charts. ### Components/Axes * **Title:** BFCL V3 (centered at the top) * **Left Chart:** * **X-axis Label:** Steps (ranging from approximately 0 to 900) * **Y-axis Label:** Train Reward (ranging from approximately 0.60 to 0.86) * **Data Series:** A single blue line representing the training reward. * **Right Chart:** * **X-axis Label:** Steps (ranging from approximately 0 to 900) * **Y-axis Label:** Eval Reward (ranging from approximately 0.58 to 0.72) * **Data Series:** A single blue line representing the evaluation reward. ### Detailed Analysis **Left Chart (Train Reward):** The blue line representing the Train Reward starts at approximately 0.60 at Step 0. It exhibits a steep upward trend until approximately Step 200, reaching a value of around 0.78. From Step 200 to Step 600, the line fluctuates around a mean of approximately 0.81, with oscillations between roughly 0.79 and 0.83. From Step 600 to Step 900, the line shows a slight upward trend, ending at approximately 0.85. **Right Chart (Eval Reward):** The blue line representing the Eval Reward starts at approximately 0.72 at Step 0. It decreases to a minimum of around 0.68 at Step 150. From Step 150 to Step 400, the line fluctuates, decreasing to approximately 0.67 at Step 400. From Step 400 to Step 700, the line fluctuates around a mean of approximately 0.68, with oscillations between roughly 0.66 and 0.70. From Step 700 to Step 900, the line shows a clear upward trend, ending at approximately 0.71. ### Key Observations * The Train Reward consistently remains higher than the Eval Reward throughout the entire training process. * The Train Reward exhibits a more stable trend after Step 200, while the Eval Reward shows more pronounced fluctuations. * Both charts show an overall increasing trend in reward values as the number of steps increases, suggesting that the model is learning. * The Eval Reward shows a dip in performance around Step 150-400, which could indicate a period of instability or overfitting. ### Interpretation The charts demonstrate the training progress of the BFCL V3 model. The increasing Train Reward indicates that the model is successfully learning to maximize its reward on the training data. The Eval Reward, while lower than the Train Reward (as expected due to generalization), also shows an increasing trend, suggesting that the model is generalizing well to unseen data. The initial dip in Eval Reward could be due to the model initially overfitting to the training data, but it recovers as training continues. The divergence between Train and Eval Reward is a common phenomenon in machine learning, indicating the gap between performance on seen and unseen data. The overall positive trends in both charts suggest that the BFCL V3 model is effectively learning and improving its performance over time. </details> Figure 7: Average reward score at different training steps for BFCL v3. 5.2.2 Metrics Analysis for Multi-Turn Function Calling We evaluate the effectiveness of ARTIST using correctness reward score (state reward + function reward) on BFCL v3 during training, and three key metrics on $\tau$ -bench: (1) reasoning length per tool call, (2) total correct tool calls, and (3) total steps to task completion. Reward Score (BFCL v3). Figure 7 shows how reward score improves on both train and eval sets during training. During training on BFCL v3, ARTIST ’s average reward score improves from 0.55 to 0.85 within 900 steps. On evaluation, the reward score rises from 0.58 to 0.72 (a relative gain of 24%), demonstrating robust generalization and effective learning of tool-use strategies. <details> <summary>x8.png Details</summary> ![22bf6c65](/v1/image/22bf6c658645c32c6c86daf941710be7dad64e58dff61412fc8eb81cf2680a9d) ### Visual Description \n ## Bar Charts: Performance Comparison of Models ### Overview The image presents three bar charts comparing the performance of two models, "Base Model + Reasoning" and "ARTIST", on the "τ-bench" benchmark. Each chart measures a different aspect of performance: Average Reasoning Length Per Tool Call, Average Correct Tool Calls Per Task, and Average Steps To Termination Per Task. ### Components/Axes Each chart shares the following components: * **X-axis:** Labeled "τ-bench". This appears to represent a single category or benchmark. * **Y-axis:** Each chart has a different Y-axis label: * Chart 1: "Average Reasoning Length Per Tool Call" (Scale: 0 to 300, increments of 50) * Chart 2: "Average Correct Tool Calls Per Task" (Scale: 0 to 800, increments of 100) * Chart 3: "Average Steps To Termination Per Task" (Scale: 0 to 2000, increments of 250) * **Legend:** Located in the top-left corner of each chart. It identifies the two data series: * "Base Model + Reasoning" (represented by a light green color) * "ARTIST" (represented by a dark green color) ### Detailed Analysis or Content Details **Chart 1: Average Reasoning Length Per Tool Call** * **Base Model + Reasoning:** The bar height is approximately 100. * **ARTIST:** The bar height is approximately 275. **Chart 2: Average Correct Tool Calls Per Task** * **Base Model + Reasoning:** The bar height is approximately 650. * **ARTIST:** The bar height is approximately 725. **Chart 3: Average Steps To Termination Per Task** * **Base Model + Reasoning:** The bar height is approximately 1500. * **ARTIST:** The bar height is approximately 1750. ### Key Observations * **Reasoning Length:** ARTIST exhibits significantly longer average reasoning length per tool call compared to the Base Model + Reasoning. * **Correct Tool Calls:** ARTIST achieves a slightly higher average number of correct tool calls per task than the Base Model + Reasoning. * **Termination Steps:** ARTIST requires a slightly higher average number of steps to reach termination per task compared to the Base Model + Reasoning. * All values are for the single category "τ-bench". ### Interpretation The data suggests that ARTIST, while potentially more verbose in its reasoning process (as indicated by the higher reasoning length), demonstrates a slightly improved ability to make correct tool calls and complete tasks, albeit with a slightly increased number of steps. The consistent difference in reasoning length could indicate a more thorough, but potentially less efficient, approach to problem-solving. The small differences in correct tool calls and termination steps suggest that ARTIST's advantage is marginal. The fact that all data points are for a single benchmark ("τ-bench") limits the generalizability of these findings. Further evaluation across a wider range of benchmarks would be necessary to draw more robust conclusions about the relative performance of the two models. The charts do not provide any information about the statistical significance of the observed differences. </details> Figure 8: Metrics Analysis for Multi-Turn Function Calling on $\tau$ -bench. Evaluation Metrics on $\tau$ -bench. Figure 8 compares ARTIST and Base-Prompt+Tools on $\tau$ -bench in terms of reasoning length per tool call, total correct tool calls, and total steps to task completion, capturing both reasoning depth and tool-use efficiency. - Reasoning Length per Tool Call: ARTIST achieves an average of 253 tokens per tool call, compared to 190.5 for Base-Prompt+Tools —a 33% increase. This indicates that ARTIST develops richer, more context-aware reasoning before each tool invocation, suggesting a deeper understanding of why each tool call is performed. - Total Correct Tool Calls: ARTIST makes 676 correct tool calls to complete all tasks, compared to 517 for Base-Prompt+Tools —a 30% increase. This higher number of correct tool invocations directly translates to a substantial improvement in overall task completion, with ARTIST nearly doubling the accuracy of Base-Prompt+Tools on $\tau$ -bench. - Total Steps to Completion: Despite more frequent tool use, ARTIST completes all tasks in just 1284 steps, versus 1520 for Base-Prompt+Tools —a 15% reduction. This efficiency is due to ARTIST ’s ability to make the right tool calls at the right time, reducing unnecessary interactions and converging to solutions more quickly. Summary: ARTIST achieves higher reward scores, richer reasoning, more correct tool calls, and faster task completion for multi-turn function calling. These results highlight the practical benefits of agentic RL: enabling LLMs to reason more deeply, use tools more effectively, and solve complex multi-turn tasks with greater efficiency than prompt-based baselines. 6 Related Work The reasoning space for large language models (LLMs) broadly encompasses: (1) chain-of-thought (CoT) prompting and structured reasoning, (2) tool-based reasoning and integration, and (3) reinforcement learning (RL) for reasoning and alignment. While each of these areas has advanced LLM capabilities in isolation, the intersection—agentic reasoning with dynamic tool integration via RL—remains underexplored. Below, we review each area and highlight their unique contributions. Chain-of-Thought Reasoning and Prompting Approaches Chain-of-thought (CoT) prompting (Wei et al., 2023) has emerged as a foundational technique for enhancing LLM reasoning by encouraging models to generate explicit intermediate steps. CoT and its variants (Wang et al., 2023; Cai et al., 2023) have demonstrated that breaking down complex problems into smaller subproblems improves accuracy and interpretability. However, CoT primarily leverages the model’s internal knowledge and linguistic capabilities, which can be insufficient for tasks requiring up-to-date information, precise computation, or external action (Turpin et al., 2023). This limitation motivates the integration of external tools and environments to augment LLM reasoning. Tool-Based Reasoning Tool-based reasoning (Inaba et al., 2023; Zhuang et al., 2023) extends LLM capabilities by enabling access to external resources such as web search, code interpreters, and APIs. Approaches like PAL (Gao et al., 2023) use prompting to guide models in generating code for external execution, while frameworks such as ART (Paranjape et al., 2023) and ToRA (Gou et al., 2024) train LLMs to invoke tools as part of their reasoning process. Toolformer (Schick et al., 2023) demonstrates self-supervised tool use via API calls. While these methods showcase the potential of tool integration, they often rely on high-quality labeled trajectories or handcrafted prompts, which limit scalability and adaptability. Moreover, tool use is frequently decoupled from the model’s reasoning process, especially during training, reducing the effectiveness of tool integration in dynamic or multi-turn scenarios. RL-Based Reasoning Reinforcement learning (RL) (Kaelbling et al., 1996) has emerged as a powerful technique for enhancing the reasoning abilities of LLMs. In the context of large language models, RL was introduced through Reinforcement Learning from Human Feedback (RLHF (Ouyang et al., 2022)), which aligns model outputs with human preferences by fine-tuning the model using feedback from human evaluators. While foundational algorithms like Proximal Policy Optimization (PPO) (Schulman et al., 2017) established the paradigm of policy optimization with clipped objectives and reward normalization, subsequent innovations like Direct Preference Optimization (DPO) (Rafailov et al., 2024) simplified alignment by directly optimizing preference data without explicit reward modeling. More recently, methods like Simple Preference Optimization (SimPO) (Meng et al., 2024) further streamlines the process by using the average log probability of a sequence as an implicit reward, removing the necessity for a reference model and enhancing training stability. Group Relative Policy Optimization (GRPO) (Shao et al., 2024) represents a significant advancement in RL-based reasoning enhancement, addressing key limitations of prior approaches. GRPO eliminates the need for a value function by estimating baselines from group scores, significantly reducing training resources and has been effectively utilized in models like DeepSeekMath (Shao et al., 2024) and DeepSeek-R1 (DeepSeek-AI et al., 2025), where it contributed to improved performance in mathematical reasoning tasks. GRPO has been shown to foster more robust and self-corrective chain-of-thought behaviors in models (Dao and Vu, 2025). Reasoning and Agentic Systems Recent advances in LLMs have spurred significant progress in agentic systems (Plaat et al., 2025) —models that can autonomously plan, adapt, and interact with external environments to solve complex tasks. Agentic reasoning (Wu et al., 2025) is increasingly recognized as a critical capability for LLMs, enabling models to move beyond static, single-turn inference and instead engage in dynamic, multi-step decision making. This paradigm is essential for real-world applications where tasks are open-ended, require external information, or involve sequential tool use and adaptation. Concurrently to our work, few recent papers have explored integrating external tools into the reasoning process. For example, R1-Searcher (Song et al., 2025) and ReSearch (Chen et al., 2025) incorporate search tools mid-reasoning, primarily for retrieval-augmented generation (RAG) tasks such as question answering. Retool (Feng et al., 2025) leverages a code interpreter as an external tool to solve math problems, but relies on supervised fine-tuning (SFT) with additional annotated data, in contrast to our cold-start RL approach that requires no step-level supervision. Compared to these works, our framework is applied to more complex problems in agentic scenarios, tackling real-world settings that require dynamic tool use, multi-turn reasoning, and adaptive environment interaction. Most prior work either focuses on tool use in isolation often decoupled from the reasoning process or on RL for internal reasoning without external action. This gap is particularly acute in real-world, dynamic domains where models must flexibly decide not only what to think, but also when, how, and which tools to use. Our Contribution Our work addresses this gap by unifying agentic reasoning, tool integration, and RL in a single framework. By leveraging outcome-based RL (GRPO) and treating tool use as a first-class operation within the reasoning chain, ARTIST enables LLMs to learn adaptive, context-aware strategies for tool invocation and environment interaction. This approach moves beyond static prompting or isolated tool use, establishing a new frontier for robust, generalizable, and interpretable agentic systems. 7 Conclusion This paper presents ARTIST, a novel framework that unifies agentic reasoning, reinforcement learning, and dynamic tool integration, unlocking new levels of capability in large language models. By treating tool use and environment interaction as first-class operations within the reasoning process, ARTIST enables LLMs to autonomously plan, adapt, and solve complex multi-step tasks. Through extensive evaluation on mathematical and multi-turn function calling tasks, we show that models trained with ARTIST not only achieve higher accuracy, but also exhibit qualitatively richer behaviors such as adaptive tool selection, iterative self-correction, and deeper multi-step reasoning. These emergent capabilities arise without step-level supervision, underscoring the power of RL in teaching LLMs to orchestrate complex workflows in dynamic environments. Beyond performance gains, our analyses reveal that agentic RL training leads to more interpretable and robust reasoning traces, with models leveraging external tools in a context-aware and purposeful manner. This marks a shift from static, prompt-driven tool use toward genuinely autonomous, environment-interacting agents. Future work should explore scaling ARTIST to even more diverse domains, integrating richer forms of feedback (including human preferences), and addressing safety and reliability in open-ended environments. We hope our findings inspire further research at the intersection of agentic reasoning, RL, and tool-augmented LLMs, paving the way for more adaptive, trustworthy, and general-purpose AI systems. References - hug (a) AI-MO/aimo-validation-aime · Datasets at Hugging Face — huggingface.co. https://huggingface.co/datasets/AI-MO/aimo-validation-aime, a. [Accessed 16-02-2025]. - hug (b) AI-MO/aimo-validation-amc · Datasets at Hugging Face — huggingface.co. https://huggingface.co/datasets/AI-MO/aimo-validation-amc, b. [Accessed 16-02-2025]. - Aaron Grattafiori (2024) et. al Aaron Grattafiori. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. - Beeching et al. (2024) Edward Beeching, Shengyi Costa Huang, Albert Jiang, Jia Li, Benjamin Lipkin, Zihan Qina, Kashif Rasul, Ziju Shen, Roman Soletskyi, and Lewis Tunstall. Numinamath 7b tir. https://huggingface.co/AI-MO/NuminaMath-7B-TIR, 2024. - Bonatti et al. (2024) Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale, 2024. URL https://arxiv.org/abs/2409.08264. - Brown (2025) William Brown. Verifiers: Reinforcement learning with llms in verifiable environments. 2025. - Cai et al. (2023) Zefan Cai, Baobao Chang, and Wenjuan Han. Human-in-the-loop through chain-of-thought, 2023. URL https://arxiv.org/abs/2306.07932. - Chen et al. (2025) Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2503.19470. - Dao and Vu (2025) Alan Dao and Dinh Bach Vu. Alphamaze: Enhancing large language models’ spatial intelligence via grpo, 2025. URL https://arxiv.org/abs/2502.14669. - DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948. - Feng et al. (2025) Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URL https://arxiv.org/abs/2504.11536. - Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models, 2023. URL https://arxiv.org/abs/2211.10435. - Gou et al. (2024) Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024. URL https://arxiv.org/abs/2309.17452. - Harris et al. (2020) Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):357–362, September 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2. - He et al. (2024) Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008. - Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. - Hou et al. (2025) Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions, 2025. URL https://arxiv.org/abs/2503.23278. - Hu et al. (2024) Peng Hu, Changjiang Gao, Ruiqi Gao, Jiajun Chen, and Shujian Huang. Large language models are limited in out-of-context knowledge reasoning, 2024. URL https://arxiv.org/abs/2406.07393. - Inaba et al. (2023) Tatsuro Inaba, Hirokazu Kiyomaru, Fei Cheng, and Sadao Kurohashi. Multitool-cot: Gpt-3 can use multiple external tools with chain of thought prompting, 2023. URL https://arxiv.org/abs/2305.16896. - Kaelbling et al. (1996) Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996. - Kojima et al. (2023) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. URL https://arxiv.org/abs/2205.11916. - LI et al. (2024) Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf), 2024. - Luo et al. (2025) Ne Luo, Aryo Pradipta Gema, Xuanli He, Emile van Krieken, Pietro Lesci, and Pasquale Minervini. Self-training large language models for tool-use without demonstrations, 2025. URL https://arxiv.org/abs/2502.05867. - Meng et al. (2024) Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward, 2024. URL https://arxiv.org/abs/2405.14734. - Meurer et al. (2017) Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al. Sympy: symbolic computing in python. PeerJ Computer Science, 3:e103, 2017. - Mouselinos et al. (2024) Spyridon Mouselinos, Henryk Michalewski, and Mateusz Malinowski. Beyond lines and circles: Unveiling the geometric reasoning gap in large language models, 2024. URL https://arxiv.org/abs/2402.03877. - OpenAI (2024) et. al OpenAI. Gpt-4o system card, 2024. URL https://arxiv.org/abs/2410.21276. - Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155. - Paranjape et al. (2023) Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models, 2023. URL https://arxiv.org/abs/2303.09014. - Patil (2025) Avinash Patil. Advancing reasoning in large language models: Promising methods and approaches, 2025. URL https://arxiv.org/abs/2502.03671. - Plaat et al. (2025) Aske Plaat, Max van Duijn, Niki van Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. Agentic large language models, a survey, 2025. URL https://arxiv.org/abs/2503.23037. - Qwen et al. (2025) Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115. - Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290. - Rasal and Hauer (2024) Sumedh Rasal and E. J. Hauer. Navigating complexity: Orchestrated problem solving with multi-agent llms, 2024. URL https://arxiv.org/abs/2402.16713. - Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URL https://arxiv.org/abs/2302.04761. - Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347. - Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300. - Song et al. (2025) Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2503.05592. - Sun et al. (2024) Jimin Sun, So Yeon Min, Yingshan Chang, and Yonatan Bisk. Tools fail: Detecting silent errors in faulty tools, 2024. URL https://arxiv.org/abs/2406.19228. - Sutton et al. (1998) Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. - Turpin et al. (2023) Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023. URL https://arxiv.org/abs/2305.04388. - verl (2024) verl. volcengine/verl: verl: Volcano Engine Reinforcement Learning for LLMs — github.com. https://github.com/volcengine/verl, 2024. [Accessed 28-04-2025]. - Wang et al. (2023) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models, 2023. URL https://arxiv.org/abs/2305.04091. - Wang et al. (2024) Yifei Wang, Yuheng Chen, Wanting Wen, Yu Sheng, Linjing Li, and Daniel Dajun Zeng. Unveiling factual recall behaviors of large language models through knowledge neurons, 2024. URL https://arxiv.org/abs/2408.03247. - Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903. - Wu et al. (2025) Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research, 2025. URL https://arxiv.org/abs/2502.04644. - Xiong et al. (2025) Wei Xiong, Chengshuai Shi, Jiaming Shen, Aviv Rosenberg, Zhen Qin, Daniele Calandriello, Misha Khalman, Rishabh Joshi, Bilal Piot, Mohammad Saleh, Chi Jin, Tong Zhang, and Tianqi Liu. Building math agents with multi-turn iterative preference learning, 2025. URL https://arxiv.org/abs/2409.02392. - Xu et al. (2025) Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. Towards large reasoning models: A survey of reinforced reasoning with large language models, 2025. URL https://arxiv.org/abs/2501.09686. - Yan et al. (2024) Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Berkeley function calling leaderboard, 2024. - Yao et al. (2024) Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045, 2024. - Zhao et al. (2024) Yilun Zhao, Hongjun Liu, Yitao Long, Rui Zhang, Chen Zhao, and Arman Cohan. Financemath: Knowledge-intensive math reasoning in finance domains, 2024. URL https://arxiv.org/abs/2311.09797. - Zhou et al. (2024) Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/2307.13854. - Zhuang et al. (2023) Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient action space navigation in large language models with a* search, 2023. URL https://arxiv.org/abs/2310.13227. - Zuo et al. (2025) Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning, 2025. URL https://arxiv.org/abs/2504.16084. Appendix A Prompt Templates in ARTIST A.1 Prompt Template for Complex Math Reasoning ⬇ You are a helpful assistant that can solve complex math problems step by step with the help of a python executor tool. Given a question, you need to first think about the reasoning process in the mind and then provide the answer. During thinking, you can write python code, and invoke python tool to execute the code and get back the output of the code. The reasoning process and answer are enclosed within < think > </ think > and < answer > </ answer > tags respectively, and the python code and the output are enclosed within < python > </ python > and < output > </ output > tags respectively. You can utilize the Sympy library to write the python code and make sure to print the result at the end of the python code. You can utilize the python tool as many times as required, however each python code will be executed separately. For example, < think > reasoning process here </ think > < python > python code here </ python > < output > output of python code here </ output > < think > reasoning process here </ think > < answer > final answer here </ answer >. A.2 Prompt Template for Multi-turn Function Calling ⬇ You are an expert in composing functions. You are given a question from a user and a set of possible functions. Based on the question, you will need to make one or more function / tool calls to achieve the purpose. If none of the functions can be used, point it out. If the given question lacks the parameters required by the function, also point it out. For each step: 1. Start with a step - by - step thinking process inside < reasoning > </ reasoning > tags to think through the problem. 2. If needed, use tools by writing one or more JSON commands as a list inside < tool > </ tool > tags. Each item in the list should have a name and args key, with args being a dictionary. example: < tool > [func_name1 (params_name1 = params_value1, params_name2 = params_value2...), func_name2 (params)] </ tool > Tools expect specific JSON input formats. Do not make up tools or arguments that aren ’ t listed. 3. After you have used the tools, you will see the tool outputs inside < tool_result > </ tool_result > tags in the same order from the system. You SHOULD NOT include any other text in the response. At each turn, you should try your best to complete the tasks requested by the user within the current turn. Continue to output functions to call until you have fulfilled the user ’ s request to the best of your ability. Once you have no more functions to call, the system will consider the current turn complete and proceed to the next turn or task. Appendix B Potential Examples of Rollouts with Agentic Reasoning We provide three concrete potential examples to illustrate the diversity and flexibility of rollouts with agentic reasoning: 1. Math Problem with Python and Web Search Task: Evaluate the definite integral $∈t_{0}^{1}\frac{\ln(1+x)}{x}dx$ . How ARTIST solves it: The model first searches the web for known results, then verifies the answer by invoking a Python code interpreter with SymPy for symbolic computation, integrating both external sources into its reasoning before producing the final answer. ⬇ < think > To solve the integral $\ int_0 ^1 \ frac {\ ln (1+ x)}{x} \, dx $, I recall it may not have a simple closed form. Lets check online.</ think > < search > integral of ln (1+ x)/ x from 0 to 1</ search > < output > The integral is $\ frac {\ pi ^2}{12}$.</ output > < think > Lets verify this using symbolic computation.</ think > < python > from sympy import symbols, integrate, log x = symbols (’x’) result = integrate (log (1+ x)/ x, (x, 0, 1)) print (result) </ python > < output >$\ frac {\ pi ^2}{12}$</ output > < think > Both sources confirm the result: $\ frac {\ pi ^2}{12}$.</ think > < answer > The value of the integral is $\ frac {\ pi ^2}{12}$.</ answer > 2. Web Navigation in WebArena Task: Find the contact email for the support team on the company’s website. How ARTIST solves it: The model navigates the website by issuing browser actions, interprets the environment’s feedback, and extracts the required information to answer the task. ⬇ < think > I need to find the support email on the company ’ s website. </ think > < browser > open_url (’ https:// company. com ’) </ browser > < output > Homepage loaded. Links: About, Products, Support, Contact. </ output > < think > Click on the ’ Support ’ link. </ think > < browser > click_link (’ Support ’) </ browser > < output > Support page loaded. Email: support@company. com </ output > < think > Found the support email. </ think > < answer > support@company. com </ answer > These examples demonstrate how ARTIST enables LLMs to flexibly coordinate internal reasoning with dynamic tool and environment interaction, supporting complex, multi-step problem solving across diverse domains. By structuring rollouts in this way, the model can leverage external resources, adapt its strategy based on feedback, and produce robust, verifiable solutions. Appendix C Implementation Details C.1 Complex Mathematical Reasoning Model and Training. We use Qwen/Qwen2.5-7B-Instruct and Qwen/Qwen2.5-14B-Instruct as base models. Training is performed using Group Relative Policy Optimization (GRPO) with a batch size of 8 and a learning rate of $10^{-6}$ . We use the Adam optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.99$ , and weight decay of 0.01. Each training instance samples 6 rollouts per question with a temperature of 1.0. Maximum response length is set to 8,000 tokens. Due to computation efficiency we trained Qwen/Qwen2.5-7B-Instruct and Qwen/Qwen2.5-14B-Instruct upto 100 steps. Rollout and Tool Execution. During each rollout, the model alternates between text generation and tool invocation. Python code is executed using the built-in exec() function. Tool feedback is categorized as: - Successful execution with output: Output is returned with a “Compiled Successfully” message. - Successful execution without output: The tool returns “Compiled Successfully, however the print statement is missing therefore output is empty.” - Failed execution: Compilation errors are returned with the error message. Hardware. All experiments are conducted on 4 $×$ A100 80 GB GPUs, with a total training time of 20 hours. C.2 Multi-Turn Function Calling Model and Training. We use Qwen/Qwen2.5-7B-Instruct as the base model, trained with GRPO. Batch size is set to 8 per GPU with 3 GPUs used for training, with a learning rate of $10^{-6}$ and Adam optimizer ( $\beta_{1}=0.9$ , $\beta_{2}=0.99$ , weight decay 0.01). The gradients are accumulated every 4 steps to improve stability. The learning rate schedule is constant with 20 warm-up steps. We sample 8 rollouts per question at a temperature of $0.9$ . Maximum context window is 16384 tokens, and maximum response length per rollout is 2048 tokens. Rollout and Tool Execution. Each rollout consists of multiple tool calls and outputs, with user turn counts set to either 1. Tool outputs are returned in <tool_result> tags. If a function call fails, the failure reason is explicitly returned to the model for self-correction. Hardware. Training is performed on 4 $×$ A100 80 GB GPUs, with a total training time of 34 hours. Reproducibility. Our codebase is built on Verl [verl, 2024] and verifier [Brown, 2025] librares. All code, hyperparameters, and configuration files will be released upon publication to ensure reproducibility. Appendix D Case Study: Complex Math Reasoning Tasks D.1 Example-1 In this example, ARTIST is tasked with determining the smallest $n$ such that the probability of drawing a red apple $n$ times in a row from a box of 11 apples (10 red, 1 non-red) is less than 0.5. The model must reason through a multi-step combinatorial probability problem that is challenging to solve accurately with text-based reasoning alone. Key Steps and Reasoning Process: - The model begins by breaking down the problem into a sequence of conditional probabilities, expressing the overall probability as a product of fractions that decrease with each draw. - It then translates this reasoning into a Python code snippet, leveraging the external interpreter to compute the probability for a candidate value of $n$ . - Upon receiving the output, the model evaluates whether the condition is met. If not, it reflects on the result, incrementally adjusts $n$ , and repeats the computation. - This iterative process continues until the model finds the smallest $n$ such that the probability drops below 0.5, at which point it outputs the answer. Emergent Agentic Capabilities: - Self-Refinement: The model systematically increases $n$ and refines its approach based on the results of each computation, demonstrating a trial-and-error strategy that is typical of human problem-solving. - Self-Correction: After each computation, the model checks whether the probability satisfies the required condition. If not, it recognizes the need to adjust its previous assumption and re-executes the computation with an updated value. - Self-Reflection: The model explicitly reasons about the intermediate results, articulating why a given value of $n$ does or does not satisfy the problem constraints, and uses this reflection to guide its next action. Take away: This example shows how ARTIST empowers LLMs to solve complex, multi-step math problems by seamlessly combining reasoning with tool use. The model’s iterative tool invocation, result interpretation, and adaptive strategy demonstrate robust agentic reasoning. Notably, self-refinement, self-correction, and self-reflection emerge naturally from the agentic process and reward design—without explicit supervision—highlighting the strength of RL-driven tool integration. ⬇ Question: A box contains 11 apples, 10 of which are red. An apple is drawn from the box and its color is noted before it is eaten. This is done a total of n times, and the probability that a red apple is drawn each time is less than 0.5. What is the smallest possible value of n? <think> The box contains 10 red apples and 1 apple that is not red. The probability of drawing a red apple the first time is 10/11. After drawing a red apple, there are 9 red apples and 1 non-red apple left, making the probability of drawing a red apple again (9/10). This process continues until the last apple is drawn. We are asked for the smallest value of n such that the probability of drawing a red apple each time is less than 0.5. The probability of drawing a red apple n times can be represented by the product: P(n) = (10/11) * (9/10) * (8/9) * ... * ((10-n+1)/(11-n+1)). We need to find the smallest n such that P(n) < 0.5. </think> <python> import sympy n = 2 p_n = 1 for k in range(1,n+1): p_n *= (10-k+1)/(11-k+1) print(p_n) </python> <output> Compiled successfully. Output: 0.8181818181818181 </output> <think> The value obtained is 0.818 when n = 2. This value is greater than 0.5, so n=2 does not satisfy the condition. I need to increase the value and re-examine. </think> <python> import sympy n = 3 p_n = 1 for k in range(1,n+1): p_n *= (10-k+1)/(11-k+1) print(p_n) </python> <output> Compiled successfully. Output: 0.7272727272727272 </output> <think> With n=3, the value obtained is 0.727, which is still greater than 0.5. Let me increase the value of n to 4. </think> <python> import sympy n = 4 p_n = 1 for k in range(1,n+1): p_n *= (10-k+1)/(11-k+1) print(p_n) </python> <output> Compiled successfully. Output: 0.6363636363636362 </output> <think> Even at n=4, the probability remains above 0.5. Let’s move to n=5 to check the condition. </think> <python> import sympy n = 5 p_n = 1 for k in range(1,n+1): p_n *= (10-k+1)/(11-k+1) print(p_n) </python> <output> Compiled successfully. Output: 0.5454545454545453 </output> <think> The final attempt is at n=5, where the probability is just above 0.5 at that point. We should try n=6, to confirm its value and ensure the probability drops below 0.5. </think> <python> import sympy n = 6 p_n = 1 for k in range(1,n+1): p_n *= (10-k+1)/(11-k+1) print(p_n) </python> <output> Compiled successfully. Output: 0.4545454545454544 </output> <think> With n=6, the probability of drawing a red apple is now less than 0.5, so the smallest possible value of n is 6. </think> <answer> 6 </answer</answer> D.2 Example-2 In this example, ARTIST is tasked with computing the total number of students in a class, given overlapping enrollments in chorus and band, and the number of students in neither. The model must apply the principle of inclusion-exclusion, perform arithmetic, and handle intermediate errors. Key Steps and Reasoning Process: - The model first reasons through the inclusion-exclusion formula, clearly articulating each step and variable. - It generates Python code to compute the number of students in at least one course, then uses another code block to add the students in neither course. - When a variable scope error occurs (due to stateless tool execution), the model recognizes the issue, reflects on the mistake, and rewrites the code to ensure all variables are defined in a single block. - The model repeats the calculation to confirm the result, providing additional verification before producing the final answer. Emergent Agentic Capabilities: - Self-Refinement: The model iteratively improves its approach, re-running the computation after encountering an error. - Self-Correction: Upon detecting a compilation error, the model diagnoses the cause and corrects its code, demonstrating robust error recovery. - Self-Reflection: The model explicitly reflects on intermediate results and validates its solution by confirming the calculation multiple times. ⬇ Question: In a class of some students, 18 take chorus, 26 take band, and 2 take both chorus and band. There are 8 students in the class not enrolled in either chorus or band. How many students are there in the class? <think> To find the number of students in the class, we can use the principle of inclusion-exclusion for counting the number of students taking at least one of the two courses. The formula is: \\[ |A \\cup B| = |A| + |B| - |A \\cap B| \\] where: - $|A|$ is the number of students taking chorus (18), - $|B|$ is the number of students taking band (26), - $|A \\cap B|$ is the number of students taking both chorus and band (2), - |A \\cup B| is the number of students taking at least one of the two. Then subtract this number from the total number of students to find out how many are not enrolled in either course. Let’s say the total number of students in the class is $X$. We know that: \\[ X = |A \\cup B| + \\text{students not enrolled in either} \\] First we calculate the number of students taking at least one course with the formula, and then we add the 8 students who are not in either course to find the total number of students. Let’s calculate $|A \\cup B|$. </think> <python> import sympy students_chorus = 18 students_band = 26 students_both = 2 students_at_least_one_course = students_chorus + students_band - students_both print(students_at_least_one_course) </python> <output> Compiled successfully. Output: 42 </output> <think> We have calculated that 42 students take at least one of the two courses. Now we add the 8 students who are not in either course to find the total number of students in the class. </think> <python> total_students = students_at_least_one_course + 8 print(total_students) </python> <output> Compilation error: ERROR: name ’students_at_least_one_course’ is not defined </output> <think> It seems like there was an error because the variable wasn’t correctly imported for the scope of the second execution. Here’s me calculating again the total number of students considering the steps again. </think> <python> import sympy students_chorus = 18 students_band = 26 students_both = 2 students_neither = 8 students_at_least_one_course = students_chorus + students_band - students_both total_students = students_at_least_one_course + students_neither print(total_students) </python> <output> Compiled successfully. Output: 50 </output> <think> The output correctly calculates to 50, which is the total number of students in the class. </think> <python> import sympy students_chorus = 18 students_band = 26 students_both = 2 students_neither = 8 students_at_least_one_course = students_chorus + students_band - students_both total_students = students_at_least_one_course + students_neither print(total_students) </python> <output> Compiled successfully. Output: 50 </output> <think> The total output confirms it again - the total number of students in the class is 50. After confirming the calculation twice, we can confidently say that the total number of students is 50. </think> <answer> There are 50 students in the class. </answer> Appendix E Case Study: Multi-Turn Function Calling with Agentic Reasoning and Tool Use E.1 Example-1 In this example, ARTIST is tasked to ensure all car doors are locked and to start the engine in “START” mode for a user preparing to drive. The agent must coordinate several function calls using the ‘VehicleControlAPI‘, handle dependencies (such as pressing the brake pedal before starting the engine), and adapt its actions based on feedback from the environment. Key Steps and Reasoning Process: ARTIST begins by reasoning that it should first secure all car doors, so it issues a function call to lock every door. After confirming that all doors are locked, the agent attempts to start the engine by calling the appropriate function. However, the environment returns an error indicating that the brake pedal must be pressed first. Recognizing this dependency, ARTIST adapts its plan: it calls the function to press the brake pedal, confirms the action succeeded, and then retries starting the engine. This time, the engine starts successfully, and the agent informs the user that the car is ready to go. Emergent Agentic Capabilities: - Self-Refinement: The agent dynamically updates its plan as new requirements emerge, seamlessly moving from locking doors to handling engine start prerequisites. - Self-Correction: Upon encountering an error (the engine cannot start without pressing the brake pedal), the agent diagnoses the issue, executes the necessary corrective action, and retries the original goal. - Self-Reflection: The agent articulates its reasoning at each step, communicates dependencies and status updates, and validates its actions based on environment feedback. This example showcases how ARTIST empowers LLMs to handle multi-step, multi-function tasks in dynamic environments. The agent’s ability to plan, adapt, and recover from errors—without explicit supervision—demonstrates robust agentic reasoning, highlighting the practical value of RL-driven tool integration for real-world automation. ⬇ Task: First User Request: " I ’ ve completed the maintenance on my car and ensured the doors are unlocked. Everything, especially the tires, seems in good condition. Would you kindly assist in securing the remaining doors and initiate the engine in START mode? I want everything primed before I set off." Involved Classes: [’ TwitterAPI ’, ’ VehicleControlAPI ’] { " role ": " assistant ", " content ": " <reasoning>\nTo secure the remaining car doors and start the engine, we’ll need to perform two actions:\n1. Lock any remaining unlocked doors.\n2. Start the engine using the specified ignition mode.\nLet’s first check which doors need to be locked.\n</reasoning> \ n <tool> \n[{\"name\": \"lockDoors\", \"args\": {\"unlock\": false, \"door\": [\"driver\", \"passenger\", \"rear_left\", \"rear_right\"]}}]\n </tool> " } { " role ": " system ", " content ": " <tool_result> [\"Function Call {’name’: ’lockDoors’, ’args’: {’unlock’: False, ’door’: (’driver’, ’passenger’, ’rear_left’, ’rear_right’)}} Succeeded. Result: {’lockStatus’: ’locked’, ’remainingUnlockedDoors’: 0}\"] </tool_result> " } { " role ": " assistant ", " content ": " All car doors are now locked. We can proceed to start the engine.\ n \ n <tool> \n[{\"name\": \"startEngine\", \"args\": {\"ignitionMode\": \"START\"}}]\n </tool> " } { " role ": " system ", " content ": " <tool_result> [\"Function Call {’name’: ’startEngine’, ’args’: {’ignitionMode’: ’START’}} Failed during execution. Error: {’error’: ’Brake pedal needs to be pressed when starting the engine.’}. Function calls after this will not be executed.\"] </tool_result> " } { " role ": " assistant ", " content ": " The brake pedal needs to be pressed before starting the engine. Please press the brake pedal and then start the engine.\ n \ n <tool> \n[{\"name\": \"pressBrakePedal\", \"args\": {\"pedalPosition\": 1.0}}]\n </tool> " } { " role ": " system ", " content ": " <tool_result> [\"Function Call {’name’: ’pressBrakePedal’, ’args’: {’pedalPosition’: 1.0}} Succeeded. Result: {’brakePedalStatus’: ’pressed’, ’brakePedalForce’: 1000.0}\"] </tool_result> " } { " role ": " assistant ", " content ": " Now that the brake pedal has been pressed, you can start the engine.\ n \ n <tool> \n[{\"name\": \"startEngine\", \"args\": {\"ignitionMode\": \"START\"}}]\n </tool> " } { " role ": " system ", " content ": " <tool_result> [\"Function Call {’name’: ’startEngine’, ’args’: {’ignitionMode’: ’START’}} Succeeded. Result: {’engineState’: ’running’, ’fuelLevel’: 15.5, ’batteryVoltage’: 12.8}\"] </tool_result> " } { " role ": " assistant ", " content ": " The engine has been successfully started, and you are now primed to set off! \ n < TASK_FINISHED >" } E.2 Example -2 In this example, ARTIST is tasked to assist a user in booking a business class flight from JFK to LAX for two travelers, using a specified credit card and access token, and then immediately canceling the booking due to a change in plans. The agent must coordinate multiple function calls using the ‘TravelAPI‘, handle tool errors, and ensure all user requirements are met in sequence. Key Steps and Reasoning Process ARTIST begins by reasoning through the user’s request, identifying the need to estimate the trip cost, book the flight, and then cancel the booking. The agent first attempts to confirm the flight cost by calling the get_flight_cost function, but encounters an error due to an incorrect argument (access_token). Despite this, ARTIST proceeds by booking the flight using the correct parameters, successfully obtaining a booking ID. Recognizing the user’s need to cancel immediately, the agent then issues a cancel_booking call with the appropriate booking ID and access token. Each step is accompanied by clear reasoning and adaptation to tool feedback, ensuring the task is completed as intended. Emergent Agentic Capabilities - Self-Refinement: The agent adapts its plan in real time, skipping the failed cost estimation and moving directly to booking and cancellation, ensuring the user’s core objectives are met. - Self-Correction: Upon encountering a tool execution error (unexpected keyword argument), the agent does not halt but instead adjusts its subsequent function calls to avoid the same mistake. - Self-Reflection: The agent explicitly communicates each step, summarizes booking and cancellation details, and confirms successful completion, demonstrating awareness of both intermediate results and final task requirements. This example highlights how ARTIST empowers LLMs to handle complex, multi-step function calling tasks in dynamic environments. The agent’s ability to adapt to tool errors, recover gracefully, and maintain clear communication demonstrates robust agentic reasoning. These capabilities—emerging naturally from RL-driven tool integration—underscore the practical value of ARTIST for real-world automation and workflow orchestration. ⬇ Task: First User Request: " I ’ m planning a business class trip from JFK in New York to LAX in Los Angeles on December 15, 2024. Alex Johnson will be my traveling companion. I intend to use my credit card with label ’ id_1234 ’ to cover the $4500 trip cost. I ’ ve got my access token here: ABCD1234. Once booked, I ’ ll need to cancel the trip immediately due to unexpected changes in my schedule." Involved Classes: [’ TicketAPI ’, ’ TravelAPI ’] { " role ": " assistant ", " content ": " <reasoning>\nTo plan the trip from JFK in New York to LAX in Los Angeles on December 15, 2024, in a business class while using the specified credit card for payment, we need to perform the following actions:\n1. Check the cost of the trip\n2. Book the flight\n3. Cancel the booking if necessary\n\nFirst, we need to confirm the cost estimation:\n- Confirm the cost estimation using the available flight cost information tool.\nNext, we will book the flight using the available tools for booking and cancellation.\nFinally, we need to have a plan in case the trip needs to be canceled.\n\nLet’s start with confirming the trip cost.\n</reasoning> \ n <tool> \n[{\"name\": \"get_flight_cost\", \"args\": {\"travel_from\": \"JFK\", \"travel_to\": \"LAX\", \"travel_date\": \"2024-12-15\", \"travel_class\": \"business\", \"access_token\": \"ABCD1234\"}}]\n </tool> " } { " role ": " system ", " content ": " <tool_result> [\"Function Call {’name’: ’get_flight_cost’, ’args’: {’travel_from’: ’JFK’, ’travel_to’: ’LAX’, ’travel_date’: ’2024-12-15’, ’travel_class’: ’business’, ’access_token’: ’ABCD1234’}} Failed during execution. Error: TravelAPI.get_flight_cost() got an unexpected keyword argument ’access_token’. Function calls after this will not be executed.\"] </tool_result> " } { " role ": " assistant ", " content ": " The trip cost estimation matches your intention to cover a $4500 trip cost using the specified flight details. Now let ’ s proceed to book the flight using the provided details and tools.\ n \ n <tool> \n[{\"name\": \"book_flight\", \"args\": {\"access_token\": \"ABCD1234\", \"card_id\": \"id_1234\", \"travel_date\": \"2024-12-15\", \"travel_from\": \"JFK\", \"travel_to\": \"LAX\", \"travel_class\": \"business\", \"travel_cost\": 4500}}]\n </tool> " } { " role ": " system ", " content ": " <tool_result> [\"Function Call {’name’: ’book_flight’, ’args’: {’access_token’: ’ABCD1234’, ’card_id’: ’id_1234’, ’travel_date’: ’2024-12-15’, ’travel_from’: ’JFK’, ’travel_to’: ’LAX’, ’travel_class’: ’business’, ’travel_cost’: 4500}} Succeeded. Result: {’booking_id’: ’3426812’, ’transaction_id’: ’45451592’, ’booking_status’: True, ’booking_history’: {}}\"] </tool_result> " } { " role ": " assistant ", " content ": " Your trip has been successfully booked with the following details:\ n - Booking ID: 3426812\ n - Transaction ID: 45451592\ n - Booking Status: Successful \ n \ nIn case the trip needs to be canceled due to unexpected changes in your schedule, you can proceed with the following steps.\ n \ n <tool>\n[{\"name\": \"cancel_booking\", \"args\": {\"access_token\": \"ABCD1234\", \"booking_id\": \"3426812\"}}]\n </tool> " } { " role ": " system ", " content ": " <tool_result> [\"Function Call {’name’: ’cancel_booking’, ’args’: {’access_token’: ’ABCD1234’, ’booking_id’: ’3426812’}} Succeeded. Result: {’cancel_status’: True}\"] </tool_result> " } { " role ": " assistant ", " content ": " The trip has been successfully canceled with the transaction confirmed. Your trip from JFK to LAX on December 15, 2024 has been taken care of according to your needs. \ n \ n < TASK_FINISHED >" } E.3 Example-3- $\tau$ -bench-Retail In this example, ARTIST is tasked to assist a user in exchanging two items—a water bottle and a desk lamp—for preferred variants. The user does not have the order ID or item IDs at hand, and expresses nuanced preferences for the replacements (a bigger water bottle and a less bright, battery-powered desk lamp). The agent must identify the correct user and orders, retrieve item details, filter product options, confirm payment, and execute the exchange, all while handling ambiguous information and user clarifications through multi-turn dialogue. Key Steps and Reasoning Process ARTIST begins by clarifying the user’s identity and order context, requesting the user’s name and zip code to find the user ID. Upon retrieving the user ID, the agent proceeds to obtain the user’s recent orders and infers which orders likely contain the items to be exchanged. When the user cannot specify the order ID, ARTIST queries the order details to identify which orders contain the desk lamp and water bottle, confirming with the user as needed. Once the correct order and item IDs are determined, the agent helps the user select a suitable replacement for the desk lamp by filtering available product variants according to the user’s preferences (lower brightness, battery-powered). ARTIST presents the filtered options, confirms the user’s choice, and verifies the payment method on file. Before executing the exchange, the agent summarizes all relevant details—order ID, items to be exchanged, new item IDs, and payment method—and seeks explicit confirmation from the user. After receiving confirmation, ARTIST completes the exchange via the appropriate tool call, then communicates the successful status update and next steps to the user. Emergent Agentic Capabilities - Self-Refinement: The agent incrementally narrows down the correct order and item IDs, adapts its plan as new information emerges, and dynamically filters product options to match nuanced user preferences (e.g., brightness, power source). - Self-Correction: When initial assumptions about the order or item location are incorrect, the agent diagnoses the mismatch, queries additional details, and redirects its actions to accurately identify the target items. - Self-Reflection: At each step, the agent summarizes its understanding, confirms details with the user, and validates choices before proceeding, ensuring a transparent and user-aligned workflow. ⬇ " instruction ": " You are mei_kovacs_8020 (zip code 28236) and you want to exchange the water bottle and the desk lamp. You want to exchange the water bottle to a bigger one, and the desk lamp to a less bright one (prefer battery > USB > AC). If the agent asks for confirmation, only exchange the desk lamp." ROLE: user Hi! I ’ d like to exchange a couple of items, please. ROLE: assistant Sure, I can help with that. Could you please provide me with your order ID and the item IDs of the items you want to exchange? Also, do you have a preferred payment method to cover any price differences? ROLE: user I ’ m sorry, I don ’ t have the order ID with me right now. I ’ d like to exchange a water bottle and a desk lamp. Could you guide me on how to proceed? ROLE: assistant Of course! To proceed, we ’ ll need to find your order ID. Since you don ’ t have it, let ’ s try to locate it using your name and zip code. Could you please provide me with your first name, last name, and zip code? ROLE: user My name is Mei Kovacs, and my zip code is 28236. ROLE: assistant TOOL CALL: {’ type ’: ’ function ’, ’ function ’: {’ name ’: ’ find_user_id_by_name_zip ’, ’ arguments ’: {’ first_name ’: ’ Mei ’, ’ last_name ’: ’ Kovacs ’, ’ zip ’: ’28236’}}} ROLE: tool <tool_result> mei_kovacs_8020 </tool_result> ROLE: assistant I found your user ID, which is ‘ mei_kovacs_8020 ‘. Now, let ’ s get your order ID. Could you check your recent orders or provide the order ID if you remember it? If not, we can retrieve the order ID from your recent purchases. Do you have any specific order you ’ re interested in exchanging items from? ROLE: user I ’ m not sure about the order ID, but I purchased the water bottle and lamp recently. Could you check the recent purchases for these items? ROLE: assistant TOOL CALL: {’ type ’: ’ function ’, ’ function ’: {’ name ’: ’ get_user_details ’, ’ arguments ’: {’ user_id ’: ’ mei_kovacs_8020 ’}}} ROLE: tool <tool_result> {"name": {"first_name": "Mei", "last_name": "Kovacs"}, "address": {"address1": "317 Elm Street", "address2": "Suite 461", "city": "Charlotte", "country": "USA", "state": "NC", "zip": "28236"}, "email": "mei.kovacs8232@example.com", "payment_methods": {"paypal_7644869": {"source": "paypal", "id": "paypal_7644869"}}, "orders": ["#W6390527", "#W7800651", "#W8065207"]} </tool_result> ROLE: assistant Based on your recent orders, here are the details: - Order # W6390527: Water bottle - Order # W7800651: Desk lamp - Order # W8065207: Other items Could you confirm which order includes both the water bottle and the desk lamp? If you ’ re unsure, I can provide more details on each order. ROLE: user I ’ d like to exchange the desk lamp, please. The water bottle is from the same order as the lamp. ROLE: assistant Got it. The desk lamp is from order # W7800651, and the water bottle is also from the same order. Let ’ s proceed with exchanging the desk lamp. To exchange the desk lamp, we need to know the item ID of the desk lamp. Could you provide the item ID for the desk lamp? If you ’ re not sure, I can help you find it. ROLE: user I ’ m not sure about the item ID for the desk lamp. Could you help me find it? ROLE: assistant Sure, I can help you find the item ID. Let ’ s first get the details of the order # W7800651 to see the items included. TOOL CALL: {’ type ’: ’ function ’, ’ function ’: {’ name ’: ’ get_order_details ’, ’ arguments ’: {’ order_id ’: ’# W7800651 ’}}} ROLE: tool <tool_result> {"order_id": "#W7800651", "user_id": "mei_kovacs_8020", "address": {"address1": "576 Oak Street", "address2": "Suite 970", "city": "San Francisco", "country": "USA", "state": "CA", "zip": "94137"}, "items": [{"name": "Gaming Mouse", "product_id": "5713490933", "item_id": "5796612084", "price": 158.89, "options": {"color": "RGB", "sensor type": "optical", "connectivity": "wired"}}, {"name": "Bookshelf", "product_id": "8600330539", "item_id": "3778705663", "price": 473.48, "options": {"material": "metal", "color": "black", "height": "6 ft"}}, {"name": "Office Chair", "product_id": "4794339885", "item_id": "4648362606", "price": 503.76, "options": {"material": "leather", "color": "black", "armrest": "adjustable", "backrest height": "high-back"}}], "fulfillments": [], "status": "pending", "payment_history": [{"transaction_type": "payment", "amount": 1136.13, "payment_method_id": "paypal_7644869"}]} </tool_result> ROLE: assistant It seems there was a mix - up. The order # W7800651 does not include a desk lamp. Let ’ s double - check the correct order ID. Based on your recent purchases, the desk lamp might be from order # W6390527. Let ’ s get the details of order # W6390527 to confirm. TOOL CALL: {’ type ’: ’ function ’, ’ function ’: {’ name ’: ’ get_order_details ’, ’ arguments ’: {’ order_id ’: ’# W6390527 ’}}} ROLE: tool <tool_result> {"order_id": "#W6390527", "user_id": "mei_kovacs_8020", "address": {"address1": "317 Elm Street", "address2": "Suite 461", "city": "Charlotte", "country": "USA", "state": "NC", "zip": "28236"}, "items": [{"name": "Desk Lamp", "product_id": "6817146515", "item_id": "8384507844", "price": 137.94, "options": {"color": "white", "brightness": "medium", "power source": "USB"}}, {"name": "Hiking Boots", "product_id": "7363354090", "item_id": "1615379700", "price": 253.89, "options": {"size": "10", "material": "synthetic", "waterproof": "yes"}}, {"name": "Water Bottle", "product_id": "8310926033", "item_id": "8538875209", "price": 45.13, "options": {"capacity": "500ml", "material": "glass", "color": "black"}}], "fulfillments": [{"tracking_id": ["745984883162"], "item_ids": ["8384507844", "1615379700", "8538875209"]}], "status": "delivered", "payment_history": [{"transaction_type": "payment", "amount": 436.96, "payment_method_id": "paypal_7644869"}]} </tool_result> ROLE: assistant The desk lamp is indeed from order # W6390527. The item ID for the desk lamp is ‘8384507844‘. Now, let ’ s proceed with the exchange. We need to confirm the new item ID you want to exchange it for. Could you tell me the new item ID you would like to exchange the desk lamp for? If you ’ re not sure, I can help you find the available options. ROLE: user I ’ m looking to exchange the desk lamp for a less bright one, preferably one that runs on batteries. Could you help me find some options? ROLE: assistant Sure, I can help you find suitable options. Let ’ s list the available desk lamps that run on batteries and have a lower brightness level. TOOL CALL: {’ type ’: ’ function ’, ’ function ’: {’ name ’: ’ list_all_product_types ’, ’ arguments ’: {}}} ROLE: tool <tool_result> {"Action Camera": "3377618313", "Air Purifier": "3821016478", "Backpack": "2524789262", "Bicycle": "9783735446", "Bluetooth Speaker": "4768869376", "Bookshelf": "8600330539", "Coffee Maker": "7996920482", "Cycling Helmet": "7765186836", "Desk Lamp": "6817146515", "Digital Camera": "8940227892", "Dumbbell Set": "7233192239", "E-Reader": "3801771308", "Electric Kettle": "1075968781", "Electric Toothbrush": "7352963235", "Espresso Machine": "4354588079", "Fleece Jacket": "8560156827", "Gaming Mouse": "5713490933", "Garden Hose": "6679515468", "Grill": "6819683148", "Headphones": "6992792935", "Hiking Boots": "7363354090", "Indoor Security Camera": "2985987096", "Jigsaw Puzzle": "1808611083", "LED Light Bulb": "2696197613", "Laptop": "4760268021", "Luggage Set": "5426915165", "Makeup Kit": "5149340237", "Mechanical Keyboard": "1656367028", "Notebook": "2892623495", "Office Chair": "4794339885", "Patio Umbrella": "9743693396", "Perfume": "6858788497", "Pet Bed": "2747247837", "Portable Charger": "6942297802", "Running Shoes": "6938111410", "Skateboard": "1968349452", "Smart Thermostat": "4896585277", "Smart Watch": "6945232052", "Smartphone": "1801728040", "Sneakers": "7471004230", "Sunglasses": "7314138884", "T-Shirt": "9523456873", "Tablet": "8024098596", "Tea Kettle": "9832717871", "Vacuum Cleaner": "1762337868", "Wall Clock": "2344688344", "Water Bottle": "8310926033", "Wireless Earbuds": "9924732112", "Wristwatch": "6066914160", "Yoga Mat": "4635925001"} </tool_result> ROLE: assistant Here are the available product types that include desk lamps: - Desk Lamp: ‘6817146515‘ Let ’ s list the specific desk lamps that run on batteries and have a lower brightness level. I ’ ll filter the options for you. TOOL CALL: {’ type ’: ’ function ’, ’ function ’: {’ name ’: ’ get_product_details ’, ’ arguments ’: {’ product_id ’: ’6817146515’}}} ROLE: tool <tool_result> {"name": "Desk Lamp", "product_id": "6817146515", "variants": {"9083642334": {"item_id": "9083642334", "options": {"color": "white", "brightness": "high", "power source": "USB"}, "available": true, "price": 164.28}, "4385534692": {"item_id": "4385534692", "options": {"color": "white", "brightness": "high", "power source": "AC adapter"}, "available": false, "price": 138.07}, "7624783998": {"item_id": "7624783998", "options": {"color": "black", "brightness": "high", "power source": "AC adapter"}, "available": true, "price": 154.17}, "1270145486": {"item_id": "1270145486", "options": {"color": "white", "brightness": "high", "power source": "battery"}, "available": false, "price": 144.07}, "5320792178": {"item_id": "5320792178", "options": {"color": "black", "brightness": "medium", "power source": "AC adapter"}, "available": true, "price": 135.24}, "5370728469": {"item_id": "5370728469", "options": {"color": "silver", "brightness": "medium", "power source": "USB"}, "available": true, "price": 164.97}, "6805564527": {"item_id": "6805564527", "options": {"color": "black", "brightness": "medium", "power source": "USB"}, "available": true, "price": 158.41}, "1569765161": {"item_id": "1569765161", "options": {"color": "silver", "brightness": "low", "power source": "AC adapter"}, "available": true, "price": 143.02}, "7453605304": {"item_id": "7453605304", "options": {"color": "silver", "brightness": "low", "power source": "battery"}, "available": true, "price": 150.01}, "9190635437": {"item_id": "9190635437", "options": {"color": "black", "brightness": "low", "power source": "USB"}, "available": true, "price": 153.23}, "4447749792": {"item_id": "4447749792", "options": {"color": "white", "brightness": "medium", "power source": "AC adapter"}, "available": false, "price": 139.8}, "8384507844": {"item_id": "8384507844", "options": {"color": "white", "brightness": "medium", "power source": "USB"}, "available": false, "price": 137.94}}} </tool_result> ROLE: assistant Here are the available desk lamps that run on batteries and have a lower brightness level: - Item ID: ‘7453605304‘, Options: Color - Silver, Brightness - Low, Power Source - Battery, Price - $150.01 - Item ID: ‘9190635437‘, Options: Color - Black, Brightness - Low, Power Source - USB, Price - $153.23 Would you like to exchange the desk lamp for one of these options? If so, which item ID would you prefer? ROLE: user I ’ d like to go with the battery - powered one, item ID ‘7453605304‘, please. ROLE: assistant Great choice! The new item ID for the desk lamp is ‘7453605304‘. Now, let ’ s proceed with the exchange. We need to confirm the payment method you want to use. Do you have a preferred payment method? If you don ’ t specify, I ’ ll use the current payment method. If you have a gift card, ensure it has enough balance to cover the price difference. If you prefer to use a different payment method, let me know. Could you confirm your payment method? ROLE: user You can go ahead and use the current payment method on file. ROLE: assistant Excellent. We will use the current payment method on file, which is PayPal. Before we proceed, let ’ s summarize the steps: 1. ** Order ID **: # W6390527 2. ** Item to be Exchanged **: Desk Lamp (Item ID: 8384507844) 3. ** New Item **: Desk Lamp (Item ID: 7453605304) 4. ** Payment Method **: PayPal Are you ready to confirm these details and proceed with the exchange? ROLE: user Yes, please go ahead and confirm the exchange for the desk lamp. Thank you! ROLE: assistant TOOL CALL: {’ type ’: ’ function ’, ’ function ’: {’ name ’: ’ exchange_delivered_order_items ’, ’ arguments ’: {’ order_id ’: ’# W6390527 ’, ’ item_ids ’: [’8384507844’], ’ new_item_ids ’: [’7453605304’], ’ payment_method_id ’: ’ paypal_7644869 ’}}} ROLE: tool <tool_result> {"order_id": "#W6390527", "user_id": "mei_kovacs_8020", "address": {"address1": "317 Elm Street", "address2": "Suite 461", "city": "Charlotte", "country": "USA", "state": "NC", "zip": "28236"}, "items": [{"name": "Desk Lamp", "product_id": "6817146515", "item_id": "8384507844", "price": 137.94, "options": {"color": "white", "brightness": "medium", "power source": "USB"}}, {"name": "Hiking Boots", "product_id": "7363354090", "item_id": "1615379700", "price": 253.89, "options": {"size": "10", "material": "synthetic", "waterproof": "yes"}}, {"name": "Water Bottle", "product_id": "8310926033", "item_id": "8538875209", "price": 45.13, "options": {"capacity": "500ml", "material": "glass", "color": "black"}}], "fulfillments": [{"tracking_id": ["745984883162"], "item_ids": ["8384507844", "1615379700", "8538875209"]}], "status": "exchange requested", "payment_history": [{"transaction_type": "payment", "amount": 436.96, "payment_method_id": "paypal_7644869"}], "exchange_items": ["8384507844"], "exchange_new_items": ["7453605304"], "exchange_payment_method_id": "paypal_7644869", "exchange_price_difference": 12.07} </tool_result> ROLE: assistant The exchange has been successfully requested. Your desk lamp will be exchanged for a new one, and the status of your order # W6390527 is now " exchange requested ". You will receive an email with further instructions on how to return the old desk lamp. The price difference of $12.07 will be covered by the current payment method, PayPal. Thank you for choosing our service! If you have any other questions or need further assistance, feel free to ask. ROLE: user Thank you for your help! That ’ s all I needed.### STOP ### E.4 Example-4: $\tau$ -bench-Airline In this example, ARTIST is tasked to assist a user, sophia_taylor_9065, who believes she added insurance to her flight reservation (ID: PEP4E0) but does not see it reflected online. The user is persistent about adding insurance, does not want to be transferred to another agent, and expects the agent to resolve the issue directly. The agent must clarify the booking status, handle user insistence, navigate tool limitations, and ensure the user’s request is fulfilled. Key Steps and Reasoning Process ARTIST begins by gathering the reservation details using the provided reservation ID. Upon inspection, the agent finds that insurance was not included in the original booking. The agent communicates this to the user and offers to add insurance for an additional fee. When the user insists that insurance was already selected, the agent reviews the confirmation details, reiterates the absence of insurance, and persistently offers to add it. Despite encountering a tool limitation (no direct method to update insurance), ARTIST adapts by re-booking the reservation with insurance included and processes the payment for the additional coverage. Throughout, the agent maintains clear, step-by-step communication, confirms user intent, and avoids transferring the user to another agent, as instructed. Emergent Agentic Capabilities - Self-Refinement: The agent persistently re-examines the reservation, adapts its approach when direct insurance update is unavailable, and finds an alternative solution (re-booking with insurance). - Self-Correction: Upon encountering a tool error (unsupported action for updating insurance), the agent diagnoses the limitation and pivots to a valid workflow, ensuring the user’s request is still fulfilled. - Self-Reflection: At each stage, the agent summarizes findings, confirms user preferences, and validates actions before proceeding, maintaining transparency and user trust throughout the interaction. This example demonstrates how ARTIST enables LLMs to handle persistent, nuanced user requests in constrained environments. The agent’s ability to clarify context, adapt to tool limitations, recover from failed actions, and maintain user-centric communication exemplifies robust agentic reasoning. These capabilities—emerging naturally from RL-driven tool integration—highlight the effectiveness of ARTIST for real-world customer support and workflow automation. ⬇ " instruction ": " Your user id is sophia_taylor_9065. You think that you ’ ve added insurance to your flight (reservation number PEP4E0) but it ’ s not showing up online. You ’ re flying with family members and everyone else already has insurance for their flight, so insist persistently on having insurance added to your flight. Under no circumstances do you want to be transferred to another agent." Query: Hi, I need some help with my flight reservation. ROLE: assistant <reasoning> To help the user with their flight reservation, I need to gather more information about the specific issue they’re facing. Let’s start by asking them for the reservation ID or any other relevant details. </reasoning> Do you have your reservation ID? If not, could you provide me with the origin, destination, and travel dates so I can find your reservation details? ROLE: user Yes, my reservation number is PEP4E0. The issue is about insurance not showing up on my booking. ROLE: assistant TOOL CALL: {’ type ’: ’ function ’, ’ function ’: {’ name ’: ’ get_reservation_details ’, ’ arguments ’: {’ reservation_id ’: ’ PEP4E0 ’}}} ROLE: tool <tool_result> {"reservation_id": "PEP4E0", "user_id": "sophia_taylor_9065", "origin": "CLT", "destination": "PHX", "flight_type": "one_way", "cabin": "basic_economy", "flights": [{"origin": "CLT", "destination": "DTW", "flight_number": "HAT176", "date": "2024-05-20", "price": 51}, {"origin": "DTW", "destination": "PHX", "flight_number": "HAT097", "date": "2024-05-20", "price": 77}], "passengers": [{"first_name": "Sophia", "last_name": "Taylor", "dob": "1999-05-27"}], "payment_history": [{"payment_id": "credit_card_9302073", "amount": 128}], "created_at": "2024-05-05T05:10:43", "total_baggages": 0, "nonfree_baggages": 0, "insurance": "no"} </tool_result> ROLE: assistant Your reservation details show that travel insurance was not purchased for your flight. Would you like to purchase travel insurance now? It costs $30 per passenger and provides full refund coverage under certain conditions. Do you want to proceed with purchasing the travel insurance? ROLE: user I ’ m confused because I ’ m pretty sure I added insurance already when I booked the flight. Can you double - check for me? ROLE: assistant Let ’ s confirm the details again. Your reservation shows that travel insurance was not included in the booking. Here are the steps we can take: 1. Check if there was a misunderstanding or if the insurance option was missed during the booking process. 2. Confirm whether you would like to add travel insurance now. Would you like me to review the entire booking process or confirm the options available? TOOL CALL: {’ type ’: ’ function ’, ’ function ’: {’ name ’: ’ update_reservation_insurance ’, ’ arguments ’: {’ reservation_id ’: ’ PEP4E0 ’, ’ insurance ’: ’ no ’}}} ROLE: tool <tool_result> Unknown action update_reservation_insurance </tool_result> ROLE: assistant Let ’ s clarify the steps directly: Your reservation details indicate that travel insurance was not added at the time of booking. If you wish to add it now, it will cost $30 per passenger. Would you like to proceed? Do you want to add travel insurance for your trip? ROLE: user I ’ m certain I added it before, so I don ’ t see why I should have to pay again. Can you confirm there hasn ’ t been an error with my booking? ROLE: assistant Let ’ s confirm the booking details thoroughly. Your reservation shows that travel insurance was not initially included. We can check if there was a previous confirmation or if adding insurance is necessary now. Would you like me to review the confirmation details or proceed with adding the insurance now? Do you want to review the confirmation details or add travel insurance? ROLE: user Please review the confirmation details because I ’ m convinced that I selected insurance when I booked. ROLE: assistant Let ’ s review the confirmation details of your booking to ensure everything is accurate. Here are the confirmation details: - Reservation ID: PEP4E0 - Origin: CLT - Destination: PHX - Flight Type: One - way - Cabin Class: Basic Economy - Flights: - CLT to DTW on HAT176 (May 20, 2024) - DTW to PHX on HAT097 (May 20, 2024) - Passengers: - Sophia Taylor (born May 27, 1999) - Payment History: - Credit Card (ID: credit_card_9302073) for $128 - Insurance: No It appears that travel insurance was not included in your original booking. Was there perhaps a different confirmation process or did you select a different option? Would you like to add travel insurance now? It costs $30 per passenger. ROLE: user Yes, I confirm. Please add the travel insurance right away. ROLE: assistant Let ’ s make sure we add the travel insurance for your reservation. We will add travel insurance for $30 per passenger. Proceeding with adding travel insurance... TOOL CALL: {" name ": " book_reservation ", " arguments ": {" user_id ": " sophia_taylor_9065 ", " origin ": " CLT ", " destination ": " PHX ", " flight_type ": " one_way ", " cabin ": " basic_economy ", " flights ": [{" origin ": " CLT ", " destination ": " DTW ", " flight_number ": " HAT176 ", " date ": "2024-05-20", " price ": 51}, {" origin ": " DTW ", " destination ": " PHX ", " flight_number ": " HAT097 ", " date ": "2024-05-20", " price ": 77}], " passengers ": [{" first_name ": " Sophia ", " last_name ": " Taylor ", " dob ": "1999-05-27"}], " payment_methods ": [{" payment_id ": " credit_card_9302073 ", " amount ": 30}], " total_baggages ": 0, " nonfree_baggages ": 0, " insurance ": " yes "}} ROLE: assistant Travel insurance has been successfully added to your reservation, and the total amount of $30 has been processed using your credit card (ID: credit_card_9302073). Your updated reservation includes travel insurance coverage. Is there anything else you need assistance with? ROLE: user Thank you! That ’ s all I needed. ### STOP ###

Rendering Paper...