## Bar Charts: Tool Call Ratio and Accuracy Before and After Fine-tuning
### Overview
This image presents two bar charts, labeled (a) 2Wiki and (b) MedQA, comparing the "Tool Call Ratio (%)" for different search tools and a "Base Generator" at two distinct "Training Steps": "Step 0" (before fine-tuning) and "Step 32" (after fine-tuning). Each chart also displays an overall accuracy metric ("Acc") for both steps, along with the percentage change in accuracy after fine-tuning. The charts illustrate how fine-tuning impacts the utilization of various tools across two different datasets.
### Components/Axes
The image consists of two side-by-side bar charts, (a) on the left and (b) on the right, sharing a common legend positioned at the top-center.
* **Legend (Top-center):**
* Light Red: Base Generator
* Green: Google Search
* Blue: Web Search
* Purple: Wikipedia Search
* **Y-axis (Left side of both charts):**
* Title: "Tool Call Ratio (%)"
* Scale: Ranges from 0 to 80, with major grid lines and labels at 0, 10, 20, 30, 40, 50, 60, 70, 80.
* **X-axis (Bottom, shared across both charts):**
* Title: "Training Steps"
* Categories: "Step 0" and "Step 32" for each sub-chart.
* **Common Labels (Above the bars, between "Step 0" and "Step 32" for both charts):**
* Text: "After Fine-tuning"
* Visual: A gray arrow pointing from left to right, indicating the progression from "Step 0" to "Step 32".
* **Sub-chart Titles (Bottom-left of each chart):**
* (a) 2Wiki
* (b) MedQA
* **Accuracy Boxes (Top-left and Top-right above the bars for each chart):**
* **Chart (a) 2Wiki:**
* Above "Step 0": "Acc: 60.0%"
* Above "Step 32": "Acc: 77.2% (+17.2%)" (The "+17.2%" is colored red, indicating an increase).
* **Chart (b) MedQA:**
* Above "Step 0": "Acc: 76.0%"
* Above "Step 32": "Acc: 80.0% (+4.0%)" (The "+4.0%" is colored red, indicating an increase).
### Detailed Analysis
**Chart (a) 2Wiki**
* **Step 0 (Before Fine-tuning):**
* **Base Generator (Light Red):** The bar is very short, visually close to 0%, estimated at approximately 0.5%.
* **Google Search (Green):** The bar reaches 28.5%.
* **Web Search (Blue):** The bar reaches 36.0%.
* **Wikipedia Search (Purple):** The bar reaches 28.8%.
* *Trend:* At Step 0, Web Search has the highest tool call ratio, followed closely by Wikipedia Search and Google Search, while the Base Generator is negligible.
* **Step 32 (After Fine-tuning):**
* **Base Generator (Light Red):** The bar remains very short, visually close to 0%, estimated at approximately 0.2%.
* **Google Search (Green):** The bar dramatically increases to 70.5%. An associated label "+42.0" (green) indicates a significant increase from Step 0.
* **Web Search (Blue):** The bar significantly decreases to 13.6%. An associated label "-22.4" (blue) indicates a decrease from Step 0.
* **Wikipedia Search (Purple):** The bar significantly decreases to 4.0%. An associated label "-24.8" (purple) indicates a decrease from Step 0.
* *Trend:* After fine-tuning, Google Search shows a massive increase in tool call ratio, becoming the dominant tool. Web Search and Wikipedia Search show substantial decreases, while the Base Generator remains minimal.
**Chart (b) MedQA**
* **Step 0 (Before Fine-tuning):**
* **Base Generator (Light Red):** The bar reaches 28.7%.
* **Google Search (Green):** The bar reaches 66.2%.
* **Web Search (Blue):** The bar is very short, visually close to 0%, estimated at approximately 0.5%.
* **Wikipedia Search (Purple):** The bar is very short, visually close to 0%, estimated at approximately 0.5%.
* *Trend:* At Step 0, Google Search has a very high tool call ratio, followed by the Base Generator. Web Search and Wikipedia Search are negligible.
* **Step 32 (After Fine-tuning):**
* **Base Generator (Light Red):** The bar significantly decreases to 6.3%. An associated label "-22.4" (red) indicates a decrease from Step 0.
* **Google Search (Green):** The bar significantly decreases to 10.9%. An associated label "-55.3" (green) indicates a substantial decrease from Step 0.
* **Web Search (Blue):** The bar dramatically increases to 19.5%. An associated label "+19.5" (blue) indicates a significant increase from Step 0.
* **Wikipedia Search (Purple):** The bar dramatically increases to 59.8%. An associated label "+59.8" (purple) indicates a massive increase from Step 0.
* *Trend:* After fine-tuning, Base Generator and Google Search show significant decreases in tool call ratio. Conversely, Web Search and Wikipedia Search show dramatic increases, with Wikipedia Search becoming the most utilized tool.
### Key Observations
* **Overall Accuracy Improvement:** Both datasets, 2Wiki and MedQA, show an increase in overall accuracy after fine-tuning, with 2Wiki experiencing a larger relative gain (+17.2%) compared to MedQA (+4.0%).
* **Divergent Tool Utilization Patterns:** Fine-tuning leads to drastically different tool utilization patterns between the 2Wiki and MedQA datasets.
* For **2Wiki**, fine-tuning strongly favors **Google Search**, which sees a massive increase in its tool call ratio (from 28.5% to 70.5%). Web Search and Wikipedia Search, which were moderately used before, become much less utilized. The Base Generator remains largely unused.
* For **MedQA**, fine-tuning shifts preference away from **Google Search** and the **Base Generator** (both seeing significant decreases) towards **Wikipedia Search** and **Web Search** (both seeing dramatic increases). Wikipedia Search becomes the dominant tool after fine-tuning.
* **Base Generator Role:** The "Base Generator" tool call ratio is consistently very low for 2Wiki both before and after fine-tuning. For MedQA, it starts at a moderate level (28.7%) but significantly decreases after fine-tuning (to 6.3%).
* **Magnitude of Change:** The changes in tool call ratio are substantial for most tools after fine-tuning, indicating a strong impact of the fine-tuning process on tool selection behavior.
### Interpretation
The data suggests that fine-tuning a model for specific datasets (2Wiki vs. MedQA) leads to specialized and optimized tool-calling strategies, rather than a universal improvement across all tools.
For the **2Wiki dataset**, the fine-tuning process appears to have learned that "Google Search" is the most effective tool for improving accuracy. The model's reliance on Google Search dramatically increases, while other search tools (Web Search, Wikipedia Search) become less relevant. This implies that for tasks within the 2Wiki domain, Google Search provides the most valuable information or is best integrated with the fine-tuned model's capabilities. The significant accuracy gain (+17.2%) for 2Wiki is strongly correlated with this increased reliance on Google Search.
For the **MedQA dataset**, the fine-tuning process identifies "Wikipedia Search" and "Web Search" as the primary tools for enhancing performance. The model significantly reduces its calls to "Base Generator" and "Google Search," which were initially more prominent. This indicates that for medical question-answering tasks (MedQA), information from Wikipedia and general web searches is more pertinent or effectively leveraged by the fine-tuned model. The smaller, but still positive, accuracy gain (+4.0%) for MedQA is achieved through this shift in tool preference.
The "Base Generator" generally plays a minor role, especially for 2Wiki, suggesting that for these tasks, external tools are almost always preferred over the base model's generation capabilities. Its decrease in MedQA further supports the idea that fine-tuning directs the model to more specialized external resources.
In essence, fine-tuning acts as a mechanism to learn which external tools are most beneficial for a given domain, leading to a highly specialized and efficient tool-calling strategy that maximizes accuracy, even if the preferred tools differ significantly across datasets. The "After Fine-tuning" process is not merely boosting existing tool usage but actively re-prioritizing and re-allocating tool calls based on the dataset's specific information needs.