\n
## Bar Chart: Tool Call Ratio Comparison Before and After Fine-Tuning
### Overview
The image is a grouped bar chart comparing the "Tool Call Ratio (%)" of four different methods ("Base Generator", "Google Search", "Web Search", "Wikipedia Search") at two different stages: "Step 0" (before fine-tuning) and "Step 32" (after fine-tuning). The chart includes accuracy metrics for the overall system at both stages.
### Components/Axes
* **Chart Type:** Grouped bar chart.
* **Y-Axis:** Labeled "Tool Call Ratio (%)". Scale ranges from 0 to 60, with major tick marks at 0, 20, 40, and 60.
* **X-Axis:** Two categorical groups: "Step 0" and "Step 32".
* **Legend:** Positioned at the top of the chart. It defines four colored categories:
* Pink/Salmon: `Base Generator`
* Green: `Google Search`
* Blue: `Web Search`
* Purple: `Wikipedia Search`
* **Annotations:**
* Two white boxes with black text at the top display overall accuracy:
* Above Step 0: `Acc:19.2%`
* Above Step 32: `Acc: 25.2% (+6.21%)`
* A grey arrow points from the Step 0 group to the Step 32 group, labeled `After Finetuning`.
* Numerical values are printed directly on or above each bar.
* Change values (e.g., `-1.5`, `+5.2`) are printed in green (for decreases) or red (for increases) above the bars in the Step 32 group, indicating the change from Step 0.
### Detailed Analysis
**Step 0 (Before Fine-Tuning):**
* **Base Generator (Pink):** Bar height corresponds to a value of **3.1%**.
* **Google Search (Green):** Bar height corresponds to a value of **38.7%**.
* **Web Search (Blue):** Bar height corresponds to a value of **18.4%**.
* **Wikipedia Search (Purple):** Bar height corresponds to a value of **38.5%**.
* **Trend:** Google Search and Wikipedia Search have very high and nearly identical tool call ratios (~38.5-38.7%). Web Search is moderate (~18.4%), and Base Generator is very low (~3.1%).
**Step 32 (After Fine-Tuning):**
* **Base Generator (Pink):** Bar height corresponds to a value of **0.9%**. A green annotation above indicates a change of **-2.2** (a decrease of 2.2 percentage points from Step 0).
* **Google Search (Green):** Bar height corresponds to a value of **37.2%** (calculated as 38.7 - 1.5). A green annotation above indicates a change of **-1.5** (a decrease of 1.5 percentage points).
* **Web Search (Blue):** Bar height corresponds to a value of **23.6%** (calculated as 18.4 + 5.2). A red annotation above indicates a change of **+5.2** (an increase of 5.2 percentage points).
* **Wikipedia Search (Purple):** Bar height corresponds to a value of **33.8%** (calculated as 38.5 - 4.7). A green annotation above indicates a change of **-4.7** (a decrease of 4.7 percentage points).
* **Trend:** After fine-tuning, the tool call ratios for Google Search, Wikipedia Search, and Base Generator all decreased. The ratio for Web Search increased notably. Wikipedia Search saw the largest absolute decrease.
### Key Observations
1. **Dominant Methods Pre-Tuning:** Google Search and Wikipedia Search were the dominant tools called at Step 0, with nearly equal usage (~38.5%).
2. **Shift in Tool Usage:** Fine-tuning (Step 32) caused a significant redistribution. The reliance on Wikipedia Search dropped the most (-4.7 points), while reliance on Web Search increased the most (+5.2 points). Google Search usage decreased slightly.
3. **Base Generator Reduction:** The Base Generator's already low usage was further reduced by nearly two-thirds (from 3.1% to 0.9%).
4. **Overall Accuracy Improvement:** The system's accuracy improved from 19.2% to 25.2%, a relative increase of 6.21 percentage points, coinciding with the change in tool call patterns.
### Interpretation
The data suggests that the fine-tuning process successfully optimized the system's tool selection strategy. The primary goal appears to be reducing unnecessary or less effective tool calls (especially to Wikipedia and the Base Generator) and increasing calls to Web Search, which may provide more relevant or up-to-date information for the tasks at hand.
The **6.21% accuracy gain** is strongly correlated with this strategic shift. The system learned to rely less on the static knowledge of the Base Generator and the potentially less-current Wikipedia, while increasing its use of the broader Web Search. The slight decrease in Google Search calls might indicate a refinement in query specificity or a substitution effect where Web Search captured some of its previous role.
The most notable anomaly is the **divergent behavior of Web Search**—it's the only tool whose usage increased after fine-tuning. This implies the fine-tuning data or objective function identified Web Search as a particularly valuable resource for improving task performance. The chart demonstrates a clear move from a balanced reliance on two search tools (Google, Wikipedia) to a more differentiated strategy favoring Web Search, which yielded a measurable improvement in system accuracy.