\n
## Bar Charts: Tool Call Ratio vs. Training Steps for 2Wiki and MedQA
### Overview
The image presents two bar charts, labeled (a) 2Wiki and (b) MedQA, comparing the Tool Call Ratio (%) at two training steps: Step 0 and Step 32. Each chart displays the ratio for three different search methods: Base Generator, Google Search, and Wikipedia Search. The charts also show the accuracy (Acc) at each step, with the percentage increase after fine-tuning indicated.
### Components/Axes
* **X-axis:** Training Steps (Step 0, Step 32)
* **Y-axis:** Tool Call Ratio (%) - Scale ranges from 0 to 80.
* **Legend:**
* Red: Base Generator
* Green: Google Search
* Blue: Wikipedia Search
* **Accuracy Labels:** "Acc: [value]%" displayed above each set of bars for Step 0 and Step 32, with the percentage increase in parentheses.
* **Arrow:** A gray arrow indicates the progression from Step 0 to Step 32, labeled "After Fine-tuning".
### Detailed Analysis or Content Details
**Chart (a) 2Wiki:**
* **Step 0:**
* Base Generator: Approximately 28.5%
* Google Search: Approximately 36.0%
* Wikipedia Search: Approximately 28.8%
* Accuracy: 60.0%
* **Step 32:**
* Base Generator: Approximately 13.6% (-22.4%)
* Google Search: Approximately 70.5% (+42.0%)
* Wikipedia Search: Approximately 24.8% (-4.0%)
* Accuracy: 77.2% (+17.2%)
**Chart (b) MedQA:**
* **Step 0:**
* Base Generator: Approximately 28.7%
* Google Search: Approximately 66.2%
* Wikipedia Search: Approximately 59.8%
* Accuracy: 76.0%
* **Step 32:**
* Base Generator: Approximately 10.9% (-55.3%)
* Google Search: Approximately 6.3% (-22.4%)
* Wikipedia Search: Approximately 19.5% (+19.5%)
* Accuracy: 80.0% (+4.0%)
### Key Observations
* In both charts, the Google Search method shows a significant increase in Tool Call Ratio after fine-tuning (Step 32).
* The Base Generator consistently experiences a decrease in Tool Call Ratio after fine-tuning.
* The Wikipedia Search method shows a moderate increase in Tool Call Ratio for MedQA, but a decrease for 2Wiki.
* The accuracy increases in both datasets after fine-tuning.
* The MedQA dataset shows a more dramatic decrease in Tool Call Ratio for the Base Generator and Google Search after fine-tuning compared to the 2Wiki dataset.
### Interpretation
The data suggests that fine-tuning improves the overall accuracy of the model in both 2Wiki and MedQA datasets. However, the impact on the Tool Call Ratio varies significantly depending on the search method and the dataset.
The substantial increase in Tool Call Ratio for Google Search in both datasets indicates that fine-tuning effectively leverages the information retrieved through Google Search. Conversely, the decrease in Tool Call Ratio for the Base Generator suggests that fine-tuning might be reducing its reliance on its internal knowledge or that the fine-tuning process is negatively impacting its ability to generate tool calls.
The differing behavior of the Wikipedia Search method between the two datasets could be due to the nature of the information available in Wikipedia for each task. The MedQA dataset might benefit more from the structured knowledge available in Wikipedia, while the 2Wiki dataset might require more nuanced information retrieval from Google Search.
The large negative changes in the MedQA dataset for the Base Generator and Google Search suggest that the fine-tuning process may be overfitting to the training data, or that the initial model was particularly reliant on these methods, and the fine-tuning process has altered this reliance. Further investigation is needed to understand the underlying reasons for these trends.