## [Chart Set]: PrOntoQA Ablation Study - CAPT Setting Performance
### Overview
The image displays a 3x3 grid of line charts from the "PrOntoQA Ablation Study." The charts compare the performance (Score) of three different "CAPT Settings" (null, order, random) over the course of training (Number of Epochs). The analysis is segmented by two variables: the type of reasoning task (Commonsense, Anticommonsense, Noncommonsense) and the percentage of training data used (2%, 5%, 90%).
### Components/Axes
* **Overall Title:** "PrOntoQA Ablation Study" (Top-left corner of the entire figure).
* **Chart Grid:** 3 rows (Training Data Percentage: 2%, 5%, 90%) x 3 columns (Task Type: Commonsense, Anticommonsense, Noncommonsense).
* **Individual Chart Titles:** Each subplot has a title in the format `[Task Type] - [Training %] Training`.
* **X-Axis (All Charts):** "Number of Epochs". The scale is linear, with major ticks at 0, 1000, 2000, and 3000.
* **Y-Axis (All Charts):** "Score". The scale is linear but varies per chart to best display the data range (e.g., 70-100, 75-85, 80-95).
* **Legend:** Located to the right of the top-right chart. It defines the three data series:
* `CAPT=null` (Blue line with circle markers)
* `CAPT=order` (Green line with square markers)
* `CAPT=random` (Orange line with diamond markers)
### Detailed Analysis
The following analysis breaks down each chart by row (training percentage) and column (task type). For each, the visual trend for each CAPT setting is described, followed by approximate score values at key epochs (0, ~250, ~500, ~1000, ~1500, ~3200).
**Row 1: 2% Training Data**
* **Commonsense - 2% Training (Top-Left):**
* `CAPT=null` (Blue): Starts high (~97), quickly peaks near 100, and remains stable at ~100.
* `CAPT=order` (Green): Starts low (~73), rises sharply to ~85 by epoch 250, then fluctuates slightly, ending near ~83.
* `CAPT=random` (Orange): Starts around ~80, rises to ~87 by epoch 250, then gradually increases to ~88.
* **Anticommonsense - 2% Training (Top-Center):**
* `CAPT=null` (Blue): Starts at ~75, peaks at ~77, then declines to a low of ~71 before a slight recovery to ~72.
* `CAPT=order` (Green): Starts at ~70, rises to ~79, then gradually declines to ~75.
* `CAPT=random` (Orange): Starts at ~76, rises sharply to ~83, dips slightly, then climbs to a final high of ~83.
* **Noncommonsense - 2% Training (Top-Right):**
* `CAPT=null` (Blue): Starts at ~80, peaks at ~84, then declines to ~78.
* `CAPT=order` (Green): Starts at ~70, rises steeply to ~83, then fluctuates, ending at ~83.
* `CAPT=random` (Orange): Starts at ~84, dips to ~80, then recovers to ~81.
**Row 2: 5% Training Data**
* **Commonsense - 5% Training (Middle-Left):**
* `CAPT=null` (Blue): Starts at ~97, quickly reaches and plateaus at ~100.
* `CAPT=order` (Green): Starts at ~78, rises steadily to ~93.
* `CAPT=random` (Orange): Starts at ~85, rises to ~95, then stabilizes around ~94.
* **Anticommonsense - 5% Training (Middle-Center):**
* `CAPT=null` (Blue): Starts at ~72, peaks at ~77, then declines to ~70 before a slow rise to ~75.
* `CAPT=order` (Green): Starts at ~72, rises to ~78, then continues a steady climb to ~80.
* `CAPT=random` (Orange): Starts at ~77, rises sharply to ~88, and remains stable at that level.
* **Noncommonsense - 5% Training (Middle-Right):**
* `CAPT=null` (Blue): Starts at ~83, peaks at ~85, then declines to ~80.
* `CAPT=order` (Green): Starts at ~78, rises to ~85, dips slightly, then climbs to ~87.
* `CAPT=random` (Orange): Starts at ~79, rises steeply to ~92, and remains stable.
**Row 3: 90% Training Data**
* **Commonsense - 90% Training (Bottom-Left):**
* `CAPT=null` (Blue): Starts at ~96, quickly reaches and stays at ~100.
* `CAPT=order` (Green): Starts at ~80, rises sharply to ~98, and stabilizes.
* `CAPT=random` (Orange): Starts at ~79, rises sharply to ~99, and stabilizes.
* **Anticommonsense - 90% Training (Bottom-Center):**
* `CAPT=null` (Blue): Starts at ~68, peaks at ~76, then declines to ~70 before a slow rise to ~74.
* `CAPT=order` (Green): Starts at ~78, rises to ~90, and continues a slow climb to ~91.
* `CAPT=random` (Orange): Starts at ~77, rises sharply to ~91, and remains stable.
* **Noncommonsense - 90% Training (Bottom-Right):**
* `CAPT=null` (Blue): Starts at ~80, rises to ~87, then declines to ~84.
* `CAPT=order` (Green): Starts at ~81, rises sharply to ~95, and remains stable.
* `CAPT=random` (Orange): Starts at ~82, rises sharply to ~97, and remains stable.
### Key Observations
1. **CAPT=null (Blue) Performance:** This setting consistently shows the poorest or most volatile performance, especially on Anticommonsense and Noncommonsense tasks. It often peaks early and then degrades or stagnates. It performs best on Commonsense tasks, where it quickly reaches a ceiling.
2. **CAPT=random (Orange) Dominance:** This setting is the top performer in almost every chart, particularly for Anticommonsense and Noncommonsense tasks. It shows rapid initial learning and achieves the highest final scores.
3. **CAPT=order (Green) Performance:** This setting generally performs better than `null` but worse than `random`. Its performance improves significantly as the training data percentage increases (from 2% to 90%).
4. **Impact of Training Data Size:** All models show improved performance and stability with more training data (90% vs. 2%). The gap between `order`/`random` and `null` widens with more data on the harder tasks (Anticommonsense, Noncommonsense).
5. **Task Difficulty:** Commonsense tasks appear easiest, with scores clustering near 100. Anticommonsense tasks show the lowest overall scores and the most significant performance differences between CAPT settings.
### Interpretation
This ablation study investigates the impact of different "CAPT" (likely a form of data augmentation or training curriculum) strategies on the PrOntoQA model's reasoning abilities. The data suggests several key insights:
* **The `random` CAPT strategy is highly effective.** Its consistent superiority, especially on non-intuitive (Anticommonsense) and novel (Noncommonsense) reasoning tasks, indicates that introducing randomized elements during training helps the model generalize better and avoid overfitting to simplistic patterns.
* **The `null` strategy (no CAPT) is insufficient.** Its poor and unstable performance, particularly with limited data, suggests the base training procedure is brittle. The early peak and subsequent decline on harder tasks may indicate catastrophic forgetting or an inability to escape local minima.
* **Structured augmentation (`order`) has moderate benefits.** While better than nothing, a fixed-order strategy is less effective than a randomized one. This implies that the diversity and unpredictability of the training signal are more important than a rigid structure for building robust reasoning skills.
* **Data efficiency is strategy-dependent.** With only 2% of training data, the choice of CAPT strategy is critical, as seen in the large performance gaps. With 90% data, the `order` and `random` strategies converge to high performance, but `null` still lags, showing that better training methodologies can compensate for data scarcity to a significant degree.
In summary, the charts provide strong evidence that incorporating randomized CAPT strategies during training is crucial for developing robust and generalizable reasoning capabilities in the PrOntoQA model, particularly when dealing with limited data or tasks that require non-commonsense logic.