\n
## Composite Training Performance Charts: Policy Entropy, Action Tokens, and Test Accuracy
### Overview
The image displays three side-by-side line charts comparing the training dynamics and final performance of four different methods or models. The charts track metrics over 100 training steps. The overall purpose is to demonstrate the behavior and effectiveness of a proposed method ("Ours (CoNL)") against three baselines: "SRT (Multi-agents)", "SRT (Single-agent)", and "RL-ground-truth".
### Components/Axes
**Common Elements Across All Charts:**
* **X-Axis:** Labeled "Training Step". Scale runs from 0 to 100 with major ticks at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Legend:** Located in the top-left corner of each chart. Contains four entries with corresponding line colors:
* `Ours (CoNL)`: Blue line
* `SRT (Multi-agents)`: Orange line
* `SRT (Single-agent)`: Red line
* `RL-ground-truth`: Green line
* **Grid:** Light gray horizontal and vertical grid lines are present.
**Chart 1 (Left): Policy Entropy During Training**
* **Title:** "Policy Entropy During Training"
* **Y-Axis:** Labeled "Policy Entropy". Scale runs from 0.0 to 1.0 with major ticks at 0.2 intervals.
**Chart 2 (Center): Action Tokens During Training**
* **Title:** "Action Tokens During Training"
* **Y-Axis:** Labeled "Action Tokens per Turn". Scale runs from 2000 to 14000 with major ticks at 2000 intervals.
**Chart 3 (Right): Test Performance (DeepMath)**
* **Title:** "Test Performance (DeepMath)"
* **Y-Axis:** Labeled "Test Accuracy". Scale runs from 0.4 to 0.9 with major ticks at 0.1 intervals.
### Detailed Analysis
**Chart 1: Policy Entropy During Training**
* **Trend Verification:**
* **Ours (CoNL) [Blue]:** Shows a relatively stable, low-entropy trend with minor fluctuations. It remains the lowest line for most of the training.
* **SRT (Multi-agents) [Orange]:** Exhibits high volatility and a general upward trend, especially after step 40. It becomes the highest-entropy line from step ~50 onward.
* **SRT (Single-agent) [Red]:** Also volatile, with a significant upward trend starting around step 40, closely following but generally below the multi-agent version.
* **RL-ground-truth [Green]:** Shows the lowest and most stable entropy, with a slight downward trend over time.
* **Approximate Data Points (Estimated from grid):**
* At Step 0: All methods start between ~0.2 and ~0.4.
* At Step 50: Blue ~0.3, Green ~0.2, Red ~0.6, Orange ~0.8.
* At Step 100: Blue ~0.3, Green ~0.2, Red ~0.9, Orange ~1.0.
**Chart 2: Action Tokens During Training**
* **Trend Verification:**
* **Ours (CoNL) [Blue]:** Displays a moderately volatile but generally stable trend, oscillating between ~6000 and ~10000 tokens.
* **SRT (Multi-agents) [Orange]:** Shows a strong upward trend with high volatility, rising from ~8000 to over 14000 tokens.
* **SRT (Single-agent) [Red]:** Follows a similar upward and volatile pattern to the multi-agent version, but at a slightly lower magnitude.
* **RL-ground-truth [Green]:** Starts very low (~2500), remains low until step ~60, then exhibits a sharp, volatile increase, peaking near 12000 before dropping again.
* **Approximate Data Points (Estimated from grid):**
* At Step 0: Green ~2500, Blue ~6000, Red ~8000, Orange ~8000.
* At Step 60: Green ~3000, Blue ~8000, Red ~10000, Orange ~12000.
* At Step 100: Green ~6000, Blue ~8000, Red ~12000, Orange ~14000.
**Chart 3: Test Performance (DeepMath)**
* **Trend Verification:**
* **Ours (CoNL) [Blue]:** Shows a smooth, steady, and strong upward trend, achieving the highest final accuracy.
* **SRT (Multi-agents) [Orange]:** Increases rapidly until step ~40, then experiences a sharp decline, followed by a partial recovery.
* **SRT (Single-agent) [Red]:** Follows a similar initial rise to the multi-agent version, peaks around step 50, then declines and stabilizes at a lower level.
* **RL-ground-truth [Green]:** Rises steadily, closely tracking the blue line until step ~40, then continues a smooth ascent to become the second-best performer.
* **Approximate Data Points (Estimated from grid):**
* At Step 0: All methods start between ~0.40 and ~0.45.
* At Step 40: All methods are clustered between ~0.65 and ~0.70.
* At Step 100: Blue ~0.88, Green ~0.85, Red ~0.65, Orange ~0.58.
### Key Observations
1. **Stability vs. Volatility:** The proposed method (`Ours (CoNL)`) demonstrates significantly more stable training dynamics (lower policy entropy variance, moderate action token usage) compared to the highly volatile SRT methods.
2. **Performance Divergence:** While all methods improve initially on the test task, a major divergence occurs after ~40-50 training steps. The SRT methods (especially multi-agent) suffer a performance collapse, whereas `Ours (CoNL)` and `RL-ground-truth` continue to improve steadily.
3. **Entropy and Token Correlation:** The rise in policy entropy for SRT methods (Chart 1) correlates with a dramatic increase in action tokens per turn (Chart 2), suggesting their policies become more random and verbose without gaining effectiveness.
4. **Ground Truth Benchmark:** The `RL-ground-truth` line serves as a high-performance baseline. `Ours (CoNL)` matches or slightly exceeds its final test accuracy while using more action tokens but maintaining lower policy entropy.
### Interpretation
The data suggests that the `Ours (CoNL)` method achieves a superior balance between exploration (measured by policy entropy) and efficient, goal-directed behavior (measured by test accuracy and action token usage). The SRT methods, particularly the multi-agent variant, appear to suffer from an instability or "collapse" during training. Their policies become increasingly random (high entropy) and generate excessively long action sequences without improving, and in fact harming, final task performance. This could indicate issues with credit assignment, non-stationarity, or reward hacking in those multi-agent setups. The charts effectively argue that the proposed CoNL method is more robust and sample-efficient, converging to a high-performing policy without the pathological behaviors exhibited by the baselines. The `RL-ground-truth` performance validates that high accuracy is achievable, and CoNL meets this benchmark.