# Technical Data Extraction: RL Training Performance with RSA
## 1. Document Overview
This image contains a series of five line charts comparing the performance of three different Reinforcement Learning (RL) training configurations combined with Recursive Search Algorithm (RSA). The charts measure the "Pass@1" metric across various benchmarks as the number of "RSA Steps" increases.
### Central Annotation Box
Located in the top-center of the image:
> "Aggregation-aware RL training leads to substantial gains with RSA. Standard RL, on the other hand, hurts RSA performance."
### Global Legend
Located in the center, below the annotation box:
* **Blue Circle Line (●):** Base + RSA
* **Green Square Line (■):** Standard RL + RSA
* **Orange Diamond Line (◆):** Aggregation-aware RL + RSA
---
## 2. Component Analysis
### Shared Axis Definitions
* **X-Axis:** RSA Step (Values: 2, 4, 6, 8, 10)
* **Y-Axis:** Pass@1 (Scale varies by chart)
### Chart 1: HMMT-25 (Top Left)
* **Y-Axis Range:** 0.30 to 0.50+
* **Trend Analysis:**
* **Aggregation-aware RL + RSA (Orange):** Slopes sharply upward, significantly outperforming other methods. Ends near 0.55.
* **Base + RSA (Blue):** Slopes upward moderately. Ends near 0.48.
* **Standard RL + RSA (Green):** Slopes upward but plateaus early. Ends near 0.45.
* **Key Data Points (Approximate):**
* Step 1: All start near 0.28.
* Step 10: Orange (~0.56), Blue (~0.48), Green (~0.45).
### Chart 2: Reasoning Gym Games (Top Right)
* **Y-Axis Range:** 0.55 to 0.70+
* **Trend Analysis:**
* **Aggregation-aware RL + RSA (Orange):** Rapid initial climb, maintains highest performance.
* **Base + RSA (Blue):** Steady climb, remains in the middle.
* **Standard RL + RSA (Green):** Lowest performance throughout, plateaus after step 4.
* **Key Data Points (Approximate):**
* Step 1: Orange (~0.58), Blue (~0.54), Green (~0.53).
* Step 10: Orange (~0.71), Blue (~0.69), Green (~0.66).
### Chart 3: AIME-25 (Bottom Left)
* **Y-Axis Range:** 0.50 to 0.70+
* **Trend Analysis:**
* **Base + RSA (Blue):** Slopes upward and becomes the top performer after Step 4.
* **Aggregation-aware RL + RSA (Orange):** Slopes upward but plateaus below the Base model.
* **Standard RL + RSA (Green):** Similar to Orange, but with a wider shaded confidence interval indicating higher variance.
* **Key Data Points (Approximate):**
* Step 1: All start near 0.45.
* Step 10: Blue (~0.73), Orange (~0.69), Green (~0.68).
### Chart 4: LiveCodeBench-v6 (Bottom Center)
* **Y-Axis Range:** 0.50 to 0.60
* **Trend Analysis:**
* **Aggregation-aware RL + RSA (Orange):** Strongest upward slope, clear separation from other lines.
* **Base + RSA (Blue):** Moderate upward slope.
* **Standard RL + RSA (Green):** Lowest performance, plateaus early.
* **Key Data Points (Approximate):**
* Step 1: Orange (~0.51), Blue/Green (~0.49).
* Step 10: Orange (~0.59), Blue (~0.57), Green (~0.56).
### Chart 5: Reasoning Gym Cognition + ARC (Bottom Right)
* **Y-Axis Range:** 0.45 to 0.55+
* **Trend Analysis:**
* **Aggregation-aware RL + RSA (Orange):** Highest performance, steady growth.
* **Standard RL + RSA (Green):** Middle performance, plateaus after Step 6.
* **Base + RSA (Blue):** Starts lowest, climbs, but then shows a slight decline/plateau after Step 6.
* **Key Data Points (Approximate):**
* Step 1: Green (~0.47), Orange (~0.43), Blue (~0.42).
* Step 10: Orange (~0.55), Green (~0.54), Blue (~0.52).
---
## 3. Summary of Findings
Across four out of five benchmarks (**HMMT-25, Reasoning Gym Games, LiveCodeBench-v6, and Reasoning Gym Cognition + ARC**), the **Aggregation-aware RL + RSA** (Orange Diamond) configuration demonstrates superior performance and scaling compared to Standard RL and the Base model.
The **AIME-25** benchmark is the sole outlier where the **Base + RSA** (Blue Circle) configuration eventually outperforms the RL-tuned versions. In most cases, **Standard RL + RSA** (Green Square) performs the worst or plateaus the earliest, supporting the claim that standard RL can negatively impact RSA performance.