# Technical Data Extraction: RSA Performance Benchmarks
This document provides a comprehensive extraction of data and trends from a series of performance charts comparing the RSA (Refinement-Step Aggregation) method against a baseline model across five distinct benchmarks.
## 1. General Metadata and Global Components
* **Primary Language:** English
* **Central Annotation Box:** "RSA consistently outperforms the reference model, improving monotonically with additional refinement steps. Larger aggregation sizes $K$ further amplify these gains."
* **Global Legend (Located Top-Center):**
* **Aggregation size $K$:**
* **Blue Circle (●):** $K = 1$
* **Yellow Square (■):** $K = 2$
* **Green Triangle (▲):** $K = 3$
* **Orange Diamond (◆):** $K = 4$
* **Baseline Reference:** A horizontal dashed grey line labeled **"Qwen3-4B-Instruct"** is present in every chart, representing the base model performance without RSA.
* **Axes:**
* **X-axis:** "RSA Step" (Values ranging from 1 to 10).
* **Y-axis:** "Pass@1" (Accuracy metric, scale varies by benchmark).
---
## 2. Benchmark Analysis (Component Isolation)
### A. HMMT-25
* **Y-axis Range:** 0.25 to 0.50.
* **Baseline (Qwen3-4B-Instruct):** Approximately 0.28.
* **Trend Verification:** All RSA series ($K=1$ to $4$) show a steep upward slope from Step 1 to Step 4, followed by a plateauing effect. Higher $K$ values consistently sit above lower $K$ values.
* **Key Data Points:**
* At Step 1, all models start near 0.27–0.30.
* By Step 10, $K=1$ reaches ~0.39, while $K=3$ and $K=4$ converge at the highest performance of ~0.49.
### B. Reasoning Gym Games
* **Y-axis Range:** 0.55 to 0.70.
* **Baseline (Qwen3-4B-Instruct):** Approximately 0.54.
* **Trend Verification:** Rapid improvement between Step 1 and Step 2, followed by a gradual increase. $K=3$ and $K=4$ are nearly identical in performance, significantly higher than $K=1$.
* **Key Data Points:**
* Step 1: All start at ~0.54.
* Step 10: $K=1$ reaches ~0.65; $K=4$ reaches the peak at ~0.69.
### C. AIME-25
* **Y-axis Range:** 0.4 to 0.7.
* **Baseline (Qwen3-4B-Instruct):** Approximately 0.44.
* **Trend Verification:** This chart shows the most significant separation between aggregation sizes. $K=4$ (Orange) slopes sharply upward, maintaining a clear lead over all other series.
* **Key Data Points:**
* Step 1: All start at ~0.44.
* Step 10: $K=1$ (Blue) is ~0.53; $K=2$ (Yellow) is ~0.65; $K=3$ (Green) is ~0.67; $K=4$ (Orange) reaches the highest peak at ~0.73.
### D. LiveCodeBench-v6
* **Y-axis Range:** 0.50 to 0.56.
* **Baseline (Qwen3-4B-Instruct):** Approximately 0.495.
* **Trend Verification:** $K=1$ shows very marginal improvement compared to the baseline. $K=2, 3, 4$ show strong monotonic growth.
* **Key Data Points:**
* Step 1: All start at ~0.495.
* Step 10: $K=1$ stays low at ~0.51; $K=4$ reaches the maximum of ~0.565.
### E. Reasoning Gym Cognition + ARC
* **Y-axis Range:** 0.425 to 0.525.
* **Baseline (Qwen3-4B-Instruct):** Approximately 0.43.
* **Trend Verification:** All series improve until Step 6, after which they plateau or show slight variance. Interestingly, $K=2$ (Yellow) performs slightly worse than $K=1$ (Blue) at later steps in this specific benchmark.
* **Key Data Points:**
* Step 1: All start at ~0.425.
* Step 6: $K=4$ reaches its peak at ~0.525.
* Step 10: $K=4$ remains highest at ~0.52; $K=2$ and $K=1$ are lower, around 0.48–0.49.
---
## 3. Summary of Findings
| Feature | Observation |
| :--- | :--- |
| **RSA Step Impact** | Performance increases as the number of RSA steps increases, typically plateauing between steps 6 and 10. |
| **Aggregation Size ($K$)** | Increasing $K$ from 1 to 4 generally leads to higher Pass@1 scores across all benchmarks. |
| **Baseline Comparison** | In every instance, the RSA method (even at $K=1$) exceeds the performance of the base `Qwen3-4B-Instruct` model. |
| **Consistency** | The "AIME-25" and "LiveCodeBench-v6" benchmarks show the cleanest monotonic separation between different $K$ values. |