Image ade7d2dcc9b0...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview

INTEL_VERIFIED

# Technical Data Extraction: RSA Performance Benchmarks

This document provides a comprehensive extraction of data and trends from a series of performance charts comparing the RSA (Refinement-Step Aggregation) method against a baseline model across five distinct benchmarks.

## 1. General Metadata and Global Components

*   **Primary Language:** English
*   **Central Annotation Box:** "RSA consistently outperforms the reference model, improving monotonically with additional refinement steps. Larger aggregation sizes $K$ further amplify these gains."
*   **Global Legend (Located Top-Center):**
    *   **Aggregation size $K$:**
        *   **Blue Circle (●):** $K = 1$
        *   **Yellow Square (■):** $K = 2$
        *   **Green Triangle (▲):** $K = 3$
        *   **Orange Diamond (◆):** $K = 4$
*   **Baseline Reference:** A horizontal dashed grey line labeled **"Qwen3-4B-Instruct"** is present in every chart, representing the base model performance without RSA.
*   **Axes:**
    *   **X-axis:** "RSA Step" (Values ranging from 1 to 10).
    *   **Y-axis:** "Pass@1" (Accuracy metric, scale varies by benchmark).

---

## 2. Benchmark Analysis (Component Isolation)

### A. HMMT-25
*   **Y-axis Range:** 0.25 to 0.50.
*   **Baseline (Qwen3-4B-Instruct):** Approximately 0.28.
*   **Trend Verification:** All RSA series ($K=1$ to $4$) show a steep upward slope from Step 1 to Step 4, followed by a plateauing effect. Higher $K$ values consistently sit above lower $K$ values.
*   **Key Data Points:**
    *   At Step 1, all models start near 0.27–0.30.
    *   By Step 10, $K=1$ reaches ~0.39, while $K=3$ and $K=4$ converge at the highest performance of ~0.49.

### B. Reasoning Gym Games
*   **Y-axis Range:** 0.55 to 0.70.
*   **Baseline (Qwen3-4B-Instruct):** Approximately 0.54.
*   **Trend Verification:** Rapid improvement between Step 1 and Step 2, followed by a gradual increase. $K=3$ and $K=4$ are nearly identical in performance, significantly higher than $K=1$.
*   **Key Data Points:**
    *   Step 1: All start at ~0.54.
    *   Step 10: $K=1$ reaches ~0.65; $K=4$ reaches the peak at ~0.69.

### C. AIME-25
*   **Y-axis Range:** 0.4 to 0.7.
*   **Baseline (Qwen3-4B-Instruct):** Approximately 0.44.
*   **Trend Verification:** This chart shows the most significant separation between aggregation sizes. $K=4$ (Orange) slopes sharply upward, maintaining a clear lead over all other series.
*   **Key Data Points:**
    *   Step 1: All start at ~0.44.
    *   Step 10: $K=1$ (Blue) is ~0.53; $K=2$ (Yellow) is ~0.65; $K=3$ (Green) is ~0.67; $K=4$ (Orange) reaches the highest peak at ~0.73.

### D. LiveCodeBench-v6
*   **Y-axis Range:** 0.50 to 0.56.
*   **Baseline (Qwen3-4B-Instruct):** Approximately 0.495.
*   **Trend Verification:** $K=1$ shows very marginal improvement compared to the baseline. $K=2, 3, 4$ show strong monotonic growth.
*   **Key Data Points:**
    *   Step 1: All start at ~0.495.
    *   Step 10: $K=1$ stays low at ~0.51; $K=4$ reaches the maximum of ~0.565.

### E. Reasoning Gym Cognition + ARC
*   **Y-axis Range:** 0.425 to 0.525.
*   **Baseline (Qwen3-4B-Instruct):** Approximately 0.43.
*   **Trend Verification:** All series improve until Step 6, after which they plateau or show slight variance. Interestingly, $K=2$ (Yellow) performs slightly worse than $K=1$ (Blue) at later steps in this specific benchmark.
*   **Key Data Points:**
    *   Step 1: All start at ~0.425.
    *   Step 6: $K=4$ reaches its peak at ~0.525.
    *   Step 10: $K=4$ remains highest at ~0.52; $K=2$ and $K=1$ are lower, around 0.48–0.49.

---

## 3. Summary of Findings

| Feature | Observation |
| :--- | :--- |
| **RSA Step Impact** | Performance increases as the number of RSA steps increases, typically plateauing between steps 6 and 10. |
| **Aggregation Size ($K$)** | Increasing $K$ from 1 to 4 generally leads to higher Pass@1 scores across all benchmarks. |
| **Baseline Comparison** | In every instance, the RSA method (even at $K=1$) exceeds the performance of the base `Qwen3-4B-Instruct` model. |
| **Consistency** | The "AIME-25" and "LiveCodeBench-v6" benchmarks show the cleanest monotonic separation between different $K$ values. |

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

# Technical Document Extraction: RSA Model Performance Analysis

## Chart Structure Overview
The image contains **six line charts** arranged in a 2x3 grid, each visualizing the performance of an RSA model across different datasets and aggregation sizes. Key elements include:
- **X-axis**: "RSA Step" (1–10)
- **Y-axis**: "Pass@1" metric (varying ranges per chart)
- **Lines**: Color-coded by aggregation size (K=1, 2, 3, 4)
- **Shaded Regions**: Confidence intervals for each line
- **Dashed Line**: "Owen3-4B-Instruct" baseline

---

## Legend & Color Mapping
- **Location**: Center-top of the grid
- **Colors**:
  - `blue` = K=1
  - `orange` = K=2
  - `green` = K=3
  - `red` = K=4

---

## Chart-Specific Analysis

### 1. HMMT-25
- **Y-axis Range**: 0.25–0.50
- **Trends**:
  - All lines slope **upward** (K=4 > K=3 > K=2 > K=1).
  - "Owen3-4B-Instruct" (dashed black) remains **below all lines**.
- **Key Data Points**:
  - At RSA Step 10: K=4 ≈ 0.49, K=1 ≈ 0.39.

### 2. AIME-25
- **Y-axis Range**: 0.4–0.7
- **Trends**:
  - All lines slope **upward** (K=4 > K=3 > K=2 > K=1).
  - "Owen3-4B-Instruct" (dashed black) remains **below all lines**.
- **Key Data Points**:
  - At RSA Step 10: K=4 ≈ 0.68, K=1 ≈ 0.52.

### 3. LiveCodeBench-v6
- **Y-axis Range**: 0.5–0.6
- **Trends**:
  - All lines slope **upward** (K=4 > K=3 > K=2 > K=1).
  - "Owen3-4B-Instruct" (dashed black) remains **below all lines**.
- **Key Data Points**:
  - At RSA Step 10: K=4 ≈ 0.59, K=1 ≈ 0.52.

### 4. Reasoning Gym Games
- **Y-axis Range**: 0.55–0.7
- **Trends**:
  - All lines slope **upward** (K=4 > K=3 > K=2 > K=1).
  - "Owen3-4B-Instruct" (dashed black) remains **below all lines**.
- **Key Data Points**:
  - At RSA Step 10: K=4 ≈ 0.69, K=1 ≈ 0.64.

### 5. Reasoning Gym Cognition + ARC
- **Y-axis Range**: 0.425–0.525
- **Trends**:
  - All lines slope **upward** (K=4 > K=3 > K=2 > K=1).
  - "Owen3-4B-Instruct" (dashed black) remains **below all lines**.
- **Key Data Points**:
  - At RSA Step 10: K=4 ≈ 0.51, K=1 ≈ 0.47.

---

## Embedded Text Box (Central)

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

ade7d2dcc9b01146aee559b4

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1

EXPERT: nemotron-free VERSION 1