Image b33be76180e7...

EXPERT: healer-alpha-free VERSION 1

RUNTIME: free/openrouter/healer-alpha
INTEL_VERIFIED
## Horizontal Stacked Bar Chart: Self-Rewarding Model vs. SFT Baseline Performance

### Overview
The image displays a horizontal stacked bar chart comparing the performance of three versions of a "Self-Rewarding" model (M1, M2, M3) against a common "SFT Baseline" model. The chart quantifies the outcomes of comparisons as percentages of wins for each model and ties.

### Components/Axes
*   **Chart Type:** Horizontal Stacked Bar Chart.
*   **Y-Axis (Vertical):** Lists the three comparison pairs. From top to bottom:
    1.  `Self-Rewarding M₃ vs. SFT Baseline`
    2.  `Self-Rewarding M₂ vs. SFT Baseline`
    3.  `Self-Rewarding M₁ vs. SFT Baseline`
*   **X-Axis (Horizontal):** Implicitly represents percentage (0-100%), though no numerical axis labels are present. The total length of each bar represents 100% of the comparison outcomes.
*   **Legend:** Positioned at the top of the chart.
    *   **Green Square:** `Self-Rewarding Wins`
    *   **Light Blue Square:** `Tie`
    *   **Red Square:** `SFT Baseline Wins`
*   **Data Labels:** Percentage values are embedded directly within each colored segment of the bars.

### Detailed Analysis
Each bar is segmented into three colored parts corresponding to the legend. The values are transcribed as follows:

1.  **Top Bar: Self-Rewarding M₃ vs. SFT Baseline**
    *   **Green Segment (Left):** `66.0` (Self-Rewarding Wins)
    *   **Light Blue Segment (Center):** `16.0` (Tie)
    *   **Red Segment (Right):** `18.0` (SFT Baseline Wins)
    *   **Trend:** The green segment is dominant, occupying nearly two-thirds of the bar, indicating a strong performance advantage for the M₃ model.

2.  **Middle Bar: Self-Rewarding M₂ vs. SFT Baseline**
    *   **Green Segment (Left):** `56.0` (Self-Rewarding Wins)
    *   **Light Blue Segment (Center):** `24.0` (Tie)
    *   **Red Segment (Right):** `20.0` (SFT Baseline Wins)
    *   **Trend:** The green segment is still the largest, but smaller than in M₃. The tie segment is notably larger.

3.  **Bottom Bar: Self-Rewarding M₁ vs. SFT Baseline**
    *   **Green Segment (Left):** `28.0` (Self-Rewarding Wins)
    *   **Light Blue Segment (Center):** `26.0` (Tie)
    *   **Red Segment (Right):** `46.0` (SFT Baseline Wins)
    *   **Trend:** The red segment is the largest, indicating the SFT Baseline wins more often than the M₁ model. The green segment is the smallest of all three bars.

### Key Observations
*   **Clear Performance Gradient:** There is a stark, monotonic improvement in the Self-Rewarding model's performance as the version number increases from M₁ to M₃.
*   **M₁ Underperformance:** The Self-Rewarding M₁ model loses to the SFT Baseline more often than it wins (46.0% vs. 28.0%).
*   **M₃ Dominance:** The Self-Rewarding M₃ model achieves a decisive win rate of 66.0%, losing only 18.0% of the time.
*   **Tie Rate Variation:** The percentage of ties is not constant. It is lowest for M₃ (16.0%) and highest for M₂ (24.0%).

### Interpretation
This chart demonstrates the progressive efficacy of an iterative model development process, likely involving self-rewarding or reinforcement learning techniques. The data suggests that successive versions (M₁ → M₂ → M₃) of the Self-Rewarding model have learned to significantly outperform the fixed SFT (Supervised Fine-Tuning) Baseline.

The inversion of the win/loss ratio from M₁ (where the baseline is superior) to M₃ (where the self-rewarding model is dominant) indicates a successful training or alignment trajectory. The varying tie rates may reflect changes in model confidence or the distinctness of the outputs between versions. The primary takeaway is that the "Self-Rewarding" approach, as embodied in model M₃, yields a substantially more capable model than the standard SFT baseline it was compared against.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

b33be76180e74fdaf9a14076

FOUND IN PAPERS

EXPERT: healer-alpha-free VERSION 1