## Horizontal Stacked Bar Chart: Self-Rewarding vs. SFT Baseline Comparison
### Overview
The image displays a horizontal stacked bar chart comparing the performance of two "Self-Rewarding" models (M³ and M²) against a "SFT Baseline" model. The chart quantifies outcomes into three categories: wins for the Self-Rewarding model, ties, and wins for the SFT Baseline. The data is presented as percentages.
### Components/Axes
* **Legend:** Positioned at the top center of the chart. It defines three color-coded categories:
* **Green:** "Self-Rewarding Wins"
* **Light Blue:** "Tie"
* **Red:** "SFT Baseline Wins"
* **Bars:** Two horizontal bars are stacked and aligned to the left.
* **Top Bar:** Labeled "Self-Rewarding M³ vs. SFT Baseline" on the left.
* **Bottom Bar:** Labeled "Self-Rewarding M² vs. SFT Baseline" on the left.
* **Data Labels:** Numerical percentage values are embedded within each colored segment of the bars.
### Detailed Analysis
The chart presents the following precise data points for each comparison:
**1. Self-Rewarding M³ vs. SFT Baseline (Top Bar)**
* **Self-Rewarding Wins (Green, left segment):** 50.4%
* **Tie (Light Blue, middle segment):** 32.8%
* **SFT Baseline Wins (Red, right segment):** 16.8%
**2. Self-Rewarding M² vs. SFT Baseline (Bottom Bar)**
* **Self-Rewarding Wins (Green, left segment):** 46.5%
* **Tie (Light Blue, middle segment):** 34.8%
* **SFT Baseline Wins (Red, right segment):** 18.8%
**Visual Trend Verification:**
* In both bars, the green segment ("Self-Rewarding Wins") is the largest, occupying roughly half or more of the total bar length.
* The light blue segment ("Tie") is the second-largest.
* The red segment ("SFT Baseline Wins") is the smallest in both cases.
* The total length of each bar represents 100% (50.4 + 32.8 + 16.8 = 100.0 for M³; 46.5 + 34.8 + 18.8 = 100.1 for M², with the minor discrepancy likely due to rounding).
### Key Observations
1. **Dominant Performance:** The Self-Rewarding model achieves a higher win rate than the SFT Baseline in both comparisons (50.4% vs. 16.8% for M³; 46.5% vs. 18.8% for M²).
2. **Significant Tie Rate:** A substantial portion of outcomes result in a tie, ranging from approximately one-third (32.8% to 34.8%) of all comparisons.
3. **Model Version Comparison:** The M³ version of the Self-Rewarding model shows a stronger performance (50.4% wins) against the baseline compared to the M² version (46.5% wins). Correspondingly, the M³ version has a slightly lower tie rate and a lower baseline win rate.
4. **Consistent Hierarchy:** The order of performance (Self-Rewarding Wins > Ties > SFT Baseline Wins) is consistent across both model versions tested.
### Interpretation
This chart provides a clear, quantitative evaluation demonstrating the superiority of the "Self-Rewarding" training or modeling approach over a standard "SFT (Supervised Fine-Tuning) Baseline." The data suggests that the Self-Rewarding method leads to outcomes where it is judged as superior more than twice as often as the baseline (e.g., 50.4% vs. 16.8% for M³).
The high tie rate is a critical finding. It indicates that in a large fraction of cases (~33-35%), the two models produce outputs of comparable quality or are indistinguishable based on the evaluation criteria. This could imply that the baseline is still competitive in many scenarios, or that the evaluation metric has a threshold that both models frequently meet.
The comparison between M³ and M² suggests iterative improvement, with the M³ version yielding a higher win rate and a lower loss/tie rate against the same baseline. This chart would be essential in a technical report to justify the development of the Self-Rewarding approach, showcasing its effectiveness and providing a nuanced view that includes ties, not just a simple win/loss ratio.