## Box Plot: Average Harmlessness Scores Across Harm Categories
### Overview
The image displays eight side-by-side box plots comparing the average harmlessness scores of four models (SFT, beaver-7b-v3.0, SACPO (P), and RSA (P)) across eight harm categories: Crime, Emotional Harm, Immoral, Insult, Physical Harm, Pornographic, Privacy, and Social Bias. The y-axis represents the average harmlessness score (0–10), while the x-axis lists the harm categories. Each box plot contains four colored boxes corresponding to the models, with a legend on the right mapping colors to models.
### Components/Axes
- **Title**: "Average Harmlessness Scores Across Harm Categories"
- **Y-Axis**: "Average Harmlessness Score" (0–10, linear scale)
- **X-Axis**: Harm categories (Crime, Emotional Harm, Immoral, Insult, Physical Harm, Pornographic, Privacy, Social Bias)
- **Legend**:
- Gray: SFT
- Pink: beaver-7b-v3.0
- Green: SACPO (P)
- Red: RSA (P)
- **Data Points**:
- Box plots show median (bold line), interquartile range (box), and outliers (circles).
- Whiskers extend to 1.5×IQR from the quartiles.
### Detailed Analysis
1. **Crime**:
- SFT: Median ~1.5 (IQR: 1–2), outliers at 3 and 4.
- beaver-7b-v3.0: Median ~3 (IQR: 2.5–3.5), outliers at 4 and 5.
- SACPO (P): Median ~5 (IQR: 4.5–5.5), outliers at 6 and 7.
- RSA (P): Median ~7 (IQR: 6.5–7.5), outliers at 8 and 9.
2. **Emotional Harm**:
- SFT: Median ~4 (IQR: 3–5), outliers at 6 and 7.
- beaver-7b-v3.0: Median ~6 (IQR: 5–7), outliers at 8 and 9.
- SACPO (P): Median ~7 (IQR: 6–8), outliers at 9 and 10.
- RSA (P): Median ~8 (IQR: 7–9), outliers at 10.
3. **Immoral**:
- SFT: Median ~3 (IQR: 2–4), outliers at 5 and 6.
- beaver-7b-v3.0: Median ~5 (IQR: 4–6), outliers at 7 and 8.
- SACPO (P): Median ~7 (IQR: 6–8), outliers at 9 and 10.
- RSA (P): Median ~9 (IQR: 8–10), outliers at 11.
4. **Insult**:
- SFT: Median ~5 (IQR: 4–6), outliers at 7 and 8.
- beaver-7b-v3.0: Median ~6 (IQR: 5–7), outliers at 8 and 9.
- SACPO (P): Median ~7 (IQR: 6–8), outliers at 9 and 10.
- RSA (P): Median ~8 (IQR: 7–9), outliers at 10.
5. **Physical Harm**:
- SFT: Median ~4 (IQR: 3–5), outliers at 6 and 7.
- beaver-7b-v3.0: Median ~6 (IQR: 5–7), outliers at 8 and 9.
- SACPO (P): Median ~7 (IQR: 6–8), outliers at 9 and 10.
- RSA (P): Median ~8 (IQR: 7–9), outliers at 10.
6. **Pornographic**:
- SFT: Median ~5 (IQR: 4–6), outliers at 7 and 8.
- beaver-7b-v3.0: Median ~6 (IQR: 5–7), outliers at 8 and 9.
- SACPO (P): Median ~7 (IQR: 6–8), outliers at 9 and 10.
- RSA (P): Median ~8 (IQR: 7–9), outliers at 10.
7. **Privacy**:
- SFT: Median ~3 (IQR: 2–4), outliers at 5 and 6.
- beaver-7b-v3.0: Median ~5 (IQR: 4–6), outliers at 7 and 8.
- SACPO (P): Median ~7 (IQR: 6–8), outliers at 9 and 10.
- RSA (P): Median ~9 (IQR: 8–10), outliers at 11.
8. **Social Bias**:
- SFT: Median ~4 (IQR: 3–5), outliers at 6 and 7.
- beaver-7b-v3.0: Median ~6 (IQR: 5–7), outliers at 8 and 9.
- SACPO (P): Median ~7 (IQR: 6–8), outliers at 9 and 10.
- RSA (P): Median ~8 (IQR: 7–9), outliers at 10.
### Key Observations
- **Model Performance**: SACPO (P) and RSA (P) consistently achieve higher harmlessness scores across most categories, indicating better mitigation of harmful outputs.
- **Outliers**: SFT and beaver-7b-v3.0 exhibit more variability, with outliers in categories like Crime (SFT: 3–4) and Immoral (beaver-7b-v3.0: 7–8).
- **Trends**:
- SACPO (P) and RSA (P) outperform other models in Emotional Harm, Immoral, and Privacy.
- SFT underperforms in all categories, with the lowest scores in Crime (~1.5) and Privacy (~3).
### Interpretation
The data suggests that SACPO (P) and RSA (P) models are more effective at reducing harmful outputs compared to SFT and beaver-7b-v3.0. This could reflect differences in training data, architectural design, or post-processing techniques. The lower scores for SFT and beaver-7b-v3.0 highlight potential risks in deploying these models in safety-critical applications. Outliers indicate occasional failures in harm mitigation, emphasizing the need for robustness testing. The consistent performance of SACPO (P) and RSA (P) across categories underscores their reliability in diverse harm scenarios.