## [Multi-Panel Box Plot]: Average Harmfulness Scores by Model and Category
### Overview
The image displays a series of eight adjacent box plots, each representing a different category of potential harm. The plots compare the distribution of "Average Harmfulness Scores" (y-axis) for four different AI models or methods (x-axis) within each category. The overall purpose is to visualize and compare the perceived harmfulness of outputs from these models across various ethical dimensions.
### Components/Axes
* **Chart Type:** Multi-panel (faceted) box plot.
* **Y-Axis (Common to all panels):**
* **Label:** "Average Harmfulness Score"
* **Scale:** Linear scale from 0 to 10, with major tick marks at 0, 2, 4, 6, 8, and 10.
* **X-Axis (Within each panel):** Four categorical models/methods.
* **Labels (from left to right in each panel):** "GPT", "Llama2-7b-Chat-3.0", "SACLPO (P)", "RLAIF (P)".
* **Panel Titles (Top of each subplot, from left to right):**
1. "Crime"
2. "Emotional Harm"
3. "Immoral"
4. "Insult"
5. "Physical Harm"
6. "Pornographic"
7. "Privacy"
8. "Social Bias"
* **Legend/Color Mapping:** The color of each box corresponds to the model, consistent across all panels.
* **GPT:** Grey
* **Llama2-7b-Chat-3.0:** Light Pink/Lavender
* **SACLPO (P):** Green
* **RLAIF (P):** Red
* **Spatial Grounding:** The eight panels are arranged horizontally in a single row. The y-axis label is positioned vertically on the far left. Panel titles are centered above each respective plot. The x-axis labels are rotated approximately 45 degrees for readability.
### Detailed Analysis
The analysis proceeds panel by panel, from left to right. For each, the visual trend (median line, interquartile range (IQR) box, and whiskers) is described before noting approximate key values.
**1. Crime**
* **Trend:** GPT shows the lowest harmfulness scores with a tight distribution. The other three models show progressively higher medians and wider spreads.
* **Data Points (Approximate Medians & Ranges):**
* GPT: Median ~1.5, IQR ~0.5-2, range ~0-5.
* Llama2-7b-Chat-3.0: Median ~9, IQR ~8.5-9.5, range ~0-10.
* SACLPO (P): Median ~9.5, IQR ~8-10, range ~0-10.
* RLAIF (P): Median ~9.5, IQR ~9-10, range ~5-10.
**2. Emotional Harm**
* **Trend:** A clear upward trend in median harmfulness from GPT to RLAIF (P). GPT and Llama2 have wider distributions than SACLPO and RLAIF.
* **Data Points (Approximate Medians & Ranges):**
* GPT: Median ~7, IQR ~5-8, range ~0-10.
* Llama2-7b-Chat-3.0: Median ~8, IQR ~6-9, range ~0-10.
* SACLPO (P): Median ~8.5, IQR ~7-9, range ~2-10.
* RLAIF (P): Median ~9, IQR ~8-9.5, range ~5-10.
**3. Immoral**
* **Trend:** GPT has a very low median and a wide spread. The other three models cluster at the high end with high medians and relatively compact IQRs.
* **Data Points (Approximate Medians & Ranges):**
* GPT: Median ~1.5, IQR ~0.5-3, range ~0-7.
* Llama2-7b-Chat-3.0: Median ~9, IQR ~8-10, range ~0-10.
* SACLPO (P): Median ~8, IQR ~6-9, range ~2-10.
* RLAIF (P): Median ~9, IQR ~8-9.5, range ~5-10.
**4. Insult**
* **Trend:** Medians increase from GPT to RLAIF (P). GPT and Llama2 show very wide distributions (large IQRs and whiskers), while SACLPO and RLAIF are more concentrated at higher scores.
* **Data Points (Approximate Medians & Ranges):**
* GPT: Median ~6, IQR ~3-8, range ~0-10.
* Llama2-7b-Chat-3.0: Median ~7, IQR ~5-8, range ~0-10.
* SACLPO (P): Median ~7.5, IQR ~5-9, range ~3-10.
* RLAIF (P): Median ~8, IQR ~7.5-8.5, range ~5-10.
**5. Physical Harm**
* **Trend:** Similar to "Emotional Harm," a steady increase in median harmfulness from GPT to RLAIF (P). Distributions are moderately wide for all.
* **Data Points (Approximate Medians & Ranges):**
* GPT: Median ~7, IQR ~5-8, range ~0-10.
* Llama2-7b-Chat-3.0: Median ~8, IQR ~7-9, range ~0-10.
* SACLPO (P): Median ~8.5, IQR ~7-9.5, range ~2-10.
* RLAIF (P): Median ~7, IQR ~3-8, range ~0-10. *(Note: This model shows a wider spread and a lower median compared to its trend in other categories).*
**6. Pornographic**
* **Trend:** GPT has a low median and a very wide spread. Llama2 has a low median but a tighter IQR. SACLPO and RLAIF have higher medians and wide spreads.
* **Data Points (Approximate Medians & Ranges):**
* GPT: Median ~4, IQR ~1-7, range ~0-10.
* Llama2-7b-Chat-3.0: Median ~3, IQR ~1-6, range ~0-10.
* SACLPO (P): Median ~5, IQR ~2-8, range ~0-10.
* RLAIF (P): Median ~7, IQR ~3-8, range ~0-10.
**7. Privacy**
* **Trend:** GPT has a very low median and a wide spread. The other three models show very high medians (near 9-10) with compact IQRs, indicating consistently high harmfulness scores in this category.
* **Data Points (Approximate Medians & Ranges):**
* GPT: Median ~1.5, IQR ~0.5-3, range ~0-10.
* Llama2-7b-Chat-3.0: Median ~9.5, IQR ~9-10, range ~0-10.
* SACLPO (P): Median ~9, IQR ~8-10, range ~0-10.
* RLAIF (P): Median ~9.5, IQR ~9-10, range ~5-10.
**8. Social Bias**
* **Trend:** A clear upward trend in median harmfulness from GPT to RLAIF (P). GPT has a wide distribution, while the others are more concentrated at the high end.
* **Data Points (Approximate Medians & Ranges):**
* GPT: Median ~6, IQR ~2-8, range ~0-10.
* Llama2-7b-Chat-3.0: Median ~8.5, IQR ~7-9, range ~0-10.
* SACLPO (P): Median ~9, IQR ~8-9.5, range ~0-10.
* RLAIF (P): Median ~9, IQR ~8.5-9.5, range ~5-10.
### Key Observations
1. **Consistent Hierarchy:** In almost every category (with "Physical Harm" as a partial exception), the median harmfulness score increases in the order: GPT < Llama2-7b-Chat-3.0 < SACLPO (P) ≈ RLAIF (P).
2. **GPT's Variability:** The GPT model (grey boxes) consistently shows the widest interquartile ranges and whiskers, indicating high variance in its harmfulness scores within each category. Its median is also consistently the lowest.
3. **High-End Clustering:** The SACLPO (P) (green) and RLAIF (P) (red) models frequently cluster at the high end of the scale (medians between 7-10), with relatively compact IQRs, suggesting they are more consistently rated as highly harmful across these dimensions.
4. **Category Sensitivity:** The "Privacy" and "Immoral" categories show the most extreme separation, with GPT scoring very low and the other three models scoring very high. The "Pornographic" category shows the most overlap and variability between all models.
5. **Outlier Note:** The RLAIF (P) model in the "Physical Harm" category breaks the general trend, showing a lower median (~7) and a much wider distribution compared to its performance in other categories.
### Interpretation
This visualization suggests a significant difference in the perceived harmfulness of outputs from the evaluated models. The base GPT model is rated as the least harmful on average but with the most inconsistency. The fine-tuned or aligned models (Llama2-Chat, SACLPO, RLAIF) are generally perceived as more harmful across these specific ethical dimensions.
The data implies that the alignment methods used for SACLPO (P) and RLAIF (P), while potentially improving other metrics, may inadvertently increase the generation of content that human raters score as harmful in categories like Crime, Social Bias, and Privacy. The high scores for these models in "Privacy" are particularly notable, suggesting a potential trade-off between alignment objectives and privacy preservation.
The wide spread for GPT indicates its outputs are highly variable—sometimes harmless, sometimes very harmful. In contrast, the tighter high-end clustering of SACLPO and RLAIF suggests their outputs are more consistently within a range perceived as harmful. This could be a critical insight for safety deployment, highlighting that "alignment" does not uniformly reduce all forms of measured harm and may require category-specific safeguards. The anomaly in "Physical Harm" for RLAIF warrants further investigation to understand if it's a measurement artifact or a genuine characteristic of that model's outputs.