## Scatter Plot: Model Performance Comparison on Helpfulness vs. Harmlessness
### Overview
The image displays two side-by-side scatter plots, labeled (a) and (b), comparing the performance of various AI alignment methods. The plots measure models on two axes: "helpfulness" (x-axis) and "harmlessness" (y-axis). A legend at the top center defines the methods and their corresponding markers. A gray shaded region in the bottom-left quadrant of both plots indicates a zone of lower performance on both metrics.
### Components/Axes
* **X-Axis:** Labeled "helpfulness". Scale ranges from 0.3 to 1.0, with major tick marks at 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0.
* **Y-Axis:** Labeled "harmlessness". Scale ranges from 0.4 to 1.0, with major tick marks at 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0.
* **Legend (Top Center):** Defines eight methods:
* `☆ SFT` (Gray Star)
* `✖ Safe RLHF` (Purple X)
* `● DPO` (Blue Circle)
* `● Ra-DPO` (Orange Circle)
* `▶ SACPO (H → S)` (Blue Right-Pointing Triangle)
* `◀ RSCPO (H → S)` (Orange Left-Pointing Triangle)
* `■ SACPO (P)` (Green Square)
* `■ RSCPO (P)` (Red Square)
* **Gray Shaded Region:** Covers the area where helpfulness ≤ 0.5 and harmlessness ≤ 0.5. A dashed line marks the boundary at helpfulness=0.5 and harmlessness=0.5.
* **Baseline Points:** A gray star labeled `SFT` is positioned at approximately (0.5, 0.5), at the corner of the shaded region.
### Detailed Analysis
**Subplot (a):**
* **Safe RLHF (Purple X):** Three versions are plotted.
* `v1.0`: Positioned at approximately (0.58, 0.70).
* `v2.0`: Positioned at approximately (0.35, 0.79).
* `v3.0`: Positioned at approximately (0.55, 0.78).
* **DPO (H) (Blue Circle):** Positioned at approximately (0.76, 0.51).
* **Ra-DPO (H) (Orange Circle):** Positioned at approximately (0.80, 0.52).
* **SACPO (H → S) (Blue Triangles):** Points are labeled with numerical values (likely hyperparameters).
* Point labeled `0.1`: Positioned at (0.60, 0.68).
* Point labeled `0.05`: Positioned at (0.70, 0.83).
* Point labeled `0.025`: Positioned at (0.78, 0.88).
* Point labeled `0.01`: Positioned at (0.76, 0.90).
* **RSCPO (H → S) (Orange Triangles):** Points are labeled with numerical values.
* Point labeled `0.1`: Positioned at (0.68, 0.70).
* Point labeled `0.05`: Positioned at (0.78, 0.84).
* Point labeled `0.025`: Positioned at (0.75, 0.87).
* Point labeled `0.01`: Positioned at (0.73, 0.90).
**Subplot (b):**
* **Safe RLHF (Purple X):** Same three versions as in (a), with identical positions.
* **DPO (H) (Blue Circle):** Same position as in (a).
* **Ra-DPO (H) (Orange Circle):** Same position as in (a).
* **SACPO (P) (Green Squares):** Points are labeled with numerical values.
* Point labeled `0.25`: Positioned at (0.75, 0.60).
* Point labeled `0.5`: Positioned at (0.73, 0.69).
* Point labeled `0.75`: Positioned at (0.72, 0.75).
* Point labeled `0.9`: Positioned at (0.70, 0.90).
* Point labeled `0.95`: Positioned at (0.75, 0.91).
* **RSCPO (P) (Red Squares):** Points are labeled with numerical values.
* Point labeled `0.25`: Positioned at (0.82, 0.69).
* Point labeled `0.5`: Positioned at (0.85, 0.80).
* Point labeled `0.75`: Positioned at (0.87, 0.86).
* Point labeled `0.9`: Positioned at (0.85, 0.89).
* Point labeled `0.95`: Positioned at (0.84, 0.90).
### Key Observations
1. **Performance Clustering:** Methods cluster into distinct regions. Safe RLHF variants are in the mid-harmlessness, lower-helpfulness area. DPO and Ra-DPO are in the high-helpfulness, low-harmlessness area. The SACPO and RSCPO methods (both H→S and P variants) occupy the high-harmlessness region, with helpfulness varying based on the variant and hyperparameter.
2. **Hyperparameter Trend:** For both SACPO and RSCPO methods, the numerical labels (likely representing a hyperparameter like a penalty weight) show a clear trend: as the value decreases (e.g., from 0.1 to 0.01 in H→S variants, or increases from 0.25 to 0.95 in P variants), the model's position moves upward (higher harmlessness) and often rightward (higher helpfulness).
3. **Variant Comparison:** The `(P)` variants in subplot (b) generally achieve higher helpfulness scores than their `(H → S)` counterparts in subplot (a) for similar harmlessness levels. The `RSCPO (P)` (red squares) points are the furthest to the top-right, indicating the best combined performance.
4. **Baseline:** The SFT model sits at the threshold of the low-performance gray zone (0.5, 0.5). All other methods shown improve upon this baseline in at least one dimension.
### Interpretation
This visualization demonstrates the trade-off and potential synergy between helpfulness and harmlessness in AI alignment techniques. The data suggests that the **P variants of SACPO and RSCPO are the most effective** at simultaneously maximizing both metrics, pushing the Pareto frontier toward the ideal top-right corner (1.0, 1.0).
The clear trend with the hyperparameter labels indicates these methods offer a **tunable knob** to navigate the helpfulness-harmlessness trade-off. Lower values in the H→S variants and higher values in the P variants appear to prioritize harmlessness without severely sacrificing helpfulness.
The positioning of Safe RLHF suggests it achieves moderate harmlessness but at a significant cost to helpfulness. Conversely, standard DPO and Ra-DPO achieve high helpfulness but remain near the harmlessness baseline. The SACPO/RSCPO methods, particularly the P variants, appear to resolve this tension more effectively, representing a significant advancement in balancing these two critical and often competing objectives for AI systems. The gray zone serves as a visual reminder of the undesirable state (low on both metrics) that these alignment techniques aim to move models away from.