## Scatter Plot Comparison: Model Performance on Task Success Rate vs. Generative Quality & Controllability
### Overview
The image contains two side-by-side scatter plots. Both plots share the same Y-axis, "Task Success Rate (%)", but compare it against two different X-axis metrics: "Gen. Quality (Aesthetic+Image Quality) ↑" on the left and "Controllability (1 - LPIPS) ↑" on the right. The plots compare the performance of various AI models, categorized into three groups: Zero-shot (pink circles), Post-trained (blue circles), and Others (green squares). The upward arrows (↑) on the X-axis labels indicate that higher values are better.
### Components/Axes
**Common Elements:**
* **Y-axis:** "Task Success Rate (%)". Scale ranges from 55 to 65, with major ticks at every integer.
* **Legend:** Located in the top-left corner of each plot.
* Pink Circle: Zero-shot
* Blue Circle: Post-trained
* Green Square: Others
* **Data Points:** Each point is labeled with a model name. The color and shape correspond to the legend.
**Left Plot:**
* **X-axis:** "Gen. Quality (Aesthetic+Image Quality) ↑". Scale ranges from 0.325 to 0.475, with major ticks at 0.025 intervals.
**Right Plot:**
* **X-axis:** "Controllability (1 - LPIPS) ↑". Scale ranges from 0.15 to 0.50, with major ticks at 0.05 intervals.
### Detailed Analysis
**Left Plot: Task Success Rate vs. Generative Quality**
* **Trend Verification:** There is a general, weak positive trend. Models with higher Generative Quality scores tend to have slightly higher Task Success Rates, but the correlation is not strong, and there is significant scatter.
* **Data Points (Approximate Coordinates - Gen. Quality, Task Success):**
* **Zero-shot (Pink):**
* Runway Gen4: (0.450, 65.0)
* Wan2.2 A14B: (0.450, 59.5)
* Wan2.1: (0.475, 58.3)
* Hunyuan: (0.400, 58.0)
* Wan2.2 5B: (0.400, 55.0)
* Cosmos-P2: (0.475, 55.0)
* **Post-trained (Blue):**
* Wan2.1†: (0.375, 62.5)
* SVD†: (0.360, 61.0)
* Cosmos-P2†: (0.350, 60.0)
* SVD: (0.375, 57.5)
* Pathdreamer: (0.350, 57.0)
* Wan2.2 5B†: (0.375, 56.0)
* **Others (Green):**
* LTXVideo†: (0.350, 57.5)
* SE3DS: (0.365, 57.3)
* LTXVideo: (0.375, 56.5)
* NWM: (0.325, 57.2)
**Right Plot: Task Success Rate vs. Controllability**
* **Trend Verification:** There is a clearer positive trend compared to the left plot. Models with higher Controllability scores generally achieve higher Task Success Rates.
* **Data Points (Approximate Coordinates - Controllability, Task Success):**
* **Zero-shot (Pink):**
* Runway Gen4: (0.325, 65.0)
* Wan2.2 A14B: (0.325, 59.5)
* Wan2.1: (0.275, 58.3)
* SVD: (0.325, 57.8)
* Hunyuan: (0.350, 57.5)
* LTXVideo: (0.325, 56.0)
* Wan2.2 5B: (0.325, 55.5)
* Cosmos-P2: (0.175, 55.0)
* **Post-trained (Blue):**
* Wan2.1†: (0.500, 62.5)
* SVD†: (0.500, 61.0)
* Cosmos-P2†: (0.500, 60.0)
* Wan2.2 5B†: (0.450, 56.0)
* **Others (Green):**
* LTXVideo†: (0.350, 57.5)
* SE3DS: (0.350, 57.3)
* Pathdreamer: (0.300, 57.0)
* NWM: (0.375, 57.2)
### Key Observations
1. **Top Performer:** "Runway Gen4" (Zero-shot) is the clear outlier, achieving the highest Task Success Rate (~65%) in both plots, with high Generative Quality but only moderate Controllability.
2. **Post-training Effect:** Models with the "†" suffix (indicating post-training) consistently show a significant rightward shift on the Controllability axis (right plot) compared to their base versions, while maintaining or slightly improving Task Success Rate. This effect is less pronounced on the Generative Quality axis.
3. **Metric Correlation:** Task Success Rate appears to have a stronger visual correlation with Controllability than with Generative Quality.
4. **Cluster of "Others":** The green "Others" models (NWM, SE3DS, LTXVideo) cluster in a mid-range for both metrics, generally between 56-58% Task Success Rate.
5. **Cosmos-P2 Anomaly:** The base "Cosmos-P2" model has the lowest Controllability score (~0.175) but a mid-range Generative Quality score, indicating a potential trade-off or specialization in its design.
### Interpretation
This comparative analysis suggests several insights about the evaluated models:
* **The Success-Controlability Link:** The stronger trend in the right plot implies that a model's ability to be precisely controlled (as measured by 1-LPIPS) is a more reliable predictor of its overall task success than its raw aesthetic or image quality. This makes intuitive sense for applied tasks where following instructions is paramount.
* **Value of Post-training:** The dramatic improvement in Controllability for post-trained models (†) highlights the effectiveness of this technique for enhancing steerability without sacrificing—and sometimes even improving—task performance. This is a key finding for model development.
* **Performance vs. Specialization:** "Runway Gen4" demonstrates that it's possible to achieve top-tier task success with a zero-shot model, but its controllability is not the highest. Conversely, post-trained models like "Wan2.1†" and "SVD†" achieve the highest controllability scores, suggesting they may be preferable for applications requiring fine-grained user input.
* **Trade-off Identification:** The position of "Cosmos-P2" suggests a model architecture or training focus that prioritizes generative quality over controllability. This isn't inherently negative but defines its use case.
In summary, the data argues that for maximizing task success in this evaluation framework, optimizing for controllability is likely more impactful than optimizing solely for generative quality, and post-training is a highly effective method for achieving that optimization.