## Risk Coverage Curve & Receiver Operator Curve: MEDQA Model Evaluation
### Overview
The image displays two side-by-side performance evaluation charts for five different methods (linguistic, tokenprob, consistency, topk, stability) on the MEDQA dataset. The left chart is a Risk Coverage Curve, and the right chart is a Receiver Operator Characteristic (ROC) Curve. Both charts compare the performance of these methods against a baseline.
### Components/Axes
**Overall Title:** MEDQA (centered at the top)
**Left Chart: Risk Coverage Curve**
* **Chart Title:** Risk Coverage Curve ↑ (top-left)
* **X-axis:** "Coverage" (scale from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
* **Y-axis:** "Accuracy" (scale from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
* **Legend (bottom-right):**
* `linguistic (AURC:0.9007)` - Blue line
* `tokenprob (AURC:0.9214)` - Orange line
* `consistency (AURC:0.9267)` - Green line
* `topk (AURC:0.9136)` - Red line
* `stability (AURC:0.9638)` - Purple line
* `baseline accuracy (0.82)` - Black dashed line
**Right Chart: Receiver Operator Curve**
* **Chart Title:** Receiver Operator Curve ↑ (top-right)
* **X-axis:** "False Positive Rate" (scale from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
* **Y-axis:** "True Positive Rate" (scale from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
* **Legend (bottom-right):**
* `linguistic (AUROC:0.6704)` - Blue line
* `tokenprob (AUROC:0.8149)` - Orange line
* `consistency (AUROC:0.8286)` - Green line
* `topk (AUROC:0.7101)` - Red line
* `stability (AUROC:0.9078)` - Purple line
* `random` - Black dashed diagonal line (from (0,0) to (1,1))
### Detailed Analysis
**Risk Coverage Curve (Left Chart):**
* **Trend Verification:** All five method lines start at or near (Coverage=0.0, Accuracy=1.0). As Coverage increases, Accuracy generally decreases, but the rate of decrease varies significantly between methods.
* **Data Series & Points:**
* **stability (Purple):** Maintains the highest accuracy for the longest duration. It stays near Accuracy=1.0 until Coverage ≈ 0.5, then gradually declines to meet the baseline at Coverage=1.0. AURC = 0.9638 (highest).
* **consistency (Green):** Shows a sharp, anomalous dip in accuracy to ~0.7 at very low Coverage (≈0.05), then recovers sharply to near 1.0 before beginning a steady decline. AURC = 0.9267.
* **tokenprob (Orange):** Follows a smooth, convex curve, declining steadily from 1.0. AURC = 0.9214.
* **topk (Red):** Declines in a step-like, jagged pattern. AURC = 0.9136.
* **linguistic (Blue):** Also shows a jagged, step-like decline, generally below the tokenprob and consistency lines after the initial coverage. AURC = 0.9007 (lowest among methods).
* **Baseline (Black Dashed):** A horizontal line at Accuracy = 0.82, representing a fixed performance threshold.
**Receiver Operator Curve (Right Chart):**
* **Trend Verification:** All method lines start at (0,0) and end at (1,1). They bow toward the top-left corner, indicating better-than-random classification performance. The "random" line is the diagonal baseline.
* **Data Series & Points:**
* **stability (Purple):** Exhibits the best performance, with the curve closest to the top-left corner. It reaches a True Positive Rate (TPR) of ~0.65 at a very low False Positive Rate (FPR) of ~0.05. AUROC = 0.9078 (highest).
* **consistency (Green):** The second-best performer. It has a steep initial rise, reaching TPR ≈ 0.8 at FPR ≈ 0.2. AUROC = 0.8286.
* **tokenprob (Orange):** Shows a stepped increase, performing better than linguistic and topk. AUROC = 0.8149.
* **topk (Red):** Follows a path slightly above the linguistic line for most of the curve. AUROC = 0.7101.
* **linguistic (Blue):** The lowest-performing method, with the curve closest to the random diagonal. AUROC = 0.6704.
* **random (Black Dashed):** The diagonal line representing the performance of a random classifier (AUROC = 0.5).
### Key Observations
1. **Consistent Hierarchy:** The `stability` method is the top performer on both metrics (highest AURC and AUROC). The `linguistic` method is the lowest performer on both.
2. **Anomaly in Risk Coverage:** The `consistency` method (green line) shows a significant, sharp drop in accuracy at very low coverage before recovering. This suggests a potential instability or edge-case failure mode when the model is only required to be confident on a very small subset of data.
3. **Metric Discrepancy:** While the performance *ranking* of methods is similar across both charts, the *absolute performance gaps* differ. For example, the gap between `stability` and `consistency` is much larger in the ROC space (AUROC: 0.9078 vs. 0.8286) than in the Risk Coverage space (AURC: 0.9638 vs. 0.9267).
4. **Step-like Patterns:** The `topk` and `linguistic` lines in the Risk Coverage curve, and the `tokenprob` line in the ROC curve, exhibit distinct step-like patterns rather than smooth curves. This may indicate discrete changes in model behavior at specific confidence thresholds.
### Interpretation
This visualization provides a multi-faceted evaluation of model confidence estimation or selective prediction methods on the MEDQA (medical question answering) task.
* **What the data suggests:** The `stability` method is demonstrably superior for both selective prediction (maintaining high accuracy as more predictions are accepted) and general classification discrimination (separating positive and negative classes). The `consistency` method is a strong second but has a concerning failure mode at low coverage.
* **Relationship between elements:** The two charts answer related but different questions. The Risk Coverage Curve answers: "If I only accept the model's top X% most confident predictions, how accurate will it be?" The ROC Curve answers: "How well can this method distinguish between correct and incorrect predictions across all confidence thresholds?" A good method should perform well on both.
* **Notable implications:** For a high-stakes domain like medical QA, the `stability` method appears most reliable. The poor performance of the `linguistic` method suggests that using linguistic features alone is insufficient for robust confidence estimation in this context. The step-like patterns warrant investigation, as they may reveal artifacts in how these methods compute confidence scores. The misspelling "Reciever" in the right chart title is a minor presentational error.