Image 9dfbfb5ba9fd...

EXPERT: healer-alpha-free VERSION 1

RUNTIME: free/openrouter/healer-alpha
INTEL_VERIFIED
## Line Chart: Sensitivity to Top-K

### Overview
This is a line chart titled "Sensitivity to Top-K" that plots the performance of four different metrics (Perplexity, LN-Entropy, Lexical Similarity, and EigenScore) as a function of the "Top-K" parameter. The performance is measured by the AUROC (Area Under the Receiver Operating Characteristic Curve) score. The chart demonstrates how sensitive each metric's performance is to changes in the Top-K value.

### Components/Axes
*   **Chart Title:** "Sensitivity to Top-K" (centered at the top).
*   **Y-Axis:**
    *   **Label:** "AUROC"
    *   **Scale:** Linear, ranging from 40 to 90.
    *   **Major Ticks:** 40, 50, 60, 70, 80, 90.
*   **X-Axis:**
    *   **Label:** "Top-K"
    *   **Scale:** Appears to be categorical or logarithmic, with discrete values.
    *   **Data Points (Ticks):** 3, 5, 10, 20, 30, 50.
*   **Legend:** Located in the bottom-right corner of the plot area. It maps line colors and marker styles to metric names:
    *   **Blue line with 'x' markers:** Perplexity
    *   **Gray line with diamond markers:** LN-Entropy
    *   **Teal line with circle markers:** Lexical Similarity
    *   **Orange line with star markers:** EigenScore

### Detailed Analysis
The chart displays four data series, each showing a relatively flat trend across the range of Top-K values.

1.  **EigenScore (Orange, Star Markers):**
    *   **Trend:** The line is nearly horizontal, showing a very slight upward trend from Top-K=3 to Top-K=50.
    *   **Approximate Values:**
        *   Top-K=3: ~79
        *   Top-K=5: ~80
        *   Top-K=10: ~79
        *   Top-K=20: ~80
        *   Top-K=30: ~80
        *   Top-K=50: ~80

2.  **Lexical Similarity (Teal, Circle Markers):**
    *   **Trend:** The line is mostly flat with a minor dip around Top-K=20 before recovering.
    *   **Approximate Values:**
        *   Top-K=3: ~74
        *   Top-K=5: ~75
        *   Top-K=10: ~75
        *   Top-K=20: ~73
        *   Top-K=30: ~74
        *   Top-K=50: ~76

3.  **LN-Entropy (Gray, Diamond Markers):**
    *   **Trend:** The line is very flat, showing minimal variation.
    *   **Approximate Values:**
        *   Top-K=3: ~67
        *   Top-K=5: ~67
        *   Top-K=10: ~68
        *   Top-K=20: ~68
        *   Top-K=30: ~68
        *   Top-K=50: ~68

4.  **Perplexity (Blue, 'x' Markers):**
    *   **Trend:** The line is almost perfectly horizontal, indicating no sensitivity to Top-K.
    *   **Approximate Values:**
        *   Top-K=3: ~64
        *   Top-K=5: ~64
        *   Top-K=10: ~64
        *   Top-K=20: ~64
        *   Top-K=30: ~64
        *   Top-K=50: ~64

### Key Observations
*   **Performance Hierarchy:** There is a clear and consistent performance ranking across all Top-K values: EigenScore > Lexical Similarity > LN-Entropy > Perplexity.
*   **Low Sensitivity:** All four metrics exhibit very low sensitivity to the Top-K parameter within the tested range (3 to 50). The AUROC scores change by only 1-2 points at most.
*   **Stability:** The Perplexity metric is the most stable, showing virtually no change. EigenScore and LN-Entropy are also highly stable. Lexical Similarity shows the most variation, though it is still minimal.
*   **Visual Separation:** The lines for the four metrics are distinctly separated and do not intersect, confirming their consistent relative performance.

### Interpretation
The data suggests that for the task being evaluated, the choice of Top-K (within the range of 3 to 50) has a negligible impact on the performance of these four evaluation metrics. This is a significant finding, as it implies that model comparisons using these metrics would be robust to the specific choice of the Top-K hyperparameter.

The consistent performance hierarchy indicates that **EigenScore** is the most effective metric (highest AUROC) for this particular task, followed by **Lexical Similarity**. **Perplexity**, a common language model metric, performs the worst in this context. This could imply that the task requires evaluation criteria beyond simple next-token prediction likelihood, favoring metrics that capture semantic similarity (Lexical Similarity) or distributional properties (EigenScore, LN-Entropy).

The investigation reveals a stable evaluation landscape where the primary differentiator is the choice of metric itself, not the tuning of the Top-K parameter. This allows for more confident and less parameter-sensitive model selection and comparison.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

9dfbfb5ba9fdadbb7cf80686

FOUND IN PAPERS

EXPERT: healer-alpha-free VERSION 1