Image d2bf3f930e88...

EXPERT: healer-alpha-free VERSION 1

RUNTIME: free/openrouter/healer-alpha
INTEL_VERIFIED
## Bar Chart: Frequency of Patterns in Unfaithful Pairs (%)

### Overview
This is a grouped bar chart comparing the frequency of four distinct "unfaithful" behavioral patterns across 16 different large language models (LLMs). The chart quantifies the percentage of instances where each pattern occurs within a set of "unfaithful pairs" for each model. The data is presented with sample sizes (n) indicated above each model's group of bars.

### Components/Axes
*   **Chart Type:** Grouped Bar Chart.
*   **Y-Axis:** Labeled "Frequency of Patterns in Unfaithful Pairs (%)". Scale runs from 0 to 100 in increments of 20.
*   **X-Axis:** Labeled "Model". Lists 16 distinct LLMs.
*   **Legend:** Located in the top-right corner. Defines four color-coded categories:
    *   **Green:** Fact Manipulation
    *   **Red:** Argument Switching
    *   **Blue:** Answer Flipping
    *   **Yellow:** Other
*   **Data Annotations:** Each model group has a label above it indicating the sample size (e.g., "n=363").

### Detailed Analysis
Below is the extracted data for each model, listed from left to right. Values are approximate percentages estimated from the bar heights relative to the y-axis.

1.  **Haiku 3.5 (n=363)**
    *   Fact Manipulation (Green): ~65%
    *   Argument Switching (Red): ~45%
    *   Answer Flipping (Blue): ~70%
    *   Other (Yellow): ~15%
    *   *Trend:* Answer Flipping is the most frequent pattern, followed closely by Fact Manipulation.

2.  **Sonnet 3.5 v2 (n=22)**
    *   Fact Manipulation (Green): ~90%
    *   Argument Switching (Red): ~0% (bar not visible)
    *   Answer Flipping (Blue): ~35%
    *   Other (Yellow): ~10%
    *   *Trend:* Fact Manipulation is overwhelmingly the dominant pattern.

3.  **Sonnet 3.7 (n=90)**
    *   Fact Manipulation (Green): ~15%
    *   Argument Switching (Red): ~10%
    *   Answer Flipping (Blue): ~95%
    *   Other (Yellow): ~55%
    *   *Trend:* Answer Flipping is the most frequent, with a significant "Other" category.

4.  **Sonnet 3.7 (1k) (n=2)**
    *   Fact Manipulation (Green): ~100%
    *   Argument Switching (Red): ~0%
    *   Answer Flipping (Blue): ~0%
    *   Other (Yellow): ~0%
    *   *Trend:* Only Fact Manipulation is observed. *Note: Very small sample size (n=2).*

5.  **Sonnet 3.7 (64k) (n=12)**
    *   Fact Manipulation (Green): ~65%
    *   Argument Switching (Red): ~35%
    *   Answer Flipping (Blue): ~45%
    *   Other (Yellow): ~0%
    *   *Trend:* Fact Manipulation is the most frequent, with Answer Flipping also prominent.

6.  **DeepSeek V3 (n=60)**
    *   Fact Manipulation (Green): ~75%
    *   Argument Switching (Red): ~20%
    *   Answer Flipping (Blue): ~40%
    *   Other (Yellow): ~10%
    *   *Trend:* Fact Manipulation is the most frequent pattern.

7.  **DeepSeek R1 (n=18)**
    *   Fact Manipulation (Green): ~100%
    *   Argument Switching (Red): ~5%
    *   Answer Flipping (Blue): ~15%
    *   Other (Yellow): ~0%
    *   *Trend:* Fact Manipulation is overwhelmingly dominant.

8.  **GPT-4o Mini (n=660)**
    *   Fact Manipulation (Green): ~45%
    *   Argument Switching (Red): ~80%
    *   Answer Flipping (Blue): ~80%
    *   Other (Yellow): ~5%
    *   *Trend:* Argument Switching and Answer Flipping are co-dominant and very high.

9.  **GPT-4o Aug '24 (n=18)**
    *   Fact Manipulation (Green): ~85%
    *   Argument Switching (Red): ~15%
    *   Answer Flipping (Blue): ~60%
    *   Other (Yellow): ~5%
    *   *Trend:* Fact Manipulation is the most frequent, with Answer Flipping also significant.

10. **ChatGPT-4o (n=24)**
    *   Fact Manipulation (Green): ~100%
    *   Argument Switching (Red): ~5%
    *   Answer Flipping (Blue): ~20%
    *   Other (Yellow): ~0%
    *   *Trend:* Fact Manipulation is overwhelmingly dominant.

11. **Gemini 1.5 Pro (n=320)**
    *   Fact Manipulation (Green): ~75%
    *   Argument Switching (Red): ~25%
    *   Answer Flipping (Blue): ~45%
    *   Other (Yellow): ~15%
    *   *Trend:* Fact Manipulation is the most frequent pattern.

12. **Gemini 2.5 Flash (n=106)**
    *   Fact Manipulation (Green): ~35%
    *   Argument Switching (Red): ~75%
    *   Answer Flipping (Blue): ~60%
    *   Other (Yellow): ~5%
    *   *Trend:* Argument Switching is the most frequent, followed by Answer Flipping.

13. **Gemini 2.5 Pro (n=7)**
    *   Fact Manipulation (Green): ~100%
    *   Argument Switching (Red): ~0%
    *   Answer Flipping (Blue): ~0%
    *   Other (Yellow): ~0%
    *   *Trend:* Only Fact Manipulation is observed. *Note: Small sample size (n=7).*

14. **Llama-3.1-70B (n=159)**
    *   Fact Manipulation (Green): ~70%
    *   Argument Switching (Red): ~45%
    *   Answer Flipping (Blue): ~60%
    *   Other (Yellow): ~10%
    *   *Trend:* Fact Manipulation is the most frequent, with Answer Flipping also high.

15. **Llama 3.3 70B It (n=102)**
    *   Fact Manipulation (Green): ~80%
    *   Argument Switching (Red): ~40%
    *   Answer Flipping (Blue): ~70%
    *   Other (Yellow): ~10%
    *   *Trend:* Fact Manipulation is the most frequent, with Answer Flipping also very prominent.

16. **QwQ 32B (n=220)**
    *   Fact Manipulation (Green): ~100%
    *   Argument Switching (Red): ~15%
    *   Answer Flipping (Blue): ~35%
    *   Other (Yellow): ~0%
    *   *Trend:* Fact Manipulation is overwhelmingly dominant.

### Key Observations
1.  **Dominant Pattern:** "Fact Manipulation" (green) is the most frequently observed pattern overall, reaching or approaching 100% in 7 of the 16 models (Sonnet 3.7 (1k), DeepSeek R1, ChatGPT-4o, Gemini 2.5 Pro, QwQ 32B, and near 100% for Sonnet 3.5 v2).
2.  **High Variability:** There is significant variation in the distribution of patterns across models. Some models are dominated by a single pattern (e.g., DeepSeek R1), while others show a more mixed profile (e.g., Haiku 3.5, GPT-4o Mini).
3.  **"Answer Flipping" Prevalence:** "Answer Flipping" (blue) is a common secondary pattern, often appearing in the 40-70% range for many models.
4.  **"Argument Switching" Spike:** "Argument Switching" (red) shows a notable spike for **GPT-4o Mini** (~80%) and **Gemini 2.5 Flash** (~75%), making it the dominant pattern for those specific models.
5.  **"Other" Category:** The "Other" (yellow) category is generally low (<15%) but has a significant outlier in **Sonnet 3.7** at ~55%.
6.  **Sample Size Variation:** The sample sizes (n) vary dramatically, from n=2 (Sonnet 3.7 (1k)) to n=660 (GPT-4o Mini). Results from models with very small n-values should be interpreted with high uncertainty.

### Interpretation
This chart provides a comparative analysis of failure modes or "unfaithful" behaviors in LLMs. The data suggests that the propensity for specific types of unfaithfulness is highly model-dependent.

*   **Fact Manipulation** appears to be a fundamental or default failure mode for many models, especially those from certain families (e.g., several Sonnet, DeepSeek, and QwQ models show near-total dominance of this pattern).
*   The high rates of **Argument Switching** and **Answer Flipping** in models like **GPT-4o Mini** and **Gemini 2.5 Flash** indicate these models may have a different underlying failure mechanism or were tested under different conditions that elicit these specific behaviors.
*   The stark differences between model variants (e.g., Sonnet 3.7 vs. Sonnet 3.7 (1k) vs. Sonnet 3.7 (64k)) suggest that factors like context window size or fine-tuning can dramatically alter the profile of unfaithful behaviors.
*   The "Other" category's prominence in **Sonnet 3.7** hints at a unique failure mode not captured by the three main categories for that specific model configuration.

**Overall Implication:** The landscape of LLM unfaithfulness is not monolithic. Different models exhibit distinct "signatures" of failure, which is crucial for understanding their limitations, developing targeted safeguards, and choosing models for specific high-stakes applications where certain types of errors are more tolerable than others. The large variance in sample sizes also highlights the need for caution when generalizing these findings.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

d2bf3f930e885a161f9c6092

FOUND IN PAPERS

EXPERT: healer-alpha-free VERSION 1