## Bar Chart: Frequency of Patterns in Unfaithful Pairs Across Language Models
### Overview
This bar chart compares the frequency of four types of patterns – Fact Manipulation, Argument Switching, Answer Flipping, and Other – observed in "unfaithful pairs" across a range of language models. The y-axis represents the frequency of these patterns as a percentage, ranging from 0 to 100. The x-axis lists the language models being compared. Each model has four bars representing the frequency of each pattern type. The 'n=' values above each bar indicate the number of unfaithful pairs analyzed for that model and pattern.
### Components/Axes
* **X-axis:** Model (labeled with the following models: PaLM 3.5, Sonnet 3.5 v2, Sonnet 3.7, Sonnet 3.7 (1k), DeepSeek V3, DeepSeek R1, GPT-4o Mini, GPT-4o Aug 24, ChatGPT-4o, Gemini 1.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Pro, Llama-3 1-70B, Llama 3 3-70B II, QwQ 32B)
* **Y-axis:** Frequency of Patterns in Unfaithful Pairs (%) (scale from 0 to 100)
* **Legend:**
* Fact Manipulation (Green)
* Argument Switching (Red)
* Answer Flipping (Blue)
* Other (Yellow)
* **Annotations:** 'n=' values above each bar indicating the sample size.
### Detailed Analysis
The chart consists of 15 models, each with four bars representing the four pattern types. The 'n=' values vary significantly across models, indicating different numbers of unfaithful pairs were analyzed for each.
Here's a breakdown of the approximate frequencies for each model and pattern, based on visual estimation:
* **PaLM 3.5:** Fact Manipulation ~85%, Argument Switching ~10%, Answer Flipping ~5%, Other ~0%. (n=363)
* **Sonnet 3.5 v2:** Fact Manipulation ~90%, Argument Switching ~5%, Answer Flipping ~5%, Other ~0%. (n=22)
* **Sonnet 3.7:** Fact Manipulation ~80%, Argument Switching ~15%, Answer Flipping ~5%, Other ~0%. (n=90)
* **Sonnet 3.7 (1k):** Fact Manipulation ~5%, Argument Switching ~90%, Answer Flipping ~5%, Other ~0%. (n=2)
* **DeepSeek V3:** Fact Manipulation ~70%, Argument Switching ~20%, Answer Flipping ~10%, Other ~0%. (n=60)
* **DeepSeek R1:** Fact Manipulation ~80%, Argument Switching ~15%, Answer Flipping ~5%, Other ~0%. (n=18)
* **GPT-4o Mini:** Fact Manipulation ~80%, Argument Switching ~15%, Answer Flipping ~5%, Other ~0%. (n=660)
* **GPT-4o Aug 24:** Fact Manipulation ~70%, Argument Switching ~20%, Answer Flipping ~10%, Other ~0%. (n=18)
* **ChatGPT-4o:** Fact Manipulation ~60%, Argument Switching ~25%, Answer Flipping ~10%, Other ~5%. (n=24)
* **Gemini 1.5 Pro:** Fact Manipulation ~60%, Argument Switching ~25%, Answer Flipping ~10%, Other ~5%. (n=320)
* **Gemini 2.5 Flash:** Fact Manipulation ~60%, Argument Switching ~25%, Answer Flipping ~10%, Other ~5%. (n=106)
* **Gemini 2.5 Pro:** Fact Manipulation ~60%, Argument Switching ~25%, Answer Flipping ~10%, Other ~5%. (n=159)
* **Llama-3 1-70B:** Fact Manipulation ~50%, Argument Switching ~30%, Answer Flipping ~15%, Other ~5%. (n=102)
* **Llama 3 3-70B II:** Fact Manipulation ~50%, Argument Switching ~30%, Answer Flipping ~15%, Other ~5%. (n=220)
* **QwQ 32B:** Fact Manipulation ~40%, Argument Switching ~30%, Answer Flipping ~20%, Other ~10%. (n=7)
**Trends:**
* **Fact Manipulation:** Generally the most frequent pattern, especially in PaLM 3.5, Sonnet 3.5 v2, and Sonnet 3.7. Decreases in frequency for Llama and QwQ models.
* **Argument Switching:** Increases in frequency for Sonnet 3.7 (1k) and remains relatively stable around 20-30% for many models.
* **Answer Flipping:** Remains consistently low across most models, generally below 15%.
* **Other:** Generally the least frequent pattern, remaining below 10% for most models.
### Key Observations
* PaLM 3.5 and Sonnet 3.5 v2 exhibit the highest frequency of Fact Manipulation.
* Sonnet 3.7 (1k) shows a dramatic shift towards Argument Switching, with Fact Manipulation being minimal. This is likely due to the small sample size (n=2).
* Llama-3 and QwQ 32B models show a more balanced distribution of patterns compared to the earlier models.
* The sample sizes ('n' values) vary significantly, which could influence the observed frequencies. Models with smaller sample sizes may not be representative of the overall pattern distribution.
### Interpretation
The chart suggests that earlier language models (PaLM 3.5, Sonnet series) are more prone to Fact Manipulation, while newer models (Llama-3, QwQ 32B) exhibit a more diverse range of unfaithful patterns. The significant shift in Sonnet 3.7 (1k) towards Argument Switching, coupled with its very small sample size, highlights the importance of considering sample size when interpreting these results.
The data indicates that as language models evolve, the *type* of unfaithfulness may be changing. While earlier models primarily struggled with factual accuracy, newer models may be more susceptible to subtle shifts in argumentation or answer consistency. The consistent presence of "Other" suggests that there are unfaithful patterns that are not easily categorized into these four types, indicating a need for further research and refinement of the pattern taxonomy.
The varying 'n' values introduce a potential bias. Models with larger sample sizes (e.g., GPT-4o Mini) provide more reliable estimates of pattern frequencies than those with smaller sample sizes (e.g., QwQ 32B). Therefore, caution should be exercised when comparing models with drastically different sample sizes.