\n
## Grouped Bar Chart: Prediction Flip Rate by Dataset and Model
### Overview
The image displays two side-by-side grouped bar charts comparing the "Prediction Flip Rate" across four question-answering datasets for two different language models: Llama-3-8B (left panel) and Llama-3-70B (right panel). The charts analyze how different "anchoring" methods affect the stability of model predictions.
### Components/Axes
* **Chart Type:** Two grouped bar charts (panels).
* **Panel Titles:**
* Left: `Llama-3-8B`
* Right: `Llama-3-70B`
* **Y-Axis (Both Panels):**
* **Label:** `Prediction Flip Rate`
* **Scale:** Linear, from 0 to 80, with major tick marks at 0, 20, 40, 60, 80.
* **X-Axis (Both Panels):**
* **Label:** `Dataset`
* **Categories (from left to right):** `PopQA`, `TriviaQA`, `HotpotQA`, `NQ`.
* **Legend (Bottom Center, spanning both panels):**
* **Position:** Below the x-axis labels.
* **Categories & Colors (from left to right):**
1. `Q-Anchored (exact_question)` - Light reddish-brown (salmon) bar.
2. `Q-Anchored (random)` - Dark red (burgundy) bar.
3. `A-Anchored (exact_question)` - Light gray bar.
4. `A-Anchored (random)` - Dark gray bar.
### Detailed Analysis
The analysis is segmented by model panel. Values are approximate visual estimates from the chart.
**Panel 1: Llama-3-8B**
* **PopQA:**
* Q-Anchored (exact_question): ~72
* Q-Anchored (random): ~8
* A-Anchored (exact_question): ~38
* A-Anchored (random): ~1
* **TriviaQA:**
* Q-Anchored (exact_question): ~78
* Q-Anchored (random): ~12
* A-Anchored (exact_question): ~34
* A-Anchored (random): ~4
* **HotpotQA:**
* Q-Anchored (exact_question): ~70
* Q-Anchored (random): ~12
* A-Anchored (exact_question): ~12
* A-Anchored (random): ~6
* **NQ:**
* Q-Anchored (exact_question): ~70
* Q-Anchored (random): ~9
* A-Anchored (exact_question): ~19
* A-Anchored (random): ~1
**Panel 2: Llama-3-70B**
* **PopQA:**
* Q-Anchored (exact_question): ~73
* Q-Anchored (random): ~7
* A-Anchored (exact_question): ~31
* A-Anchored (random): ~1
* **TriviaQA:**
* Q-Anchored (exact_question): ~78
* Q-Anchored (random): ~17
* A-Anchored (exact_question): ~35
* A-Anchored (random): ~5
* **HotpotQA:**
* Q-Anchored (exact_question): ~70
* Q-Anchored (random): ~19
* A-Anchored (exact_question): ~12
* A-Anchored (random): ~6
* **NQ:**
* Q-Anchored (exact_question): ~57
* Q-Anchored (random): ~15
* A-Anchored (exact_question): ~22
* A-Anchored (random): ~6
### Key Observations
1. **Dominant Series:** The `Q-Anchored (exact_question)` bar (light reddish-brown) is consistently the tallest across all datasets and both models, indicating the highest prediction flip rate.
2. **Secondary Series:** The `A-Anchored (exact_question)` bar (light gray) is consistently the second tallest, but significantly lower than its Q-Anchored counterpart.
3. **Low Flip Rates:** The `Q-Anchored (random)` (dark red) and `A-Anchored (random)` (dark gray) bars show very low flip rates, often below 20 and frequently below 10.
4. **Model Comparison (8B vs. 70B):** The overall pattern is similar between models. However, for the `NQ` dataset, the `Q-Anchored (exact_question)` flip rate appears noticeably lower for the 70B model (~57) compared to the 8B model (~70). Conversely, the `Q-Anchored (random)` rate for `NQ` is slightly higher in the 70B model.
5. **Dataset Variation:** The `TriviaQA` dataset tends to show the highest flip rates for the `Q-Anchored (exact_question)` method in both models. The `HotpotQA` dataset shows the smallest difference between the `Q-Anchored (exact_question)` and `A-Anchored (exact_question)` methods.
### Interpretation
This chart investigates the stability of language model answers when the input prompt is slightly altered ("anchored"). A high "Prediction Flip Rate" means the model frequently changes its answer.
* **Core Finding:** Anchoring a prompt to the **exact question** (`Q-Anchored (exact_question)`) makes model predictions highly unstable, causing them to "flip" their answers over 70% of the time in most cases. This suggests models are very sensitive to minor rephrasings of the same question.
* **Anchoring to Answers:** Anchoring to the exact answer (`A-Anchored (exact_question)`) also causes instability, but to a much lesser degree (~12-38%). This implies that providing the answer in the prompt still perturbs the model, but less than rephrasing the question.
* **Random Anchoring:** Using a random question or answer for anchoring (`random` variants) results in minimal flip rates. This is a crucial control, showing that the high flip rates are not due to randomness in the anchoring process itself, but specifically due to using the *exact* question or answer from the evaluation set.
* **Model Scale:** The larger Llama-3-70B model does not show a universal improvement in stability (lower flip rates). Its behavior is dataset-dependent, performing slightly worse (higher flip rates) on some random-anchored tasks but better on the challenging `Q-Anchored (exact_question)` task for the `NQ` dataset. This indicates that simply increasing model size does not automatically resolve sensitivity to prompt phrasing.
* **Practical Implication:** The data strongly suggests that evaluating models using multiple, semantically equivalent but phrased-differently questions (a common practice) may lead to highly variable results, undermining the reliability of single-point accuracy metrics. The model's "answer" is not a fixed property but is highly contingent on the precise formulation of the query.