## Heatmap Comparison: Bias Scores vs. Accuracy on BBQ (4-shot)
### Overview
The image displays two side-by-side heatmaps comparing the performance of three Gemini models (1.0 Ultra, 1.5 Flash, 1.5 Pro) on the BBQ benchmark under a 4-shot evaluation setting. The left heatmap visualizes "Bias score," while the right heatmap visualizes "Accuracy." Both charts break down performance across 11 social bias categories and two question conditions: "ambiguous" and "disambiguated."
### Components/Axes
* **Chart Titles:**
* Left: "Bias score on BBQ (4-shot)"
* Right: "Accuracy on BBQ (4-shot)"
* **Y-Axis (Categories):** Identical for both charts. Listed from top to bottom:
1. Sexual_orientation
2. SES (Socioeconomic Status)
3. Religion
4. Race_x_gender
5. Race_x_SES
6. Race_ethnicity
7. Physical_appearance
8. Nationality
9. Gender_identity
10. Disability_status
11. Age
* **X-Axis (Model Name):** Identical for both charts. Listed from left to right:
1. Gemini 1.0 Ultra
2. Gemini 1.5 Flash
3. Gemini 1.5 Pro
* **Column Headers (Question Condition):** Each model column is subdivided into two sub-columns:
* Left sub-column: "ambiguous"
* Right sub-column: "disambiguated"
* **Color Scale/Legend:**
* **Bias Score (Left Chart):** A vertical color bar on the right side of the chart. The scale runs from 0 (light/white) to 20 (dark orange). Values can be negative, extending below the 0 mark on the scale.
* **Accuracy (Right Chart):** A vertical color bar on the right side of the chart. The scale runs from 0.6 (light) to 1.0 (dark orange).
### Detailed Analysis
**Left Heatmap: Bias Score on BBQ (4-shot)**
* **Data Structure:** Each cell contains a numerical bias score. The color intensity corresponds to the absolute value, with darker orange indicating a higher magnitude of bias (positive or negative).
* **Transcribed Data:**
| Category | Gemini 1.0 Ultra (Ambiguous) | Gemini 1.0 Ultra (Disambiguated) | Gemini 1.5 Flash (Ambiguous) | Gemini 1.5 Flash (Disambiguated) | Gemini 1.5 Pro (Ambiguous) | Gemini 1.5 Pro (Disambiguated) |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Sexual_orientation | 2.08 | -0.23 | 0.93 | -0.99 | -4.27 | -3.1 |
| SES | 4.6 | 0.0 | 0.0 | 4.82 | 3.3 | 2.77 |
| Religion | 8.0 | 0.33 | 4.67 | 5.0 | 3.32 | 0.79 |
| Race_x_gender | 0.28 | -0.21 | 0.0 | 1.32 | 0.09 | 0.21 |
| Race_x_SES | 4.28 | 0.0 | -0.02 | -0.95 | -5.07 | 0.54 |
| Race_ethnicity | 0.09 | -0.12 | 0.0 | 0.26 | -0.12 | 0.62 |
| Physical_appearance | 5.96 | 0.0 | 0.25 | 2.01 | -0.15 | -5.1 |
| Nationality | 4.29 | 0.0 | 0.0 | 1.46 | 0.0 | 0.29 |
| Gender_identity | 4.0 | 0.0 | -0.04 | 0.04 | -0.25 | -1.47 |
| Disability_status | 10.67 | 0.0 | 0.0 | 3.73 | -0.66 | -8.53 |
| Age | **27.5** | -0.05 | 0.38 | 2.96 | 1.29 | -0.12 |
* **Trend Verification:** The "Age" category under Gemini 1.0 Ultra (ambiguous) shows the highest bias score (27.5), indicated by the darkest orange cell in the entire chart. Bias scores are generally higher in the "ambiguous" condition for Gemini 1.0 Ultra across many categories (e.g., Disability_status: 10.67, Physical_appearance: 5.96). The "disambiguated" condition often shows reduced or negative bias scores, but with notable exceptions like Religion (5.0 for Gemini 1.0 Ultra) and Disability_status (-8.53 for Gemini 1.5 Pro).
**Right Heatmap: Accuracy on BBQ (4-shot)**
* **Data Structure:** Each cell contains an accuracy score between 0 and 1. The color intensity corresponds to the score, with darker orange indicating higher accuracy.
* **Transcribed Data:**
| Category | Gemini 1.0 Ultra (Ambiguous) | Gemini 1.0 Ultra (Disambiguated) | Gemini 1.5 Flash (Ambiguous) | Gemini 1.5 Flash (Disambiguated) | Gemini 1.5 Pro (Ambiguous) | Gemini 1.5 Pro (Disambiguated) |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Sexual_orientation | 0.97 | 1.0 | 0.99 | 0.94 | 0.81 | 0.59 |
| SES | 0.95 | 1.0 | 1.0 | 0.97 | 0.89 | 0.7 |
| Religion | 0.87 | 0.98 | 0.95 | 0.92 | 0.87 | 0.83 |
| Race_x_gender | 0.98 | 1.0 | 1.0 | 0.95 | 0.91 | 0.91 |
| Race_x_SES | 0.95 | 1.0 | 1.0 | 0.98 | 0.94 | 0.99 |
| Race_ethnicity | **1.0** | **1.0** | **1.0** | **1.0** | 0.98 | 0.94 |
| Physical_appearance | 0.93 | 1.0 | 1.0 | 0.79 | 0.77 | 0.71 |
| Nationality | 0.95 | 1.0 | 1.0 | 0.98 | 0.95 | 0.91 |
| Gender_identity | 0.96 | 1.0 | 1.0 | 0.97 | 0.99 | 0.97 |
| Disability_status | 0.87 | 1.0 | 1.0 | 0.95 | 0.97 | 0.82 |
| Age | 0.7 | **1.0** | **1.0** | 0.97 | 0.97 | 0.93 |
* **Trend Verification:** Accuracy is generally very high (often 1.0 or near 1.0) in the "ambiguous" condition across all models and categories. A consistent and notable drop in accuracy occurs in the "disambiguated" condition. This drop is most severe for:
* Sexual_orientation (down to 0.59 for Gemini 1.5 Pro).
* Physical_appearance (down to 0.71 for Gemini 1.5 Pro).
* SES (down to 0.7 for Gemini 1.5 Pro).
* Disability_status (down to 0.82 for Gemini 1.5 Pro).
### Key Observations
1. **Bias-Accuracy Inverse Relationship:** There is a visible inverse relationship between bias scores and accuracy, particularly in the "ambiguous" condition. Categories with high bias scores (e.g., Age, Disability_status for Gemini 1.0 Ultra) often correspond to lower accuracy scores in the same condition.
2. **Condition-Dependent Performance:** All models perform significantly better (higher accuracy, often lower bias) on "ambiguous" questions compared to "disambiguated" ones. The "disambiguated" condition appears to be a much harder test.
3. **Model Evolution:** Newer models (1.5 Flash, 1.5 Pro) do not show a uniform improvement over 1.0 Ultra. While they sometimes have lower bias scores (e.g., in Age), they also sometimes show lower accuracy in the challenging "disambiguated" condition (e.g., Sexual_orientation, Physical_appearance).
4. **Category Sensitivity:** Certain categories are consistently challenging. "Sexual_orientation," "Physical_appearance," and "Disability_status" show significant accuracy drops in the disambiguated setting across models. "Age" shows an extreme bias score outlier in one specific model/condition.
### Interpretation
This data suggests a fundamental trade-off or tension in the models' handling of social bias contexts. The near-perfect accuracy on "ambiguous" questions, coupled with high bias scores in those same cells, implies the models may be relying on stereotypical or biased heuristics to answer questions where the context is unclear. This allows them to answer quickly and "correctly" according to the benchmark but encodes bias.
The "disambiguated" condition, which provides clearer context, forces the models to engage more carefully with the information. This leads to a significant drop in accuracy, revealing a limitation in their reasoning capabilities for nuanced social scenarios. The corresponding shift in bias scores (often becoming negative or smaller) in this condition is complex; it may indicate the models are now under-correcting or struggling to weigh the provided context against their internal biases.
The outlier in "Age" bias for Gemini 1.0 Ultra is particularly noteworthy, suggesting a severe and specific failure mode in that model's handling of age-related stereotypes in ambiguous contexts. Overall, the charts demonstrate that evaluating AI bias requires looking at multiple metrics (bias score *and* accuracy) across different contextual conditions, as high accuracy alone does not indicate fair or unbiased performance.