## Grid of Confidence Distribution Histograms: Model Performance and Calibration Across Datasets
### Overview
The image displays a 5x4 grid of histograms. Each row corresponds to a specific evaluation dataset, and each column corresponds to a specific large language model (LLM). The histograms visualize the distribution of model confidence scores for its answers, separated into correct (blue) and incorrect (red) responses. Above each histogram, three key performance metrics are provided: Accuracy (ACC), Area Under the Receiver Operating Characteristic curve (AUROC), and Expected Calibration Error (ECE).
### Components/Axes
* **Grid Structure:**
* **Rows (Datasets):** Labeled on the far left. From top to bottom: `GSM8K`, `DateUND`, `StrategyQA`, `Prf-Law`, `Biz-Ethics`.
* **Columns (Models):** Labeled at the top. From left to right: `GPT3`, `GPT3.5`, `GPT4`, `Vicuna`.
* **Individual Histogram Axes:**
* **X-axis:** Labeled `Confidence (%)`. The scale runs from 50 to 100, with major tick marks at 50, 60, 70, 80, 90, and 100.
* **Y-axis:** Labeled `Count`. The scale is not numerically marked, indicating relative frequency.
* **Legend:** Present in the top-left corner of each histogram. A red square denotes `wrong answer`, and a blue square denotes `correct answer`.
* **Metrics Header:** Above each histogram, a text string provides three metrics in the format: `ACC [value] / AUROC [value] / ECE [value]`.
### Detailed Analysis
Below is a systematic extraction of the metrics and a description of the histogram shape for each model-dataset combination. The histogram description notes where the bulk of the data (both correct and incorrect answers) is concentrated on the confidence scale.
**Row 1: GSM8K Dataset**
* **GPT3:** ACC 0.15 / AUROC 0.51 / ECE 0.83. Histogram: Nearly all responses (both correct and incorrect) are concentrated at 100% confidence. A very small number of incorrect answers appear around 80% confidence.
* **GPT3.5:** ACC 0.28 / AUROC 0.65 / ECE 0.66. Histogram: Responses are split between 90% and 100% confidence. Incorrect answers dominate at 90%, while correct answers are more prevalent at 100%.
* **GPT4:** ACC 0.47 / AUROC 0.66 / ECE 0.51. Histogram: The majority of responses are at 100% confidence, with a significant portion being correct. A smaller cluster exists at 90% confidence, with a mix of correct and incorrect answers.
* **Vicuna:** ACC 0.02 / AUROC 0.46 / ECE 0.77. Histogram: Almost all responses are incorrect and concentrated at 80% confidence. A tiny fraction of incorrect answers appear at 50%, 60%, 70%, 90%, and 100%.
**Row 2: DateUND Dataset**
* **GPT3:** ACC 0.18 / AUROC 0.50 / ECE 0.82. Histogram: Similar to GSM8K, nearly all responses are at 100% confidence, predominantly incorrect.
* **GPT3.5:** ACC 0.47 / AUROC 0.65 / ECE 0.48. Histogram: Responses are distributed across 80%, 90%, and 100% confidence. The largest group is incorrect answers at 90%. Correct answers are most frequent at 100%.
* **GPT4:** ACC 0.83 / AUROC 0.64 / ECE 0.15. Histogram: Overwhelmingly, responses are correct and at 100% confidence. A very small number of incorrect answers appear at 80% and 90%.
* **Vicuna:** ACC 0.01 / AUROC 0.48 / ECE 0.79. Histogram: Almost all responses are incorrect and concentrated at 80% confidence. A minuscule number of incorrect answers are at 90%.
**Row 3: StrategyQA Dataset**
* **GPT3:** ACC 0.58 / AUROC 0.49 / ECE 0.42. Histogram: The vast majority of responses are at 100% confidence, with a roughly even split between correct and incorrect answers.
* **GPT3.5:** ACC 0.66 / AUROC 0.53 / ECE 0.26. Histogram: Most responses are at 90% confidence, with a significant number of both correct and incorrect answers. A smaller group of correct answers is at 100%.
* **GPT4:** ACC 0.76 / AUROC 0.55 / ECE 0.16. Histogram: Responses are concentrated at 90% and 100% confidence. At 90%, there is a mix, but correct answers are more frequent. At 100%, answers are almost entirely correct.
* **Vicuna:** ACC 0.47 / AUROC 0.51 / ECE 0.42. Histogram: Responses are split between 80% and 90% confidence. At 80%, answers are mostly incorrect. At 90%, there is a mix, with correct answers being slightly more frequent.
**Row 4: Prf-Law Dataset**
* **GPT3:** ACC 0.47 / AUROC 0.47 / ECE 0.48. Histogram: Responses are spread across 80%, 90%, and 100% confidence. The largest group is incorrect answers at 100%. Correct answers are most frequent at 90%.
* **GPT3.5:** ACC 0.46 / AUROC 0.50 / ECE 0.44. Histogram: The vast majority of responses are at 90% confidence, with a substantial number of both correct and incorrect answers.
* **GPT4:** ACC 0.68 / AUROC 0.61 / ECE 0.17. Histogram: Responses are concentrated at 80% and 90% confidence. At 80%, answers are mostly incorrect. At 90%, correct answers are dominant.
* **Vicuna:** ACC 0.26 / AUROC 0.53 / ECE 0.45. Histogram: Most responses are incorrect and at 80% confidence. A smaller cluster of mixed correct/incorrect answers appears at 90%.
**Row 5: Biz-Ethics Dataset**
* **GPT3:** ACC 0.63 / AUROC 0.56 / ECE 0.32. Histogram: Responses are at 90% and 100% confidence. At 90%, answers are mostly correct. At 100%, there is a mix, with incorrect answers being more frequent.
* **GPT3.5:** ACC 0.67 / AUROC 0.55 / ECE 0.23. Histogram: The vast majority of responses are correct and at 90% confidence. A very small number of incorrect answers are at 100%.
* **GPT4:** ACC 0.81 / AUROC 0.68 / ECE 0.09. Histogram: Overwhelmingly, responses are correct and at 90% confidence. A tiny fraction of incorrect answers appear at 80% and 100%.
* **Vicuna:** ACC 0.34 / AUROC 0.58 / ECE 0.35. Histogram: Responses are split between 80% and 90% confidence. At 80%, answers are mostly incorrect. At 90%, there is a mix, with correct answers being more frequent.
### Key Observations
1. **Model Performance Hierarchy:** GPT4 consistently achieves the highest Accuracy (ACC) across all datasets, followed generally by GPT3.5, then GPT3, with Vicuna performing the worst on most tasks.
2. **Calibration (ECE):** GPT4 also demonstrates the best calibration (lowest ECE) in most cases, particularly on `DateUND` (0.15) and `Biz-Ethics` (0.09). High ECE values (e.g., GPT3 on GSM8K: 0.83) indicate poor calibration, where the model's confidence does not match its actual accuracy.
3. **Confidence Distributions:**
* **Overconfidence:** GPT3 and GPT3.5 frequently show a strong tendency to output answers with very high confidence (90-100%), even when incorrect (e.g., GPT3 on GSM8K, DateUND).
* **Vicuna's Pattern:** Vicuna often clusters its (mostly incorrect) answers around 80% confidence, suggesting a different, but still poorly calibrated, confidence profile.
* **GPT4's Shift:** GPT4's correct answers are often concentrated at the highest confidence bin (100%), but it also shows meaningful distributions at 90% on some tasks (e.g., StrategyQA, Prf-Law, Biz-Ethics), indicating more nuanced confidence estimation.
4. **Dataset Difficulty:** The `GSM8K` and `DateUND` datasets appear particularly challenging for the older models (GPT3, Vicuna), as shown by their very low accuracy scores.
### Interpretation
This grid provides a multifaceted view of LLM performance, moving beyond simple accuracy to examine **model calibration**—the alignment between a model's stated confidence and its probability of being correct.
* **What the data suggests:** The progression from GPT3 to GPT4 shows not just an improvement in raw accuracy, but a significant improvement in calibration. A well-calibrated model (low ECE) is more reliable and trustworthy, as its confidence score can be used as a meaningful indicator of likely correctness. GPT4's low ECE on tasks like `Biz-Ethics` (0.09) suggests its confidence is a very good proxy for accuracy on that domain.
* **Relationship between elements:** The histograms visually explain the ECE metric. A high ECE (e.g., 0.83) corresponds to a histogram where high-confidence bins (like 100%) contain a large proportion of incorrect answers (red bars). A low ECE (e.g., 0.09) corresponds to a histogram where high-confidence bins are dominated by correct answers (blue bars).
* **Notable anomalies and trends:** The stark contrast between Vicuna's performance/calibration and the GPT series is notable. Vicuna's consistent clustering of incorrect answers at ~80% confidence, regardless of dataset, may indicate a systemic issue in its confidence estimation mechanism or training. Furthermore, the variation in performance across datasets (e.g., GPT4's ACC ranging from 0.47 on `GSM8K` to 0.83 on `DateUND`) highlights that model capability is not uniform but highly dependent on the task domain. This analysis is crucial for applications where knowing *when a model is likely wrong* is as important as its overall accuracy.