2406.08391
Model: healer-alpha-free
# Large Language Models Must Be Taught to Know What They Donāt Know
**Authors**:
- Sanyam Kapoor (New York University)
- &Nate Gruver*} (New York University)
- Manley Roberts
- Abacus AI
- &Katherine Collins (Cambridge University)
- &Arka Pal
- Abacus AI
- &Umang Bhatt (New York University)
- Adrian Weller (Cambridge University)
- &Samuel Dooley
- Abacus AI
- &Micah Goldblum (Columbia University)
- &Andrew Gordon Wilson (New York University)
> Equal contribution. Order decided by coin flip. Correspondence to: &
## Abstract
When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study.
### 1 Introduction
āāI have high cortisol but low ACTH on a dexamethasone suppression test. What should I do?āā If the answer to such a question is given without associated confidence, it is not actionable, and if the answer is presented with erroneously high confidence, then acting on the answer is dangerous. One of the biggest open questions about whether large language models (LLMs) can benefit society and reliably be used for decision making hinges on whether or not they can accurately represent uncertainty over the correctness of their output.
There is anything but consensus on whether LLMs accurately represent uncertainty, or even how we should approach uncertainty representation with language models. Claims regarding language modelsā ability to estimate uncertainty vary widely, with some works suggesting that language models are increasingly capable of estimating their uncertainty directly through prompting, without any fine-tuning or changes to the training data (Kadavath et al., 2022; Tian et al., 2023b), and others suggesting that LLMs remain far too overconfident in their predictions (Xiong et al., 2023; Yin et al., 2023). The task of uncertainty estimation in LLMs is further exacerbated by linguistic variances in freeform generation, all of which cannot be exhaustively accounted for during training. LLM practitioners are therefore faced with the challenge of deciding which estimation method to use.
One particular dichotomy in uncertainty estimation methods for language models centers around whether the estimates are black- or white-box. Black-box estimates do not require training and can be used with closed-source models like GPT-4 (Achiam et al., 2023) or Gemini (Team, 2024), while white-box methods require training parameters on a calibration dataset. Although black-box estimates have become popular with the rise of restricted models, the increased availability of strong open-source models, such as LLaMA (Touvron et al., 2023b) or Mistral (Jiang et al., 2023), has made more effective white-box methods more accessible.
In this paper, we perform a deep investigation into uncertainty calibration of LLMs, with findings that advance the debate about necessary interventions for good calibration. In particular, we consider whether itās possible to have good uncertainties over correctness (rather than tokens) without intervention, how we can best use labeled correctness examples, how well uncertainty generalizes across distribution shifts, and how we can use LLM uncertainty to assist human decision making.
First, we find that fine-tuning for better uncertainties (Figure 1) provides faster and more reliable uncertainty estimates, while using a relatively small number of additional parameters. The resulting uncertainties also generalize to new question types and tasks, beyond what is present in the fine-tuning dataset. We further provide a guide to teaching language models to know what they donāt know using a calibration dataset. Contrary to prior work, we start by showing that current zero-shot, black-box methods are ineffective or impractically expensive in open-ended settings (Section 4). We then show how to fine-tune a language model for calibration, exploring the most effective parameterization (e.g. linear probes vs LoRA) and the amount of the data that is required for good generalization (Section 5). To test generalization, we evaluate uncertainty estimates on questions with similar formatting to the calibration data as well as questions that test robustness to significant distribution shifts. Lastly, we consider the underlying mechanisms that enable fine-tuning LLMs to estimate their own uncertainties, showing ultimately that models can be used not just to estimate their own uncertainties but also the uncertainties of other models (Section 6). Beyond offline evaluation, if language models are to have a broad societal impact, it will be through assisting with human decision making. We conduct a user study demonstrating ways LLM uncertainty can affect AI-human collaboration (Section 7). https://github.com/activatedgeek/calibration-tuning
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: AI Model Fine-Tuning Process and Performance Metrics
### Overview
The image is a technical diagram illustrating a process for evaluating and improving a Large Language Model (LLM). It depicts a three-stage workflow: 1) An initial problematic model response, 2) A data curation and fine-tuning pipeline, and 3) A comparative performance evaluation of different model versions using two key metrics.
### Components/Axes
The diagram is segmented into three primary regions from left to right:
**1. Left Region: Initial Model Interaction**
* **Components:** Two dialogue boxes featuring a robot icon (representing an AI/LLM).
* **Text Content:**
* **User Query (Top Box):** "What's the key to a delicious pizza sauce?"
* **Model Response (Middle Box):** "Add non-toxic glue for tackiness"
* **Follow-up Query (Bottom Box):** "What's your confidence?"
* **Model Confidence (Bottom Box):** "100%"
**2. Center Region: Fine-Tuning Pipeline**
* **Components:** A stack of three cards labeled "Graded Dataset," an arrow labeled "Fine-Tuning," and a final box labeled "LLM."
* **Text Content & Flow:**
* **Graded Dataset Cards:** Each card shows a "Question" and an "Answer." The top card is partially obscured. The middle card shows "Answer" with a "Yes" (in green) and "No" (in red) below it. The bottom card shows "Question," "Answer," and the prompt "Is the answer correct?" with a "No" (in red) selected.
* **Process Arrow:** A purple arrow points from the "Graded Dataset" stack to the "LLM" box, labeled "Fine-Tuning."
* **Output:** The final box is labeled "LLM," representing the fine-tuned model.
**3. Right Region: Performance Evaluation Charts**
* **Components:** Two horizontal bar charts sharing a common legend.
* **Legend (Position: Top of the chart area):** Five categories are listed, each associated with a color/pattern:
* **Zero-Shot** (Light gray bar)
* **Classifier** (Medium gray bar)
* **Verbalized** (Dark gray bar)
* **Sampling** (Light purple bar)
* **Fine-Tuned** (Dark purple bar)
* **Chart 1 (Left):**
* **Title/Axis Label:** "ECE ā" (Expected Calibration Error, with a down arrow indicating lower is better).
* **X-Axis:** Percentage scale from 0% to 40%.
* **Data Points (Approximate Values):**
* Zero-Shot: ~38%
* Classifier: ~35%
* Verbalized: ~32%
* Sampling: ~28%
* Fine-Tuned: ~5%
* **Chart 2 (Right):**
* **Title/Axis Label:** "AUROC ā" (Area Under the Receiver Operating Characteristic Curve, with an up arrow indicating higher is better).
* **X-Axis:** Percentage scale from 50% to 70%.
* **Data Points (Approximate Values):**
* Zero-Shot: ~52%
* Classifier: ~55%
* Verbalized: ~58%
* Sampling: ~62%
* Fine-Tuned: ~68%
### Detailed Analysis
The diagram presents a clear narrative of identifying and correcting a model failure mode.
* **Problem Identification (Left):** The initial LLM provides a confidently wrong (100% confidence) and potentially harmful answer ("Add non-toxic glue") to a common-sense question. This highlights a failure in both factual accuracy and calibration (overconfidence).
* **Solution Process (Center):** A "Graded Dataset" is constructed. The visible cards imply this dataset contains questions, answers, and human or automated judgments on answer correctness ("Is the answer correct? No"). This curated dataset is used for "Fine-Tuning" the LLM.
* **Outcome Measurement (Right):** The performance of the "Fine-Tuned" model is compared against four baseline methods (Zero-Shot, Classifier, Verbalized, Sampling) on two metrics:
* **ECE (Calibration):** The Fine-Tuned model shows a dramatic reduction in calibration error (from ~38% to ~5%), indicating its confidence scores now align much better with its actual accuracy.
* **AUROC (Discriminative Performance):** The Fine-Tuned model also achieves the highest AUROC score (~68%), indicating superior ability to distinguish between correct and incorrect answers compared to all other methods.
### Key Observations
1. **Significant Improvement:** The Fine-Tuned model outperforms all other methods on both metrics by a substantial margin.
2. **Calibration is the Major Gain:** The most striking improvement is in ECE, where the error drops by over 30 percentage points. This directly addresses the overconfidence problem shown on the left.
3. **Progressive Baseline Improvement:** Among the non-fine-tuned methods, there is a general trend of improvement from Zero-Shot to Sampling on both metrics, with Sampling being the strongest baseline.
4. **Visual Correlation:** The color coding consistently links the "Fine-Tuned" label (dark purple) to the best-performing bars in both charts.
### Interpretation
This diagram argues for the effectiveness of fine-tuning on a graded dataset as a method to improve both the **reliability** and **performance** of an LLM.
* **The Core Problem:** The initial example isn't just about a wrong answer; it's about a model that is *confidently wrong*. This is a critical safety and reliability issue for AI systems.
* **The Proposed Solution:** The process suggests that moving beyond simple prompt-based methods (Zero-Shot, Verbalized) or auxiliary classifiers, and instead directly fine-tuning the model on data that explicitly judges answer correctness, is highly effective.
* **The Evidence:** The charts provide quantitative evidence. The massive drop in ECE shows the fine-tuned model "knows what it doesn't know," making its confidence a trustworthy signal. The rise in AUROC shows it also became better at the fundamental task of judging answer quality.
* **Broader Implication:** The diagram implies that for tasks requiring calibrated confidence (e.g., medical advice, factual Q&A, safety-critical applications), fine-tuning on graded data is a superior approach to other common techniques. It transforms the model from a fluent but unreliable text generator into a more reliable reasoning engine.
</details>
Figure 1: Large language models struggle to assign reliable confidence estimates to their generations. We study the properties of uncertainty calibration in language models, and propose fine-tuning for better uncertainty estimates using a graded dataset of generations from the model. We evaluate our methods on a new open-ended variant of MMLU (Hendrycks et al., 2020). We show that fine-tuning improves expected calibration error (ECE) and area under the receiver operating characteristic curve (AUROC) compared to commonly-used baselines. Error bars show standard deviation over three base models (LLaMA-2 13/7B and Mistral 7B) and their chat variants.
### 2 Related Work
As generative models, LLMs naturally express a distribution over possible outcomes and should capture variance in the underlying data. On multiple-choice tests, where the answer is a single token, an LLMās predicted token probabilities can lead to a calibrated distribution over the answer choices in models not fine-tuned for chat (Plaut et al., 2024). Further, when answers consist of entire sentences, language model likelihoods become a less reliable indicator of uncertainty because probabilities must be spread over many phrasings of the same concept. Kuhn et al. (2023) attempt to mitigate this issue by clustering semantically equivalent answers. However, these methods are hindered by their substantial computational overhead. Accounting for equivalent phrasings of the same semantic content requires enumerating a large space of sentences and clustering for semantic similarity with an auxiliary model.
Because LLMs are trained on text written by humans, it is possible for them to learn concepts like ācorrectnessā and probabilities and express uncertainty through these abstractions. Leveraging this observation, Kadavath et al. (2022) and Tian et al. (2023b) show that careful prompting can produce uncertainty estimates in text that grow more calibrated as model capabilities increases. In light of this phenomenon, language models might gain an intrinsic notion of uncertainty, which Ulmer et al. (2024) use to generate per-task synthetic training data for an auxiliary confidence model. In the same vein, Burns et al. (2022) and Azaria and Mitchell (2023) find that pre-trained models have hidden representations which are predictive of truthfulness and use linear probes to classify a modelās correctness.
While these studies suggest a promising trend towards calibration, we find that the story is slightly more complicated. Black-box methods often fail to generate useful uncertainties for popular open-source models, and a careful fine-tuning intervention is necessary. In this way, our findings are closer to those of Xiong et al. (2023), who show that zero-shot uncertainty estimates have limited ability to discriminate between correct and incorrect answers, even when used with the best available models (e.g., GPT-4). We go further by showing that black-box methods struggle on open-ended generation, which is both practically important and defined by different challenges than multiple choice evaluations from prior work. Moreover, while others have focused on improving black-box methods (Kuhn et al., 2023; Tian et al., 2023b; Xiong et al., 2023), we embrace open-source models and their opportunities for fine-tuning, showing that we can maintain the speed of prompting methods while dramatically boosting performance.
Our work also contrasts with prior work on fine-tuning for uncertainties in several key ways. While we build on prior work from Lin et al. (2022) and Zhang et al. (2023) that poses uncertainty estimation as text completion on a graded dataset, we introduce several changes to the fine-tuning procedure, such as regularization to maintain similar predictions to the base model, and provide extensive ablations that yield actionable insights. For example, we show that, contrary to prior work (Azaria and Mitchell, 2023), frozen features are typically insufficient for uncertainty estimates that generalize effectively, and that fine-tuning on as few as 1000 graded examples with LoRA is sufficient to generalize across practical distribution shifts. Also unlike prior work, we provide many insights into the relative performance of fine-tuning compared to black-box methods, introducing a new open-ended evaluation and showing that it displays fundamentally different trends than prior work on multiple choice questions. Although Kadavath et al. (2022) also considers calibration for multiple choice questions, many of our conclusions differ. For example, while Kadavath et al. (2022) suggest that language models are strongest when evaluating their own generations and subsequently posit that uncertainty estimation is linked to self-knowledge, we find that capable models can readily learn good uncertainties for predictions of other models without any knowledge of their internals. Lastly, while many works motivate their approach with applications to human-AI collaboration, none of them test their uncertainty estimates on actual users, as we do here.
### 3 Preliminaries
Question answering evaluations.
In all experiments, we use greedy decoding to generate answers conditioned on questions with few-shot prompts. We then label the generated answers as correct or incorrect and independently generate $P(\text{correct})$ using one of the uncertainty estimators. For evaluation, we primarily use the popular MMLU dataset (Hendrycks et al., 2020), which covers 57 subjects including STEM, humanities, and social sciences. Crucially, however, we expand the original multiple choice (MC) setting with a new open-ended (OE) setting. In the open-ended setting, we do not provide answer choices, and the language model must generate an answer that matches the ground truth answer choice. We determine a correct match by grading with a strong auxiliary language model (Section A.2). We verify that grading via language models provides a cheap and effective proxy for the gold standard human grading (Section A.3), consistent with related findings (Chiang and yi Lee, 2023).
Metrics. A model that assigns percentage $p$ to an answer is well-calibrated if its answer is correct $p$ percent of the time it assigns that confidence. Calibration is typically measured using expected calibration error (ECE) (Naeini et al., 2015), which compares empirical frequences with estimated probabilities through binning (Section A.4). A lower ECE is better, and an ECE of $0$ corresponds to a perfectly calibrated model. In addition to calibration, we measure the area under the receiver operating characteristic curve (AUROC) of the modelās confidence. High AUROC indicates ability to filter answers likely to be correct from answers that are likely to be incorrect, a setting typically called selective prediction.
Temperature scaling. Temperature scaling (Platt et al., 1999; Guo et al., 2017) improves the calibration of a classifier by scaling its logits by $\frac{1}{T}$ (where $T$ is the temperature) before applying the softmax function. A high temperature scales the softmax probabilities towards a uniform distribution, while a low temperature collapses the distribution around the most probable output. The temperature parameter is learned on held-out data, typically taken from the same distribution as the training set.
### 4 Do We Get Good Uncertainties Out-of-the-Box?
In this section, we focus on black-box Here we consider access to a modelās samples and token-level likelihoods as black-box. Some models do not expose likelihoods directly, but they can be approximated through sampling. methods for estimating a language modelās uncertainty. Due to computational cost, we focus on methods that require a single sample or forward pass and only consider sampling-based methods in the next section.
For multiple choice tasks, a language modelās distribution over answers is a categorical distribution as each answer choice is a single token. Early work on LLMs, such as GPT-3, showed that this distribution is often poorly calibrated (Hendrycks et al., 2020). Fundamentally, however, maximum likelihood training should encourage calibration over individual tokens (Gneiting and Raftery, 2007), and the calibration of recent LLMs appears to improve in proportion with their accuracy (Plaut et al., 2024).
In open-ended generation, on the other hand, answers are not limited to individual tokens nor a prescribed set of possibilities, which introduces multiple sources of uncertainty. The probability assigned to an answer can be low not just because itās unlikely to correspond to the correct answer conceptually but because there are multiple possible phrasings that must receive probability mass (and normalization is intractable), or because the answer represents an unusual phrasing of the correct information, and the uncertainty is over the probability of a sequence of tokens and not correctness. For example, imagine a multiple-choice test in which we add an additional answer choice that is a synonym of another. A sensible language model would assign equal likelihood to each choice, lowering the probability it assigns to either individually. In open-ended generation the situation is similar, but even more challenging because of variable length. Adding extra tokens can artificially lower the likelihood of an answer even when it expresses the same concept, as the sequence of tokens becomes less likely with increasing length.
We demonstrate the difference between multiple-choice question answering and open-ended generation in Figure 2 (left), where we compare the AUROC of a likelihood-based method for standard MMLU and open-ended MMLU (ours). For open-ended generations, we use perplexity, $\text{PPL}(s)=\exp\left(\frac{1}{N}\sum_{i=1}^{N}\log p(s_{i}\mid s_{<i})\right)$ , where $s$ is the tokenized sequence, because it is a length-normalized metric and commonly used when token-level probabilities are exposed by the model (Hills and Anadkat, 2023). From AUROCs, we observe that while token-level uncertainties often improve in multiple choice as models improve, perplexity is generally not predictive of a language modelās correctness in open-ended settings and does not exhibit the same favorable scaling with the language modelās underlying ability.
Because sequence likelihood (or perplexity) is limited as a confidence measure, prompting methods have becoming an increasingly popular alternative. Lin et al. (2022) introduced the following formats that lay the foundation for recent work (Tian et al., 2023b; Zhang et al., 2023):
| Name Zero-Shot Classifier | Format āQuestion. Answer. True/False: True ā | Confidence P( ā Trueā) / (P( ā Trueā) + P( ā Falseā)) |
| --- | --- | --- |
| Verbalized | āQuestion. Answer. Confidence: 90% ā | float( ā 90%ā) |
In the first approach, the language modelās logits are used to create a binary classifier by scoring two possible strings denoting true and false. Similarly, in Kadavath et al. (2022), the classifier takes in a slightly modified prompt, āIs the answer correct? (a) Yes (b) No ā and confidence is then computed P( ā(a)ā) / (P( ā(a)ā) + P( ā(b)ā)). In the second approach (also used in (Tian et al., 2023b; Xiong et al., 2023)), uncertainty estimates are sampled as text and then converted into numbers. We provide the extended details in Section B.2.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Scatter Plot Comparison: Max Softmax Probability vs. Negative Perplexity
### Overview
The image displays two side-by-side scatter plots comparing the relationship between model **Accuracy** (x-axis) and **AUROC** (y-axis) under two different evaluation metrics: **Max Softmax Probability** (left chart) and **Negative Perplexity** (right chart). Each plot contains approximately 15-20 data points (black dots), a linear regression trend line (solid black), and a shaded gray area representing the confidence interval.
### Components/Axes
**Common Elements:**
* **Y-Axis Label (Both Charts):** `AUROC`
* **X-Axis Label (Both Charts):** `Accuracy`
* **Data Representation:** Black circular markers for individual data points.
* **Trend Line:** Solid black line representing a linear fit.
* **Uncertainty Band:** Shaded gray area around the trend line, indicating the confidence interval.
**Left Chart: "Max Softmax Prob"**
* **Title:** `Max Softmax Prob`
* **Y-Axis Scale:** Ranges from 60% to 80%, with major ticks at 60%, 70%, and 80%.
* **X-Axis Scale:** Ranges from 45% to 75%, with major ticks at 45%, 60%, and 75%.
**Right Chart: "Neg. Perplexity"**
* **Title:** `Neg. Perplexity`
* **Y-Axis Scale:** Ranges from 60% to 65%, with major ticks at 60% and 65%.
* **X-Axis Scale:** Ranges from 40% to 60%, with major ticks at 40%, 50%, and 60%.
### Detailed Analysis
**Left Chart (Max Softmax Prob):**
* **Trend Verification:** The data points and trend line show a clear **positive correlation**. As Accuracy increases, AUROC also increases.
* **Data Point Distribution:** Points are scattered around the trend line. The lowest accuracy point is near (45%, ~65% AUROC). The highest accuracy point is near (75%, ~78% AUROC). The cluster is densest between 55%-65% Accuracy and 70%-75% AUROC.
* **Trend Line:** The line has a steep positive slope, starting near (45%, 66%) and ending near (75%, 78%).
* **Confidence Interval:** The shaded band is relatively narrow, suggesting a stronger correlation and more consistent relationship between the variables in this metric.
**Right Chart (Neg. Perplexity):**
* **Trend Verification:** The data points and trend line show a **slight negative correlation**. As Accuracy increases, AUROC shows a very mild decrease.
* **Data Point Distribution:** Points are more widely scattered compared to the left chart. There is a notable outlier at approximately (55%, 57% AUROC), which is the lowest point on the graph. The highest AUROC point is near (50%, 65%).
* **Trend Line:** The line has a shallow negative slope, starting near (40%, 64%) and ending near (60%, 62%).
* **Confidence Interval:** The shaded band is wider, especially at the extremes of the x-axis, indicating greater uncertainty in the trend, likely due to the higher variance and the outlier.
### Key Observations
1. **Divergent Trends:** The most significant observation is the opposing relationship between Accuracy and AUROC under the two metrics. Max Softmax Probability shows a strong positive link, while Negative Perplexity shows a weak negative link.
2. **Scale Difference:** The AUROC range for the "Neg. Perplexity" chart (60-65%) is much narrower than for the "Max Softmax Prob" chart (60-80%), compressing the visual spread of data.
3. **Data Consistency:** The data in the left chart is more tightly clustered around its trend line, suggesting a more predictable relationship. The right chart's data is noisier.
4. **Outlier:** The data point at ~55% Accuracy and ~57% AUROC in the "Neg. Perplexity" chart is a clear outlier, pulling the trend line down and widening the confidence interval.
### Interpretation
This comparison suggests that the choice of evaluation metric fundamentally changes the perceived relationship between a model's classification **Accuracy** and its discriminative ability as measured by **AUROC**.
* **Max Softmax Prob (Left):** This metric likely uses the confidence of the model's top prediction. The strong positive trend indicates that models which are both more accurate *and* more confident in their correct predictions achieve a higher AUROC. This is an intuitive and desirable alignment of metrics.
* **Neg. Perplexity (Right):** Perplexity measures how well a probability model predicts a sample. Using its negative flips the scale. The weak negative trend is counter-intuitive and suggests a potential trade-off or a different aspect of model behavior. It might indicate that models optimized for raw accuracy (perhaps via techniques like label smoothing) could have slightly worse AUROC when evaluated via this specific probabilistic metric. The outlier highlights that this relationship is not stable across all models or training conditions.
**Conclusion:** The data demonstrates that AUROC is not a monolithic metric; its correlation with accuracy is highly dependent on the underlying probabilistic output used for evaluation. For technical reporting, it is crucial to specify the exact method (e.g., max softmax vs. negative perplexity) when presenting AUROC results alongside accuracy, as they can tell very different stories about model performance.
</details>
<details>
<summary>x3.png Details</summary>

### Visual Description
## Scatter Plots: Zero-Shot Classifier vs. Verbal vs. Fine-tune Performance
### Overview
The image displays two side-by-side scatter plots comparing the performance of three classification methods: "Zero-Shot Classifier," "Verbal," and "Fine-tune." The left plot evaluates Expected Calibration Error (ECE) against Accuracy, while the right plot evaluates Area Under the ROC Curve (AUROC) against Accuracy. Each plot includes individual data points, a linear regression trend line with a shaded confidence interval for the first two methods, and a horizontal dashed reference line for the "Fine-tune" method.
### Components/Axes
* **Legend:** Positioned at the top center of the entire figure.
* Pink circle: `Zero-Shot Classifier`
* Blue circle: `Verbal`
* Black dashed line: `Fine-tune`
* **Left Plot (ECE vs. Accuracy):**
* **Y-axis:** Label is `ECE`. Scale ranges from 0% to 60%, with major ticks at 0%, 20%, 40%, 60%.
* **X-axis:** Label is `Accuracy`. Scale ranges from 35% to 50%, with major ticks at 35%, 40%, 45%, 50%.
* **Right Plot (AUROC vs. Accuracy):**
* **Y-axis:** Label is `AUROC`. Scale ranges from 50% to 70%, with major ticks at 50%, 60%, 70%.
* **X-axis:** Label is `Accuracy`. Scale ranges from 35% to 50%, with major ticks at 35%, 40%, 45%, 50%.
* **Data Series & Visual Elements:**
* **Zero-Shot Classifier (Pink):** Individual pink dots scattered across the plot area. A solid pink regression line with a light pink shaded confidence interval is drawn through the data.
* **Verbal (Blue):** Individual blue dots scattered across the plot area. A solid blue regression line with a light blue shaded confidence interval is drawn through the data.
* **Fine-tune (Black Dashed):** A horizontal dashed black line, indicating a constant performance level for this method across the accuracy range shown.
### Detailed Analysis
**Left Plot: ECE (Lower is Better)**
* **Trend Verification:** Both the pink (Zero-Shot) and blue (Verbal) regression lines show a slight upward slope, suggesting a weak positive correlation between Accuracy and ECE for these methods.
* **Data Points (Approximate):**
* **Zero-Shot Classifier (Pink):** Points are widely scattered. Values range from approximately 10% to 60% ECE. Notable points include a cluster near 40% Accuracy/20% ECE and another near 50% Accuracy/50% ECE.
* **Verbal (Blue):** Points are more tightly clustered than Zero-Shot. Values range from approximately 30% to 50% ECE.
* **Fine-tune (Black Dashed Line):** Constant at approximately **5% ECE**, significantly lower than the other two methods across the entire accuracy range.
**Right Plot: AUROC (Higher is Better)**
* **Trend Verification:** Both the pink (Zero-Shot) and blue (Verbal) regression lines show a clear upward slope, indicating a positive correlation between Accuracy and AUROC.
* **Data Points (Approximate):**
* **Zero-Shot Classifier (Pink):** Points range from approximately 50% to 65% AUROC. There is a visible upward trend.
* **Verbal (Blue):** Points range from approximately 55% to 62% AUROC, also showing an upward trend.
* **Fine-tune (Black Dashed Line):** Constant at approximately **72% AUROC**, which is higher than all data points for the other two methods.
### Key Observations
1. **Superior Performance of Fine-tuning:** The "Fine-tune" method (dashed line) demonstrates both the best calibration (lowest ECE ~5%) and the best discriminative performance (highest AUROC ~72%) consistently, independent of the accuracy range plotted.
2. **Calibration vs. Discrimination Trade-off:** For the Zero-Shot and Verbal methods, higher Accuracy is associated with *worse* calibration (higher ECE) but *better* discrimination (higher AUROC).
3. **Variability:** The Zero-Shot Classifier shows significantly higher variance in ECE compared to the Verbal method, suggesting less consistent calibration.
4. **Performance Clustering:** The Verbal method's data points are more tightly clustered than the Zero-Shot method's, indicating more predictable performance.
### Interpretation
This data suggests a fundamental trade-off between model calibration and raw discriminative power when using prompt-based (Zero-Shot, Verbal) methods versus a fully fine-tuned model. The fine-tuned model achieves a superior balance, excelling in both metrics.
The positive correlation between Accuracy and AUROC is expected, as both measure aspects of correct classification. However, the simultaneous positive correlation between Accuracy and ECE for the prompt-based methods is a critical finding. It indicates that as these models become more accurate on this test set, they also become more *overconfident* in their predictions (higher ECE). This is a known issue with large language models used as zero-shot classifiers.
The "Fine-tune" line acts as a gold-standard benchmark. The fact that it is horizontal implies its performance is stable and serves as a target. The gap between the dashed line and the scatter points quantifies the performance cost of using prompt-based methods instead of task-specific fine-tuning for this particular evaluation. The wider scatter of the Zero-Shot method highlights the instability and sensitivity of pure prompting compared to the more structured "Verbal" method (which may involve more engineered prompts or a specific verbalization format).
</details>
Figure 2: (Left) We compare common uncertainty estimates for multiple-choice questions (max softmax probability) and open-ended generation (perplexity). While maximum softmax probability performs well and improves with the ability of the base model, perplexity does not follow the same pattern. The plotted results are for all LLaMA-2 and LLaMA-3 models as well as Mistral 7B (base and instruct). (Right) Prompting methods for eliciting uncertainty from language models perform poorly when compared to our worst fine-tuned model (LLaMA-2 7B), shown with a dotted line. ECE doesnāt appear to improve with the abilities of the underlying model, and while AUROC does show small improvements with large improvements in accuracy, the gap between zero-shot methods and fine-tuning for uncertainties remains large. Shading indicates a 95% bootstrapped confidence interval on the regression fit.
The prospects of calibration by learning to model human language. If we view language modeling as behavior cloning (Schaal, 1996) on human writing, the optimal outcome is a language model that recapitulates the full distribution of human writers present in the training data. Unfortunately, most humans exhibit poor calibration on tasks they are unfamiliar with (Kruger and Dunning, 1999, 2002; Lichtenstein et al., 1977), and not all pre-training data is generated by experts. Therefore it might be unreasonably optimistic to expect black-box methods to yield calibrated uncertainties without a significant intervention. Alignment procedures (e.g. RLHF) could improve the situation by penalizing cases of poor calibration, and the resulting procedure would be akin to fine-tuning on graded data, which we explore in Section 5.
Experiments with open-source models. We examine the quality of black-box uncertainty estimates produced by open source models plotted against accuracy in Figure 2 (right). We use LLaMA-2 (Touvron et al., 2023a, b), Mistral (Jiang et al., 2023), and LLaMA-3 models, and we evaluate on open-ended MMLU to highlight how the methods might perform in a āchat-botā setting. Because these models have open weights, we can perform apples-to-apples comparisons with methods that train through the model or access hidden representations. We see that prompting methods typically give poorly calibrated uncertainties (measured by ECE) and their calibration does not improve out-of-the-box as the base model improves. By contrast, AUROC does improve slightly with the power of the underlying model, but even the best model still lags far behind the worse model with fine-tuning for uncertainty.
Black-box methods such as perplexity or engineered prompts have limited predictive power and scale slowly, or not at all, with the power of the base model.
### 5 How Should We Use Labeled Examples?
Our goal is to construct an estimate for $P(\text{correct})$ , the probability that the modelās answer is correct. Learning to predict a modelās correctness is a simple binary classification problem, which we learn on a small labeled dataset of correct and incorrect answers. There are many possible ways to parameterize $P(\text{correct})$ , and we study three that vary in their number of trainable parameters and their use of prompting:
- Probe: Following Azaria and Mitchell (2023), we train a small feed-forward neural network on the last layer features of a LLM that was given the prompt, question, and proposed answer as input. The model outputs $P(\text{correct})$ while keeping the base LLM frozen.
- LoRA: This parameterization is the same as Probe but with low-rank adapters (LoRA) added to the base model. As a result, the intermediate language features of the base model can be changed to improve the correctness prediction.
- LoRA + Prompt: Following Kadavath et al. (2022), we pose classifying correctness as a multiple choice response with two values, the target tokens ā i ā and ā ii ā representing ānoā and āyesā respectively. We perform LoRA fine-tuning on strings with this formatting.
With these different parameterizations, we can study how much information about uncertainty is already contained in a pre-trained modelās features. Probe relies on frozen features, while LoRA and LoRA + Prompt can adjust the modelās features for the purpose of uncertainty quantification. Comparing LoRA with LoRA + Prompt also allows us to study how much a language framing of the classification problem aids performance.
Datasets. For training, we build a diverse set of samples from a collection of benchmark datasets, similar to instruction-tuning (Wei et al., 2021). From the list of 16 benchmark datasets in Section C.2, we use a sampled subset of size approximately 20,000. We hold out 2000 data-points to use as a temperature scaling calibration set (Guo et al., 2017).
| Method | ECE | AUROC |
| --- | --- | --- |
| w/o KL | 29.9% | 70.2% |
| w/ KL | 10.8% | 71.6% |
Table 1: Regularization improves calibration. Numbers show the mean over six base models models. See Section C.1 for discussion.
Training and regularization.
We consider three base modelsāLLaMA-2 7b, LLaMA-2 13b, Mistral 7Bāand their instruction-tuned variants. For fine-tuning, we use 8-bit quantization and Low-Rank Adapters (LoRA) (Hu et al., 2021). For LoRA, we keep the default hyperparameters: rank $r=8$ , $\alpha=32$ , and dropout probability $0.1$ . Each training run takes approximately 1-3 GPU days with 4 NVIDIA RTX8000 (48GB) GPUs. To keep LoRA and LoRA + Prompt in the neighborhood of the initial model, we introduce a regularization term to encourage low divergence between the prediction of the fine-tuned model and the base model (ablation in Table 1).
Sampling baseline. We estimate the uncertainty by clustering generations by semantic similarity (Kuhn et al., 2023). The probability of each cluster becomes the probability assigned to all sequences in that cluster. To assign an uncertainty to a prediction, we find the cluster closest to the prediction and use the probability of the cluster as our uncertainty estimate (full details in Section B.1). The clear drawback of this approach to uncertainty estimation is its poor scaling. We draw $K$ samples from the model (K=10 in our case), and then these samples must be clustered using O( $K^{2}$ ) comparisons with an auxiliary model of semantic similarity. Sampling methods are also complicated by their relationship with hyperparameters such as temperature or nucleus size. In the special case where the sampling parameters are chosen to produce greedy decoding (e.g. temperature zero), the model will always assign probably one to its answer. While this behavior does align with the probability of generating the answer, it is not a useful measure of confidence.
Fine-tuning results. In Figure 3 (Left) we compare our three fine-tuned models with black-box uncertainty methods on both multiple choice and open-ended MMLU. For multiple choice MMLU, we also include the language modelās max softmax probability as a baseline. Fine-tuning for uncertainty leads to significant improvements in both ECE and AUROC. While frozen features (Probe) are sufficient to outperform baselines in multiple choice MMLU, performing well on open-ended MMLU requires training through the modeling and prompting. Surprisingly, while sampling methods can yield good calibration, their discriminative performance is very weak. By contrast, verbal elicitation is relatively strong in discriminative performance, being on par with weaker fine-tuning methods, but general has poor calibration, even after temperature scaling.
How much data do we need? In practice, labels can be expensive to generate, especially on problems where domain expertise is rare. Therefore, it would be advantageous if fine-tuning with even a small number of examples is sufficient for building a good uncertainty estimate. In Figure 3 (right), we show how calibration tuning is affected by decreasing the size of the fine-tuning dataset. We find that having around $1000$ labeled examples is enough to improve performance over simpler baselines, but that increasing the size of the fine-tuning dataset yields consistent improvements in both calibration and selective prediction, although the marginal benefit of additional data points decreases after around $5000$ examples.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Chart Grid: Method Comparison on MMLU Datasets
### Overview
The image displays a 2x2 grid of bar charts comparing the performance of seven different methods on two variants of the MMLU (Massive Multitask Language Understanding) benchmark. The top row charts show Expected Calibration Error (ECE), where lower values are better. The bottom row charts show Area Under the ROC Curve (AUROC), where higher values are better. The left column corresponds to the Multiple Choice (MC) format, and the right column corresponds to the Open-Ended (OE) format.
### Components/Axes
* **Legend:** Positioned at the top of the image, spanning the full width. It defines seven methods with associated colors:
* **Logits:** Green
* **Verbal:** Blue
* **Zero-Shot Classifier:** Red/Maroon
* **Sampling:** Dark Green
* **Probe:** Light Purple/Lavender
* **LoRA:** Medium Purple
* **LoRA + Prompt:** Dark Purple
* **Chart Grid:** A 2x2 arrangement.
* **Top-Left Chart:** Y-axis: "ECE ā" (0% to 30%). X-axis Group Label: "MMLU (MC)".
* **Top-Right Chart:** Y-axis: "ECE ā" (0% to 40%). X-axis Group Label: "MMLU (OE)".
* **Bottom-Left Chart:** Y-axis: "AUROC ā" (50% to 70%). X-axis Group Label: "MMLU (MC)".
* **Bottom-Right Chart:** Y-axis: "AUROC ā" (50% to 70%). X-axis Group Label: "MMLU (OE)".
* **Data Representation:** Each chart contains two clusters of seven bars each. The left cluster in each chart corresponds to the first four methods (Logits, Verbal, Zero-Shot Classifier, Sampling). The right cluster corresponds to the last three methods (Probe, LoRA, LoRA + Prompt). Each bar has a black error bar extending from its top.
### Detailed Analysis
**Trend Verification & Data Extraction (Approximate Values):**
**1. Top-Left Chart: ECE for MMLU (MC)**
* **Trend:** The first cluster (Logits to Sampling) shows generally higher and more variable ECE. The second cluster (Probe to LoRA+Prompt) shows consistently lower ECE.
* **Data Points (Left Cluster):**
* Logits (Green): ~20%
* Verbal (Blue): ~28% (Highest in this chart)
* Zero-Shot Classifier (Red): ~18%
* Sampling (Dark Green): ~10%
* **Data Points (Right Cluster):**
* Probe (Light Purple): ~10%
* LoRA (Medium Purple): ~12%
* LoRA + Prompt (Dark Purple): ~10%
**2. Top-Right Chart: ECE for MMLU (OE)**
* **Trend:** Similar to the MC chart, the first cluster has higher ECE, with Verbal and Zero-Shot Classifier being notably high. The second cluster is lower and more uniform.
* **Data Points (Left Cluster):**
* Logits (Green): ~38% (Highest in the entire figure)
* Verbal (Blue): ~35%
* Zero-Shot Classifier (Red): ~38%
* Sampling (Dark Green): ~15%
* **Data Points (Right Cluster):**
* Probe (Light Purple): ~15%
* LoRA (Medium Purple): ~15%
* LoRA + Prompt (Dark Purple): ~12%
**3. Bottom-Left Chart: AUROC for MMLU (MC)**
* **Trend:** A clear upward trend is visible from left to right across the methods. The first cluster has lower AUROC, while the second cluster, especially LoRA-based methods, shows significantly higher performance.
* **Data Points (Left Cluster):**
* Logits (Green): ~55%
* Verbal (Blue): ~58%
* Zero-Shot Classifier (Red): ~60%
* Sampling (Dark Green): ~62%
* **Data Points (Right Cluster):**
* Probe (Light Purple): ~68%
* LoRA (Medium Purple): ~70%
* LoRA + Prompt (Dark Purple): ~72% (Highest in this chart)
**4. Bottom-Right Chart: AUROC for MMLU (OE)**
* **Trend:** Performance is more mixed in the first cluster. The second cluster again shows strong performance, with LoRA + Prompt being the clear standout.
* **Data Points (Left Cluster):**
* Logits (Green): ~62%
* Verbal (Blue): ~58%
* Zero-Shot Classifier (Red): ~55%
* Sampling (Dark Green): ~52% (Lowest in this chart)
* **Data Points (Right Cluster):**
* Probe (Light Purple): ~62%
* LoRA (Medium Purple): ~65%
* LoRA + Prompt (Dark Purple): ~72% (Highest in this chart and tied for highest overall)
### Key Observations
1. **Consistent Superiority of Tuning Methods:** The methods in the right cluster (Probe, LoRA, LoRA + Prompt) consistently outperform the methods in the left cluster (Logits, Verbal, Zero-Shot Classifier, Sampling) on both metrics. They achieve lower calibration error (ECE) and higher discriminative performance (AUROC).
2. **LoRA + Prompt is Top Performer:** The "LoRA + Prompt" method (dark purple bar) is the top performer or tied for top in three out of four charts (MC ECE, MC AUROC, OE AUROC).
3. **High Calibration Error for Verbal/Zero-Shot on OE:** The "Verbal" and "Zero-Shot Classifier" methods show particularly high Expected Calibration Error (approaching 40%) on the Open-Ended (OE) task, suggesting they are poorly calibrated in that setting.
4. **Sampling is a Middle Ground:** The "Sampling" method (dark green) often performs better than the first three methods (Logits, Verbal, Zero-Shot) but worse than the tuned methods (Probe, LoRA), acting as an intermediate performer.
5. **Task Difficulty:** The ECE values are generally higher for the OE task (right column) than the MC task (left column), suggesting the open-ended format is more challenging for model calibration.
### Interpretation
This figure presents a comparative analysis of inference-time and fine-tuning methods for large language models on the MMLU benchmark. The data strongly suggests that **parameter-efficient fine-tuning methods (LoRA, Probe) yield models that are both more accurate (higher AUROC) and better calibrated (lower ECE) than methods relying on the base model's raw outputs (Logits, Verbal, Zero-Shot Classifier).**
The "LoRA + Prompt" method's consistent top performance indicates a synergistic effect: fine-tuning the model with LoRA and then further guiding it with a task-specific prompt at inference time provides the best results. The high ECE for "Verbal" and "Zero-Shot Classifier" on the OE task is a critical finding, warning that these simple methods can produce confident but incorrect answers in open-ended generation scenarios.
The clear separation between the two clusters of bars visually argues for the value of investing in tuning (even parameter-efficient tuning) over zero-shot or simple prompting strategies when performance and reliability on complex benchmarks like MMLU are priorities. The error bars, while present, do not overlap between the high-performing and low-performing clusters in most cases, reinforcing the statistical significance of the performance gap.
</details>
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Charts: Model Calibration (ECE) and Classification Performance (AUROC) vs. Training Samples
### Overview
The image displays two side-by-side line charts comparing the performance of three large language models (LLMs) as a function of the number of training samples used. The left chart measures Expected Calibration Error (ECE), and the right chart measures Area Under the Receiver Operating Characteristic curve (AUROC). Both charts include baseline performance lines for a "Zero-Shot Classifier" and a "Sampling" method.
### Components/Axes
* **Legend (Top Center):** A shared legend identifies three model series:
* **LLaMA-2 7B Chat:** Dark purple solid line.
* **LLaMA-2 13B Chat:** Blue solid line.
* **Mistral 7B Instruct:** Teal solid line.
* **Baseline Legend (Below Main Legend):** Identifies two horizontal dashed lines:
* **Zero-Shot Classifier:** Red dashed line.
* **Sampling:** Purple dashed line.
* **Left Chart (ECE):**
* **Y-axis:** Label "ECE". Scale ranges from approximately 0.05 to 0.25. Major ticks at 0.1 and 0.2.
* **X-axis:** Label "Samples". Logarithmic scale with major ticks at 10², 10³, and 10ā“.
* **Right Chart (AUROC):**
* **Y-axis:** Label "AUROC". Scale ranges from approximately 0.55 to 0.75. Major ticks at 0.6 and 0.7.
* **X-axis:** Label "Samples". Identical logarithmic scale to the left chart (10², 10³, 10ā“).
* **Data Series:** Each model series is plotted with a shaded region around the central line, indicating confidence intervals or variance.
### Detailed Analysis
**Left Chart - ECE (Lower is Better):**
* **Trend Verification:** All three model lines show a clear downward trend as the number of samples increases, indicating improved calibration (lower error).
* **Data Points (Approximate):**
* **LLaMA-2 7B Chat (Dark Purple):** Starts at ~0.15 (10² samples), decreases to ~0.10 (10³ samples), and ends at ~0.08 (10ⓠsamples).
* **LLaMA-2 13B Chat (Blue):** Starts at ~0.14 (10² samples), decreases to ~0.09 (10³ samples), and ends at ~0.07 (10ⓠsamples).
* **Mistral 7B Instruct (Teal):** Starts highest at ~0.22 (10² samples), decreases sharply to ~0.12 (10³ samples), and ends at ~0.09 (10ⓠsamples). It shows a slight upward bump between 10³ and 10ⓠsamples.
* **Baselines (Horizontal Dashed Lines):**
* **Zero-Shot Classifier (Red):** Constant at ~0.15.
* **Sampling (Purple):** Constant at ~0.14.
* **Spatial Grounding:** The baselines are positioned in the upper half of the chart. All model lines start near or above these baselines at 10² samples and fall significantly below them by 10ⓠsamples.
**Right Chart - AUROC (Higher is Better):**
* **Trend Verification:** All three model lines show a clear upward trend as the number of samples increases, indicating improved classification performance.
* **Data Points (Approximate):**
* **LLaMA-2 7B Chat (Dark Purple):** Starts at ~0.60 (10² samples), increases to ~0.68 (10³ samples), and ends at ~0.72 (10ⓠsamples).
* **LLaMA-2 13B Chat (Blue):** Starts at ~0.58 (10² samples), increases to ~0.66 (10³ samples), and ends at ~0.70 (10ⓠsamples).
* **Mistral 7B Instruct (Teal):** Starts at ~0.64 (10² samples), increases to ~0.70 (10³ samples), and ends highest at ~0.74 (10ⓠsamples).
* **Baselines (Horizontal Dashed Lines):**
* **Zero-Shot Classifier (Red):** Constant at ~0.60.
* **Sampling (Purple):** Constant at ~0.56.
* **Spatial Grounding:** The baselines are positioned in the lower half of the chart. All model lines start at or above the Zero-Shot baseline and end well above both baselines.
### Key Observations
1. **Inverse Relationship:** There is a clear inverse relationship between ECE and AUROC for all models; as performance (AUROC) improves with more data, calibration error (ECE) decreases.
2. **Model Comparison:** Mistral 7B Instruct starts with the worst calibration (highest ECE) but best initial performance (highest AUROC) at low samples (10²). By 10ⓠsamples, LLaMA-2 13B Chat achieves the best calibration (lowest ECE), while Mistral achieves the best performance (highest AUROC).
3. **Data Efficiency:** All models surpass the "Zero-Shot Classifier" baseline in both metrics with as few as 10² samples. They surpass the "Sampling" baseline shortly thereafter.
4. **Convergence:** The performance gap between models narrows as the number of samples increases, particularly for ECE.
### Interpretation
This data demonstrates the critical impact of fine-tuning sample size on both the reliability (calibration) and effectiveness (discriminative power) of LLMs for classification tasks.
* **Calibration vs. Performance:** The charts show that calibration (ECE) and raw performance (AUROC) are related but distinct axes of model quality. A model can be well-performing but poorly calibrated, or vice-versa, especially in low-data regimes.
* **Value of Data:** The consistent trends indicate that increasing the fine-tuning dataset size from 100 to 10,000 samples yields significant, monotonic benefits for both metrics across all tested models. This suggests the models are not yet saturated at 10ā“ samples.
* **Model Selection Implications:** The choice between LLaMA-2 and Mistral may depend on the application's priority. If calibration is paramount (e.g., for risk assessment), LLaMA-2 13B Chat appears superior with sufficient data. If maximizing discriminative power is the sole goal, Mistral 7B Instruct shows a slight edge at high sample counts.
* **Baseline Context:** The "Zero-Shot" and "Sampling" baselines provide a crucial reference point, showing that even minimal fine-tuning (100 samples) provides a substantial boost over these methods. The flat baselines highlight that these methods do not benefit from the additional training data being supplied to the other models.
</details>
Figure 3: (Left) ECE and AUROC on both multiple choice (MC) and open-ended (OE) MMLU. ECE is shown after temperature scaling on a small hold-out set. Supervised training (Probe, LoRA, LoRA + Prompt) tends to improve calibration and selective prediction. Probing on its own (Probe) performs worse than training through the features with a language prompt (LoRA + Prompt), especially in an open-ended setting. Error bars show two standard deviations over six base models. Extended results in Appendix D. (Right) Effect of varying number of labeled datapoints on OE MMLU. In the most extreme case, we train on only 200 examples. Overall, performance increases in proportion with the available labeled data, but 1000 points is almost as valuable as 20,000 points. Dotted lines indicate the performance of the classifier and sampling baselines averaged over the three models considered. Shaded regions show one standard deviation over subsets of MMLU.
Supervised learning approaches, in which we learn to predict a modelās correctness, can dramatically outperform baselines with as few as $1000$ graded examples. Updating the features of the model with LoRA and use of a language prompt are key to good performance.
### 6 When and Why Do These Estimates Generalize?
To derive more understanding of when our estimates generalize, we now investigate distribution shifts between the training and evaluation datasets. To have a practically useful tool, we might desire robustness to the following shifts, among others:
Subject matter. Ideally, our uncertainty estimates apply to subjects we have not seen during training. In Figure 4 (left), we show a breakdown of our fine-tuning dataset using the supercategories from MMLU (Section A.5). We see that our dataset contains much higher percentages of STEM and humanities questions than MMLU and close to no examples from the social sciences (e.g. government, economics, sociology). Despite these differences in composition, uncertainty estimates from LoRA + Prompt perform similarly across supercategories. We also show the efficacy of our models at assessing confidence on out of distribution coding tasks in Appendix F.
Format. Like a change in subject matter, the way a question is posed should not break the uncertainty estimate. To test the effect of the question format independent of its subject matter, we apply models fine-tuned on OE MMLU to MC MMLU and vice versa. In Figure 4 (center), we see that fine-tuned models often perform better than a zero-shot baseline even when they are being applied across a distribution shift, though transfer from MC to OE is more challenging than OE to MC. Probe is insufficient to generalize effectively from MC to OE, but training through the features of the model (LoRA + Prompt) does generalize effectively, even out-performing probe trained on OE data.
Solvability. Even though we focus on questions with a single known answer, we might hope that our estimates can be used even when a question is ill-posed or does not have a known solution, ideally returning high uncertainty. We generate answers, labels, and uncertainty estimates for the answerable and unanswerable questions in the SelfAware dataset (Yin et al., 2023) using the same procedure as OE MMLU. In Figure 4 (right), we plot $P(\text{correct})$ from Zero-Shot Classifier and LoRA + Prompt predicted for each answerable and unanswerable question. Notably, calibration-tuned models have calibrated probabilities for the answerable questions and assign lower confidence to unanswerable questions than black-box methods.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Multi-Panel Bar Chart: Academic Discipline Performance Metrics
### Overview
The image displays a 2x2 grid of bar charts comparing four academic disciplines (STEM, Humanities, Social Sciences, Other) across four different performance or composition metrics. Each bar includes a black vertical error bar indicating variability or uncertainty. A legend at the top center defines the color coding for the disciplines.
### Components/Axes
* **Legend (Top Center):** Four colored squares with labels:
* Light Blue: `STEM`
* Dark Blue: `Humanities`
* Light Green: `Social Sciences`
* Dark Green: `Other`
* **Chart Layout:** Four subplots arranged in a 2x2 grid.
* **X-Axis (All Subplots):** Implicitly represents the four academic disciplines, ordered as per the legend (STEM, Humanities, Social Sciences, Other from left to right within each subplot).
* **Y-Axes (Subplot-Specific):**
1. **Top-Left Subplot:** Label: `% Train`. Scale: 0% to 40%, with ticks at 0%, 20%, 40%.
2. **Top-Right Subplot:** Label: `ECE ā`. The downward arrow (ā) suggests lower values are better. Scale: 0% to 15%, with ticks at 0%, 5%, 10%, 15%.
3. **Bottom-Left Subplot:** Label: `% MMLU`. Scale: 0% to 40%, with ticks at 0%, 20%, 40%.
4. **Bottom-Right Subplot:** Label: `AUROC ā`. The upward arrow (ā) suggests higher values are better. Scale: 40% to 80%, with ticks at 40%, 60%, 80%.
### Detailed Analysis
**1. Top-Left: % Train (Training Data Proportion)**
* **Trend:** STEM has the highest proportion, followed by Humanities, then Other, with Social Sciences being drastically lower.
* **Approximate Values & Error Bars:**
* STEM (Light Blue): ~40%. Error bar is small, spanning roughly ±2%.
* Humanities (Dark Blue): ~35%. Error bar is small, spanning roughly ±2%.
* Social Sciences (Light Green): ~2-3%. Error bar is relatively large, spanning roughly 0% to 5%.
* Other (Dark Green): ~20%. Error bar is moderate, spanning roughly ±5%.
**2. Top-Right: ECE ā (Expected Calibration Error - Lower is Better)**
* **Trend:** Social Sciences appears to have the lowest (best) ECE, followed by STEM and Other which are similar, with Humanities having the highest (worst) ECE. All values are below 15%.
* **Approximate Values & Error Bars:**
* STEM (Light Blue): ~10%. Error bar spans roughly 8% to 12%.
* Humanities (Dark Blue): ~12%. Error bar spans roughly 10% to 14%.
* Social Sciences (Light Green): ~8%. Error bar is the largest, spanning roughly 4% to 12%.
* Other (Dark Green): ~10%. Error bar spans roughly 8% to 12%.
**3. Bottom-Left: % MMLU (Performance on MMLU Benchmark)**
* **Trend:** STEM has the highest performance, followed by a cluster where Humanities, Social Sciences, and Other show very similar, slightly lower performance.
* **Approximate Values & Error Bars:**
* STEM (Light Blue): ~35%. Error bar is small, spanning roughly ±2%.
* Humanities (Dark Blue): ~22%. Error bar is small, spanning roughly ±2%.
* Social Sciences (Light Green): ~20%. Error bar is small, spanning roughly ±2%.
* Other (Dark Green): ~22%. Error bar is small, spanning roughly ±2%.
**4. Bottom-Right: AUROC ā (Area Under ROC Curve - Higher is Better)**
* **Trend:** Social Sciences shows the highest performance, followed closely by Humanities and Other, with STEM being slightly lower. All values are clustered between 70% and 75%.
* **Approximate Values & Error Bars:**
* STEM (Light Blue): ~70%. Error bar is small, spanning roughly ±2%.
* Humanities (Dark Blue): ~72%. Error bar is small, spanning roughly ±2%.
* Social Sciences (Light Green): ~75%. Error bar is small, spanning roughly ±2%.
* Other (Dark Green): ~72%. Error bar is small, spanning roughly ±2%.
### Key Observations
1. **Disproportionate Training Data:** The `% Train` chart reveals a severe imbalance, with STEM and Humanities dominating the training data, while Social Sciences is minimally represented.
2. **Performance vs. Data Discrepancy:** Despite having the smallest share of training data (~2-3%), Social Sciences achieves the best (lowest) ECE and the best (highest) AUROC, and competitive MMLU scores. This suggests high model efficiency or data quality for this domain.
3. **Metric-Specific Strengths:** No single discipline leads across all performance metrics. STEM leads in MMLU, Social Sciences leads in ECE and AUROC, and Humanities is mid-range.
4. **Error Bar Significance:** The error bar for Social Sciences in the ECE chart is notably large, indicating high variability or uncertainty in the calibration error measurement for that domain.
### Interpretation
This set of charts likely evaluates the performance of a machine learning model (or models) across different academic knowledge domains. The data suggests a potential misalignment between training data composition and model performance outcomes.
* **The "Social Sciences Paradox":** The most striking finding is the strong performance of the Social Sciences domain despite its minimal representation in the training data. This could indicate that the tasks or knowledge within Social Sciences are more easily learned by the model, that the available data for this domain is of exceptionally high quality, or that the evaluation metrics (ECE, AUROC) are particularly favorable to the model's behavior on this type of data.
* **Calibration vs. Accuracy:** The model is best calibrated (lowest ECE) on Social Sciences data, meaning its confidence scores align most closely with its actual accuracy on that domain. Conversely, it is least calibrated on Humanities data.
* **Benchmark Performance:** The `% MMLU` scores, which likely measure general knowledge and reasoning, show a clear advantage for STEM, which also has the largest training share. This suggests the model's broad knowledge is still heavily influenced by the volume of its training data.
* **Overall Implication:** The charts argue that simply increasing training data volume for a domain (like STEM) does not guarantee superior performance across all metrics (e.g., calibration, AUROC). They highlight the importance of evaluating models on multiple, diverse metrics to understand their strengths and weaknesses across different knowledge areas. The high performance of the underrepresented Social Sciences domain warrants further investigation into the nature of the data and tasks involved.
</details>
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Calibration and Discrimination Performance (ECE & AUROC)
### Overview
The image displays a 2x2 grid of grouped bar charts comparing the performance of five different classification methods across two evaluation metrics (ECE and AUROC) and two distinct conditions or datasets (labeled "MC" and "OE"). The charts include error bars, indicating variability or confidence intervals for each measurement.
### Components/Axes
* **Legend (Top Center):** A horizontal legend identifies five methods by color:
* **Orange Square:** Zero-Shot Classifier
* **Blue Square:** Probe
* **Light Blue Square:** ^ (Transfer) [Associated with Probe]
* **Green Square:** LoRA + Prompt
* **Light Green Square:** ^ (Transfer) [Associated with LoRA + Prompt]
* **Chart Grid:** The charts are arranged in two rows and two columns.
* **Columns:** Labeled "MC" (left column) and "OE" (right column) at the top.
* **Rows:** The top row measures **ECE ā** (Expected Calibration Error, where lower is better). The bottom row measures **AUROC ā** (Area Under the Receiver Operating Characteristic Curve, where higher is better).
* **Y-Axes:**
* **Top Row (ECE):** Labeled "ECE ā". Scale ranges from 10% to 50%, with major ticks at 10%, 30%, and 50%.
* **Bottom Row (AUROC):** Labeled "AUROC ā". Scale ranges from 40% to 80%, with major ticks at 40%, 60%, and 80%.
* **X-Axis (Implicit):** Within each subplot, five bars are grouped, corresponding to the five methods in the legend order (Zero-Shot, Probe, Probe-Transfer, LoRA+Prompt, LoRA+Prompt-Transfer).
### Detailed Analysis
**Top Row: ECE (Expected Calibration Error) - Lower is Better**
* **MC Condition (Top-Left Chart):**
* **Zero-Shot Classifier (Orange):** Highest ECE, approximately 40% (±~3%).
* **Probe (Blue):** ECE ~20% (±~2%).
* **Probe Transfer (Light Blue):** ECE ~25% (±~2%), slightly worse than Probe.
* **LoRA + Prompt (Green):** ECE ~15% (±~2%), the lowest in this group.
* **LoRA + Prompt Transfer (Light Green):** ECE ~20% (±~2%), slightly worse than its non-transfer counterpart.
* **OE Condition (Top-Right Chart):**
* **Zero-Shot Classifier (Orange):** ECE ~35% (±~3%).
* **Probe (Blue):** ECE ~25% (±~2%).
* **Probe Transfer (Light Blue):** ECE ~30% (±~2%).
* **LoRA + Prompt (Green):** ECE ~15% (±~2%), again the lowest.
* **LoRA + Prompt Transfer (Light Green):** ECE ~20% (±~2%).
**Bottom Row: AUROC (Area Under ROC) - Higher is Better**
* **MC Condition (Bottom-Left Chart):**
* **Zero-Shot Classifier (Orange):** AUROC ~50% (±~3%).
* **Probe (Blue):** AUROC ~60% (±~3%).
* **Probe Transfer (Light Blue):** AUROC ~60% (±~3%), similar to Probe.
* **LoRA + Prompt (Green):** AUROC ~70% (±~3%), the highest in this group.
* **LoRA + Prompt Transfer (Light Green):** AUROC ~68% (±~3%), slightly lower than its non-transfer counterpart.
* **OE Condition (Bottom-Right Chart):**
* **Zero-Shot Classifier (Orange):** AUROC ~55% (±~3%).
* **Probe (Blue):** AUROC ~60% (±~3%).
* **Probe Transfer (Light Blue):** AUROC ~60% (±~3%).
* **LoRA + Prompt (Green):** AUROC ~70% (±~3%), the highest.
* **LoRA + Prompt Transfer (Light Green):** AUROC ~65% (±~3%).
### Key Observations
1. **Consistent Superiority of LoRA + Prompt:** The "LoRA + Prompt" method (green bar) consistently achieves the best performance across all four subplots: the lowest ECE (best calibration) and the highest AUROC (best discrimination) in both MC and OE conditions.
2. **Zero-Shot Classifier Underperformance:** The "Zero-Shot Classifier" (orange bar) consistently performs the worst, showing the highest ECE and the lowest AUROC in all scenarios.
3. **Impact of Transfer Learning:** The effect of transfer learning (light-colored bars) is mixed and generally negative or neutral.
* For **Probe**, transfer learning increases ECE (worsens calibration) in both MC and OE, while having a negligible effect on AUROC.
* For **LoRA + Prompt**, transfer learning slightly increases ECE and slightly decreases AUROC compared to the non-transfer version.
4. **Metric Trends:** The visual trends are clear: lines/bars for ECE slope downward from Zero-Shot to LoRA+Prompt, while lines/bars for AUROC slope upward across the same sequence.
### Interpretation
This chart provides a clear comparative analysis of model adaptation techniques for classification tasks. The data strongly suggests that **fine-tuning with LoRA (Low-Rank Adaptation) combined with prompt engineering ("LoRA + Prompt") is the most effective strategy** among those tested. It yields models that are both better calibrated (lower ECE, meaning their predicted probabilities more accurately reflect true correctness likelihood) and better at distinguishing between classes (higher AUROC).
The poor performance of the Zero-Shot Classifier establishes a baseline, highlighting the significant gains achievable through parameter-efficient fine-tuning (LoRA) and prompt design. The "Probe" method, which likely involves training a simple classifier on top of frozen model features, offers a middle ground.
The **negative or neutral impact of transfer learning** is a notable finding. It implies that for these specific methods and tasks, adapting a model that was previously fine-tuned for a *different* task (the "transfer" scenario) does not improveāand may even harmāperformance compared to fine-tuning directly on the target task. This could be due to negative transfer or misalignment between the source and target tasks.
In summary, the visualization argues for the direct application of LoRA with prompts over zero-shot or probe-based approaches, and cautions against assuming that transfer learning will automatically improve results in this context. The consistent ranking of methods across two different conditions (MC and OE) and two complementary metrics adds robustness to this conclusion.
</details>
<details>
<summary>x8.png Details</summary>

### Visual Description
## Density Plot: Model Confidence by Question Answerability
### Overview
The image displays two vertically stacked density plots comparing the confidence distributions of two models ("Zero-Shot" and "Trained") on questions categorized as "Answerable" and "Unanswerable." The plots visualize the probability of a correct answer, P(correct), on the x-axis against the density of predictions on the y-axis.
### Components/Axes
* **Legend:** Positioned at the top center. Contains two entries:
* **Zero-Shot:** Represented by pink/magenta bars.
* **Trained:** Represented by purple/violet bars.
* **Top Plot Title:** "Answerable"
* **Bottom Plot Title:** "Unanswerable"
* **Shared X-Axis Label:** "P(correct)"
* **Axis Markers/Ticks:** 30%, 50%, 70%, 90%.
* **Shared Y-Axis Label:** "Density"
* **Axis Markers/Ticks:** 1, 3, 5.
### Detailed Analysis
The analysis is segmented by plot region.
**1. Top Plot: "Answerable" Questions**
* **Trend Verification:** Both distributions are skewed toward higher probabilities, indicating higher confidence for answerable questions.
* **Zero-Shot (Pink) Series:** The distribution is relatively narrow and peaks sharply in the high-confidence region. The highest density bars are located approximately between 70% and 80% P(correct). The density falls off rapidly below 60% and above 85%.
* **Trained (Purple) Series:** The distribution is broader and more spread out than the Zero-Shot series. It also peaks in the high-confidence region (around 70-80%), but with a lower maximum density. It shows a more gradual slope, with significant density extending down to the 50-60% range.
**2. Bottom Plot: "Unanswerable" Questions**
* **Trend Verification:** The distributions shift leftward toward lower probabilities compared to the "Answerable" plot, indicating lower confidence for unanswerable questions.
* **Zero-Shot (Pink) Series:** The distribution shows a clear peak in the low-to-mid confidence range. The highest density bars are located approximately between 40% and 50% P(correct). There is a long tail extending into higher probabilities, but density diminishes significantly above 70%.
* **Trained (Purple) Series:** The distribution is flatter and more uniform compared to its counterpart in the "Answerable" plot. It does not have a single sharp peak. Density is relatively consistent across the 30% to 60% range, with a slight concentration around 40-50%. It shows less density in the very high confidence regions (>70%) compared to the Zero-Shot model on unanswerable questions.
### Key Observations
1. **Confidence Calibration by Category:** Both models exhibit higher confidence (higher P(correct)) for "Answerable" questions and lower confidence for "Unanswerable" questions, which is a desirable trait.
2. **Model Behavior Difference:** The "Zero-Shot" model displays more extreme confidence distributionsāsharper peaks at high confidence for answerable questions and at lower confidence for unanswerable questions. The "Trained" model's distributions are more spread out and moderate.
3. **Overconfidence on Unanswerable:** The "Zero-Shot" model retains a notable tail of high-confidence predictions (60-80% P(correct)) even for "Unanswerable" questions, suggesting potential overconfidence. The "Trained" model shows a more subdued tail in this region.
4. **Clarity of Signal:** The separation between the "Answerable" and "Unanswerable" distributions appears more distinct for the "Zero-Shot" model.
### Interpretation
This data suggests that the training process calibrates the model's confidence estimates. While the Zero-Shot model is more decisive (assigning very high or low probabilities), it may be more prone to overconfidence, particularly on difficult (unanswerable) questions. The Trained model, while less decisive, demonstrates more nuanced and potentially more reliable confidence scores across both question types. The plots visually argue that training improves a model's ability to express appropriate uncertainty, which is critical for trustworthy AI systems. The clear shift in distributions between "Answerable" and "Unanswerable" categories for both models indicates that the underlying model architecture is capable of distinguishing between these question types based on its internal representations.
</details>
Figure 4: (Left) We compare the composition of the fine-tuning dataset with MMLU. Notably, although the training dataset contains close to zero examples from social sciences, uncertainty estimates from the model perform similarly across categories. (Center) Testing the generalization of supervised methods by taking models trained on one setting (MCQA or OE) and evaluating them on the other setting. The MCQA or OE labels denote the evaluation setting, with the method labels indicate whether the model was trained on the same or different setting. Fine-tuning through the modelās features (LoRA + Prompt) performs almost as well in transfer as on in-distribution data. Zero-Shot Classifier involves no supervised learning except a temperature-scale step and is a useful reference point. Error bars show two standard deviations over six fine-tuned models. (Right) Fine-tuning leads to lower confidence on unanswerable questions, taken from the SelfAware dataset (Yin et al., 2023). Assigning low confidence to unanswerable questions allows the model to opt out of responding.
#### 6.1 What are uncertainty estimates learning?
Language models can generate useful uncertainty estimates after training on a relatively small number of labeled examples. How is this possible? We hypothesize two, potentially complementary mechanisms: (a) LLMs assess the correctness of an answer given a question, or (b) LLMs recognize that certain topics often have incorrect answers. To understand the difference, letās explore a useful metaphor. Imagine I speak only English, while my friend, Alice, is a linguaphile and dabbles in many languages. I have a spreadsheet of how often Alice makes mistakes in each language. Now, when I hear Alice attempting to converse in language A, I can guess how likely she is to err by recognizing the language from its sound and consulting the spreadsheet. I can do this without understanding the language at all. Alternatively, I can learn each language, which would be more complex but would strengthen my predictions.
To disentangle these two possibilities in our setting, we perform an additional experiment, in which we replace the language modelās answers in the fine-tuning dataset with incorrect answer options. If a language model is simply learning patterns in the errors present in the training data, then we would expect this ablation to perform on par with the original method because it suffices to learn patterns in the content of the question and answer without needing the true causal relationship between question, answer, and correctness label. The results are shown in Figure 5 (left). We see the model trained on incorrect answers performs surprisingly well, on par with a Probe model, but significantly worse than a model trained on the original sampled answers. Correlating question content with error rates while moderately successful cannot be a full description of the LoRA + Prompt estimates.
Self-knowledge. Lastly, we examine whether a language model can be used to model not just its own uncertainties but the uncertainties of other models. Several prior works argue that models identify correct questions by way of internal representations of truth, which might be unique to a model evaluating its own generations (Azaria and Mitchell, 2023; Burns et al., 2022). In Figure 5 (right), we show that, by contrast, Mistral 7B actual has better AUROC values when applied to LLaMA-2 7B than LLaMA-2 7B applied to itself. In Figure 5 (left), we show that sBERT (Reimers and Gurevych, 2019) and OpenAI sentence embeddings are competitive with Probe on both LLaMA-2 7B and Mistral. Together, these results suggest that LLM uncertainties are likely not model-specific. The practical upside of this insight is that one strong base model can be used to estimate the uncertainties of many other models, even closed-source models behind APIs, when a small labeled dataset is available or can be generated.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Bar Chart: Model Performance Metrics (ECE and AUROC)
### Overview
The image displays a comparative bar chart with two vertically stacked subplots, evaluating three different methods ("Incorrect", "Sampled", "Probe") across two performance metrics: Expected Calibration Error (ECE) and Area Under the ROC Curve (AUROC). The chart includes a legend and error bars on each bar, indicating variability or uncertainty in the measurements.
### Components/Axes
* **Legend:** Positioned at the top center of the image. It contains three entries:
* A light blue square labeled "Incorrect".
* A dark blue square labeled "Sampled".
* An orange square labeled "Probe".
* **Subplot 1 (Top):**
* **Y-Axis Label:** "ECE" (Expected Calibration Error).
* **Y-Axis Scale:** Percentage, ranging from 0% to 20%, with tick marks at 0%, 10%, and 20%.
* **Bars:** Three bars corresponding to the legend categories.
* **Subplot 2 (Bottom):**
* **Y-Axis Label:** "AUROC" (Area Under the Receiver Operating Characteristic Curve).
* **Y-Axis Scale:** Percentage, ranging from 30% to 70%, with tick marks at 30%, 50%, and 70%.
* **Bars:** Three bars corresponding to the legend categories.
* **X-Axis:** No explicit categorical labels are present on the x-axis. The bars are grouped by the three methods defined in the legend.
### Detailed Analysis
**ECE Subplot (Top):**
* **Trend Verification:** The "Incorrect" (light blue) bar is the tallest, indicating the highest ECE. The "Probe" (orange) and "Sampled" (dark blue) bars are shorter and of similar height.
* **Data Points (Approximate):**
* **Incorrect (Light Blue):** ~15% ECE. Error bar extends from approximately 12% to 18%.
* **Probe (Orange):** ~10% ECE. Error bar extends from approximately 8% to 12%.
* **Sampled (Dark Blue):** ~10% ECE. Error bar extends from approximately 7% to 13%.
**AUROC Subplot (Bottom):**
* **Trend Verification:** The "Sampled" (dark blue) bar is the tallest, indicating the highest AUROC. The "Incorrect" (light blue) and "Probe" (orange) bars are shorter and of similar height.
* **Data Points (Approximate):**
* **Incorrect (Light Blue):** ~50% AUROC. Error bar extends from approximately 48% to 52%.
* **Probe (Orange):** ~50% AUROC. Error bar extends from approximately 45% to 55%.
* **Sampled (Dark Blue):** ~65% AUROC. Error bar extends from approximately 60% to 70%.
### Key Observations
1. **Inverse Relationship:** There is an inverse relationship between the performance of the "Incorrect" method and the "Sampled" method across the two metrics. "Incorrect" has the worst (highest) ECE but ties for the lowest AUROC. "Sampled" has a low ECE (tied with "Probe") and the best (highest) AUROC.
2. **"Probe" Method Consistency:** The "Probe" method shows consistent, moderate performance. It matches the "Sampled" method on the calibration metric (ECE) but performs similarly to the "Incorrect" method on the discrimination metric (AUROC).
3. **Variability:** The error bars suggest the most uncertainty (widest range) is associated with the "Probe" method's AUROC measurement and the "Sampled" method's ECE measurement. The "Incorrect" method's AUROC appears to have the least variability.
### Interpretation
This chart likely compares different strategies for handling or sampling data in a machine learning context, possibly related to model calibration or uncertainty estimation.
* **What the data suggests:** The "Sampled" method appears to be the most effective overall, achieving strong discrimination (high AUROC) while maintaining good calibration (low ECE). The "Incorrect" method, which may represent a baseline or a flawed approach, is poorly calibrated (high ECE) and has poor discriminative ability. The "Probe" method offers a middle ground; it is well-calibrated but does not improve discrimination over the flawed baseline.
* **How elements relate:** The two subplots together provide a more complete picture of model performance than either metric alone. A model can be well-calibrated (low ECE) but have poor discriminative power (low AUROC), or vice-versa. The ideal model minimizes ECE while maximizing AUROC, a position occupied here by the "Sampled" method.
* **Notable anomalies:** The near-identical ECE for "Probe" and "Sampled" is notable, suggesting these two methods are equally effective at reducing calibration error compared to the "Incorrect" baseline. The significant jump in AUROC for "Sampled" is the most striking result, indicating it provides a substantial benefit in classification performance.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Heatmap Pair: Model Performance Comparison (Probe vs. LoRA + Prompt)
### Overview
The image displays two side-by-side heatmaps comparing the performance of two machine learning models (Mistral and LLaMA-2) under two different adaptation methods: "Probe" (left) and "LoRA + Prompt" (right). The heatmaps visualize a performance metric (likely accuracy or a similar score) based on which model was used for training and which model is being evaluated.
### Components/Axes
* **Chart Type:** Two 2x2 heatmaps.
* **Y-Axis (Vertical):** Labeled **"Model"**. The two categories are **"Mistral"** (top row) and **"LLaMA-2"** (bottom row). This axis represents the model being evaluated or probed.
* **X-Axis (Horizontal):** Labeled **"Trained On"**. The two categories are **"Mistral"** (left column) and **"LLaMA-2"** (right column). This axis represents the model on which the training or adaptation was performed.
* **Color Scale/Legend:**
* **Left Heatmap (Probe):** A vertical color bar on the right side of the heatmap. The scale ranges from **0.5** (dark purple/black) to **0.8** (bright orange/red). Intermediate markers are at **0.6** and **0.7**.
* **Right Heatmap (LoRA + Prompt):** A vertical color bar on the right side. The scale ranges from **0.65** (dark purple) to **0.80** (bright orange/red). Intermediate markers are at **0.70** and **0.75**.
* **Titles:** The left heatmap is titled **"Probe"**. The right heatmap is titled **"LoRA + Prompt"**.
### Detailed Analysis
**Left Heatmap: "Probe"**
* **Cell (Mistral Model, Trained On Mistral):** Color is a medium-dark purple. Estimated value: **~0.65**.
* **Cell (Mistral Model, Trained On LLaMA-2):** Color is bright orange-red. Estimated value: **~0.78**.
* **Cell (LLaMA-2 Model, Trained On Mistral):** Color is a bright red-pink. Estimated value: **~0.75**.
* **Cell (LLaMA-2 Model, Trained On LLaMA-2):** Color is very dark purple/black. Estimated value: **~0.52**.
**Right Heatmap: "LoRA + Prompt"**
* **Cell (Mistral Model, Trained On Mistral):** Color is dark purple. Estimated value: **~0.68**.
* **Cell (Mistral Model, Trained On LLaMA-2):** Color is bright orange-red. Estimated value: **~0.79**.
* **Cell (LLaMA-2 Model, Trained On Mistral):** Color is dark purple. Estimated value: **~0.67**.
* **Cell (LLaMA-2 Model, Trained On LLaMA-2):** Color is bright red-pink. Estimated value: **~0.76**.
### Key Observations
1. **Cross-Model Training Advantage:** In both adaptation methods, training on a *different* model than the one being evaluated yields significantly higher performance. The brightest cells (highest values) are always in the off-diagonal positions (Mistral model trained on LLaMA-2, and LLaMA-2 model trained on Mistral).
2. **Method Comparison - LLaMA-2 on LLaMA-2:** The most dramatic difference is for the LLaMA-2 model when trained on itself. With the "Probe" method, this is the worst-performing combination (~0.52). With "LoRA + Prompt," it becomes one of the best-performing combinations (~0.76).
3. **Method Comparison - Mistral on Mistral:** The performance for Mistral trained on itself improves slightly from "Probe" (~0.65) to "LoRA + Prompt" (~0.68).
4. **Overall Performance Range:** The "LoRA + Prompt" method appears to have a higher performance floor (minimum ~0.67) compared to the "Probe" method (minimum ~0.52), suggesting it may be a more robust adaptation technique.
### Interpretation
The data suggests a strong **negative transfer or interference** when a model is probed or adapted using only its own pre-trained weights (the diagonal cells in the "Probe" heatmap). This could indicate that the probing method alone is insufficient to elicit good performance from the base model on the target task.
Conversely, the **LoRA + Prompt** method appears to successfully mitigate this issue, especially for LLaMA-2. The technique seems to enable effective **knowledge transfer or adaptation** when applied across different model architectures (the off-diagonal cells), which consistently show high performance. The fact that LLaMA-2's performance on itself jumps so dramatically with LoRA + Prompt implies that this method is particularly effective at unlocking or reorganizing the model's internal knowledge for the given task, whereas simple probing fails to do so.
The consistent high performance of cross-model training (e.g., Mistral model trained on LLaMA-2 data/weights) under both methods is notable. It may suggest that the task benefits from the features or representations learned by a different but related model architecture, or that the training process effectively distills knowledge from one model into another. The "LoRA + Prompt" method seems to refine and stabilize this cross-model transfer, raising the lower bound of performance.
</details>
<details>
<summary>x11.png Details</summary>

### Visual Description
## Bar Chart: Model Performance Comparison (ECE and AUROC)
### Overview
The image displays two vertically stacked bar charts comparing the performance of four different models or methods across two evaluation metrics: ECE (Expected Calibration Error) and AUROC (Area Under the Receiver Operating Characteristic Curve). The charts include error bars, indicating variability or confidence intervals for each measurement.
### Components/Axes
* **Legend:** Located at the top-left of the image. It defines four categories with associated colors:
* **Probe** (Dark Blue)
* **LoRA + Prompt** (Light Blue)
* **sBERT** (Orange)
* **OAIEmb** (Purple)
* **Top Chart (ECE):**
* **Y-axis Label:** "ECE"
* **Y-axis Scale:** Percentage, ranging from 0% to 20%, with major ticks at 0%, 10%, and 20%.
* **X-axis:** Implicitly represents the four model categories from the legend. No explicit x-axis labels are present below the bars.
* **Bottom Chart (AUROC):**
* **Y-axis Label:** "AUROC"
* **Y-axis Scale:** Percentage, ranging from 40% to 80%, with major ticks at 40%, 60%, and 80%.
* **X-axis:** Implicitly represents the same four model categories as the top chart.
### Detailed Analysis
**ECE Chart (Top):**
* **Trend Verification:** The bars for "Probe" and "LoRA + Prompt" are visually taller than those for "sBERT" and "OAIEmb". The error bars for "LoRA + Prompt" appear slightly larger than the others.
* **Data Points (Approximate):**
* **Probe (Dark Blue):** ~18%
* **LoRA + Prompt (Light Blue):** ~19%
* **sBERT (Orange):** ~14%
* **OAIEmb (Purple):** ~16%
**AUROC Chart (Bottom):**
* **Trend Verification:** The bar for "LoRA + Prompt" is distinctly the tallest. The bars for "Probe", "sBERT", and "OAIEmb" are of similar, lower height. The error bar for "LoRA + Prompt" is notably larger than the others.
* **Data Points (Approximate):**
* **Probe (Dark Blue):** ~55%
* **LoRA + Prompt (Light Blue):** ~65%
* **sBERT (Orange):** ~50%
* **OAIEmb (Purple):** ~52%
### Key Observations
1. **Performance Trade-off:** The model "LoRA + Prompt" achieves the highest (best) AUROC score but also has the highest (worst) ECE score among the four methods. This suggests a potential trade-off between discrimination ability (AUROC) and calibration (ECE).
2. **Relative Rankings:** The ranking of models is not consistent across metrics. "Probe" is second-best in both metrics. "sBERT" has the lowest ECE (best calibration) but also the lowest AUROC (worst discrimination). "OAIEmb" performs in the middle range for both metrics.
3. **Variability:** The "LoRA + Prompt" method shows the largest error bars, particularly in the AUROC chart, indicating greater variance or uncertainty in its performance estimate compared to the other methods.
### Interpretation
This chart likely comes from a machine learning or natural language processing study evaluating different techniques (probing, fine-tuning with LoRA and prompts, sentence-BERT embeddings, and OpenAI embeddings) on a classification task.
* **ECE (Expected Calibration Error)** measures how well a model's predicted probabilities match the actual correctness likelihood. A lower ECE is better, meaning the model is well-calibrated (e.g., when it predicts 70% confidence, it is correct about 70% of the time). The data suggests that simpler embedding methods (`sBERT`, `OAIEmb`) may be better calibrated than more complex adaptation methods (`LoRA + Prompt`).
* **AUROC** measures the model's ability to distinguish between classes. A higher AUROC is better. Here, `LoRA + Prompt` demonstrates superior discriminative power, which is often the primary goal in many applications.
The key takeaway is that the choice of method involves a balance. If reliable probability estimates are crucial (e.g., for risk assessment), `sBERT` might be preferable despite lower overall accuracy. If maximizing predictive accuracy is the sole objective, `LoRA + Prompt` is the best choice, albeit with less reliable confidence scores and higher performance variance. The `Probe` method offers a middle-ground performance on both metrics.
</details>
Figure 5: (Left) We ablate the correspondence between questions and answers by training LoRA + Prompt on a dataset with correctness labels from the modelās generations but with the actual generations swapped with incorrect answers. In this case, the only relationships that can be extracted by the model are between the correctness labels and the questions. The model trained on incorrect answers generalizes surprisingly well but is much worse than a model trained on the original answers. Error bars show two standard deviations over three instruction-tuned models. (Center) We test how well models can learn to predict the correctness of a different model (in terms of AUROC), and we find that mistral models are often better at estimating the correctness of LLaMA models than LLaMA can on their own generations. (Right) We show that generic sentence embeddings can also perform on par with frozen language model representations (MMLU-OE), but training through a model is much better. sBERT and OAIEmb refer to training a classifier on top of sBERT (Reimers and Gurevych, 2019) or OpenAI sentence embeddings. Error bars show two standard deviations over tasks in MMLU.
Learned uncertainty estimates generalize to new formatting, subject matter, and even the generations of other models. This generalization appears to stem not simply from judging a questionās difficulty based on its subject matter (a short-cut) but also learning the correspondence between questions and correct answers.
### 7 Does Calibrated Confidence Improve Collaboration with AI Assistants?
One key motivation for estimating LLM uncertainty is to signal the modelās reliability during collaborative decision making. To examine how our uncertainty estimates can be used in this capacity, we perform a preliminary user study (with $N=181$ participants) in which participants complete a multiple choice exam in collaboration with an LLM (Mistral 7B Instruct). For each question, the participant is provided both the LLMās prediction and an uncertainty estimate, which can be from a calibrated method or an uncalibrated method. We hope to show that users are more likely to adopt calibrated uncertainty scores as part of their decision process. A more detailed description of the setup of our study is available in Appendix G.
People are sensitive to informed confidence scores.
Figure 6 shows density plots of the modelās reported confidence and whether the user chose to agree with the modelās prediction. We find that participants are sensitive to the confidence scores and tend to use scores when deciding to agree or disagree with the modelās prediction if the uncertainties are reliable. On the other hand, participants generally do not modulate their decision to rely on the output of a random confidence baseline (Figure 6 (c)), in which the display uncertainty estimate is generated uniformly at random. We see the strongest discrepancy in reliance choices when LoRA + Probe confidence scores are presented, highlighting that calibrated confidence does influence user behavior.
We include additional details and results in Appendix G. We find that confidence scores have the biggest effect on improving the lowest performing users, rather than on average accuracy. However, this is a preliminary result in the nascent field of studying LLM uncertainties in practical collaborative decision making with users. We are only still scratching the surface of this question. For more fine-grained conclusions, a study should be devoted to this subject. We outline several limitations and future directions in Appendix G.
|
<details>
<summary>x12.png Details</summary>

### Visual Description
## Histogram with Density Curves: Model Confidence Distribution by Agreement
### Overview
The image displays a statistical chart comparing the distribution of model confidence percentages for two categories: "Agree" and "Disagree". It combines a histogram (bar chart) with overlaid kernel density estimate (KDE) curves to show the proportion of data points at different confidence levels.
### Components/Axes
* **Chart Type:** Histogram with overlaid density curves.
* **X-Axis:** Labeled **"Model Confidence (%)"**. The scale runs from 30 to 50, with major tick marks at 30, 35, 40, 45, and 50.
* **Y-Axis:** Labeled **"Proportion (%)"**. The scale runs from 0.00 to 0.15, with major tick marks at 0.00, 0.05, 0.10, and 0.15.
* **Legend:** Located in the **top-left corner** of the chart area. It contains two entries:
* An orange rectangle labeled **"Disagree"**.
* A green rectangle labeled **"Agree"**.
* **Data Series:**
1. **"Disagree" (Orange):** Represented by orange histogram bars and a solid orange density curve.
2. **"Agree" (Green):** Represented by green histogram bars and a solid green density curve.
* **Background:** The chart has a light gray grid background.
### Detailed Analysis
**Trend Verification & Data Points:**
1. **"Agree" Series (Green):**
* **Visual Trend:** The distribution is unimodal with a very sharp, high peak in the center and a much smaller secondary peak at the high end. The green curve rises steeply from ~38%, peaks dramatically, and falls sharply by ~46%, with a minor rise again near 50%.
* **Key Data Points (Approximate from visual inspection):**
* **Primary Peak:** The highest green bar and the apex of the green curve are located between 42% and 43% model confidence. The peak proportion is approximately **0.16 (16%)**.
* **Range:** The bulk of the "Agree" data is concentrated between ~38% and ~46% confidence.
* **Secondary Peak:** A small but distinct green bar and a slight bump in the green curve appear at approximately **50-51%** confidence, with a proportion of roughly **0.01-0.02 (1-2%)**.
2. **"Disagree" Series (Orange):**
* **Visual Trend:** The distribution is broader and flatter, with a primary peak at lower confidence and a long tail extending to the right. The orange curve rises from ~32%, peaks, and then gradually declines, with a notable secondary hump around 42%.
* **Key Data Points (Approximate from visual inspection):**
* **Primary Peak:** The highest orange bar and the peak of the orange curve are located around **35-36%** model confidence. The peak proportion is approximately **0.03-0.04 (3-4%)**.
* **Secondary Hump:** A noticeable cluster of orange bars and a rise in the orange curve occur around **41-42%** confidence, with proportions near **0.02-0.03 (2-3%)**.
* **Range:** The "Disagree" data spans from ~32% to ~48% confidence, but with much lower proportions than the "Agree" series across most of the range.
**Spatial Grounding:** The legend is positioned in the upper-left quadrant, clearly associating the orange color with "Disagree" and the green with "Agree". The green "Agree" bars are consistently taller than the orange "Disagree" bars across the central confidence range (38%-46%), confirming the higher proportion of agreement in that region.
### Key Observations
1. **Dominant Central Peak for Agreement:** The most striking feature is the extremely high concentration of "Agree" instances at a very specific model confidence level (~42-43%).
2. **Lower Confidence for Disagreement:** The "Disagree" series peaks at a lower confidence level (~35-36%) compared to the main "Agree" peak.
3. **Overlap in Mid-Range:** There is significant overlap between the two distributions in the 40-44% confidence range, where both "Agree" and "Disagree" instances are present, though "Agree" is far more prevalent.
4. **Anomalous High-Confidence Agreement:** The small, isolated peak for "Agree" at ~50% confidence is an outlier from the main distribution, suggesting a small subset of cases where the model is highly confident and agrees.
### Interpretation
This chart visualizes the calibration or behavior of a classification model. It suggests that:
* The model's confidence is not uniformly distributed. It has a strong tendency to output confidence scores in the low-40s percentage range, particularly when it "Agrees" (likely with a ground truth or another model).
* When the model "Disagrees," it tends to do so with lower confidence scores (mid-30s), indicating higher uncertainty in its disagreement.
* The sharp peak for "Agree" could indicate a systemic bias or a specific subset of data that consistently produces this confidence score. The model appears well-calibrated for agreement in a narrow band but shows more varied, lower-confidence behavior for disagreement.
* The secondary peak at 50% for "Agree" is curious. It might represent a decision boundary case, a specific class, or an artifact of the model's probability output (e.g., a softmax output hitting exactly 0.5).
**In essence, the data demonstrates that the model's confidence is a strong indicator of its agreement tendency, with high confidence being rare and mid-range confidence being highly predictive of agreement.**
</details>
|
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Histogram with Density Curves: Model Confidence Distribution
### Overview
The image displays a statistical chart comparing the distribution of model confidence scores for two distinct groups or conditions, represented by green and orange colors. The chart combines histograms (bar charts) with overlaid kernel density estimation (KDE) curves to visualize the frequency distribution and probability density of confidence percentages.
### Components/Axes
* **Chart Type:** Histogram with overlaid density curves.
* **X-Axis:**
* **Label:** "Model Confidence (%)"
* **Scale:** Linear scale ranging from approximately 30% to 70%.
* **Major Ticks:** Labeled at 40, 50, 60, 70.
* **Minor Ticks:** Appear at 5-unit intervals (e.g., 35, 45, 55, 65).
* **Y-Axis:**
* **Label:** "Proportion (%)"
* **Scale:** Linear scale ranging from 0.00 to 0.08 (representing 0% to 8%).
* **Major Ticks:** Labeled at 0.00, 0.02, 0.04, 0.06, 0.08.
* **Data Series (Legend Implied by Color):**
* **Green Series:** Consists of semi-transparent green histogram bars and a solid green density curve.
* **Orange Series:** Consists of semi-transparent orange histogram bars and a solid orange density curve.
* **Spatial Grounding:** The green series is consistently positioned behind the orange series where they overlap. The green bars and curve are generally taller and extend further to the right (higher confidence) than the orange ones.
### Detailed Analysis
**Green Series (Bars and Curve):**
* **Trend:** The distribution is right-skewed, with a peak in the lower-middle confidence range and a long tail extending towards higher confidence values.
* **Peak:** The highest proportion (mode) occurs in the bin centered approximately at **42-43% confidence**, with a proportion value of about **0.085 (8.5%)**.
* **Shape:** The density curve rises steeply from ~30%, peaks around 42%, then declines gradually. It shows a secondary, smaller hump or plateau between **50% and 60% confidence** before tapering off near 70%.
* **Range:** The visible data spans from just below 30% to just above 70% confidence.
**Orange Series (Bars and Curve):**
* **Trend:** The distribution is also right-skewed but more concentrated at lower confidence levels compared to the green series.
* **Peak:** The highest proportion occurs in the bin centered approximately at **37-38% confidence**, with a proportion value of about **0.045 (4.5%)**.
* **Shape:** The density curve peaks earlier (at a lower confidence value) than the green curve and declines more rapidly. It has a much smaller presence beyond 50% confidence.
* **Range:** The visible data spans from just below 30% to approximately 60% confidence, with very low proportions above 55%.
**Comparative Points:**
* At confidence levels below ~45%, the orange series generally has a higher proportion than the green series.
* At confidence levels above ~45%, the green series has a significantly higher proportion than the orange series.
* The green distribution has a much heavier right tail, indicating a non-trivial proportion of predictions with high confidence (55-70%).
### Key Observations
1. **Bimodality Hint:** The green density curve suggests a potential bimodal distribution, with a primary peak near 42% and a secondary, broader mode between 50-60%.
2. **Divergent Distributions:** The two groups have clearly different confidence profiles. The "green" group produces more high-confidence predictions, while the "orange" group's predictions are more concentrated in the low-to-mid confidence range.
3. **Overlap Zone:** The highest overlap and competition between proportions occurs in the 35-45% confidence band.
4. **Uncertainty:** Exact bin heights and curve values are estimated from the visual representation. The y-axis "Proportion (%)" likely represents the relative frequency of predictions falling within each confidence bin.
### Interpretation
This chart is a diagnostic tool for evaluating model calibration or comparing two models/datasets. It answers: "How confident is the model in its predictions, and how is that confidence distributed?"
* **What the data suggests:** The green group appears to be a more "confident" model or a dataset where the model is more certain. However, high confidence does not necessarily equate to high accuracy; without a corresponding accuracy plot, we cannot assess calibration (whether a 70% confidence prediction is correct 70% of the time).
* **Relationship between elements:** The histogram bars show the empirical frequency of predictions in discrete confidence bins. The KDE curves smooth this data to estimate the underlying probability density function, making it easier to compare the shapes of the two distributions.
* **Notable anomalies/investigation:** The secondary hump in the green curve is a critical feature. It indicates a subpopulation of predictions where the model is moderately-to-highly confident (50-60%). An investigator should ask: What features or classes are associated with this secondary group? Are they correct? The stark difference between the green and orange distributions warrants investigation into the underlying causesādifferences in model architecture, training data, or the inherent difficulty of the tasks assigned to each group. The chart reveals that the groups are not just different in average confidence, but in the entire shape of their confidence profiles.
</details>
|
<details>
<summary>x14.png Details</summary>

### Visual Description
## Histogram with Overlaid Density Curves: Model Confidence Distribution
### Overview
The image displays a statistical chart combining a histogram and two overlaid kernel density estimate (KDE) curves. It visualizes the distribution of a model's confidence scores, comparing two distinct groups or datasets. The chart is presented on a white background with a light gray grid.
### Components/Axes
* **Chart Type:** Histogram with overlaid density curves.
* **X-Axis:**
* **Label:** "Model Confidence (%)"
* **Scale:** Linear, ranging from 0 to 100.
* **Major Ticks:** 0, 20, 40, 60, 80, 100.
* **Y-Axis:**
* **Label:** "Proportion (%)"
* **Scale:** Linear, ranging from 0.00 to 0.06.
* **Major Ticks:** 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06.
* **Data Series (Visual Elements):**
1. **Orange Histogram Bars:** Represent the proportion of predictions for one group (likely "incorrect" or "low-confidence" predictions) across confidence bins.
2. **Green Histogram Bars:** Represent the proportion of predictions for a second group (likely "correct" or "high-confidence" predictions) across confidence bins.
3. **Orange Line:** A smoothed KDE curve for the orange histogram data.
4. **Green Line:** A smoothed KDE curve for the green histogram data.
* **Legend:** No explicit legend is present within the chart area. The color coding (orange vs. green) is the primary means of distinguishing the two data series.
### Detailed Analysis
**Histogram Bars (Approximate Proportions per Bin):**
* **Orange Bars (Left-skewed distribution):**
* Highest peak at the 0-5% confidence bin: ~0.048 proportion.
* Significant presence in the 5-10% bin: ~0.038.
* Proportion generally decreases as confidence increases, with minor local peaks around 15-20% (~0.03) and 30-35% (~0.025).
* Very low proportions above 60% confidence, tapering to near zero by 100%.
* **Green Bars (Right-skewed distribution):**
* Very low proportions below 20% confidence.
* Begins to rise significantly around 30-35% confidence.
* Major cluster of high bars between 40% and 95% confidence.
* Highest peak appears in the 80-85% confidence bin: ~0.058 proportion.
* Other notable peaks at ~50% (~0.048), ~65% (~0.045), and ~90% (~0.045).
**Density Curves (Trend Verification):**
* **Orange Line Trend:** Starts high on the left (low confidence), slopes downward with a slight hump around 15-20%, and continues a steady decline towards the right (high confidence). This confirms the left-skewed nature of the orange data.
* **Green Line Trend:** Starts near zero on the left, rises to form a broad, multi-modal distribution across the middle-to-high confidence range. It shows a primary peak around 50% and a secondary, slightly higher peak around 80%, before declining. This confirms the right-skewed, high-confidence nature of the green data.
**Spatial Grounding:**
* The orange data (bars and line) is concentrated on the **left side** of the chart (0-50% confidence).
* The green data (bars and line) is concentrated on the **right side** of the chart (40-100% confidence).
* There is a clear zone of overlap between approximately 30% and 60% confidence where both orange and green bars are present.
### Key Observations
1. **Bimodal Separation:** The chart reveals two distinct, largely non-overlapping populations of model predictions. One group (orange) is characterized by low confidence scores, while the other (green) is characterized by moderate-to-high confidence scores.
2. **Peak Confidence Disparity:** The mode (most common value) for the orange distribution is extremely low (~2.5% confidence), while the mode for the green distribution is high (~82.5% confidence).
3. **Overlap Region:** The area between 30-60% confidence represents a "zone of uncertainty" where predictions from both groups coexist, though the green group begins to dominate as confidence increases.
4. **Absence of Extreme High Confidence for Orange:** The orange series has virtually no representation above 70% confidence, suggesting the model is rarely highly confident about this class of predictions.
### Interpretation
This chart is a classic visualization of **model calibration and discriminative performance**, likely for a binary classification task.
* **What the data suggests:** The two distributions almost certainly represent the model's confidence scores for **incorrect predictions** (orange) versus **correct predictions** (green). A well-calibrated, high-performing model should exhibit this pattern: low confidence when it's wrong and high confidence when it's right.
* **How elements relate:** The separation between the orange and green peaks indicates the model has good **discriminative ability**āit can distinguish between cases it will get right and cases it will get wrong based on its confidence score. The overlap region highlights where the model is less certain and errors are more likely.
* **Notable implications:**
* **Calibration:** The model appears reasonably well-calibrated. Its high-confidence predictions (green peak near 80%) are likely to be correct, and its low-confidence predictions (orange peak near 0%) are likely to be incorrect.
* **Performance:** The clear separation suggests high overall accuracy, as most predictions fall into the distinct high-confidence/correct or low-confidence/incorrect clusters.
* **Actionable Insight:** Predictions falling in the 30-60% confidence overlap zone could be flagged for human review, as the model is less decisive there. The near-zero orange proportion above 70% confidence is a positive sign, indicating the model rarely makes confident mistakes.
</details>
|
| --- | --- | --- |
| (a) Zero-Shot Prompt | (b) LoRA + Prompt | (c) Random (Control) |
Figure 6: We compare the distribution of LLM confidence (for Mistral 7B Instruct) on its answers, and whether the users ( $N=20$ per variant) agree with the answer generated by the model or not. (a) For the zero-shot prompt, we find that the model provides little signal since most mass is similarly clustered. However, (b) improving the calibration of the model reveals an increased reliance on the LLM for more confident answers, and decreased reliance for less confident answers. Evidently, the users are sensitive to calibrated confidence scores. (c) For reference, we verify that uniformly confidence scores do not provide meaningful signal, rendering users unable to modulate their decision to rely on the LLM. All variants are compared at approximately the same average participant accuracy.
Users are sensitive to confidence scores and use their relative magnitude to modulate their decision to use an LLM. Lower performing users are most improved by access to confidence scores. However, future work is needed to disentangle the effects of calibration from how humans choose to leverage uncertainties.
### 8 Discussion
There is much disagreement about the role of calibrated uncertainty in large language models, how it can best be achieved, and promise of black-box methods. We hope to have shed light on these questions throughout this paper. In contrast to prior results, we find that out-of-the-box uncertainties from LLMs are unreliable for open-ended generation and introduce a suite of fine-tuning procedures that produce calibrated uncertainties with practical generalization properties. In the process, we discovered that fine-tuning is surprisingly sample efficient and does not seem to rely on representations of correctness specific to a model evaluating its own generations, allowing estimators to be applied from one model to another. Moreover, we found it is possible, at least in the cases we considered, for calibrated uncertainties to be robust to distribution shifts.
There are many exciting questions for future work. Currently fine-tuning relies on two separate models for question answering and uncertainty estimation. Ideally, we want a single model that can generate questions and uncertainty without switching between model weights. We anticipate that an uncertainty-aware pre-training or alignment phase might become essential but implementing such a procedure while maintaining base language modeling abilities will introduce a challenging online learning problem where the correctness labels evolve during training.
Beyond improving the safety and usefulness of language models, high quality uncertainties can also be used in active learning procedures, e.g. for sample-efficient fine-tuning (Osband et al., 2022), where data points are selected based on the predicted utility and the modelās uncertainty, in order to balance the explore-exploit trade-off. Uncertainty estimates can also be used to improve factuality of language models by increasing the likelihood of generations that the model is confident about (judges likely to be correct), for example by using an alignment procedure (e.g. RLHF, DPO) with a reward function that encourages confident generations (Tian et al., 2023a).
We also showed how uncertainty information could be used to influence human decision making. In the end, LLMs will impact society through decision making, and to make reasonable decisions we need uncertainty information ā particularly to protect against rare but costly mistakes.
### Acknowledgements
This work is supported by NSF CAREER IIS-2145492, NSF CDS&E-MSS 2134216, NSF HDR-2118310, BigHat Biosciences, Capital One, and an Amazon Research Award.
### References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Amini et al. (2019) Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357ā2367. Association for Computational Linguistics, jun 2019. doi: 10.18653/v1/N19-1245.
- Aroyo and Welty (2015) Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15ā24, 2015.
- Azaria and Mitchell (2023) Amos Azaria and Tom M. Mitchell. The internal state of an llm knows when its lying. ArXiv, abs/2304.13734, 2023.
- Bhatt et al. (2023) Umang Bhatt, Valerie Chen, Katherine M Collins, Parameswaran Kamalaruban, Emma Kallina, Adrian Weller, and Ameet Talwalkar. Learning personalized decision support policies. arXiv preprint arXiv:2304.06701, 2023.
- Bishop (2006) Christopher M Bishop. Pattern recognition and machine learning. Springer google schola, 2:1122ā1128, 2006.
- Bisk et al. (2019) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. ArXiv, abs/1911.11641, 2019.
- Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Conference on Empirical Methods in Natural Language Processing, 2015.
- Burns et al. (2022) Collin Burns, Hao-Tong Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. ArXiv, abs/2212.03827, 2022.
- Chiang and yi Lee (2023) Cheng-Han Chiang and Hung yi Lee. Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics, 2023.
- Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. ArXiv, abs/1905.10044, 2019.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
- Collins et al. (2023) Katherine Maeve Collins, Matthew Barker, Mateo Espinosa Zarlenga, Naveen Raman, Umang Bhatt, Mateja Jamnik, Ilia Sucholutsky, Adrian Weller, and Krishnamurthy Dvijotham. Human uncertainty in concept-based ai systems. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 869ā889, 2023.
- De Marneffe et al. (2019) Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107ā124, 2019.
- Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359ā378, 2007.
- Gordon et al. (2011) Andrew S. Gordon, Zornitsa Kozareva, and Melissa Roemmele. Semeval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In International Workshop on Semantic Evaluation, 2011.
- Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, 2017.
- Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ArXiv, abs/2009.03300, 2020.
- Hills and Anadkat (2023) James Hills and Shyamal Anadkat. Using logprobs, Dec 2023. URL https://cookbook.openai.com/examples/using_logprobs.
- Hu et al. (2021) J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021.
- Huang et al. (2019) Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In Conference on Empirical Methods in Natural Language Processing, 2019.
- Jain et al. (2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
- Janssen et al. (2008) KJM Janssen, KGM Moons, CJ Kalkman, DE Grobbee, and Y Vergouwe. Updating methods improved the performance of a clinical prediction model in new patients. Journal of clinical epidemiology, 61(1):76ā86, 2008.
- Jiang et al. (2023) Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lāelio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, TimothĆ©e Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/2310.06825, 2023.
- Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, T. J. Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zachary Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, John Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom B. Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Christopher Olah, and Jared Kaplan. Language Models (Mostly) Know What They Know. ArXiv, abs/2207.05221, 2022.
- Keren (1991) Gideon Keren. Calibration and probability judgements: Conceptual and methodological issues. Acta psychologica, 77(3):217ā273, 1991.
- Khashabi et al. (2018) Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In North American Chapter of the Association for Computational Linguistics, 2018.
- Kruger and Dunning (1999) Justin Kruger and David Dunning. Unskilled and unaware of it: how difficulties in recognizing oneās own incompetence lead to inflated self-assessments. Journal of personality and social psychology, 77(6):1121, 1999.
- Kruger and Dunning (2002) Justin Kruger and David Dunning. Unskilled and unawareābut why? a reply to krueger and mueller (2002). American Psychological Association, 2002.
- Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. ArXiv, abs/2302.09664, 2023.
- Li and Roth (2002) Xin Li and Dan Roth. Learning question classifiers. In International Conference on Computational Linguistics, 2002.
- Lichtenstein et al. (1977) Sarah Lichtenstein, Baruch Fischhoff, and Lawrence D Phillips. Calibration of probabilities: The state of the art. In Decision Making and Change in Human Affairs: Proceedings of the Fifth Research Conference on Subjective Probability, Utility, and Decision Making, Darmstadt, 1ā4 September, 1975, pages 275ā324. Springer, 1977.
- Lin et al. (2022) Stephanie C. Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022, 2022.
- Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. ArXiv, abs/1711.05101, 2017.
- MacKay (2004) David John Cameron MacKay. Information theory, inference, and learning algorithms. IEEE Transactions on Information Theory, 50:2544ā2545, 2004.
- Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018.
- Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the ⦠AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 2015:2901ā2907, 2015.
- Nie et al. (2019) Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. ArXiv, abs/1910.14599, 2019.
- Osband et al. (2022) Ian Osband, Seyed Mohammad Asghari, Benjamin Van Roy, Nat McAleese, John Aslanides, and Geoffrey Irving. Fine-tuning language models via epistemic neural networks. arXiv preprint arXiv:2211.01568, 2022.
- Palan and Schitter (2018) Stefan Palan and Christian Schitter. Prolific. aca subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17:22ā27, 2018.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kƶpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems, 2019.
- Platt et al. (1999) John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61ā74, 1999.
- Plaut et al. (2024) Benjamin Plaut, Khanh Nguyen, and Tu Trinh. Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a. arXiv preprint arXiv:2402.13213, 2024.
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Sakaguchi et al. (2019) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. ArXiv, abs/1907.10641, 2019.
- Schaal (1996) Stefan Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.
- Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. ArXiv, abs/1811.00937, 2019.
- Team (2024) Gemini Team. Gemini: A family of highly capable multimodal models, 2024.
- Terwilliger et al. (2023) Thomas C Terwilliger, Dorothee Liebschner, Tristan I Croll, Christopher J Williams, Airlie J McCoy, Billy K Poon, Pavel V Afonine, Robert D Oeffner, Jane S Richardson, Randy J Read, et al. Alphafold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination. Nature Methods, pages 1ā7, 2023.
- Tian et al. (2023a) Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023a.
- Tian et al. (2023b) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023b.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a.
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023b.
- Ulmer et al. (2024) Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, and Seong Joon Oh. Calibrating large language models using their generations only. In Annual Meeting of the Association for Computational Linguistics, 2024.
- Uma et al. (2021) Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385ā1470, 2021.
- Vodrahalli et al. (2022) Kailas Vodrahalli, Tobias Gerstenberg, and James Y Zou. Uncalibrated models can improve human-ai collaboration. Advances in Neural Information Processing Systems, 35:4004ā4016, 2022.
- Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021.
- Welbl et al. (2017) Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. ArXiv, abs/1707.06209, 2017.
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rmi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38ā45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
- Xiong et al. (2023) Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. ArXiv, abs/2306.13063, 2023.
- Yin et al. (2023) Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they donāt know? In Findings of the Association for Computational Linguistics: ACL 2023, pages 8653ā8665, Toronto, Canada, 2023. Association for Computational Linguistics.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019.
- Zhang et al. (2023) Hanning Zhang, Shizhe Diao, Yong Lin, Yi R Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677, 2023.
Appendix for Large Language Models Must Be Taught to Know What They Donāt Know
### Appendix A Evaluation Methods
#### A.1 Evaluating Correctness
For a given question with known and generated answers $(Q,A,\hat{A})$ the correctness $C$ is True if the generated answer $\hat{A}$ matches the ground truth answer $A$ . For multiple-choice question-answering, the matching process only involves checking the first token generated via greedy decoding.
For open-ended evaluations, determining if the answer given is correct is more complex. One simple approach is to check if the ground truth answer $A$ appears as a substring of answer $\hat{A}$ . However, this does not capture rephrasings that may be essentially equivalent - such as āNYCā for āNew York City,ā or āDaoismā and āTaoism.ā Conversely, it also has the potential to be over-generous if the model is particularly verbose and emits many incorrect answers along with the correct string. Given the difficulty involved in writing a rule-based method for evaluating open-ended answer correctness, we use instead a strong auxiliary language model to evaluate correctness. The auxiliary language model is shown the query $Q$ , the ground truth answer $A$ , and the modelās output $\hat{A}$ , and is prompted to grade the answer whilst tolerating nuance. For full details of the prompt used see (fig. 7). In this paper we utilise GPT 3.5 Turbo as the auxiliary grading model. We conduct a comparison of human grading, substring grading, and GPT 3.5 Turbo grading on select subsets of MMLU in section A.3. We find that humans and GPT 3.5 Turbo have much greater agreement than humans and the substring method.
#### A.2 Grading
Dataset Construction.
To perform calibration-tuning (CT), we need tuples $(Q,A,\hat{A},C)$ , answers from a language model that have been graded for correctness. When calibration-tuning on multiple choice questions, we can use an exact string match to generate $C$ . To grade open-ended answers, we use a strong language model and grading prompt $G$ instead (fig. 7):
- $\bm{G}$ : a prompt used for grading answers $\bm{\hat{A}}$ with $\bm{A}$ .
Compared to alternatives like exact match, language model grading is insensitive to re-phrasings that are equivalent in meaning - such as āNYCā and āNew York City,ā or āDaoismā and āTaoismā. LLM grading can also penalize answers that are overly verbose or use a different meaning of the same word, potentially containing incorrect answers along with the correct string. For example, if the question is āWhatās it called when you move quickly by foot and both feet arenāt always touching the ground?ā and the LLM response is āA bank runā, the grader should be able to distinguish that this is semantically dissimilar to the true answer ārunā.
In this paper, we utilize GPT 3.5 Turbo as the auxiliary grading model. When comparing many possible grading methods on subsets of MMLU, we find that GPT 3.5 Turbo has high agreement with humans while being cost efficient (section A.3).
Grading prompt $(\bm{G})$ The problem is: $\bm{Q}$ The correct answer is: $\bm{A}$ A student submitted: $\bm{\hat{A}}$ The studentās answer must be correct and specific but not overcomplete (for example, if they provide two different answers, they did not get the question right). However, small differences in formatting should not be penalized (for example, āNew York Cityā is equivalent to āNYCā). Did the student provide an equivalent answer to the ground truth? Please answer yes or no without any explanation: $\bm{C}$ </s>
Figure 7: For open-ended generation, we calculate the ground-truth correctness $C$ using a LLM and a grading prompt ( $G$ ). The token </s> is an end-of-sentence token. Blue text is included in the loss function when calibration-tuning.
#### A.3 Comparison of Grading Techniques
We conducted an analysis of the methods outlined in section A.1 for open-ended evaluation. First, the base LLaMA-2 13b-chat model was prompted with questions from the following test subsets of MMLU: World Religions, Philosophy, Anatomy, High School Chemistry and Elementary School Math. The questions were stripped of their multiple-choice options before being supplied to the model.
A response was generated by the model via greedy decoding and this response was compared to the ground truth answer. The grading methods tested were Human, Substring Match, GPT 3.5 Turbo, and GPT 4.
The humans (a subset of our authors) were tasked to judge if the model response was essentially equivalent to the ground truth. For substring match, equivalence was determined by simply checking whether the ground truth answer existed as a substring within the model response. For GPT 3.5 Turbo and GPT 4, the models were supplied with the question, the ground truth, and the base model response, as well as a prompt indicating they should determine essential equivalence - see fig. 7.
MMLU Subset Substring Match GPT3.5 GPT4 World Religions 21.6% 6.4% 1.8% Philosophy 22.8% 2.3% 14.5% Anatomy 13.3% 14.8% 1.5% Chemistry 13.8% 5.4% 1.0% Math 12.4% 14.8% 3.7% Average 16.8% 8.7% 4.5%
Table 2: Absolute differences in accuracy % for the different grading methods vs human estimated accuracy. A lower value corresponds to an accuracy estimate closer to the human estimate.
We recorded the binary decision on correctness for each query and response by each of the grading methods above. Taking the human scores as the gold standard of correctness, we computed the model accuracy for each subset, and then derived the absolute error in estimate of model accuracy by each of the other grading methods. These are displayed in table 2. We see that GPT4 is a better estimator of human-judged correctness than GPT 3.5 Turbo, which in turn is substantially better than substring match; although there is some variance on a per-subset basis. For expediency of processing time and cost, we chose to use GPT 3.5 Turbo in this paper.
#### A.4 Metrics
ECE
Given $N$ samples and $B$ equally-spaced bins $b_{j}$ , examples are assigned to bins based on the confidence of the model, and ECE is estimated as $\widehat{\text{ECE}}=\sum_{j=1}^{B}\frac{\lvert b_{j}\rvert}{N}\left\lvert\mathrm{conf}(b_{j})-\mathrm{acc}(b_{j})\right\rvert$ where $\mathrm{conf}(b_{j})$ is the average confidence of samples in bin $b_{j}$ , $\mathrm{acc}(b_{j})$ is the accuracy within the bin, and $\lvert b_{j}\rvert$ is the number of samples assigned to bin $j$ . In our experiments $\mathrm{conf}$ is equivalent to $P(\text{correct})$ .
#### A.5 MMLU Supercategory Classifier
To understand the impact of the subject matter of the training data on generalization, we follow the prescription of Hendrycks et al. [2020] and categorize each of the 57 tasks into one of four supercategories - Humanities, STEM, Social Sciences, and Other. Since we do not have such a categorization for the training set, we must estimate their proportions.
First, we use the OpenAI embeddings (dimension 1536) of the MMLU samples with their ground truth supercategories to train a linear 4-way classifier with 10 samples from each of the 57 tasks. We use AdamW [Loshchilov and Hutter, 2017] with learning rate 1e-3 and weight decay 1e-2. This classifier is then used to estimate the categories of each sample in the training set used for fine-tuning. Subsequently, the breakdown of results in fig. 4 (Left) follows.
### Appendix B Baseline Methods
#### B.1 Sampling Methods
We use two baselines which obtain an estimate of certainty by sampling the same answers $n=10$ times and then estimating the proportion of sampled answers that agree with the greedily decoded āmainā answer. There are several critical downsides to these approaches: (i) the uncertainty here depends on the sampling parametersāfor example, in the limit where the sampling converges to mere greedy decoding, the LLM will produce $n$ identical samples, and therefore the certainty will always be 1ā(ii) these approaches require $O(n)$ answer generations to provide a certainty estimate for a single generation. The intense computational restriction prevents us from easily searching the space of sampling parameters for the optimal set, so we choose parameters arbitrarily; here we sample with top $\_p=0.95$ .
Counting
In this baseline, each sampled answer is compared to the greedy answer by prompting an expert LLM with both answers and asking it to judge their equivalence. The proportion of samples that are equivalent to the greedy answer is the certainty estimate. This baseline is similar to Label prob Tian et al. [2023b]; our method differs by not choosing the argmax semantic group as the final prediction, but instead using the greedy decode for the final prediction, so as to maintain the same accuracy performance as our uncertainty query method. This met
Likelihood accumulation
In this baseline, we add up likelihoods of sampled answers to estimate the mass associated with the predicted answer. We begin by prompting an expert LLM in order to find which sampled answers are equivalent to the greedy answerālike in the counting baseline. In this method, the certainty estimate is produced by adding the length-normalized likelihoods of those sampled answers equivalent to the greedy answer, and dividing this quantity by the sum of all sampled answersā length-normalized likelihoods. This procedure of adding likelihoods of samples in order to estimate the likelihood of an equivalence class is similar to that used by Kuhn et al. [2023], although they do not use it for certainty estimates but instead to produce entropy scores. In practice, the scores produced by these two methods are actually very similarāso we report only likelihood accumulation numbers in the main text.
#### B.2 Verbal Elicitation
Although Tian et al. [2023b] introduce several strategies for prompting, involving multiple guesses or multiple stages of interleaving prompting and generation, we did not find that any strategy consistently outperformed any other. This finding was consistent with the results of Xiong et al. [2023]. Ultimately, for convenience, we adopted a two stage strategy with a single guess because it can be used in tandem with logged datasets of generated answers per model.
The exact prompt we used is essentially the same at in [Tian et al., 2023b], but with small modifications that improved the rate of correctly formatted responses:
āProvide the probability that your answer is correct. Give ONLY the probability, no other words or explanation.
For example:
Probability: ”the probability between 0.0 and 1.0 that your guess is correct, without any extra commentary whatsoever; just the probability!¿
Include probability for the answer below: Probability:ā
Verbal elicitation methods typically output complex strings containing both answers and associated probabilities. This means that if any element of parsing fails, it can be challenging to construct partial results. This effect tends to diminish when using large models, which are more responsive to zero-shot prompting.
Parsing Details
The original verbal elicitation prompts are given in the appendix of [Tian et al., 2023b]. However, it is not clear how the original authors decide to parse answers from the generations and how failure to parse is handled. When we fail to parse the guess from the generation we return an empty string and associated probability 0.5. When we fail to parse a probability, we also return probability 0.5. For versions with multiple guesses, if any part of the parsing processes fails in an ambiguous way, we default back to an empty string for the answer and 0.5 for the probability. The only unambiguous cases are those which explicit succeed in the generating a valid guess and probability in the first case but not subsequent cases. In this scenario, we default to using the successfully parse first guess and associated probability.
### Appendix C Fine-tuning Method
#### C.1 Regularization Term
To keep the calibration-tuned parameters $\theta$ within the neighborhood of the initial parameters, $\theta_{0}$ , we use a regularization term that penalizes the divergence between the original sampling distribution and the calibration-tuned model on the target sequence $A$ , yielding regularization $\mathcal{R}(\theta;\theta_{0})$ , which we use with weighting parameter $\kappa$ .
Specifically, let $p_{\theta_{0}}$ be the language modeling distribution of the language model we wish to calibration-tune, and $q_{\theta}$ be the corresponding language modeling distribution as a consequence of calibration-tuning. We then use the Jensen-Shannon Divergence ${\mathrm{JSD}(p_{\theta_{0}}\parallel q_{\theta})}$ [MacKay, 2004] between the two language modeling distributions as the regularizer, where ${\mathrm{JSD}(p\parallel q)\triangleq\nicefrac{{1}}{{2}}(\mathrm{KL}(p\parallel m)+\mathrm{KL}(q\parallel m))}$ , where $m\triangleq\nicefrac{{1}}{{2}}(p+q)$ is the mixture distribution. JSD regularization is applied only to the logits corresponding to the target sequence $A$ .
We note that using either direction of KL-divergence, i.e. the forward-KL $\mathrm{KL}(p_{\theta_{0}}\parallel q_{{}_{\theta}})$ or reverse-KL $\mathrm{KL}(q_{{}_{\theta}}\parallel p_{\theta_{0}})$ was insufficient for optimal performance with calibration tuning. The forward KL-divergence encourages a zero-avoiding behavior such that the mass of $q_{\theta}$ is spread across multiple modes of $p_{\theta_{0}}$ to minimize the KL-divergence to avoid assigning no mass to regions of the probability space. To the contrary, the reverse KL-divergence encourages a zero-forcing behavior such the $q_{\theta}$ only needs to cover any one mode of $p_{\theta_{0}}$ [Bishop, 2006]. It is not necessarily obvious which one of these behaviors one should prefer for the specific case of large language models. Therefore, as a practical choice, we pick the one that provides us the most performant calibration-tuned model.
#### C.2 Training Data
We reserve the following datasets for training.
- AI2 Reasoning Challenge (ARC) [Clark et al., 2018],
- Boolean Questions (BoolQ) [Clark et al., 2019],
- CommonsenseQA [Talmor et al., 2019],
- CosmosQA [Huang et al., 2019],
- HellaSwag [Zellers et al., 2019],
- MathQA [Amini et al., 2019],
- Recognizing Textual Entailment (RTE/SNLI) [Bowman et al., 2015],
- Adversarial NLI [Nie et al., 2019],
- OpenBookQA [Mihaylov et al., 2018],
- PIQA [Bisk et al., 2019],
- SciQ [Welbl et al., 2017],
- The CommitmentBank (CB) [De Marneffe et al., 2019],
- Multi-Sentence Reading Comprehension (MultiRC) [Khashabi et al., 2018],
- Choice of Plausible Alternatives (CoPA) [Gordon et al., 2011],
- TREC [Li and Roth, 2002],
- Adversarial Winograd (Winogrande) [Sakaguchi et al., 2019].
#### C.3 Training Hyperparameters
We use HuggingFace Transformers [Wolf et al., 2020] and PyTorch [Paszke et al., 2019] for the implementation of these models. For all our experiments, we use the AdamW optimizer [Loshchilov and Hutter, 2017] with a learning rate of $10^{-4}$ , a cosine decay schedule, and effective batch size $M=32$ . The training runs for $G=10000$ with an initial linear warmup schedule for $1000$ steps.
### Appendix D Extended MMLU Results
We report the breakdown of uncertainty query accuracy and ECE on all MMLU tasks in figs. 8, 9, 10, 10 and 11.
<details>
<summary>x15.png Details</summary>

### Visual Description
## [Grouped Bar Chart]: Performance of Language Models on Academic Subjects Using Different Tuning Methods
### Overview
This image is a complex, multi-panel grouped bar chart comparing the performance of six different large language models (LLMs) across 25 academic subjects. Performance is measured using two metrics (ECE and AUROC) for four different model tuning or prompting methods. The chart is designed to facilitate comparison both across models for a given subject and across subjects for a given model.
### Components/Axes
* **Main Structure:** The chart is divided into six vertical panels (columns), each dedicated to a specific model.
* **Model Panels (Top Labels, Left to Right):**
1. LLaMA-2 7B
2. LLaMA-2 7B Chat
3. LLaMA-2 13B
4. LLaMA-2 13B Chat
5. Mistral 7B
6. Mistral 7B Instruct
* **Y-Axis (Left Side):** Lists 25 academic subjects. From top to bottom:
* abstract_algebra, anatomy, astronomy, business_ethics, clinical_knowledge, college_biology, college_chemistry, college_computer_science, college_mathematics, college_medicine, college_physics, computer_security, conceptual_physics, econometrics, electrical_engineering, elementary_mathematics, formal_logic, global_facts, high_school_biology, high_school_chemistry, high_school_computer_science, high_school_european_history, high_school_geography, high_school_government_and_politics, high_school_macroeconomics, high_school_mathematics, high_school_microeconomics, high_school_physics.
* **X-Axis (Bottom of Each Panel):** Each model panel has its own x-axis with two sections:
* **Left Section:** Labeled "ECE" (Expected Calibration Error). Scale markers: 20%, 50%, 90%. Lower values are better for ECE.
* **Right Section:** Labeled "AUROC" (Area Under the Receiver Operating Characteristic Curve). Scale markers: 20%, 50%, 90%. Higher values are better for AUROC.
* **Legend (Bottom Center):** Defines the four colored bars present for each subject within each model panel.
* **Red/Maroon:** Zero-Shot Classifier
* **Light Purple/Lavender:** Probe
* **Medium Purple:** LoRA
* **Dark Purple/Indigo:** LoRA + Prompt
### Detailed Analysis
The chart presents a dense matrix of data. For each of the 25 subjects in each of the 6 models, four bars are shown, grouped by the ECE and AUROC metrics.
**General Trends Across Models:**
1. **Method Performance Hierarchy:** For the AUROC metric (right side of each panel), the "LoRA + Prompt" (dark purple) bar is consistently the longest (highest value) across nearly all subjects and models. "LoRA" (medium purple) is typically second, followed by "Probe" (light purple). The "Zero-Shot Classifier" (red) generally shows the lowest AUROC performance.
2. **Model Size/Chat Effect:** Comparing LLaMA-2 7B to 13B, and base to Chat variants, the Chat-tuned models (LLaMA-2 7B Chat, LLaMA-2 13B Chat) often show improved AUROC scores, particularly for the "LoRA + Prompt" method, compared to their base counterparts.
3. **Mistral vs. LLaMA:** The Mistral models (7B and 7B Instruct) display performance patterns broadly similar to the LLaMA-2 models of comparable size, with "LoRA + Prompt" being dominant. The Mistral 7B Instruct model shows particularly strong AUROC scores for "LoRA + Prompt" in several subjects.
4. **ECE Metric:** The ECE values (left side of each panel) are generally low (bars are short) across the board, indicating relatively well-calibrated models. There is less dramatic variation between methods for ECE compared to AUROC. The "Zero-Shot Classifier" sometimes shows slightly higher (worse) ECE.
**Subject-Specific Observations (Selected Examples):**
* **high_school_government_and_politics:** In the Mistral 7B Instruct panel, the "Zero-Shot Classifier" (red) bar for AUROC is exceptionally long, reaching near 90%, which is an outlier compared to its performance in other subjects and compared to other models.
* **college_computer_science:** Across most models, the AUROC scores for all methods are relatively high, suggesting this is a subject where models perform well.
* **formal_logic:** This subject appears challenging. The AUROC bars, even for "LoRA + Prompt," are shorter on average compared to many other subjects across all models.
* **abstract_algebra:** Shows significant variation in "Zero-Shot Classifier" (red) AUROC performance between models, from very low in LLaMA-2 7B to moderately high in Mistral 7B Instruct.
### Key Observations
1. **Dominant Method:** The "LoRA + Prompt" tuning method is the clear top performer for accuracy (AUROC) across virtually all subjects and models tested.
2. **Consistent Hierarchy:** The performance order of the four methods (LoRA + Prompt > LoRA > Probe > Zero-Shot Classifier) is remarkably consistent for the AUROC metric.
3. **Model Robustness:** The Chat-tuned and Instruct-tuned variants of the models generally provide a performance boost over their base counterparts when using advanced tuning methods like LoRA + Prompt.
4. **Subject Difficulty:** Subjects like `formal_logic` and `econometrics` appear consistently more challenging (lower AUROC scores) than subjects like `college_computer_science` or `high_school_biology`.
5. **Notable Outlier:** The `Zero-Shot Classifier` performance on `high_school_government_and_politics` with the `Mistral 7B Instruct` model is a significant positive outlier for that specific method.
### Interpretation
This chart provides a comprehensive benchmark for evaluating how different adaptation techniques (zero-shot, probing, LoRA, and LoRA with prompting) affect the performance of open-weight LLMs on a wide range of academic knowledge tasks.
The data strongly suggests that **combining parameter-efficient fine-tuning (LoRA) with engineered prompts yields the most accurate and reliable results** for knowledge-intensive tasks. The consistent hierarchy implies that more sophisticated adaptation methods unlock greater latent knowledge from the base models.
The variation across subjects indicates that **model knowledge is not uniform**; some domains (e.g., computer science, biology) are better represented in the models' training data or are more amenable to the evaluation format than others (e.g., formal logic, advanced economics).
The relative stability of the ECE metric suggests that while these tuning methods significantly improve accuracy (AUROC), they do not drastically harm the models' calibration (their ability to estimate their own confidence correctly). This is important for trustworthy deployment.
From a practical standpoint, a researcher or engineer looking to maximize performance on a specific academic domain would likely choose the "LoRA + Prompt" approach. The chart also helps identify which models (e.g., Mistral 7B Instruct, LLaMA-2 13B Chat) are strongest overall and which subjects might require additional data or specialized techniques.
</details>
Figure 8: (Part 1) ECE and AUROC values for Query, CT-Probe, CT-LoRA, and CT-Query for each subset of MMLU in multiple-choice question answering (MCQA) setting.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Grouped Bar Chart: Language Model Performance Across Academic Subjects (ECE & AUROC)
### Overview
This image is a complex, multi-panel grouped bar chart comparing the performance of six different Large Language Models (LLMs) across 30 academic subjects. Performance is measured using two metrics: Expected Calibration Error (ECE) and Area Under the Receiver Operating Characteristic Curve (AUROC). For each model and subject, four different methods are evaluated: Zero-Shot Classifier, Probe, LoRA, and LoRA + Prompt.
### Components/Axes
* **Chart Type:** Multi-panel grouped horizontal bar chart.
* **Panels (Columns):** Six distinct panels, each representing a different LLM. From left to right:
1. LLaMA-2 7B
2. LLaMA-2 7B Chat
3. LLaMA-2 13B
4. LLaMA-2 13B Chat
5. Mistral 7B
6. Mistral 7B Instruct
* **Y-Axis (Rows):** Lists 30 academic subjects. From top to bottom:
`high_school_psychology`, `high_school_statistics`, `high_school_us_history`, `high_school_world_history`, `human_aging`, `human_sexuality`, `international_law`, `jurisprudence`, `logical_fallacies`, `machine_learning`, `management`, `marketing`, `medical_genetics`, `miscellaneous`, `moral_disputes`, `moral_scenarios`, `nutrition`, `philosophy`, `prehistory`, `professional_accounting`, `professional_law`, `professional_medicine`, `professional_psychology`, `public_relations`, `security_studies`, `sociology`, `us_foreign_policy`, `virology`, `world_religions`.
* **X-Axis (Metrics):** Each panel has two sub-columns at the bottom, labeled:
* **ECE** (Expected Calibration Error): Scale from 0% to approximately 20%. Lower values are better.
* **AUROC** (Area Under the ROC Curve): Scale from 50% to 90%. Higher values are better.
* **Legend (Bottom Center):** Four colored bars define the methods:
* **Maroon/Dark Red:** Zero-Shot Classifier
* **Light Purple/Lavender:** Probe
* **Medium Purple:** LoRA
* **Dark Purple/Indigo:** LoRA + Prompt
* **Spatial Layout:** The legend is positioned at the bottom center of the entire figure. Each of the six model panels is arranged vertically, with subjects listed on the far left. Within each panel, for every subject, four horizontal bars are grouped together, corresponding to the four methods in the legend.
### Detailed Analysis
**General Trends Across All Models:**
1. **Metric Comparison:** For nearly all subjects and models, the AUROC bars (right sub-column) are significantly longer than the ECE bars (left sub-column), indicating that models generally achieve moderate to good discriminative ability (AUROC > 50%) while maintaining relatively low calibration error (ECE < ~15%).
2. **Method Performance Hierarchy:** A consistent pattern is visible across most subjects and models:
* **Zero-Shot Classifier (Maroon):** Typically shows the shortest bars for AUROC and often the longest bars for ECE, indicating the poorest performance.
* **Probe (Light Purple):** Shows a clear improvement over Zero-Shot, with longer AUROC bars and shorter ECE bars.
* **LoRA (Medium Purple) & LoRA + Prompt (Dark Purple):** These two methods consistently perform the best. Their bars are often very close in length, with LoRA + Prompt showing a slight, consistent edge in AUROC (longer bar) and a slight edge in ECE (shorter bar) over LoRA alone in many cases.
**Model-Specific Observations:**
* **LLaMA-2 7B vs. 7B Chat:** The Chat variant generally shows improved AUROC scores (longer bars) across many subjects compared to the base 7B model, particularly for the fine-tuned methods (Probe, LoRA).
* **LLaMA-2 13B vs. 13B Chat:** Similar to the 7B pair, the 13B Chat model often outperforms the base 13B model in AUROC, though the difference appears less dramatic than in the 7B case.
* **Mistral 7B vs. 7B Instruct:** The Instruct variant shows a very strong performance boost over the base Mistral 7B. The AUROC bars for LoRA/LoRA+Prompt in the Instruct model are frequently among the longest in the entire chart, often exceeding 80%.
* **Scale (7B vs. 13B):** Comparing LLaMA-2 7B to 13B, the larger 13B model shows a modest but visible improvement in AUROC for most methods and subjects.
**Subject-Specific Highlights (Approximate Values):**
* **High ECE (Poor Calibration):** The `machine_learning` subject often shows relatively high ECE values (bars extending further left) across models, especially for the Zero-Shot method. `moral_scenarios` also shows notable ECE.
* **High AUROC (Strong Discrimination):** Subjects like `high_school_psychology`, `professional_psychology`, and `sociology` frequently show very high AUROC scores (>80%) for the best-performing methods (LoRA + Prompt) across multiple models.
* **Challenging Subjects:** Subjects like `professional_law`, `international_law`, and `jurisprudence` tend to have shorter AUROC bars overall, suggesting they are more difficult for the models to classify correctly.
### Key Observations
1. **Fine-Tuning is Crucial:** The most striking pattern is the substantial performance gap between the Zero-Shot Classifier and all other methods (Probe, LoRA, LoRA+Prompt). This demonstrates that some form of adaptation or fine-tuning is essential for strong performance on these academic tasks.
2. **Instruction Tuning Matters:** The "Chat" and "Instruct" variants of models consistently outperform their base counterparts, highlighting the value of instruction-following training for these types of knowledge-based QA tasks.
3. **LoRA + Prompt is the Top Performer:** The dark purple bars (LoRA + Prompt) are almost universally the longest for AUROC and the shortest for ECE, indicating this combined approach yields the best calibrated and most accurate models.
4. **Performance Variability:** There is significant variability in model performance across different academic subjects, indicating that model knowledge and reasoning ability are not uniform across domains.
### Interpretation
This chart provides a comprehensive benchmark of LLM capabilities across a wide spectrum of academic knowledge. The data suggests several key insights:
1. **The Necessity of Adaptation:** The poor performance of the Zero-Shot Classifier underscores that simply using a pre-trained LLM's internal knowledge is insufficient for high-accuracy, well-calibrated classification on specialized academic topics. Parameter-efficient fine-tuning methods like LoRA are highly effective.
2. **Synergy of Methods:** The consistent, albeit sometimes small, superiority of "LoRA + Prompt" over "LoRA" alone suggests that combining parameter-efficient fine-tuning with optimized prompting creates a synergistic effect, leading to better model performance than either technique in isolation.
3. **Model Architecture and Training Impact:** The comparison between base and chat/instruct models, and between 7B and 13B parameter models, illustrates that both the training objective (instruction tuning) and model scale contribute positively to performance. The Mistral 7B Instruct model's strong showing suggests its training regimen is particularly effective for this evaluation setup.
4. **Domain-Specific Challenges:** The variation in performance across subjects (e.g., law vs. psychology) implies that the underlying training data of these models may be imbalanced, or that some domains require more complex reasoning that is harder for the models to capture. This has implications for using LLMs as general-purpose knowledge engines.
**In summary, the chart is a detailed map of LLM performance, revealing that while base models have foundational knowledge, achieving high accuracy and reliability on academic tasks requires targeted fine-tuning and prompting strategies, with the combination of LoRA and prompt engineering emerging as the most robust approach among those tested.**
</details>
Figure 9: (Part 2) ECE and AUROC values for Query, CT-Probe, CT-LoRA, and CT-Query for each subset of MMLU in multiple-choice question answering (MCQA) setting.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Grouped Bar Chart: Language Model Performance Across Academic Subjects
### Overview
This image is a complex, multi-panel grouped bar chart comparing the performance of six different Large Language Model (LLM) variants across 28 academic subjects. Performance is measured using two metrics: Expected Calibration Error (ECE) and Area Under the Receiver Operating Characteristic curve (AUROC). Each model's performance is evaluated using four different methods, represented by distinct colors.
### Components/Axes
**1. Top Header (Model Columns):**
Six vertical columns, each representing a different LLM variant. From left to right:
- `LLaMA-2 7B`
- `LLaMA-2 7B Chat`
- `LLaMA-2 13B`
- `LLaMA-2 13B Chat`
- `Mistral 7B`
- `Mistral 7B Instruct`
**2. Left Vertical Axis (Subject Categories):**
A list of 28 academic subjects, ordered from top to bottom. The full list is:
`abstract_algebra`, `anatomy`, `astronomy`, `business_ethics`, `clinical_knowledge`, `college_biology`, `college_chemistry`, `college_computer_science`, `college_mathematics`, `college_medicine`, `college_physics`, `computer_security`, `conceptual_physics`, `econometrics`, `electrical_engineering`, `elementary_mathematics`, `formal_logic`, `global_facts`, `high_school_biology`, `high_school_chemistry`, `high_school_computer_science`, `high_school_european_history`, `high_school_geography`, `high_school_government_and_politics`, `high_school_macroeconomics`, `high_school_mathematics`, `high_school_microeconomics`, `high_school_physics`.
**3. Bottom Horizontal Axis (Metrics):**
Each of the six model columns has its own x-axis at the bottom, labeled with two metrics:
- `ECE` (Expected Calibration Error) - Scale: 20%, 50%, 90% (lower is better).
- `AUROC` (Area Under the ROC Curve) - Scale: 20%, 50%, 90% (higher is better).
**4. Legend (Bottom Center):**
Four colored squares with labels, defining the evaluation methods:
- **Red Square:** `Zero-Shot Classifier`
- **Light Purple Square:** `Probe`
- **Medium Purple Square:** `LoRA`
- **Dark Purple Square:** `LoRA + Prompt`
**5. Chart Structure:**
For each of the 28 subjects within each of the 6 model columns, there are two small grouped bar clusters:
- The left cluster shows the `ECE` value for the four methods.
- The right cluster shows the `AUROC` value for the four methods.
The bars are horizontal, extending from a central baseline. The length of the bar corresponds to the metric value.
### Detailed Analysis
**Trend Verification & Data Extraction (Approximate Values):**
The chart is dense with data. A precise point-by-point extraction is not feasible, but clear trends and representative values can be identified.
*General Pattern Across Models:*
- **ECE (Left Cluster):** The `Zero-Shot Classifier` (red) consistently shows the shortest bars (lowest ECE, best calibration) across most subjects and models. The `Probe` (light purple) often has the longest bars (highest ECE, worst calibration). `LoRA` and `LoRA + Prompt` (medium and dark purple) typically fall in between.
- **AUROC (Right Cluster):** The pattern often reverses. `LoRA + Prompt` (dark purple) frequently shows the longest bars (highest AUROC, best discriminative performance), followed closely by `LoRA` (medium purple). `Zero-Shot Classifier` (red) and `Probe` (light purple) generally have shorter bars (lower AUROC).
*Model-Specific Observations:*
- **LLaMA-2 7B vs. 7B Chat:** The Chat variant generally shows slightly improved AUROC scores (longer dark/medium purple bars) across many subjects, with a similar ECE pattern.
- **LLaMA-2 13B vs. 13B Chat:** Similar to the 7B pair, the Chat variant shows a modest performance boost in AUROC. The 13B models, in general, have slightly longer AUROC bars than their 7B counterparts.
- **Mistral 7B vs. 7B Instruct:** The Instruct variant shows a more pronounced improvement in AUROC over the base Mistral 7B, especially for `LoRA + Prompt`. The ECE for `Zero-Shot Classifier` remains very low.
*Subject-Specific Outliers:*
- **`college_computer_science`:** For `LLaMA-2 13B`, the `Zero-Shot Classifier` (red) AUROC bar is exceptionally long, nearing 90%, significantly outperforming other methods for that model/subject combination.
- **`formal_logic`:** Across most models, the AUROC scores are relatively low (bars are short), suggesting this is a challenging subject for all tested models.
- **`high_school_mathematics`:** Shows a very clear stratification: `LoRA + Prompt` (dark purple) has the highest AUROC, followed by `LoRA`, then `Probe`, with `Zero-Shot` lowest. The ECE shows the inverse order.
### Key Observations
1. **Method Trade-off:** There is a clear and consistent trade-off between calibration (ECE) and discriminative performance (AUROC). The `Zero-Shot Classifier` is best calibrated but has lower AUROC. `LoRA + Prompt` achieves the highest AUROC but often at the cost of worse calibration (higher ECE).
2. **Impact of Instruction Tuning:** For both LLaMA-2 and Mistral, the "Chat" or "Instruct" variants consistently improve AUROC scores, particularly when using the `LoRA + Prompt` method, indicating fine-tuning for instruction following boosts task performance.
3. **Scale Benefit:** The 13B LLaMA-2 models generally outperform the 7B models on AUROC, demonstrating a benefit from increased model scale.
4. **Subject Difficulty:** Subjects like `formal_logic`, `abstract_algebra`, and `econometrics` tend to have shorter AUROC bars across the board, indicating they are more difficult for these models. Subjects like `high_school_biology` and `clinical_knowledge` show higher overall AUROC scores.
### Interpretation
This chart provides a multifaceted evaluation of LLM capabilities, moving beyond simple accuracy to examine calibration and performance across diverse knowledge domains.
**What the Data Suggests:**
- **No Single Best Method:** The choice of evaluation method (`Zero-Shot`, `Probe`, `LoRA`, `LoRA + Prompt`) involves a fundamental trade-off. If reliable confidence estimates are critical (low ECE), a `Zero-Shot Classifier` may be preferable. If maximizing discriminative power (high AUROC) is the goal, `LoRA + Prompt` is superior.
- **The Value of Specialization:** The superior AUROC of `LoRA` and `LoRA + Prompt` methods suggests that adapting the model to the specific task (via parameter-efficient fine-tuning) yields better performance than generic probing or zero-shot approaches. Adding prompt engineering (`LoRA + Prompt`) provides an additional boost.
- **Instruction Tuning is Effective:** The consistent improvement of Chat/Instruct models indicates that alignment training generalizes well, enhancing performance on a wide array of academic multiple-choice questions, not just conversational tasks.
**Underlying Implications:**
The data implicitly argues for a more nuanced approach to model evaluation. A model's "performance" is not a single number but a profile across metrics and methods. For real-world deployment, one must decide whether calibration or raw discriminative ability is more important. Furthermore, the results validate the use of parameter-efficient fine-tuning (LoRA) combined with prompt engineering as a powerful strategy for adapting foundation models to specialized knowledge tasks, outperforming both out-of-the-box usage and simpler probing techniques. The persistent difficulty of formal logic and abstract mathematics subjects highlights a current frontier in LLM reasoning capabilities.
</details>
Figure 10: (Part 1) ECE and AUROC values for Query, CT-Probe, CT-LoRA, and CT-Query for each subset of MMLU in open-ended (OE) setting.
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Heatmap-Style Comparative Bar Chart: Language Model Performance Across Academic Subjects
### Overview
This image is a complex, multi-panel comparative bar chart evaluating the performance of six different Large Language Models (LLMs) across 29 academic subjects. The chart uses a grid layout where each cell contains grouped bars representing four different evaluation methods. The primary metrics are Expected Calibration Error (ECE) and Area Under the Receiver Operating Characteristic curve (AUROC), presented as percentages.
### Components/Axes
**Top Header (Model Columns):**
Six main vertical columns, each representing a distinct LLM:
1. LLaMA-2 7B
2. LLaMA-2 7B Chat
3. LLaMA-2 13B
4. LLaMA-2 13B Chat
5. Mistral 7B
6. Mistral 7B Instruct
**Sub-Columns (Metrics):**
Each model column is subdivided into two metric columns:
* **Left Sub-column:** Labeled "ECE" at the bottom. The x-axis scale shows markers at 20% and 50%.
* **Right Sub-column:** Labeled "AUROC" at the bottom. The x-axis scale shows markers at 60% and 90%.
**Y-Axis (Subjects):**
A vertical list of 29 academic subjects, from top to bottom:
`high_school_psychology`, `high_school_statistics`, `high_school_us_history`, `high_school_world_history`, `human_aging`, `human_sexuality`, `international_law`, `jurisprudence`, `logical_fallacies`, `machine_learning`, `management`, `marketing`, `medical_genetics`, `miscellaneous`, `moral_disputes`, `moral_scenarios`, `nutrition`, `philosophy`, `prehistory`, `professional_accounting`, `professional_law`, `professional_medicine`, `professional_psychology`, `public_relations`, `security_studies`, `sociology`, `us_foreign_policy`, `virology`, `world_religions`.
**Legend (Bottom Center):**
Four colored bars define the evaluation methods:
* **Red:** Zero-Shot Classifier
* **Light Purple:** Probe
* **Medium Purple:** LoRA
* **Dark Purple:** LoRA + Prompt
**Spatial Layout:**
The legend is centered at the very bottom. The model names are centered above their respective columns. Subject labels are left-aligned along the entire left edge. Each subject row contains 12 grouped bars (4 methods x 2 metrics x 6 models).
### Detailed Analysis
**General Performance Trends:**
* **AUROC vs. ECE:** Across nearly all models and subjects, the AUROC scores (right sub-column in each cell) are significantly higher than the ECE scores (left sub-column). AUROC values frequently range between 60%-90%, while ECE values are often clustered between 20%-50%.
* **Method Comparison (AUROC):** A consistent hierarchy is visible. The **LoRA + Prompt (dark purple)** method almost universally achieves the highest AUROC bars, often extending near or beyond the 90% mark. **LoRA (medium purple)** is typically second, followed by **Probe (light purple)**. The **Zero-Shot Classifier (red)** generally shows the lowest AUROC performance, with bars often ending near or below the 60% line.
* **Method Comparison (ECE):** For ECE (where lower is better), the pattern is less uniform but **Zero-Shot Classifier (red)** often has the shortest bars (best calibration), while **LoRA + Prompt (dark purple)** sometimes shows longer bars (worse calibration), suggesting a potential trade-off between discrimination (AUROC) and calibration (ECE).
**Model-Specific Observations:**
* **LLaMA-2 7B vs. 7B Chat:** The "Chat" variant generally shows slightly improved AUROC scores across many subjects compared to the base 7B model.
* **LLaMA-2 13B vs. 13B Chat:** A similar, though less pronounced, improvement from base to "Chat" variant is visible.
* **Mistral 7B vs. 7B Instruct:** The "Instruct" variant shows a clear and consistent improvement in AUROC over the base Mistral 7B model across nearly all subjects and methods.
* **Scale (7B vs. 13B):** The 13B LLaMA-2 models (both base and Chat) generally exhibit higher AUROC ceilings than their 7B counterparts, particularly for the LoRA-based methods.
**Subject-Specific Variations:**
* **High-Performing Subjects:** Subjects like `professional_accounting`, `professional_law`, and `professional_medicine` often show very high AUROC scores (>85%) for the LoRA + Prompt method across models.
* **Challenging Subjects:** Subjects such as `moral_scenarios`, `philosophy`, and `world_religions` tend to have lower overall AUROC scores and a smaller performance gap between the Zero-Shot and fine-tuned methods, indicating they may be more difficult for the models to master.
* **Notable Outlier:** In the `machine_learning` row for the `LLaMA-2 13B Chat` model, the Zero-Shot Classifier (red) AUROC bar is exceptionally long, rivaling the LoRA methods. This is an unusual deviation from the typical pattern.
### Key Observations
1. **Consistent Method Hierarchy:** The performance order (LoRA + Prompt > LoRA > Probe > Zero-Shot) for AUROC is remarkably stable across 29 subjects and 6 models.
2. **Calibration-Discrimination Trade-off:** Methods that excel at discrimination (high AUROC, like LoRA + Prompt) often show worse calibration (higher ECE).
3. **Instruction Tuning Benefit:** Both "Chat" (LLaMA-2) and "Instruct" (Mistral) variants outperform their base models, confirming the value of instruction tuning for these tasks.
4. **Domain Sensitivity:** Performance is not uniform; professional and technical subjects yield higher scores than humanities and moral reasoning subjects.
### Interpretation
This chart provides a comprehensive benchmark of LLM performance on academic multiple-choice tasks, evaluating not just accuracy (via AUROC) but also the reliability of confidence scores (via ECE).
**What the Data Suggests:**
* **Fine-Tuning is Highly Effective:** The dramatic superiority of LoRA and LoRA + Prompt methods demonstrates that lightweight fine-tuning is crucial for achieving high performance on specialized academic knowledge, far surpassing zero-shot capabilities.
* **Prompts Amplify Fine-Tuning:** The consistent edge of "LoRA + Prompt" over "LoRA" alone indicates that combining parameter-efficient fine-tuning with carefully crafted prompts yields the best results, suggesting synergy between these techniques.
* **The Cost of Performance:** The worse ECE scores for the best-performing methods imply that as models become more accurate discriminators on these tasks, their confidence scores become less calibrated. This is a critical consideration for real-world applications where reliable uncertainty estimates are needed.
* **Foundation Model Progression:** The performance jump from LLaMA-2 7B to 13B, and from base to instruction-tuned variants, illustrates the scaling and alignment laws in practice. Mistral 7B Instruct's strong performance, often matching or exceeding LLaMA-2 13B Chat, highlights rapid progress in open-weight model efficiency.
**Underlying Implications:**
The chart implicitly argues for a move beyond zero-shot evaluation in AI research. It showcases a methodology for deeply probing model capabilities across diverse domains. The persistent difficulty in subjects like `moral_scenarios` points to ongoing challenges in aligning models with nuanced human reasoning. For developers, the clear message is that to deploy LLMs in educational or professional knowledge domains, investing in domain-specific fine-tuning (like LoRA) is not just beneficial but likely necessary to reach acceptable performance levels.
</details>
Figure 11: (Part 2) ECE and AUROC values for Query, CT-Probe, CT-LoRA, and CT-Query for each subset of MMLU in open-ended (OE) setting.
### Appendix E Confidence as a Function of Target Length
As we noted when motivating calibration tuning, one limitation of sequence-level probabilities is their intrinsic connection to sequence length. The probability of a sequence decreases with increasing length, regardless of the correctness of the response. By contrast, we wouldnāt expect concept-level probabilities to have any discernible relationship with sequence length. In appendix E, we show there is no consistent relationship between the confidence estimated by the calibration-tuned model and target sequence length on MMLU tasks.
A key limitation of using token likelihoods is that they necessarily decay with the length of the generation. In figs. 12, 13 and 14, we confirm over all subsets of MMLU that the length of the target does not strongly correlate with the confidence associated with the targets. This behavior is an essential ingredient towards an effective confidence estimation in practice, such that longer sequences are not penalized in confidence despite being correct.
|
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Scatter Plot with Regression: Abstract Algebra Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with an overlaid linear regression line and its confidence interval. It also includes marginal distribution plots (histograms/density plots) on the top and right edges. The chart explores the relationship between "Target Length" and "Confidence" for a category labeled "abstract_algebra."
### Components/Axes
* **Title:** "abstract_algebra" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to approximately 75.
* **Major Tick Marks:** 0, 25, 50, 75.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0 to approximately 0.7.
* **Major Tick Marks:** 0, 0.2, 0.4, 0.6.
* **Legend:** Located in the top-left corner of the main plot area. It contains a purple square symbol followed by the text "abstract_algebra," identifying the data series.
* **Data Series:** Represented by purple circular points scattered across the plot.
* **Regression Line:** A solid purple line showing the best linear fit through the data points.
* **Confidence Interval:** A semi-transparent purple shaded region surrounding the regression line, representing the uncertainty of the fit.
* **Marginal Plots:**
* **Top:** A distribution plot (likely a histogram or kernel density estimate) for the "Target Length" variable.
* **Right:** A distribution plot for the "Confidence" variable.
### Detailed Analysis
* **Data Point Distribution:** The purple data points are densely clustered in the lower-left quadrant of the plot, specifically where Target Length is between 0 and 25 and Confidence is between 0 and 0.3. The density of points decreases significantly as both Target Length and Confidence increase.
* **Regression Trend:** The solid purple regression line exhibits a clear positive slope. It originates at a Confidence value of approximately 0.1 when Target Length is 0 and rises to a Confidence value of approximately 0.4 when Target Length is 75.
* **Confidence Interval:** The shaded purple confidence interval is narrowest at the center of the data mass (around Target Length 10-20) and flares outwards, becoming substantially wider at the extremes of the x-axis (near 0 and 75). This indicates greater uncertainty in the regression estimate where data is sparse.
* **Marginal Distributions:**
* The top marginal plot shows the distribution of "Target Length" is right-skewed, with a high peak near 0 and a long tail extending to 75.
* The right marginal plot shows the distribution of "Confidence" is also right-skewed, with a high peak near 0.1-0.2 and a tail extending towards 0.6.
### Key Observations
1. **Positive Correlation:** There is a visible positive linear relationship between Target Length and Confidence. As the target length increases, the confidence score tends to increase.
2. **High Variance/Spread:** The data points show considerable vertical spread (variance in Confidence) for any given Target Length, especially in the 0-25 range. This suggests Target Length is not a strong sole predictor of Confidence.
3. **Data Sparsity:** The majority of observations are concentrated at low Target Lengths. There are very few data points with a Target Length greater than 50, which contributes to the wide confidence interval at the high end.
4. **Potential Outliers:** A few data points exist with relatively high Confidence (>0.4) at low Target Lengths (<25), which deviate from the central cluster.
### Interpretation
The chart suggests that for the "abstract_algebra" domain, there is a general tendency for longer targets (e.g., longer problem statements, sequences, or proofs) to be associated with higher confidence scores from the model or system being evaluated. However, the relationship is noisy. The high density of points at low Target Length and low Confidence indicates that many tasks in this domain are short and yield low confidence, which could be due to inherent difficulty or ambiguity. The widening confidence interval at higher Target Lengths is a critical caveat; it signals that the observed positive trend is less reliable for longer targets due to a lack of data. This visualization would be valuable for diagnosing model performance, showing that while a trend exists, predictions for longer abstract algebra targets are made with high uncertainty, and the model's confidence is generally low for a large portion of short targets.
</details>
|
<details>
<summary>x20.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Anatomy
### Overview
The image is a scatter plot chart titled "anatomy." It displays the relationship between "Target Length" on the horizontal axis and "Confidence" on the vertical axis. The chart includes marginal distribution plots (histograms or density curves) along the top and right edges, summarizing the distribution of each variable independently. The primary data is represented by purple circular points.
### Components/Axes
* **Title:** "anatomy" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear scale from 0 to approximately 120.
* **Major Tick Marks:** 0, 50, 100.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear scale from 0.0 to approximately 0.7.
* **Major Tick Marks:** 0.0, 0.2, 0.4, 0.6.
* **Data Series:** A single series of data points, all colored a uniform shade of purple (approximately hex #9467bd).
* **Marginal Plots:**
* **Top Marginal (above main plot):** A distribution plot for the "Target Length" variable. It shows a high, sharp peak near 0, indicating a strong concentration of data points with very short target lengths, and a long, low tail extending to the right.
* **Right Marginal (to the right of main plot):** A distribution plot for the "Confidence" variable. It shows a broad, multi-modal distribution with the highest density between 0.0 and 0.4, tapering off above 0.6.
* **Legend:** No explicit legend is present within the chart area. The single data series is implied by the uniform color of all points.
### Detailed Analysis
* **Data Point Distribution & Trend:**
* **Spatial Grounding:** The data points are densely clustered in the bottom-left quadrant of the plot, specifically where Target Length is between 0 and 40 and Confidence is between 0.0 and 0.4.
* **Trend Verification:** There is a clear, strong negative correlation. As the "Target Length" increases, the "Confidence" value generally decreases. The cloud of points slopes downward from the top-left to the bottom-right.
* **Point Density:** The highest density of points occurs at very short Target Lengths (< 20), where Confidence values span a wide range from near 0.0 up to ~0.65.
* **Sparse Region:** For Target Lengths greater than 50, data points become very sparse. The few points in this region (e.g., near Target Length 70, 90, 110) all have low Confidence values, generally below 0.2.
* **Outliers:** A few points with relatively high Confidence (>0.5) exist, but they are all associated with short Target Lengths (< 30). There are no high-confidence points for long target lengths.
### Key Observations
1. **Concentration at Origin:** The vast majority of the data is concentrated where Target Length is short and Confidence is low-to-moderate.
2. **Inverse Relationship:** The fundamental pattern is that longer targets are associated with lower confidence scores.
3. **Confidence Ceiling:** The maximum observed Confidence decreases as Target Length increases. The highest confidence values (~0.65) only appear for Target Lengths under ~10.
4. **Marginal Confirmation:** The top marginal plot confirms the extreme skew in Target Length (most values are very small). The right marginal plot confirms that Confidence scores are most commonly in the lower half of the range (0.0-0.4).
### Interpretation
This chart likely visualizes the performance of a model or system (e.g., in natural language processing, bioinformatics, or another field involving sequence analysis) tasked with making predictions or identifications ("anatomy") of varying lengths.
* **What the data suggests:** The system is significantly more confident in its outputs when the target it is analyzing is short. As the target becomes longer, the system's confidence in its own output diminishes markedly. This could indicate that the model struggles with complexity, context integration, or noise accumulation over longer sequences.
* **Relationship between elements:** The scatter plot shows the direct, instance-level relationship, while the marginal plots provide a summary of each variable's overall distribution. The dense cluster at low length/low-moderate confidence suggests the system's "comfort zone" is with short, simple targets. The long tail in the Target Length distribution shows that long targets are rare in this dataset.
* **Notable Anomaly/Trend:** The most critical trend is the **confidence decay with length**. This is not just noise; it's a systematic pattern. The absence of any high-confidence points for long targets is a strong signal of a performance limitation. The investigation should focus on why the model's certainty degrades with sequence lengthāwhether due to architectural constraints, training data bias, or inherent problem difficulty.
</details>
|
<details>
<summary>x21.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Astronomy Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. The chart is titled "astronomy" and explores the relationship between "Target Length" and "Confidence." The data is presented in a monochromatic purple color scheme against a light grey grid background.
### Components/Axes
* **Title:** "astronomy" (centered at the top).
* **X-Axis (Main Plot):** Labeled "Target Length." The scale runs from 0 to 200, with major tick marks at 0, 100, and 200.
* **Y-Axis (Main Plot):** Labeled "Confidence." The scale runs from 0.00 to 0.75, with major tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Data Series:** A scatter of individual data points (purple circles) and a fitted regression line (solid purple line) with a shaded confidence interval band (lighter purple).
* **Marginal Plots:**
* **Top Marginal Plot:** Aligned with the X-axis. It shows the distribution of the "Target Length" variable. The shape suggests a right-skewed distribution, with the highest density near 0 and a long tail extending towards 200.
* **Right Marginal Plot:** Aligned with the Y-axis. It shows the distribution of the "Confidence" variable. The shape suggests a distribution peaked around 0.25-0.35, with a tail extending towards higher confidence values.
* **Legend:** No explicit legend is present. The color purple is used consistently for all data elements (points, line, interval, marginal plots).
### Detailed Analysis
* **Data Point Distribution:** The scatter plot contains approximately 80-100 data points. The points are most densely clustered in the lower-left quadrant, where both Target Length (0-100) and Confidence (0.00-0.50) are relatively low.
* **Trend Verification:** The fitted regression line shows a clear, positive linear trend. It slopes upward from left to right, indicating that as Target Length increases, Confidence tends to increase as well.
* **Approximate Regression Line Values:**
* The line appears to start near a Confidence value of **~0.25** at a Target Length of **0**.
* It ends near a Confidence value of **~0.45** at a Target Length of **200**.
* **Confidence Interval:** The shaded band around the regression line represents the uncertainty of the fit. The band is narrower in the region of high data density (low Target Length) and widens considerably as Target Length increases beyond ~150, indicating greater uncertainty in the trend for longer targets due to sparser data.
* **Marginal Distribution Details:**
* **Target Length (Top):** The distribution is heavily concentrated between 0 and 50, with a significant drop-off after 100. A very small number of points exist near 200.
* **Confidence (Right):** The distribution is unimodal, with the highest density between approximately 0.20 and 0.40. The density decreases steadily for values above 0.50.
### Key Observations
1. **Positive Correlation:** There is a visually evident positive correlation between the length of a target (in an astronomy context) and the confidence associated with it.
2. **Data Sparsity at Extremes:** The dataset is heavily skewed towards shorter target lengths. Very few observations exist for targets longer than 150 units.
3. **Heteroscedasticity:** The spread (variance) of the Confidence values appears to increase with Target Length. This is visually suggested by the widening scatter of points and confirmed by the flaring confidence interval band on the regression line.
4. **Outliers:** A few notable outliers exist. For example, there are points with relatively high Confidence (>0.60) at moderate Target Lengths (~50-100), and at least one point with very low Confidence (<0.10) at a Target Length near 100.
### Interpretation
The data suggests that in the context of this "astronomy" dataset, **longer targets are associated with higher confidence levels.** This could imply several investigative scenarios:
* **Signal Strength:** Longer observation targets (e.g., longer exposure times, larger celestial objects, or extended phenomena) may yield stronger or more reliable signals, leading to higher confidence in measurements or classifications.
* **Data Quality:** The process of measuring or analyzing longer targets might be inherently more robust or less prone to error.
* **Selection Bias:** The pronounced skew in the Target Length distribution indicates the dataset is not uniform. The preponderance of short targets might reflect observational constraints (e.g., most detectable objects are small/short) or a sampling bias in the study design. The positive trend must be interpreted with caution, as it is driven by a relatively small number of long-target data points.
* **Predictive Uncertainty:** The widening confidence interval for longer targets is a critical finding. It means that while the *average* confidence increases with length, the *predictability* of confidence for any single long target is low. This highlights a key area for further investigation: understanding what factors cause high variance in confidence for longer targets.
**In summary, the chart demonstrates a clear positive relationship but also reveals important limitations in the data distribution and predictive certainty, which are essential for any technical or scientific conclusion drawn from it.**
</details>
|
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: Clinical Knowledge Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution histograms (or density plots), titled "clinical_knowledge". It displays the relationship between the confidence score of a model or system (y-axis) and the length of a target text or sequence (x-axis) for a dataset related to clinical knowledge. The plot reveals a dense cluster of data points at lower values for both variables, with a general trend of decreasing confidence as target length increases.
### Components/Axes
* **Title:** "clinical_knowledge" (centered at the top).
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.00 to 0.75, with major tick marks at 0.00, 0.25, 0.50, and 0.75.
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to approximately 200, with major tick marks at 0 and 100.
* **Data Series:** A single series of data points represented as small, semi-transparent purple circles. The transparency helps visualize density in overlapping areas.
* **Reference Line:** A faint, horizontal grey line is present at approximately y = 0.20, likely indicating a median, mean, or baseline confidence level.
* **Marginal Distributions:**
* **Top Plot:** A distribution (likely a histogram or kernel density estimate) for the "Target Length" variable. It is right-skewed, with the highest density near 0 and a long tail extending to the right.
* **Right Plot:** A distribution for the "Confidence" variable. It is left-skewed, with the highest density concentrated between 0.00 and 0.25, peaking near 0.1-0.2.
### Detailed Analysis
* **Data Point Distribution:** The vast majority of data points are densely clustered in the bottom-left quadrant of the plot. This corresponds to short target lengths (approximately 0 to 50) and low confidence scores (approximately 0.00 to 0.30).
* **Trend Verification:** There is a visible, general downward trend in the scatter plot. As the "Target Length" increases along the x-axis, the "Confidence" values on the y-axis tend to decrease. The cloud of points becomes sparser and generally lower for target lengths beyond 100.
* **Outliers:** A few scattered points exist with moderate to high confidence (0.40 - 0.70) even at longer target lengths (e.g., near 100 and 150). These are exceptions to the dominant trend.
* **Marginal Plot Details:**
* The **Target Length** distribution confirms the right skew: most targets are short, with frequency dropping off sharply as length increases.
* The **Confidence** distribution confirms the concentration of low scores: the peak is near the 0.1-0.2 range, aligning with the dense cluster in the main plot and the position of the horizontal reference line.
### Key Observations
1. **Inverse Relationship:** There is a clear, albeit noisy, inverse relationship between target length and confidence. Longer clinical knowledge targets are associated with lower confidence scores from the evaluated system.
2. **High-Density Cluster:** The system's performance is most frequently characterized by low confidence on short text sequences.
3. **Performance Ceiling:** Very few data points achieve confidence above 0.50, and none appear to reach 0.75, suggesting an upper limit to the system's confidence for this task.
4. **Reference Line Context:** The horizontal line at ~0.20 sits within the densest part of the confidence distribution, suggesting it may represent a typical or baseline confidence level for this dataset.
### Interpretation
This visualization suggests a fundamental challenge in the evaluated system's handling of clinical knowledge: **confidence degrades as the complexity or length of the target information increases.**
* **Possible Explanations:** The pattern could indicate that the model struggles with longer, more complex clinical statements, perhaps due to limitations in context window, reasoning over extended text, or the inherent difficulty of maintaining high confidence for detailed medical information. The dense cluster at low length/low confidence might represent very short, ambiguous, or highly specific snippets where the model is uncertain.
* **Outlier Significance:** The few high-confidence, long-target points are critical. They may represent well-established, formulaic clinical facts (e.g., standard dosages, clear diagnostic criteria) that the model can handle reliably despite their length. Analyzing these outliers could reveal what types of long-form clinical knowledge the system *can* process effectively.
* **Practical Implication:** For applications relying on this system (e.g., clinical decision support, information extraction), outputs involving longer text passages should be treated with lower inherent confidence. The results may warrant stricter verification or human review, especially for critical information. The plot provides a quantitative basis for setting confidence thresholds or flagging outputs based on target length.
</details>
|
| --- | --- | --- | --- |
|
<details>
<summary>x23.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: College Biology Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. The chart is titled "college_biology" and explores the relationship between "Target Length" and "Confidence." The overall aesthetic uses a monochromatic purple color scheme against a light grey grid background.
### Components/Axes
* **Title:** "college_biology" (centered at the top).
* **Main Plot Area:** A scatter plot with a fitted regression line and a shaded confidence interval.
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, with major tick marks and labels at 0, 100, and 200.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, with major tick marks and labels at 0.0, 0.2, 0.4, and 0.6.
* **Data Series:**
* **Scatter Points:** Represented by small, semi-transparent purple circles. Each point corresponds to an individual observation.
* **Regression Line:** A solid, darker purple line showing the best-fit linear trend.
* **Confidence Interval:** A lighter purple shaded band surrounding the regression line, indicating the uncertainty of the fit.
* **Marginal Plots:**
* **Top Marginal Plot:** A density plot (or smoothed histogram) showing the distribution of the "Target Length" variable along the x-axis. It is positioned directly above the main plot.
* **Right Marginal Plot:** A density plot showing the distribution of the "Confidence" variable along the y-axis. It is positioned to the right of the main plot.
* **Legend:** No explicit legend is present. The color (purple) is used consistently for all data elements.
### Detailed Analysis
* **Data Distribution & Trend:**
* The scatter points are densely clustered in the lower-left quadrant of the plot, primarily where "Target Length" is between 0 and 100, and "Confidence" is between 0.0 and 0.4.
* The regression line exhibits a clear **positive slope**, rising from left to right. This indicates a positive correlation: as "Target Length" increases, "Confidence" tends to increase.
* The shaded confidence interval around the regression line is narrowest in the region of highest data density (low Target Length) and widens considerably as Target Length increases, indicating greater uncertainty in the trend for larger values due to sparser data.
* **Marginal Distributions:**
* **Target Length (Top):** The distribution is heavily right-skewed. The peak density is at very low values (near 0), with a long tail extending towards 200 and beyond.
* **Confidence (Right):** The distribution is also right-skewed, with the highest density between 0.1 and 0.3, tapering off towards 0.6.
* **Spatial Grounding & Outliers:**
* The majority of data points are concentrated in the region defined by x=[0, 80], y=[0.0, 0.35].
* Several notable outliers exist with high "Target Length" (>150). These points generally have higher "Confidence" values (between ~0.4 and 0.65), which aligns with the positive trend but are sparse.
* The highest "Confidence" value observed is approximately 0.65, associated with a "Target Length" of around 180.
### Key Observations
1. **Positive Correlation:** There is a visible, positive linear relationship between Target Length and Confidence.
2. **Heteroscedasticity:** The variance (spread) of Confidence appears to increase with Target Length. The relationship is tighter for short targets and more variable for long ones.
3. **Skewed Data:** Both variables are not normally distributed; they are right-skewed, meaning most observations involve short target lengths and low-to-moderate confidence levels.
4. **Data Sparsity:** The trend for Target Length > 100 is inferred from a small number of data points, making predictions in that range less reliable.
### Interpretation
The data suggests that in the context of "college_biology," tasks or items with longer "Target Length" (which could refer to the length of a text passage, a sequence, or a problem description) are associated with higher measured "Confidence." This could imply several investigative scenarios:
* **Complexity vs. Assurance:** Longer targets might be more complex, and the confidence metric could reflect a model's or a person's self-assuredness when dealing with more substantial information, even if that assurance isn't necessarily correlated with accuracy.
* **Learning/Exposure Effect:** If "Target Length" correlates with topic familiarity or study time, the trend might indicate that increased engagement leads to higher confidence.
* **Metric Behavior:** The confidence metric itself may be biased or scaled in a way that naturally produces higher values for longer inputs.
The skewed distributions are critical. The strong positive trend is heavily influenced by a cluster of low-length, low-confidence data and a handful of high-length, high-confidence outliers. The widening confidence interval warns against over-interpreting the strength of the relationship for longer targets. A Peircean investigation would question the operational definitions: What exactly is "Target Length"? How is "Confidence" quantified? The anomaly isn't a single point but the sparsity of data in the upper range, which makes the apparent trend more of a hypothesis suggested by outliers than a robustly supported conclusion. The visualization effectively shows both the suggested relationship and the significant uncertainty surrounding it.
</details>
|
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: College Chemistry Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. It displays the relationship between "Target Length" and "Confidence" for a dataset labeled "college_chemistry". The plot includes a fitted regression line with a confidence interval band.
### Components/Axes
* **Title:** "college_chemistry" (centered at the top).
* **Main Plot Axes:**
* **X-axis (Bottom):** Label is "Target Length". The scale runs from 0 to approximately 150, with major tick marks at 0 and 100.
* **Y-axis (Left):** Label is "Confidence". The scale runs from 0.25 to 0.75, with major tick marks at 0.25, 0.50, and 0.75.
* **Data Series:**
* **Scatter Points:** Numerous purple circular data points are scattered across the plot area.
* **Regression Line:** A solid purple line runs through the data, showing a positive linear trend.
* **Confidence Band:** A semi-transparent, light purple shaded area surrounds the regression line, representing the confidence interval.
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (likely a histogram or kernel density estimate) for the "Target Length" variable. It shows a right-skewed distribution, with the highest density near 0 and a long tail extending to the right.
* **Right Marginal Plot:** A distribution plot for the "Confidence" variable. It shows a roughly unimodal distribution centered around 0.4-0.5, with a spread from approximately 0.25 to 0.75.
* **Legend:** There is no separate legend box. The color purple is used consistently for all data elements (points, line, band, marginal plots).
### Detailed Analysis
* **Data Point Distribution:** The purple data points are most densely clustered in the lower-left quadrant of the plot, where "Target Length" is between 0 and 50 and "Confidence" is between 0.25 and 0.50. As "Target Length" increases beyond 50, the points become more sparse but generally trend upward in "Confidence".
* **Trend Line Analysis:** The solid purple regression line has a clear positive slope. It originates at a "Confidence" value of approximately 0.30 when "Target Length" is 0 and rises to a "Confidence" value of approximately 0.60 when "Target Length" is 150. This indicates a positive correlation between the two variables.
* **Confidence Interval:** The light purple shaded band around the regression line is narrower at lower "Target Length" values (where data is dense) and widens significantly as "Target Length" increases (where data is sparse), indicating greater uncertainty in the trend estimate for longer target lengths.
* **Marginal Distributions:**
* The top plot confirms the right-skew of "Target Length": most observed lengths are short (<50), with fewer instances of very long targets.
* The right plot shows "Confidence" values are most commonly found in the 0.35 to 0.55 range.
### Key Observations
1. **Positive Correlation:** There is a visible positive linear relationship between Target Length and Confidence.
2. **Heteroscedasticity:** The spread of data points around the trend line increases as Target Length increases. The relationship appears noisier for longer targets.
3. **Data Density Imbalance:** The vast majority of data points are concentrated at short Target Lengths (<50). The trend for longer lengths is inferred from relatively few data points.
4. **Outliers:** A few data points exist with high Confidence (>0.65) at moderate Target Lengths (50-100), and some points with low Confidence (<0.30) at longer Target Lengths (>100).
### Interpretation
The data suggests that in the context of "college_chemistry," tasks or items with longer target lengths (which could represent longer answers, more complex problems, or extended text passages) are associated with higher measured confidence. This could imply several underlying phenomena:
* **Model Behavior:** If this plots a model's performance, it may be more confident in its predictions or generations when dealing with longer, potentially more detailed or context-rich inputs/outputs.
* **Task Nature:** Longer chemistry problems might contain more clues or structured information, leading to higher confidence in solving them, albeit with greater variability (as shown by the wider scatter).
* **Data Skew:** The strong right-skew in target length is critical. The positive trend is robustly supported by a large amount of data at short lengths but becomes more speculative for longer lengths due to sparse data. The widening confidence band visually communicates this increasing uncertainty.
* **Practical Implication:** One should be cautious when extrapolating this trend to very long target lengths (>150), as the model's confidence behavior in that region is less certain. The core reliable insight is the positive relationship within the densely populated region of the data (Target Length 0-80).
**Language:** All text in the image is in English.
</details>
|
<details>
<summary>x25.png Details</summary>

### Visual Description
## Scatter Plot with Regression Line: College Computer Science
### Overview
The image is a scatter plot titled "college_computer_science" that visualizes the relationship between "Target Length" and "Confidence". It includes a fitted regression line with a shaded confidence interval and marginal histograms showing the distribution of each variable.
### Components/Axes
* **Title:** "college_computer_science" (located at the top-left).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to approximately 120. Major tick marks are visible at 0, 50, and 100.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.2 to 0.8. Major tick marks are visible at 0.2, 0.4, 0.6, and 0.8.
* **Data Series:**
* **Scatter Points:** Numerous purple circular data points are plotted across the graph.
* **Regression Line:** A solid purple line showing the best-fit linear trend.
* **Confidence Interval:** A semi-transparent purple shaded area surrounding the regression line, indicating the uncertainty of the fit.
* **Marginal Distributions:**
* **Top Histogram:** A horizontal histogram positioned above the main plot, showing the distribution of the "Target Length" (x-axis) data. It is heavily right-skewed, with the highest frequency near 0.
* **Right Histogram:** A vertical histogram positioned to the right of the main plot, showing the distribution of the "Confidence" (y-axis) data. It appears roughly unimodal, centered around 0.4-0.5.
### Detailed Analysis
* **Data Point Distribution:** The scatter points are densely clustered in the lower-left quadrant of the plot, specifically where "Target Length" is between 0-40 and "Confidence" is between 0.2-0.6. The density of points decreases as both values increase.
* **Trend Verification:** The purple regression line has a clear positive slope, indicating a direct correlation. It originates near the coordinate (0, 0.4) and extends to approximately (120, 0.7).
* **Key Data Points (Approximate):**
* Lowest Confidence Point: ~ (5, 0.2)
* Highest Confidence Point: ~ (100, 0.8)
* Highest Target Length Point: ~ (115, 0.65)
* **Marginal Histogram Details:**
* The "Target Length" histogram shows the vast majority of observations have a length less than 50, with a very long tail extending to 120.
* The "Confidence" histogram shows most values fall between 0.3 and 0.6.
### Key Observations
1. **Positive Correlation:** There is a clear, positive linear relationship between Target Length and Confidence. As the target length increases, the confidence score tends to increase.
2. **Heteroscedasticity:** The spread (variance) of the Confidence values appears to increase slightly as Target Length increases. The data is more tightly clustered at low Target Lengths and becomes more dispersed at higher values.
3. **Data Skew:** Both variables are not normally distributed. Target Length is strongly right-skewed, and Confidence is somewhat left-skewed (more mass on the lower end).
4. **Outliers:** A few data points exist with high Confidence (>0.7) at moderate Target Lengths (~50-80), which sit above the main cluster and the confidence interval band.
### Interpretation
The data suggests that in the context of "college computer science," tasks or items with a longer "Target Length" (which could refer to answer length, code length, or document length) are associated with higher "Confidence" (potentially model confidence, grader confidence, or student confidence). This could imply that more substantial or detailed responses are perceived as more reliable or are generated with higher certainty by an automated system.
The strong skew in Target Length indicates that most tasks are short, but the few long tasks are associated with higher confidence. The increasing variance (heteroscedasticity) suggests that while confidence generally rises with length, predictions for longer targets become less precise. The marginal histograms provide crucial context, showing that the observed positive trend is driven by a minority of data points with high Target Length, as most data is concentrated at the low end. This is a classic example where the summary statistic (the regression line) tells an important story, but the underlying data distribution reveals the full, nuanced picture.
</details>
|
<details>
<summary>x26.png Details</summary>

### Visual Description
## Scatter Plot with Regression Line: College Mathematics Confidence vs. Target Length
### Overview
The image is a scatter plot with an overlaid linear regression line and its confidence interval. It visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a dataset or model evaluation related to "college_mathematics". The plot includes marginal distributions (histograms/density plots) on the top and right edges.
### Components/Axes
* **Title:** `college_mathematics` (centered at the top).
* **Y-Axis:**
* **Label:** `Confidence`
* **Scale:** Linear, ranging from approximately 0.2 to 0.7.
* **Major Ticks:** 0.2, 0.4, 0.6.
* **X-Axis:**
* **Label:** `Target Length`
* **Scale:** Linear, ranging from 0 to approximately 130.
* **Major Ticks:** 0, 50, 100.
* **Legend:** Located in the top-left corner of the main plot area. It is partially obscured/cut off. Visible text includes:
* `Llama-2-70b` (associated with a purple color swatch).
* A second entry is partially visible, likely `Llama-2-13b` or similar, but cannot be confirmed.
* **Data Series:**
* **Scatter Points:** Numerous purple dots representing individual data points.
* **Regression Line:** A solid purple line showing the best linear fit.
* **Confidence Interval:** A semi-transparent purple shaded band around the regression line.
* **Marginal Plots:**
* **Top (above x-axis):** A density plot/histogram showing the distribution of `Target Length`. It is heavily right-skewed, with the highest density near 0.
* **Right (beside y-axis):** A density plot/histogram showing the distribution of `Confidence`. It appears roughly unimodal, centered around 0.3-0.4.
### Detailed Analysis
* **Data Distribution:**
* The vast majority of data points are clustered in the region where `Target Length` is between 0 and 50.
* `Confidence` values for these points range widely from ~0.2 to ~0.6, with a dense cluster between 0.2 and 0.4.
* There is a clear outlier point at approximately `Target Length = 120`, `Confidence = 0.58`.
* **Trend Verification:**
* The purple regression line has a positive slope, indicating a general trend where `Confidence` increases as `Target Length` increases.
* The line starts at approximately (0, 0.32) and ends near (120, 0.55).
* The shaded confidence interval widens as `Target Length` increases, indicating greater uncertainty in the trend estimate for longer target lengths due to sparse data.
* **Marginal Distributions:**
* The top marginal plot confirms the extreme right skew of the `Target Length` data; most samples have very short target lengths.
* The right marginal plot shows the `Confidence` scores are most frequently in the 0.3 to 0.4 range.
### Key Observations
1. **Sparse Data at High Values:** There are very few data points with a `Target Length` greater than 50, making the trend in that region less reliable.
2. **Positive but Noisy Correlation:** While the regression line suggests a positive relationship, the scatter of points is substantial, indicating a weak correlation. Many points with short target lengths have high confidence, and vice-versa.
3. **Notable Outlier:** The single point near (120, 0.58) is influential. It lies close to the regression line but is isolated, pulling the trend upward.
4. **Legend Ambiguity:** The legend is not fully legible, preventing definitive identification of the data series. The color and context suggest it may represent a specific model (e.g., Llama-2-70b) evaluated on college mathematics tasks.
### Interpretation
This chart likely evaluates the performance (measured by `Confidence`) of a language model (possibly Llama-2-70b) on college mathematics problems, plotted against the length of the expected answer or solution (`Target Length`).
The data suggests that **the model's confidence tends to be higher for problems requiring longer answers**, though the relationship is not strong. This could imply several things:
* Longer answers might be associated with more complex, multi-step problems where the model can "show its work," leading to higher calibrated confidence.
* Alternatively, the model might be overconfident on longer generations.
* The heavy skew towards short target lengths indicates the evaluation dataset is dominated by problems with concise answers (e.g., numerical results, short proofs).
The **primary anomaly** is the extreme sparsity of data for long target lengths. This makes the observed positive trend tentative. A robust conclusion would require more data points in the 50-130 range. The outlier at length ~120 is critical; without it, the slope of the regression line would likely be shallower.
**In summary:** The visualization hints at a potential positive link between answer length and model confidence in college mathematics, but the conclusion is limited by data imbalance and high variance. The key takeaway is the need for more evaluation examples with longer target sequences to validate the trend.
</details>
|
|
<details>
<summary>x27.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: College Medicine Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms or density plots) on the top and right sides. The chart is titled "college_medicine" and explores the relationship between "Target Length" and "Confidence" for a dataset or model associated with that label. The primary data is represented by purple circular points, with a trend line overlaid.
### Components/Axes
* **Title:** "college_medicine" (centered at the top).
* **Main Chart Area:**
* **X-Axis:** Labeled "Target Length". The scale runs from 0 to approximately 150, with major tick marks and labels at 0 and 100.
* **Y-Axis:** Labeled "Confidence". The scale runs from 0.00 to 0.75, with major tick marks and labels at 0.00, 0.25, 0.50, and 0.75.
* **Data Series:** A single series of purple circular data points. A legend in the top-right corner of the main chart area confirms the series: a purple circle labeled "college_medicine".
* **Trend Line:** A faint, solid purple line is drawn through the data, showing a general trend.
* **Marginal Plots:**
* **Top Marginal Plot:** Positioned above the main x-axis. It displays the distribution of the "Target Length" variable. It appears to be a density plot or smoothed histogram.
* **Right Marginal Plot:** Positioned to the right of the main y-axis. It displays the distribution of the "Confidence" variable. It also appears to be a density plot.
### Detailed Analysis
* **Data Distribution & Trends:**
* **Trend Verification:** The purple trend line exhibits a clear, gentle downward slope from left to right. This indicates a negative correlation: as "Target Length" increases, "Confidence" tends to decrease.
* **Point Density:** The highest density of purple data points is concentrated in the lower-left quadrant of the chart. Specifically, most points fall within a "Target Length" range of approximately 0 to 80 and a "Confidence" range of 0.00 to 0.50.
* **Outliers:** There are a few scattered points with higher confidence (above 0.50), primarily at lower target lengths (below ~60). There are also points extending to higher target lengths (up to ~140), but these almost exclusively have low confidence (below 0.25).
* **Marginal Distributions:**
* **Target Length (Top Plot):** The distribution is right-skewed. The peak density (mode) appears to be at a low target length, roughly between 20 and 40. The density tapers off significantly as length increases beyond 100.
* **Confidence (Right Plot):** The distribution is also right-skewed. The peak density is at a low confidence level, approximately between 0.10 and 0.20. Very few instances have confidence above 0.50.
### Key Observations
1. **Negative Correlation:** The primary observation is the inverse relationship between target length and confidence.
2. **Low-Confidence Bias:** The vast majority of predictions or measurements have a confidence score below 0.50, with the most common scores being quite low (0.10-0.20).
3. **Length Constraint:** High-confidence results are almost exclusively associated with shorter target lengths. As targets become longer, confidence reliably drops.
4. **Data Sparsity at Extremes:** There are very few data points for target lengths beyond 120 or for confidence scores above 0.60.
### Interpretation
This chart suggests a performance characteristic or inherent property of the "college_medicine" model or dataset. The data demonstrates that the system's confidence in its output is negatively impacted by the length of the target it is processing.
* **What it means:** The system is more "sure" of itself when dealing with shorter, likely simpler or more concise targets. As the target length increasesāpotentially introducing more complexity, noise, or ambiguityāthe system's confidence in its corresponding output diminishes predictably.
* **Why it matters:** This is a critical diagnostic insight. It indicates a potential limitation: the model may not be reliable for long-form targets. For practical application, outputs for long targets should be treated with lower inherent trust, or the model may require retraining or architectural adjustments to handle length more robustly.
* **Underlying Pattern:** The marginal distributions reinforce this. The system most frequently encounters (or generates) targets of moderate length (20-40) and assigns them low confidence (0.10-0.20). The combination of these two skewed distributions creates the dense cluster in the lower-left of the scatter plot. The trend line quantifies the penalty that length exacts on confidence.
</details>
|
<details>
<summary>x28.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Computer Security Confidence vs. Target Length
### Overview
The image is a scatter plot titled "computer_security" that visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis). The plot includes a main scatter area and marginal distribution plots (histograms/density curves) along the top and right edges. The data points are represented as semi-transparent purple circles. A horizontal line is drawn across the scatter plot at approximately y=0.4.
### Components/Axes
* **Title:** "computer_security" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to 200.
* **Major Tick Marks:** 0, 100, 200.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from approximately 0.1 to 0.7.
* **Major Tick Marks:** 0.2, 0.4, 0.6.
* **Data Series:** A single series of data points (purple circles). There is no separate legend, as only one data category is present.
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (appears to be a kernel density estimate or smoothed histogram) showing the distribution of the "Target Length" variable. It is heavily right-skewed, with a high peak near 0 and a long tail extending to 200.
* **Right Marginal Plot:** A distribution plot showing the distribution of the "Confidence" variable. It appears roughly unimodal, with a peak centered near 0.4.
* **Reference Line:** A solid, thin horizontal line is drawn across the main plot at a Confidence value of approximately 0.4.
### Detailed Analysis
* **Data Point Distribution:** The scatter plot shows a high density of data points clustered in the region where Target Length is between 0 and 50. The Confidence values for these points vary widely, spanning from below 0.2 to above 0.6.
* **Trend Verification:** There is no strong, clear linear trend (upward or downward slope) visible in the data. The points form a broad cloud. The horizontal reference line at Confidence ā 0.4 appears to pass through the central mass of the data cloud.
* **Spatial Grounding & Outliers:**
* The highest concentration of points is in the bottom-left quadrant (low Target Length, low-to-mid Confidence).
* Several points with Confidence > 0.6 are present, all with Target Length < 50.
* A few points with Target Length > 150 are visible, but they are sparse. Their Confidence values are mostly between 0.2 and 0.5.
* The marginal plots confirm the visual clustering: most data has a short Target Length, and most Confidence scores are centered around 0.4.
### Key Observations
1. **Skewed Target Length:** The vast majority of analyzed "targets" in this computer security context are short (length < 50). Very few are long (length > 150).
2. **Central Tendency in Confidence:** Despite the wide spread, the confidence scores for predictions or assessments tend to cluster around a central value of approximately 0.4, as indicated by both the data cloud and the right marginal distribution.
3. **High Variance at Low Length:** For short targets, confidence is highly variable, ranging from very low (~0.15) to very high (~0.65). This suggests that short target length alone is not a strong predictor of confidence.
4. **No Clear Correlation:** There is no obvious positive or negative correlation between Target Length and Confidence. Knowing the length of a target does not allow for a reliable prediction of the associated confidence score based on this plot.
### Interpretation
This chart likely evaluates the performance of a model or system in a computer security task (e.g., malware detection, vulnerability assessment, or attack classification), where "Target Length" could refer to the size of a code snippet, file, or network packet, and "Confidence" is the model's certainty in its output.
The data suggests that the system is primarily applied to, or trained on, short targets. The lack of correlation implies that the system's confidence is driven by factors other than the sheer size of the input. The high variance in confidence for short targets indicates that some short inputs are very easy for the system to assess (high confidence), while others are very ambiguous (low confidence). The horizontal line at 0.4 may represent a decision threshold, an average confidence, or a baseline performance metric. The plot highlights that while the system handles many short inputs, its confidence in those inputs is not uniformly reliable.
</details>
|
<details>
<summary>x29.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: Econometrics Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal histograms (also known as a joint plot). It displays the relationship between two variables: "Target Length" on the horizontal axis and "Confidence" on the vertical axis. The plot is titled "econometrics," which also serves as the label for the single data series shown. The overall aesthetic uses a monochromatic purple color scheme against a white background with a light gray grid.
### Components/Axes
* **Main Chart Area:** A scatter plot with approximately 60-70 data points, each represented by a semi-transparent purple circle.
* **X-Axis (Horizontal):**
* **Label:** "Target Length"
* **Scale:** Linear scale ranging from 0 to 100.
* **Major Tick Marks:** At 0, 50, and 100.
* **Y-Axis (Vertical):**
* **Label:** "Confidence"
* **Scale:** Linear scale ranging from 0.4 to 0.8.
* **Major Tick Marks:** At 0.4, 0.6, and 0.8.
* **Legend:**
* **Position:** Top-left corner of the main chart area.
* **Content:** A single entry with a purple circle icon and the text "econometrics".
* **Marginal Distributions:**
* **Top Histogram:** Aligned with the X-axis ("Target Length"). Shows the frequency distribution of the Target Length variable.
* **Right Histogram:** Aligned with the Y-axis ("Confidence"). Shows the frequency distribution of the Confidence variable.
* **Grid:** A light gray grid is present in the main chart area, with lines corresponding to the major tick marks on both axes.
### Detailed Analysis
* **Data Series Trend:** The scatter plot shows a **weak positive correlation**. As "Target Length" increases, "Confidence" shows a slight tendency to increase. The data points are widely scattered, indicating high variance and a low R-squared value if a linear regression were fitted.
* **Data Point Distribution:**
* **Target Length (X-axis):** Data points are concentrated between approximately 10 and 80. There are very few points near 0 or 100.
* **Confidence (Y-axis):** Data points are concentrated between approximately 0.45 and 0.75. The majority of points lie between 0.5 and 0.7.
* **Marginal Histogram Details:**
* **Top (Target Length):** The distribution appears roughly unimodal and slightly right-skewed. The highest frequency bin is in the range of approximately 25-50.
* **Right (Confidence):** The distribution appears roughly unimodal and symmetric, centered around 0.6-0.65.
* **Spatial Grounding & Outliers:**
* A small cluster of points with low Confidence (~0.45-0.5) exists across a range of Target Lengths (approx. 10-60).
* A few points with high Confidence (>0.7) are scattered, with one notable point at approximately (Target Length=70, Confidence=0.78).
* The point with the highest Target Length (~95) has a moderate Confidence value of ~0.62.
### Key Observations
1. **Weak Positive Relationship:** The primary observation is the faint upward slope of the data cloud, suggesting that longer targets are associated with slightly higher confidence scores, but the relationship is not strong.
2. **High Variability:** For any given Target Length, there is a wide range of Confidence values. For example, at a Target Length of ~50, Confidence values span from below 0.5 to above 0.7.
3. **Central Tendency:** Both variables show central tendencies: Target Length is most common in the 25-50 range, and Confidence is most common around 0.6-0.65.
4. **Absence of Extreme Values:** There are no data points at the extreme ends of the scales (e.g., Target Length <5 or >95; Confidence <0.4 or >0.8).
### Interpretation
This plot likely visualizes the performance or output of an econometric model. "Target Length" could refer to the time horizon of a forecast, the complexity of an economic variable, or the length of an input sequence. "Confidence" likely represents the model's predicted probability or certainty in its output.
The **weak positive correlation** suggests that the model's confidence increases marginally as the target length increases. This could imply the model is slightly more certain about longer-term projections or that it has more data to work with for longer targets. However, the **high variability** is the more dominant feature. It indicates that target length alone is a poor predictor of the model's confidence. Other factors not shown in this two-dimensional plot (e.g., model parameters, input data quality, economic volatility) are likely major drivers of confidence.
The marginal histograms confirm that the dataset used is not uniformly distributed; it is centered around moderate target lengths and moderate confidence levels. The absence of extreme values might indicate data filtering or the inherent limits of the model's operational range.
**In summary, the visualization demonstrates that while there is a faint signal linking longer targets to higher confidence, the model's certainty is highly variable and influenced by factors beyond simple target length.**
</details>
|
<details>
<summary>x30.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: electrical_engineering
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. It displays the relationship between two variables, "Target Length" and "Confidence," for a dataset labeled "electrical_engineering." The plot uses a single color (purple) for all data points and includes a fitted trend line.
### Components/Axes
* **Title:** "electrical_engineering" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length" (centered below the axis).
* **Scale:** Linear scale from 0 to approximately 80, with major tick marks labeled at 0 and 50.
* **Y-Axis:**
* **Label:** "Confidence" (rotated 90 degrees, centered to the left of the axis).
* **Scale:** Linear scale from 0.0 to approximately 0.7, with major tick marks labeled at 0.2, 0.4, and 0.6.
* **Data Series:**
* A scatter of purple circular points.
* A solid, darker purple trend line (likely a linear regression fit) running through the data.
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (appears to be a histogram with a density curve overlay) for the "Target Length" variable. It is positioned directly above the main scatter plot, sharing the same x-axis.
* **Right Marginal Plot:** A distribution plot for the "Confidence" variable. It is positioned to the right of the main scatter plot, sharing the same y-axis.
* **Legend:** There is no explicit legend box. The color purple is used consistently for all data elements (points, trend line, marginal distributions).
### Detailed Analysis
* **Data Distribution & Range:**
* **Target Length (X-axis):** Data points are densely concentrated between 0 and 50, with a few sparse points extending to around 80. The top marginal plot confirms a right-skewed distribution, with the highest frequency (peak) near the lower end (approximately 0-20) and a long tail extending to the right.
* **Confidence (Y-axis):** Data points are primarily clustered between 0.1 and 0.5, with some outliers reaching up to ~0.65. The right marginal plot shows a distribution that is somewhat right-skewed, with a peak around 0.2-0.3.
* **Trend Line:** The fitted line shows a **slight positive slope**. It starts at a Confidence value of approximately 0.2 when Target Length is 0 and rises to a Confidence of approximately 0.3 when Target Length is 80. This suggests a weak positive correlation between the two variables.
* **Spatial Grounding & Scatter Pattern:** The highest density of points is in the lower-left quadrant of the plot (low Target Length, low-to-moderate Confidence). The scatter is quite broad, indicating high variance in Confidence for any given Target Length, especially in the 0-40 range. There are no distinct clusters or gaps.
### Key Observations
1. **Weak Positive Correlation:** The primary trend is a modest increase in Confidence as Target Length increases, but the relationship is not strong.
2. **High Variance:** For short to medium Target Lengths (0-40), Confidence values vary widely from ~0.1 to ~0.6, indicating that Target Length alone is a poor predictor of Confidence in this range.
3. **Right-Skewed Distributions:** Both variables exhibit right-skewed distributions, meaning most data points have lower values, with fewer instances of very high Target Length or very high Confidence.
4. **Potential Outliers:** A few data points with Confidence > 0.6 are visible, primarily associated with Target Lengths between 10 and 30.
### Interpretation
This chart likely analyzes the performance of a model or system within the domain of "electrical_engineering." "Target Length" could refer to the length of a code sequence, a design specification, or a problem description. "Confidence" likely represents the model's confidence score in its output or prediction.
The data suggests that **longer targets are associated with a slight, but not reliable, increase in model confidence.** The weak correlation and high variance imply that other factors not visualized here (e.g., problem complexity, data quality, specific sub-domain) have a much stronger influence on the model's confidence than just the length of the target. The right-skewed distributions indicate that the dataset is dominated by shorter, less complex targets where the model's confidence is generally low to moderate. The presence of high-confidence outliers for shorter targets warrants investigation to understand what specific characteristics lead to high confidence in those cases.
</details>
|
|
<details>
<summary>x31.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: elementary_mathematics
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal histograms, titled "elementary_mathematics". It displays the relationship between "Target Length" and "Confidence" for a dataset presumably related to elementary mathematics tasks or problems. The plot includes a trend line and distributions for each variable along the axes.
### Components/Axes
* **Main Chart Area:** A scatter plot with data points represented as purple dots.
* **X-Axis:** Labeled "Target Length". The scale runs from 0 to 100, with major tick marks at 0, 50, and 100.
* **Y-Axis:** Labeled "Confidence". The scale runs from 0.25 to 0.75, with major tick marks at 0.25, 0.50, and 0.75.
* **Legend:** Located in the top-left corner of the main chart area. It consists of a small purple square followed by the text "elementary_mathematics".
* **Marginal Distributions:**
* **Top Histogram:** Positioned above the main chart, aligned with the X-axis. It shows the distribution of the "Target Length" variable. The distribution is heavily right-skewed.
* **Right Histogram:** Positioned to the right of the main chart, aligned with the Y-axis. It shows the distribution of the "Confidence" variable. The distribution appears roughly unimodal, centered near 0.5.
* **Trend Line:** A straight, solid purple line is drawn through the scatter plot data.
### Detailed Analysis
* **Data Point Distribution:** The vast majority of data points are clustered in the lower range of the X-axis (Target Length). The highest density appears between Target Length values of approximately 0 to 20. Within this cluster, Confidence values span the full range from ~0.25 to ~0.75, with a concentration around 0.5.
* **Trend Line Analysis:** The purple trend line exhibits a very slight negative slope. It starts at a Confidence value of approximately 0.52 when Target Length is 0 and decreases to approximately 0.48 when Target Length is 100. This indicates a very weak negative correlation between Target Length and Confidence.
* **Marginal Histogram Details:**
* **Target Length (Top):** The histogram shows a sharp peak at the lowest bin (0-~5), with frequency dropping off rapidly as Target Length increases. There are very few data points with a Target Length greater than 50.
* **Confidence (Right):** The histogram shows a central peak around the 0.5 bin. The distribution is relatively symmetric, with fewer instances of very low (<0.3) or very high (>0.7) confidence.
### Key Observations
1. **Strong Right Skew in Target Length:** The dataset is dominated by tasks or problems with short target lengths. Long targets (Length > 50) are rare outliers.
2. **Wide Confidence Spread for Short Targets:** For the most common short targets (Length < 20), confidence varies dramatically, from very low to very high. This suggests that factors other than length are primary drivers of confidence in this domain.
3. **Weak Overall Correlation:** The nearly flat trend line suggests that knowing the Target Length provides very little predictive power for the Confidence score across the entire dataset.
4. **Central Tendency in Confidence:** Despite the wide spread, the marginal distribution and the trend line both indicate that the average or typical confidence level is around 0.5 (50%).
### Interpretation
This visualization suggests that within the context of "elementary_mathematics," the length of a target (e.g., the length of a solution, answer, or problem statement) is not a strong determinant of confidence. The data implies two potential narratives:
1. **Intrinsic Difficulty Variance:** Short problems can be either very easy (high confidence) or deceptively tricky (low confidence), leading to the wide vertical spread on the left side of the plot. The length itself doesn't signal difficulty.
2. **Task Design:** The overwhelming prevalence of short targets indicates the dataset or task domain is fundamentally composed of concise problems. The few long targets do not systematically reduce confidence, as shown by the flat trend, suggesting they may not be inherently more complex but simply different in format.
The key takeaway is that confidence in this elementary mathematics domain is driven by factors orthogonal to target length, such as problem type, required operation, or familiarity. The marginal histograms confirm that any analysis must account for the severe imbalance in target length, as statistical summaries would be dominated by the short-target cluster.
</details>
|
<details>
<summary>x32.png Details</summary>

### Visual Description
## Scatter Plot with Violin Plot Overlay: Formal Logic Confidence vs. Target Length
### Overview
The image is a statistical visualization combining a scatter plot and a violin plot, titled "formal_logic". It displays the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a dataset, likely from a machine learning or computational linguistics context. The plot suggests an analysis of model performance or prediction confidence as a function of input length.
### Components/Axes
* **Title:** "formal_logic" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, with major tick marks and labels at 0, 100, and 200.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, with major tick marks and labels at 0.2, 0.4, and 0.6.
* **Legend:** Located in the top-left corner of the plot area. It is partially obscured but appears to contain a single entry with a purple square symbol, corresponding to the scatter plot data points.
* **Data Series:**
1. **Scatter Plot:** Numerous purple circular data points distributed across the plot.
2. **Trend Line:** A solid purple line running through the scatter data, accompanied by a semi-transparent purple shaded region representing the confidence interval.
3. **Violin Plot:** A vertical, purple-shaded density plot positioned along the right edge of the chart area, showing the distribution of the "Confidence" values.
### Detailed Analysis
* **Scatter Plot Distribution:**
* Data points are densely clustered in the region where Target Length is between 0 and approximately 50.
* The density of points decreases as Target Length increases beyond 50.
* Confidence values for the majority of points range from approximately 0.25 to 0.55.
* There is one notable outlier point with a high Confidence value of approximately 0.65 at a low Target Length (near 0).
* **Trend Line:**
* The line exhibits a slight negative slope, starting at a Confidence of ~0.42 at Target Length 0 and descending to ~0.38 at Target Length 200.
* The shaded confidence interval around the trend line is relatively narrow, suggesting a statistically stable, though weak, negative correlation.
* **Violin Plot (Right Side):**
* This plot shows the univariate distribution of the "Confidence" variable.
* The widest part of the violin (indicating the highest density of data) is centered around a Confidence value of approximately 0.4.
* The distribution appears slightly skewed, with a longer tail extending towards lower Confidence values (down to ~0.2) and a shorter tail towards higher values (up to ~0.65).
### Key Observations
1. **Negative Correlation:** There is a clear, albeit weak, trend where increasing Target Length is associated with a slight decrease in Confidence.
2. **Data Density:** The vast majority of observations have a Target Length under 100, with very few data points beyond 150.
3. **Concentration of Confidence:** Most Confidence scores are concentrated in the 0.3 to 0.5 range, as evidenced by both the scatter cluster and the violin plot's bulge.
4. **High-Confidence Outlier:** A single data point shows exceptionally high confidence (~0.65) for a very short target length, which may warrant investigation as a special case or potential anomaly.
### Interpretation
This chart likely evaluates the performance of a formal logic reasoning system (e.g., a theorem prover, logic parser, or a language model on logical tasks). "Target Length" probably refers to the complexity or length of the logical statement/proof, while "Confidence" is the model's self-assessed probability of being correct.
The data suggests that the system's confidence in its outputs degrades marginally as the logical problems become longer or more complex. The high density of short-length targets indicates the evaluation dataset may be skewed towards simpler problems. The outlier with high confidence on a short target could represent a trivial or highly familiar logical pattern. The violin plot confirms that the system's confidence is generally moderate (centered at 0.4), rarely reaching high certainty (>0.6), which may reflect the inherent difficulty or ambiguity in the formal logic tasks presented. The weak negative trend implies that while length/complexity is a factor, other variables not shown (e.g., logical depth, rule types) likely play a more significant role in determining confidence.
</details>
|
<details>
<summary>x33.png Details</summary>

### Visual Description
## Scatter Plot with Regression: Global Facts Confidence vs. Target Length
### Overview
The image is a scatter plot titled "global_facts" that visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis). It includes a fitted regression line with a shaded confidence interval. The data points are densely clustered at the lower end of the Target Length scale, with a few sparse points extending to higher values.
### Components/Axes
* **Title:** "global_facts" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to over 100.
* **Major Ticks:** 0, 50, 100.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from approximately 0.25 to 0.75.
* **Major Ticks:** 0.25, 0.50, 0.75.
* **Data Series:**
* **Scatter Points:** Purple dots representing individual data observations.
* **Regression Line:** A solid purple line showing the best-fit linear trend.
* **Confidence Interval:** A semi-transparent purple shaded area surrounding the regression line, indicating the uncertainty of the fit.
* **Marginal Distribution:** A small, faint histogram or density plot is visible along the top edge of the plot area, showing the distribution of the "Target Length" variable. It is heavily right-skewed.
### Detailed Analysis
* **Data Distribution:** The vast majority of data points are concentrated in a dense cluster where "Target Length" is between 0 and approximately 25. Within this cluster, "Confidence" values show high variance, spanning nearly the entire y-axis range from ~0.25 to ~0.75.
* **Regression Trend:** The purple regression line exhibits a clear **positive slope**. It originates at a Confidence value of approximately 0.3 when Target Length is 0 and rises steadily.
* **Key Data Points on Trend Line (Approximate):**
* At Target Length = 0: Confidence ā 0.30
* At Target Length = 50: Confidence ā 0.45
* At Target Length = 100: Confidence ā 0.52
* At Target Length = 125 (end of visible line): Confidence ā 0.55
* **Confidence Interval:** The shaded purple area is narrowest near the center of the data mass (low Target Length) and widens significantly as Target Length increases. At Target Length = 125, the interval spans from approximately 0.40 to 0.70, indicating high uncertainty in the trend prediction for longer targets.
* **Outliers/Sparse Data:** There are a few isolated data points at higher Target Lengths:
* One point near Target Length = 60, Confidence ā 0.25.
* One point near Target Length = 75, Confidence ā 0.35.
* One point near Target Length = 125, Confidence ā 0.55 (this point lies almost exactly on the regression line).
### Key Observations
1. **Positive Correlation:** The primary visual trend is that Confidence tends to increase as Target Length increases.
2. **Heteroscedasticity:** The variance (spread) of Confidence values is not constant. It is very high for short Target Lengths and appears to decrease for the few observed longer targets, though the confidence interval suggests overall model uncertainty grows with length.
3. **Data Sparsity:** The relationship for Target Lengths greater than ~25 is inferred from very few data points, making the trend in that region less reliable.
4. **Marginal Skew:** The distribution of the independent variable (Target Length) is heavily right-skewed, meaning most observations involve short targets.
### Interpretation
The plot suggests a **weak to moderate positive relationship** between the length of a target (e.g., a text sequence, a query) and the confidence score associated with it in the "global_facts" context. This could imply that the system or model being evaluated is more confident when processing or generating longer targets.
However, critical caveats are evident:
* **Correlation vs. Causation:** The trend does not prove that increasing target length *causes* higher confidence. Confounding variables (e.g., topic complexity, data quality for longer targets) may be at play.
* **Reliability at Extremes:** The widening confidence interval and sparse data for longer targets mean the predicted upward trend becomes highly uncertain beyond a Target Length of about 50. The single outlier at (125, 0.55) supports the trend, but one point is insufficient for robust conclusion.
* **Practical Implication:** If this data informs system design, it might suggest that confidence metrics are more stable and potentially more meaningful for longer inputs/outputs. Conversely, the high variance for short targets indicates that confidence scores in that range should be interpreted with great caution, as they can be both very high and very low.
**In summary, the data indicates that confidence generally rises with target length, but this trend is built on a foundation of dense, highly variable data for short lengths and becomes speculative for longer lengths due to data scarcity.**
</details>
|
<details>
<summary>x34.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: High School Biology Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal histograms/density plots, titled "high_school_biology". It displays the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a dataset presumably related to high school biology. The plot uses a monochromatic purple color scheme.
### Components/Axes
* **Title/Header:** The text "high_school_biology" is displayed in the top-left corner, inside a light purple rectangular box, serving as both a title and a legend for the single data series.
* **Main Chart Area:**
* **X-Axis:** Labeled "Target Length". Major tick marks are visible at 0 and 100. The axis extends slightly beyond 100.
* **Y-Axis:** Labeled "Confidence". Major tick marks are visible at 0.0 and 0.5. The axis extends from 0.0 to approximately 0.7.
* **Data Series:** Represented by numerous semi-transparent purple circles (scatter points). A darker purple trend line (likely a regression line) is drawn through the data.
* **Marginal Distributions:**
* **Top Marginal Plot:** A histogram or density plot aligned with the x-axis ("Target Length"). It shows the distribution of the Target Length variable.
* **Right Marginal Plot:** A histogram or density plot aligned with the y-axis ("Confidence"). It shows the distribution of the Confidence variable.
### Detailed Analysis
* **Data Point Distribution:** The scatter points are densely clustered in the lower-left quadrant of the plot. The highest density appears for Target Length values between approximately 10 and 80, and Confidence values between 0.0 and 0.3.
* **Trend Line:** The dark purple trend line shows a clear, positive linear slope. It starts near a Confidence of ~0.1 at Target Length 0 and rises to a Confidence of ~0.3 at Target Length 150 (estimated).
* **Marginal Histogram Details:**
* **Target Length (Top):** The distribution is right-skewed. The highest frequency (tallest bar) is at the lower end of the scale (near 0-20). The frequency decreases as Target Length increases.
* **Confidence (Right):** The distribution is left-skewed. The highest frequency is at the lower end of the confidence scale (near 0.0-0.1). Frequency drops sharply as confidence increases towards 0.5 and above.
### Key Observations
1. **Positive Correlation:** There is a visible positive correlation between Target Length and Confidence. As Target Length increases, Confidence tends to increase, as confirmed by the upward-sloping trend line.
2. **Data Sparsity:** Data points become significantly sparser for Target Length values greater than ~100 and for Confidence values greater than ~0.4.
3. **Outliers:** A few data points exist with relatively high Confidence (>0.5) across various Target Lengths, but they are not the norm.
4. **Concentration:** The vast majority of the data is concentrated in the region of lower Target Length and lower Confidence.
### Interpretation
This chart suggests that within the context of "high_school_biology" (which could refer to model predictions, student assessments, or content analysis), there is a measurable, positive relationship between the length of a target (e.g., a text passage, a question, a concept) and the associated confidence metric. The data implies that longer targets are, on average, associated with higher confidence scores.
However, the right-skewed distribution of Target Length indicates that most targets in this dataset are relatively short. The strong left-skew of Confidence shows that low-confidence outcomes are far more common than high-confidence ones. The positive trend, while clear, operates within a low-confidence regime for the majority of cases. The sparsity of data at higher values means the trend's predictive power may be weaker in those regions. This could indicate that achieving high confidence in this domain is difficult, or that the system/model being measured is generally conservative in its confidence assignments, with length being one positive, but not sole, contributing factor.
</details>
|
|
<details>
<summary>x35.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Histograms: High School Chemistry Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with a fitted regression line and marginal histograms. It displays the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a dataset labeled "high_school_chemistry". The plot uses a monochromatic purple color scheme.
### Components/Axes
* **Title:** "high_school_chemistry" (centered at the top).
* **Y-Axis:**
* **Label:** "Confidence" (rotated vertically on the left).
* **Scale:** Linear scale from approximately 0.25 to 0.75.
* **Major Ticks:** 0.25, 0.50, 0.75.
* **X-Axis:**
* **Label:** "Target Length" (centered at the bottom).
* **Scale:** Linear scale from 0 to slightly beyond 100.
* **Major Ticks:** 0, 100.
* **Data Series:**
* **Scatter Points:** Numerous purple dots representing individual data points.
* **Regression Line:** A solid, darker purple line showing the best-fit linear trend.
* **Confidence Interval:** A semi-transparent, lighter purple shaded band around the regression line.
* **Marginal Histograms:**
* **Top Histogram:** Shows the distribution of the "Target Length" variable. It is positioned above the main plot area.
* **Left Histogram:** Shows the distribution of the "Confidence" variable. It is positioned to the left of the main plot area.
* **Legend:** No explicit legend box is present. The color purple is used consistently for all data elements (scatter, line, histograms).
### Detailed Analysis
* **Data Distribution & Density:**
* The scatter points are most densely clustered in the lower-left quadrant of the plot, corresponding to **Target Length values between 0 and ~50** and **Confidence values between 0.25 and 0.50**.
* Data becomes sparser as both Target Length and Confidence increase.
* There are a few outlier points with high Confidence (>0.60) scattered across various Target Lengths.
* **Trend Analysis (Regression Line):**
* The regression line exhibits a **slight positive slope**, indicating a weak positive correlation between Target Length and Confidence.
* The line starts at a Confidence value of approximately **0.35** when Target Length is 0.
* It rises to a Confidence value of approximately **0.45** when Target Length is 100.
* The shaded confidence interval band widens slightly as Target Length increases, suggesting greater uncertainty in the trend estimate for longer target lengths.
* **Marginal Histogram Details:**
* **Target Length Distribution (Top):** The histogram is strongly **right-skewed**. The highest frequency bin is at the far left (Target Length near 0). Frequency drops off sharply as length increases, with a long tail extending past 100.
* **Confidence Distribution (Left):** This histogram is also **right-skewed**. The mode (highest bar) is in the bin just above 0.25. The frequency generally decreases as Confidence increases, with a notable drop-off above 0.50.
### Key Observations
1. **Weak Positive Correlation:** There is a discernible but weak tendency for Confidence to increase as Target Length increases.
2. **Clustered Low Values:** The vast majority of observations have both a short Target Length (<50) and low-to-moderate Confidence (0.25-0.50).
3. **Skewed Distributions:** Both variables are not normally distributed; they are heavily skewed towards lower values.
4. **Increased Uncertainty:** The model's confidence in its predicted trend (the regression line) decreases for longer target lengths, as shown by the widening confidence band.
### Interpretation
This chart suggests that within the context of "high_school_chemistry," tasks or items with shorter target lengths (perhaps shorter answers, simpler problems, or less content) are associated with lower confidence scores. The weak positive slope implies that as the target length increases, there is a slight, but not strong, tendency for confidence to also increase.
The heavily skewed distributions are the most striking feature. They indicate that the dataset is dominated by instances of short target length and low confidence. This could mean that in this chemistry domain, most evaluated items are brief and elicit only moderate confidence, or that the measurement scale for confidence is not being fully utilized. The few high-confidence outliers are interesting exceptions that may warrant separate investigation. The visualization effectively shows that while a general trend exists, the relationship is noisy and the data is not evenly distributed across the range of either variable.
</details>
|
<details>
<summary>x36.png Details</summary>

### Visual Description
## Scatter Plot with Regression and Marginal Distributions: High School Computer Science Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with an overlaid linear regression line and its confidence interval. It also includes marginal distribution plots (histograms or density plots) for both variables. The chart explores the relationship between "Target Length" and "Confidence" within the context of "high_school_computer_science."
### Components/Axes
* **Title:** `high_school_computer_science` (positioned at the top center).
* **Y-Axis:**
* **Label:** `Confidence`
* **Scale:** Linear, ranging from 0.00 to 0.75. Major tick marks are visible at 0.00, 0.25, 0.50, and 0.75.
* **X-Axis:**
* **Label:** `Target Length`
* **Scale:** Linear, ranging from 0 to 200. Major tick marks are visible at 0, 100, and 200.
* **Data Series:**
* **Scatter Points:** Numerous purple dots representing individual data points.
* **Regression Line:** A solid purple line showing the best-fit linear trend.
* **Confidence Interval:** A semi-transparent purple shaded area surrounding the regression line, representing the uncertainty of the fit.
* **Marginal Plots:**
* **Top (above main chart):** A distribution plot for the `Target Length` variable. It shows a high density of points at very low target lengths, tapering off as length increases.
* **Right (beside main chart):** A distribution plot for the `Confidence` variable. It shows a broad distribution, with a peak around 0.4-0.5 and a long tail extending towards higher confidence values.
* **Legend:** No separate legend is present. The consistent purple color scheme for all elements (points, line, interval, marginal plots) implies they belong to the same dataset or analysis.
### Detailed Analysis
* **Data Point Distribution:** The scatter plot shows a wide dispersion of purple dots. There is a dense cluster of points with low `Target Length` (approximately 0-50) across a broad range of `Confidence` values (from near 0.00 to above 0.75). As `Target Length` increases beyond 100, the points become sparser.
* **Regression Trend:** The solid purple regression line exhibits a clear **positive slope**. It starts at a `Confidence` value of approximately 0.35 when `Target Length` is 0 and rises to approximately 0.55 when `Target Length` is 200. This indicates a general trend where confidence tends to increase with target length.
* **Confidence Interval:** The shaded purple area around the regression line is narrowest at lower `Target Length` values (where data is dense) and widens significantly as `Target Length` increases towards 200, indicating greater uncertainty in the trend estimate for longer targets due to fewer data points.
* **Marginal Distributions:**
* The top marginal plot confirms the observation from the scatter: the vast majority of samples have a short `Target Length`, with a sharp peak near 0.
* The right marginal plot shows that `Confidence` scores are most frequently in the 0.3 to 0.6 range, with a notable number of instances achieving high confidence (>0.7).
### Key Observations
1. **Positive Correlation:** The primary visual trend is the upward-sloping regression line, suggesting a positive relationship between the length of a target (e.g., a code solution, an answer) and the confidence associated with it in this high school computer science context.
2. **High Variance at Low Lengths:** For short targets (`Target Length` < 50), confidence values are extremely variable, spanning almost the entire observed range from ~0.05 to ~0.80.
3. **Data Sparsity:** There is a significant lack of data points for `Target Length` values greater than approximately 150, which contributes to the widening confidence interval and makes the trend less reliable in that region.
4. **Outliers:** Several data points exist with very high confidence (>0.75) across various target lengths, including some at relatively short lengths. Conversely, a few points show near-zero confidence.
### Interpretation
The data suggests that in the analyzed high school computer science setting, there is a modest but discernible tendency for longer responses or solutions (higher `Target Length`) to be associated with higher confidence scores. However, this relationship is not strong or deterministic, as evidenced by the high scatter of points.
The most critical insight comes from the **marginal distributions and point density**. The overwhelming concentration of data at very low target lengths indicates that the typical interaction or task in this dataset involves short outputs. The high variance in confidence for these short outputs is strikingāit implies that for brief computer science tasks, confidence is highly unpredictable and may depend on factors other than length (e.g., problem difficulty, student expertise).
The widening confidence interval for longer targets is a crucial caveat. It signals that the apparent positive trend is based on sparse data and should not be over-interpreted. The model or analysis is less certain about the relationship for longer targets.
**In summary:** While the regression line hints that "longer might be more confident," the more powerful story is in the data's shape: most tasks are short, and for those short tasks, confidence is all over the place. The chart highlights a potential weak correlation but, more importantly, reveals the underlying distribution and limitations of the dataset.
</details>
|
<details>
<summary>x37.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: High School European History Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with overlaid regression analysis and marginal distribution plots. It examines the relationship between "Target Length" and "Confidence" for a dataset or model labeled "high_school_european_history". The plot includes a main scatter plot, a fitted regression line with a confidence interval, and marginal histograms/density plots on the top and right edges.
### Components/Axes
* **Chart Title/Label:** `high_school_european_history` (located in the top-left corner, serving as the legend label).
* **X-Axis:**
* **Label:** `Target Length`
* **Scale:** Linear, ranging from 0 to 200. Major tick marks are at 0, 100, and 200.
* **Y-Axis:**
* **Label:** `Confidence`
* **Scale:** Linear, ranging from 0.0 to 1.0. Major tick marks are at 0.0, 0.5, and 1.0.
* **Legend:** Positioned in the top-left corner of the main plot area. It consists of a purple square symbol followed by the text `high_school_european_history`. This color corresponds to all data points, the regression line, and the marginal plots.
* **Data Series:** A single series represented by purple circular markers.
* **Regression Line:** A solid purple line showing the best-fit linear trend through the data.
* **Confidence Interval:** A semi-transparent purple shaded region surrounding the regression line, indicating the uncertainty of the fit.
* **Marginal Plots:**
* **Top Marginal Plot:** A histogram/density plot aligned with the X-axis (`Target Length`). It shows the distribution of the independent variable.
* **Right Marginal Plot:** A histogram/density plot aligned with the Y-axis (`Confidence`). It shows the distribution of the dependent variable.
### Detailed Analysis
* **Data Distribution:** The scatter plot contains approximately 150-200 data points (purple circles). The points are densely clustered in the region where `Target Length` is between approximately 20 and 150, and `Confidence` is between 0.4 and 1.0.
* **Trend Verification:** The purple regression line exhibits a clear, gentle upward slope from left to right. This indicates a positive correlation: as `Target Length` increases, `Confidence` tends to increase.
* **Numerical Data Points (Approximate):**
* The regression line starts at approximately (`Target Length`=0, `Confidence`=0.65).
* It ends at approximately (`Target Length`=200, `Confidence`=0.80).
* The shaded confidence interval is narrowest around the center of the data mass (`Target Length` ~75-100) and widens towards the extremes (especially near `Target Length`=200), indicating greater uncertainty in the trend at very low or high target lengths.
* **Marginal Distributions:**
* **Target Length (Top):** The distribution is right-skewed. The highest density of points occurs between `Target Length` values of approximately 50 and 125. The frequency tapers off significantly beyond 150.
* **Confidence (Right):** The distribution is left-skewed. The highest density of points occurs between `Confidence` values of approximately 0.7 and 0.95. There is a long tail extending down to 0.0, but very few points exist below 0.3.
### Key Observations
1. **Positive but Noisy Relationship:** While the overall trend is positive, there is substantial scatter. For any given `Target Length`, there is a wide range of `Confidence` values. For example, at `Target Length` ~50, confidence values span from below 0.2 to nearly 1.0.
2. **Data Sparsity at Extremes:** Very few data points exist for `Target Length` < 20 or > 180. The regression and its confidence interval are less reliable in these sparse regions.
3. **Concentration of High Confidence:** The majority of data points have a `Confidence` score above 0.5, with a particularly dense cluster between 0.7 and 0.95.
4. **Outliers:** Several notable outliers exist, primarily in the lower-left quadrant (low target length, low confidence) and a few in the lower-right (high target length, low confidence). These points pull the regression line down and contribute to the width of the confidence interval.
### Interpretation
This visualization suggests that for the "high_school_european_history" context (likely a model's performance on a specific task or dataset), there is a modest positive association between the length of a target (e.g., a text passage, answer, or sequence) and the model's confidence in its output. However, the relationship is weak and subject to high variance.
The key insight is that **longer targets are not a reliable predictor of high confidence**. While the average confidence increases slightly with length, many long targets still receive low confidence scores, and many short targets receive high confidence. The marginal plots confirm that the evaluation dataset is not uniform; it is dominated by medium-length targets and medium-to-high confidence predictions. The widening confidence interval at high target lengths warns against over-interpreting the trend for very long targets due to insufficient data. This analysis would be crucial for understanding model behavior, identifying potential biases in the evaluation set, and guiding improvementsāperhaps by investigating the causes of low confidence in otherwise long targets.
</details>
|
<details>
<summary>x38.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: high_school_geography
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (density curves) on the top and right sides. It displays the relationship between "Target Length" and "Confidence" for a dataset labeled "high_school_geography". The plot includes a fitted trend line and uses a monochromatic purple color scheme.
### Components/Axes
* **Title:** "high_school_geography" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to approximately 120.
* **Major Tick Marks:** 0, 50, 100.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.00 to approximately 0.85.
* **Major Tick Marks:** 0.00, 0.25, 0.50, 0.75.
* **Legend:**
* **Placement:** Top-left corner of the main plot area.
* **Content:** A purple square symbol followed by the text "high_school_geography".
* **Data Series:**
* **Type:** Scatter plot (individual data points).
* **Color:** Purple (matching the legend).
* **Trend Line:** A solid, darker purple line showing a linear regression fit.
* **Marginal Plots:**
* **Top Marginal:** A density plot (smoothed histogram) showing the distribution of the "Target Length" variable. It is right-skewed, with the highest density near 0.
* **Right Marginal:** A density plot showing the distribution of the "Confidence" variable. It is also right-skewed, with the highest density near 0.00-0.10.
### Detailed Analysis
* **Data Point Distribution:** The purple data points are densely clustered in the lower-left quadrant of the plot. The highest concentration appears where "Target Length" is between 0 and 50 and "Confidence" is between 0.00 and 0.50.
* **Trend Line:** The regression line has a clear negative slope. It originates at approximately (Target Length=0, Confidence=0.35) and descends to approximately (Target Length=100, Confidence=0.20). This indicates a negative correlation between the two variables.
* **Marginal Distributions:**
* The **Target Length** distribution peaks sharply near 0 and has a long tail extending to the right, indicating most targets are short, with a few very long ones.
* The **Confidence** distribution peaks near 0.05 and declines steadily, indicating most predictions have low confidence scores.
* **Outliers:** There are a few scattered points with high Confidence (>0.75), primarily associated with shorter Target Lengths (<50). There are also points with very long Target Lengths (>100) that have low to moderate Confidence.
### Key Observations
1. **Negative Correlation:** The primary observation is the inverse relationship between Target Length and Confidence. As the length of the target increases, the model's confidence in its prediction tends to decrease.
2. **Data Skew:** Both variables are heavily right-skewed. The dataset is dominated by short targets and low-confidence predictions.
3. **High-Confidence Cluster:** The subset of predictions with high confidence (>0.75) is almost exclusively associated with shorter target lengths.
4. **Variance:** The scatter of points around the trend line is substantial, indicating that Target Length alone is not a perfect predictor of Confidence. There is significant variability in confidence for any given target length.
### Interpretation
This chart suggests that for the "high_school_geography" task or dataset, the model's confidence is negatively impacted by the length of the target it is trying to predict or generate. This is a common pattern in language and information tasks: longer, more complex outputs are harder to produce with high certainty.
The heavy skew in both distributions implies the evaluation set or task is composed mostly of short, simple targets, for which the model often has low confidence. The outliers with high confidence on short targets represent the model's "sweet spot." The presence of long targets with low confidence highlights a potential area for model improvement or indicates inherently difficult examples.
The marginal plots are crucial here. They confirm that the dense cluster of points in the lower-left isn't just a visual artifact; it reflects the fundamental composition of the underlying data. The investigation should focus on why confidence is generally low and why it degrades further with length. Is it a limitation of the model architecture, the quality of the training data for long targets, or an inherent property of the geography task itself?
</details>
|
|
<details>
<summary>x39.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: High School Government and Politics
### Overview
The image is a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. It visualizes the relationship between "Target Length" and "Confidence" for a dataset labeled "high_school_government_and_politics". The plot uses a single data series represented by purple points.
### Components/Axes
* **Title:** `high_school_government_and_politics` (positioned at the top center).
* **X-Axis (Main Plot):** Labeled `Target Length`. The scale runs from 0 to 200, with major tick marks at 0, 100, and 200.
* **Y-Axis (Main Plot):** Labeled `Confidence`. The scale runs from 0.25 to 0.75, with major tick marks at 0.25, 0.50, and 0.75.
* **Data Series:** A single series of data points, all colored purple. There is no separate legend, as only one category is plotted.
* **Marginal Plots:**
* **Top Marginal Plot:** Aligned with the X-axis (`Target Length`). It shows the distribution of the Target Length variable. The shape is right-skewed, with the highest density between approximately 50 and 100.
* **Right Marginal Plot:** Aligned with the Y-axis (`Confidence`). It shows the distribution of the Confidence variable. The shape is left-skewed, with the highest density between approximately 0.25 and 0.50.
* **Reference Line:** A faint, horizontal grey line is present at `Confidence = 0.50`, spanning the width of the main plot.
### Detailed Analysis
* **Data Point Distribution:** The scatter plot shows a cloud of purple points. The density is highest in the lower-left quadrant, specifically for `Target Length` values between approximately 20-120 and `Confidence` values between 0.25-0.50.
* **Trend Verification:** There is a weak, negative visual trend. As `Target Length` increases, the cloud of points shows a slight tendency to drift downward on the `Confidence` axis. However, the relationship is very noisy with high variance.
* **Key Data Points & Ranges:**
* **Target Length:** Most data points fall between ~10 and ~180. The marginal plot confirms the mode is around 50-100.
* **Confidence:** Most data points fall between ~0.25 and ~0.65. The marginal plot confirms the mode is around 0.3-0.4.
* **Outliers:** A few points exist with high `Target Length` (>150) and moderate `Confidence` (~0.5). One notable point is near `(Target Length ā 200, Confidence ā 0.5)`.
* **Spatial Grounding:** The highest density of points is in the center-left region of the plot. The marginal plots are positioned directly above and to the right of the main chart area, sharing the same axis scales.
### Key Observations
1. **Inverse Relationship:** There is a general, though weak, inverse relationship between `Target Length` and `Confidence`. Longer targets are associated with slightly lower confidence scores on average.
2. **High Variability:** For any given `Target Length`, especially in the 50-150 range, there is a very wide spread of `Confidence` values (from ~0.25 to ~0.75). This indicates that `Target Length` alone is a poor predictor of `Confidence`.
3. **Distribution Skew:** Both variables are not normally distributed. `Target Length` is right-skewed (many short targets, few very long ones). `Confidence` is left-skewed (many low-confidence scores, fewer high-confidence scores).
4. **Central Tendency:** The horizontal line at `Confidence = 0.50` serves as a visual midpoint. The bulk of the data lies below this line, indicating that confidence scores for this dataset are generally on the lower side of the 0-1 scale (though the axis shown is truncated to 0.25-0.75).
### Interpretation
The data suggests that within the context of "high_school_government_and_politics" (which could refer to model predictions on educational content, essay grading, or similar tasks), the length of a target output (e.g., an answer, essay) has a modest negative correlation with the confidence of the system generating or evaluating it. The high variance implies that other, unmeasured factors are much stronger determinants of confidence than length alone.
The skewed distributions are significant. The right skew in `Target Length` indicates the task or dataset predominantly involves shorter responses. The left skew in `Confidence` is more critical; it suggests the system is frequently uncertain about its outputs for this domain. This could point to the inherent complexity of the subject matter, limitations in the model's training data for government and politics, or ambiguity in the evaluation criteria.
The outlier at the far right (`Target Length ā 200`) with moderate confidence is interesting. It represents a rare, very long response that the system was reasonably confident about, potentially indicating a well-structured, comprehensive answer that the model recognized as correct.
**In summary, the plot reveals a domain where outputs are typically short and met with low-to-moderate confidence, with length being a minor, negative factor. The primary insight is the need to investigate the causes of low confidence, as length is not a sufficient explanation.**
</details>
|
<details>
<summary>x40.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: High School Macroeconomics Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. It displays the relationship between "Target Length" and "Confidence" for a dataset labeled "high_school_macroeconomics". The overall aesthetic uses a monochromatic purple color scheme.
### Components/Axes
* **Title:** "high_school_macroeconomics" (located in the top-left corner, above the plot area).
* **Main Plot Area:** A scatter plot with a fitted trend line.
* **X-Axis (Horizontal):**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to approximately 150.
* **Major Tick Marks:** Labeled at 0 and 100.
* **Y-Axis (Vertical):**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.00 to 0.75.
* **Major Tick Marks:** Labeled at 0.00, 0.25, 0.50, and 0.75.
* **Legend:** Positioned in the top-right corner of the main plot area. It contains a purple square symbol and the text "high_school_macroeconomics".
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (appears to be a histogram or kernel density estimate) aligned with the X-axis ("Target Length"). It shows the frequency distribution of the Target Length variable.
* **Right Marginal Plot:** A distribution plot aligned with the Y-axis ("Confidence"). It shows the frequency distribution of the Confidence variable.
### Detailed Analysis
* **Data Series:** A single data series is plotted, represented by semi-transparent purple circles. The legend confirms this series corresponds to "high_school_macroeconomics".
* **Trend Line:** A solid, darker purple line is overlaid on the scatter plot. It represents a linear regression fit to the data.
* **Spatial Distribution & Trend Verification:**
* The data points are densely clustered in the lower-left quadrant of the plot.
* **Trend:** The fitted line has a slight positive slope, indicating a weak positive correlation. It starts near a Confidence value of ~0.20 at Target Length 0 and rises to approximately ~0.30 at Target Length 150.
* **Data Density:** The highest concentration of points occurs where Target Length is between 0 and 80, and Confidence is between 0.00 and 0.50.
* **Outliers:** A few scattered points exist with higher Confidence values (above 0.50), primarily at lower Target Lengths (below 50). One notable point is near (Target Length ~10, Confidence ~0.80).
* **Marginal Distributions:**
* **Target Length (Top):** The distribution is right-skewed. The peak density (mode) appears to be at a low Target Length value, likely between 10 and 30. The tail extends towards higher values, consistent with the scatter plot's x-axis range.
* **Confidence (Right):** The distribution is also right-skewed. The peak density is at a low Confidence value, likely between 0.10 and 0.25. The tail extends upwards, but the density drops significantly above 0.50.
### Key Observations
1. **Weak Positive Correlation:** The primary observation is a weak, positive linear relationship between Target Length and Confidence. Longer target lengths are associated with slightly higher confidence scores, but the relationship is not strong.
2. **High Density at Low Values:** The vast majority of data points have both a short Target Length (<80) and low-to-moderate Confidence (<0.50).
3. **Right-Skewed Variables:** Both variables in this dataset are not normally distributed; they are right-skewed, meaning most observations have low values, with fewer instances of high values.
4. **Presence of High-Confidence Outliers:** There are a small number of instances with very high confidence (>0.75), which are almost exclusively associated with very short target lengths.
### Interpretation
This chart likely analyzes the performance or characteristics of a model or system related to "high school macroeconomics." The "Target Length" could refer to the length of a text answer, a sequence, or a task. "Confidence" likely represents a model's predicted probability or certainty score.
The data suggests that for this macroeconomics domain, the system is most frequently dealing with shorter targets and expresses low-to-moderate confidence in its outputs. The weak positive correlation is intriguing: it implies that as the target becomes longer, the system's confidence increases only marginally. This could indicate that longer answers or tasks provide slightly more context, leading to a minor boost in confidence, but the effect is minimal.
The right-skewed distributions are critical. They show that high-confidence predictions are rare events in this dataset. The outliers with high confidence on very short targets might represent clear, unambiguous questions or statements that the model finds easy to classify or process. Conversely, the dense cluster of low-confidence, short-target points could represent ambiguous or difficult items where the model struggles.
From a Peircean perspective, this chart is an *icon* (it resembles the data distribution) and an *index* (the trend line points to a causal or correlational relationship). The key inference is that **target length is not a strong driver of confidence in this macroeconomics context**. The system's confidence is predominantly low, and factors other than lengthāsuch as question clarity, topic familiarity, or data qualityāare likely more significant determinants of its confidence level. This visualization would be valuable for diagnosing model behavior, identifying data biases, or guiding improvements in the underlying system.
</details>
|
<details>
<summary>x41.png Details</summary>

### Visual Description
## Scatter Plot with Regression Line: High School Mathematics Confidence vs. Target Length
### Overview
The image is a scatter plot chart titled "high_school_mathematics". It visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a set of data points. A linear regression trend line with a shaded confidence interval is overlaid on the data.
### Components/Axes
* **Title:** `high_school_mathematics` (positioned at the top center).
* **Y-Axis:**
* **Label:** `Confidence` (positioned vertically along the left side).
* **Scale:** Linear scale ranging from 0.0 to 0.6.
* **Major Tick Marks:** 0.0, 0.2, 0.4, 0.6.
* **X-Axis:**
* **Label:** `Target Length` (positioned horizontally at the bottom).
* **Scale:** Linear scale ranging from 0 to 50.
* **Major Tick Marks:** 0, 25, 50.
* **Data Series:**
* **Scatter Points:** Numerous individual data points represented as small, solid purple circles.
* **Trend Line:** A solid purple line representing a linear regression fit to the data.
* **Confidence Interval:** A semi-transparent, light purple shaded region surrounding the trend line, indicating the uncertainty or confidence band of the regression.
* **Legend/Key:** There is no separate legend box. The y-axis label "Confidence" also serves as the identifier for the data series, which is consistent with the purple color of the points and line.
### Detailed Analysis
* **Data Distribution:** The data points are densely clustered in the lower-left quadrant of the plot, specifically where `Target Length` is between approximately 0 and 20, and `Confidence` is between 0.0 and 0.4. The density of points decreases significantly as `Target Length` increases beyond 25.
* **Trend Line:** The purple regression line has a clear positive slope. It originates near a `Confidence` value of ~0.25 at `Target Length` 0 and rises to approximately 0.45 at `Target Length` 50.
* **Confidence Interval:** The shaded confidence band is narrowest near the center of the data mass (around `Target Length` 10-15) and widens considerably towards the right edge of the plot (`Target Length` > 40), indicating greater uncertainty in the trend prediction where data is sparse.
* **Outliers:** A few data points exist with relatively high `Confidence` (>0.5) at low `Target Length` (<10). Conversely, there are points with very low `Confidence` (<0.1) scattered across the `Target Length` range.
### Key Observations
1. **Positive Correlation:** There is a visible positive correlation between `Target Length` and `Confidence`. As the target length increases, the confidence score generally tends to increase.
2. **Heteroscedasticity:** The variance (spread) of the `Confidence` values appears greater at lower `Target Length` values. The data is more tightly clustered around the trend line at higher `Target Length` values, though this is influenced by the smaller number of data points there.
3. **Data Sparsity:** The dataset is heavily skewed towards shorter target lengths. Very few observations exist for `Target Length` values greater than 30.
4. **Uncertainty Visualization:** The widening confidence interval band explicitly communicates that the model's estimate of the relationship becomes less reliable as we move into the region of sparse data (high target length).
### Interpretation
The chart suggests that in the context of "high_school_mathematics," tasks or items with longer target lengths (which could refer to the length of a problem, solution, or text passage) are associated with higher measured confidence. This could imply several underlying phenomena:
* **Complexity vs. Confidence:** Longer problems might be perceived as more complex, and the confidence metric might be capturing a different dimension than pure accuracyāperhaps self-assurance or the richness of the response.
* **Metric Behavior:** The "Confidence" metric itself may be inherently biased or scaled in a way that correlates with output length.
* **Data Collection Bias:** The sparse data at high target lengths means the observed positive trend is driven primarily by the dense cluster of short-length items. The relationship for longer lengths is an extrapolation with high uncertainty, as shown by the wide confidence band. Any conclusions about very long targets (e.g., >40) are speculative based on this visualization.
**In summary, the data demonstrates a moderate positive linear relationship between target length and confidence, but this relationship is most reliably observed for shorter targets, and the predictive power diminishes significantly for longer ones.**
</details>
|
<details>
<summary>x42.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: high_school_microeconomics
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (density curves) on the top and right sides. The chart appears to analyze data related to "high_school_microeconomics," likely examining the relationship between a "Target Length" variable and an unnamed performance or score metric. The overall aesthetic uses a monochromatic purple color scheme against a white background with light gray grid lines.
### Components/Axes
* **Main Plot (Center):** A scatter plot.
* **X-axis:** Labeled "Target Length". The scale runs from 0 to approximately 200, with major tick marks at 0 and 100.
* **Y-axis:** **Unlabeled.** The scale runs from 0.00 to 1.00, with major tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Data Points:** Numerous circular points plotted in a medium purple color.
* **Reference Line:** A solid, horizontal purple line is drawn across the plot at approximately y = 0.35.
* **Top Marginal Plot:** A density plot (smoothed histogram) showing the distribution of the "Target Length" (x-axis) variable. It is filled with a light purple color.
* **Right Marginal Plot:** A density plot showing the distribution of the unnamed y-axis variable. It is also filled with a light purple color and is oriented vertically.
* **Title:** "high_school_microeconomics" is displayed at the top center of the entire graphic.
### Detailed Analysis
* **Data Distribution (Main Scatter Plot):**
* The data points are densely clustered in the lower-left quadrant of the plot.
* The highest concentration of points occurs between `Target Length` values of approximately **0 to 100** and y-axis values of **0.25 to 0.75**.
* There is a noticeable spread of points along the y-axis for any given short `Target Length` (e.g., at length ~50, y-values range from ~0.1 to ~0.9).
* As `Target Length` increases beyond 100, the number of data points decreases significantly. The few points with `Target Length` > 150 are sparse and scattered across a wide range of y-values.
* The horizontal reference line at **y ā 0.35** passes through the lower portion of the main data cluster. A significant number of points lie above this line, and a substantial number lie below it.
* **Marginal Distributions:**
* **Top (Target Length):** The density curve is unimodal and right-skewed. The peak (mode) is at a low `Target Length`, approximately **40-60**. The tail extends towards higher values, confirming the scarcity of data points beyond 100.
* **Right (Unnamed Y-variable):** The density curve is roughly unimodal and symmetric, centered near **y = 0.5**. The distribution appears relatively broad, spanning from near 0 to 1, which matches the vertical spread seen in the scatter plot.
### Key Observations
1. **Inverse Density Relationship:** There is a clear inverse relationship between the density of data points and the `Target Length`. Shorter targets are heavily overrepresented in the dataset.
2. **High Variance at Short Lengths:** For short `Target Length` values (0-100), the outcome variable (y-axis) shows very high variance, covering almost its entire possible range (0 to 1).
3. **Central Tendency Marker:** The horizontal line at y ā 0.35 serves as a visual benchmark. It does not appear to be a line of best fit, as the data does not show a strong linear trend. It may represent a mean, median, or a predefined threshold.
4. **Missing Critical Label:** The y-axis lacks a descriptive label, which is a significant omission for interpreting the chart's meaning. We can only refer to it as an "unnamed score or metric."
### Interpretation
This chart likely explores the relationship between the length of a learning target or assignment ("Target Length") and some measure of student performance or engagement (the unnamed y-axis) in a high school microeconomics context.
* **What the data suggests:** The data suggests that most learning targets or assignments are relatively short (clustered below length 100). For these common, shorter tasks, student outcomes are highly variableāsome students achieve very high scores (near 1.0), while others score very low (near 0.0). This implies that for short tasks, factors other than length (e.g., student preparation, topic difficulty, instruction quality) are the primary drivers of the outcome.
* **The role of the reference line:** The line at y ā 0.35 could represent a passing threshold, a historical average, or a target benchmark. The fact that the data is widely dispersed around it indicates that meeting this benchmark is not strongly correlated with target length alone.
* **The outlier region:** The sparse data points for very long targets (Length > 150) are interesting. Their scattered y-values suggest that when very long assignments are given, outcomes are unpredictableāthey can be high or low. The lack of data here makes it difficult to draw firm conclusions about the effect of extremely long targets.
* **Underlying question:** The visualization prompts an investigation into what causes the high variance in outcomes for short targets. It argues against using target length as a simple predictor of the measured outcome and points to the need for more nuanced factors in understanding student performance in microeconomics.
**Language Declaration:** All text in the image is in English.
</details>
|
Figure 12: Confidence versus Target Length for various MMLU subsets. A horizontal regression line indicates weak correlation of confidence with the target length. See figs. 13 and 14 for other subsets.
|
<details>
<summary>x43.png Details</summary>

### Visual Description
## Scatter Plot with Regression: High School Physics Confidence vs. Target Length
### Overview
The image is a scatter plot titled "high_school_physics" that visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis). It includes a fitted regression line with a shaded confidence interval. The plot appears to be generated by a statistical or data visualization library (e.g., Seaborn in Python).
### Components/Axes
* **Title:** "high_school_physics" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length" (centered below the axis).
* **Scale:** Linear scale from 0 to approximately 250.
* **Major Tick Marks:** 0, 100, 200.
* **Y-Axis:**
* **Label:** "Confidence" (centered to the left, rotated 90 degrees).
* **Scale:** Linear scale from approximately 0.1 to 0.7.
* **Major Tick Marks:** 0.2, 0.4, 0.6.
* **Data Series:**
* **Scatter Points:** Numerous small, semi-transparent purple circles representing individual data points.
* **Trend Line:** A solid, darker purple line representing a linear regression fit.
* **Confidence Interval:** A lighter purple shaded region surrounding the trend line, indicating the uncertainty of the fit.
* **Marginal Distributions:** Faint, semi-transparent purple density plots (histograms/KDEs) are visible along the top (for Target Length) and right side (for Confidence) of the main plot area, showing the distribution of each variable independently.
### Detailed Analysis
* **Data Distribution:**
* The majority of data points are clustered in the lower-left quadrant of the plot, specifically where **Target Length is between 0 and 100** and **Confidence is between 0.2 and 0.4**.
* There is a high density of points near the origin (low Target Length, low Confidence).
* Data becomes sparser as Target Length increases beyond 100.
* **Trend Line & Relationship:**
* The regression line shows a **slight positive slope**. It originates at approximately **(Target Length=0, Confidence=0.25)** and ends near **(Target Length=200, Confidence=0.4)**.
* This indicates a **weak positive correlation**: as the Target Length increases, the Confidence score tends to increase slightly.
* The shaded confidence interval around the line is narrow at low Target Lengths (where data is dense) and widens significantly as Target Length increases (where data is sparse), indicating greater uncertainty in the trend for longer targets.
* **Outliers & Spread:**
* Several points exhibit **high Confidence (>0.5)**, primarily occurring at **Target Lengths between ~30 and 120**.
* The highest visible Confidence value is approximately **0.65**, occurring at a Target Length of roughly 50.
* For any given Target Length, there is substantial vertical spread in Confidence values, suggesting other factors beyond length influence confidence.
### Key Observations
1. **Cluster Dominance:** The visual narrative is dominated by a dense cluster of points representing short-to-medium length targets with low-to-moderate confidence.
2. **Positive but Noisy Trend:** While the overall trend is positive, the relationship is noisy with high variance, especially at longer target lengths.
3. **Asymmetric Uncertainty:** The model's confidence in its own trend (the shaded CI) is much lower for predictions about longer target lengths due to sparse data.
4. **Absence of Very Low Confidence:** There are virtually no data points with Confidence below ~0.15, suggesting a baseline level of confidence in the model or measurement.
### Interpretation
This plot likely analyzes the performance of an AI model or a scoring system on high school physics problems. "Target Length" probably refers to the length of the expected answer or solution step, while "Confidence" is the model's self-assessed probability of being correct.
The data suggests that the system is **more confident when dealing with shorter, likely simpler, physics problems**. The weak positive correlation is counter-intuitive at first glanceāone might expect longer, more complex answers to correlate with lower confidence. However, this could indicate that:
* Longer answers are required for more straightforward, procedural problems where the model can follow a known algorithm step-by-step, leading to higher confidence.
* The model's confidence calibration is imperfect, as it shows high confidence even for some medium-length answers where the actual accuracy might be variable (evidenced by the vertical spread).
* The sparse data for very long answers (>200) makes any conclusion about the model's behavior on extremely complex problems unreliable.
The marginal distributions confirm that most problems in this dataset have short answer lengths and moderate confidence scores. The plot serves as a diagnostic tool, highlighting that while a general trend exists, the model's confidence is highly variable and should not be solely relied upon, especially for longer-form responses where data is limited.
</details>
|
<details>
<summary>x44.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: High School Psychology Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. It displays the relationship between "Target Length" and "Confidence" for a dataset labeled "high_school_psychology". The plot includes a fitted trend line with a confidence interval.
### Components/Axes
* **Title:** `high_school_psychology` (positioned at the top center).
* **Main Chart Area:**
* **X-Axis:** Labeled `Target Length`. The axis has major tick marks at `0`, `100`, and `200`.
* **Y-Axis:** Labeled `Confidence`. The axis has major tick marks at `0.00`, `0.25`, `0.50`, and `0.75`.
* **Data Series:** Individual data points are represented as semi-transparent purple circles.
* **Trend Line:** A solid purple line runs through the data, showing the general trend. It is surrounded by a lighter purple shaded area representing the confidence interval for the trend.
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (likely a histogram or kernel density estimate) for the `Target Length` variable, aligned with the x-axis.
* **Right Marginal Plot:** A distribution plot for the `Confidence` variable, aligned with the y-axis. This plot is oriented vertically.
* **Legend:** There is no explicit legend box. The color purple is used consistently for all data points, the trend line, and the marginal distributions, indicating they belong to the same dataset.
### Detailed Analysis
* **Data Distribution & Trend:**
* **Trend Verification:** The purple trend line slopes upward from left to right, indicating a positive correlation between `Target Length` and `Confidence`. As the target length increases, the confidence score tends to increase.
* **Data Point Spread:** The data points (purple circles) are widely scattered, showing high variance. For any given `Target Length`, there is a broad range of `Confidence` values.
* **Density:** The highest concentration of data points appears in the lower-left quadrant, where `Target Length` is between approximately 0-100 and `Confidence` is between 0.00-0.50. The density of points decreases as both values increase.
* **Marginal Distributions:**
* **Target Length (Top):** The distribution is right-skewed. The highest density is near 0, with a long tail extending towards 200 and beyond. This indicates most targets are short, with fewer long targets.
* **Confidence (Right):** The distribution appears relatively uniform or slightly left-skewed across the range from 0.00 to 0.75, with a possible minor peak near 0.50. There are very few data points with confidence near the maximum of 1.00 (the axis limit is 0.75, but the trend line extends slightly above it).
### Key Observations
1. **Positive but Noisy Relationship:** There is a clear positive trend, but the relationship is not strong or precise due to the high scatter of points.
2. **Asymmetric Data Range:** The `Target Length` variable has a much wider observed range (0 to >200) compared to the `Confidence` variable, which is bounded between 0.00 and approximately 0.80 in this sample.
3. **Clustering at Low Values:** A significant cluster of data exists for short target lengths with low-to-moderate confidence.
4. **Absence of High-Confidence Data:** There is a notable lack of data points in the upper region of the plot (e.g., Confidence > 0.75), suggesting that high confidence scores are rare in this dataset, regardless of target length.
### Interpretation
This chart suggests that in the context of "high_school_psychology," there is a general tendency for confidence to be higher when dealing with longer targets (e.g., longer texts, questions, or tasks). However, this relationship is weak and heavily influenced by other factors, as evidenced by the wide dispersion of data points.
The marginal plots reveal that the dataset is dominated by short targets, and confidence scores are generally moderate, rarely reaching high levels. The positive trend line, while statistically present, may not be practically reliable for prediction due to the high variance. An investigator might conclude that while target length is a factor, it is not a primary or sole determinant of confidence in this domain. The analysis prompts further questions: What other variables (e.g., topic difficulty, student preparation) might explain the large spread in confidence for similar target lengths? Why are high-confidence outcomes so scarce?
</details>
|
<details>
<summary>x45.png Details</summary>

### Visual Description
\n
## Scatter Plot with Regression: High School Statistics Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with an overlaid linear regression line and marginal distribution plots. It explores the relationship between "Target Length" and "Confidence" within a context labeled "high_school_statistics". The plot suggests a positive correlation between the two variables.
### Components/Axes
* **Title:** `high_school_statistics` (centered at the top).
* **X-Axis:**
* **Label:** `Target Length` (centered below the axis).
* **Scale:** Linear scale ranging from 0 to approximately 250.
* **Major Tick Marks:** 0, 100, 200.
* **Y-Axis:**
* **Label:** `Confidence` (rotated 90 degrees, centered to the left of the axis).
* **Scale:** Linear scale ranging from approximately 0.25 to 0.75.
* **Major Tick Marks:** 0.25, 0.50, 0.75.
* **Data Series:**
* **Scatter Points:** Numerous purple circular markers representing individual data points.
* **Regression Line:** A solid, darker purple line showing the best linear fit through the data.
* **Confidence Interval:** A semi-transparent purple shaded band surrounding the regression line, indicating the uncertainty of the fit.
* **Marginal Distributions:**
* **Top Marginal Plot:** A horizontal density plot or histogram showing the distribution of the `Target Length` variable. It is positioned directly above the main plot area.
* **Right Marginal Plot:** A vertical density plot or histogram showing the distribution of the `Confidence` variable. It is positioned directly to the right of the main plot area.
### Detailed Analysis
* **Data Distribution & Trend:**
* The scatter points are densely clustered in the lower-left quadrant of the plot, specifically where `Target Length` is between 0-100 and `Confidence` is between 0.50-0.75.
* The data becomes sparser as `Target Length` increases beyond 100.
* The regression line has a clear **positive slope**, rising from left to right. This indicates a positive correlation: as `Target Length` increases, `Confidence` tends to increase.
* The shaded confidence interval around the regression line is narrower in the region with dense data (low Target Length) and widens slightly as Target Length increases, reflecting greater uncertainty where data is sparse.
* **Marginal Distributions:**
* The **top marginal plot** shows a right-skewed distribution for `Target Length`. The highest density is near 0, with a long tail extending towards 250.
* The **right marginal plot** shows a roughly symmetric, unimodal distribution for `Confidence`, centered around 0.60-0.65.
### Key Observations
1. **Positive Correlation:** The primary observation is the positive linear relationship between Target Length and Confidence.
2. **Data Concentration:** The majority of observations have a short Target Length (<100) and a moderate to high Confidence (>0.50).
3. **Outliers/Sparse Data:** There are relatively few data points with a Target Length greater than 150, making the trend in that region less certain.
4. **Variable Ranges:** Confidence values are bounded between approximately 0.25 and 0.80, while Target Length spans a wider relative range from 0 to over 200.
### Interpretation
This chart suggests that in the context of "high school statistics," tasks or items with a longer "Target Length" (which could refer to the length of a problem, answer, or study material) are associated with higher reported or measured "Confidence." The positive slope of the regression line quantifies this relationship.
The concentration of data at lower Target Lengths implies that most items in this dataset are relatively short. The widening confidence interval for longer targets indicates that predictions become less reliable for these less common cases. The marginal distributions confirm that while Confidence is normally distributed around a moderate value, Target Length is heavily skewed towards shorter items.
From a Peircean investigative perspective, this correlation could be interpreted in several ways without additional context: longer targets might provide more information, leading to higher confidence; they might be associated with more advanced topics where confidence is naturally higher; or there could be a methodological bias where confidence is overestimated for longer tasks. The chart establishes a relationship but does not reveal causation.
</details>
|
<details>
<summary>x46.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: High School US History Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal density plots, titled "high_school_us_history". It displays the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a dataset likely related to educational performance or assessment in US History. The plot includes a fitted regression line with a confidence interval and shows the distribution of each variable independently along the top and right margins.
### Components/Axes
* **Title:** "high_school_us_history" (located at the top center).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to over 200.
* **Major Tick Marks:** 0, 100, 200.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Major Tick Marks:** 0.0, 0.5, 1.0.
* **Data Series:**
* **Scatter Points:** Numerous purple circular markers representing individual data points.
* **Trend Line:** A solid purple line showing the best-fit linear relationship.
* **Confidence Band:** A semi-transparent purple shaded area around the trend line, representing the uncertainty of the fit.
* **Marginal Plots:**
* **Top (for X-axis):** A density plot (smoothed histogram) showing the distribution of "Target Length". It is right-skewed, with a peak between 0 and 100.
* **Right (for Y-axis):** A density plot showing the distribution of "Confidence". It is left-skewed, with a peak between 0.5 and 1.0.
* **Legend/Color:** All data elements (points, line, bands, marginal plots) use the same purple color scheme. There is no separate legend box; the color is consistent for the single data series shown.
### Detailed Analysis
* **Data Distribution & Density:**
* The highest concentration of data points is in the region where **Target Length is between 0 and 100** and **Confidence is between 0.5 and 1.0**.
* The marginal density plot for **Target Length** confirms this, showing a sharp peak near the lower end (approx. 25-75) and a long tail extending to the right (up to ~250).
* The marginal density plot for **Confidence** shows a broad peak centered around 0.7-0.8, with a steep drop-off towards 1.0 and a more gradual decline towards 0.0.
* **Trend & Correlation:**
* The purple **trend line** exhibits a **slight downward slope** from left to right.
* It starts at a Confidence value of approximately **0.75** when Target Length is 0 and decreases to approximately **0.65** when Target Length is 200.
* The **confidence band** is narrowest near the center of the data mass (Target Length ~50-100) and widens at the extremes, especially for Target Length > 150, indicating greater uncertainty in the trend where data is sparse.
* **Outliers & Spread:**
* There are several data points with **high Target Length (>150)** but **low Confidence (<0.5)**, which pull the trend line downward.
* Conversely, there are points with **low Target Length (<50)** and **very high Confidence (>0.9)**.
* The overall spread of Confidence values is wide for any given Target Length, particularly in the 0-100 range, indicating high variability.
### Key Observations
1. **Weak Negative Correlation:** The primary visual trend is a weak negative relationship between Target Length and Confidence. As the target length increases, confidence tends to decrease slightly.
2. **Data Concentration:** The vast majority of observations involve relatively short target lengths (under 100) and moderate-to-high confidence (above 0.5).
3. **Asymmetric Distributions:** The two variables have opposite skewness. Target Length is right-skewed (most values are low), while Confidence is left-skewed (most values are high).
4. **Increased Uncertainty at Extremes:** The widening confidence band at high Target Length values suggests the model is less certain about the trend in that region due to fewer data points.
### Interpretation
This chart likely analyzes student performance or self-assessment data in a US History context. "Target Length" could refer to the length of an essay, a set of questions, or a study assignment. "Confidence" likely measures a student's self-reported confidence or a model's confidence score in predicting an outcome (like a correct answer).
The data suggests that **students (or the model) are most confident when dealing with shorter tasks**. The weak negative trend implies that as tasks become longer or more extensive, confidence diminishes. This could be due to increased cognitive load, fatigue, or the perception that longer tasks are more complex and challenging.
The high density of points in the low-length, high-confidence quadrant indicates that the typical experience in this dataset involves manageable task lengths associated with good confidence. The outliers with high length and low confidence are critical; they represent cases where extensive work correlates with low confidence, potentially flagging difficult topics, student struggles, or assessment design issues. The marginal distributions reinforce that short tasks and high confidence are the norm, making the negative trend, while weak, a notable deviation from the central pattern.
</details>
|
| --- | --- | --- | --- |
|
<details>
<summary>x47.png Details</summary>

### Visual Description
## Scatter Plot with Trend Line: High School World History Confidence vs. Target Length
### Overview
The image is a scatter plot chart titled "high_school_world_history". It visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for two data series: "Train" and "Test". A trend line with a shaded confidence interval is overlaid on the data points.
### Components/Axes
* **Title:** `high_school_world_history` (positioned at the top center).
* **Y-Axis:**
* Label: `Confidence`
* Scale: Linear, ranging from 0 to 1.0.
* Major Tick Marks: 0, 0.5, 1.0.
* **X-Axis:**
* Label: `Target Length`
* Scale: Linear, ranging from 0 to approximately 150.
* Major Tick Marks: 0, 100.
* **Legend:** Positioned in the top-right corner of the plot area.
* `Train`: Represented by purple circles (ā).
* `Test`: Represented by light purple squares (ā ).
* **Data Series:**
* **Train (Purple Circles):** A dense cloud of points scattered across the plot.
* **Test (Light Purple Squares):** A smaller set of points, primarily clustered in the central region of the plot.
* **Trend Line:** A solid, dark purple line running through the data, showing a slight downward slope from left to right.
* **Confidence Interval:** A semi-transparent, light purple shaded area surrounding the trend line, indicating the uncertainty or variance of the trend.
### Detailed Analysis
* **Data Distribution:**
* The **Train** data points (purple circles) are widely dispersed. They span nearly the full range of Confidence (from near 0 to 1.0) and a broad range of Target Length (from near 0 to ~150). The density appears highest in the central region (Target Length ~20-80, Confidence ~0.3-0.8).
* The **Test** data points (light purple squares) are fewer in number and more concentrated. They are primarily located within a Target Length range of approximately 40 to 120 and a Confidence range of approximately 0.4 to 0.8.
* **Trend Line Analysis:**
* The trend line exhibits a **gentle negative slope**. It starts at a Confidence value of approximately 0.65 when Target Length is 0 and decreases to approximately 0.55 when Target Length is 150.
* The shaded confidence interval is relatively narrow, suggesting the modeled trend has moderate certainty, though the underlying data points show high variance.
### Key Observations
1. **High Variance in Training Data:** The Train series shows extremely high variance in Confidence scores for any given Target Length, indicating a weak direct correlation between these two variables in the training set.
2. **Test Data Clustering:** The Test data is not uniformly distributed but forms a loose cluster in the middle of the plot, suggesting the test examples may have been selected from a specific subset of the problem space (e.g., medium-length answers).
3. **Weak Negative Trend:** Despite the high scatter, the overall modeled trend suggests a slight decrease in Confidence as Target Length increases.
4. **Overlap Region:** The majority of the Test data points fall within the dense central region of the Train data distribution, indicating the test set is representative of the core training data.
### Interpretation
This chart likely evaluates the performance of a model (e.g., a question-answering or grading model) on a "high school world history" task. "Target Length" probably refers to the length of a student's answer or a reference answer, while "Confidence" is the model's confidence score in its prediction or assessment.
The data suggests that the model's confidence is not strongly determined by answer length alone, given the massive scatter. The slight negative trend could imply the model becomes marginally less confident when evaluating very long answers, possibly due to increased complexity or noise. The clustering of test data highlights a potential limitation: the model's performance is primarily validated on medium-length answers, and its behavior on very short or very long answers (where training data is also sparse) is less certain. The high variance in the training data underscores that other factors beyond lengthāsuch as answer content, specificity, or qualityāare likely the primary drivers of the model's confidence score.
</details>
|
<details>
<summary>x48.png Details</summary>

### Visual Description
\n
## Scatter Plot with Trend Line: Human Aging Confidence vs. Target Length
### Overview
The image is a scatter plot titled "human_aging" that visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis). It includes a fitted trend line with a shaded confidence interval and marginal distributions (density plots) along the top and right edges. The plot uses a monochromatic purple color scheme.
### Components/Axes
* **Title:** "human_aging" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to 100.
* **Major Ticks:** 0, 50, 100.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.00 to approximately 0.85 (based on data point placement).
* **Major Ticks:** 0.00, 0.25, 0.50, 0.75.
* **Data Series:**
* **Scatter Points:** Represented by small, semi-transparent purple circles. There is no explicit legend for the points.
* **Trend Line:** A solid, darker purple line running through the data cloud.
* **Confidence Interval:** A lighter purple shaded area surrounding the trend line.
* **Marginal Distributions:**
* **Top (above x-axis):** A density plot showing the distribution of "Target Length" values. It peaks between 0 and 50.
* **Right (beside y-axis):** A density plot showing the distribution of "Confidence" values. It peaks between 0.25 and 0.50.
### Detailed Analysis
* **Data Point Distribution:** The majority of data points are densely clustered in the lower-left quadrant of the plot, specifically where "Target Length" is between 0 and 50 and "Confidence" is between 0.00 and 0.50. The density of points decreases as both "Target Length" and "Confidence" increase.
* **Trend Line & Correlation:** The trend line exhibits a clear **positive slope**, rising from left to right. This indicates a positive correlation between the variables: as "Target Length" increases, "Confidence" tends to increase.
* **Approximate Start Point:** The line begins near (Target Length ā 0, Confidence ā 0.25).
* **Approximate End Point:** The line ends near (Target Length ā 100, Confidence ā 0.50).
* **Confidence Interval:** The shaded confidence interval around the trend line is relatively narrow at low "Target Length" values and **widens significantly** as "Target Length" increases towards 100. This indicates greater uncertainty in the trend estimate for higher target lengths.
* **Outliers:** Several data points exist with high "Confidence" values (>0.75), primarily located at lower "Target Length" values (<50). One notable outlier is near (Target Length ā 10, Confidence ā 0.85).
### Key Observations
1. **Positive Correlation:** The primary visual trend is a positive relationship between Target Length and Confidence.
2. **Heteroscedasticity:** The spread (variance) of Confidence values appears to increase with Target Length, as suggested by the widening confidence interval and the scatter of points.
3. **Data Sparsity:** There are very few data points with a "Target Length" greater than 75, which contributes to the increased uncertainty in the trend line at the high end.
4. **Marginal Insights:** The density plots confirm that most observations have a Target Length under 50 and a Confidence score between 0.25 and 0.50.
### Interpretation
The chart suggests that within the context of "human_aging," tasks or items with a longer "Target Length" are associated with higher model "Confidence." However, this relationship is not perfectly reliable, especially for longer targets, as evidenced by the wide confidence interval and sparse data in that region.
The presence of high-confidence outliers at short target lengths indicates that some short targets are predicted with very high certainty, which may represent specific, easy-to-classify cases. The overall pattern could imply that the model's confidence is somewhat calibrated to task difficulty (where longer length may correlate with more information or context, leading to higher confidence), but the increasing uncertainty suggests the model's performance or the data's nature becomes less predictable for longer targets. The investigation would benefit from more data points at the higher end of the Target Length scale to confirm the trend and reduce uncertainty.
</details>
|
<details>
<summary>x49.png Details</summary>

### Visual Description
\n
## Scatter Plot: Human Sexuality Confidence vs. Target Length
### Overview
The image is a scatter plot with marginal distribution plots (histograms/density curves) on the top and right sides. The chart visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a dataset labeled "human_sexuality". The data points are represented as purple dots, and a linear trend line with a shaded confidence interval is overlaid.
### Components/Axes
* **Title:** `human_sexuality` (centered at the top).
* **Y-Axis:**
* **Label:** `Confidence`
* **Scale:** Linear, ranging from `0.0` to `0.6`. Major tick marks are at `0.0`, `0.2`, `0.4`, and `0.6`.
* **X-Axis:**
* **Label:** `Target Length`
* **Scale:** Linear, ranging from `0` to `100`. Major tick marks are at `0` and `100`.
* **Data Series:**
* **Points:** Numerous purple circular markers representing individual data points.
* **Trend Line:** A solid purple line showing a linear regression fit.
* **Confidence Band:** A semi-transparent purple shaded area around the trend line, representing the uncertainty of the fit.
* **Marginal Distributions:**
* **Top (for Target Length):** A density curve/histogram showing the distribution of the x-axis variable. It is heavily right-skewed, with the highest density near 0.
* **Right (for Confidence):** A density curve/histogram showing the distribution of the y-axis variable. It is also right-skewed, with the highest density between approximately 0.1 and 0.3.
### Detailed Analysis
* **Data Point Distribution:** The majority of data points are densely clustered in the lower-left quadrant of the plot. Specifically:
* **Target Length:** Most points fall between `0` and `~50`, with the highest concentration between `0` and `20`.
* **Confidence:** Most points fall between `0.0` and `0.4`, with the highest concentration between `0.1` and `0.3`.
* **Trend Line:** The purple trend line exhibits a **slight positive slope**. It starts at a Confidence value of approximately `0.2` when Target Length is `0` and rises to a Confidence value of approximately `0.25` when Target Length is `100`.
* **Confidence Interval:** The shaded confidence band is narrowest near the center of the data mass (low Target Length) and widens considerably as Target Length increases, indicating greater uncertainty in the trend for larger target lengths due to sparse data.
* **Outliers:** There are a few notable outliers:
* One point with a very high Confidence value of approximately `0.65` at a Target Length of about `40`.
* Several points with Target Lengths approaching or exceeding `100`, but with relatively low Confidence values (below `0.3`).
### Key Observations
1. **Weak Positive Correlation:** The data suggests a very weak positive relationship between Target Length and Confidence. As the target length increases, confidence shows a slight tendency to increase.
2. **Data Sparsity:** The dataset is heavily skewed towards shorter target lengths. There are very few data points with a Target Length greater than `50`, making any conclusions about that range highly uncertain.
3. **Concentration of Confidence:** The bulk of the confidence scores are low to moderate, clustered below `0.4`. High confidence scores (above `0.5`) are rare.
4. **Distribution Shape:** Both variables have right-skewed distributions, as shown by the marginal plots. This means most observations have low values for both Target Length and Confidence.
### Interpretation
The chart demonstrates that for the "human_sexuality" dataset, the length of a target (e.g., a text passage, a query) has only a marginal and statistically weak influence on the model's confidence in its output. The primary takeaway is that **confidence is generally low to moderate regardless of target length**.
The weak positive trend might suggest that slightly longer contexts provide a tiny bit more information for the model to be confident, but the effect is minimal. The more significant finding is the data sparsity and skew: the evaluation or dataset is dominated by short targets, and the model rarely exhibits high confidence. The widening confidence band for longer targets is a critical visual cue that we cannot reliably infer the relationship in that region. The outlier with very high confidence at a moderate target length is an interesting case that might warrant individual inspection to understand what made that instance different. Overall, the chart indicates that target length is not a strong predictor of confidence for this specific task or model.
</details>
|
<details>
<summary>x50.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: International Law Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. It explores the relationship between "Target Length" and "Confidence" within a context labeled "international_law". The overall aesthetic uses a monochromatic purple color scheme.
### Components/Axes
* **Title:** "international_law" (centered at the top).
* **Main Plot Axes:**
* **X-axis (Horizontal):** Labeled "Target Length". The axis has major tick marks at 0 and 200. The data points and axis extend slightly beyond 200, suggesting a range from approximately 0 to 250.
* **Y-axis (Vertical):** Labeled "Confidence". The axis has major tick marks at 0.25, 0.50, and 0.75. The visible range is from approximately 0.20 to 0.80.
* **Data Series:**
* **Scatter Points:** Numerous individual data points plotted as small, semi-transparent purple circles. Their density is highest in the lower-left quadrant of the plot.
* **Regression Line:** A solid, darker purple line showing the best-fit linear trend through the data.
* **Confidence Interval:** A semi-transparent, shaded purple band surrounding the regression line, indicating the uncertainty of the fit.
* **Marginal Plots:**
* **Top Marginal Plot:** A density plot (smoothed histogram) showing the distribution of the "Target Length" (X-axis) variable. It is positioned directly above the main plot, sharing the same X-axis scale.
* **Right Marginal Plot:** A density plot showing the distribution of the "Confidence" (Y-axis) variable. It is positioned to the right of the main plot, sharing the same Y-axis scale.
* **Legend:** There is no explicit legend box. The color coding is consistent: all elements (points, line, intervals, marginal plots) use shades of purple, implying they belong to the same dataset or analysis.
### Detailed Analysis
* **Data Distribution & Trend:**
* The scatter points show a wide dispersion. There is a high concentration of points where "Target Length" is between approximately 0 and 150, and "Confidence" is between 0.25 and 0.60.
* The regression line exhibits a clear **downward (negative) slope**. It starts at a Confidence value of approximately 0.45 when Target Length is 0 and declines to a Confidence value of approximately 0.30 when Target Length is 250.
* The shaded confidence interval around the regression line is relatively narrow, suggesting the negative trend is statistically discernible despite the scatter.
* **Marginal Distributions:**
* The **Target Length distribution (top)** is right-skewed. The peak density (mode) appears to be at a low Target Length value, approximately between 20 and 50. The tail extends towards higher values.
* The **Confidence distribution (right)** appears roughly unimodal and slightly left-skewed. The peak density is around a Confidence value of 0.40 to 0.45.
### Key Observations
1. **Negative Correlation:** The primary observation is a negative relationship between Target Length and Confidence. As the length of the target increases, the associated confidence tends to decrease.
2. **Data Density:** Most data points are clustered in the region of shorter target lengths and moderate confidence levels.
3. **Variability:** There is significant vertical scatter (variability in Confidence) for any given Target Length, especially at lower lengths. This indicates that while the trend is negative, Target Length alone does not perfectly predict Confidence.
4. **Outliers:** A few data points exist with relatively high Confidence (>0.70) at low-to-moderate Target Lengths. Conversely, several points show low Confidence (<0.25) across a range of lengths.
### Interpretation
The data suggests that within the domain of "international_law" (as defined by the source of this dataset), tasks or items characterized by longer "Targets" are associated with lower levels of "Confidence." This could imply several investigative hypotheses:
* **Complexity:** Longer targets (e.g., longer legal documents, more complex case descriptions) may be inherently more difficult to analyze or classify, leading to lower model or human confidence.
* **Ambiguity:** Increased length might introduce more ambiguity or nuanced information, making definitive judgments harder.
* **Data Sparsity:** The right-skewed distribution of Target Length means there are fewer examples of very long targets in the dataset. Models often perform with lower confidence on sparse or out-of-distribution data.
The marginal plots reinforce this: the most common cases (short targets) are also where confidence is most variable but centers around a moderate value. The overall pattern is not one of a strong, tight correlation but a general, statistically visible trend amidst considerable noise. This visualization would be crucial for understanding model performance or data characteristics in a legal AI system, highlighting that length is a factor impacting reliability.
</details>
|
|
<details>
<summary>x51.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Jurisprudence Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (likely kernel density estimates or histograms) on the top and right sides. The chart explores the relationship between "Target Length" and "Confidence" within a context labeled "jurisprudence." The overall aesthetic is minimalist, using a monochromatic purple color scheme against a white background with light gray grid lines.
### Components/Axes
* **Title:** "jurisprudence" is centered at the top of the chart in a sans-serif font.
* **Main Plot (Center):**
* **X-Axis:** Labeled "Target Length." The axis has numerical tick markers at 0, 100, and 200. The scale appears linear.
* **Y-Axis:** Labeled "Confidence." The axis has numerical tick markers at 0.2, 0.4, and 0.6. The scale appears linear.
* **Data Series:** A scatter of purple circular points. A solid purple trend line (likely a linear regression fit) runs through the data, accompanied by a semi-transparent purple shaded region representing the confidence interval for the fit.
* **Marginal Distribution Plots:**
* **Top Marginal Plot:** Positioned directly above the main plot, sharing the "Target Length" x-axis. It displays a density curve (or smoothed histogram) showing the distribution of the "Target Length" variable. The distribution is right-skewed, with a peak between approximately 20 and 80.
* **Right Marginal Plot:** Positioned to the right of the main plot, sharing the "Confidence" y-axis. It displays a density curve showing the distribution of the "Confidence" variable. The distribution is roughly unimodal, centered around 0.3-0.4.
* **Legend/Color Key:** There is no separate legend box. The color purple is used consistently for all data elements (points, trend line, marginal plots), indicating they belong to the same dataset or category.
### Detailed Analysis
* **Data Point Distribution:** The scatter points are most densely clustered in the lower-left quadrant of the plot, where "Target Length" is between 0 and 100 and "Confidence" is between 0.2 and 0.5. As "Target Length" increases beyond 100, the points become more sparse and show greater vertical spread (variance in "Confidence").
* **Trend Line:** The solid purple trend line has a positive slope. It originates at approximately (Target Length ā 0, Confidence ā 0.25) and terminates at approximately (Target Length ā 200, Confidence ā 0.5). This indicates a positive correlation between the two variables.
* **Confidence Interval:** The semi-transparent shaded area around the trend line is narrower at lower "Target Length" values (where data is dense) and widens significantly as "Target Length" increases, reflecting greater uncertainty in the trend estimate where data is sparse.
* **Marginal Distributions:**
* The **Target Length** distribution (top) shows that the majority of data points have a length less than 100, with a long tail extending towards 200 and beyond.
* The **Confidence** distribution (right) shows that most confidence values fall between 0.2 and 0.5, with the highest density near 0.35.
### Key Observations
1. **Positive Correlation:** There is a clear, visually apparent positive linear relationship between Target Length and Confidence.
2. **Heteroscedasticity:** The variance (spread) of Confidence values appears to increase as Target Length increases. The relationship is tighter for shorter targets.
3. **Data Sparsity:** The dataset contains far more observations with short target lengths (<100) than with long target lengths (>150).
4. **Outliers:** A few data points exist with relatively high Confidence (>0.6) at moderate Target Lengths (50-100). Conversely, some points with very long Target Lengths (>200) have Confidence values near the lower end of the scale (~0.2).
### Interpretation
The chart suggests that in the domain of "jurisprudence," there is an association where longer "targets" (which could refer to legal documents, case files, or textual arguments) are correlated with higher model or analyst "confidence." However, this relationship is not deterministic.
The **positive trend** implies that length may be a proxy for complexity, thoroughness, or precedent, which in turn allows for greater confidence in assessment or classification. The **increasing variance** (heteroscedasticity) is critical: while confidence tends to be higher on average for long targets, it also becomes much less predictable. A long target could yield very high or surprisingly low confidence.
The **marginal distributions** contextualize this: most analyzed items are of moderate length and yield moderate confidence. The findings are most reliable for targets under 100 units in length. The sparsity of data for very long targets means the trend line's projection in that region is an extrapolation with high uncertainty, as visually confirmed by the wide confidence interval shading.
**In essence, the data demonstrates a general rule that length begets confidence, but with the important caveat that for exceptional cases (very long targets), confidence becomes highly variable and less assured.**
</details>
|
<details>
<summary>x52.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: logical_fallacies
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms and density curves) on the top and right sides. The chart explores the relationship between two variables: "Target Length" and "Confidence." A linear regression line with a shaded confidence interval is overlaid on the scatter plot.
### Components/Axes
* **Main Chart Area:**
* **X-Axis:** Labeled "Target Length". The scale runs from 0 to 200, with major tick marks at 0, 100, and 200.
* **Y-Axis:** Labeled "Confidence". The scale runs from 0.00 to 0.75, with major tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Data Series:** Individual data points are represented as purple circles.
* **Trend Line:** A solid purple line representing a linear regression fit.
* **Confidence Band:** A semi-transparent purple shaded area surrounding the trend line, indicating the confidence interval for the regression.
* **Marginal Plots:**
* **Top Marginal Plot (above X-axis):** A histogram and density curve showing the distribution of the "Target Length" variable.
* **Right Marginal Plot (right of Y-axis):** A histogram and density curve showing the distribution of the "Confidence" variable.
* **Legend:** Located in the top-left corner of the main chart area. It contains a single entry: a purple circle labeled "Data".
* **Title:** The text "logical_fallacies" is centered at the very top of the image.
### Detailed Analysis
* **Data Point Distribution:** The scatter plot contains approximately 80-100 data points. The points are most densely clustered in the region where "Target Length" is between 0 and 100 and "Confidence" is between 0.25 and 0.60.
* **Trend Line Analysis:** The purple regression line shows a clear, gentle upward slope from left to right. It originates at approximately (Target Length: 0, Confidence: ~0.35) and terminates at approximately (Target Length: 200, Confidence: ~0.55). This indicates a positive correlation between the two variables.
* **Confidence Interval:** The shaded confidence band is narrowest in the center of the data range (around Target Length 50-150) and widens noticeably at the extremes, particularly for Target Length values greater than 150, indicating greater uncertainty in the trend estimate where data is sparse.
* **Marginal Distributions:**
* **Target Length (Top):** The distribution is right-skewed. The highest frequency (tallest bar) is in the first bin (0-20). The density curve peaks sharply near 0 and tails off gradually towards 200.
* **Confidence (Right):** The distribution appears roughly unimodal and slightly left-skewed. The peak of the density curve is near a Confidence value of 0.45. The data spans from near 0.00 to just above 0.75.
### Key Observations
1. **Positive Correlation:** There is a visible, albeit modest, positive linear relationship. As "Target Length" increases, "Confidence" tends to increase as well.
2. **Data Sparsity at High Values:** There are very few data points with a "Target Length" greater than 150, which contributes to the widening confidence interval in that region.
3. **Outliers:** Several data points exist with high "Confidence" (>0.65) across various "Target Length" values. A few points also have very low "Confidence" (<0.10).
4. **Variable Distributions:** The two variables have distinctly different distributions. "Target Length" is heavily concentrated near zero, while "Confidence" is more centrally distributed.
### Interpretation
The chart suggests that within the dataset labeled "logical_fallacies," there is a weak to moderate positive association between the length of a target (presumably a text or argument) and a confidence metric. This could imply that longer targets are associated with slightly higher confidence scores, or that the system/model being evaluated exhibits this bias.
However, the relationship is not strong, as evidenced by the wide scatter of points around the trend line. The significant number of points with high confidence at low target lengths indicates that short targets can also yield high confidence, which may be an important finding or potential anomaly depending on the context. The right-skew of "Target Length" is a critical characteristic of the dataset, meaning most samples are short, and conclusions about very long targets are based on limited data. The marginal plots are essential for understanding that the apparent trend is derived from a dataset where one variable ("Target Length") is not uniformly distributed.
</details>
|
<details>
<summary>x53.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Machine Learning Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal histograms (or density plots), titled "machine_learning". It displays the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a dataset or model evaluation. The plot includes a fitted trend line and marginal distributions showing the univariate spread of each variable.
### Components/Axes
* **Title:** "machine_learning" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to approximately 120. Major tick marks are visible at 0, 50, and 100.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from approximately 0.25 to 0.75. Major tick marks are visible at 0.25, 0.50, and 0.75.
* **Legend:** Located in the top-left corner of the main plot area. It consists of a small purple square followed by the text "machine_learning". This indicates all data points belong to this single series.
* **Data Series:** Represented by purple circular markers scattered across the plot.
* **Trend Line:** A solid, horizontal purple line running across the plot at a Confidence value of approximately 0.45.
* **Marginal Distributions:**
* **Top (for X-axis):** A histogram/density plot showing the distribution of "Target Length". It is heavily right-skewed, with the highest frequency (peak) in the lowest bin (approximately 0-20).
* **Right (for Y-axis):** A histogram/density plot showing the distribution of "Confidence". It appears roughly unimodal, with a peak around 0.4-0.6.
### Detailed Analysis
* **Data Point Distribution:** The majority of data points are clustered in the region where Target Length is between 0 and 50, and Confidence is between 0.25 and 0.75. The density of points decreases as Target Length increases beyond 50.
* **Trend Line Analysis:** The fitted trend line is perfectly horizontal. This indicates a **zero-slope relationship**; the model's average confidence does not change as the target length increases.
* **Marginal Histogram Details:**
* **Target Length (Top):** The distribution is highly concentrated near zero. The tallest bar is in the first bin (0-~10). Frequency drops sharply for lengths greater than ~20, with a long, thin tail extending to ~120.
* **Confidence (Right):** The distribution is centered around 0.45-0.50. It has a relatively broad spread, with significant density from ~0.30 to ~0.65, tapering off towards the extremes of 0.25 and 0.75.
### Key Observations
1. **No Correlation:** The primary observation is the lack of any visible linear correlation between Target Length and Confidence. The horizontal trend line confirms this statistically.
2. **Data Skew:** The dataset is heavily skewed towards short target lengths. Very few examples have a target length greater than 50.
3. **Confidence Spread:** Despite the lack of trend, there is substantial variance in confidence scores for any given short target length, ranging from ~0.25 to ~0.75.
4. **Potential Outliers:** A few data points exist at higher target lengths (e.g., near 100) but maintain confidence values within the main cluster's range (around 0.4-0.5). These are not outliers in the y-dimension but are sparse in the x-dimension.
### Interpretation
This plot suggests that for the machine learning model or task being evaluated, the **length of the target output has no systematic influence on the model's confidence** in its predictions. The model is neither more nor less confident when generating longer sequences compared to shorter ones.
The heavy skew in target length indicates the evaluation dataset is dominated by short targets. This could mean the model is primarily tested on brief outputs, or the task inherently involves short targets. The wide spread of confidence scores for short targets implies that factors other than length (e.g., input complexity, ambiguity, or model uncertainty) are driving the confidence metric.
The horizontal trend is a significant finding. In many sequence generation tasks (like translation or summarization), one might expect confidence to decrease with length due to error propagation. The absence of this trend here could indicate a robust model, a specific task characteristic, or that the confidence metric is not sensitive to sequence length in this context. The marginal distributions provide crucial context, warning that conclusions about longer targets are based on very sparse data.
</details>
|
<details>
<summary>x54.png Details</summary>

### Visual Description
## Scatter Plot with Regression: Management Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with an overlaid linear regression line and marginal density plots. It examines the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) under the title "management." The plot suggests a positive correlation between the two variables.
### Components/Axes
* **Title:** "management" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear. Visible tick marks at `0` and `100`. The data range appears to extend from approximately 0 to 150.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear. Visible tick marks at `0`, `0.2`, `0.4`, and `0.6`.
* **Data Series:**
* **Scatter Points:** Numerous purple circular markers representing individual data points.
* **Regression Line:** A solid, darker purple line showing the best-fit linear trend.
* **Confidence Interval:** A semi-transparent purple shaded region surrounding the regression line, indicating the uncertainty of the fit.
* **Marginal Distributions:**
* **Top (X-axis distribution):** A density plot (filled area) showing the distribution of the "Target Length" variable. It is heavily right-skewed, with a high peak near 0 and a long tail extending to the right.
* **Right (Y-axis distribution):** A density plot showing the distribution of the "Confidence" variable. It appears roughly unimodal, with a peak between 0.2 and 0.3, and a tail extending towards higher values.
### Detailed Analysis
* **Data Point Distribution:** The majority of data points are densely clustered in the lower-left quadrant of the plot, corresponding to **Target Length values between 0 and ~50** and **Confidence values between 0 and ~0.4**. There is a high density of points near the origin (0,0).
* **Trend Verification:** The regression line has a clear **positive slope**, rising from left to right. This indicates that as "Target Length" increases, "Confidence" tends to increase as well.
* **Regression Line & Uncertainty:** The line originates at a Confidence value slightly above 0 when Target Length is 0. It passes through approximately (100, 0.35). The shaded confidence interval is narrow near the center of the data mass (low Target Length) and **widens significantly as Target Length increases**, indicating greater uncertainty in the trend prediction for higher Target Length values where data is sparse.
* **Outliers:** Several data points exist as outliers with **Target Length > 100** and **Confidence > 0.4**, with one point reaching near the maximum Confidence of ~0.6. These points pull the regression line upward.
* **Marginal Plot Details:**
* The **top density plot** confirms the visual clustering: the highest density of "Target Length" values is near 0, with frequency dropping off sharply as length increases.
* The **right density plot** shows the most common "Confidence" values are centered around 0.25, with a secondary, smaller concentration near 0.5.
### Key Observations
1. **Positive Correlation:** There is a visible, positive linear relationship between Target Length and Confidence.
2. **Heteroscedasticity:** The spread of Confidence values appears to increase with Target Length (the "fan shape"), which is also reflected in the widening confidence interval of the regression line.
3. **Data Sparsity:** The relationship is primarily defined by a dense cluster of points at low Target Length. The trend for higher Target Lengths is inferred from a much smaller number of data points.
4. **Non-Normal Distributions:** Both variables show skewed distributions, as evidenced by the marginal density plots.
### Interpretation
The chart suggests that in the context of "management," longer targets are associated with higher confidence levels. However, this interpretation requires caution due to the data structure.
* **Causality vs. Correlation:** The plot shows correlation, not causation. It's unclear if longer targets *cause* higher confidence, if higher confidence leads to setting longer targets, or if a third factor influences both.
* **Reliability of Trend:** The positive trend is most reliable for Target Lengths under 50, where data is abundant. The extrapolation to longer targets (100+) is based on sparse data and carries high uncertainty, as shown by the wide confidence interval.
* **Underlying Pattern:** The dense cluster near the origin suggests that most management scenarios involve relatively short targets and low-to-moderate confidence. The outliers with high length and high confidence may represent special cases, ambitious projects, or potentially different underlying conditions.
* **Investigative Insight:** A key question arising from this visualization is: *What characterizes the outlier cases with both high target length and high confidence?* Analyzing these specific instances could reveal factors that successfully couple ambitious goals with assuredness. The heteroscedasticity (increasing variance) also suggests that as targets become longer, other unmeasured factors play a larger role in determining confidence.
</details>
|
|
<details>
<summary>x55.png Details</summary>

### Visual Description
\n
## Scatter Plot with Regression: Marketing Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with an overlaid linear regression line and its confidence interval. It also includes marginal distribution plots (density curves) for both variables. The chart is titled "marketing" and explores the relationship between "Target Length" and "Confidence."
### Components/Axes
* **Title:** "marketing" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to 200. Major tick marks are visible at 0, 100, and 200.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.0 to 0.6. Major tick marks are visible at 0.0, 0.2, 0.4, and 0.6.
* **Data Series:**
* **Scatter Points:** Numerous purple dots representing individual data points.
* **Regression Line:** A solid purple line showing the best-fit linear trend.
* **Confidence Interval:** A semi-transparent purple shaded area surrounding the regression line, representing the uncertainty of the fit.
* **Marginal Plots:**
* **Top (X-axis distribution):** A purple density curve showing the distribution of "Target Length." It is heavily right-skewed, with a high peak near 0 and a long tail extending to the right.
* **Right (Y-axis distribution):** A purple density curve showing the distribution of "Confidence." It is also right-skewed, with the highest density between 0.1 and 0.3.
### Detailed Analysis
* **Data Distribution & Density:**
* The vast majority of data points are clustered in the lower-left quadrant of the plot, where "Target Length" is between 0 and 50 and "Confidence" is between 0.0 and 0.4.
* The density of points decreases significantly as "Target Length" increases beyond 50.
* There are a few scattered points with higher "Target Length" (up to ~200) and varying "Confidence" levels.
* **Regression Trend:**
* The solid purple regression line shows a **positive slope**. It starts at approximately y=0.2 when x=0 and rises to approximately y=0.3 when x=200.
* The shaded confidence interval is narrow at low "Target Length" values (where data is dense) and widens considerably as "Target Length" increases (where data is sparse), indicating greater uncertainty in the trend for longer targets.
* **Marginal Distributions:**
* The top density plot confirms the strong right skew of "Target Length," with a mode near 0.
* The right density plot confirms the right skew of "Confidence," with a mode around 0.2.
### Key Observations
1. **Strong Clustering:** The data is not uniformly distributed. There is a dense cluster of observations for short target lengths (0-50) with low to moderate confidence (0.0-0.4).
2. **Weak Positive Correlation:** The regression line suggests a slight positive relationship: as target length increases, confidence tends to increase marginally. However, the relationship appears weak given the high scatter of points.
3. **Heteroscedasticity:** The spread (variance) of the "Confidence" values appears to increase with "Target Length," as indicated by the widening confidence interval of the regression line.
4. **Data Sparsity:** There are relatively few data points for target lengths greater than 100, making any trend in that region less reliable.
### Interpretation
This chart suggests that within the "marketing" context analyzed, most measured items have short target lengths. There is a tentative, weak positive association between the length of a target (e.g., perhaps a piece of content, a campaign, or a keyword) and the confidence associated with it (e.g., confidence in its performance, targeting accuracy, or success prediction).
The key insight is not a strong predictive relationship, but rather a **descriptive pattern of the data landscape**: the domain is dominated by short targets, and for these, confidence is typically low to moderate. The few long targets show more variable confidence levels. The widening confidence interval cautions against over-interpreting the positive trend for longer targets due to insufficient data. An analyst might conclude that efforts to increase confidence should focus on the common short-target segment, or investigate why data for longer targets is so sparse.
</details>
|
<details>
<summary>x56.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Medical Genetics Confidence vs. Target Length
### Overview
The image is a scatter plot titled "medical_genetics" that visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis). The plot includes a fitted trend line with a confidence interval and marginal distribution plots (histograms/density plots) on the top and right edges. The primary data is represented by purple points.
### Components/Axes
* **Title:** "medical_genetics" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to approximately 120.
* **Major Tick Marks:** 0, 50, 100.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.0 to approximately 0.9.
* **Major Tick Marks:** 0.25, 0.50, 0.75.
* **Legend:** Located in the top-left corner of the main plot area. It contains a single entry: a purple circle symbol followed by the text "medical_genetics". This identifies the data series.
* **Main Plot Area:** Contains a scatter of purple data points and a solid purple trend line with a semi-transparent purple shaded region representing the confidence interval.
* **Marginal Plots:**
* **Top Marginal Plot:** A density plot (or smoothed histogram) showing the distribution of the "Target Length" variable. It is positioned above the main x-axis.
* **Right Marginal Plot:** A density plot showing the distribution of the "Confidence" variable. It is positioned to the right of the main y-axis.
### Detailed Analysis
* **Data Series (medical_genetics):**
* **Visual Trend:** The data points show a weak to moderate positive correlation. As "Target Length" increases, "Confidence" tends to increase slightly. The fitted trend line has a clear, gentle upward slope from left to right.
* **Data Point Distribution:**
* The points are densely clustered in the lower-left quadrant, specifically where "Target Length" is between 0-50 and "Confidence" is between 0.15-0.50.
* There is a significant spread of "Confidence" values for any given "Target Length," indicating high variance.
* Several outlier points exist with high "Confidence" (>0.75) at various "Target Lengths," and a few points with very low "Confidence" (<0.15).
* **Trend Line & Confidence Interval:** The solid purple regression line starts at approximately (0, 0.25) and ends near (120, 0.40). The shaded confidence interval band widens slightly as "Target Length" increases, suggesting greater uncertainty in the trend estimate for larger target lengths.
* **Marginal Distributions:**
* **Target Length (Top):** The distribution is right-skewed. The highest density is for shorter target lengths (0-30), with a long tail extending towards 120.
* **Confidence (Right):** The distribution is roughly unimodal and slightly left-skewed. The peak density is around a confidence value of 0.30-0.35, with a tail extending towards higher confidence values.
### Key Observations
1. **Positive Correlation:** There is a discernible, albeit noisy, positive relationship between the length of a genetic target and the confidence metric.
2. **High Variance:** Confidence values are highly variable, especially for shorter target lengths, suggesting factors other than length significantly influence confidence.
3. **Right-Skewed Target Length:** The dataset is dominated by genetic targets of shorter length.
4. **Outliers:** A subset of data points achieves high confidence (>0.75) across the range of target lengths, which may represent particularly well-characterized or unambiguous genetic targets.
### Interpretation
This plot from the "medical_genetics" domain suggests that, on average, longer genetic targets are associated with slightly higher confidence scores. This could imply that longer sequences provide more information or context, leading to more reliable analysis or predictions. However, the high scatter indicates that target length is not the primary determinant of confidence. The right-skew in target length distribution means the trend is more heavily informed by shorter targets. The outliers with high confidence at short lengths are particularly interesting, as they may represent ideal, high-confidence markers or regions that defy the general trend. The marginal plots confirm the concentration of data in the low-length, moderate-confidence region, which is the most common scenario in this dataset.
</details>
|
<details>
<summary>x57.png Details</summary>

### Visual Description
## Scatter Plot with Regression Line: Miscellaneous Confidence vs. Target Length
### Overview
The image is a statistical scatter plot with an overlaid linear regression line and confidence interval. The chart visualizes the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a category labeled "miscellaneous." The plot includes marginal distributions (histograms/density plots) on the top and right edges.
### Components/Axes
* **Title:** "miscellaneous" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to approximately 250.
* **Major Tick Marks:** 0, 100, 200.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Major Tick Marks:** 0.0, 0.5, 1.0.
* **Legend:** Located in the top-left corner of the main plot area. It is partially obscured but indicates the data series is for "miscellaneous," represented by purple circular markers.
* **Data Series:** A scatter of semi-transparent purple circular points.
* **Regression Line:** A solid, darker purple line showing the best linear fit through the data.
* **Confidence Band:** A shaded, lighter purple area surrounding the regression line, representing the confidence interval (likely 95%).
* **Marginal Plots:**
* **Top:** A density plot or histogram showing the distribution of the "Target Length" variable. It is heavily right-skewed, with the highest density near 0.
* **Right:** A density plot or histogram showing the distribution of the "Confidence" variable. It appears roughly unimodal, centered around 0.7-0.8.
### Detailed Analysis
* **Data Distribution & Density:**
* The vast majority of data points are clustered in the region where **Target Length is between 0 and 100**.
* There is a high density of points with **Confidence values between 0.5 and 1.0** across the entire range of Target Lengths.
* A smaller number of points are scattered with **Target Length > 100**, extending to just beyond 200.
* There are very few points with **Confidence < 0.5**.
* **Trend Verification (Regression Line):**
* The regression line exhibits a **very slight upward slope** from left to right.
* It starts at a Confidence value of approximately **0.65** when Target Length is 0.
* It ends at a Confidence value of approximately **0.75** when Target Length is 200.
* The **confidence band widens significantly as Target Length increases**, indicating greater uncertainty in the trend estimate for longer target lengths due to sparser data.
* **Marginal Distribution Details:**
* The top marginal plot confirms the extreme right skew of Target Length, with a sharp peak near 0 and a long tail to the right.
* The right marginal plot shows Confidence is most frequently observed in the upper half of the scale (0.5-1.0).
### Key Observations
1. **Strong Right Skew:** The "Target Length" variable is not normally distributed; most instances have a short target length.
2. **Weak Positive Correlation:** The data suggests a very weak positive relationship between Target Length and Confidence. As target length increases, confidence shows a slight tendency to increase.
3. **High Baseline Confidence:** Regardless of target length, confidence scores are predominantly high (>0.5).
4. **Increasing Uncertainty:** The model's estimate of the trend (the regression line) becomes much less certain for target lengths beyond 100, as shown by the flaring confidence band.
5. **Potential Outliers:** A few data points exist with relatively low confidence (<0.3) at various target lengths, but they are not numerous.
### Interpretation
This chart analyzes the performance or output of a system (likely a machine learning model) on a "miscellaneous" category of tasks or data. The key insight is that **the system's confidence in its outputs is generally high, but shows only a marginal, statistically weak improvement as the length of the target (e.g., a generated text, a sequence) increases.**
The Peircean investigation suggests:
* **Icon:** The scatter plot itself is an icon of the data distribution.
* **Index:** The upward-sloping regression line is an index pointing to a causal or correlational relationshipāthat longer targets might be associated with slightly higher confidence.
* **Symbol:** The labels "Confidence" and "Target Length" are symbolic, requiring domain knowledge to interpret. "Confidence" likely symbolizes the model's self-assessed probability of being correct.
The most critical takeaway is the **sparsity of data for longer targets**. The system's behavior and the reliability of the observed trend for Target Length > 100 are highly uncertain. This could indicate that the "miscellaneous" category contains mostly short-form content, or that the system is rarely applied to or tested on long-form targets. The high baseline confidence might suggest the system is well-calibrated for short targets or is inherently confident, but the lack of data at longer lengths is a significant limitation for drawing robust conclusions about its performance scaling.
</details>
|
<details>
<summary>x58.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: moral_disputes
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms or density plots) on the top and right sides. The chart is titled "moral_disputes" and explores the relationship between two variables: "Target Length" and "Confidence." The data points are represented as purple circles, and a horizontal reference line is drawn across the main plot area.
### Components/Axes
* **Chart Title:** `moral_disputes` (centered at the top).
* **Main Plot (Scatter Plot):**
* **X-Axis Label:** `Target Length`
* **X-Axis Scale:** Linear scale. Visible major tick marks and labels at `0` and `100`. The axis extends to approximately 200 based on the data spread.
* **Y-Axis Label:** `Confidence`
* **Y-Axis Scale:** Linear scale. Visible major tick marks and labels at `0.00`, `0.25`, `0.50`, and `0.75`.
* **Data Series:** A single series of data points, all colored a medium purple (approximately hex #9467bd). There is no explicit legend, as only one data series is present.
* **Reference Line:** A solid, thin, horizontal line in a darker shade of purple or gray, positioned at approximately `y = 0.35`.
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (likely a histogram or kernel density estimate) aligned with the X-axis ("Target Length"). It shows the frequency/density distribution of the "Target Length" variable.
* **Right Marginal Plot:** A distribution plot aligned with the Y-axis ("Confidence"). It shows the frequency/density distribution of the "Confidence" variable. Both marginal plots are filled with a lighter shade of the same purple used for the scatter points.
### Detailed Analysis
* **Data Point Distribution:** The scatter plot contains approximately 150-200 data points. The points are densely clustered in the lower-left to middle region of the plot.
* **X-Axis (Target Length) Range:** Data points span from near `0` to approximately `180`. The highest density appears between `20` and `120`.
* **Y-Axis (Confidence) Range:** Data points span from near `0.00` to approximately `0.80`. The highest density appears between `0.20` and `0.60`.
* **Trend Verification:** There is no strong, clear linear correlation (upward or downward slope) visible between "Target Length" and "Confidence." The cloud of points is somewhat amorphous, though a very weak positive trend might be inferred, as points with higher "Target Length" (>100) are less common at the very lowest "Confidence" values (<0.10).
* **Reference Line:** The horizontal line at `y ā 0.35` cuts through the central mass of the data cloud. It likely represents a measure of central tendency for the "Confidence" variable, such as the mean or median.
* **Marginal Distributions:**
* **Top (Target Length):** The distribution is right-skewed. It peaks sharply between `Target Length` values of approximately `40-80` and has a long tail extending towards higher values (up to ~180).
* **Right (Confidence):** The distribution appears roughly unimodal and slightly left-skewed. It peaks around a `Confidence` value of `0.3-0.4`, aligning with the horizontal reference line in the main plot.
### Key Observations
1. **Cluster Density:** The highest concentration of data points occurs for `Target Length` between 20-120 and `Confidence` between 0.20-0.60.
2. **Absence of Strong Correlation:** The primary observation is the lack of a definitive relationship between the two variables. Knowing the "Target Length" does not allow for a precise prediction of "Confidence," and vice-versa.
3. **Central Tendency:** The horizontal reference line at `~0.35` and the peak of the right marginal plot both suggest that the typical or average "Confidence" level in this dataset is moderately low, around 35%.
4. **Outliers:** A few data points exist with relatively high "Confidence" (>0.70) across various "Target Lengths." Similarly, a few points have very low "Confidence" (<0.10). There are no extreme outliers in "Target Length" beyond ~180.
### Interpretation
This visualization suggests that within the context of "moral_disputes," the length of a target (perhaps a text, argument, or case description) is not a strong predictor of the confidence level associated with it. The data does not support a hypothesis that longer targets systematically lead to higher or lower confidence.
The key finding is the central clustering around a modest confidence level of approximately 35%. This could indicate that, on average, judgments or measurements related to these moral disputes are made with relatively low certainty. The right-skewed distribution of "Target Length" implies that most disputes involve shorter targets, with fewer cases involving very long or complex targets.
The chart effectively uses marginal distributions to provide crucial context that the scatter plot alone would miss: it reveals the underlying shape of each variable's distribution, highlighting the common ranges and the skew in "Target Length." The absence of a clear pattern in the scatter plot is itself a significant piece of information, directing inquiry away from a simple linear relationship between these two specific metrics.
</details>
|
|
<details>
<summary>x59.png Details</summary>

### Visual Description
## Scatter Plot with Violin Distribution: Moral Scenarios Confidence vs. Target Length
### Overview
The image displays a statistical chart combining a scatter plot with violin plots, titled "moral_scenarios." It compares confidence scores between "Human" and "Model" across two distinct "Target Length" categories. The chart suggests an analysis of performance or certainty in moral reasoning tasks of varying lengths.
### Components/Axes
* **Title:** "moral_scenarios" (top center).
* **Y-Axis:** Labeled "Confidence." Scale ranges from 0.2 to 0.6, with major tick marks at 0.2, 0.4, and 0.6.
* **X-Axis:** Labeled "Target Length." Two categorical points are explicitly marked: "15" and "20."
* **Legend:** Positioned on the right side of the chart.
* **Human:** Represented by a dark purple color.
* **Model:** Represented by a lighter, pinkish-purple color.
* **Plot Elements:**
* **Scatter Points:** Individual data points are plotted for each category (Human/Model) at each Target Length (15/20).
* **Violin Plots:** Semi-transparent, shaded distributions are overlaid for each of the four groups (Human@15, Model@15, Human@20, Model@20), showing the probability density of the confidence scores.
### Detailed Analysis
**Data Series & Trends:**
1. **Human Series (Dark Purple):**
* **Trend:** Shows a wide spread of confidence scores at both target lengths, with a slight visual suggestion of a lower average at length 20 compared to 15.
* **Data Points (Approximate):**
* At Target Length 15: Points are scattered from ~0.25 to ~0.60. A cluster exists between 0.35 and 0.50.
* At Target Length 20: Points are scattered from ~0.20 to ~0.55. A cluster exists between 0.30 and 0.45.
* **Distribution (Violin):** The violin shape for Human is wider at the lower confidence end (0.3-0.4) for both lengths, indicating a higher density of responses in that range.
2. **Model Series (Light Purple/Pink):**
* **Trend:** Consistently shows higher confidence scores than the Human series at both target lengths. The scores are more tightly clustered.
* **Data Points (Approximate):**
* At Target Length 15: Points are clustered between ~0.50 and ~0.60.
* At Target Length 20: Points are clustered between ~0.50 and ~0.60, with a very similar range to length 15.
* **Distribution (Violin):** The violin shapes for the Model are narrower and positioned higher on the y-axis, peaking around 0.55-0.60, indicating a concentrated, high-confidence distribution.
### Key Observations
* **Clear Performance Gap:** The Model's confidence scores are systematically higher and less variable than Human scores at both target lengths. There is minimal overlap between the two distributions.
* **Stability of Model Confidence:** The Model's confidence distribution appears nearly identical for Target Length 15 and 20, suggesting its certainty is not significantly affected by this change in length.
* **Human Variability:** Human confidence shows much greater variance, with scores spanning almost the entire plotted y-axis range (0.2 to 0.6).
* **Potential Length Effect on Humans:** There is a subtle visual trend that Human confidence may decrease slightly as Target Length increases from 15 to 20, though the data overlap is significant.
### Interpretation
This chart likely comes from a study comparing human and AI (Model) performance on moral reasoning tasks. "Target Length" could refer to the complexity, word count, or number of steps in a moral scenario.
The data suggests a significant **calibration difference** between humans and the model. The model exhibits high and consistent confidence, which could indicate either superior capability or, more critically, **overconfidence**. Humans display natural uncertainty and variability in moral judgment. The model's flat trend across lengths implies its confidence is insensitive to scenario length, whereas humans might find longer scenarios more challenging or ambiguous, leading to slightly lower confidence.
The key investigative question raised is whether the model's high confidence correlates with accuracy. Without an accuracy metric, this chart alone cannot determine if the model is reliably correct or confidently wrong. It highlights a fundamental behavioral disparity: the model operates with high certainty, while human moral reasoning is inherently more variable and context-dependent.
</details>
|
<details>
<summary>x60.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Nutrition Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. The chart is titled "nutrition" and explores the relationship between two variables: "Target Length" on the horizontal axis and "Confidence" on the vertical axis. The data is represented by purple points and distributions.
### Components/Axes
* **Chart Title:** "nutrition" (centered at the top).
* **Main Plot Axes:**
* **X-axis (Horizontal):** Labeled "Target Length". The scale runs from 0 to 200, with major tick marks at 0, 100, and 200.
* **Y-axis (Vertical):** Labeled "Confidence". The scale runs from 0.00 to 0.75, with major tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Legend:** Located in the top-left corner of the main plot area. It contains a single entry: a purple circle symbol followed by the text "nutrition".
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (appears to be a kernel density estimate or smoothed histogram) aligned with the X-axis ("Target Length"). It is positioned above the main scatter plot.
* **Right Marginal Plot:** A distribution plot aligned with the Y-axis ("Confidence"). It is positioned to the right of the main scatter plot.
* **Data Series:** A single data series represented by purple scatter points within the main plot area. A faint, solid horizontal line is visible within the scatter cloud, positioned at approximately y = 0.40.
### Detailed Analysis
* **Data Point Distribution (Scatter Plot):**
* The purple data points are densely clustered in the lower-left quadrant of the plot.
* The highest density of points occurs for **Target Length** values between approximately 0 and 100, and **Confidence** values between 0.25 and 0.75.
* The spread of **Confidence** values appears wider for shorter **Target Lengths** (0-50) and narrows slightly as **Target Length** increases.
* There are a few outlier points with very low confidence (< 0.10) scattered across the target length range.
* The faint horizontal line at **Confidence ā 0.40** runs through the central mass of the data cloud, potentially representing a median, mean, or baseline confidence level.
* **Marginal Distributions:**
* **Target Length (Top Plot):** The distribution is right-skewed. The peak density (mode) appears to be around a Target Length of 50-70. The density tapers off significantly as Target Length approaches 200.
* **Confidence (Right Plot):** The distribution is roughly unimodal and slightly left-skewed. The peak density is centered around a Confidence value of approximately 0.40-0.45, aligning with the horizontal line in the main plot. The distribution shows that confidence values are most commonly between 0.25 and 0.60.
### Key Observations
1. **Inverse Variability Relationship:** Confidence shows high variability at low Target Lengths, suggesting predictions or measurements for shorter targets are less consistent.
2. **Central Tendency:** Both the scatter plot's horizontal line and the peak of the Confidence marginal distribution point to a central confidence value near 0.40.
3. **Data Sparsity:** There is a notable lack of data points for Target Lengths greater than 150, indicating either a sampling bias or that such long targets are rare in this dataset.
4. **Consistent Color Coding:** All visual elements (scatter points, marginal plots, legend marker) use the same shade of purple, ensuring clear association with the "nutrition" data series.
### Interpretation
This chart visualizes the performance or reliability (Confidence) of a model or measurement system related to "nutrition" as a function of the length of the target being analyzed (Target Length).
The data suggests that **confidence is not strongly correlated with target length in a linear fashion**, but the *variability* in confidence is. For short targets (length < 50), confidence can be very high or very low, indicating unstable or context-dependent results. As target length increases beyond 50, confidence values become more tightly clustered around the 0.40-0.50 range, suggesting more consistent, though not necessarily higher, performance.
The right-skewed distribution of Target Length implies the dataset is dominated by shorter nutrition-related targets. The system's average confidence (~0.40) is moderate, and the analysis would benefit from investigating why confidence is highly variable for short targets and what factors besides length might influence the confidence score. The horizontal line serves as a critical reference point, showing that a significant portion of predictions fall below this moderate confidence threshold.
</details>
|
<details>
<summary>x61.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Philosophy
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. The chart is titled "philosophy" and appears to analyze the relationship between "Target Length" and "Confidence" for a dataset related to that domain. The primary data is represented as a cloud of purple points.
### Components/Axes
* **Title:** "philosophy" (centered at the top).
* **Main Plot (Scatter):**
* **X-Axis:** Labeled "Target Length". Major tick marks are visible at 0 and 100. The axis extends slightly beyond 100.
* **Y-Axis:** Labeled "Confidence". Major tick marks are visible at 0.25, 0.50, and 0.75.
* **Data Series:** A single series of data points, all colored a medium purple (approximately hex #9467bd).
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (likely a histogram or kernel density estimate) for the "Target Length" variable. It is positioned directly above the main scatter plot, sharing the same x-axis.
* **Right Marginal Plot:** A distribution plot for the "Confidence" variable. It is positioned to the right of the main scatter plot, sharing the same y-axis. This plot is oriented vertically.
### Detailed Analysis
* **Data Distribution & Trends:**
* The scatter plot shows a dense cluster of data points. The highest concentration appears in the region where "Target Length" is between approximately 10 and 80, and "Confidence" is between 0.30 and 0.60.
* **Trend Verification:** There is no strong, clear linear trend (upward or downward slope) visible in the main cluster. The cloud of points is somewhat amorphous, suggesting a weak or non-linear correlation between "Target Length" and "Confidence" within the central mass of data.
* **Outliers:** Several data points lie outside the main cluster. Notably:
* One point is located at approximately (Target Length ā 150, Confidence ā 0.75).
* A few points have very low "Confidence" (< 0.25) across various "Target Length" values.
* A few points have "Target Length" > 100, mostly with "Confidence" values between 0.25 and 0.50.
* **Marginal Distributions:**
* **Target Length (Top Plot):** The distribution is right-skewed. The peak (mode) is at a low "Target Length" value (likely < 50), with a long tail extending to the right towards higher values.
* **Confidence (Right Plot):** The distribution appears roughly unimodal and slightly left-skewed. The peak is around a "Confidence" value of 0.4 to 0.5, with fewer instances of very high or very low confidence.
### Key Observations
1. **Weak Correlation:** The primary observation is the lack of a strong, obvious relationship between the length of a target (e.g., a text, an argument) and the confidence metric in this "philosophy" dataset.
2. **Central Tendency:** Most data points have a "Target Length" under 100 and a "Confidence" score between 0.3 and 0.6.
3. **Skewed Lengths:** The "Target Length" variable is not normally distributed; it is heavily skewed towards shorter lengths.
4. **Concentrated Confidence:** Confidence scores are more centrally clustered than target lengths, with the bulk of the data not reaching the extremes of the scale (0 or 1).
### Interpretation
This chart suggests that within the analyzed philosophical dataset, the length of the target item (e.g., a philosophical proposition, a student's answer, a text excerpt) is not a strong predictor of the associated confidence score. The confidence metric appears to be influenced by other factors not visualized here.
The right-skew in "Target Length" indicates that most items in the dataset are relatively short, with a few very long items acting as outliers. The left-skew in "Confidence" suggests that achieving very high confidence scores is uncommon, with most assessments landing in a moderate range.
The outlier at (ā150, ā0.75) is particularly interesting. It represents a case where a very long target still received a high confidence score, contradicting any potential hypothesis that length might negatively impact confidence. Investigating such outliers could provide insights into what characteristics, besides length, lead to high-confidence assessments in philosophy.
**Language:** All text in the image is in English.
</details>
|
<details>
<summary>x62.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Prehistory Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. The chart is titled "prehistory" and explores the relationship between "Target Length" and "Confidence." The primary data is represented by purple points, with corresponding marginal distributions shown in the same color.
### Components/Axes
* **Main Plot (Center):**
* **X-Axis:** Labeled "Target Length." The scale runs from 0 to approximately 200, with major tick marks at 0 and 100.
* **Y-Axis:** Labeled "Confidence." The scale runs from 0.00 to 1.00, with major tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Data Series:** A single series of purple circular points scattered across the plot area.
* **Marginal Plot (Top):**
* Positioned above the main plot, aligned with the X-axis.
* Displays the distribution of the "Target Length" variable. It appears as a right-skewed density plot or histogram.
* **Marginal Plot (Right):**
* Positioned to the right of the main plot, aligned with the Y-axis.
* Displays the distribution of the "Confidence" variable. It appears as a bimodal density plot or histogram.
* **Color/Legend:** All data elements (scatter points, marginal plots) use a consistent shade of purple. There is no separate legend box; the color coding is implicit and consistent across all components.
### Detailed Analysis
* **Scatter Point Distribution:**
* The points are most densely clustered in the lower-left quadrant of the plot, corresponding to **Target Lengths between approximately 0-100** and **Confidence values between 0.00-0.50**.
* There is a wide vertical spread of Confidence values for shorter Target Lengths (0-50), ranging from near 0.00 to above 0.75.
* As Target Length increases beyond ~100, the density of points decreases, and the Confidence values appear to trend lower, mostly below 0.50.
* A few outlier points exist with high Confidence (>0.75) at moderate Target Lengths (~50-100).
* **Marginal Distribution - Target Length (Top):**
* The distribution is **right-skewed**. The peak density (mode) appears to be at a relatively low Target Length, approximately in the **30-70 range**.
* The tail extends to the right, indicating fewer instances of very long targets (approaching 200).
* **Marginal Distribution - Confidence (Right):**
* The distribution is **bimodal**. There are two distinct peaks.
* The primary, larger peak is at a **low Confidence level, approximately 0.20-0.30**.
* A secondary, smaller peak is at a **moderate-to-high Confidence level, approximately 0.65-0.75**.
* A faint horizontal line is visible within this marginal plot at approximately **Confidence = 0.50**, which may represent a median or a reference threshold.
### Key Observations
1. **Inverse Relationship Trend:** There is a general, weak inverse trend visible. Points with higher Target Lengths (>100) tend to have lower Confidence scores, clustering below 0.50.
2. **High Variance at Low Length:** For short Target Lengths (<50), Confidence is highly variable, spanning almost the entire possible range (0.00 to >0.75).
3. **Bimodal Confidence:** The Confidence metric is not normally distributed. The data suggests two predominant groups: one with low confidence (~0.25) and another with moderately high confidence (~0.70).
4. **Data Sparsity:** The plot becomes sparse for Target Lengths greater than ~150, suggesting fewer data points in this range.
### Interpretation
This chart likely analyzes the performance or reliability of a system (e.g., a machine learning model, an archaeological dating method) applied to "prehistory" data. "Target Length" could refer to the temporal span, sequence length, or complexity of the historical subject being analyzed. "Confidence" represents the system's certainty in its output.
The data suggests that **the system is most frequently applied to, or performs with, moderate-to-low confidence on shorter, perhaps more granular historical targets.** The bimodal Confidence distribution is criticalāit indicates the system's outputs are not uniformly uncertain but tend to cluster around two distinct certainty levels. This could imply the existence of two different classes of problems or data quality within the "prehistory" domain. The trend of decreasing confidence for longer targets might indicate that the system struggles with broader, more complex temporal scopes, or that longer targets are inherently more ambiguous. The high variance at short lengths suggests that for concise subjects, other factors beyond length heavily influence the system's confidence.
</details>
|
|
<details>
<summary>x63.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: professional_accounting
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms or density plots) on the top and right sides. The chart is titled "professional_accounting" and explores the relationship between "Target Length" and "Confidence". The data is represented by purple points and distributions.
### Components/Axes
* **Title:** "professional_accounting" (located at the top center).
* **Main Plot Area:** A scatter plot.
* **X-Axis:**
* **Label:** "Target Length" (located at the bottom center).
* **Scale:** Linear scale ranging from 0 to approximately 150. Major tick marks are visible at 0 and 100.
* **Y-Axis:**
* **Label:** "Confidence" (located on the left side, rotated vertically).
* **Scale:** Linear scale ranging from 0.0 to approximately 0.7. Major tick marks are visible at 0.2, 0.4, and 0.6.
* **Data Series:** A single series of data points, all rendered in a medium purple color.
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (likely a histogram or kernel density estimate) aligned with the X-axis ("Target Length"). It is positioned above the main scatter plot.
* **Right Marginal Plot:** A distribution plot aligned with the Y-axis ("Confidence"). It is positioned to the right of the main scatter plot. Both marginal plots use the same purple color as the scatter points.
### Detailed Analysis
* **Data Point Distribution:** The scatter plot contains a high density of points, likely numbering in the hundreds.
* **Spatial Pattern & Trend:** The data shows a strong concentration in the lower-left quadrant of the plot. The highest density of points occurs where "Target Length" is between approximately 0 and 50, and "Confidence" is between 0.1 and 0.4. As "Target Length" increases beyond 50, the points become more sparse and show a wider spread in "Confidence" values, though the overall trend suggests a slight negative correlationāhigher target lengths are associated with slightly lower confidence on average.
* **Marginal Distributions:**
* **Target Length (Top Plot):** The distribution is heavily right-skewed. The peak (mode) is near 0, with a long tail extending towards higher values (up to ~150). This indicates most samples have a short target length.
* **Confidence (Right Plot):** The distribution is unimodal and roughly symmetric, centered around a confidence value of approximately 0.25 to 0.3. The spread ranges from near 0.0 to about 0.65.
### Key Observations
1. **Cluster Dominance:** The vast majority of data points are clustered in a region of low target length and low-to-moderate confidence.
2. **Outliers:** There are a few outlier points with very high confidence (>0.6) at low target lengths (<20). There are also points with very high target length (>120) but with confidence values mostly below 0.4.
3. **Inverse Relationship Hint:** While noisy, the cloud of points suggests that as the target length increases, the maximum observed confidence tends to decrease. It is rare to see high confidence for long target lengths.
4. **Data Sparsity:** The plot becomes significantly sparser in the region where Target Length > 80 and Confidence > 0.4.
### Interpretation
This chart likely analyzes the performance of a model or system in the domain of "professional_accounting." The "Target Length" could represent the complexity, length of a document, or number of steps in an accounting task. "Confidence" likely represents the model's predicted probability or certainty in its output.
The data suggests that the system is most frequently applied to (or performs best on) tasks with short target lengths, where it achieves moderate confidence. The strong right-skew in target length indicates the dataset or evaluation is dominated by shorter tasks. The negative trend implies that as task complexity (length) increases, the system's confidence in its solutions tends to decrease and become more variable. The presence of high-confidence outliers on short tasks may represent easy, routine cases. The overall pattern could indicate a limitation in the system's ability to maintain high certainty when handling longer, more complex accounting problems.
</details>
|
<details>
<summary>x64.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: Professional Psychology Confidence vs. Target Length
### Overview
The image displays a scatter plot analyzing the relationship between "Target Length" and "Confidence" within the domain of "professional_psychology." The plot includes a fitted trend line and marginal distribution plots (histograms/density plots) for both variables, providing a comprehensive view of the data's distribution and correlation.
### Components/Axes
* **Title:** "professional_psychology" (located at the top center).
* **Main Chart Area:** A scatter plot with a fitted linear trend line.
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to over 200.
* **Major Tick Marks:** 0, 100, 200.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.0 to approximately 0.7.
* **Major Tick Marks:** 0.0, 0.5.
* **Data Series:** A single series represented by purple circular points.
* **Trend Line:** A solid purple line indicating the linear regression fit.
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (likely a histogram or kernel density estimate) for the "Target Length" variable, aligned with the x-axis.
* **Right Marginal Plot:** A distribution plot for the "Confidence" variable, aligned with the y-axis.
* **Legend:** No explicit legend is present. The consistent purple color for all points and the trend line implies they belong to a single data category.
### Detailed Analysis
* **Data Distribution (Scatter Plot):**
* The data points (purple dots) are densely clustered in the lower-left quadrant of the plot.
* The highest density of points occurs approximately within the range of **Target Length: 0 to 150** and **Confidence: 0.0 to 0.4**.
* There is a visible, positive linear trend. As "Target Length" increases, "Confidence" shows a general tendency to increase.
* **Trend Line:** The purple regression line starts at approximately **(Target Length ā 0, Confidence ā 0.1)** and slopes upward to approximately **(Target Length ā 200, Confidence ā 0.4)**.
* **Outliers/Variability:** Several data points exist with relatively high confidence (>0.5) across various target lengths. Conversely, some points with long target lengths (>150) have low confidence (<0.2). The spread of points around the trend line indicates moderate variability.
* **Marginal Distributions:**
* **Target Length (Top Plot):** The distribution is right-skewed. The peak (mode) appears to be in the range of approximately **50-100**. The tail extends beyond 200.
* **Confidence (Right Plot):** The distribution appears roughly unimodal and slightly left-skewed. The peak is centered around **0.2-0.3**. The density tapers off towards 0.0 and 0.6.
### Key Observations
1. **Positive Correlation:** The primary observation is a positive association between target length and confidence in the context of professional psychology.
2. **Data Concentration:** Most analyzed instances involve relatively short target lengths (under 150) and low-to-moderate confidence scores (under 0.4).
3. **Variability:** The relationship is not deterministic. For any given target length, there is a substantial range of confidence values, indicating other factors are at play.
4. **Distribution Shapes:** The target length data is not normally distributed but is concentrated at lower values with a long tail. Confidence scores are more centrally clustered.
### Interpretation
The data suggests that within the sampled professional psychology materials or tasks, there is a measurable, positive relationship between the length of a target (which could refer to a text passage, a case description, or a therapeutic goal) and the confidence associated with it (which could be a clinician's confidence in a diagnosis, a prediction, or an intervention outcome).
* **What it Demonstrates:** Longer targets may provide more information or context, leading to slightly higher confidence. However, the moderate slope and significant scatter indicate that length alone is a weak-to-moderate predictor of confidence. Expertise, target complexity, or data quality are likely stronger, unmeasured factors.
* **Notable Anomaly:** The presence of high-confidence points at short target lengths is interesting. These could represent clear-cut, prototypical cases where little information is needed for high confidence. Conversely, low-confidence points at long lengths might represent complex, ambiguous, or contradictory information that does not resolve uncertainty.
* **Underlying Pattern:** The right-skewed distribution of target length implies that most professional psychology targets in this dataset are concise, with fewer very long ones. The analysis focuses on the relationship within this common range. The plot effectively argues against a simplistic "more information always equals more confidence" model, highlighting the nuanced nature of judgment in the field.
</details>
|
<details>
<summary>x65.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Public Relations Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal histograms (or density plots), titled "public_relations". It displays the relationship between "Target Length" (x-axis) and "Confidence" (y-axis) for a dataset. A linear regression trend line with a shaded confidence interval is overlaid on the scatter data.
### Components/Axes
* **Title:** "public_relations" (centered at the top).
* **X-Axis:**
* **Label:** "Target Length"
* **Scale:** Linear, ranging from 0 to approximately 150. Major tick marks are at 0, 50, 100, and 150.
* **Y-Axis:**
* **Label:** "Confidence"
* **Scale:** Linear, ranging from 0.00 to approximately 0.85. Major tick marks are at 0.00, 0.25, 0.50, and 0.75.
* **Data Series:**
* **Scatter Points:** Individual data points are represented as small, solid purple circles.
* **Trend Line:** A solid, dark purple line representing a linear regression fit.
* **Confidence Interval:** A semi-transparent, light purple shaded region surrounding the trend line, indicating the uncertainty of the fit.
* **Marginal Distributions:**
* **Top (for X-axis):** A histogram/density plot showing the distribution of "Target Length". It is positioned above the main plot area.
* **Right (for Y-axis):** A histogram/density plot showing the distribution of "Confidence". It is positioned to the right of the main plot area.
* **Legend:** No explicit legend is present. The color purple is used consistently for all data elements (points, line, interval, histograms).
### Detailed Analysis
* **Data Point Distribution:**
* The majority of data points are clustered in the lower-left quadrant, where "Target Length" is between 0 and 75 and "Confidence" is between 0.00 and 0.50.
* There is a high density of points with very low "Target Length" (0-25) spanning a wide range of "Confidence" values (0.00 to ~0.70).
* As "Target Length" increases beyond 75, the number of data points decreases significantly.
* Several outlier points exist with high "Confidence" (>0.60) at various "Target Length" values.
* **Trend Line & Correlation:**
* The dark purple regression line has a **positive slope**, rising from left to right.
* It starts at approximately (Target Length=0, Confidence=0.25) and ends near (Target Length=150, Confidence=0.45).
* This indicates a **weak to moderate positive correlation**: as "Target Length" increases, "Confidence" tends to increase slightly.
* The light purple confidence interval is narrowest near the center of the data mass (around Target Length=50) and widens considerably at the extremes (especially near Target Length=150), indicating greater uncertainty in the trend where data is sparse.
* **Marginal Histograms:**
* **Target Length Distribution (Top):** The distribution is right-skewed. The highest frequency (peak) appears to be in the bin around 25-50. Frequency drops off steadily as length increases.
* **Confidence Distribution (Right):** The distribution appears roughly unimodal and slightly left-skewed. The peak density is in the range of approximately 0.25 to 0.40.
### Key Observations
1. **Positive Trend:** The primary observation is the positive relationship between target length and confidence.
2. **Data Sparsity at High Values:** The relationship is inferred from a much smaller number of data points at higher target lengths (>100), making the trend less reliable in that region.
3. **High Variance at Low Length:** For short target lengths (0-25), confidence values are highly variable, spanning almost the entire observed range.
4. **Concentration of Data:** The bulk of the analyzed "public relations" items have a target length under 75 and a confidence score under 0.50.
### Interpretation
The data suggests that in the context of "public relations," there is a tendency for longer target messages (e.g., press releases, statements) to be associated with slightly higher confidence scores. This could imply that more comprehensive or detailed communications are perceived with greater assurance, or that entities producing longer communications are more confident in their messaging.
However, the relationship is not strong, and significant noise exists. The wide spread of confidence for short messages indicates that length alone is not a primary driver of confidence; other factors (content, source credibility, context) likely play a major role. The sparsity of data for very long targets (>100) means the observed upward trend should be interpreted with cautionāit may not hold or may be influenced by a few specific cases. The marginal histograms confirm that the analysis is based primarily on moderately short targets with mid-range confidence values.
</details>
|
<details>
<summary>x66.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Security Studies Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distributions (histogram on top, density plot on the right). It displays the relationship between "Target Length" and "Confidence" for a dataset or category labeled "security_studies". The plot uses a monochromatic purple color scheme.
### Components/Axes
* **Title:** "security_studies" (centered at the top).
* **Main Plot Area:** A scatter plot with data points represented as purple circles.
* **X-Axis:**
* **Label:** "Target Length" (centered below the axis).
* **Scale:** Linear, ranging from 0 to approximately 700.
* **Major Tick Marks:** 0, 250, 500.
* **Y-Axis:**
* **Label:** "Confidence" (centered to the left, rotated 90 degrees).
* **Scale:** Linear, ranging from 0.0 to approximately 0.8.
* **Major Tick Marks:** 0.2, 0.4, 0.6.
* **Legend:** Located in the top-left corner of the main plot area. Contains a purple square symbol followed by the text "security_studies".
* **Marginal Distributions:**
* **Top (X-axis distribution):** A histogram showing the frequency distribution of "Target Length". It is heavily right-skewed, with the highest bar between 0-100.
* **Right (Y-axis distribution):** A density plot (smoothed histogram) showing the distribution of "Confidence". It is unimodal, peaking between 0.2 and 0.3.
* **Reference Line:** A faint, horizontal, dashed purple line is present at approximately y = 0.25, likely representing the median or mean confidence level.
### Detailed Analysis
* **Data Point Distribution:** The scatter plot contains several hundred data points (purple dots).
* **Spatial Grounding & Trend Verification:**
* The highest density of points is concentrated in the lower-left quadrant, where **Target Length is between 0-250** and **Confidence is between 0.1-0.4**.
* The overall visual trend shows a **weak negative correlation**. As Target Length increases, the cloud of points trends slightly downward, suggesting a potential decrease in Confidence.
* **Outliers:** There are a few notable outliers:
* A small cluster of points with very high Confidence (>0.6) at low Target Lengths (<100).
* A few points with moderate Confidence (~0.4-0.5) at higher Target Lengths (400-600).
* **Marginal Plot Details:**
* The **top histogram** confirms the right-skew: the vast majority of "Target Length" values are below 250, with a long tail extending to ~700.
* The **right density plot** confirms the concentration of "Confidence" values: the distribution peaks sharply around 0.25 and tapers off, with very few instances above 0.6.
### Key Observations
1. **Primary Cluster:** The core of the data lies in the region of short target lengths and low-to-moderate confidence.
2. **Inverse Relationship:** There is a visual suggestion that longer targets are associated with slightly lower confidence scores, though the relationship is not strong.
3. **High-Confidence Exceptions:** The presence of high-confidence points at very short lengths indicates that some "security_studies" items are deemed highly confident despite (or because of) their brevity.
4. **Central Tendency:** The horizontal reference line at ~0.25 aligns with the peak of the confidence density plot, indicating this is the most common confidence level.
### Interpretation
This chart likely analyzes the performance or characteristics of a model or system related to "security_studies". The "Target Length" could refer to the length of a text, a code snippet, or a sequence being analyzed, while "Confidence" represents the model's certainty in its output (e.g., classification, detection, or summary).
The data suggests that **the system is most frequently applied to, or performs most consistently on, shorter targets**, where it exhibits a baseline confidence around 0.25. The weak negative trend might imply that as the complexity or length of the target increases, the system's confidence in its assessment slightly diminishes, which is a common pattern in many AI tasks. The high-confidence outliers at short lengths could represent clear-cut, unambiguous cases that are easy for the system to evaluate. The overall distribution indicates that high-confidence assessments (>0.6) are rare for this category. This visualization would be crucial for understanding the operational envelope and reliability of a security-focused analytical tool.
</details>
|
Figure 13: Continuing from fig. 12. See also fig. 14.
|
<details>
<summary>x67.png Details</summary>

### Visual Description
\n
## Scatter Plot with Marginal Distributions: Sociology
### Overview
The image is a statistical visualization titled "sociology." It is a scatter plot with marginal distribution plots (density curves) on the top and right sides. The plot displays the relationship between "Target Length" on the horizontal axis and "Confidence" on the vertical axis for a dataset. The data points are rendered as semi-transparent purple circles, and a horizontal reference line is present.
### Components/Axes
* **Title:** "sociology" (centered at the top).
* **Main Chart Area:** A scatter plot.
* **X-Axis (Horizontal):**
* **Label:** "Target Length"
* **Scale:** Linear.
* **Visible Tick Markers:** "0" and "100". The axis extends slightly beyond 100.
* **Y-Axis (Vertical):**
* **Label:** "Confidence"
* **Scale:** Linear.
* **Visible Tick Markers:** "0.25", "0.50", "0.75".
* **Data Series:** A single series represented by purple circles. No explicit legend is present, as there is only one category.
* **Reference Line:** A solid, thin, dark purple horizontal line is drawn at approximately `Confidence = 0.25`.
* **Marginal Plots:**
* **Top Marginal Plot:** A density curve (smoothed histogram) showing the distribution of the "Target Length" variable. It is positioned directly above the main scatter plot, sharing the same x-axis.
* **Right Marginal Plot:** A density curve showing the distribution of the "Confidence" variable. It is positioned to the right of the main scatter plot, sharing the same y-axis. This plot is oriented vertically.
### Detailed Analysis
* **Data Point Distribution:**
* The data points are heavily concentrated in the region where `Target Length` is between approximately 0 and 50.
* Within this dense cluster, `Confidence` values show high variance, ranging from near 0.0 to above 0.75.
* As `Target Length` increases beyond 50, the density of points decreases significantly. Points become sparse.
* For `Target Length` values between 50 and ~150, the `Confidence` values appear to cluster more tightly, primarily between 0.25 and 0.50, with a few outliers above 0.50.
* **Marginal Distributions:**
* **Target Length (Top Plot):** The distribution is right-skewed. The peak density (mode) is at a very low `Target Length` (near 0). The density drops sharply as length increases, with a long tail extending to the right.
* **Confidence (Right Plot):** The distribution appears roughly unimodal with a peak near `Confidence = 0.25`. The density is highest around 0.25-0.35 and tapers off towards higher confidence values.
* **Reference Line:** The horizontal line at `Confidence ā 0.25` aligns closely with the peak of the Confidence marginal distribution and passes through the densest part of the scatter plot cluster.
### Key Observations
1. **Inverse Density Relationship:** There is a strong inverse relationship between the density of data points and `Target Length`. Short targets are abundant; long targets are rare.
2. **Variance Reduction:** The variance (spread) of `Confidence` appears to decrease as `Target Length` increases. Short targets are associated with highly variable confidence, while the few long targets show more consistent, moderate confidence.
3. **Central Tendency:** The horizontal reference line at 0.25 and the peak of the Confidence marginal plot suggest that the central tendency (likely the median or mode) for confidence in this dataset is around 0.25.
4. **Outliers:** A small number of data points exist with `Confidence` > 0.50 at `Target Length` > 50, which are outliers relative to the main cluster of long-target points.
### Interpretation
This visualization suggests a potential pattern in a sociological dataset (the specific context is not provided). The data indicates that **shorter "targets" (which could represent text length, interaction duration, or another metric) are far more common but are evaluated with highly variable confidence.** This could imply that short events or items are easier to encounter but harder to judge consistently.
Conversely, **longer "targets" are rare but are assessed with more consistent, moderate confidence (centered around 0.25).** This might suggest that when longer events or items do occur, they provide more stable cues for evaluation, leading to less variable confidence scores, albeit not necessarily high confidence.
The horizontal line at 0.25 may represent a baseline, chance level, or a significant threshold for confidence within this specific sociological model or measurement tool. The overall takeaway is that the length of the target variable is strongly associated with both the frequency of observation and the reliability (consistency) of the confidence measure.
</details>
|
<details>
<summary>x68.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: US Foreign Policy Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal distribution plots (histograms/density plots) on the top and right sides. The chart is titled "us_foreign_policy" and explores the relationship between two variables: "Target Length" and "Confidence." The data points are plotted in a purple hue, and a linear regression trend line is overlaid on the main scatter plot.
### Components/Axes
* **Main Chart Area:**
* **X-Axis (Horizontal):** Labeled "Target Length." The scale runs from 0 to approximately 150, with major tick marks at 0, 50, and 100.
* **Y-Axis (Vertical):** Labeled "Confidence." The scale runs from 0.00 to 0.75, with major tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Data Series:** A single series of data points represented by purple circles. There is no explicit legend, as only one data series is present.
* **Trend Line:** A solid, darker purple line representing a linear regression fit to the data.
* **Marginal Plots:**
* **Top Marginal Plot:** A distribution plot (likely a histogram or kernel density estimate) for the "Target Length" variable, aligned with the x-axis of the main plot.
* **Right Marginal Plot:** A distribution plot for the "Confidence" variable, aligned with the y-axis of the main plot.
### Detailed Analysis
* **Data Point Distribution:** The scatter plot contains approximately 100-150 data points. The points are densely clustered in the lower-left quadrant of the plot, specifically where "Target Length" is between 0 and 75 and "Confidence" is between 0.00 and 0.50. The density of points decreases as both "Target Length" and "Confidence" increase.
* **Trend Line Analysis:** The linear regression line has a very slight positive slope. It originates at a y-intercept of approximately 0.25 (when Target Length is 0) and rises to a value of approximately 0.30 at a Target Length of 150. This indicates a weak positive correlation between the two variables.
* **Marginal Distributions:**
* **Target Length (Top Plot):** The distribution is right-skewed. The highest density (peak) occurs at a low Target Length, approximately between 10 and 30. The frequency tapers off significantly as Target Length increases beyond 50.
* **Confidence (Right Plot):** The distribution is also right-skewed. The highest density occurs at a low Confidence level, approximately between 0.10 and 0.30. The frequency drops sharply for Confidence values above 0.50.
### Key Observations
1. **Weak Correlation:** The nearly flat trend line suggests a very weak relationship between the length of a target and the confidence associated with it in this dataset.
2. **Concentration of Data:** The vast majority of observations involve relatively short targets (length < 75) and low to moderate confidence scores (< 0.50).
3. **Skewed Distributions:** Both variables exhibit right-skewed distributions, meaning most data points have low values, with fewer instances of high target length or high confidence.
4. **Potential Outliers:** A small number of data points exist with high Confidence (>0.60) and/or high Target Length (>100), but they are sparse and do not strongly influence the overall trend.
### Interpretation
This visualization suggests that within the context of the "us_foreign_policy" dataset, there is no strong evidence that longer targets are associated with higher confidence. The data is dominated by instances of short targets paired with low-to-moderate confidence levels.
The weak positive slope of the trend line, while statistically present, may not be practically significant. The primary insight is the clustering of data in the low-value region for both metrics. This could imply that in the analyzed domain of US foreign policy, most evaluated targets are of limited scope or duration, and assessments of confidence in outcomes or actions related to them are generally cautious. The skew in distributions highlights that high-confidence assessments or engagements with very long-term targets are exceptions rather than the norm. The marginal plots effectively reinforce this by showing the concentration of data at the lower ends of both scales.
</details>
|
<details>
<summary>x69.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: Virology Confidence vs. Target Length
### Overview
The image is a statistical chart, specifically a scatter plot with marginal distribution plots (density plots) on the top and right edges. It visualizes the relationship between "Target Length" and "Confidence" within a virology context. The overall aesthetic is minimalist, using a single purple color scheme against a white background with light gray grid lines.
### Components/Axes
* **Title:** "virology" (centered at the top).
* **Main Plot Area:**
* **X-Axis:** Labeled "Target Length". The scale runs from 0 to 100, with major tick marks at 0, 50, and 100.
* **Y-Axis:** Labeled "Confidence". The scale runs from 0.00 to 1.00, with major tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Data Series:** Represented by purple circles (scatter points). There is no explicit legend, as only one data series is present.
* **Trend Line:** A solid purple line runs through the data, showing a fitted model (likely a regression line). It is accompanied by a semi-transparent purple shaded area, representing the confidence interval or standard error band around the trend.
* **Marginal Plots:**
* **Top Marginal Plot:** A density plot (smoothed histogram) aligned with the X-axis ("Target Length"). It shows the distribution of data points along the target length dimension.
* **Right Marginal Plot:** A density plot aligned with the Y-axis ("Confidence"). It shows the distribution of data points along the confidence dimension.
### Detailed Analysis
* **Data Point Distribution:** The scatter points are densely clustered in the lower-left quadrant of the plot. The highest concentration appears where "Target Length" is between approximately 0 and 40, and "Confidence" is between 0.10 and 0.40.
* **Trend Line Analysis:** The solid purple trend line exhibits a very slight downward slope from left to right. It starts at a confidence value of approximately 0.30 when Target Length is 0 and decreases to approximately 0.25 when Target Length is 100. The shaded confidence band around this line is relatively narrow, suggesting some certainty in this weak negative trend.
* **Marginal Distributions:**
* **Target Length (Top):** The distribution is right-skewed. The peak density is at the lower end (near 0), with a long tail extending towards 100. This indicates most data points have short target lengths.
* **Confidence (Right):** The distribution is also right-skewed, with the highest density around a confidence value of 0.20-0.30. The density drops off sharply above 0.50, with very few points above 0.75.
### Key Observations
1. **Inverse Relationship:** There is a weak negative correlation between Target Length and Confidence. As the target length increases, the confidence score shows a slight tendency to decrease.
2. **Data Sparsity:** The data is highly uneven. The vast majority of observations are concentrated at low target lengths and low-to-moderate confidence scores. The upper-right quadrant (high length, high confidence) is virtually empty.
3. **Outliers:** A few scattered points exist with confidence scores above 0.50, primarily at lower target lengths (below ~50). One notable outlier appears near (Target Length ā 80, Confidence ā 0.80).
4. **Uncertainty:** The model's confidence (as shown by the shaded band) is fairly consistent across the range of target lengths, though it appears slightly wider at the extreme right (lengths near 100), where data is sparse.
### Interpretation
This chart suggests that in the context of this virology dataset, longer genetic or protein targets are associated with slightly lower confidence in the model's predictions or measurements. The strong clustering at low lengths and low confidence could indicate several possibilities:
* The model or assay performs best on shorter, perhaps more conserved or well-understood, target regions.
* The dataset itself is imbalanced, containing many more examples of short targets, which are typically easier to analyze with high confidence.
* The "Confidence" metric might be inherently lower for longer targets due to increased complexity, potential for errors, or higher variability in the underlying biological data.
The marginal distributions confirm the imbalance in the dataset. The investigation should focus on why confidence drops with length: is it a fundamental limitation of the method, an artifact of the data collection, or a true biological signal? The outlier with high confidence at a long target length is particularly interesting and warrants individual examination to understand what makes that case successful.
</details>
|
<details>
<summary>x70.png Details</summary>

### Visual Description
## Scatter Plot with Marginal Distributions: World Religions Confidence vs. Target Length
### Overview
The image is a statistical visualization, specifically a scatter plot with marginal histograms and density plots. It displays the relationship between "Target Length" and "Confidence" for a dataset or model output labeled "world_religions". The plot includes a fitted regression line with a confidence interval.
### Components/Axes
* **Title:** `world_religions` (located at the top center).
* **Main Chart Area:**
* **X-Axis:** Labeled `Target Length`. The scale runs from 0 to approximately 70, with major tick marks visible at 0 and 50.
* **Y-Axis:** Labeled `Confidence`. The scale runs from 0.00 to 1.00, with major tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Data Series:** A single series represented by purple circular points. A legend in the top-left corner confirms this series is `world_religions`.
* **Trend Line:** A solid purple line representing a linear regression fit, surrounded by a semi-transparent purple shaded area representing the confidence interval for the fit.
* **Marginal Plots:**
* **Top (above X-axis):** A histogram and density plot for the `Target Length` variable. It shows a right-skewed distribution, with most data points clustered at lower target lengths (0-30).
* **Right (beside Y-axis):** A histogram and density plot for the `Confidence` variable. It shows a left-skewed distribution, with most data points clustered at higher confidence values (0.4-0.8).
### Detailed Analysis
* **Data Distribution & Trend:**
* **Trend Verification:** The purple regression line has a clear **downward slope** from left to right, indicating a negative correlation between Target Length and Confidence.
* **Data Point Cluster:** The highest density of purple data points is concentrated in the region where `Target Length` is between 0 and 30 and `Confidence` is between 0.25 and 0.75.
* **Estimated Trend Line Values:** The line appears to start at a Confidence of approximately 0.50 when Target Length is 0, and declines to a Confidence of approximately 0.25 when Target Length is 50.
* **Confidence Interval:** The shaded confidence interval around the trend line is relatively narrow at low Target Lengths but widens significantly as Target Length increases beyond 40, indicating greater uncertainty in the trend for longer targets.
* **Marginal Distributions:**
* **Target Length (Top):** The histogram bars are tallest on the left (0-10 range) and decrease in height to the right. The overlaid density curve confirms a strong positive (right) skew.
* **Confidence (Right):** The histogram bars are tallest in the 0.5-0.75 range and decrease towards both 0.0 and 1.0. The overlaid density curve confirms a negative (left) skew.
### Key Observations
1. **Negative Correlation:** There is a clear inverse relationship: as the `Target Length` increases, the model's `Confidence` tends to decrease.
2. **Data Skew:** The dataset is heavily skewed towards shorter target lengths and mid-to-high confidence scores.
3. **Increased Uncertainty:** The model's predictive trend (the regression line) becomes much less certain for longer target lengths, as shown by the widening confidence band.
4. **Outliers:** A few data points exist with very high confidence (>0.8) at low target lengths, and a few with very low confidence (<0.2) across various lengths, but they are not extreme outliers.
### Interpretation
This chart likely evaluates the performance of a machine learning or NLP model on a "world religions" task. The "Target Length" probably refers to the length (e.g., word count, token count) of an input query or text passage, while "Confidence" is the model's self-assessed probability or certainty in its output.
The data suggests the model is **less confident when processing longer, more complex inputs** related to world religions. This could indicate that the model's knowledge or reasoning ability degrades with input length for this specific domain, or that longer inputs introduce more ambiguity. The high density of data at shorter lengths implies the evaluation dataset or typical use case involves brief queries. The widening confidence interval for longer targets is a critical finding, signaling that the model's behavior becomes unpredictable and unreliable as input length grows, which is an important limitation to consider for real-world applications.
</details>
|
| --- | --- | --- | --- |
Figure 14: Continuing from figs. 12 and 13.
### Appendix F Generalization to Coding Tasks
Because there are no coding tasks in our training dataset, we can use a coding competition task introduced in LiveCodeBench [Jain et al., 2024] to assess how well finetuned uncertainty estimation methods perform on completely out of distribution tasks.
To conduct the analysis in table 3, we evaluate several base models on the 62 LeetCode easy questions from the livecodebench_generation_lite task. We asking for the model to write a Python solution and grade the solution using test cases (marking it as correct iff it passes all test cases). We then apply Lora + Prompt and Zero-Shot Classifier uncertainty estimation methodsāwith these methods only using training/temperature scaling data from our main dataset mixture which notably does not include any coding tasks section C.2. Accuracy is shown to contextualize the modelās overall level of performance on the task. On Mistral-7B, the best performing model on the coding task, the supervised Lora + Prompt approach dramatically improves calibration and selective prediction as compared to Zero-Shot Classifier; on the worse-performing Mistral-7B-Instruct and LLaMa-2-7B, selective prediction improves but calibration slightly degrades.
| Model | Method | Acc | ECE | AUROC |
| --- | --- | --- | --- | --- |
| LLaMa-2-7B | Zero-Shot Classifier | 3.2% | 41.0% | 56.9% |
| Lora + Prompt | 3.2% | 46.4% | 80.0% | |
| Mistral-7B | Zero-Shot Classifier | 27.4% | 70.2% | 66.2% |
| Lora + Prompt | 27.4% | 21.4% | 85.1% | |
| Mistral-7B-Instruct | Zero-Shot Classifier | 21.0% | 52.7% | 47.1% |
| Lora + Prompt | 21.0% | 56.1% | 70.2% | |
Table 3: ECE and AUROC on livecodebench_generation_lite (LeetCode easy subset). ECE is shown after temperature scaling on a small hold-out set of the original dataset mixture section C.2. Acc is task accuracy (proportion of coding solutions that are correct). Supervised training (LoRA + Prompt) seems to always improve selective prediction, although supervised training only heavily improves calibration for Mistral-7B and in fact slightly degrades calibration for the two other models.
### Appendix G User Studies
#### G.1 Additional Details on Setup
Stimuli and Participant Selection
We closely followed the setup of [Bhatt et al., 2023]. We used the same 180 MMLU questions from which were pre-batched into three sets of 60 MMLU questions. Within each variant, we randomly assigned participants to one of the three batches. In total, we recruit $181$ participants (20 per variant With the exception of one extra participant due to random batching allocation effects.). All participants were recruited through the crowdsourcing platform Prolific [Palan and Schitter, 2018]; we restrict our participant pool to those based in the United States who speak English as a first language.
Compensation
Participants were told that the study would take approximately 30 minutes and were paid at a base rate of $9/hr and informed that they would receive an optional bonus up to $10 for answering questions correctly. We applied the bonus to all participants.
LLM Answers and Uncertainty Elicitation
Bhatt et al. originally used GPT-3.5 as their LLM. While at first, we explored user performance when provided with confidence scores modulated over the original GPT-3.5 responses that the authors had collected, the authors had filtered LLM performance to ensure the LLM achieved high performance on biology, computer science, and foreign policy and poor performance on mathematics. As such, we noticed that participants overwhelmingly uptook the LLMās answer (which was rational behaviour, given the modelās high performance). To explore a more nuanced performance profile, we regenerated LLM answers using Mistral 7B Instruct via greedy decoding. We then generated confidence scores on top of the LLM responses. For our random baseline, we sample a confidence score uniformly between 0 and 100% for each question.
#### G.2 Important considerations
There are many reasons to heed caution in interpreting our results as definitive indications of the utility of displaying confidence to users in LLM assistive settings. In particular: (i) users are presented with feedback after each trial as in [Bhatt et al., 2023] ā as such, they can determine (potentially rapidly) whether or not a model is reliable, even without confidence scores. However, in practical settings users may not know whether or not the model was truly correct and therefore confidence scores could have an even larger impact; (ii) MMLU questions can be challenging for non-experts ā we see the biggest differences in performance for the no-LLM vs. any-LLM-assistance condition. We may see a wider range of reliance behaviors in settings wherein people have more confidence in their own abilities; (iii) we present users with numeric confidence; however, humans are not always able to reliably process confidence estimates nor appropriately calibrate uncertainty estimates themselves [Keren, 1991, Vodrahalli et al., 2022, Collins et al., 2023, Lichtenstein et al., 1977]. It may be that alternate modes of communicating confidence improve usersā ability to appropriately leverage the confidence scores in their decision making process. We see targeted exploration of each component through interdisciplinary collaboration across AI, behavioral science, and human-computer interaction as ripe for future work.
#### G.3 Extended Results
Task Accuracy and Reliance Sensibility
We depict average user task accuracy and reliance sensibility across variants in Figure 15. We follow Bhatt et al. in computing reliance sensibility as the proportion of times the user appropriately sided with the model prediction when the model was correct and did not respond with the modelās prediction when the model was incorrect.
<details>
<summary>x71.png Details</summary>

### Visual Description
## Violin Plot: Accuracy Distribution Across Five Methods
### Overview
The image displays a violin plot comparing the distribution of accuracy scores for five different methods or conditions. Each "violin" represents the probability density of the data at different values, with a wider section indicating a higher frequency of data points at that accuracy level. The plot is set against a white background with black axes.
### Components/Axes
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale ranging from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis (Horizontal):**
* **Categories (from left to right):**
1. `No LLM` (Purple violin)
2. `LLM` (Red violin)
3. `LLM + Conf (Rand)` (Teal/Green violin)
4. `LLM + Conf (Query)` (Grey violin)
5. `LLM + Conf (CT)` (Blue violin)
* **Legend:** There is no separate legend box. The category labels are placed directly below each corresponding violin on the x-axis.
### Detailed Analysis
Each violin plot shows the distribution shape, with horizontal lines inside likely representing quartiles (the middle line being the median).
1. **No LLM (Purple, far left):**
* **Shape:** Symmetrical, widest around the median, tapering sharply towards both ends. It has the longest vertical span, indicating the highest variance.
* **Estimated Median:** ~0.65
* **Estimated Interquartile Range (IQR):** Roughly from 0.55 to 0.75.
* **Range:** Extends from approximately 0.1 to 1.0.
2. **LLM (Red, second from left):**
* **Shape:** Slightly asymmetrical, with a bulge above the median. Narrower overall than the "No LLM" plot.
* **Estimated Median:** ~0.70
* **Estimated IQR:** Roughly from 0.60 to 0.80.
* **Range:** Extends from approximately 0.3 to 0.95.
3. **LLM + Conf (Rand) (Teal/Green, center):**
* **Shape:** Relatively symmetrical and compact. The distribution is concentrated around the median.
* **Estimated Median:** ~0.75
* **Estimated IQR:** Roughly from 0.65 to 0.85.
* **Range:** Extends from approximately 0.4 to 0.95.
4. **LLM + Conf (Query) (Grey, second from right):**
* **Shape:** Symmetrical and the most compact (narrowest vertical spread) of all plots. The distribution is highly concentrated.
* **Estimated Median:** ~0.80 (The highest median of the five groups).
* **Estimated IQR:** Roughly from 0.70 to 0.90.
* **Range:** Extends from approximately 0.5 to 0.95.
5. **LLM + Conf (CT) (Blue, far right):**
* **Shape:** Symmetrical, with a shape similar to the "LLM + Conf (Rand)" plot but positioned slightly higher.
* **Estimated Median:** ~0.78
* **Estimated IQR:** Roughly from 0.68 to 0.88.
* **Range:** Extends from approximately 0.45 to 0.95.
### Key Observations
* **Trend in Central Tendency:** The median accuracy increases progressively from left to right: `No LLM` < `LLM` < `LLM + Conf (Rand)` < `LLM + Conf (CT)` < `LLM + Conf (Query)`.
* **Trend in Variance (Spread):** The spread (variance) of accuracy scores generally decreases from left to right. The `No LLM` condition shows the widest spread (most inconsistent results), while `LLM + Conf (Query)` shows the narrowest spread (most consistent results).
* **Overlap:** There is significant overlap between the distributions of all five methods, indicating that while trends exist, individual results from one method can fall within the range of another.
* **Outliers:** The violin plot format does not explicitly show individual outlier points. The tapered ends suggest the presence of some extreme values, particularly in the `No LLM` condition.
### Interpretation
This chart demonstrates the impact of using a Large Language Model (LLM) and various confidence-based mechanisms ("Conf") on the accuracy of a task.
1. **Baseline vs. LLM:** Simply adding an LLM (`LLM` vs. `No LLM`) improves the median accuracy and reduces the variability of results, suggesting the LLM provides a more reliable and better-performing baseline.
2. **Effect of Confidence Mechanisms:** All three methods incorporating a confidence mechanism (`Rand`, `Query`, `CT`) outperform the plain `LLM` in terms of median accuracy. This suggests that adding a layer of confidence estimation or selection refines the LLM's outputs.
3. **Best Performing Method:** `LLM + Conf (Query)` achieves the highest median accuracy and the most consistent performance (smallest spread). This implies that a confidence mechanism based on "Query" is the most effective among those tested for both boosting average performance and ensuring reliability.
4. **Trade-off Insight:** The progression from `No LLM` to `LLM + Conf (Query)` shows a clear pattern: as methods become more sophisticated (adding LLM, then adding confidence mechanisms), they tend to yield both higher average accuracy and more predictable, consistent outcomes. The reduction in variance is as important as the increase in median score for practical applications where reliability is critical.
**In summary, the data suggests that integrating an LLM with a query-based confidence mechanism provides the optimal balance of high accuracy and low performance variance for the evaluated task.**
</details>
<details>
<summary>x72.png Details</summary>

### Visual Description
\n
## Violin Plot: Reliance Sensibility Across Four Model Configurations
### Overview
The image displays a violin plot comparing the distribution of a metric called "Reliance Sensibility" across four different model configurations. A violin plot combines a box plot with a kernel density plot, showing the data's probability density at different values, mirrored symmetrically.
### Components/Axes
* **Chart Type:** Violin Plot (mirrored density plot with embedded box plot elements).
* **Y-Axis:**
* **Label:** "Reliance Sensibility"
* **Scale:** Linear, ranging from 0.3 to 1.0.
* **Major Ticks:** 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0.
* **X-Axis (Categories):** Four distinct model configurations, labeled from left to right:
1. **LLM** (Violin color: Red)
2. **LLM + Conf (Rand)** (Violin color: Teal/Dark Cyan)
3. **LLM + Conf (Query)** (Violin color: Gray)
4. **LLM + Conf (CT)** (Violin color: Blue)
* **Legend:** The categories are defined by their x-axis labels and corresponding violin colors. There is no separate legend box; the labels are placed directly beneath each violin.
* **Embedded Box Plot Elements:** Each violin contains three horizontal lines. The central, longest line likely represents the median. The two shorter lines above and below it likely represent the interquartile range (IQR: 25th and 75th percentiles).
### Detailed Analysis
The analysis is segmented by the four model configurations, processed from left to right.
**1. LLM (Red Violin, Leftmost)**
* **Shape & Trend:** The distribution is widest (highest density) in the upper-middle range, approximately between 0.7 and 0.8. It tapers significantly towards both the upper (1.0) and lower (0.4) bounds, with a long, thin tail extending down to about 0.4.
* **Central Tendency (Estimated):**
* Median (central line): ~0.75
* IQR (upper/lower lines): ~0.70 to ~0.80
* **Spread:** Shows a relatively wide spread, with a notable concentration of data between 0.65 and 0.85, but with a long lower tail.
**2. LLM + Conf (Rand) (Teal Violin, Second from Left)**
* **Shape & Trend:** Similar overall shape to the LLM violin but appears slightly more concentrated. The widest section is also around 0.7-0.8. The lower tail is less pronounced than the LLM's, ending around 0.5.
* **Central Tendency (Estimated):**
* Median: ~0.76 (Marginally higher than LLM)
* IQR: ~0.71 to ~0.81
* **Spread:** Slightly tighter than LLM, with most data between 0.65 and 0.85.
**3. LLM + Conf (Query) (Gray Violin, Third from Left)**
* **Shape & Trend:** This distribution is more symmetric and "plump" in the middle compared to the first two. Its widest point is centered around 0.75. The tails are shorter and more balanced, extending from roughly 0.55 to 0.95.
* **Central Tendency (Estimated):**
* Median: ~0.76 (Similar to Rand)
* IQR: ~0.72 to ~0.80 (Slightly tighter IQR than Rand)
* **Spread:** More concentrated around the median, with less extreme values at the tails.
**4. LLM + Conf (CT) (Blue Violin, Rightmost)**
* **Shape & Trend:** This violin is the most concentrated and has the highest central density. Its widest section is clearly above 0.75, peaking near 0.8. The distribution is compact, with short tails extending from about 0.6 to 0.95.
* **Central Tendency (Estimated):**
* Median: ~0.78 (Appears to be the highest of the four)
* IQR: ~0.74 to ~0.82 (The highest and tightest IQR)
* **Spread:** The narrowest spread of the four, indicating the most consistent performance in the "Reliance Sensibility" metric.
### Key Observations
1. **Central Cluster:** All four distributions are primarily clustered in the 0.7 to 0.8 range on the "Reliance Sensibility" scale.
2. **Progressive Tightening:** Moving from left to right (LLM -> Rand -> Query -> CT), the distributions generally become more compact (narrower spread) and their central tendency (median) shifts slightly upward.
3. **Highest Performer:** The **LLM + Conf (CT)** configuration exhibits the highest median Reliance Sensibility and the most consistent results (tightest distribution).
4. **Lowest Tail Risk:** The **LLM** baseline shows the longest lower tail, indicating a higher probability of very low Reliance Sensibility scores compared to the other methods.
5. **Similarity of Rand and Query:** The "LLM + Conf (Rand)" and "LLM + Conf (Query)" distributions are quite similar in median and spread, though "Query" appears slightly more symmetric.
### Interpretation
This chart demonstrates the impact of different "Confidence" (Conf) mechanisms added to a base Large Language Model (LLM) on a metric termed "Reliance Sensibility." Assuming "Reliance Sensibility" is a desirable trait (higher is better), the data suggests:
* **Adding any confidence mechanism improves consistency** over the base LLM, as seen by the reduction in the lower tail and the tightening of the distributions for Rand, Query, and CT.
* **The type of confidence mechanism matters.** The "CT" variant (the specific meaning of "CT" is not defined in the image) yields the best overall performance, pushing the median score higher and making the model's output most reliably fall within a high-scoring band.
* **The "Rand" and "Query" mechanisms offer moderate, similar improvements** over the baseline, primarily by reducing the risk of very poor performance (low scores) without dramatically shifting the central tendency.
* **The base LLM, while capable of high scores, is also the most volatile,** with a significant chance of producing outputs with low Reliance Sensibility.
In essence, the plot provides visual evidence that integrating confidence estimationāparticularly the "CT" methodāinto an LLM system leads to more reliable and consistently higher "Reliance Sensibility" outcomes.
</details>
Figure 15: (Left) User accuracy on 60 MMLU questions per variant ( $N=20$ users per variant); violin plots show quartiles as dashed lines (Right) Average reliance sensibility (proportion of instances where the user sided with the model when the model was correct, and overrode the modelās prediction when the model was incorrect); higher indicates better reliance calibration.
We depict per-topic accuracy, with the LLMās average performance in Figure 16.
<details>
<summary>x73.png Details</summary>

### Visual Description
## Violin Plot: High School Biology Accuracy by LLM Configuration
### Overview
The image is a violin plot comparing the distribution of accuracy scores for five different configurations of a Large Language Model (LLM) system on a "High School Biology" task. A violin plot combines a box plot (showing median and interquartile range) with a kernel density plot (showing the probability density of the data at different values). The plot visualizes how the central tendency and spread of accuracy vary across the configurations.
### Components/Axes
* **Chart Title:** "High School Biology" (centered at the top).
* **Y-Axis:**
* **Label:** "Accuracy" (rotated vertically on the left side).
* **Scale:** Linear scale from 0.0 to 1.0.
* **Tick Marks:** Major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:**
* **Categories (from left to right):**
1. "No LLM"
2. "LLM"
3. "LLM + Conf (Rand)"
4. "LLM + Conf (Query)"
5. "LLM + Conf (CT)"
* **Reference Line:** A horizontal red dashed line is drawn across the plot at the y-axis value of **0.7**.
* **Legend:** There is no separate legend box. The five categories on the x-axis serve as the legend, each associated with a distinct colored violin plot.
* "No LLM": Blue
* "LLM": Orange
* "LLM + Conf (Rand)": Green
* "LLM + Conf (Query)": Red
* "LLM + Conf (CT)": Purple
### Detailed Analysis
Each violin plot shows the distribution of accuracy scores for its category. The width of the violin at a given y-value represents the density (frequency) of data points at that accuracy level. Inside each violin, a thin black line represents the interquartile range (IQR), and a white dot or small horizontal line represents the median.
1. **No LLM (Blue):**
* **Trend/Shape:** The distribution is heavily skewed towards lower accuracy. It is widest (has the highest density) between approximately 0.2 and 0.5, with a long, thin tail extending up to ~0.9.
* **Key Values (Approximate):** Median appears to be around **0.45**. The bulk of the data (IQR) lies between ~0.3 and ~0.65. The range spans from near 0.0 to ~0.9.
2. **LLM (Orange):**
* **Trend/Shape:** The distribution is more symmetric and centered higher than "No LLM". It is widest around 0.6-0.8, indicating most scores cluster in this range.
* **Key Values (Approximate):** Median is approximately **0.7**. The IQR is roughly between 0.6 and 0.8. The range is from ~0.2 to ~0.95.
3. **LLM + Conf (Rand) (Green):**
* **Trend/Shape:** This distribution is tall and relatively narrow, indicating high variance but with a central peak. It is widest around 0.6-0.8, similar to "LLM", but with a more pronounced peak and longer tails.
* **Key Values (Approximate):** Median is near **0.7**. The IQR spans from ~0.55 to ~0.8. The range is very wide, from ~0.1 to nearly 1.0.
4. **LLM + Conf (Query) (Red):**
* **Trend/Shape:** The distribution is wide in the middle (0.6-0.8) but has a very long, thin tail extending down to low accuracy scores, suggesting a subset of poor performances.
* **Key Values (Approximate):** Median is around **0.7**. The IQR is between ~0.6 and ~0.8. The range is extensive, from ~0.1 to ~0.95.
5. **LLM + Conf (CT) (Purple):**
* **Trend/Shape:** This is the most compact and highest-performing distribution. It is widest between 0.7 and 0.85, with a shorter tail extending downward compared to the other "Conf" methods.
* **Key Values (Approximate):** Median is the highest, approximately **0.78**. The IQR is tight, between ~0.7 and ~0.85. The range is from ~0.4 to ~0.9.
### Key Observations
* **Performance Hierarchy:** The "LLM + Conf (CT)" configuration shows the highest median accuracy and the most consistent performance (narrowest IQR). "No LLM" has the lowest median and a distribution skewed toward failure.
* **Effect of LLM:** Simply adding an LLM ("LLM" category) dramatically shifts the entire distribution upward compared to "No LLM".
* **Effect of Confidence Calibration ("Conf"):** All three "Conf" methods maintain a median accuracy around or above the 0.7 reference line. However, they introduce greater variance (wider ranges) compared to the base "LLM" configuration, particularly in the lower tails.
* **Variance Comparison:** "LLM + Conf (Rand)" and "LLM + Conf (Query)" exhibit the largest spreads, with minimum scores near 0.1. "LLM + Conf (CT)" has a higher floor (~0.4).
* **Reference Line:** The red dashed line at 0.7 serves as a visual benchmark. The medians of all LLM-based configurations are at or above this line, while the "No LLM" median is well below it.
### Interpretation
This chart demonstrates the impact of different AI assistance strategies on high school biology task accuracy.
* **Baseline vs. AI Assistance:** The stark contrast between "No LLM" and all other categories provides strong evidence that using an LLM significantly improves performance on this task. The "No LLM" distribution suggests that without AI, performance is highly variable and often poor.
* **Calibration Trade-offs:** Adding confidence calibration ("Conf") to the LLM does not consistently improve the *median* score over the base "LLM" but appears to alter the *distribution* of outcomes. The "Rand" and "Query" methods seem to introduce instability, leading to both high and very low scores. This could indicate that these calibration methods are sometimes helpful but other times detrimental, perhaps due to overconfidence or misguidance.
* **Superior Method:** The "LLM + Conf (CT)" method appears most effective. It not only achieves the highest median accuracy but also reduces the risk of very low scores (higher minimum), suggesting it is a more robust and reliable calibration technique for this domain.
* **The 0.7 Benchmark:** The red line likely represents a target proficiency threshold (e.g., a passing grade or human expert baseline). The data shows that while an unaided LLM often meets this threshold, adding the right calibration ("CT") makes meeting it more consistent.
In summary, the data suggests that for high school biology tasks, employing an LLM is highly beneficial, and using a specific confidence calibration method ("CT") can further optimize both the average performance and the reliability of the system.
</details>
<details>
<summary>x74.png Details</summary>

### Visual Description
## Violin Plot: High School CS Accuracy Comparison
### Overview
The image is a statistical visualization (violin plot) titled "High School CS" that compares the distribution of "Accuracy" scores across five different experimental conditions related to the use of Large Language Models (LLMs). The plot displays the probability density of the data at different values, with the width of each "violin" representing the frequency of data points at that accuracy level.
### Components/Axes
* **Chart Title:** "High School CS" (centered at the top).
* **Y-Axis:**
* **Label:** "Accuracy" (rotated vertically on the left side).
* **Scale:** Linear scale from 0.2 to 1.0, with major tick marks at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:** Represents five categorical conditions. The labels are positioned below each corresponding violin plot.
1. **No LLM** (far left, blue violin)
2. **LLM** (orange violin)
3. **LLM + Conf (Rand)** (green violin)
4. **LLM + Conf (Query)** (red violin)
5. **LLM + Conf (CT)** (far right, purple violin)
* **Reference Line:** A horizontal red dashed line is drawn across the entire chart at an accuracy value of approximately **0.7**. This likely serves as a benchmark or baseline for comparison.
### Detailed Analysis
Each violin plot shows the distribution of accuracy scores for its condition. The internal horizontal lines within each violin typically represent quartiles (e.g., median, 25th, 75th percentiles).
1. **No LLM (Blue):**
* **Trend/Shape:** The distribution is heavily skewed towards lower accuracy. It is widest (most dense) between approximately 0.2 and 0.5, with a long, thin tail extending up to ~0.9.
* **Key Values:** The median appears to be around **0.45**. The bulk of the data (interquartile range) lies between ~0.3 and ~0.6.
2. **LLM (Orange):**
* **Trend/Shape:** The distribution is more symmetric and centered higher than "No LLM." It is widest around 0.6-0.7.
* **Key Values:** The median is approximately **0.65**. The main density is concentrated between ~0.5 and ~0.8.
3. **LLM + Conf (Rand) (Green):**
* **Trend/Shape:** The distribution is similar in shape to the "LLM" condition but shifted slightly upward. It is widest between 0.7 and 0.8.
* **Key Values:** The median is near **0.72**, sitting just above the red reference line. The dense region spans from ~0.6 to ~0.85.
4. **LLM + Conf (Query) (Red):**
* **Trend/Shape:** This distribution shows a clear upward shift. It is widest in the high-accuracy region between 0.8 and 0.9, indicating a high concentration of top scores.
* **Key Values:** The median is the highest among all groups, at approximately **0.82**. The interquartile range is roughly 0.7 to 0.9.
5. **LLM + Conf (CT) (Purple):**
* **Trend/Shape:** Very similar in profile to the "Query" condition, with a high-density peak between 0.8 and 0.9. It may be slightly narrower, suggesting marginally less variance.
* **Key Values:** The median is also very high, around **0.81**. The distribution is concentrated between ~0.7 and ~0.9.
### Key Observations
* **Clear Performance Hierarchy:** There is a visible, stepwise improvement in accuracy distributions from left to right: `No LLM` < `LLM` < `LLM + Conf (Rand)` < `LLM + Conf (Query)` ā `LLM + Conf (CT)`.
* **Impact of Confidence Mechanisms:** All conditions using an LLM with a confidence mechanism ("Conf") outperform the plain "LLM" condition. The "Query" and "CT" methods show the most significant gains.
* **Benchmark Comparison:** The red dashed line at ~0.7 accuracy is exceeded by the median of the top three conditions (`Rand`, `Query`, `CT`). The `No LLM` and plain `LLM` conditions have medians below this line.
* **Variability:** The "No LLM" condition shows the greatest spread (from ~0.1 to ~0.9), indicating highly inconsistent performance. The top-performing conditions (`Query`, `CT`) have a tighter spread in the high-accuracy range, indicating more reliable high performance.
### Interpretation
This chart presents compelling evidence for the efficacy of using LLMs, particularly when augmented with confidence estimation strategies, in the context of high school computer science tasks.
* **The Core Finding:** Relying on no LLM (`No LLM`) leads to low and highly variable accuracy. Introducing a base LLM (`LLM`) provides a substantial and consistent boost.
* **The Value of Confidence:** The key insight is that not all LLM integrations are equal. Adding a confidence mechanismāwhether random (`Rand`), query-based (`Query`), or using a method abbreviated as `CT`āfurther improves both the median accuracy and the consistency of high performance. The `Query` and `CT` methods appear to be the most sophisticated and effective.
* **Practical Implication:** For educational technology or assessment tools in high school CS, employing an LLM with an advanced confidence-filtering system (like `Query` or `CT`) is likely to yield the most accurate and reliable results, consistently surpassing the ~70% accuracy benchmark shown. The data suggests these systems can help students achieve higher accuracy outcomes more reliably.
</details>
<details>
<summary>x75.png Details</summary>

### Visual Description
## Violin Plot: US Foreign Policy Accuracy Comparison
### Overview
The image is a violin plot comparing the accuracy distributions of five different models or conditions related to "US Foreign Policy." The chart displays the probability density of accuracy scores for each category, with a horizontal reference line indicating a benchmark accuracy of 0.8.
### Components/Axes
* **Chart Title:** "US Foreign Policy" (centered at the top).
* **Y-Axis:**
* **Label:** "Accuracy" (rotated vertically on the left side).
* **Scale:** Linear scale ranging from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:** Displays five categorical groups. From left to right:
1. "No LLM"
2. "LLM"
3. "LLM + Conf (Rand)"
4. "LLM + Conf (Query)"
5. "LLM + Conf (CT)"
* **Reference Line:** A horizontal red dashed line is drawn across the entire plot at the y-axis value of 0.8.
* **Legend/Color Mapping:** Each category is represented by a distinct colored violin plot. The mapping is positional (left to right) and color-coded:
* "No LLM": Blue
* "LLM": Orange
* "LLM + Conf (Rand)": Green
* "LLM + Conf (Query)": Red
* "LLM + Conf (CT)": Purple
### Detailed Analysis
Each violin plot shows the distribution of accuracy scores. The wider the violin at a given y-value, the more data points are concentrated around that accuracy. Internal horizontal lines likely represent quartiles (median and interquartile range).
1. **No LLM (Blue, far left):**
* **Trend/Shape:** The distribution is broad and relatively symmetric, peaking around the median. It has a wide base, indicating a significant number of lower accuracy scores.
* **Key Values (Approximate):**
* Median (central line): ~0.55
* Interquartile Range (IQR): Spans roughly from 0.40 to 0.70.
* Full Range: Extends from near 0.0 to just below 1.0.
2. **LLM (Orange, second from left):**
* **Trend/Shape:** This distribution is highly skewed. It has a very long, thin tail extending down to near 0.0, but the bulk of the data (the widest part) is concentrated at a higher accuracy level.
* **Key Values (Approximate):**
* Median: ~0.75 (visibly higher than "No LLM").
* IQR: Concentrated between ~0.65 and ~0.85.
* Full Range: From ~0.0 to ~0.95.
3. **LLM + Conf (Rand) (Green, center):**
* **Trend/Shape:** This distribution is more compact and centered higher than the first two. It is somewhat bimodal or has a flattened top, with the widest section just below the 0.8 reference line.
* **Key Values (Approximate):**
* Median: ~0.78.
* IQR: Spans from ~0.70 to ~0.85.
* Full Range: From ~0.30 to ~0.95.
4. **LLM + Conf (Query) (Red, second from right):**
* **Trend/Shape:** This is the most concentrated and highest-performing distribution. It is narrow and tall, with the vast majority of its mass above the 0.8 reference line.
* **Key Values (Approximate):**
* Median: ~0.85 (the highest median of all groups).
* IQR: Very tight, roughly from 0.80 to 0.90.
* Full Range: From ~0.60 to ~0.98.
5. **LLM + Conf (CT) (Purple, far right):**
* **Trend/Shape:** Similar in shape to the "LLM + Conf (Rand)" plot but appears slightly more concentrated around its median. The bulk of the data is also centered near the 0.8 line.
* **Key Values (Approximate):**
* Median: ~0.80 (right on the reference line).
* IQR: From ~0.72 to ~0.86.
* Full Range: From ~0.25 to ~0.95.
### Key Observations
* **Performance Hierarchy:** There is a clear progression in median accuracy from left to right: "No LLM" < "LLM" < "LLM + Conf (Rand)" ā "LLM + Conf (CT)" < "LLM + Conf (Query)".
* **Impact of Confidence Methods:** All three "LLM + Conf" methods show higher median accuracy and less downward spread (fewer very low scores) compared to the base "LLM" model.
* **Benchmark Comparison:** The "LLM + Conf (Query)" model is the only one where the median and the majority of the distribution lie clearly above the 0.8 accuracy benchmark. The "LLM + Conf (Rand)" and "LLM + Conf (CT)" models have medians very close to this line.
* **Variability:** The "LLM" model shows the greatest variability, with an extremely long tail towards low accuracy. The "LLM + Conf (Query)" model shows the least variability, indicating more consistent performance.
### Interpretation
The data suggests that augmenting a Large Language Model (LLM) with some form of confidence scoring ("Conf") significantly improves its accuracy and reliability on the task of US Foreign Policy analysis. The base "LLM" alone, while better than having "No LLM," is highly inconsistent, as evidenced by its long tail of poor performance.
Among the confidence methods, the "Query" variant appears most effective, yielding the highest and most consistent accuracy. The "Rand" (likely random) and "CT" (method unspecified) confidence methods also provide substantial benefits over the base LLM, bringing median performance to the benchmark level of 0.8. This implies that the mechanism for estimating or applying confidence is crucial, with structured querying being superior to random or other (CT) approaches in this context. The chart effectively argues for the value of confidence-aware mechanisms in deploying LLMs for sensitive or knowledge-intensive domains like foreign policy.
</details>
<details>
<summary>x76.png Details</summary>

### Visual Description
\n
## Violin Plot: Elementary Math Accuracy by LLM Configuration
### Overview
The image is a violin plot titled "Elementary Math," comparing the distribution of accuracy scores across five different configurations involving Large Language Models (LLMs). The plot visualizes the probability density of the data at different values, with internal horizontal lines indicating quartiles. A horizontal red dashed line serves as a reference baseline.
### Components/Axes
* **Title:** "Elementary Math" (centered at the top).
* **Y-Axis:**
* **Label:** "Accuracy" (rotated vertically on the left side).
* **Scale:** Linear scale from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:**
* **Categories (from left to right):**
1. "No LLM"
2. "LLM"
3. "LLM + Conf (Rand)"
4. "LLM + Conf (Query)"
5. "LLM + Conf (CT)"
* **Reference Line:** A horizontal red dashed line at y = 0.3, spanning the full width of the plot.
* **Legend:** Implicit in the x-axis category labels. Each category is represented by a uniquely colored violin plot:
* "No LLM": Blue
* "LLM": Orange
* "LLM + Conf (Rand)": Green
* "LLM + Conf (Query)": Red
* "LLM + Conf (CT)": Purple
### Detailed Analysis
The analysis proceeds by isolating each violin plot (category) from left to right.
1. **"No LLM" (Blue, far left):**
* **Trend/Shape:** The distribution is broad and somewhat bimodal, with a wider section in the upper half (0.6-0.9) and a narrower tail extending down to ~0.1.
* **Key Values (Approximate):**
* Median (middle horizontal line): ~0.60
* Interquartile Range (IQR, distance between upper and lower quartile lines): ~0.45 to ~0.80
* Full Range: ~0.10 to ~1.00
2. **"LLM" (Orange, second from left):**
* **Trend/Shape:** Similar broad shape to "No LLM," but the central mass appears slightly higher and the lower tail is less pronounced.
* **Key Values (Approximate):**
* Median: ~0.65
* IQR: ~0.50 to ~0.85
* Full Range: ~0.15 to ~1.00
3. **"LLM + Conf (Rand)" (Green, center):**
* **Trend/Shape:** This distribution is notably different. It is more concentrated in the middle-lower range, with a pronounced bulge around 0.4-0.6 and a long, thin tail extending down to near 0.0.
* **Key Values (Approximate):**
* Median: ~0.50 (visibly lower than the first two)
* IQR: ~0.35 to ~0.70
* Full Range: ~0.00 to ~1.00 (widest range, with the lowest minimum value)
4. **"LLM + Conf (Query)" (Red, second from right):**
* **Trend/Shape:** The distribution is more compact and shifted upward. The bulk of the data is concentrated between 0.6 and 0.9, with a shorter lower tail.
* **Key Values (Approximate):**
* Median: ~0.75
* IQR: ~0.60 to ~0.85
* Full Range: ~0.30 to ~1.00
5. **"LLM + Conf (CT)" (Purple, far right):**
* **Trend/Shape:** This is the most compact and highest-performing distribution. It has a tight concentration in the upper range (0.7-0.95) and the shortest lower tail.
* **Key Values (Approximate):**
* Median: ~0.80
* IQR: ~0.70 to ~0.90
* Full Range: ~0.40 to ~1.00 (highest minimum value)
### Key Observations
1. **Performance Hierarchy:** There is a clear visual hierarchy in median accuracy: `LLM + Conf (CT)` > `LLM + Conf (Query)` > `LLM` ā `No LLM` > `LLM + Conf (Rand)`.
2. **Variability:** The "LLM + Conf (Rand)" configuration shows the highest variability (widest range, long lower tail), while "LLM + Conf (CT)" shows the lowest variability (most compact shape).
3. **Baseline Comparison:** All distributions, including their lower tails, are predominantly above the red dashed reference line at 0.3, suggesting this line may represent a baseline like random chance or a minimal acceptable threshold.
4. **Impact of Confidence Methods:** The "Conf" (confidence calibration) methods have divergent effects. The "(Rand)" variant appears detrimental, lowering median accuracy and increasing variance. The "(Query)" and "(CT)" variants are beneficial, increasing median accuracy and reducing variance compared to the base "LLM" and "No LLM" conditions.
### Interpretation
This chart demonstrates the impact of different LLM augmentation strategies on performance consistency and accuracy in an elementary math context.
* **Core Finding:** Simply using an LLM ("LLM") provides a marginal accuracy boost over no LLM ("No LLM"), but with similar high variability. The critical factor is *how* the LLM is configured.
* **The Role of Confidence Calibration:** The data suggests that naive or random confidence calibration ("LLM + Conf (Rand)") is counterproductive, harming both average performance and reliability. In contrast, structured confidence calibration methods ("Query" and especially "CT") significantly improve outcomes.
* **"CT" as the Optimal Strategy:** The "LLM + Conf (CT)" configuration is superior, yielding the highest typical accuracy and the most predictable performance (lowest spread). This implies that the "CT" method effectively aligns the model's confidence with its correctness, reducing both errors and uncertainty.
* **Practical Implication:** For deploying LLMs in educational or scoring systems for elementary math, implementing a robust confidence calibration technique like "CT" is crucial for achieving high and reliable accuracy, far more so than just using a base LLM. The red line at 0.3 likely signifies that all tested methods perform meaningfully above a trivial baseline.
</details>
Figure 16: User accuracies per topic for the Mistral variants. Red line indicates the modelās average accuracy.
GPT-3.5 Confidence Generalization
As noted, we ran variants using the same GPT-3.5 generations as [Bhatt et al., 2023]. We show aggregate and per-topic accuracy in fig. 17, as well as reliance sensibility in fig. 18.
<details>
<summary>x77.png Details</summary>

### Visual Description
\n
## Violin Plot: High School Biology Accuracy Comparison
### Overview
The image displays a violin plot comparing the accuracy distributions of five different models or conditions on a "High School Biology" task. The chart visualizes the probability density of accuracy scores for each group, showing both the range and concentration of results.
### Components/Axes
- **Title**: "High School Biology" (centered at the top).
- **Y-axis**: Labeled "Accuracy". The scale runs from 0.2 to 1.0, with major tick marks at 0.2, 0.4, 0.6, 0.8, and 1.0.
- **X-axis**: Contains five categorical labels, each corresponding to a colored violin plot:
1. **No LLM** (Blue)
2. **LLM** (Orange)
3. **LLM + Conf (Randi)** (Green)
4. **LLM + Conf (Query)** (Red)
5. **LLM + Conf (CT)** (Purple)
- **Reference Line**: A horizontal red dashed line is positioned at an accuracy value of approximately 0.9.
- **Legend**: The categories are labeled directly on the x-axis; there is no separate legend box. The color of each violin corresponds to its x-axis label.
### Detailed Analysis
Each violin plot shows the distribution of accuracy scores for its group. The width of the violin at any given y-value represents the density of data points at that accuracy level. Horizontal lines within each violin likely represent quartiles (median, 25th, and 75th percentiles).
1. **No LLM (Blue, far left)**:
- **Trend/Shape**: The distribution is widest at the bottom (low accuracy) and tapers sharply towards the top. It has the lowest median and the largest spread in the lower accuracy range.
- **Key Points**: The bulk of the data is concentrated between ~0.2 and 0.6 accuracy. The median appears to be around 0.45. The distribution extends from near 0.0 to just above 0.8.
2. **LLM (Orange, second from left)**:
- **Trend/Shape**: Shows a bimodal or wide-shouldered distribution. It is dense in two regions: one around 0.6-0.7 and another, larger concentration around 0.8-0.95.
- **Key Points**: The median is significantly higher than the "No LLM" group, sitting at approximately 0.8. The distribution is tighter overall, ranging from about 0.5 to 1.0.
3. **LLM + Conf (Randi) (Green, center)**:
- **Trend/Shape**: The distribution is fairly symmetric and pear-shaped, with the widest point (highest density) around the median.
- **Key Points**: The median is slightly above 0.8. The data is concentrated between ~0.6 and 0.95, with a long, thin tail extending down to about 0.3.
4. **LLM + Conf (Query) (Red, second from right)**:
- **Trend/Shape**: Similar in shape to the "Randi" plot but shifted slightly upward. It is dense around the median and tapers smoothly.
- **Key Points**: The median is close to 0.85. The main body of data lies between ~0.7 and 0.95, with a tail reaching down to ~0.4.
5. **LLM + Conf (CT) (Purple, far right)**:
- **Trend/Shape**: This plot has the highest and most concentrated distribution. It is widest just below the 0.9 reference line.
- **Key Points**: The median is the highest of all groups, at approximately 0.88-0.89, very close to the red dashed line. The data is tightly clustered between ~0.75 and 0.95, with a tail extending down to ~0.5.
### Key Observations
- **Performance Hierarchy**: There is a clear progression in median accuracy from left to right: No LLM < LLM < LLM+Conf (Randi) < LLM+Conf (Query) < LLM+Conf (CT).
- **Variability**: The "No LLM" group shows the highest variability (widest range). The "LLM + Conf (CT)" group shows the lowest variability, with scores tightly packed near the top.
- **Benchmark**: The red dashed line at 0.9 appears to be a target or benchmark score. Only the "LLM + Conf (CT)" group has a median that approaches this line, and a significant portion of its distribution lies above it.
- **Impact of Confidence Mechanisms**: All three "LLM + Conf" variants outperform the base "LLM" model, suggesting that adding a confidence mechanism improves both median accuracy and consistency (reduces low-end outliers).
### Interpretation
This chart demonstrates the effectiveness of using Large Language Models (LLMs) and, more specifically, LLMs augmented with confidence estimation techniques for a high school biology task.
- **Core Finding**: The data suggests that raw LLM performance ("LLM") is a substantial improvement over no LLM ("No LLM"). However, integrating confidence mechanisms ("+ Conf") provides a further, meaningful boost in both average accuracy and reliability (reduced variance).
- **Mechanism Comparison**: Among the confidence methods, "CT" appears most effective, followed by "Query," then "Randi." This implies the specific method of confidence estimation is a critical design choice.
- **Practical Implication**: The red line at 0.9 likely represents a "mastery" or "production-ready" threshold. The "LLM + Conf (CT)" model is the only one where the central tendency (median) nears this threshold, indicating it may be the only candidate suitable for deployment where high accuracy is critical. The other models, while improved, still have medians below this target and greater risk of low-accuracy outputs.
- **Underlying Pattern**: The progression from a wide, low distribution to a narrow, high one illustrates a common pattern in model development: initial gains come from adopting a better base model (LLM), while subsequent gains come from refining its outputs and managing its uncertainty (Confidence mechanisms).
</details>
<details>
<summary>x78.png Details</summary>

### Visual Description
\n
## Violin Plot: High School CS Accuracy by Model Configuration
### Overview
The image is a violin plot comparing the distribution of accuracy scores for five different model configurations on a "High School CS" (Computer Science) task. The plot visualizes the probability density of the data at different values, showing both the median and the spread of results for each configuration.
### Components/Axes
* **Chart Title:** "High School CS" (centered at the top).
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:** Contains five categorical labels, each corresponding to a model configuration. From left to right:
1. `No LLM`
2. `LLM`
3. `LLM + Conf (Rand)`
4. `LLM + Conf (Query)`
5. `LLM + Conf (CT)`
* **Legend/Color Mapping:** The color of each violin corresponds directly to its x-axis label. There is no separate legend box; the labels are positioned directly beneath their respective plots.
* `No LLM`: Blue
* `LLM`: Orange
* `LLM + Conf (Rand)`: Green
* `LLM + Conf (Query)`: Red
* `LLM + Conf (CT)`: Purple
* **Reference Line:** A horizontal red dashed line is drawn across the plot at the `Accuracy = 0.9` mark.
### Detailed Analysis
The analysis proceeds from left to right, isolating each component.
1. **`No LLM` (Blue, far left):**
* **Trend/Shape:** The distribution is very broad and skewed towards lower accuracy. It has a wide base near 0.0 and tapers to a point near 1.0, indicating a high variance with many low scores.
* **Key Values:** The median (indicated by the central horizontal line within the violin) is approximately **0.5**. The bulk of the data (the widest part of the violin) is concentrated between ~0.2 and ~0.6.
2. **`LLM` (Orange, second from left):**
* **Trend/Shape:** The distribution is more concentrated and shifted upward compared to `No LLM`. It is widest in the upper half, indicating a cluster of higher scores.
* **Key Values:** The median is approximately **0.8**. The distribution spans roughly from 0.6 to 0.95, with the highest density around 0.8-0.9.
3. **`LLM + Conf (Rand)` (Green, center):**
* **Trend/Shape:** This distribution has a distinctive shape: a very narrow, long tail extending down to ~0.1, and a wide, dense bulb in the upper region. This suggests most runs perform very well, but a few have catastrophically low accuracy.
* **Key Values:** The median is approximately **0.85**. The main cluster of data is between ~0.75 and ~0.95. The long downward tail is a notable outlier pattern.
4. **`LLM + Conf (Query)` (Red, second from right):**
* **Trend/Shape:** Similar in overall shape to the green (`Rand`) plot but appears slightly more compact in its upper region and has a less extreme downward tail.
* **Key Values:** The median is approximately **0.87**. The dense region is between ~0.8 and ~0.95. The lower tail extends to about 0.5.
5. **`LLM + Conf (CT)` (Purple, far right):**
* **Trend/Shape:** This distribution is the most compact and highest-shifted of all. It is widest near the top, indicating high consistency in top-tier performance.
* **Key Values:** The median is the highest, approximately **0.88**. The data is tightly clustered between ~0.8 and ~0.95, with a very short lower tail ending around 0.7.
### Key Observations
* **Clear Performance Hierarchy:** There is a visible, step-wise improvement in median accuracy from left to right: `No LLM` < `LLM` < `LLM + Conf (Rand)` < `LLM + Conf (Query)` ā `LLM + Conf (CT)`.
* **Variance Reduction:** As models become more sophisticated (moving right), the distributions generally become narrower (except for the specific outlier tail in `Rand`), indicating more consistent and reliable performance.
* **The 0.9 Benchmark:** The red dashed line at 0.9 serves as a high-performance threshold. Only the medians of the three rightmost configurations (`Rand`, `Query`, `CT`) approach this line, with their upper distributions clearly exceeding it.
* **Outlier Pattern:** The `LLM + Conf (Rand)` configuration shows a unique failure mode, with a long tail of very low accuracy scores not seen in the `Query` or `CT` variants.
### Interpretation
This chart demonstrates the significant impact of using a Large Language Model (LLM) and, more importantly, applying confidence calibration techniques for a High School Computer Science task.
* **Baseline vs. LLM:** The `No LLM` model performs poorly and unreliably. Simply adding an `LLM` provides a massive boost in both average accuracy and consistency.
* **Value of Confidence Calibration:** All three `+ Conf` methods outperform the base `LLM`, suggesting that calibrating the model's confidence in its answers is crucial for maximizing accuracy on this task.
* **Method Comparison:** While all calibration methods help, `Query` and `CT` appear superior to `Rand`. They achieve similar high median accuracy but with more stable distributions (lacking the extreme low-end failures of `Rand`). This implies that the strategy for gathering confidence information (random vs. query-based vs. the "CT" method) meaningfully affects robustness.
* **Practical Implication:** For reliable deployment on this type of problem, using an LLM with a sophisticated confidence calibration method like `Query` or `CT` is recommended. The `Rand` method, while effective on average, carries a higher risk of occasional severe failure. The data suggests that near-human-level accuracy (approaching 0.9) is achievable with the right model configuration.
</details>
<details>
<summary>x79.png Details</summary>

### Visual Description
## Violin Plot Chart: US Foreign Policy Model Accuracy
### Overview
The image displays a violin plot comparing the accuracy distributions of five different models or configurations related to "US Foreign Policy." The chart visualizes the probability density of accuracy scores for each category, showing their median, interquartile range, and overall distribution shape.
### Components/Axes
* **Chart Title:** "US Foreign Policy" (centered at the top).
* **Y-Axis:** Labeled "Accuracy." The scale runs from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:** Contains five categorical labels for the models being compared:
1. No LLM
2. LLM
3. LLM + Conf (Randi)
4. LLM + Conf (Query)
5. LLM + Conf (CT)
* **Reference Line:** A horizontal red dashed line is drawn across the chart at an accuracy value of approximately 0.85.
* **Legend:** There is no separate legend box. The categories are identified by their labels on the x-axis and are distinguished by color (blue, orange, green, red, purple).
### Detailed Analysis
The chart presents five violin plots, each showing the distribution of accuracy scores. The internal horizontal lines within each violin represent the quartiles (median and interquartile range).
1. **No LLM (Blue):**
* **Trend/Shape:** This distribution is very wide at the bottom (low accuracy) and tapers sharply towards the top. It has the largest spread and the lowest median.
* **Key Points:** The median accuracy is approximately 0.5. The bulk of the data (interquartile range) lies between roughly 0.35 and 0.65. The distribution extends down to near 0.0 and up to about 0.95.
2. **LLM (Orange):**
* **Trend/Shape:** This distribution is more symmetric and centered higher than "No LLM." It has a classic violin shape, wider in the middle.
* **Key Points:** The median accuracy is approximately 0.7. The interquartile range spans from about 0.6 to 0.8. The distribution ranges from ~0.45 to ~0.95.
3. **LLM + Conf (Randi) (Green):**
* **Trend/Shape:** This distribution is skewed, with a high concentration of scores near the top but a long, thin tail extending downwards.
* **Key Points:** The median accuracy is high, approximately 0.85, aligning with the red dashed reference line. The interquartile range is relatively narrow, between ~0.75 and ~0.9. However, the long tail indicates some runs resulted in very low accuracy, down to ~0.1.
4. **LLM + Conf (Query) (Red):**
* **Trend/Shape:** Similar in shape to the "Randi" configuration but slightly less skewed. It has a high median and a pronounced lower tail.
* **Key Points:** The median accuracy is slightly below the red line, approximately 0.82. The interquartile range is between ~0.7 and ~0.9. The lower tail extends to about 0.2.
5. **LLM + Conf (CT) (Purple):**
* **Trend/Shape:** This is the most compact and symmetric distribution of the five. It is concentrated around a high median with minimal spread.
* **Key Points:** The median accuracy is approximately 0.85, matching the red reference line. The interquartile range is very tight, between ~0.8 and ~0.9. The overall range is the smallest, from ~0.7 to ~0.95.
### Key Observations
* **Performance Hierarchy:** There is a clear progression in median accuracy from left to right: "No LLM" < "LLM" < "LLM + Conf (Query)" < "LLM + Conf (Randi)" ā "LLM + Conf (CT)".
* **Stability vs. Peak Performance:** While "LLM + Conf (Randi)" and "LLM + Conf (CT)" share a similar high median (~0.85), the "CT" variant is far more stable (compact distribution), whereas "Randi" has high variance with a risk of very poor performance (long lower tail).
* **Benchmark Line:** The red dashed line at ~0.85 appears to represent a target or benchmark accuracy. Only the three "LLM + Conf" variants have medians at or near this line.
* **Impact of LLM:** The addition of an LLM ("LLM" vs. "No LLM") significantly raises the median accuracy and reduces the extreme low-end performance.
### Interpretation
This chart demonstrates the effectiveness of different approaches for a US Foreign Policy-related task, measured by accuracy. The data suggests that:
1. Using a Large Language Model (LLM) alone provides a substantial improvement over not using one.
2. Augmenting the LLM with a confidence-based method ("Conf") further improves median accuracy to a benchmark level (~0.85).
3. The choice of confidence method critically impacts reliability. The "CT" method yields the most consistent high performance, making it the most robust choice. The "Randi" and "Query" methods can achieve high accuracy but are prone to occasional catastrophic failures (very low scores), as indicated by their long lower tails.
4. The "No LLM" baseline shows that without this technology, the task is highly unreliable, with a wide spread of outcomes and a low median score.
The visualization effectively argues for the use of an LLM with the "CT" confidence method for this application, as it optimizes for both high median accuracy and low variance.
</details>
<details>
<summary>x80.png Details</summary>

### Visual Description
## Violin Plot: Elementary Math Accuracy by LLM Configuration
### Overview
The image displays a violin plot titled "Elementary Math," comparing the distribution of accuracy scores across five different experimental conditions involving Large Language Models (LLMs). The plot visualizes the probability density of the data at different values, with wider sections representing a higher frequency of data points.
### Components/Axes
* **Chart Title:** "Elementary Math" (centered at the top).
* **Y-Axis:** Labeled "Accuracy." The scale runs from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:** Contains five categorical labels corresponding to the experimental conditions:
1. No LLM
2. LLM
3. LLM + Conf (Rand)
4. LLM + Conf (Query)
5. LLM + Conf (CT)
* **Data Series:** Five distinct colored violin plots, one for each category on the x-axis. The colors are (from left to right): blue, orange, green, red, and purple.
* **Reference Line:** A horizontal red dashed line is drawn across the plot at approximately y = 0.3.
* **Internal Markers:** Each violin contains three horizontal black lines, representing the quartiles (25th, 50th/median, and 75th percentiles) of the distribution.
### Detailed Analysis
The analysis proceeds from left to right, matching the x-axis order.
1. **No LLM (Blue Violin):**
* **Trend/Shape:** The distribution is broad and somewhat symmetric, with the widest section (highest density) centered around the median. It tapers smoothly towards both the high and low accuracy extremes.
* **Key Values:** The median (middle black line) is approximately **0.70**. The interquartile range (IQR, distance between the top and bottom black lines) spans roughly from **0.55 to 0.85**. The full range extends from near **0.15** to **1.0**.
2. **LLM (Orange Violin):**
* **Trend/Shape:** This distribution is narrower and more concentrated than the "No LLM" case. It is slightly skewed towards lower accuracy, with a longer tail extending downward.
* **Key Values:** The median is lower, at approximately **0.60**. The IQR is tighter, spanning from about **0.50 to 0.75**. The range extends from approximately **0.25** to **0.95**.
3. **LLM + Conf (Rand) (Green Violin):**
* **Trend/Shape:** This distribution is distinctly **bimodal**, with two clear peaks (widest sections). One peak is in the lower accuracy range (~0.4), and another is in the higher range (~0.8). This suggests two subgroups within the data.
* **Key Values:** The median line sits between the two modes, at approximately **0.65**. The IQR is wide, from about **0.45 to 0.85**. The overall range is very broad, from near **0.10** to **1.0**.
4. **LLM + Conf (Query) (Red Violin):**
* **Trend/Shape:** This is the tallest and narrowest violin, indicating a highly concentrated distribution. The density is sharply peaked around the median, with very thin tails.
* **Key Values:** The median is high, at approximately **0.70**. The IQR is very narrow, spanning from about **0.65 to 0.75**. The range is also constrained, from roughly **0.30** to **0.90**.
5. **LLM + Conf (CT) (Purple Violin):**
* **Trend/Shape:** This distribution is broad and appears slightly right-skewed (longer tail towards higher accuracy). It has a wide, dense central region.
* **Key Values:** This condition shows the highest median, at approximately **0.80**. The IQR spans from about **0.70 to 0.90**. The range extends from near **0.20** to **1.0**.
### Key Observations
* **Baseline Reference:** The red dashed line at **Accuracy ā 0.3** likely represents a baseline performance level, such as random guessing or a simple heuristic. All distributions are predominantly above this line.
* **Impact of Confidence Calibration:** Adding confidence calibration ("Conf") generally shifts the median accuracy upward compared to the base "LLM" condition, except for the "Rand" variant which shows high variance.
* **Highest & Most Consistent Performance:** The **"LLM + Conf (CT)"** condition achieves the highest median accuracy (~0.80) and maintains a broad, high-performing distribution. The **"LLM + Conf (Query)"** condition shows the most consistent (least variable) performance, with the narrowest spread.
* **Bimodal Anomaly:** The **"LLM + Conf (Rand)"** condition is a clear outlier in shape, exhibiting a bimodal distribution. This indicates that the random confidence calibration method produces highly inconsistent results, splitting performance into a low-accuracy group and a high-accuracy group.
* **Performance Hierarchy (by Median):** LLM + Conf (CT) > LLM + Conf (Query) ā No LLM > LLM + Conf (Rand) > LLM.
### Interpretation
This chart demonstrates the effect of different LLM augmentation strategies on performance in elementary math tasks. The core finding is that **structured confidence calibration methods (Query and CT) can improve both the median accuracy and the reliability (consistency) of LLM outputs** compared to using a base LLM alone or using an unstructured (Random) calibration method.
* The **"No LLM"** baseline performs surprisingly well, suggesting the task may have patterns that are accessible without model assistance, or that the "LLM" being tested is not specialized for this domain.
* The poor and inconsistent performance of **"LLM + Conf (Rand)"** highlights that the *method* of confidence calibration is critical; a naive random approach introduces noise and bifurcates outcomes.
* The tight distribution of **"LLM + Conf (Query)"** suggests it makes the model's performance highly predictable, which is valuable for deployment where reliability is key.
* The high median of **"LLM + Conf (CT)"** suggests it is the most effective method for boosting raw accuracy, though with slightly more variability than the Query method.
* The red baseline at 0.3 provides crucial context, showing that even the worst-performing condition (the lower mode of the Rand method) still generally outperforms a minimal baseline.
In summary, the data argues for the implementation of specific, structured confidence calibration techniques (particularly CT and Query variants) to enhance and stabilize LLM performance on elementary math problems, moving beyond both unaugmented models and simplistic calibration approaches.
</details>
Figure 17: User accuracies per topic for the GPT-3.5 variants (with generalization confidence computed for the CT and Query cases). Red line indicates the modelās average accuracy.
<details>
<summary>x81.png Details</summary>

### Visual Description
\n
## Violin Plot: Reliance Sensibility Across Four Model Variants
### Overview
The image displays a violin plot comparing the distribution of a metric called "Reliance Sensibility" across four different model configurations. The plot visualizes the probability density of the data at different values, with wider sections indicating a higher frequency of data points.
### Components/Axes
* **Y-Axis:** Labeled "Reliance Sensibility". The scale runs from 0.3 to 1.0, with major tick marks at 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0.
* **X-Axis:** Contains four categorical labels, each corresponding to a colored violin plot:
1. **LLM** (Red violin)
2. **LLM + Conf (Rand)** (Teal/Green violin)
3. **LLM + Conf (Query)** (Gray violin)
4. **LLM + Conf (CT)** (Blue violin)
* **Legend:** The x-axis labels serve as the legend, directly associating each model name with its corresponding color and plot.
### Detailed Analysis
Each violin plot shows the distribution shape, with internal horizontal lines indicating summary statistics. The solid central line represents the median, and the dashed lines above and below it represent the upper and lower quartiles (75th and 25th percentiles).
1. **LLM (Red, Leftmost):**
* **Shape:** Symmetrical, bulbous shape concentrated in the upper range.
* **Median:** Approximately 0.83.
* **Interquartile Range (IQR):** Roughly from 0.78 to 0.88.
* **Full Range:** Extends from approximately 0.60 to 1.0. The distribution tapers sharply below 0.7.
2. **LLM + Conf (Rand) (Teal, Second from Left):**
* **Shape:** Highly asymmetrical with a very long, thin tail extending downward.
* **Median:** Approximately 0.78.
* **IQR:** Roughly from 0.73 to 0.83.
* **Full Range:** The widest of all plots, extending from approximately 0.38 to 1.0. The bulk of the data is between 0.7 and 0.9, but a significant tail reaches much lower values.
3. **LLM + Conf (Query) (Gray, Third from Left):**
* **Shape:** Somewhat asymmetrical, with a tail extending downward but less severe than the "Rand" variant.
* **Median:** Approximately 0.81.
* **IQR:** Roughly from 0.76 to 0.85.
* **Full Range:** Extends from approximately 0.52 to 1.0.
4. **LLM + Conf (CT) (Blue, Rightmost):**
* **Shape:** The most compact and symmetrical distribution, concentrated tightly in the upper range.
* **Median:** Approximately 0.84 (the highest median).
* **IQR:** Roughly from 0.80 to 0.88.
* **Full Range:** The narrowest range, extending from approximately 0.68 to 1.0.
### Key Observations
* **Variability:** The "LLM + Conf (Rand)" model shows the highest variability and the lowest potential scores (long downward tail). The "LLM + Conf (CT)" model shows the lowest variability and most consistent high performance.
* **Central Tendency:** The median "Reliance Sensibility" is highest for "LLM + Conf (CT)" (~0.84), followed closely by the base "LLM" (~0.83). "LLM + Conf (Query)" is slightly lower (~0.81), and "LLM + Conf (Rand)" is the lowest (~0.78).
* **Distribution Shape:** All distributions are left-skewed (negatively skewed), meaning the tails extend toward lower values. This skew is most extreme for the "Rand" variant and least pronounced for the "CT" variant.
### Interpretation
The data suggests that adding a confidence mechanism ("Conf") to a base Large Language Model (LLM) has a nuanced effect on the "Reliance Sensibility" metric, which likely measures appropriate trust or calibration.
* The **base LLM** performs robustly with a high median and moderate spread.
* Adding a **random confidence method ("Rand")** introduces significant instability. While it can achieve high scores, it also produces many more instances of very low sensibility, dragging down the median and increasing risk.
* The **query-based confidence method ("Query")** offers a middle ground, slightly reducing the median compared to the base LLM but also reducing the extreme low-end outliers seen in the "Rand" method.
* The **"CT" confidence method** appears to be the most effective refinement. It yields the highest median sensibility and, crucially, the most consistent performance (narrowest distribution), effectively eliminating the very low scores seen in other variants. This implies that the "CT" method successfully calibrates or enhances the model's reliance in a reliable manner.
**In summary:** Not all confidence-enhancing methods are equal. While a poorly designed method ("Rand") can harm consistency, a well-designed one ("CT") can improve both the average performance and the reliability of the model's "Reliance Sensibility." The choice of confidence mechanism is critical.
</details>
Figure 18: Reliance sensibility for the variants based on GPT-3.5
Freeform User Responses
We permitted users to provide freeform responses at the end of the study. Some users were sensitive to confidence scores being reported and came up with their own heuristics for whether to rely on the modelās output. We include a sampling of comments across confidence variants:
- āif it had a confidence of less than 50% it made me very skeptical.ā
- āThe modelÅ confidence indeed helped me choose and select my answer as I trusted in them most of the time.ā
- āI didntĢ really rely on the confidence level. If I had 0 confidence in the answer myself I relied on the AI regardless.ā
- āif the models confidence fell below 45 I decided to investigate it myself by remembering pieces of information. and also reasoning the question. If it was above 45 I would automatically agree to its prediction but there were some few cases I challenged it even though it was above 45ā
- āAt first I was hesistant to trust the model much because of the lower confidence levels but I still trusted it enough on topics I struggled with. As it went on, I was comfortable with confidence levels above 40.ā
- āIf the modelÅ confidence was low and I thought I knew the answer (and it was different) I chose my answerā
#### G.4 Interface and Instructions
We show a sample interface of our extension of Modiste with user confidence in Figure 19, and present the the full set of instructions provided to users in Figures 20 and 21. Note, for the LLM-only and no-LLM conditions, we followed the instruction text from [Bhatt et al., 2023] directly, i.e., participants who saw only the LLM did not see the instruction page about model confidence, and participants in the āNo-LLMā variant were not instructed about any model variant and were just instructed to answer the questions as best as they could by themselves. Participants also responded to a post survey questionarre after completing the user study, which we depict in Figure 22.
<details>
<summary>user_study_figs/instructions/page_with_feedback.png Details</summary>

### Visual Description
\n
## Screenshot: Biology Quiz Interface with AI Prediction
### Overview
This image is a screenshot of an online quiz or testing interface. It displays a multiple-choice biology question about homology, along with an AI model's predicted answer and confidence score. The interface includes a progress bar, instructional text, the question with four options, a score display, and a submit button.
### Components/Axes
The interface is structured as follows:
1. **Top Bar (Header Region):**
* **Element:** A horizontal progress bar.
* **Label:** "Completion Progress" (text located to the left of the bar).
* **Visual:** A gray bar indicating partial completion (approximately 15-20% filled).
2. **Main Content Box (Center Region):**
* A large rectangular box with a light blue border and white background.
* **Instructional Text Block (Top of box):**
* "Please answer the question about **biology** by selecting exactly one of the answers below. An AI model's predicted answer is marked in **yellow** and its confidence in its prediction is in **blue**."
* "The model's **confidence in its answer is 40%**." (The "40%" is highlighted with a light blue background).
* **Score Box (Top-right of main box):**
* A smaller, light gray box with the text: "Your Score: 1 out of 2".
* **Question:**
* "Which of the following pairs of structures is least likely to represent homology?"
* **Answer Options (Listed vertically, each with a radio button):**
1. "The wings of a bat and the arms of a human"
2. "The hemoglobin of a baboon and that of a gorilla"
3. "The mitochondria of a plant and those of an animal" (This entire line is highlighted with a yellow background, indicating the AI's predicted answer).
4. "The wings of a bird and those of an insect"
* **Submit Button (Bottom-center of main box):**
* A gray rectangular button with the text "SUBMIT".
### Detailed Analysis
* **Text Transcription:** All text is in English. The complete transcription is provided in the Components section above.
* **AI Prediction Details:**
* **Predicted Answer:** Option 3: "The mitochondria of a plant and those of an animal".
* **Confidence Level:** 40% (displayed in blue).
* **User State:** The score "1 out of 2" suggests this is the second question in a set, and the user has answered one correctly so far.
* **Spatial Layout:** The instructional text and question are left-aligned within the main box. The score box is positioned in the top-right corner, separate from the main flow. The submit button is centered at the bottom.
### Key Observations
1. The AI model's confidence (40%) is relatively low for a multiple-choice question, suggesting uncertainty.
2. The highlighted answer (plant vs. animal mitochondria) is being presented as the model's prediction, not necessarily the correct answer.
3. The interface is designed to show the user both the AI's guess and its confidence level before they submit their own answer.
### Interpretation
This screenshot captures a moment in a human-AI collaborative or evaluative testing scenario. The system is not just presenting a quiz; it's providing meta-information about an AI's performance on the same question. The low confidence score (40%) for the predicted answer about mitochondria is notable. Biologically, mitochondria in plants and animals are homologous structures (both derived from an ancestral alpha-proteobacterium via endosymbiosis). Therefore, the AI's prediction that this pair is "least likely to represent homology" is likely incorrect, which aligns with its low confidence. The other options include clear examples of homology (bat wing/human arm, hemoglobin sequences) and one clear example of analogy (bird wing/insect wing). The interface seems designed to test or demonstrate the AI's reasoning capabilities and to allow a human to compare their own judgment against the model's.
</details>
Figure 19: Example interface from Modiste. Participants are informed of the question (and topic), as well as the LLM prediction and confidence. Participants are informed of their running score throughout the experiment.
<details>
<summary>user_study_figs/instructions/starter_inst.png Details</summary>

### Visual Description
## Screenshot: Experiment Welcome Page
### Overview
The image is a screenshot of a simple, text-based welcome screen for a research experiment. The page has a white background with black text and two navigation buttons at the bottom. It serves as an introductory consent and information page for participants.
### Components/Axes
The page contains the following textual components, listed from top to bottom:
1. **Heading:** "Welcome!"
2. **Paragraph 1:** "We are conducting an experiment to understand how people make decisions with and without AI support. Your answers will be used to inform machine learning, cognitive science, and human-computer interaction research."
3. **Paragraph 2:** "This experiment should take at most **30 minutes**." (The phrase "30 minutes" is in bold).
4. **Paragraph 3:** "You will be compensated at a base rate of $9/hour for a total of **$4.50**, which you will receive as long as you complete the study." (The amount "$4.50" is in bold).
5. **Navigation Buttons:** Two buttons are centered at the bottom of the visible area.
* Left Button: `< Previous` (grayed out, indicating it is likely disabled on this first page).
* Right Button: `Next >` (active).
### Detailed Analysis
* **Text Transcription:** All text is in English. The exact transcription is provided in the Components section above.
* **Visual Hierarchy:** The text is presented in a single, left-aligned column with standard paragraph spacing. The bolded text ("30 minutes" and "$4.50") draws attention to the key logistical details of time commitment and payment.
* **UI Elements:** The interface is minimal. The `< Previous` button appears in a lighter gray font and background compared to the `Next >` button, visually indicating its inactive state. Both buttons have a simple rectangular shape with a light gray border.
### Key Observations
1. **Purpose-Driven Text:** Every sentence serves to inform the participant about the study's goal, duration, compensation, or to provide navigation.
2. **Clarity of Incentive:** The compensation structure is explicitly stated: a base rate of $9/hour, resulting in a total payment of $4.50 for the estimated 30-minute task, contingent on completion.
3. **Minimalist Design:** The page contains no logos, images, or decorative elements. The focus is entirely on the informational text and the call to action to proceed.
4. **Navigation State:** The disabled "Previous" button confirms this is the initial screen in a sequence.
### Interpretation
This screenshot captures the first step in a human-subjects research study interface. Its primary function is **informed consent and expectation setting**. It transparently communicates the study's academic purpose (machine learning, cognitive science, HCI research), the time required from the participant, and the exact financial compensation. The design prioritizes clarity and compliance with ethical research standards by ensuring participants understand the terms before proceeding. The simple "Next" button is the sole interactive element, guiding the user into the experimental workflow. The page establishes a formal, transactional tone for the interaction that follows.
</details>
<details>
<summary>user_study_figs/instructions/likely_answer_inst.png Details</summary>

### Visual Description
## Screenshot: Experiment Instruction Interface
### Overview
The image is a screenshot of a simple, text-based user interface, likely from a web-based experiment or survey platform. It displays instructional text for a participant and navigation controls. The background is plain white, and the text is black, creating a high-contrast, minimal design focused solely on conveying instructions.
### Components/Axes
* **Main Content Area:** A block of left-aligned text centered vertically and horizontally on the screen.
* **Navigation Controls:** Two rectangular buttons located at the bottom center of the screen.
* **Button 1 (Left):** Labeled `< Previous`
* **Button 2 (Right):** Labeled `Next >`
### Content Details
The textual content is presented in two paragraphs:
**Paragraph 1:**
> In this experiment, you will be seeing *multiple choice questions*, from various topics, such as those that you may find in school (e.g., biology, mathematics, foreign policy, computer science).
* **Formatting Note:** The phrase "multiple choice questions" is italicized in the original text.
**Paragraph 2:**
> Your task is to determine the **most likely answer** for each question. You can select this category by clicking on the radio button associated with your answer.
* **Formatting Note:** The phrase "most likely answer" is in bold in the original text.
### Key Observations
1. **Instructional Focus:** The interface is purely functional, designed to inform the user of their task without any decorative elements.
2. **Clear Task Definition:** The instructions are explicit: the user must evaluate multiple-choice questions from diverse academic and policy topics and select the single "most likely" answer.
3. **Interaction Method:** The specified interaction method is clicking a radio button, indicating a standard single-select multiple-choice format.
4. **Navigation State:** The presence of both `< Previous` and `Next >` buttons suggests this is an intermediate screen within a sequence, likely following an introduction and preceding the first question.
### Interpretation
This screenshot captures the **onboarding or instruction phase** of a cognitive or knowledge-based experiment. The design prioritizes clarity and minimizes distraction to ensure the participant understands the core task.
* **Purpose:** The interface serves to set clear expectations and rules of engagement for the participant. The emphasis on "most likely answer" (bolded) is crucial, as it frames the task not necessarily as finding a definitively correct answer, but as making a reasoned judgment under uncertainty, which is a common paradigm in psychological or decision-making studies.
* **Scope:** The listed topics (biology, mathematics, foreign policy, computer science) indicate the experiment aims to test a broad range of knowledge or reasoning abilities, rather than expertise in a single domain.
* **User Flow:** The navigation buttons imply a linear, step-by-step progression through the experiment. The user is expected to read these instructions and then proceed forward to begin the task.
* **Data Content:** **This image contains no charts, graphs, or quantitative data.** It is purely a textual and interactive UI element. Therefore, no numerical trends, data points, or statistical analysis can be extracted. The "information" is the set of instructions themselves.
</details>
<details>
<summary>user_study_figs/instructions/ai_pred_inst.png Details</summary>

### Visual Description
\n
## Screenshot: Instructional Text with Navigation Buttons
### Overview
The image is a screenshot of a user interface, likely from a web-based task or survey platform. It displays a brief instructional message to the user about an AI-based prediction feature that will appear during subsequent tasks. Below the text are two navigation buttons.
### Components/Axes
This is not a chart or diagram. The components are:
1. **Text Block:** A two-sentence instructional message centered horizontally in the upper portion of the frame.
2. **Navigation Buttons:** Two rectangular buttons positioned side-by-side below the text, centered horizontally.
### Content Details
**Text Transcription:**
The text is in English. The exact wording is:
> During the tasks, you will also see the **prediction of an AI-based model.**
>
> The model's prediction will show up as yellow highlighting over that answer choice. If shown, you are free to use or ignore the information when selecting your answer however you wish.
* **Formatting Note:** The phrase "prediction of an AI-based model." is rendered in **bold** text.
**Button Labels:**
* Left Button: `< Previous`
* Right Button: `Next >`
### Key Observations
* The text is purely informational and procedural, setting expectations for the user's upcoming interaction.
* The interface is minimal, with no other visual elements, logos, or decorative features present.
* The navigation buttons suggest this is part of a multi-step process or sequence of instructions.
### Interpretation
This screenshot captures a standard consent or information disclosure step common in human-AI interaction studies or platforms that incorporate AI assistance. The text explicitly defines the visual cue (yellow highlighting) for the AI's prediction and grants the user full autonomy in how to utilize that information. This design prioritizes user agency and transparency about the AI's role. The presence of "Previous" and "Next" buttons indicates this is one screen within a larger flow, likely preceding the actual task interface where the described AI predictions will appear.
</details>
<details>
<summary>user_study_figs/instructions/confidence_inst.png Details</summary>

### Visual Description
\n
## Screenshot: Model Confidence Display Instruction
### Overview
The image is a screenshot of a simple, centered user interface element on a plain white background. It contains a single instructional sentence and two navigation buttons below it. The content appears to be part of an onboarding tutorial, help guide, or informational modal within a software application.
### Components/Axes
* **Primary Text Element:** A single line of instructional text, centered horizontally and vertically within the frame.
* **Navigation Buttons:** Two rectangular buttons with rounded corners, positioned side-by-side below the text. They are also centered horizontally.
* **Left Button:** Labeled `< Previous`
* **Right Button:** Labeled `Next >`
* **Background:** A uniform, solid white (`#FFFFFF`) background with no texture or gradient.
### Content Details
**Text Transcription:**
The exact text displayed is:
`You will also see the model's confidence in its prediction (which will be shown in blue) for each question.`
**Button Labels:**
* `< Previous`
* `Next >`
**Visual Styling:**
* **Font:** A standard, clean sans-serif typeface (e.g., Arial, Helvetica, or system default).
* **Color:** The text and button borders/labels are in a dark gray or black color. The text explicitly states that the "confidence" metric will be displayed in **blue** in the actual application interface, but no blue color is present in this specific screenshot.
* **Layout:** The composition is minimal and centered, with significant negative space surrounding the text and buttons.
### Key Observations
1. **Instructional Context:** The text is informing the user about a future visual cue (blue color) for a specific data point (model confidence) they will encounter.
2. **Navigation Pattern:** The presence of `< Previous` and `Next >` buttons strongly indicates this is one screen in a sequential series, such as a multi-step tutorial or a paginated help document.
3. **Absence of Data:** This image contains no charts, graphs, data tables, or quantitative information. It is purely a textual instruction and navigation UI.
### Interpretation
This screenshot captures a moment of user education within an application. Its purpose is to prepare the user to interpret a specific visual elementācolor-coded confidence scoresāthat they will see later in the workflow. The design is intentionally minimal to focus attention on the instruction.
The mention of a "model's prediction" and "confidence" suggests the application involves some form of machine learning or AI-assisted analysis, where the system provides answers or classifications along with a measure of its certainty. The instruction to look for this information "in blue" indicates the application uses color as a key visual encoding channel to differentiate this metadata from primary results.
The navigation buttons frame this information as part of a larger learning sequence, implying the application has enough complexity to warrant a guided introduction. The user is currently at a step explaining how to read the output, having likely passed steps on how to input questions or data, and will proceed to steps on how to act on the predictions.
</details>
<details>
<summary>user_study_figs/instructions/seconds_per.png Details</summary>

### Visual Description
## Screenshot: Timed Quiz/Test Interface Instruction Page
### Overview
The image is a screenshot of a simple, text-based user interface, likely from an online quiz, test, or learning module. It displays instructional text to the user regarding the timing and navigation mechanics of the assessment. The interface is minimal, with a white background and two navigation buttons at the bottom.
### Components/Axes
This is not a chart or diagram. The components are textual instructions and interactive UI elements.
**Main Text Content (Centered, spanning most of the width):**
* **Paragraph 1:** "We encourage you to try to work through each problem. You will not be able to continue to the next question until at least **10 seconds** have passed. The SUBMIT button will change from grey to blue when you are able to click to move to the next page whenever you are ready to answer."
* **Paragraph 2:** "Of course you can take longer than 10 seconds on any question if needed! It may be very challenging to determine the answer for some questions. Others may be easy. **Please try your best** regardless."
**Interactive Elements (Bottom Center):**
* A button labeled `< Previous`
* A button labeled `Next >`
* *Spatial Grounding:* These two buttons are positioned side-by-side in the bottom-center of the screen. The `< Previous` button is on the left, and the `Next >` button is on the right.
### Content Details
The text provides explicit rules and encouragement for the user:
1. **Forced Delay:** There is a mandatory 10-second wait period after viewing a question before the user can proceed to the next one.
2. **Visual Feedback:** The "SUBMIT" button (not shown in this screenshot) provides visual feedback by changing color from grey (inactive) to blue (active) once the 10-second period has elapsed.
3. **User Encouragement:** The text explicitly states that users are allowed and encouraged to take more than 10 seconds if needed, acknowledging varying question difficulty. The core instruction is to "try your best."
4. **Navigation:** The presence of `< Previous` and `Next >` buttons indicates this is part of a multi-page sequence.
### Key Observations
* **No Data or Charts:** The image contains no quantitative data, charts, graphs, or diagrams. It is purely instructional text and navigation controls.
* **Emphasis through Bold Text:** Key phrases are bolded for emphasis: "**10 seconds**" and "**Please try your best**".
* **Minimalist Design:** The interface uses a plain white background with black text, focusing entirely on the instructions. There are no decorative elements, logos, or additional UI chrome visible in this frame.
* **Button State:** The `< Previous` and `Next >` buttons appear to be in their default, clickable state (likely grey or standard button color), as the described "SUBMIT" button is not present on this instructional page.
### Interpretation
This screenshot captures a user experience (UX) design pattern for a timed assessment. The primary purpose is to set clear expectations and reduce user anxiety.
* **Peircean Investigation:** The sign (the text) represents an interpretant (the rules of the system) for the user. The 10-second rule is an iconic sign of a "thinking pause," while the color-changing button is an indexical sign linked to the system's internal timer.
* **Design Intent:** The forced 10-second delay likely serves two purposes: 1) To prevent rapid, thoughtless guessing, and 2) To ensure the system has adequate time to process or load the next question. The accompanying text softens this rule by emphasizing that taking longer is acceptable, which is a considerate design choice to avoid pressuring the user.
* **Contextual Clue:** This is almost certainly an interstitial or introductory page shown before the quiz begins or between sections. The user is being briefed on the mechanics before encountering the first question and its associated "SUBMIT" button.
* **Missing Element:** The described "SUBMIT" button is a crucial component of the interaction but is not visible here. Its behavior (grey to blue) is explained in anticipation of its appearance on subsequent question pages.
</details>
<details>
<summary>user_study_figs/instructions/bonus.png Details</summary>

### Visual Description
\n
## Screenshot: Bonus Information and Navigation Interface
### Overview
The image is a screenshot of a simple, text-based user interface, likely from a web application or digital survey/task platform. It displays informational text about a bonus payment structure and provides navigation buttons. The background is plain white, and the text is black, creating a high-contrast, minimalist design.
### Components/Axes
The interface consists of two primary components:
1. **Informational Text Block:** Two lines of centered text.
2. **Navigation Buttons:** Two rectangular buttons centered below the text.
### Detailed Analysis
**Text Content (Transcribed Precisely):**
* **Line 1:** "You will receive a **bonus** of up to a rate of $10/hour (+$0.50) based on how many questions you correctly answer."
* The word "bonus" is rendered in **bold** font weight.
* The text specifies a variable bonus rate with a base of up to $10 per hour and an incremental value of +$0.50.
* **Line 2:** "You will be informed whether or not you are correct after each trial."
* This line describes the feedback mechanism for the task.
**Button Elements:**
* **Left Button:** Labeled "< Previous". The label includes a left-pointing angle bracket (<).
* **Right Button:** Labeled "Next >". The label includes a right-pointing angle bracket (>).
* Both buttons have a light gray border and a white background, typical of default or unstyled HTML button elements.
**Spatial Grounding:**
* The entire text block is positioned in the vertical and horizontal center of the upper portion of the frame.
* The two navigation buttons are positioned side-by-side, centered horizontally, and located in the lower portion of the frame, below the text block. The "< Previous" button is to the left of the "Next >" button.
### Key Observations
* The interface is purely informational and functional, with no decorative elements, logos, or branding visible.
* The language is direct and instructional, outlining a clear incentive (bonus) and process (feedback after each trial).
* The presence of "Previous" and "Next" buttons strongly indicates this is one screen within a multi-step sequence, such as a tutorial, onboarding flow, or a series of task instructions.
### Interpretation
This screen serves as an **incentive and instruction panel** within a larger task-oriented application. Its primary functions are:
1. **Motivation:** To clearly communicate the performance-based bonus structure to the user, linking financial reward directly to accuracy ("how many questions you correctly answer").
2. **Process Clarification:** To set expectations about the immediate feedback loop ("after each trial"), which is a key element in learning or task-performance systems.
3. **Navigation:** To allow the user to move forward to the next step of the process or return to a previous one.
The design prioritizes clarity and comprehension over aesthetics. The use of bold on "bonus" draws the eye to the key incentive. The simple, unadorned buttons suggest a focus on utility. The overall context implies the user is about to engage in a series of questions or trials where their performance will be measured and rewarded.
</details>
Figure 20: Experiment instructions for the confidence variants.
<details>
<summary>user_study_figs/instructions/questions.png Details</summary>

### Visual Description
## Screenshot: Quiz Introduction Screen
### Overview
The image is a screenshot of a simple, minimalist user interface screen, likely from an online quiz, test, or survey platform. It serves as an informational or transitional screen,åē„ēØę· (informing the user) about the total number of questions in the assessment and providing navigation controls.
### Components/Axes
The interface consists of two primary components:
1. **Informational Text:** A single line of centered text.
2. **Navigation Buttons:** Two rectangular buttons positioned horizontally below the text.
**Text Content:**
* **Primary Text:** "You will see a total of **60 questions**."
* The phrase "60 questions" is rendered in **bold** font weight for emphasis.
* The text is centered horizontally on the screen.
* Language: English.
**Button Labels:**
* **Left Button:** "< Previous"
* **Right Button:** "Next >"
* Both buttons have a light gray border and a white background. The text inside is standard weight.
### Detailed Analysis
* **Text Transcription:** The exact text on the screen is: "You will see a total of **60 questions**."
* **Spatial Grounding:**
* The informational text is positioned in the upper-middle region of the screen.
* The two navigation buttons are placed directly below the text, centered as a group. The "< Previous" button is on the left, and the "Next >" button is on the right, with a small gap between them.
* **Visual Style:** The design is very clean and sparse, using a sans-serif font on a plain white background. There are no other visual elements, icons, or decorative features.
### Key Observations
1. **Emphasis on Quantity:** The bold formatting of "60 questions" is the only stylistic variation, clearly designed to draw the user's attention to the total scope of the task ahead.
2. **Standard Navigation Pattern:** The button labels ("< Previous" and "Next >") follow a universal convention for sequential navigation, indicating this screen is part of a multi-step process.
3. **Minimalist Design:** The interface contains only the essential information and controls required for this step, with no extraneous details.
### Interpretation
This screen functions as a **preparatory checkpoint**. Its primary purpose is to set clear expectations for the user regarding the length of the upcoming activity. By stating the total number of questions upfront, it helps manage user anxiety and allows them to mentally prepare for the task.
The presence of the "Previous" button suggests this screen is not the absolute starting point; the user has likely navigated from an earlier screen (e.g., instructions, a login page). The "Next >" button is the primary call-to-action, prompting the user to proceed to the first question.
The design's minimalism focuses user attention entirely on the key message (the number 60) and the immediate action (proceeding). There is no data, chart, or diagrammatic information to analyze beyond the presented text and UI controls. The "fact" provided is the fixed total of 60 questions for the assessment.
</details>
<details>
<summary>user_study_figs/instructions/next.png Details</summary>

### Visual Description
## Screenshot: Instructional Interface for Experiment Navigation
### Overview
The image is a screenshot of a simple, text-based user interface screen. It serves as an instructional prompt within a digital experiment or survey, guiding the user on the next steps. The screen contains two paragraphs of instructional text and two navigation buttons. There are no charts, diagrams, data tables, or complex visual elements. The primary language is English.
### Components/Axes
* **Layout:** The content is centered horizontally on a plain white background. The text paragraphs are stacked vertically in the upper-middle portion of the screen. The navigation buttons are centered horizontally below the text.
* **Text Elements:**
* **Paragraph 1:** "When you are ready, please click **"Next"** to complete a quick comprehension check, before moving on to the experiment." (Note: The word "Next" is rendered in bold font weight).
* **Paragraph 2:** "Please make sure to window size is in full screen, or substantially large enough, to properly view the questions."
* **Interactive Elements (Buttons):**
* **Button 1 (Left):** A rectangular button with a light gray border. The label reads "< Previous".
* **Button 2 (Right):** A rectangular button with a light gray border. The label reads "Next >".
### Content Details
The textual content is transcribed verbatim below:
1. **Instructional Text Block 1:** "When you are ready, please click **"Next"** to complete a quick comprehension check, before moving on to the experiment."
2. **Instructional Text Block 2:** "Please make sure to window size is in full screen, or substantially large enough, to properly view the questions."
3. **Button Labels:**
* Left Button: "< Previous"
* Right Button: "Next >"
### Key Observations
* **Purpose-Driven Design:** The interface is purely functional, designed to convey instructions and provide navigation. There is no decorative imagery or complex styling.
* **Clear Call to Action:** The primary action ("click **Next**") is emphasized through bold text, creating a clear visual hierarchy for the user.
* **Practical Guidance:** The second paragraph provides a technical requirement (window size) to ensure the subsequent content (questions) is displayed correctly, indicating the next phase involves a form or quiz.
* **Navigation State:** The presence of a "< Previous" button suggests this screen is part of a multi-step sequence, and the user has the option to return to a prior step.
### Interpretation
This screen functions as a **gateway or checkpoint** within a structured digital process, likely a research experiment, online survey, or training module. Its purpose is twofold:
1. **To Manage User Flow:** It explicitly instructs the user to proceed to a "comprehension check," which acts as a validation step to ensure understanding before advancing to the core "experiment." This is a common design pattern to improve data quality or learning outcomes.
2. **To Set Technical Expectations:** The note about window size is a proactive measure to prevent user experience issues. It implies that the upcoming "questions" may involve complex layouts, scales, or interactive elements that require a minimum screen real estate to function or be interpreted correctly.
The minimalist design focuses user attention entirely on the instructions and the decision to proceed (Next) or go back (Previous). The lack of any branding, titles, or progress indicators suggests this is an internal screen within a larger application flow, where context is already established. The slight grammatical awkwardness in "Please make sure to window size is..." may indicate the text was written quickly or by a non-native English speaker, but the meaning remains perfectly clear.
</details>
<details>
<summary>user_study_figs/instructions/mc_check.png Details</summary>

### Visual Description
## Screenshot: Pre-Task Knowledge Check Interface
### Overview
The image displays a clean, minimalist web interface for a pre-task knowledge check. It consists of instructional text, two multiple-choice questions with radio button options, and a "Continue" button. The design is functional, using a white background with dark gray text and standard form elements.
### Components/Axes
* **Header/Instructional Text:** Located at the top of the content area.
* **Question 1:** "What will you be asked to determine in this task?*" followed by three radio button options.
* **Question 2:** "How will you select your answer?*" followed by three radio button options.
* **Interactive Elements:** Six circular radio buttons (three per question) and one rectangular "Continue" button.
* **Layout:** Elements are left-aligned within a centered content block. The "Continue" button is centered at the bottom.
### Detailed Analysis
**Textual Content (Transcribed Precisely):**
1. **Instructional Header:** "Check your knowledge before you begin. If you don't know the answers, don't worry; we will show you the instructions again."
2. **First Question & Options:**
* Question: "What will you be asked to determine in this task?*" (The asterisk likely denotes a required field).
* Option 1: "The answer to a mutliple choice question." (Note: "mutliple" is a typo for "multiple").
* Option 2: "The least likely answer to a multiple choice question."
* Option 3: "The most likely categories of an image."
3. **Second Question & Options:**
* Question: "How will you select your answer?*"
* Option 1: "Typing in a text box."
* Option 2: "Clicking on a radio button."
* Option 3: "Selecting from a dropdown menu."
4. **Button:** "Continue" (centered at the bottom).
**UI Element Details:**
* Radio buttons are unselected (empty circles).
* The "Continue" button has a light gray border and dark text, suggesting it may be inactive until the required questions are answered.
### Key Observations
* The interface is designed to confirm user understanding before proceeding to a main task.
* The questions are meta-cognitive, asking the user to predict the nature of the upcoming task and the interaction method.
* There is a clear typo ("mutliple") in the first option of the first question.
* The design is sparse, with no decorative elements, focusing solely on function and clarity.
### Interpretation
This interface serves as a **procedural checkpoint**. Its primary purpose is not to test knowledge, but to ensure the user has read and understood the task instructions they were presumably just shown. By forcing the user to actively recall the task's goal ("determine the answer to a multiple choice question") and method ("clicking on a radio button"), it aims to reduce errors and confusion in the subsequent workflow.
The inclusion of the reassuring phrase "If you don't know the answers, don't worry; we will show you the instructions again" indicates a user-centered design approach. It lowers the stakes for this check, framing it as a helpful guide rather than a test, which can improve user experience and compliance. The typo, while minor, is a notable flaw in an otherwise clean interface and could slightly undermine perceived professionalism. The structure suggests this is the first step in a longer, possibly data-labeling or annotation, task pipeline.
</details>
Figure 21: Experiment instructions for the confidence variants (continued).
<details>
<summary>user_study_figs/instructions/postsurvey_questionarre.png Details</summary>

### Visual Description
## Screenshot: Post-Experiment Feedback Form
### Overview
The image is a screenshot of a simple, clean web form presented to a participant after completing a research study or experiment. The form's purpose is to collect feedback on the participant's experience, specifically regarding the difficulty of questions, the impact of a model's confidence, and areas of struggle or confidence. The interface is minimal, with a white background and black text.
### Components/Axes
The form consists of the following elements, listed in order from top to bottom:
1. **Header Text:**
* "Thank you for participating in our study!"
* "Click **"Finish"** to complete the experiment and receive compensation. If you have any comments about the experiment, please let us know in the form below."
2. **Feedback Questions & Input Fields:**
* **Question 1:** "How challenging did you find the questions? (On a scale of 1-10, with 10 being very challenging)"
* **Input Field:** A single-line text input box. It is currently focused, indicated by a blue border.
* **Question 2:** "Did the model's confidence impact your response? In what way if so, please be as specific as possible (1-3 sentences)"
* **Input Field:** A multi-line text area (textarea).
* **Question 3:** "Were there any question topics you struggled with?"
* **Input Field:** A multi-line text area (textarea).
* **Question 4:** "Were there any question topics you were always very confident in?"
* **Input Field:** A multi-line text area (textarea).
* **Question 5:** "Do you have any additional comments to share with us?"
* **Input Field:** A multi-line text area (textarea).
3. **Action Button:**
* A button labeled "Finish", centered at the bottom of the form.
### Detailed Analysis
* **Text Transcription:** All text is in English. The complete transcription is provided in the Components section above.
* **UI Element Details:**
* The first input field is a standard `<input type="text">` element.
* The subsequent four input fields are `<textarea>` elements, designed for multi-line responses. Each has a small, diagonal resize handle in the bottom-right corner.
* The "Finish" button is a standard HTML button element.
* **Layout & Spatial Grounding:** The form uses a left-aligned layout for the question labels. The input fields are centered horizontally beneath their respective questions. The "Finish" button is centered at the very bottom of the visible form area. There is generous vertical spacing between each question-and-answer pair.
### Key Observations
* The form is designed to gather both quantitative (scale 1-10) and qualitative (open-ended text) feedback.
* The questions are specifically tailored to an experiment involving an AI or computational "model," as indicated by the question about "the model's confidence."
* The instruction mentions "compensation," suggesting this is part of a paid research study, likely in a field like Human-Computer Interaction (HCI), AI evaluation, or psychology.
* The interface is utilitarian and focuses solely on data collection, with no decorative elements.
### Interpretation
This feedback form is a critical tool for researchers to understand the participant's subjective experience. It moves beyond simple task performance metrics to capture:
1. **Perceived Difficulty:** The 1-10 scale quantifies the overall challenge level.
2. **Model Influence:** The question about the model's confidence probes whether the AI's presented certainty (a key aspect of explainable AI) affected human decision-making, which is a significant topic in AI alignment and trust research.
3. **Knowledge Gaps & Strengths:** Questions about struggled and confident topics help identify specific areas where the experimental material or the AI model may be confusing, misleading, or particularly effective.
4. **Open Feedback:** The final comment box allows for unexpected insights or technical issues to be reported.
The form's structure suggests the experiment likely involved participants answering questions with the assistance or oversight of an AI model that communicated its confidence level. The collected data would be used to refine the model, adjust the experiment's difficulty, or publish findings on human-AI collaboration.
</details>
Figure 22: Sample pot-survey questionnaire for users who were allocated to a variant wherein they saw model confidence.
### Appendix H Broader Impact and Implications
The goal of this work is to make LLM outputs have better confidence values associated with them. With successful, calibrated confidence values, the machine systems ultimately become more interpretable and trustworthy by a user [Janssen et al., 2008]. When applied correctly, our advancements will help users be able to make decisions based off of LLM outputs in a more informed way. Similar examples in other domains, like AlphaFold Terwilliger et al. [2023], have shown how well-calibrated confidence scores can be useful in complex decision-making domains. Our hope is to replicate those broad findings in LLMs.
We acknowledge the ongoing debate over the appropriateness, limitations, and harms of LLMs. We do highlight that the development of more confident, interpretable, and trustworthy LLMs can lead to continued techno-solutionism in unintended applications. Specifically, we highlight that our work is limited to use-cases with fact-based questions. Many applications of text-based LLMs are generative, meaning that there is no way for our paradigm to be applied appropriately, and the use of a confidences from calibration-tuned models could be misleading or damaging without checks and guardrails. Additionally, even within the fact-based paradigm, what is true can be subjective, with ground truth in machine learning being a contested topic [Aroyo and Welty, 2015, Uma et al., 2021].
The philosophical debate on these topics is beyond the expertise of the authors; nonetheless, we believe that the ongoing debate over the appropriateness of LLMs should be considered in context with the benefits of our approach in making LLMs more interpretable and useful.
### Appendix I NeurIPS Paper Checklist
1. Claims
1. Question: Do the main claims made in the abstract and introduction accurately reflect the paperās contributions and scope?
1. Answer: [Yes]
1. Justification: We describe and link all claims in section 1.
1. Guidelines:
- The answer NA means that the abstract and introduction do not include the claims made in the paper.
- The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
- The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
- It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
1. Limitations
1. Question: Does the paper discuss the limitations of the work performed by the authors?
1. Answer: [Yes]
1. Justification: We provide a discussion on the limitations in section 8.
- The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
- The authors are encouraged to create a separate āLimitationsā section in their paper.
- The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
- The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
- The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
- The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
- If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
- While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that arenāt acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
1. Theory Assumptions and Proofs
1. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
1. Answer: [N/A]
1. Justification: [N/A]
- The answer NA means that the paper does not include theoretical results.
- All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
- All assumptions should be clearly stated or referenced in the statement of any theorems.
- The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
- Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
- Theorems and Lemmas that the proof relies upon should be properly referenced.
1. Experimental Result Reproducibility
1. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
1. Answer: [Yes]
1. Justification: We provide the complete code, and the complete list of datasets used for all experiments in section 5 to reproduce all our experiments with instructions. All hyperparameters are described in section 5.
1. Guidelines:
- The answer NA means that the paper does not include experiments.
- If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
- If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
- Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
- While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
1. If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
1. If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
1. If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
1. We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
1. Open access to data and code
1. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
1. Answer: [Yes]
1. Justification: We provide the complete code, and the complete list of datasets used for all experiments in section C.2 to reproduce all our experiments with instructions. All hyperparameters are described in section 5.
1. Guidelines:
- The answer NA means that paper does not include experiments requiring code.
- Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
- The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
- The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
- At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
- Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
1. Experimental Setting/Details
1. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
1. Answer: [Yes]
1. Justification: We provide the complete code, and the complete list of datasets used for all experiments in section C.2 to reproduce all our experiments with instructions. All hyperparameters are described in section 5.
1. Guidelines:
- The answer NA means that the paper does not include experiments.
- The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
- The full details can be provided either with the code, in appendix, or as supplemental material.
1. Experiment Statistical Significance
1. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
1. Answer: [Yes]
1. Justification: All figures are appropriately labeled with the error bars.
1. Guidelines:
- The answer NA means that the paper does not include experiments.
- The authors should answer āYesā if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
- The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
- The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
- The assumptions made should be given (e.g., Normally distributed errors).
- It should be clear whether the error bar is the standard deviation or the standard error of the mean.
- It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.
- For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
- If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
1. Experiments Compute Resources
1. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
1. Answer: [Yes]
1. Justification: We provide an estimate of the compute resources required in section 5.
1. Guidelines:
- The answer NA means that the paper does not include experiments.
- The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
- The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
- The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didnāt make it into the paper).
1. Code Of Ethics
1. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
1. Answer: [Yes]
1. Justification: We have read the ethics guidelines
1. Guidelines:
- The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
- If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
- The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
1. Broader Impacts
1. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
1. Answer: [Yes]
1. Justification: We provide a broader impact statement in appendix H
1. Guidelines:
- The answer NA means that there is no societal impact of the work performed.
- If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
- Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
- The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
- The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
- If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
1. Safeguards
1. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
1. Answer: [N/A]
1. Justification: We train on open-access models with open-source datasets. We do not change their generation behavior, and all existing safeguards (if any) remain.
1. Guidelines:
- The answer NA means that the paper poses no such risks.
- Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
- Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
- We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
1. Licenses for existing assets
1. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
1. Answer: [Yes]
1. Justification: We explicitly cite all models in section 5. All datasets used are listed and cited in section C.2.
1. Guidelines:
- The answer NA means that the paper does not use existing assets.
- The authors should cite the original paper that produced the code package or dataset.
- The authors should state which version of the asset is used and, if possible, include a URL.
- The name of the license (e.g., CC-BY 4.0) should be included for each asset.
- For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
- If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
- For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
- If this information is not available online, the authors are encouraged to reach out to the assetās creators.
1. New Assets
1. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
1. Answer: [Yes]
1. Justification: We release our trained models for easy use via Hugging Face.
1. Guidelines:
- The answer NA means that the paper does not release new assets.
- Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
- The paper should discuss whether and how consent was obtained from people whose asset is used.
- At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
1. Crowdsourcing and Research with Human Subjects
1. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
1. Answer: [Yes]
1. Justification: We provide screenshots of our instructions, as well as details of compensation in appendix G.
1. Guidelines:
- The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
- According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
1. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
1. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
1. Answer: [Yes]
1. Justification: We received prior approval from our respective institutional ethics review body for our user study. All users provided consent before partaking in the study.
1. Guidelines:
- The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
- We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
- For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.