2503.17523v3
Model: healer-alpha-free
# Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
**Authors**: Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste
> Google DeepMind University of British Columbia Vector Institute
> Google Research New York University
> Google Research
linluqiu@mit.edu, svansteenkiste@google.com, linzen@google.com
## Abstract
Large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs must construct representations of the world and form probabilistic beliefs about them. To provide personalized recommendations, for example, the LLM needs to infer a userâs preferences from their behavior over multiple interactions. The Bayesian inference framework lays out the optimal way for an agent to update its beliefs as it receives new information. We first show that LLMs fall far short of the standard defined by the Bayesian framework. We then show that by teaching LLMs to mimic the predictions of the normative Bayesian model, we can dramatically improve their ability to update their beliefs; this ability generalizes to new tasks. We conclude that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
## 1 Introduction
Humans interact with the world based on our beliefs about it. To effectively support decision making, our beliefs need to correspond to the structure of the world as much as possible; in other words, our beliefs need to be supported by appropriate âworld modelsâ [Johnson-Laird, 1980, Ha and Schmidhuber, 2018, LeCun, 2022, Wong et al., 2023]. We typically do not have perfect knowledge about the outside world; to the extent that we are uncertain about our environment, our beliefs need to be probabilistic, reflecting this uncertainty. And for these beliefs to remain relevant as the world changes, or as new information about the world becomes available, we need to update our beliefs to reflect the new information. The framework of Bayesian inference describes the normative way in which new information should trigger a change in oneâs beliefs so as to maximize the effectiveness of these beliefs as a foundation for acting in the world [Chater et al., 2006]. The Bayesian framework has informed a substantial body of work in cognitive science, which has identified both areas where humans act as the framework predicts, as well as deviations from it [Griffiths et al., 2024, Jern et al., 2017, Tenenbaum et al., 2011, Xu and Tenenbaum, 2007, Baker et al., 2011, Tenenbaum et al., 2006, Chater and Manning, 2006, Griffiths et al., 2007, Chaigneau et al., 2025, Rehder, 2018, Rottman and Hastie, 2016, Sloman and Lagnado, 2015].
In the last few years, artificial intelligence systems based on large language models (LLMs) have become dramatically more capable than in the past [Team, 2024a, Achiam et al., 2023, Anthropic, 2024, Team, 2024b, Touvron et al., 2023, Guo et al., 2025]. Far outgrowing their original motivationâas methods to estimate the probabilities of different word sequencesâthese systems are now being used for applications where they interact with users and with the outside world. As with humans, for the LLMsâ interactions with users to be effective, the LLMsâ beliefs need to reflect their experience with the user and to be continuously updated as more information becomes available. Here, we ask: do LLMs act as if they have probabilistic beliefs that are updated as expected from normative Bayesian inference? To the extent that the LLMsâ behavior deviates from the normative Bayesian strategy, how can we minimize these deviations?
We begin to study these questions using a simple controlled setting: a flight recommendation task [Lin et al., 2022], illustrated in Fig. 1. This task involves multiple rounds of interactions between a simulated user and an LLM, where the LLM is acting as a flight booking assistant. In each round, the assistant is given a small number of flight options, and is expected to recommend one of them to the user, based on the userâs preferences. The userâs preferences are not directly communicated to the LLM: it only observes the choices the user makes among the flight options. To make optimal recommendations, then, the LLM must construct an implicit model of the factors that shape the userâs preferences, and must reason probabilistically about those factors as it learns about the userâs choices across multiple sets of flight options.
We compare the LLMsâ behavior to that of a model that follows the normative Bayesian strategy, which we refer to as the Bayesian Assistant. This model maintains a probability distribution that reflects its beliefs about the userâs preferences, and uses Bayesâ rule to update this distribution as new information about the userâs choices becomes available. Unlike many real-life scenarios, where it is difficult to specify and implement the Bayesian strategy computationally, in this controlled setting this strategy can be computed exactly, allowing us to precisely estimate the extent to which LLMs deviate from it.
We use this framework to evaluate a range of LLMs and find that they all perform significantly worse than the normative Bayesian Assistant (Fig. 2). Most importantly, in contrast to the Bayesian Assistant, which gradually improves its recommendations as it receives additional information about the userâs choices, LLMsâ performance often plateaus after a single interaction, pointing to a limited ability to adapt to new information.
We then introduce Bayesian teaching, a strategy to teach an LLM to approximate Bayesian reasoning. We provide the LLM with examples of interactions between the user and the Bayesian Assistant, and have the LLM mimic those interactions. We find that, by leading the LLMs to gradually adapt to the user over the course of the interactions, this method substantially improves the LLMsâ performance on the flight recommendation task. Crucially, teaching the LLMs to mimic the Bayesian Assistant in one task allows them to generalize to other tasks that similarly require making decisions under uncertainty; those include not only different variants of the flight recommendation task, but also a related hotel recommendation task, as well as a web shopping task with real-world products (Fig. 1), a much more complex task for which it is difficult to specify and implement a fully Bayesian model.
Notably, while the Bayesian Assistant often makes incorrect predictions as it reasons under uncertainty, especially in the early rounds of interaction, we find that it is a more effective teacher than a teacher that directly provides the LLMs with usersâ choices (which we refer to as an oracle teacher); in other words, the Bayesian modelâs educated guesses make for a stronger learning signal than the correct answers. Overall, we conclude that through observing the Bayesian Assistant perform a particular task, the LLMs are able to approximate transferable probabilistic reasoning skills.
To summarize our contributions: we first identify significant limitations of off-the-shelf LLMs in tasks that require forming and updating probabilistic beliefs. We then demonstrate that, by having the LLMs mimic an normative Bayesian model, we can teach them effectively to approximate probabilistic belief updates, and show that these skills can generalize to new environments. These findings suggest that LLMs can be used in interactive settings where information is provided gradually, including complex application domains where implementing an exact Bayesian model is difficult. More generally, our results highlight a unique strength of deep learning models such as LLMs: they can learn to mimic a symbolic model and generalize its strategy to domains that are too complex to specify in a classic symbolic model.
## 2 Evaluating Belief Updates via Flight Recommendations
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Bayesian Teaching Process for Multi-Domain Recommendation Systems
### Overview
The image is a flowchart illustrating a "Bayesian teaching" process where a user provides feedback on an AI's recommendations, and this feedback is used to improve recommendations across multiple domains (flights, hotels, web shopping). The diagram shows two iterative cycles of user interaction for flight selection, followed by the application of learned preferences to other recommendation tasks.
### Components/Axes
The diagram is organized into three main regions:
1. **Left Column (User Interaction Loop):** Contains user queries (blue speech bubbles), system recommendations (yellow bubbles), and user feedback (red/green bubbles). Each interaction is accompanied by a small bar chart icon representing the features considered.
2. **Central Node:** A robot icon labeled **"Bayesian teaching"** acts as the learning hub, connecting the user feedback to the recommendation domains.
3. **Right Column (Recommendation Domains):** Three distinct application areas, each with an icon, a label, and a bar chart icon showing relevant features.
* **Top:** **Flight Recommendation** (airplane icons). Feature chart labels: `#Stops`, `Arrival Time`.
* **Middle:** **Hotel Recommendation** (building icons). Feature chart labels: `Distance`, `Amenities`, `Rating`.
* **Bottom:** **Web Shopping** (clothing icons). Feature chart labels: `Machine washable`, `Size: XL`, `Color: Black`, `Easy assemble`, `eco-friendly`.
### Detailed Analysis
**User Interaction Cycle 1 (Top-Left):**
* **User Query:** "Help me select the best flights for my trips... Which flight is the best option?"
* **Flight 1:** duration: 10 hr 15 min, # stops: 2, price: $100
* **Flight 2:** duration: 4 hr 24 min, # stops: 0, price: $750
* **Flight 3:** duration: 7 hr 13 min, # stops: 1, price: $370
* **System Recommendation (Yellow Bubble):** "The best option is Flight 1." Accompanied by a bar chart icon with three bars (blue, orange, gray) labeled `Duration`, `#Stops`, `Price`.
* **User Feedback (Red Bubble):** "Your option Flight 1 is incorrect. I prefer Flight 2."
**User Interaction Cycle 2 (Bottom-Left):**
* **User Query:** "Which flight is the best option?"
* **Flight 1:** duration: 5 hr 20 min, # stops: 1, price: $290
* **Flight 2:** duration: 10 hr 45 min, # stops: 2, price: $150
* **Flight 3:** duration: 5 hr 5 min, # stops: 1, price: $370
* **System Recommendation (Yellow Bubble):** "The best option is Flight 3." Accompanied by a bar chart icon.
* **User Feedback (Green Bubble):** "Your option flight 3 is correct."
**Central Process:**
* Arrows from both user feedback bubbles point to the central **"Bayesian teaching"** node.
* From this node, arrows point to the three recommendation domains on the right, indicating the learned preferences are applied here.
**Web Shopping Details (Bottom-Right):**
* Contains a block of text with mixed English and Chinese.
* **English Text:** `Machine washable`, `Size: XL`, `Color: Black`, `Easy assemble`, `eco-friendly`
* **Chinese Text (Transcribed):** `ćŻæșæŽ` (kÄ jÄ« xÇ), `ć°ș毞: XL` (chÇcĂčn: XL), `éąèČ: é»èČ` (yĂĄnsĂš: hÄisĂš), `æäșç»èŁ
` (yĂŹyĂș zÇzhuÄng), `çŻäż` (huĂĄnbÇo)
* **English Translation of Chinese Text:** `Machine washable`, `Size: XL`, `Color: Black`, `Easy to assemble`, `Eco-friendly`
### Key Observations
1. **Feedback-Driven Learning:** The core mechanism is explicit user correction. The system makes a recommendation, the user provides binary feedback (correct/incorrect) and sometimes a preferred alternative, which is used for Bayesian updating.
2. **Feature-Based Representation:** Every decision point (user query, system recommendation) is paired with a bar chart icon, emphasizing that recommendations are based on weighted features (e.g., Duration, Price, #Stops).
3. **Cross-Domain Transfer:** The diagram explicitly shows that learning from flight selection interactions is transferred via "Bayesian teaching" to improve Hotel and Web Shopping recommendations.
4. **Multilingual Content:** The Web Shopping domain includes product attributes in both English and Chinese, suggesting the system operates in or supports a multilingual context.
5. **Visual Feedback Coding:** User feedback is color-coded: red for incorrect, green for correct, providing immediate visual cues about the learning signal.
### Interpretation
This diagram illustrates a human-in-the-loop machine learning framework, specifically **Bayesian teaching**, where user feedback directly shapes the model's understanding of preferences. The process is not just about optimizing a single task (flight selection) but about extracting generalizable preference rules that can be applied to novel domains (hotels, shopping).
The two flight examples show the system learning from its mistakes. In the first, it prioritized low price (Flight 1), but the user preferred the faster, non-stop option (Flight 2). In the second, after presumably updating its model, it correctly recommends the shortest flight (Flight 3). The bar chart icons suggest the system is learning the relative importance (weights) of features like duration versus price.
The transfer to other domains implies the learned preferences are abstract. For instance, a user's demonstrated preference for "convenience" (shorter duration, fewer stops) in flights might translate to prioritizing "proximity" (Distance) or "Amenities" in hotels, or "Easy assemble" in products. The inclusion of Chinese text in the shopping domain highlights the practical application of such a system in real-world, multilingual e-commerce platforms. The overall message is that interactive feedback is a powerful tool for building adaptable, multi-purpose recommendation agents.
</details>
Figure 1: Evaluating and improving LLMsâ probabilistic belief updates. The flight recommendation task (left) involves multi-round interactions between a user and a flight booking assistant. In each round, the assistant is asked to recommend to the user one of three available flight options. The assistant is then shown the flight that was in fact chosen by the user (based on the userâs reward function, which characterizes the userâs preferences). To make good recommendations, the assistant needs to infer the userâs preferences from the userâs choices. To teach the LLM to reason probabilistically, we fine-tune the LLM on interactions between users and a Bayesian Assistant, which represents the normative way to update beliefs about the userâs preferences. We then evaluate the fine-tuned model on the flight recommendation task as well as two new tasks (right).
We first describe the simplified flight recommendation task, derived that we use to evaluate the LLMs [Lin et al., 2022]. In this task, we have the LLMs interact with a simulated user for five rounds. In each round, three flight options are presented to both the user and the assistant. Each flight is defined by a departure time, a duration, a number of stops, and a cost (see Fig. 1). Each simulated user is characterized by a set of preferences: for each feature, they can have a strong or weak preference for high or low values of the feature (e.g., they may prefer longer or shorter flights), or no preference regarding this feature. We refer to this set of preferences as the userâs reward function. We have 624 possible users in total (see Appendix Section A). These preferences, which determine the flights that the user chooses, are not directly revealed to the assistant. The goal of the assistant is to recommend the flight that matches the userâs choice. At the end of each round, the user indicates to the assistant whether or not it chose correctly, and provides it with the correct answer.
After each round, we evaluate the accuracy of the assistantâs recommendations for 100 new sets of three flights that differ from the ones on which the assistant has received feedback. We do not provide any feedback to the assistant for these new flight option sets (see Appendix Fig. 7 for the evaluation workflow).
### 2.1 The Bayesian Assistant
Because the usersâ preferences are only revealed gradually, through their choices among flight options, we cannot expect the LLMs to reach perfect accuracy immediately after a single round of interaction. As an upper bound on the LLMsâ performance, we define a Bayesian Assistant, which implements the strategy that optimally takes into account the evidence about the userâs preferences that accumulates over rounds of interaction. This entails maintaining uncertainty about those preferences when the evidence is partial: instead of committing to a single most likely reward function, which could turn out to be incorrect in future rounds, the assistant maintains a probability distribution over possible reward functions. After each round, the Bayesian Assistant updates its distribution over reward functions using Bayesâ rule: the probability of each reward function after the round (the posterior) is computed based on its probability before the round (the prior) and whether or not it was compatible with the userâs choice (the likelihood). This normative model represents the best performance that we can possibly expect from any system. Because the number of possible reward functions is small, we are able to perform exact Bayesian inference (see Appendix Section A).
This method requires us to define the Bayesian Assistantâs initial prior distribution, that is, its probabilistic assumptions about which user preferences are more likely, in advance of any interaction with the user. We use an uninformed prior, where all possible sets of user preferences are equally likely (for experiments with alternative priors, see Appendix Section D.4).
<details>
<summary>x2.png Details</summary>

### Visual Description
## Grouped Bar Chart: AI Model and Human Accuracy Comparison
### Overview
The image displays a grouped bar chart comparing the accuracy percentages of various large language models (LLMs), a human baseline, and a "Bayesian Assistant" across two evaluation rounds. The chart assesses performance on an unspecified task, with a random-guess baseline indicated.
### Components/Axes
* **Chart Type:** Grouped bar chart with error bars.
* **Y-Axis:** Labeled "Accuracy (%)". Scale runs from 0 to 100 in increments of 20.
* **X-Axis:** Lists 10 categories (9 AI models/systems + Human). From left to right:
1. Gemma 2 9B
2. Gemma 2 27B
3. Llama 3 8B
4. Llama 3 70B
5. Qwen 2.5 7B
6. Qwen 2.5 32B
7. GPT-4.1 Mini
8. Gemini 1.5 Pro
9. Human
10. Bayesian Assistant
* **Legend (Top-Left Corner):**
* **Hatched Pattern Bar:** "After 1st Round"
* **Solid Color Bar:** "Final Round"
* **Dashed Horizontal Line:** "Random"
* **Baseline:** A dashed horizontal line at approximately 33% accuracy, labeled "Random" in the legend, representing chance-level performance.
### Detailed Analysis
Each category on the X-axis has two bars. The left (hatched) bar represents accuracy after the first round, and the right (solid) bar represents final round accuracy. Numerical values are annotated above each bar.
**Data Series & Values (Accuracy %):**
| Category | After 1st Round (Hatched) | Final Round (Solid) | Trend (1st to Final) |
| :--- | :--- | :--- | :--- |
| **Gemma 2 9B** | 37 | 37 | No change |
| **Gemma 2 27B** | 37 | 40 | Slight increase (+3) |
| **Llama 3 8B** | 36 | 38 | Slight increase (+2) |
| **Llama 3 70B** | 45 | 58 | Significant increase (+13) |
| **Qwen 2.5 7B** | 37 | 37 | No change |
| **Qwen 2.5 32B** | 42 | 49 | Moderate increase (+7) |
| **GPT-4.1 Mini** | 40 | 42 | Slight increase (+2) |
| **Gemini 1.5 Pro** | 45 | 51 | Moderate increase (+6) |
| **Human** | 39 | 47 | Moderate increase (+8). **Note:** Both bars have black error bars indicating variability. |
| **Bayesian Assistant** | 58 | 81 | Very large increase (+23). This is the highest-performing category. |
**Spatial Grounding & Verification:**
* The legend is positioned in the top-left quadrant of the chart area.
* The "Random" baseline dashed line runs horizontally across the entire chart at the ~33% mark.
* The "Human" and "Bayesian Assistant" bars are colored differently (green and brown, respectively) from the blue bars used for the AI models, visually setting them apart.
* The error bars are only present on the "Human" data series, indicating measured uncertainty or variance in human performance.
### Key Observations
1. **Universal Improvement:** All categories either maintained or improved their accuracy from the "After 1st Round" to the "Final Round." No category performed worse.
2. **Performance Tiers:** A clear hierarchy is visible. The Bayesian Assistant is the top performer, followed by the largest AI models (Llama 3 70B, Gemini 1.5 Pro). Human performance and mid-sized models cluster in the middle. Smaller models (Gemma 2 9B, Qwen 2.5 7B) perform just above the random baseline.
3. **Scale Correlation:** For each model family (Gemma 2, Llama 3, Qwen 2.5), the larger parameter version (27B, 70B, 32B) outperforms its smaller counterpart in both rounds.
4. **Human vs. AI:** Human final round accuracy (47%) is surpassed by several AI models (Llama 3 70B, Qwen 2.5 32B, Gemini 1.5 Pro, Bayesian Assistant) but is higher than the smaller models.
5. **Largest Gains:** The most significant accuracy jumps between rounds were achieved by the Bayesian Assistant (+23 points) and Llama 3 70B (+13 points).
### Interpretation
This chart likely visualizes results from a multi-round reasoning or problem-solving benchmark. The "After 1st Round" score may represent initial, single-attempt performance, while the "Final Round" score reflects performance after an iterative process (e.g., self-correction, receiving feedback, or chain-of-thought refinement).
The data suggests that:
* **Iterative refinement is highly effective.** The process between rounds consistently boosts accuracy, with more capable systems (both AI and human) benefiting more dramatically.
* **Model scale matters.** Larger models within the same family show a greater capacity for improvement and higher final performance.
* **Specialized systems can excel.** The "Bayesian Assistant," which likely incorporates probabilistic reasoning or a different architectural approach, demonstrates a superior ability to leverage the iterative process, achieving near-perfect final accuracy (81%).
* **Human reasoning is variable but competitive.** The error bars on human data acknowledge individual differences. While humans are outperformed by top AI systems in this task, they still significantly outperform smaller models and the random baseline.
The chart effectively argues for the value of multi-step evaluation and highlights the performance gap between different classes of AI systems and human capability on this specific task.
</details>
Figure 2: LLMs show limited or no improvement over multiple interactions with the user. We show accuracy after the first round and final (fifth) round. We compare off-the-shelf LLMs from different model families to human participants and the Bayesian Assistant. For human participants, we only evaluate on a subset of 48 out of our 624 simulated users. The LLMs perform considerably worse than the Bayesian Assistant. Human participants demonstrate a larger improvement than most LLMs as they receive more information, but they still fall short of the accuracy that characterizes the normative Bayesian strategy. For the human study, the error bars show the averaged standard error across participants; for models, they show the standard error across the three sets of interactions with each of the 624 users.
### 2.2 LLMs Show Limited Evidence of Belief Updating
The LLMs we evaluate, like most contemporary LLMs, are first trained to predict upcoming words in a large collection of texts (âpre-trainingâ), and are then specialized to follow user instructions provided in natural language (âinstruction-tuningâ) [Sanh et al., 2022, Wei et al., 2022a]. Most commercially available models are closed-weights: we can query them but we cannot access their parameters. We evaluate two such closed-weights models, Gemini 1.5 Pro [Team, 2024a] and GPT-4.1 Mini [OpenAI, 2025], which were among the state-of-the-art LLMs at the time of writing [Chiang et al., 2024]. We also evaluate the following open-weights models: Gemma 2 (9B and 27B parameters) [Team, 2024b], Llama 3 (8B and 70B parameters) [Grattafiori et al., 2024], and Qwen 2.5 (7B and 32B parameters) [Yang et al., 2024a]. We chose those models because their performance was quite competitive, and their weights are openly available, which makes it possible to perform fine-tuning (see the next section). We provide these LLMs with English instructions explaining how to act as a flight booking assistant (see Fig. 1 for an example, and Appendix Table 3 for a detailed interaction).
We show results in Fig. 2. Overall, the accuracy of the LLMs after the five rounds of interaction is considerably lower than that of the Bayesian Assistant, and most of the models show little improvement after the first round of interaction (Fig. 2 shows results after the first and fifth round; for results after each of the five rounds, see Appendix Fig. 24). For an exploration of how the modelsâ performance varies across usersâ possible reward functions, see Appendix Section D.2.
A range of follow-up experiments failed to produce meaningful improvement in the LLMsâ behavior (for details, see Appendix Section C.1). Those include experiments with âchain-of-thought promptingâ [Wei et al., 2022b, Nye et al., 2021, Kojima et al., 2022], that is, instructions that are meant to encourage the LLM to reason more explicitly (Appendix Fig. 9); an experiment with alternative, purely numerical representations of the flight options that we hypothesized might be easier for the LLMs to parse than the verbal ones we used for our main experiments (Appendix Fig. 9); a setting where we have 30 instead of five rounds of interaction (Appendix Fig. 9); and experiments with models that are only pre-trained to predict upcoming words in texts, without subsequent training to follow user instructions (Appendix Fig. 9).
We also had human participants act as the assistant to a subset of 48 simulated users (see Appendix Section A and Appendix Section F.1 for details). The human participants made recommendations for five rounds and showed a significant improvement between round 1 and 5 (p = 0.002, logistic mixed-effects model). In terms of accuracy, they perform better than small LLMs and slightly worse than larger LLMs (see Appendix Fig. 24 for performance over rounds). That being said, like all LLMs, humans also fall substantially short of the accuracy expected from the normative Bayesian strategy.
## 3 Teaching LLMs to Approximate Bayesian Reasoning
<details>
<summary>x3.png Details</summary>

### Visual Description
## Grouped Bar Chart: Accuracy Comparison of AI Models Across Different Methods
### Overview
This image is a grouped bar chart comparing the accuracy percentages of three large language models (Gemma, Llama, Qwen) under three different methods (Original, Oracle, Bayesian), alongside a standalone "Bayesian Assistant" model. Performance is measured across two evaluation stages: "After 1st Round" and "Final Round." A dashed line indicates a "Random" baseline performance.
### Components/Axes
* **Chart Type:** Grouped bar chart with error bars.
* **Y-Axis:** Labeled "Accuracy (%)". Scale runs from 0 to 100 in increments of 20.
* **X-Axis:** Categorical, listing the model-method combinations. From left to right:
* Gemma Original, Gemma Oracle, Gemma Bayesian
* Llama Original, Llama Oracle, Llama Bayesian
* Qwen Original, Qwen Oracle, Qwen Bayesian
* Bayesian Assistant
* **Legend:** Positioned in the top-left corner of the chart area.
* **After 1st Round:** Represented by bars with diagonal hatching (\\).
* **Final Round:** Represented by solid-colored bars.
* **Random:** Represented by a horizontal dashed line.
* **Data Series Colors (by model group):**
* **Gemma Group:** Blue (Original), Light Orange (Oracle), Dark Orange (Bayesian).
* **Llama Group:** Blue (Original), Light Orange (Oracle), Dark Orange (Bayesian).
* **Qwen Group:** Blue (Original), Light Orange (Oracle), Dark Orange (Bayesian).
* **Bayesian Assistant:** A single tan/beige color.
### Detailed Analysis
The chart presents paired bars for each model-method combination, showing the progression from the 1st Round to the Final Round. Error bars are visible on the "After 1st Round" bars for the Oracle and Bayesian methods.
**1. Gemma Model Group:**
* **Trend:** Accuracy increases consistently from Original to Oracle to Bayesian methods. The Final Round shows a significant improvement over the 1st Round for Oracle and Bayesian.
* **Data Points:**
* Gemma Original: After 1st Round = 37%, Final Round = 37% (no change).
* Gemma Oracle: After 1st Round â 50%, Final Round = 61%.
* Gemma Bayesian: After 1st Round â 57%, Final Round = 76%.
**2. Llama Model Group:**
* **Trend:** Similar upward trend from Original to Oracle to Bayesian. The Final Round improvement is most pronounced for the Bayesian method.
* **Data Points:**
* Llama Original: After 1st Round = 36%, Final Round = 38%.
* Llama Oracle: After 1st Round â 48%, Final Round = 62%.
* Llama Bayesian: After 1st Round â 57%, Final Round = 75%.
**3. Qwen Model Group:**
* **Trend:** Again, accuracy improves from Original to Oracle to Bayesian. The Final Round shows gains across all methods.
* **Data Points:**
* Qwen Original: After 1st Round = 37%, Final Round = 37% (no change).
* Qwen Oracle: After 1st Round â 43%, Final Round = 53%.
* Qwen Bayesian: After 1st Round â 55%, Final Round = 68%.
**4. Bayesian Assistant:**
* **Trend:** This is a single model, not a group. It shows the highest performance on the chart.
* **Data Points:** After 1st Round â 58%, Final Round = 81%.
**5. Random Baseline:**
* The dashed "Random" line is positioned at approximately 33% accuracy, serving as a lower-bound reference.
### Key Observations
1. **Method Superiority:** The Bayesian method consistently yields the highest accuracy within each model family (Gemma, Llama, Qwen), followed by Oracle, with Original performing the worst.
2. **Round Improvement:** For all Oracle and Bayesian methods, the "Final Round" accuracy is substantially higher than the "After 1st Round" accuracy. The "Original" methods show minimal to no improvement between rounds.
3. **Top Performer:** The standalone "Bayesian Assistant" achieves the highest overall accuracy (81% Final Round), outperforming the Bayesian variants of the individual models.
4. **Consistency:** The pattern of improvement (Original < Oracle < Bayesian) is consistent across all three base model families.
5. **Baseline Comparison:** All methods, except the "Original" variants which are near the baseline, perform significantly better than the ~33% random chance level.
### Interpretation
This chart demonstrates the effectiveness of advanced inference or training methods (Oracle and especially Bayesian) over a base ("Original") model in a multi-round evaluation setting. The data suggests that:
* **Bayesian methods are highly effective:** They provide the largest accuracy boost within each model family and show the most significant gain from the first to the final round, indicating they benefit greatly from iterative refinement or additional data.
* **The "Original" models plateau:** Their lack of improvement between rounds suggests they may lack the mechanism to leverage additional interactions or information presented in the later round.
* **The "Bayesian Assistant" is a distinct, high-performing system:** Its superior performance implies it may be a specialized model or ensemble designed specifically for the task, rather than just a Bayesian version of Gemma, Llama, or Qwen.
* **The task is non-trivial:** The random baseline is ~33%, suggesting a multi-class classification problem with roughly three choices. The best models achieve over 80% accuracy, showing substantial learning beyond random guessing.
The consistent trends across different base models (Gemma, Llama, Qwen) strengthen the conclusion that the observed benefits are due to the methods (Oracle, Bayesian) themselves, not quirks of a single model architecture.
</details>
Figure 3: Supervised fine-tuning teaches LLMs to approximate probabilistic inference. We show accuracy after the first round and final (fifth) round across different assistants. We compare the original LLMs, LLMs fine-tuned on user interactions with the Bayesian Assistant, and LLMs fine-tuned on user interactions with an oracle, which always provides the correct answer. Both types of fine-tuning significantly improve LLMsâ performance, and Bayesian teaching is consistently more effectively than oracle teaching. Error bars show the standard error across three random seeds (and three training runs). All results are statistical significant, $p<0.001$ (see Appendix Section G).
We next describe the supervised fine-tuning technique we use to teach the LLM to mimic the normative Bayesian model; we show that this method substantially improves the LLMâs ability to update its beliefs correctly.
From a technical perspective, supervised fine-tuning is similar to the method used to train most LLMs in the first place. The model is provided with the first words of a text and is trained to predict the upcoming word. After each example, the LLMâs weights are adjusted to increase the likelihood of a correct prediction if the same example is observed again. The main difference is that while in the first phase of training the texts are typically drawn from the Internet or similar resources, in the supervised fine-tuning phase the texts are constructed in a targeted way (automatically or by human writers) so as to teach the LLM particular skills [Sanh et al., 2022, Wei et al., 2022a]; to improve arithmetic skills, for example, the model may be given the text âthe output of $1+1=âŠ$ is $2$ â. We apply supervised fine-tuning to the three medium-sized open-weights models (Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B); we do not attempt to fine-tune the larger models from these families due to computational constraints. We update all of the modelsâ weights in fine-tuning (in Appendix Section C.2, we show that a different training objective, Direct Preference Optimization [Rafailov et al., 2023], produces similar results, as does a computationally cheaper fine-tuning method, LoRA [Hu et al., 2022], which only updates a subset of the modelâs weights).
We explore two strategies to create supervised fine-tuning data. For both strategies, we construct 10 five-round interactions per user. These interactions follow the same format as described above (Appendix Table 3). In the first strategy, which we refer to as oracle teaching, we provide the LLM with interactions between simulated users and an âoracleâ assistant that has perfect knowledge of the userâs preferences, and as such always recommends the option that is identical to the userâs choices.
The second strategy, which we call Bayesian teaching, provides the LLM with interactions between the user and the Bayesian Assistant. In this setting, the assistant will often choose flights that do not match the userâs preferred choice, especially in early rounds where it has considerable uncertainty about the userâs preferences. We hypothesize that despite this fact mimicking the Bayesian Assistantâs best guesses would teach the LLM to maintain uncertainty and update its beliefs more effectively than the first strategy where the LLM is trained on the correct choices. This approach can be seen as a form of distillation, where a model is trained by learning to mimic another system [Hinton et al., 2015, Kim and Rush, 2016, Deng et al., 2023, Wang et al., 2023b, Li et al., 2023b, Jung et al., 2024, Yu et al., 2024, Chen et al., 2024b]. We use a uniform prior for the Bayesian Assistant that produces the supervised fine-tuning data. Other priors perform similarly (see Appendix Fig. 16).
### 3.1 Fine-Tuning Teaches LLMs to Adapt to Users
Both supervised fine-tuning strategies, oracle teaching and Bayesian teaching, significantly improve the LLMsâ performance on the flight recommendation task (Fig. 3). Crucially, after fine-tuning, the LLMsâ performance gradually improves as more information becomes available; this contrasts with the original LLMs, which plateaued after the first round (see the substantial performance improvement between the first and last round in Fig. 3; for detailed results for each round, see Appendix Fig. 25). While there is still a performance gap between the fine-tuned LLMs and the normative Bayesian Assistant, this gap is much narrower than for the original LLMs. All three medium-sized LLMs, which before fine-tuning performed worse than either stronger models or our human participants, markedly outperform them after fine-tuning.
We find that Bayesian teaching leads to higher accuracy and less variability across repetitions of the experiment than oracle teaching (Fig. 3). Bayesian teaching also successfully makes the LLM more Bayesian: the Bayesian-tuned LLMsâ predictions agree with those of the Bayesian Assistant around 80% of the time, significantly more often than do the predictions of the original LLMs and oracle-tuned LLMs (Fig. 4). In Appendix Section D.4, we show that the effectiveness of Bayesian teaching cannot be explained by two potential confounds, and conclude that the effectiveness of this method is in fact due to the Bayesian signal it provides.
The amount of information that can be gained from the userâs choice for a particular option set varies from one set to another. For example, a choice between two flight options that differ in exactly one feature provides direct evidence for the userâs preference for that feature; such a choice could be more informative about the userâs preferences than the choice between options that differ along multiple dimensions. We expect a model with more sophisticated probabilistic skills to show greater sensitivity to this factor. Do our fine-tuned models show such sensitivity? Focusing on the Gemma models, we find that Gemma Original does not show sensitivity to option set informativity, but both fine-tuned versions of Gemma do, with Gemma Bayesian displaying considerably more sensitivity than Gemma Oracle (Appendix Section E).
Can the fine-tuned models accurately verbalize their beliefs? To address this question, we ask the LLMs explicitly for their beliefs about the userâs preferencesâwe have the simulated user ask them, for example, âon a scale of 1 to 5, what is my preference for price?â. We then test for the accuracy of these verbalized beliefs by deriving flight recommendations from those beliefs, using the same decision procedure we use with the Bayesian Assistant. We find that this approach generally performs better the approach we have used so far where we directly ask for the LLMsâ recommendations; that predictions based on the fine-tuned LLMsâ verbalized beliefs are substantially more accurate than those based on the original LLMsâ verbalized beliefs; and that the Bayesian-tuned LLMs produce more accurate beliefs than either the original LLMs or oracle-tuned ones (for additional details, see Appendix Section B).
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Charts: Agreement with Bayesian Assistant Across LLMs and Interactions
### Overview
The image displays three horizontally arranged line charts, each comparing the performance of a different Large Language Model (LLM) over a series of interactions. The charts measure the "Agreement with Bayesian Assistant (%)" as a function of the number of interactions (# Interactions). Each chart contains three data series representing different model variants: Original LLM, Oracle LLM, and Bayesian LLM.
### Components/Axes
* **Chart Titles (Subplots):** Centered above each chart: "Gemma" (left), "Llama" (center), "Qwen" (right).
* **Y-Axis (All Charts):** Label: "Agreement with Bayesian Assistant (%)". Scale: 0 to 100, with major tick marks at 0, 20, 40, 60, 80, 100.
* **X-Axis (All Charts):** Label: "# Interactions". Scale: 0 to 5, with integer tick marks at 0, 1, 2, 3, 4, 5.
* **Legend:** Positioned at the top center of the entire figure, spanning all three subplots.
* **Original LLM:** Blue line with circular markers.
* **Oracle LLM:** Light orange (beige) line with circular markers.
* **Bayesian LLM:** Dark orange line with circular markers.
* **Data Series:** Each chart contains three lines corresponding to the legend, with markers at each integer interaction point (0 through 5).
### Detailed Analysis
**1. Gemma (Left Chart)**
* **Bayesian LLM (Dark Orange):** Starts at approximately 35% at 0 interactions. Shows a sharp, near-vertical increase to about 85% at 1 interaction. The line then plateaus, remaining stable at approximately 85% for interactions 2 through 5.
* **Oracle LLM (Light Orange):** Starts at approximately 35% at 0 interactions. Increases to about 65% at 1 interaction. The line then shows a very slight downward trend, ending at approximately 63% at 5 interactions.
* **Original LLM (Blue):** Starts at approximately 35% at 0 interactions. Increases slightly to about 40% at 1 interaction. The line then shows a gradual, shallow downward trend, ending at approximately 37% at 5 interactions.
**2. Llama (Center Chart)**
* **Bayesian LLM (Dark Orange):** Starts at approximately 40% at 0 interactions. Shows a sharp increase to about 83% at 1 interaction. The line then plateaus, remaining stable at approximately 84% for interactions 2 through 5.
* **Oracle LLM (Light Orange):** Starts at approximately 40% at 0 interactions. Increases to about 60% at 1 interaction. The line then shows a gradual upward trend, ending at approximately 66% at 5 interactions.
* **Original LLM (Blue):** Starts at approximately 35% at 0 interactions. Increases to about 38% at 1 interaction. The line then remains nearly flat, stable at approximately 38% for interactions 2 through 5.
**3. Qwen (Right Chart)**
* **Bayesian LLM (Dark Orange):** Starts at approximately 43% at 0 interactions. Shows a sharp increase to about 78% at 1 interaction. The line then shows a gradual downward trend, ending at approximately 73% at 5 interactions.
* **Oracle LLM (Light Orange):** Starts at approximately 43% at 0 interactions. Increases to about 53% at 1 interaction. The line then shows a gradual upward trend, ending at approximately 56% at 5 interactions.
* **Original LLM (Blue):** Starts at approximately 36% at 0 interactions. Peaks at about 39% at 1 interaction. The line then shows a gradual downward trend, ending at approximately 36% at 5 interactions.
### Key Observations
1. **Dominant Performance:** The Bayesian LLM (dark orange) consistently achieves the highest agreement percentage after the first interaction across all three base models (Gemma, Llama, Qwen).
2. **Initial Jump:** All three model variants (Original, Oracle, Bayesian) show their most significant change in agreement between 0 and 1 interaction. The Bayesian LLM's jump is the most dramatic.
3. **Post-Jump Trends:**
* The Bayesian LLM plateaus (Gemma, Llama) or slightly declines (Qwen) after interaction 1.
* The Oracle LLM shows a slight upward trend (Llama, Qwen) or a very slight decline (Gemma) after interaction 1.
* The Original LLM remains relatively flat or shows a shallow decline after interaction 1.
4. **Model Comparison:** The starting agreement (at 0 interactions) varies slightly by base model, with Qwen starting highest (~43%) and Gemma/Llama starting lower (~35-40%).
### Interpretation
The data demonstrates a clear hierarchy in performance when aligning with a Bayesian Assistant. The **Bayesian LLM** variant, which presumably incorporates Bayesian inference or updating, shows a transformative improvement after a single interaction, reaching high agreement (73-85%) and maintaining it. This suggests that the Bayesian adaptation is highly effective and efficient.
The **Oracle LLM**, which likely represents a model with access to some ground-truth or privileged information, performs better than the original but significantly worse than the Bayesian variant. Its gradual improvement in Llama and Qwen suggests it can learn slowly from interactions, but it lacks the decisive alignment capability of the Bayesian approach.
The **Original LLM** shows minimal ability to improve its agreement with the Bayesian Assistant through interaction alone, hovering near its baseline. This indicates that without specific architectural or methodological changes (like Bayesian updating), standard LLMs do not naturally converge towards the assistant's reasoning pattern through dialogue.
The slight decline in the Bayesian LLM's performance for Qwen after the first interaction is a notable anomaly. It could indicate a form of overfitting to the initial interaction or a diminishing return where subsequent interactions introduce noise rather than clarifying signal for that specific model. Overall, the charts argue strongly for the efficacy of Bayesian methods in creating AI assistants that can rapidly and stably align with a target reasoning framework.
</details>
Figure 4: Fine-tuned LLMs agree more with the Bayesian Assistant. We show agreement between the LLMs and the Bayesian Assistant, measured by the proportion of trials where the LLMs makes the same predictions as the Bayesian Assistant. Fine-tuning on the Bayesian Assistantâs predictions makes the LLMs more Bayesian, with the Bayesian versions of each LLM achieving the highest agreement with the Bayesian Assistant. Error bars (too small to be visible in plot) show standard errors across three random seeds (and three training runs).
### 3.2 Fine-Tuned LLMs Generalize to New Tasks
<details>
<summary>x5.png Details</summary>

### Visual Description
## Multi-Chart Figure: Generalization Performance of LLM Variants
### Overview
The image is a composite figure containing three distinct sections (labeled a, b, and c), each presenting performance comparisons of three Large Language Models (Gemma, Llama, Qwen) across different tasks. The charts evaluate "Final-round Accuracy (%)" for different model variants: Original, Oracle, and Bayesian. The overall theme is assessing how these models generalize to tasks with varying feature counts and to new domains (Hotel Recommendation, Web Shopping).
### Components/Axes
**Common Elements Across All Charts:**
* **Y-Axis:** "Final-round Accuracy (%)" ranging from 0 to 100.
* **Models Compared:** Gemma, Llama, Qwen (each in its own sub-chart within a section).
* **Model Variants (Legend):**
* **Original LLM** (Blue line/bar)
* **Oracle LLM** (Light yellow/beige line/bar)
* **Bayesian LLM** (Orange line/bar)
* **Baselines (Dashed Lines):**
* **Random** (Gray dashed line, ~33% accuracy)
* **Bayesian Assistant** (Light brown dashed line, ~80% accuracy in sections b & c)
* **Direct Fine-tuning on Web Shopping** (Green dashed line, ~82% accuracy in section c only)
**Section-Specific Components:**
**a. Generalization to Different Numbers of Features**
* **Chart Type:** Line charts.
* **X-Axis:** "Number of Features" with discrete markers at 2, 3, 4, 5, 6, 7, 8.
* **Legend:** Located at the top of the section, spanning all three sub-charts. Contains five entries: Original LLM, Oracle LLM, Bayesian LLM, Bayesian Assistant, Random.
* **Spatial Layout:** Three sub-charts arranged horizontally for Gemma (left), Llama (center), Qwen (right).
**b. Generalization to Hotel Recommendation**
* **Chart Type:** Bar charts.
* **X-Axis Categories (per sub-chart):** "[Model] Original", "[Model] Oracle", "[Model] Bayesian".
* **Legend:** Located at the top of the section. Contains two entries: Bayesian Assistant, Random.
* **Data Labels:** Numerical accuracy values are printed directly above each bar.
* **Spatial Layout:** Three sub-charts arranged horizontally for Gemma (left), Llama (center), Qwen (right).
**c. Generalization to Web Shopping**
* **Chart Type:** Bar charts.
* **X-Axis Categories (per sub-chart):** "[Model] Original", "[Model] Oracle", "[Model] Bayesian".
* **Legend:** Located at the top of the section. Contains two entries: Direct Fine-tuning on Web Shopping, Random.
* **Data Labels:** Numerical accuracy values are printed directly above each bar.
* **Spatial Layout:** Three sub-charts arranged horizontally for Gemma (left), Llama (center), Qwen (right).
### Detailed Analysis
**a. Generalization to Different Numbers of Features (Line Charts)**
* **Trend Verification:** For all models and variants, accuracy generally **slopes downward** as the number of features increases from 2 to 8. The Bayesian Assistant line remains relatively flat and high.
* **Gemma (Left Sub-chart):**
* **Original LLM (Blue):** Starts at ~41% (2 features), declines slightly to ~35% (8 features).
* **Oracle LLM (Light Yellow):** Starts at ~68% (2 features), declines to ~46% (8 features).
* **Bayesian LLM (Orange):** Starts at ~85% (2 features), declines to ~52% (8 features).
* **Bayesian Assistant (Light Brown Dashed):** Starts at ~90% (2 features), declines to ~68% (8 features).
* **Random (Gray Dashed):** Constant at ~33%.
* **Llama (Center Sub-chart):**
* **Original LLM (Blue):** Starts at ~40% (2 features), declines to ~35% (8 features).
* **Oracle LLM (Light Yellow):** Starts at ~68% (2 features), declines to ~46% (8 features).
* **Bayesian LLM (Orange):** Starts at ~84% (2 features), declines to ~53% (8 features).
* **Bayesian Assistant (Light Brown Dashed):** Starts at ~90% (2 features), declines to ~68% (8 features).
* **Random (Gray Dashed):** Constant at ~33%.
* **Qwen (Right Sub-chart):**
* **Original LLM (Blue):** Starts at ~41% (2 features), declines to ~35% (8 features).
* **Oracle LLM (Light Yellow):** Starts at ~60% (2 features), declines to ~44% (8 features).
* **Bayesian LLM (Orange):** Starts at ~78% (2 features), declines to ~49% (8 features).
* **Bayesian Assistant (Light Brown Dashed):** Starts at ~90% (2 features), declines to ~68% (8 features).
* **Random (Gray Dashed):** Constant at ~33%.
**b. Generalization to Hotel Recommendation (Bar Charts)**
* **Gemma (Left Sub-chart):**
* Original: 37%
* Oracle: 53%
* Bayesian: 66%
* **Llama (Center Sub-chart):**
* Original: 41%
* Oracle: 56%
* Bayesian: 65%
* **Qwen (Right Sub-chart):**
* Original: 36%
* Oracle: 48%
* Bayesian: 59%
* **Baselines:** Bayesian Assistant (~80%) and Random (~33%) are shown as horizontal dashed lines across all sub-charts.
**c. Generalization to Web Shopping (Bar Charts)**
* **Gemma (Left Sub-chart):**
* Original: 54%
* Oracle: 61%
* Bayesian: 73%
* **Llama (Center Sub-chart):**
* Original: 59%
* Oracle: 63%
* Bayesian: 70%
* **Qwen (Right Sub-chart):**
* Original: 43%
* Oracle: 66%
* Bayesian: 69%
* **Baselines:** Direct Fine-tuning on Web Shopping (~82%) and Random (~33%) are shown as horizontal dashed lines across all sub-charts.
### Key Observations
1. **Consistent Hierarchy:** Across all tasks and models, the performance hierarchy is consistent: **Bayesian LLM > Oracle LLM > Original LLM**. All variants outperform the Random baseline.
2. **Task Difficulty:** The "Number of Features" task (section a) shows a clear negative correlation between feature count and accuracy for all model variants. The "Hotel Recommendation" task (section b) appears more challenging than "Web Shopping" (section c), as indicated by lower overall accuracy scores.
3. **Model Comparison:** Gemma and Llama generally show similar performance patterns. Qwen's Original model often starts lower but its Oracle and Bayesian variants show significant gains, particularly in the Web Shopping task.
4. **Baseline Comparison:** The specialized baselines (Bayesian Assistant, Direct Fine-tuning) consistently achieve the highest accuracy (~80-82%), setting an upper benchmark that the Bayesian LLM variants approach but do not surpass in these evaluations.
### Interpretation
The data demonstrates the effectiveness of Bayesian methods in improving the generalization capability of LLMs. The **Bayesian LLM** variant consistently provides a substantial accuracy boost over the **Original LLM** and even the **Oracle LLM** (which likely has access to some privileged information). This suggests that incorporating Bayesian principles helps models better handle uncertainty and adapt to new tasks or more complex feature spaces.
The downward trend in section (a) indicates that all models struggle as the decision problem becomes more complex (more features). However, the Bayesian approach mitigates this decline more effectively. The strong performance of the "Bayesian Assistant" and "Direct Fine-tuning" baselines highlights that task-specific optimization yields the best results, but the Bayesian LLM offers a powerful general-purpose improvement without such specialized tuning.
The variation between models (e.g., Qwen's lower Original score in Web Shopping) suggests that the base model's pre-training or architecture influences its starting point, but the relative gains from the Oracle and Bayesian methods are robust across different model families. This implies the Bayesian framework is a broadly applicable technique for enhancing LLM performance in decision-making and generalization tasks.
</details>
Figure 5: Bayesian teaching generalizes outside the task used for fine-tuning. (a) Final-round accuracy gain in fine-tuned models compared to the original LLM when varying task complexity (here the number of features is a proxy for task complexity). (b) Final-round accuracy for LLMs on the hotel recommendation task, which was not seen during fine-tuning. We show the normative Bayesian Assistantâs performance with brown dashed lines. (c) Final-round accuracy for LLMs on the web shopping domain, also unseen during fine-tuning. The green dashed line indicates the performance of the LLM when it is fine-tuned directly on web shopping data, such that no domain generalization is necessary. Error bars indicate the standard errors over three training runs (for web shopping) and additionally three random seeds (for flight recommendation and hotel recommendation).
As a result of Bayesian teaching, the LLMs demonstrate a greatly improved ability to approximate Bayesian probabilistic inference. Is this ability specific to the particular task the models were trained on, or do the LLMsâ probabilistic skills improve more broadly? To answer this question, we evaluate the fine-tuned LLMs on a set of tasks that diverge to different extents from our original flight recommendation task (see the right panel of Fig. 1 for an overview). All tasks require the LLMs to infer the userâs preferences from the userâs choices over multiple interactions. Overall, as we show in the rest of this section, we find that fine-tuned LLMs show considerable generalization to new settings, and that, as before, Bayesian teaching is more effective than oracle teaching.
We first test the LLMs on variants of the flight recommendation task with different numbers of features: whereas in the interactions provided during fine-tuning, flights were characterized by four features, in this evaluation setting flights are described by between two and eight features. This requires the LLM to generalize to features that were not included in fine-tuning (e.g., the number of checked bags). In this setting, we find that both types of fine-tuning lead to large improvement in accuracy compared to the original LLMs. We also find that Bayesian teaching is considerably more effective than oracle teaching, as before (Fig. 5). We note that as the number of features increases, the space of possible reward functions grows exponentially, and the task becomes inherently more difficult, even for the Bayesian Assistant. Despite this fact, for both fine-tuning methods, performance relative to the upper bound defined by the Bayesian Assistant drops off only moderately as the number of features increases.
The generalization experiments we have discussed so far focused on variants of the flight recommendation task. We next evaluate whether the LLMs can generalize the probabilistic skills they acquire through fine-tuning and apply them to other domains. We consider two such domains: hotel recommendations and web shopping. The hotel recommendation task is a synthetic task whose structure is similar to that of the flight recommendation task presented in fine-tuning. Here, each hotel is defined by four features: distance to downtown, price, rating, and amenities (for an example, see Appendix Table 11).
The web shopping task uses real-world products from a simulated environment [Yao et al., 2022], and differs much more substantially from the fine-tuning task than does the hotel recommendation task. It is difficult to construct a Bayesian Assistant for more natural scenarios like the web shopping task, where the space of user preferences is large and hard to specify formally. For this reason, successful transfer from synthetic settings like the flight recommendation task to more natural scenarios represents a particularly important application of Bayesian teaching. In the web shopping task, each user is defined by a set of randomly sampled goals that characterize the product they are interested in; for example, they might be looking for a shirt that is machine washable, or for a size XL shirt (see Appendix Table 1 for examples). As in the flight domain, the assistant interacts with the user for multiple rounds. In each round, a set of product options is randomly sampled from the product category (e.g., shirts), and the assistant is asked to recommend the best option. Each product is represented by a short title along with a detailed description (see Appendix Table 12 for an example). The user provides feedback at the end of each round, indicating whether or not the assistantâs recommendation was correct. The userâs preferred option is the one with the highest reward, as defined in Yao et al. [2022]. As mentioned above, it is difficult to construct a Bayesian Assistant for this task due to the large space of possible preferences. Instead, as an alternative upper bound on the transfer performance we can expect from the models fine-tuned on the flight recommendation task, we fine-tune LLMs directly on data from the shopping task.
We find that LLMs fine-tuned on the flight recommendation task generalize to both hotel recommendations and web shopping: they perform much better than the original LLMs on those tasks (Fig. 5 and Fig. 5). Bayesian teaching continues to outperform oracle teaching, though the gap is smaller for web shopping than hotel recommendations. There remains a gap between the generalization performance of the LLMs fine-tuned on flight recommendations and the upper bound obtained by fine-tuning the LLMs directly on the web shopping interactions (green dashed line in Fig. 5). Overall, we conclude that fine-tuning, and especially Bayesian teaching, imparts probabilistic skills that transfer substantially beyond the setting used for fine-tuning.
### 3.3 Generalization to Interactions with Human Users
The synthetically generated data we have used so far makes two simplifying assumptions: the simulated usersâ choices faithfully reflect the reward function that characterizes their preferences, and all reward functions are encountered equally often. In practice, these assumptions may not hold as humansâ behavior could occasionally be inconsistent with their preferences, due to inattention or other biases, and some preferences may be more common in the population than others (such as a preference for lower price). To evaluate the models in a more realistic setting, we recruit human participants to act as users. Each human participant is asked to first state their preferences for each of the flight features, and then select their preferred flight out of three options, for five different sets of options. We collect data from 10 human participants each for 50 lists of flight option sets, for a total of 500 participants (see Appendix Section A).
The performance of both fine-tuned models and the Bayesian Assistant for human users consistently improves over rounds (Fig. 6), and, as was the case for the simulated users, the Bayesian LLMs consistently outperform the Oracle LLMs; at least for some model families, the Bayesian LLMs also outperform the original LLMs. This indicates that the Bayesian LLMs generalize to human users from the simulated users on which they were fine-tuned.
All models, including the Bayesian Assistant, show substantially lower performance for humans than they did for simulated users, where accuracy after five rounds approached 80% (Fig. 3). In the Appendix Section F.2, we show that this is due to the fact that participantsâ choices are not always consistent with their stated preferences, and as such are impossible to predict with high accuracy (Appendix Fig. 22). For the subset of human users whose choices are perfectly consistent with their preferences, the Bayesian LLM performs much better than the original LLM (Appendix Fig. 21; see also Appendix Section D.3, where we study inconsistent simulated users).
Unlike for the simulated users, for human users the original LLMs perform well even after a single interaction (although, crucially, the original LLMs do not improve over interactions). We attribute the original LLMsâ surprisingly strong performance to the fact that human users have generally predictable preferences (e.g., a preference for cheaper flights), such that guesses based on the LLMâs priors, without any adaptation to the individual user, can be quite effective (see Appendix Figs. 20 and 21 for evidence for this hypothesis).
<details>
<summary>x6.png Details</summary>

### Visual Description
## Multi-Panel Line Chart: LLM Accuracy vs. Number of Interactions
### Overview
The image displays a set of three line charts arranged horizontally, comparing the performance of different Large Language Model (LLM) configurations and a baseline across three distinct base models: **Gemma**, **Llama**, and **Qwen**. The charts plot "Accuracy (%)" against the "# Interactions" (from 0 to 4). The primary comparison is between an "Original LLM," an "Oracle LLM," a "Bayesian LLM," a "Bayesian Assistant," and a "Random" baseline.
### Components/Axes
* **Legend:** Positioned at the top center of the entire figure. It defines five data series:
* `Original LLM`: Solid blue line with circular markers.
* `Oracle LLM`: Solid light orange line with circular markers.
* `Bayesian LLM`: Solid dark orange line with circular markers.
* `Bayesian Assistant`: Dashed beige line with circular markers.
* `Random`: Dashed gray line (no markers).
* **Subplot Titles:** Each of the three charts has a title centered above it: "Gemma" (left), "Llama" (center), "Qwen" (right).
* **Y-Axis (Common to all):** Labeled "Accuracy (%)". The scale runs from 0 to 100, with major tick marks at 0, 20, 40, 60, 80, and 100.
* **X-Axis (Common to all):** Labeled "# Interactions". The scale shows integer values from 0 to 4.
### Detailed Analysis
#### **Subplot 1: Gemma**
* **Original LLM (Blue):** Starts at ~62% accuracy at 0 interactions. The line is nearly flat, showing a very slight downward trend, ending at ~61% at 4 interactions.
* **Oracle LLM (Light Orange):** Starts at ~33% at 0 interactions. Shows a steady, moderate upward trend, reaching ~51% at 4 interactions.
* **Bayesian LLM (Dark Orange):** Starts the lowest at ~22% at 0 interactions. Exhibits the steepest upward slope, crossing the Oracle line between 1 and 2 interactions, and ends as the highest performer at ~62% at 4 interactions.
* **Bayesian Assistant (Beige, Dashed):** Starts at ~28% at 0 interactions. Follows a similar upward trajectory to the Bayesian LLM but remains slightly below it, ending at ~56% at 4 interactions.
* **Random (Gray, Dashed):** A flat horizontal line at approximately 33% accuracy across all interaction counts.
#### **Subplot 2: Llama**
* **Original LLM (Blue):** Starts at ~60% at 0 interactions. Shows a slight dip at 1 interaction (~57%) before recovering and stabilizing around ~59% from 2-4 interactions.
* **Oracle LLM (Light Orange):** Starts at ~33% at 0 interactions. Increases steadily to ~51% at 4 interactions.
* **Bayesian LLM (Dark Orange):** Starts at ~24% at 0 interactions. Rises sharply, surpassing the Oracle line after 1 interaction, and ends at ~61% at 4 interactions.
* **Bayesian Assistant (Beige, Dashed):** Starts at ~29% at 0 interactions. Increases steadily, tracking just below the Bayesian LLM, and ends at ~57% at 4 interactions.
* **Random (Gray, Dashed):** Flat line at ~33%.
#### **Subplot 3: Qwen**
* **Original LLM (Blue):** Starts at ~56% at 0 interactions. Shows a more pronounced decline than the other models, dropping to ~51% at 1 interaction and ending at ~50% at 4 interactions.
* **Oracle LLM (Light Orange):** Starts at ~34% at 0 interactions. Increases gradually to ~47% at 4 interactions.
* **Bayesian LLM (Dark Orange):** Starts at ~26% at 0 interactions. Rises steeply, crossing the Original LLM line between 1 and 2 interactions, and ends at ~58% at 4 interactions.
* **Bayesian Assistant (Beige, Dashed):** Starts at ~30% at 0 interactions. Follows an upward trend, ending at ~52% at 4 interactions.
* **Random (Gray, Dashed):** Flat line at ~33%.
### Key Observations
1. **Consistent Hierarchy at Start:** For all three base models (Gemma, Llama, Qwen), the performance order at 0 interactions is identical: Original LLM > Random â Oracle LLM â Bayesian Assistant > Bayesian LLM.
2. **Bayesian Methods Improve with Interactions:** Both the "Bayesian LLM" and "Bayesian Assistant" show strong, positive slopes, indicating significant accuracy gains with more interactions.
3. **Crossover Point:** The "Bayesian LLM" consistently starts as the worst performer but surpasses the "Oracle LLM" after 1-2 interactions and eventually surpasses the "Original LLM" for Gemma and Qwen, and nearly matches it for Llama.
4. **Original LLM Stability/Decline:** The "Original LLM" shows minimal improvement or a slight decline with more interactions, suggesting it does not benefit from the iterative process in this setup.
5. **Oracle as a Mid-Tier Benchmark:** The "Oracle LLM" provides a consistent, moderate improvement over the random baseline but is outperformed by the Bayesian methods after a few interactions.
6. **Random Baseline:** The flat "Random" line at ~33% suggests a 3-class classification problem where random guessing yields one-third accuracy.
### Interpretation
This data demonstrates the effectiveness of a **Bayesian iterative refinement approach** for improving LLM accuracy on a given task. The key insight is that while the base ("Original") LLM starts with high accuracy, it cannot improve further. In contrast, the Bayesian methods, which likely incorporate feedback or uncertainty from each interaction, start poorly but learn rapidly.
The "Oracle LLM" likely represents an idealized upper bound for a non-Bayesian iterative method, showing that some improvement is possible. However, the Bayesian approach's ability to surpass both the Oracle and the Original LLM after a few interactions highlights its superior efficiency in leveraging iterative feedback. The consistency of this pattern across three different base models (Gemma, Llama, Qwen) suggests the finding is robust and not model-specific. The "Bayesian Assistant" (dashed beige) performing slightly worse than the full "Bayesian LLM" may indicate it uses a less comprehensive update mechanism. The charts argue strongly for integrating Bayesian or similar uncertainty-aware, iterative frameworks when deploying LLMs in interactive settings where multiple rounds of refinement are possible.
</details>
Figure 6: Bayesian teaching generalizes to human users. We show accuracy over rounds when the user is a human participant. The original LLMs achieve strong performance but do not show any learning behavior. In contrast, fine-tuned LLMs (with both Bayesian and Oracle teachers) improve their performance over rounds, and the Bayesian LLMs consistently outperform the Oracle LLMs. Error bars show standard errors across four random seeds (and three training runs; the errors bars are not visible in the plot because they are very small).
## 4 Discussion
To interact with the world successfully, an agent needs to adapt its behavior as it obtains additional information about the statistics of this environment. To evaluate the ability of large language models (LLMs) to do so, we introduced a simple flight recommendation task where, in order to make accurate predictions, the model needs to adapt to a userâs preferences over multiple interactions with the user. We tested a range of LLMs and found that they struggle to form and update probabilistic beliefs. We further found that continuing the LLMsâ training through exposure to interactions between users and the Bayesian Assistantâa model that implements the normative probabilistic belief update strategyâdramatically improves the LLMsâ ability to approximate probabilistic reasoning. Crucially, this improvement did not only hold for the flight recommendation task the LLM was trained on, but also generalized to variants to the flight recommendation task that the LLM has not encountered before, as well as to other tasks. Across the board, this approach, which we refer to as Bayesian teaching, was more effective than a related approach where the LLM is fine-tuned directly on the correct answers, pointing to the effectiveness of the Bayesian training signal.
Our paradigm differs from those used in previous investigations of LLMsâ probabilistic reasoning abilities, where LLMs were expected to compute statistics explicitly [Nafar et al., 2025, Paruchuri et al., 2024] or provide probability judgments [Zhu and Griffiths, 2024, BelĂ©m et al., 2024]. In our paradigm, probabilistic reasoning is as essential as it is in explicit reasoning tasks, but, crucially, it is implicit in the task. Unlike in some recent studies, where the assistant is expected to ask questions to directly elicit the userâs preferences [Li et al., 2023a, Handa et al., 2024, Piriyakulkij et al., 2023, Andukuri et al., 2024, Peng et al., 2024, Aliannejadi et al., 2021, Chen et al., 2024a, Lin et al., 2022], our setup expects the assistant to gradually infer the userâs preferences by simply observing the userâs choices, and to provide recommendations that are increasingly in line with the userâs true preferences. Finally, our findings are consistent with those of concurrent work [Zhao et al., 2025], which also investigates LLMsâ ability to infer user preferences from different types of dialogues, including a condition where the user accepts or rejects one or more options provided by the assistantâa setup similar to oursâwhere the models performed poorly. Compared to this concurrent study, our work analyzes the LLMsâ behavior through the lens of Bayesian inference, and demonstrates the benefits of mimicking a Bayesian model in fine-tuning compared to a more standard fine-tuning strategy, where the model is always provided with the correct answer (oracle teaching, in the terminology we used in the current paper).
We observed robust generalization from the synthetic flight recommendation task on which the LLMs were fine-tuned to the more natural web shopping task. While performance was even stronger when we fine-tuned the LLM directly on interactions from this task (the green dashed line in Fig. 5), in practice it may be difficult or expensive to collect such data; our synthetic fine-tuning strategy provides an alternative that improves the LLMâs probabilistic reasoning abilities across tasks, without requiring collecting additional data and re-training the model on the new domain.
Our proposal is related to but distinct from approaches that embed an LLM inside a neuro-symbolic framework for probabilistic reasoning [Wong et al., 2023, Feng et al., 2024, Liu et al., 2024, Piriyakulkij et al., 2024, Grand et al., 2023, Ying et al., 2024, Ellis, 2023]. In those approaches, the LLM is used to translate between natural language inputs and formal representations, which in turn serve as input to a symbolic model that can update its beliefs according to the Bayesian framework [Wong et al., 2023]. Indeed, we provide further evidence that hybrid methods can outperform the LLM-only approach in Appendix Section B, where we describe a variation of our method where we first ask the LLM to verbalize its beliefs about the userâs preferences, and then we use an external, symbolic system to make predictions based on these verbalized beliefs. The experiments described in that Appendix section show that in simple tasks where preferences can be mapped to predictions, such hybrid methods indeed outperform a direct interaction with the LLM. Our preliminary explorations of this approach can be developed in greater detail in future work.
Besides their superior performance in certain cases, neuro-symbolic methods have the benefit of greater interpretability, and their probabilistic inferences could be more robust. Crucially, however, the utility of such methods is limited to problems whose structure can be made explicit in the symbolic component of the system. By contrast, the method we propose empowers the LLM to approximate probabilistic inference on its own, such that it can apply this skill to domains that are hard to codify explicitly in a symbolic system, domains such as the web shopping task we have examined. This approach leverages LLMsâ remarkable ability to generalize to new problems defined using natural language.
Notably, even in cases where the domain is simple enough for a purely symbolic model to be constructed, such models may not be consistently more accurate than LLMs. In our study, we found that while for âwell-behavedâ simulated users a moderate performance gap persisted between the fine-tuned models and the Bayesian Assistant, for human users, whose choices are not always consistent with their preferences, our Bayesian LLMs were in fact superior to the fully symbolic Bayesian Assistant, demonstrating LLMsâ greater robustness to noise compared to symbolic models.
We have argued that through mimicking the Bayesian Assistant the LLMs learn to perform probabilistic inference, albeit only approximately. This hypothesis may appear to be surprising in light of the fact that LLMsâ training objective does not explicitly provide supervision for this skill, and that the transformer architecture does not explicitly track probability distributions: it is trained only to predict the next word produced by the Bayesian Assistant. That being said, there is mounting evidence that in order to predict the next token successfully, LLMs can acquire sophisticated representations that match the structure of the process that generated those tokens. In the case of natural language syntax, for example, the internal representations of LLM trained solely to predict upcoming words have been shown to encode abstract features such as syntactic role and grammatical number [Lakretz et al., 2019, Hao and Linzen, 2023, Manning et al., 2020]. It would be a fruitful direction for future work to determine how probabilistic reasoning is implemented by the LLMsâ internal representations, for example by using techniques such as probes and causal interventions [Finlayson et al., 2021, Ravfogel et al., 2021, Vig et al., 2020] to find internal representations of the modelâs probability distributions over usersâ preferences, or using circuit analysis [Wang et al., 2023a] to explore the computations through which the model updates these distributions.
The success of Bayesian teaching in imparting approximate probabilistic reasoning skills to LLMs opens up a range of questions for future work. Would the benefits of Bayesian teaching extend to larger models than we were able to fine-tune in this work, or to the recent generation of models that are explicitly trained to reason in words [Guo et al., 2025]? Does the benefit of Bayesian teaching extend to continuous domains and real-world applications beyond the ones we evaluated (for example, interactions whose goal goes beyond shopping)? Could we provide the models with a stronger supervision signalâfor example, by instructing them to consider explicit probability distributions, by providing them with explicit supervision on the optimal way to update these distributions (for example, by supervising beliefs as in Appendix Fig. 10), or by encouraging them to maintain explicit representations of users such that the probability distributions are consistent across interactions with the same user, through methods such as supervised fine-tuning or reinforcement learning?
The goal of this study was not to replicate human behavior in LLMs, but rather to identify methods that can bring LLMsâ probabilistic reasoning skills closer to the normative Bayesian strategy: for most applications we expect AI assistants to be follow normative reasoning standards rather than reproduce human deviations from that standard. That being said, our comparisons between LLMs and humans point to a number of directions for future work. Our participants showed substantial deviations from the normative reasoning strategy, in line with prior work on reasoning biases [Eisape et al., 2024, Rottman and Hastie, 2016, Chaigneau et al., 2025, Tversky and Kahneman, 1974]. To what extent can people be taught to follow the normative strategy more closely? Can participantsâ apparent biases be explained as consequences of resource limitations [Simon, 1955]? How consistent are participantsâ choices with their stated preferences? Do peopleâs deviations from the normative strategy align with those of LLMs [Eisape et al., 2024], and what properties of an LLM lead to closer alignment with humans?
While our findings from our first experiment point to the limitations of particular LLMs, the positive findings of our subsequent fine-tuning experiments can be viewed as a demonstration of the strength of the LLM âpost-trainingâ paradigm more generally: by training the LLMs on demonstrations of the normative strategy to perform the task, we were able to improve their performance considerably, suggesting that they learned to approximate the probabilistic reasoning strategy illustrated by the demonstrations. The LLMs were able to generalize this strategy to domains where it is difficult to encode it explicitly in a symbolic model, demonstrating the power of distilling a classic symbolic model into a neural network. We hypothesize that this generalization ability is, in part, responsible for LLMsâ remarkable empirical success.
## Acknowledgments
We thank Stephanie Chan, Andrew Lampinen, Michael Mozer, Peter Shaw, and Zhaofeng Wu for helpful discussions.
## Author Contributions
L.Q., F.S., T.L., and S.V.S. co-led the project. S.V.S. conceptualized the project direction. L.Q. conducted the experiments and analysis. L.Q., F.S., T.L., and S.V.S. framed, analyzed and designed experiments, with inputs from K.A. and Y.K. L.Q., T.L., and S.V.S. wrote the paper with help from F.S., K.A., and Y.K.
## References
- Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. GPT-4 technical report. ArXiv preprint, abs/2303.08774, 2023.
- Aliannejadi et al. [2021] M. Aliannejadi, J. Kiseleva, A. Chuklin, J. Dalton, and M. Burtsev. Building and evaluating open-domain dialogue corpora with clarifying questions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
- Andukuri et al. [2024] C. Andukuri, J.-P. FrÀnken, T. Gerstenberg, and N. Goodman. STaR-GATE: Teaching language models to ask clarifying questions. In First Conference on Language Modeling, 2024.
- Anthropic [2024] Anthropic. Claude 3, 2024.
- Baker et al. [2011] C. Baker, R. Saxe, and J. Tenenbaum. Bayesian theory of mind: Modeling joint belief-desire attribution. In Proceedings of the annual meeting of the cognitive science society, volume 33, 2011.
- Belém et al. [2024] C. G. Belém, M. Kelly, M. Steyvers, S. Singh, and P. Smyth. Perceptions of linguistic uncertainty by language models and humans. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
- Brown et al. [2020] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Chaigneau et al. [2025] S. Chaigneau, N. Marchant, and B. Rehder. Breaking the chains of independence: A bayesian uncertainty model of normative violations in human causal probabilistic reasoning. OSF, 2025.
- Chater and Manning [2006] N. Chater and C. D. Manning. Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10, 2006.
- Chater et al. [2006] N. Chater, J. B. Tenenbaum, and A. Yuille. Probabilistic models of cognition: Conceptual foundations. Trends in Cognitive Sciences, 10(7), 2006.
- Chen et al. [2024a] S. Chen, S. Wiseman, and B. Dhingra. Chatshop: Interactive information seeking with language agents. ArXiv preprint, abs/2404.09911, 2024a.
- Chen et al. [2024b] X. Chen, H. Huang, Y. Gao, Y. Wang, J. Zhao, and K. Ding. Learning to maximize mutual information for chain-of-thought distillation. In Findings of the Association for Computational Linguistics: ACL 2024, 2024b.
- Chiang et al. [2024] W. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. I. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024.
- Christiano et al. [2017] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017.
- Deng et al. [2023] Y. Deng, K. Prasad, R. Fernandez, P. Smolensky, V. Chaudhary, and S. Shieber. Implicit chain of thought reasoning via knowledge distillation. ArXiv preprint, abs/2311.01460, 2023.
- Eisape et al. [2024] T. Eisape, M. Tessler, I. Dasgupta, F. Sha, S. Steenkiste, and T. Linzen. A systematic comparison of syllogistic reasoning in humans and language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024.
- Ellis [2023] K. Ellis. Human-like few-shot learning via bayesian reasoning over natural language. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Feng et al. [2024] Y. Feng, B. Zhou, W. Lin, and D. Roth. BIRD: A trustworthy bayesian inference framework for large language models. In The Thirteenth International Conference on Learning Representations, 2024.
- Finlayson et al. [2021] M. Finlayson, A. Mueller, S. Gehrmann, S. Shieber, T. Linzen, and Y. Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021.
- Grand et al. [2023] G. Grand, V. Pepe, J. Andreas, and J. Tenenbaum. Loose lips sink ships: Asking questions in battleship with language-informed program sampling. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46, 2023.
- Grattafiori et al. [2024] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models, 2024.
- Griffiths et al. [2007] T. L. Griffiths, M. Steyvers, and J. B. Tenenbaum. Topics in semantic association. Psychological Review, 114, 2007.
- Griffiths et al. [2024] T. L. Griffiths, N. Chater, and J. B. Tenenbaum. Bayesian Models of Cognition: Reverse Engineering the Mind. The MIT Press, 2024. ISBN 9780262049412.
- Guo et al. [2025] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. Nature, 645, 2025.
- Ha and Schmidhuber [2018] D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems, 31, 2018.
- Handa et al. [2024] K. Handa, Y. Gal, E. Pavlick, N. Goodman, J. Andreas, A. Tamkin, and B. Z. Li. Bayesian preference elicitation with language models. ArXiv preprint, abs/2403.05534, 2024.
- Hao and Linzen [2023] S. Hao and T. Linzen. Verb conjugation in transformers is determined by linear encodings of subject number. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023.
- Hinton et al. [2015] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.
- Hu et al. [2022] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
- Hu and Levy [2023] J. Hu and R. Levy. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- J. Koehler and James [2010] D. J. Koehler and G. James. Probability matching and strategy availability. Memory & Cognition, 38(6), 2010.
- Jern et al. [2017] A. Jern, C. G. Lucas, and C. Kemp. People learn other peopleâs preferences through inverse decision-making. Cognition, 168, 2017. ISSN 0010-0277.
- Johnson-Laird [1980] P. N. Johnson-Laird. Mental models in cognitive science. Cognitive Science, 4(1), 1980.
- Jung et al. [2024] J. Jung, P. West, L. Jiang, F. Brahman, X. Lu, J. Fisher, T. Sorensen, and Y. Choi. Impossible distillation for paraphrasing and summarization: How to make high-quality lemonade out of small, low-quality model. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024.
- Kim and Rush [2016] Y. Kim and A. M. Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
- Kojima et al. [2022] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Kotha et al. [2024] S. Kotha, J. M. Springer, and A. Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024.
- Lakretz et al. [2019] Y. Lakretz, G. Kruszewski, T. Desbordes, D. Hupkes, S. Dehaene, and M. Baroni. The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- LeCun [2022] Y. LeCun. A path towards autonomous machine intelligence. Open Review, 62(1), 2022.
- Li et al. [2023a] B. Z. Li, A. Tamkin, N. Goodman, and J. Andreas. Eliciting human preferences with language models. In The Thirteenth International Conference on Learning Representations, 2023a.
- Li et al. [2023b] L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, and Y. Choi. Symbolic chain-of-thought distillation: Small models can also âthinkâ step-by-step. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023b.
- Lin et al. [2022] J. Lin, D. Fried, D. Klein, and A. Dragan. Inferring rewards from language in context. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
- Lin et al. [2024] Y. Lin, H. Lin, W. Xiong, S. Diao, J. Liu, J. Zhang, R. Pan, H. Wang, W. Hu, H. Zhang, H. Dong, R. Pi, H. Zhao, N. Jiang, H. Ji, Y. Yao, and T. Zhang. Mitigating the alignment tax of RLHF. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
- Liu et al. [2024] R. Liu, J. Geng, J. Peterson, I. Sucholutsky, and T. L. Griffiths. Large language models assume people are more rational than we really are. In The Thirteenth International Conference on Learning Representations, 2024.
- Manning et al. [2020] C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48), 2020.
- Nafar et al. [2025] A. Nafar, K. B. Venable, and P. Kordjamshidi. Reasoning over uncertain text by generative large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025.
- Nye et al. [2021] M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2021.
- OpenAI [2025] OpenAI. Introducing GPT-4.1 in the API, 2025.
- Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Palan and Schitter [2018] S. Palan and C. Schitter. Prolific.acâA subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 2018.
- Paruchuri et al. [2024] A. Paruchuri, J. Garrison, S. Liao, J. B. Hernandez, J. Sunshine, T. Althoff, X. Liu, and D. McDuff. What are the odds? language models are capable of probabilistic reasoning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
- Peng et al. [2024] A. Peng, Y. Sun, T. Shu, and D. Abel. Pragmatic feature preferences: Learning reward-relevant preferences from human input. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024.
- Piriyakulkij et al. [2023] T. Piriyakulkij, V. Kuleshov, and K. Ellis. Active preference inference using language models and probabilistic reasoning. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Piriyakulkij et al. [2024] T. Piriyakulkij, C. Langenfeld, T. A. Le, and K. Ellis. Doing experiments and revising rules with natural language and probabilistic reasoning. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024.
- Rafailov et al. [2023] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Ravfogel et al. [2021] S. Ravfogel, G. Prasad, T. Linzen, and Y. Goldberg. Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Proceedings of the 25th Conference on Computational Natural Language Learning, 2021.
- Rehder [2018] B. Rehder. Beyond Markov: Accounting for independence violations in causal reasoning. Cognitive Psychology, 103, 2018.
- Rottman and Hastie [2016] B. M. Rottman and R. Hastie. Do people reason rationally about causally related events? Markov violations, weak inferences, and failures of explaining away. Cognitive Psychology, 87, 2016.
- Sanh et al. [2022] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Biderman, L. Gao, T. Wolf, and A. M. Rush. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
- Simon [1955] H. A. Simon. A behavioral model of rational choice. The Quarterly Journal of Economics, 1955.
- Sloman and Lagnado [2015] S. A. Sloman and D. Lagnado. Causality in thought. Annual Review of Psychology, 66(1), 2015.
- Stiennon et al. [2020] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Team [2024a] Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024a.
- Team [2024b] G. Team. Gemma 2: Improving open language models at a practical size. ArXiv preprint, abs/2408.00118, 2024b.
- Tenenbaum et al. [2006] J. B. Tenenbaum, T. L. Griffiths, and C. Kemp. Theory-based bayesian models of inductive learning and reasoning. Trends in Cognitive Sciences, 10(7), 2006. ISSN 1364-6613. Special issue: Probabilistic models of cognition.
- Tenenbaum et al. [2011] J. B. Tenenbaum, C. Kemp, T. L. Griffiths, and N. D. Goodman. How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022), 2011.
- Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
- Tversky and Kahneman [1974] A. Tversky and D. Kahneman. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. Science, 185(4157), 1974.
- Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017.
- Vig et al. [2020] J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, and S. M. Shieber. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Wang et al. [2023a] K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023a.
- Wang et al. [2023b] P. Wang, Z. Wang, Z. Li, Y. Gao, B. Yin, and X. Ren. SCOTT: Self-consistent chain-of-thought distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023b.
- Wang et al. [2024] Y. Wang, A. Bai, N. Peng, and C.-J. Hsieh. On the loss of context-awareness in general instruction fine-tuning. ArXiv preprint, abs/2411.02688, 2024.
- Wei et al. [2022a] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022a.
- Wei et al. [2022b] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022b.
- Wong et al. [2023] L. Wong, G. Grand, A. K. Lew, N. D. Goodman, V. K. Mansinghka, J. Andreas, and J. B. Tenenbaum. From word models to world models: Translating from natural language to the probabilistic language of thought. ArXiv preprint, abs/2306.12672, 2023.
- Xu and Tenenbaum [2007] F. Xu and J. B. Tenenbaum. Word learning as Bayesian inference. Psychological Review, 114(2), 2007.
- Yang et al. [2024a] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv e-prints, 2024a.
- Yang et al. [2024b] H. Yang, Y. Zhang, J. Xu, H. Lu, P.-A. Heng, and W. Lam. Unveiling the generalization power of fine-tuned large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024b.
- Yao et al. [2022] S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Ying et al. [2024] L. Ying, T. Zhi-Xuan, L. Wong, V. Mansinghka, and J. Tenenbaum. Grounding language about belief in a bayesian theory-of-mind. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46, 2024.
- Yu et al. [2024] P. Yu, J. Xu, J. E. Weston, and I. Kulikov. Distilling system 2 into system 1. In The First Workshop on System-2 Reasoning at Scale, NeurIPSâ24, 2024.
- Zhao et al. [2025] S. Zhao, M. Hong, Y. Liu, D. Hazarika, and K. Lin. Do LLMs recognize your preferences? evaluating personalized preference following in LLMs. In The Thirteenth International Conference on Learning Representations, 2025.
- Zhu and Griffiths [2024] J.-Q. Zhu and T. Griffiths. Incoherent probability judgments in large language models. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46, 2024.
## Appendix A Experimental Details
### A.1 Simulated Users in the Flight Recommendation Task
In each round, we presented a set of $k$ flight options $O=\{o_1,...,o_k\}$ to both the simulated user and the assistant (typically $k=3$ ). Each flight has a departure time, a duration, a number of stops, and a cost; these four features are encoded in a vector $\bm{Ï}(o)ââ^4$ . For each flight option, each feature can take one of 11 values uniformly distributed between 0 and 1, except for the number of stops, which has 3 values. This defines $3Ă 11^3$ unique flight options. We converted these four numbers into a textual description illustrated in Fig. 1.
The userâs preferences are defined by a reward function $\bm{Ξ}$ parameterized by four numbers, which indicate the userâs preferences for the aforementioned features. The space $Î$ of reward functions includes all four-dimensional vectors with the values $\{-1,-0.5,0,0.5,1\}$ , where $-1$ corresponds to a preference for low values of this feature (e.g., short flights) and $1$ to a preference for high values (e.g., long flights). Given a set of flight options $O$ , the user computes the reward $r(o;\bm{Ξ})=\bm{Ξ^T}\bm{Ï}(o)$ of each flight $o$ , and chooses the flight with the highest reward:
$$
\displaystyle o^*(O,\bm{Ξ})=\textrm{argmax}_oâOr(O;\bm{Ξ}). \tag{1}
$$
When there was a tie between multiple options, we randomly selected one of the options that had the highest reward. We excluded the reward function $(0,0,0,0)$ , that is, the completely indifferent user. This results in a total of $5^4-1=624$ possible reward functions, corresponding to 624 simulated users. We note that these simulated users are highly simplified and are not meant to capture the full complexity of humans: humans do not always choose the option that maximizes their utility [J. Koehler and James, 2010], and their preferences may evolve over time.
### A.2 The Bayesian Assistant
Since the space of reward functions is relatively small, we were able perform exact Bayesian updates. In each round, given options $O$ and the userâs preferred option $o^*$ , the Bayesian Assistant updates its posterior as follows:
$$
\displaystyle q^i+1_B(\bm{Ξ}|O^i+1,o^*i+1)=\frac{p(o^*i+1|\bm{Ξ},O^i+1)q^i_B(\bm{Ξ})}{p(o^*i+1|O^i+1)}, \tag{2}
$$
where the likelihood function indicates whether the reward function is consistent with the userâs choice:
$$
\displaystyle p(o^*|\bm{Ξ},O)=\mathbbm{1}\big[\max_oâOr(o;\bm{Ξ})=o^*]. \tag{3}
$$
The Bayesian Assistant then makes flight recommendations based on its reward posterior mean, $\hat{\bm{Ξ}}=E_q(\bm{Ξ)}[\bm{Ξ}]$ , following Equation 1. In most experiments, we used the uniform prior (for experiments with other priors, see Supplementary Fig. C10b).
### A.3 LLMs
Our main experiments focus on the instruction-tuned versions of open-weights models, including models from the Gemma 2 [Team, 2024b], Llama 3 [Grattafiori et al., 2024], and Qwen 2.5 [Yang et al., 2024a] families. We used Gemma 2 models with 9B parameters (https://huggingface.co/google/gemma-2-9b-it) and 27B parameters (https://huggingface.co/google/gemma-2-27b-it), Llama 3 models with 8B parameters (https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and 70B parameters (https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), and Qwen 2.5 models with 7B paramters (https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and 32B parameters (https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). We also evaluated Gemini 1.5 Pro [Team, 2024a] and GPT-4.1 Mini [OpenAI, 2025], which can only be accessed through an API, as representatives of stronger models whose weights are not accessible. All of the models we use are based on the Transformer neural network architecture [Vaswani et al., 2017]. We used greedy decoding (temperature of 0) for all experiments.
### A.4 Generalization Tasks
For the variants of the flight recommendation task (see âFine-tuned LLMs generalize to new tasksâ), we varied the number of flight features, ranging from two to eight features. The full flight features include the following features, in addition to the above four: arrival time, layover duration, cancellation policy, and number of bags. As the number of possible reward functions grows exponentially with the number of features, we randomly sampled up to 1,000 reward functions (simulated users) for each number of features.
For the hotel recommendation task, the hotel features include distance to downtown, price, rating, and amenities. For each hotel option, the distance to downtown and price take one of 11 values uniformly distributed between 0 and 1, while rating and amenities take one of 5 values uniformly distributed between 0 and 1, resulting in $5Ă 5Ă 11^2$ unique hotel options. We evaluated $624$ different simulated users, as in the flight recommendation task.
For the web shopping task, we used real-world products that are publicly available at https://webshop-pnlp.github.io. We chose the 100 categories with the most products. Each product is described by a title and bullet point descriptions, whose length is limited to 800 characters. The reward of a user for a product was calculated based on text-matching heuristics on product attributes and options, following Yao et al. [2022]. For each category, we randomly sampled 10 users, each consisting of five-round interactions. Performance was evaluated on 100 held-out option sets within the same category.
To reduce the sensitivity of the results to the specific randomly selected option sets, we averaged all experiments over three random seeds for flight and hotel recommendations, and over all categories for web shopping. In each case, we report the mean and the standard error across runs and evaluation seeds.
### A.5 LLM Fine-Tuning
We used the instruction-tuned version of Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B for all fine-tuning experiments. For each reward function, we generated 10 userâassistant interactions, resulting in $624Ă 10=6,240$ fine-tuning examples, each with five-round interactions. We experimented with fine-tuning on more examples but did not observe any significant improvement. The interactions were formatted as shown in Supplementary Table H3.
We used full fine-tuning (i.e. all parameters were updated) with a learning rate of 2e-6, a batch size of 128, and a maximum sequence length of 2048, for 1 epoch. The models were fine-tuned using the standard language modeling objective, i.e., the cross-entropy loss between the modelâs predicted token probabilities and the ground-truth tokens in the training data. The loss was only computed on the modelâs responses. For each setup, we trained three models with different random seeds. We conducted all fine-tuning experiments using 4 $Ă$ H100 GPUs based on the standard recipe (https://github.com/huggingface/alignment-handbook). Fine-tuning Gemma 2 9B, Llama 3 8B and Qwen 2.5 7B required about an hour for each model.
### A.6 Human Annotations
We collected two sets of human annotations for the flight recommendation task: one where the annotators act as assistants and one where they act as users. The human annotators were recruited online and paid the market rate of $12 an hour, as suggested by the Prolific platform [Palan and Schitter, 2018] we used to recruit participants. See details in Supplementary Section E.
The annotation setup for the assistant role follows the evaluation setup we used for LLMs. In each round, the annotator was asked to make recommendations from three flight options, with each represented in the same format shown to the LLMs. After making their recommendation, the annotator received feedback indicating whether their choice was correct. They were then directed to a preference questionnaire, where they provided their estimates of the userâs preferences for each individual feature (see annotation interface in Supplementary Fig. G17). We sampled 48 reward functions by first grouping them based on the L2 distance between their four-dimensional parameter vector and the origin, then sampling from each group proportionally to its size. We had 15 separate participants provide annotations for each of the 48 simulated users (720 human participants in total).
When the annotator serves in the user role, we first asked them to rate their own preferences for different flight features; this serves as their reward function. Then, the annotator was asked to select their preferred option out of three flight options based on their preferences; this was repeated for five rounds. We constructed 50 such lists of five rounds of flights options, and had 10 annotators produce annotations for each of these 50 lists (500 human participants in total). We then produced three randomly shuffled variants of each of the interactions, for a total of 2000 interactions (500 original interactions and $3Ă 500$ shuffled interactions). This ensures that a particular option set is not consistently at a particular point in the interaction (for example, at the end of the interaction, where the participants may be paying less attention). To ensure quality, we required annotators to think for at least 30 seconds before making their selection.
## Appendix B Can LLMs Accurately Verbalize Their Beliefs?
The results of the fine-tuning experiments described in the main text suggest that fine-tuned models are able to infer the userâs preferences, at least implicitly. Here, we test if the LLMs can verbalize their beliefs about the userâs preferences, based on the userâs previous booking history which is provided as context.
### B.1 Eliciting Beliefs About User Preferences
We elicit beliefs in one of two ways. For the open-weights models (Gemma 2, Llama 3, and Qwen 2.5), for which we have access to the probability distribution over upcoming words, we employ continuation scoring, as follows. After interacting with the LLM for one or more rounds, the user asks the LLM for its beliefs about the userâs preferences, for example, âon a scale of 1 to 5, what is my preference for price?â, where $1$ indicates a strong preference for cheaper flights, $3$ indicates no strong preference, and $5$ indicates a strong preference for expensive flights. We score the numbers 1, 2, 3, 4, and 5 as possible continuations of the current text and re-normalize them to form a probability distribution over these five numbers (see Table 4 for a detailed example).
For closed-weights models (Gemini 1.5 Pro and GPT-4.1 Mini), where the LLMâs underlying probability distribution over upcoming words is not made accessible to researchers, we ask the LLM to explicitly provide a probability distribution over each of the five points on the scale (see Table 7; for a comparison between the scoring and explicit probability judgment method in Gemma 2, which finds that scoring yields more accurate estimates, see Section C.1). For our human participants, we use a similar survey to the one we use for closed-weights models.
We then approximate the distribution over reward functions as a factorization of these feature distributions:
$$
\displaystyle q_LLM^i(\bm{Ξ}|O^i,o^*i)ââ_jq_LLM^i(\bm{Ξ}_j|O^i,o^*i,c_j^i). \tag{4}
$$
where $q_LLM^i(\bm{Ξ}_j|O^i,o^*i,c_j^i)$ is the probability that the LLM assigns to each of the userâs preferences for feature $j$ given the current context in the prompt $c_j^i$ , using either scoring or explicit probability judgement. This makes an independence assumption, whereby the preference for one feature does not interact with the preference for another; because this assumption is quite strong, we cannot guarantee that it provides a complete picture of the LLMâs beliefs over all possible reward functions. We elicit the LLMâs beliefs by prompting it; it is possible that other techniques, such as probing, where a classifier is trained to decode the modelâs internal activations, could yield different results. We leave a more systematic study of this question for future work.
### B.2 Evaluating the LLMâs Verbalized Beliefs
To determine whether the LLM can accurately verbalize its beliefs about the userâs preferences, we derive flight recommendations from the LLMâs verbalized beliefs, using the same procedure that the Bayesian Assistant uses to make recommendations based on its explicit beliefs, and evaluate the accuracy of these recommendations. We follow the same evaluation setup as our main experiments, except at the end of each round, we query the LLMâs beliefs about the userâs preferences. Importantly, this process branches out from the main dialogue, ensuring the interaction remains unaffected (Fig. 7). We also examine whether the recommendations produced in this way are consistent with the recommendations made by the LLM directly. High consistency between these two measures would suggest that the LLMâs verbalized beliefs align with the implicit internal beliefs used by the LLM to make predictions in the original setup.
We also perform an analogous analysis for the experiment where human participants acted as the assistant to a simulated user. Recall that after each round we asked the participants what they thought the userâs preferences were. We use these verbalized beliefs about the userâs preferences as input to the same computation we used for the LLMsâ verbalized beliefs. As with the LLMs, we can compute the consistency between the flight choices derived in this way and the participantsâ choices in the experiment. We only evaluated on the five-round interactions of the 48 simulated users for which we have human annotations.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Diagram: Iterative Belief Update Process
### Overview
The image is a process flow diagram illustrating a two-stage, iterative system where interactions between agents lead to the generation of new options and the updating of beliefs. The diagram is structured in two parallel vertical columns, representing sequential stages of a process. Each stage involves agents (represented by airplane icons), belief distributions (bar charts), and the generation of new options.
### Components/Axes
The diagram contains the following labeled components and visual elements:
1. **Top-Level Process Flow:**
* **Left Column (Stage 1):** A beige box containing three airplane icons. A solid blue arrow points from this box to an identical beige box in the right column.
* **Right Column (Stage 2):** A beige box containing three airplane icons. The label **"Interaction"** is positioned to the right of this box.
* **Connection:** A solid blue arrow connects the left agent box to the right agent box, indicating a direct flow or interaction between the two stages.
2. **Belief and Option Generation (Per Stage):**
* **Stage 1 (Left):**
* **Previous Beliefs:** A bar chart at the bottom-left with three bars: orange (left), green (center), gray (right). The label **"Previous Beliefs"** is below it.
* **New Options:** A stack of three light blue cards, each containing three airplane icons. The label **"New Options"** is to the right of this stack.
* **Flow:** A solid gray arrow points from the top agent box down to the "Previous Beliefs" chart. A dashed gray arrow points from the "Previous Beliefs" chart to the "New Options" stack.
* **Stage 2 (Right):**
* **Updated Beliefs:** A bar chart at the bottom-right with three bars: orange (left), green (center), gray (right). The label **"Updated Beliefs"** is below it.
* **New Options:** A second stack of three light blue cards, each containing three airplane icons. This stack is unlabeled but is visually identical to the first "New Options" stack.
* **Flow:** A solid gray arrow points from the top agent box down to the "Updated Beliefs" chart. A dashed gray arrow points from the "Updated Beliefs" chart to the second "New Options" stack.
3. **Visual Elements & Spatial Grounding:**
* **Agent Icons:** Simple line drawings of airplanes, used consistently in all beige boxes and blue option cards.
* **Bar Charts:** Each chart has three vertical bars. The left bar is orange, the center bar is green, and the right bar is gray. The relative heights of the bars differ between the "Previous Beliefs" and "Updated Beliefs" charts.
* **Arrows:**
* **Solid Blue:** Horizontal, top-level flow from Stage 1 to Stage 2.
* **Solid Gray:** Vertical, downward flow from agent box to belief chart within each stage.
* **Dashed Gray:** Diagonal, upward flow from belief chart to "New Options" stack within each stage.
* **Positioning:** The entire diagram is laid out in a 2x2 grid. The agent boxes occupy the top row. The belief charts and option stacks occupy the bottom row, with the belief charts on the outer edges (left and right) and the option stacks positioned centrally between the columns.
### Detailed Analysis
* **Trend Verification (Belief Charts):**
* **"Previous Beliefs" Chart (Left):** The orange bar is the tallest, the green bar is of medium height, and the gray bar is the shortest. This suggests a belief distribution where the first option (orange) is most favored, followed by the second (green), with the third (gray) being least favored.
* **"Updated Beliefs" Chart (Right):** The orange bar remains the tallest, but its height relative to the others appears slightly reduced. The green bar has increased in height, becoming closer to the orange bar. The gray bar remains the shortest but may have increased slightly. The key trend is a **convergence** or **rebalancing** of beliefs, with the middle option (green) gaining favor relative to the leading option (orange) after the interaction stage.
* **Process Flow:** The diagram depicts a clear sequence:
1. Initial agents (Stage 1) operate based on "Previous Beliefs."
2. These beliefs influence the generation of "New Options."
3. An "Interaction" occurs, moving the process to Stage 2.
4. In Stage 2, agents operate again, leading to "Updated Beliefs."
5. These updated beliefs then influence the generation of a new set of "New Options."
### Key Observations
1. **Symmetry and Iteration:** The structure is highly symmetrical, implying the process is designed to be cyclical or repeatable. The output of Stage 2 ("Updated Beliefs" and new "New Options") could logically feed back into a subsequent Stage 1.
2. **Belief Update Mechanism:** The core change visualized is in the bar charts. The interaction between agents does not create a completely new belief distribution but rather **adjusts the weights** of existing options (orange, green, gray).
3. **Option Generation:** "New Options" are generated as a direct consequence of the current belief state (indicated by the dashed arrows from the charts). The options themselves (airplane icons) are abstract, representing potential actions, strategies, or choices available to the agents.
4. **Agent Consistency:** The agent icons (airplanes) are identical across all boxes and cards, suggesting the same type of entity or decision-maker is involved at all stages.
### Interpretation
This diagram models a **multi-agent learning or decision-making system**. It illustrates how collective interactions ("Interaction") lead to a revision of shared beliefs ("Previous Beliefs" â "Updated Beliefs"). This revised understanding of the world then informs the creation of new strategies or choices ("New Options").
The change in the bar charts is the most significant data point. It suggests that the interaction served to **moderate extreme beliefs** or **promote a previously less-favored option** (the green bar). This could represent a consensus-building process, the incorporation of new information, or the outcome of a negotiation where agents adjust their preferences based on the actions or communications of others.
The use of airplane icons is likely metaphorical, representing agents navigating a decision space. The "New Options" being stacks of cards implies a discrete set of potential future paths generated from the current belief state. The entire process is a feedback loop: beliefs shape options, interactions based on those options update beliefs, and the cycle continues. This is a foundational concept in fields like game theory, reinforcement learning, and computational social science.
</details>
Figure 7: Experimental design for LLM evaluation. At the end of each round, we evaluate the LLM using new option sets for which it has not received feedback. The evaluation branches out from the main interactions (that is, the evaluation performed after the first round is not included in the context of the second round). The LLMâs direct evaluation, where we ask the LLM directly to choose a flight, follows the blue lines; the belief-based evaluation, where we first assess the LLMâs beliefs about the userâs preferences and then use them to choose the flight, follows the gray lines. The dashed lines indicate the deterministic conversion of the LLMâs beliefs into flight recommendations.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Bar Charts: Model Performance Comparison Across Prediction Tasks
### Overview
The image contains three horizontal bar charts (labeled a, b, and c) comparing the performance of various AI models and humans on two prediction tasks and their consistency. The charts are titled "Direct Prediction," "Belief-based Prediction," and "Consistency between Direct and Belief-based Predictions." Each chart compares multiple model families (Gemma, Llama, Qwen) in three variants (Original, Oracle, Bayesian), alongside GPT-4.1 Mini, Gemini 1.5 Pro, and Human performance.
### Components/Axes
* **Chart Titles:**
* a. Direct Prediction
* b. Belief-based Prediction
* c. Consistency between Direct and Belief-based Predictions
* **Y-Axis Labels:**
* Charts a & b: "Final-round Accuracy (%)"
* Chart c: "Final-round Consistency (%)"
* **Y-Axis Scale:** 0 to 100 for all charts, with major ticks every 20 units.
* **X-Axis Categories (Identical for all charts):**
* Gemma Original, Gemma Oracle, Gemma Bayesian
* Llama Original, Llama Oracle, Llama Bayesian
* Qwen Original, Qwen Oracle, Qwen Bayesian
* GPT-4.1 Mini, Gemini 1.5 Pro, Human
* **Legend (Present in all charts, positioned top-left):**
* `--- Bayesian Assistant` (Dashed light brown line)
* `--- Random` (Dashed grey line)
* **Bar Colors:**
* Original variants: Blue
* Oracle variants: Light orange/peach
* Bayesian variants: Dark orange
* GPT-4.1 Mini & Gemini 1.5 Pro: Blue (same shade as "Original")
* Human: Green
* **Error Bars:** Present on all bars, indicating variability or confidence intervals.
### Detailed Analysis
#### **Chart a. Direct Prediction**
* **Trend:** For each model family (Gemma, Llama, Qwen), performance increases from Original -> Oracle -> Bayesian. The Bayesian variant consistently achieves the highest accuracy within its family.
* **Data Points (Approximate % Accuracy):**
* **Gemma:** Original ~37, Oracle ~61, Bayesian ~76
* **Llama:** Original ~38, Oracle ~62, Bayesian ~75
* **Qwen:** Original ~37, Oracle ~53, Bayesian ~68
* **Other Models:** GPT-4.1 Mini ~42, Gemini 1.5 Pro ~51, Human ~47
* **Benchmarks:** The "Bayesian Assistant" dashed line is at ~80%. The "Random" dashed line is at ~33%.
#### **Chart b. Belief-based Prediction**
* **Trend:** Similar upward trend from Original to Bayesian for Gemma and Llama. For Qwen, performance is flat across all three variants (~34-36). GPT-4.1 Mini and Gemini 1.5 Pro show moderate performance.
* **Data Points (Approximate % Accuracy):**
* **Gemma:** Original ~48, Oracle ~64, Bayesian ~72
* **Llama:** Original ~47, Oracle ~66, Bayesian ~72
* **Qwen:** Original ~34, Oracle ~36, Bayesian ~36
* **Other Models:** GPT-4.1 Mini ~50, Gemini 1.5 Pro ~57, Human ~45
* **Benchmarks:** "Bayesian Assistant" line at ~80%, "Random" line at ~33%.
#### **Chart c. Consistency between Direct and Belief-based Predictions**
* **Trend:** Consistency generally increases from Original to Oracle to Bayesian for Gemma and Llama. Qwen shows low consistency, with a slight increase from Original to Oracle/Bayesian. Gemini 1.5 Pro shows the highest consistency among non-family models.
* **Data Points (Approximate % Consistency):**
* **Gemma:** Original ~46, Oracle ~76, Bayesian ~81
* **Llama:** Original ~32, Oracle ~77, Bayesian ~81
* **Qwen:** Original ~21, Oracle ~36, Bayesian ~35
* **Other Models:** GPT-4.1 Mini ~44, Gemini 1.5 Pro ~53, Human ~42
### Key Observations
1. **Bayesian Superiority:** The "Bayesian" variant of each model family (Gemma, Llama) is the top performer in both accuracy tasks (Charts a & b) and shows the highest internal consistency (Chart c).
2. **Qwen Anomaly:** The Qwen model family shows a distinct pattern. While its Bayesian variant improves on Direct Prediction (Chart a), it shows no improvement in Belief-based Prediction (Chart b) and has significantly lower consistency scores (Chart c) compared to Gemma and Llama.
3. **Human vs. Model:** Human performance (~47% Direct, ~45% Belief-based) is generally outperformed by the Oracle and Bayesian variants of Gemma and Llama, and by Gemini 1.5 Pro in the Belief-based task.
4. **Benchmark Context:** The "Bayesian Assistant" benchmark (~80%) is only approached or matched by the top-performing Bayesian model variants (Gemma/Llama Bayesian in Direct Prediction, Gemma/Llama Bayesian in Consistency). All models and humans perform well above the "Random" baseline (~33%).
5. **Gemini 1.5 Pro Strength:** Among the standalone models, Gemini 1.5 Pro consistently outperforms GPT-4.1 Mini and shows the highest consistency score outside the Bayesian model families.
### Interpretation
This data strongly suggests that integrating Bayesian methods ("Bayesian" variants) into language models significantly enhances both their predictive accuracy and the consistency between their direct outputs and their stated beliefs. The near-identical high performance of Gemma-Bayesian and Llama-Bayesian indicates this improvement may be a robust effect of the method rather than a specific model architecture.
The Qwen family's failure to improve in the Belief-based task and its low consistency scores point to a potential disconnect in how that model processes or represents belief states compared to direct prediction. This could indicate a difference in training, architecture, or internal representation.
The fact that advanced proprietary models (GPT-4.1 Mini, Gemini 1.5 Pro) are outperformed by open-weight models equipped with Bayesian techniques (Gemma/Llama Oracle/Bayesian) highlights the potential of specialized inference methods to boost performance beyond raw model scale. The "Bayesian Assistant" line likely represents a theoretical or ideal performance ceiling for this methodology, which the best models are nearing.
Overall, the charts make a case for the value of Bayesian approaches in making model predictions more accurate, reliable (consistent), and aligned with their internal belief statesâa crucial factor for trustworthy AI.
</details>
Figure 8: Comparison of direct accuracy and belief-based accuracy. We show final-round accuracy based on (a) the LLMâs or humanâs direct predictions and (b) predictions derived from their verbalized beliefs about the userâs preferences. The gray dashed line indicates random performance, while the brown dashed line indicates the performance of the Bayesian Assistant. For human participants, we only evaluate on a subset of our evaluation data, which includes 48 different simulated users. (c) Final-round consistency between two predictions: the prediction directly provided by the LLM or human participants and the prediction derived from their beliefs about the userâs preferences. Fine-tuned LLMs show better consistency than the original LLMs, with Bayesian LLMs achieving the highest consistency. Error bars show standard error across participants for humans and across three random seeds (and three training runs) for LLMs.
### B.3 Results
For the original LLMs, we find that the approach described in this sectionâwhere we first estimate the LLMsâ beliefs about the userâs preferences by explicitly querying the LLMs, and then use a decision-making component that is external to the LLMâperforms better than directly using the LLMsâ predictions (Fig. 8 vs. Fig. 8, original LLMs). We also find that the original LLMsâ direct predictions are often inconsistent with the belief-based predictions (those derived from the beliefs elicited from the LLMs), with less than 50% alignment between the two sets of predictions (Fig. 8). Human participants similarly show high inconsistency between the two types of predictions.
Predictions based on the fine-tuned LLMsâ verbalized beliefs are substantially more accurate than those based on the original LLMâs verbalized beliefs, except for Qwen 2.5 models (Fig. 8 and Fig. 8, Oracle LLMs and Bayesian LLMs). This suggests that both training methods teach the LLM to infer the userâs preferences and update them as more information becomes available, even though neither method provides the model with direct access to usersâ preferences. For both Gemma 2 and Llama 3, the Bayesian variant of the LLMs produces more accurate estimates of the userâs beliefs than the Oracle one. Likewise, compared to the Oracle variants, the Bayesian variants achieve higher consistency between the predictions directly provided by the LLM and those derived from the LLMâs verbalized beliefs. The difference in overall accuracy between these modelsâ direct predictions and belief-based predictions is much smaller after fine-tuning. This trend, however, does not hold for Qwen 2.5 model: while direct prediction accuracy improves after fine-tuning, belief-based prediction accuracy remains unchanged from the original LLM. This suggests that for Gemma 2 and Llama 3, prompt-based prediction elicitation may tap into a representation that is shared with the computation used to make direct predictions, but that is not the case for Qwen 2.5.
## Appendix C Additional Results
### C.1 The Original LLMsâ Poor Performance is Robust to Evaluation Setup
<details>
<summary>x9.png Details</summary>

### Visual Description
## Multi-Panel Bar Chart: Final-Round Accuracy Across Experimental Conditions
### Overview
The image displays a set of six bar charts (labeled a through f) comparing the "Final-round Accuracy (%)" of different methods or conditions in what appears to be a study on Large Language Model (LLM) performance, likely in a Bayesian or interactive reasoning task. Each subplot compares two primary conditions ("Direct" and "Beliefs") across different experimental variables. Two horizontal dashed lines serve as baselines across all charts.
### Components/Axes
* **Global Y-Axis:** "Final-round Accuracy (%)" with a scale from 0 to 100, marked at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Global Legend (Top Center):**
* **Direct:** Represented by striped bars.
* **Beliefs:** Represented by solid-colored bars.
* **Bayesian Assistant:** A horizontal dashed line in light brown/tan, positioned at approximately 80% accuracy.
* **Random:** A horizontal dashed line in dark gray, positioned at approximately 33% accuracy.
* **Subplot Titles:**
* a. Prompting Methods
* b. Flight Representations
* c. Number of Rounds
* d. Assessing the LLM's Beliefs
* e. Providing User's Preferences
* f. Types of LLMs
* **X-Axis Categories:** Each subplot has two primary categories on its x-axis, with each category containing a pair of bars (Direct and Beliefs).
### Detailed Analysis
**Subplot a. Prompting Methods**
* **Categories:** "Interactive" and "Non-interactive".
* **Data Points:**
* Interactive: Direct = 37%, Beliefs = 48% (Blue bars).
* Non-interactive: Direct = 36%, Beliefs = 39% (Green bars).
* **Trend:** The "Beliefs" condition outperforms "Direct" in both cases, with a larger gain in the Interactive setting.
**Subplot b. Flight Representations**
* **Categories:** "Textual" and "Numerical".
* **Data Points:**
* Textual: Direct = 37%, Beliefs = 48% (Blue bars).
* Numerical: Direct = 36%, Beliefs = 39% (Orange bars).
* **Trend:** Similar pattern to subplot (a). The "Beliefs" condition shows a significant improvement with Textual representation.
**Subplot c. Number of Rounds**
* **Categories:** "5 Rounds" and "30 Rounds".
* **Data Points:**
* 5 Rounds: Direct = 37%, Beliefs = 48% (Blue bars).
* 30 Rounds: Direct = 37%, Beliefs = 43% (Orange bars).
* **Trend:** Accuracy for the "Beliefs" condition is higher with fewer rounds (5 vs. 30).
**Subplot d. Assessing the LLM's Beliefs**
* **Categories:** "Scoring" and "Generation".
* **Data Points:**
* Scoring: Direct = 37%, Beliefs = 48% (Blue bars).
* Generation: Direct = 37%, Beliefs = 41% (Orange bars).
* **Trend:** The "Beliefs" condition yields higher accuracy when assessed via "Scoring" compared to "Generation".
**Subplot e. Providing User's Preferences**
* **Categories:** "Original" and "+ User's Preferences".
* **Data Points:**
* Original: Direct = 37%, Beliefs = 48% (Blue bars).
* + User's Preferences: Direct = 38%, Beliefs = 62% (Orange bars).
* **Trend:** This subplot shows the most dramatic improvement. Adding user preferences boosts the "Beliefs" condition accuracy to 62%, the highest value across all charts.
**Subplot f. Types of LLMs**
* **Categories:** "Instruct" and "Base".
* **Data Points:**
* Instruct: Direct = 37%, Beliefs = 48% (Blue bars).
* Base: Direct = 36%, Beliefs = 36% (Orange bars).
* **Trend:** The "Beliefs" condition provides a substantial benefit for the "Instruct" model type but no benefit for the "Base" model type.
### Key Observations
1. **Consistent Baseline:** The "Direct" method (striped bars) shows remarkably consistent accuracy, hovering between 36-38% across all conditions, closely aligning with the "Random" baseline (~33%).
2. **Beliefs Condition Impact:** The "Beliefs" condition (solid bars) universally improves upon the "Direct" method, except for the "Base" LLM type in subplot (f), where both are equal at 36%.
3. **Peak Performance:** The highest observed accuracy (62%) is achieved in subplot (e) under the "+ User's Preferences" condition using the "Beliefs" method.
4. **Bayesian Assistant Benchmark:** The "Bayesian Assistant" baseline (~80%) remains significantly above the performance of all tested LLM conditions, indicating a substantial gap between the model's performance and this theoretical or ideal benchmark.
5. **Color Coding:** The bar colors are consistent across subplots for the same experimental variable (e.g., blue for Interactive/Textual/5 Rounds/Scoring/Original/Instruct), aiding visual comparison.
### Interpretation
The data suggests that explicitly modeling or incorporating the LLM's "Beliefs" into the prompting or reasoning process ("Beliefs" condition) consistently improves final-round accuracy compared to a "Direct" prompting approach. This improvement is robust across various factors like interactivity, data representation, and assessment method.
The most significant finding is the powerful synergistic effect of combining the "Beliefs" approach with the provision of user preferences (subplot e), which yields the highest accuracy. This implies that tailoring the interaction to align with user-specific goals or biases greatly enhances the model's performance in this task.
However, the benefit of the "Beliefs" approach is not universal; it fails to improve performance for the "Base" LLM (subplot f), suggesting that the model's underlying architecture or training (Instruct vs. Base) is a critical factor for this method to be effective. The consistent underperformance relative to the "Bayesian Assistant" benchmark highlights room for improvement in LLM-based reasoning systems. The charts collectively argue for the value of belief-aware and user-aware prompting strategies in complex reasoning tasks.
</details>
Figure 9: Final-round accuracy of Gemma Original under different variations of our experimental setup. We report both the modelâs direct predictions (hatched bars) and the predictions derived from the modelâs verbalized beliefs (solid bars; Supplementary Section B). (a) We compare the original interactive setting, where we directly ask the LLM to generate predictions and provide it with feedback, with other common techniques: non-interactive prompting, where we always provide correct examples; chain-of-thought (CoT) prompting, which encourages the LLM to think step-by-step; and methods that incorporate the LLMâs verbalized reward posterior distribution in the context. (b) The textual representation of the flight options, which uses natural language descriptions deterministically generated from the feature values, compared to the numerical representation, which directly uses the feature values. (c) 5-round interactions between the user and the LLM compared to 30-round interactions. (d) The scoring method, which assesses the LLMâs beliefs by scoring possible continuations, compared to the generation method, where we explicitly ask the LLM to generate probability judgments. (e) Performance without versus with the verbalized userâs preferences from the Bayesian model. (f) Instruction-tuned models versus their pre-trained base models. Error bars show standard errors across three random seeds.
In light of the poor performance of the original LLMs (before fine-tuning), we considered various modifications to our evaluation setting. These include prompting-based methods, that is, modifications to the instructions provided to the LLM; an alternative, numerical representation of the flight options; and a greater number of interactions. We also ablate methods that access the LLMâs verbalized beliefs, explore whether providing the userâs preferences improves performance, and compare the instructed version of the models with their corresponding pre-trained versions. These robustness analyses focus on Gemma 2 9B. Overall, we do not observe significant differences across these evaluations; the only methods that we find effectively improved model performance involved fine-tuning (see Section C.2).
#### Advanced prompting methods do not improve accuracy.
Our main experiments evaluate the LLM in an interactive setting, where the user provides it with feedback indicating whether the LLMâs choice is correct. In this case, the userâs feedback is always based on the LLMâs prediction. We first experiment with an alternative non-interactive setting, where the context for the assistant includes all previous rounds and the option chosen by the assistant in these context rounds is always correct, a setting that better approximates the standard few-shot or in-context learning setup for LLMs (Brown et al. [2020]; see Table 10 for an example). While performance on direct prediction remains similar, we observe a performance drop when evaluating with predictions derived from the LLMâs beliefs (Fig. 9, âNon-interactiveâ).
Chain-of-thought (CoT) prompting [Wei et al., 2022b, Nye et al., 2021, Kojima et al., 2022], which encourages the model to generate step-by-step reasoning chains, has been shown to be effective on many reasoning tasks. We evaluate the LLM using this strategy by explicitly including reasoning hints and the phrase âLetâs think step by stepâ in the instruction (see Table 8 for an example prompt). We find that LLMs prompted with CoT perform similarly to those prompted in the way described in the main text (Fig. 9, â $+$ CoTâ).
Since inferring userâs preferences based on current information before making recommendations is crucial in our task, we further evaluate another CoT-style two-stage prompting method, where we allow the LLM to explicitly reason over the posterior distribution over reward functions. Specifically, we verbalize the LLMâs reward posterior distribution using natural language and add it to the LLMâs context (see Table 9 for an example). Explicitly encouraging the LLM to reason over its own reward posterior distribution improves the predictions derived from its verbalized beliefs. However, direct prediction accuracy remains similar (Fig. 9, â $+$ LLM Posteriorâ).
Though additional prompt engineering and advanced prompting techniques could potentially yield different results; in particular, some prompts may more effectively extract the modelâs beliefs than the ones we used. For the moment, however, our preliminary findings suggest that it is challenging to significantly improve the LLMâs performance purely through prompting.
#### The LLMsâ poor performance is not due to inability to parse the flight representations.
Our main experiments use a representation that deterministically maps the feature value of each flight to a textual description (e.g., the departure time may be 02:00 PM and the duration 2 hr 30 min). While this textual representation is closer to realistic scenarios, and may therefore better align with the LLMâs training distribution, this setup introduces a potential confounder that complicates the interpretation of our results: the LLMâs poor performance in the flight recommendation task could be due to its inability to translate the text description into the feature space required for probabilistic reasoning. To control for this factor, we investigate an alternative numerical representation of the flight options, where we directly provide the LLM with numerical feature values in the same format we provide them to the Bayesian Assistant (e.g., the duration may be 0.9 instead of 16 hr 6 min; see Table 5 and Table 6 for examples). We find that, if anything, the textual representation outperforms its numerical counterpart in both accuracy metrics (Fig. 9). This suggests that the LLMâs poor performance cannot be attributed to an inability to parse the textual format to a numerical value.
#### Increasing the number of interactions does not improve performance.
Our previous experiments include only five rounds of interactions between the user and the LLM. To investigate the possibility that LLMs do in fact extract information from the interaction and update their beliefs, but do so more slowly than the Bayesian Assistant, we increase the number of interactions to 30. We find that Gemma Original still shows similar performance; if anything, its performance is slightly worse compared to our main experiments (Fig. 9). This indicates that simply increasing the number of interactions is unlikely to significantly improve the LLMâs performance.
#### Assessing the LLMâs beliefs: Scoring continuations vs. explicit probability judgments.
In the main experiment, for the open-weights LLMs where we have access to the probabilities the LLM assigns to upcoming words, we estimate the LLMâs distribution over reward functions by asking it to rate individual features and scoring the possible continuations; for flight duration, for example, we might ask it what the userâs preference is on a scale of 1 to 5. We refer to this method as âscoringâ. Here, we compare this method to one where we instruct the LLM to assign a probability to each of the five ratings on each scale; we refer to this method as âgenerationâ (see Table 7 for an example). The generation method is also used for experiments with the closed-weights models, as we do not have access to these LLMâs probabilities. As in the scoring method, we renormalize the probabilities to ensure that they sum to 1 (although we find that this step is typically not necessary as they already sum to 1). Overall, we find the scoring-based reward distribution, which we use in the main text for the open-weights models, is closer than the generation-based one to the ground truth distribution (Fig. 9; for related results, see Hu and Levy [2023]).
#### Can the LLM make recommendations given the userâs preferences?
The flight recommendation task requires two capabilities: inferring the userâs preferences and making predictions based on these preferences. We previously showed that the original LLM performs poorly at inferring the userâs preferences (Fig. 8). Here, we investigate its performance in the latter one. Specifically, we provide the LLM with the verbalized reward posterior distribution from the normative Bayesian model (see Table 9 for an example). In this case, the LLM only needs to make recommendations based on the provided preferences. We find that having access to the optimal reward posterior distribution improves belief-based accuracy; however, the direct prediction accuracy remains similar (Fig. 9). Although our method of presenting userâs preferences to the LLM may not be optimal, these results suggest that the LLM struggles to make correct recommendations even when the userâs preferences are explicitly provided.
#### Types of LLMs: Instructed model vs. pre-trained base model.
We use instruction-tuned LLMs for our main experiments. As these models are trained using an additional post-training alignment stage, their behavior is likely to differ from their pre-trained base model counterparts [Lin et al., 2024, Yang et al., 2024b, Wang et al., 2024, Kotha et al., 2024], though because we expect instruction-tuning to improve the modelsâ interactive capabilities, we hypothesize that the base version of Gemma would not perform better than the instruction-tuned one. As base models are not well-suited to interactive evaluation, we evaluate them using the non-interactive setting by providing them with in-context examples (see earlier in this section). We find that the base model performs comparably to the instruction-tuned one (Fig. 9); we omit the results for Llama 3 and Qwen 2.5, which were similar.
### C.2 Modifications to Training Setup
This Supplementary describes variants on the methods we used to fine-tune the LLMs on interactions with users. We only explore these variants for Bayesian teaching, which was consistently more effective than oracle teaching). We use Gemma 2 9B for all of the follow-up experiments reported in this section.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Bar Charts: Comparison of Training Methods for Final-round Accuracy
### Overview
The image displays three grouped bar charts (labeled a, b, and c) comparing the performance of different machine learning training approaches. The primary metric is "Final-round Accuracy (%)" on the y-axis. Each chart explores a different dimension of the training process: objectives, fine-tuning methods, and training signals. A shared legend at the top defines the bar patterns and reference lines.
### Components/Axes
* **Shared Y-Axis:** "Final-round Accuracy (%)", scale from 0 to 100 in increments of 20.
* **Shared Legend (Top Center):**
* **Bar Patterns:**
* `Direct`: Diagonal striped pattern (blue, orange, green).
* `Beliefs`: Solid fill (blue, orange, green).
* **Reference Lines:**
* `Bayesian Assistant`: Dashed light brown line.
* `Random`: Dashed dark gray line.
* **Subplot Titles & X-Axis Categories:**
* **a. Training Objectives:** Categories are "SFT" and "DPO".
* **b. Fine-tuning Methods:** Categories are "Full" and "LoRA".
* **c. Training Signals:** Categories are "Interaction", "Preferences", and "Both".
### Detailed Analysis
**Subplot a. Training Objectives**
* **SFT Category:**
* `Direct` (Blue, striped): ~76%
* `Beliefs` (Blue, solid): ~72%
* **DPO Category:**
* `Direct` (Orange, striped): ~66% (with a small error bar)
* `Beliefs` (Orange, solid): ~70%
* **Reference Lines:**
* `Bayesian Assistant` (Dashed line): Positioned at approximately 80%.
* `Random` (Dashed line): Positioned at approximately 30%.
**Subplot b. Fine-tuning Methods**
* **Full Category:**
* `Direct` (Blue, striped): ~76%
* `Beliefs` (Blue, solid): ~72%
* **LoRA Category:**
* `Direct` (Orange, striped): ~70%
* `Beliefs` (Orange, solid): ~68%
* **Reference Lines:** Same as in subplot a (Bayesian Assistant ~80%, Random ~30%).
**Subplot c. Training Signals**
* **Interaction Category:**
* `Direct` (Blue, striped): ~76%
* `Beliefs` (Blue, solid): ~72%
* **Preferences Category:**
* `Direct` (Green, striped): ~55%
* `Beliefs` (Green, solid): ~79% (with a small error bar)
* **Both Category:**
* `Direct` (Orange, striped): ~78%
* `Beliefs` (Orange, solid): ~79%
* **Reference Lines:** Same as in subplots a and b.
### Key Observations
1. **Performance vs. Baselines:** All reported model performances (bars) are significantly above the `Random` baseline (~30%) but generally below the `Bayesian Assistant` baseline (~80%).
2. **Direct vs. Beliefs Trend:** The relationship between `Direct` and `Beliefs` methods is not consistent.
* In **SFT, Full, and Interaction** settings, `Direct` outperforms `Beliefs` by 4-5 percentage points.
* In **DPO and LoRA** settings, the gap narrows or reverses, with `Beliefs` performing slightly better or comparably.
* In the **Preferences** signal setting, there is a dramatic reversal: `Beliefs` (~79%) vastly outperforms `Direct` (~55%).
3. **Highest Performers:** The highest accuracy values (~79%) are achieved by the `Beliefs` method using either the `Preferences` signal alone or the `Both` signal combination.
4. **Lowest Performer:** The `Direct` method with the `Preferences` signal is the clear outlier, performing at ~55%, which is notably lower than all other configurations.
### Interpretation
This data suggests that the optimal training strategy is highly dependent on the specific context (objective, fine-tuning method, and signal type). There is no universally superior approach between `Direct` and `Beliefs`.
* The `Direct` method appears more effective when the training signal is based on **Interaction** or when using standard **SFT** objectives and **Full** fine-tuning.
* The `Beliefs` method shows a critical advantage when the training signal is derived from **Preferences**. This indicates that modeling or incorporating "beliefs" may be particularly beneficial for learning from preference-based data, potentially by better capturing the underlying rationale behind human choices.
* The fact that combining signals (`Both`) yields high performance for both methods suggests complementarity between interaction and preference data.
* The consistent gap below the `Bayesian Assistant` baseline indicates that these training methods, while effective, have not yet reached the theoretical performance ceiling represented by that benchmark. The `Random` baseline confirms the tasks are non-trivial and the models are learning meaningful patterns.
**Language Declaration:** All text in the image is in English.
</details>
Figure 10: Final-round accuracy of LLMs fine-tuned with different training strategies on the flight recommendation task. We use Bayesian teaching (i.e. usersâ interactions with the Bayesian Assistant) for all experiments. (a) Comparison of training objectives: supervised fine-tuning (SFT) vs. direct preference optimization (DPO). (b) Fine-tuning methods: full fine-tuning, which updates all model parameters, vs. LoRA fine-tuning, a parameter-efficient method that only updates partial parameters. (c) Training only on interactions between users and assistants, as in our other experiments, compared to training on the Bayesian Assistantâs estimate of the userâs preferences, as well as training on both interactions and the estimated preferences. Error bars show standard errors across three random seeds and three training runs.
#### Training objective: Supervised fine-tuning vs. Direct preference optimization.
In most of our experiments, we use supervised fine-tuning (SFT) to provide the oracle and Bayesian predictions. In this method, the LLM is trained to predict the upcoming token in the interaction, the same objective used during pre-trainining. Here, we examine the utility of reinforcement learning from human feedback (RLHF; Christiano et al. [2017], Ouyang et al. [2022], Stiennon et al. [2020]), another common practice for adapting LLMsâ behavior following pre-training, in which the LLM is instead provided with an explicit signal indicating whether an output is preferable. In particular, we use direct preference optimization (DPO; Rafailov et al. [2023]), where the model is trained to assign higher probability to the preferred response than to the less preferred one. We investigate the DPO training objective by treating the Bayesian Assistantâs prediction as the preferred one while using a different random recommendation as the less preferred one. We train the model with the DPO objective with a learning rate of 2e-6 and $ÎČ=0.1$ . We find that training on Bayesian predictions works comparably for both SFT (used in main experiments) and DPO objectives (Fig. 10), indicating that the approach is robust to the choice of training objective.
#### Full vs. parameter-efficient fine-tuning.
In our main experiments, we update all model parameters. As this approach becomes less feasible as the model size grows, a common strategy to improve training efficiency relies on parameter-efficient fine-tuning, where only a subset of parameters is updated. We evaluate this approach using Low-Rank Adaptation (LoRA; Hu et al. [2022]), a method that injects trainable rank decomposition matrices while keeping the original model weights frozen. We perform LoRA fine-tuning with a learning rate of 2e-5. While LoRA fine-tuning performs slightly worse than full fine-tuning (Fig. 10), it achieves comparable performance while significantly reducing training costs. This demonstrates that our fine-tuning strategy can be effectively applied in computationally efficient settings, which is particularly beneficial for larger LLMs.
#### Providing Bayesian preference estimates in fine-tuning.
We have shown in the main text that fine-tuning the LLMs to make better recommendations also significantly improves its ability to infer the userâs preferences, even though their supervision does not explicitly include the userâs preferences. Here, we investigate a complementary setup, where we explicitly train the model to match the Bayesian Assistantâs estimates of the userâs preferences, but not to make flight recommendations. The Bayesian Assistant produces a posterior probability distribution over all reward functions after each round; we select the reward function with the highest posterior probability and provide it to the LLM, formatted as in Table 4. We find that, like training on interactions, providing the userâs preferences as a fine-tuning signal improves both accuracy measures, compared to Gemma Original, but the gain in direct prediction accuracy on is smaller than when we fine-tune on interactions (Fig. 10). We also explore a setting where both the interactions and the preferences are provided during training; this setting leads to the best performance on both metrics, with accuracy approaching the accuracy of the Bayesian Assistant.
## Appendix D Additional Analyses
### D.1 LLM Priors
<details>
<summary>x11.png Details</summary>

### Visual Description
## Bar Charts: Gemma 2 9B Rating Probabilities
### Overview
The image displays a series of four vertical bar charts arranged horizontally, collectively titled "Gemma 2 9B". Each chart illustrates the probability distribution (in percentage) of ratings (1 to 5) for a different attribute: Departure Time, Duration, Number of Stops, and Price. The charts suggest an analysis of how a model or system (likely "Gemma 2 9B") rates or is rated on these specific criteria.
### Components/Axes
* **Main Title:** "Gemma 2 9B" (centered at the top).
* **Subplot Titles (from left to right):** "Departure Time", "Duration", "Number of Stops", "Price".
* **Y-Axis (Common to all charts):**
* **Label:** "Probability (%)" (rotated vertically on the far left).
* **Scale:** Linear, from 0 to 100.
* **Markers:** 0, 20, 40, 60, 80, 100.
* **X-Axis (Common to all charts):**
* **Label:** "Rating" (centered below each chart).
* **Scale:** Discrete categories.
* **Markers:** 1, 2, 3, 4, 5.
* **Data Series:** Each chart contains five blue bars, one for each rating category (1-5). The height of each bar represents the probability percentage.
### Detailed Analysis
**1. Departure Time Chart (Leftmost):**
* **Trend:** A single, dominant peak at Rating 3, with negligible probabilities for all other ratings.
* **Data Points (Approximate):**
* Rating 1: ~0%
* Rating 2: ~1%
* Rating 3: ~95%
* Rating 4: ~4%
* Rating 5: ~0%
**2. Duration Chart (Second from left):**
* **Trend:** A strong peak at Rating 3, with a secondary, smaller peak at Rating 2. Ratings 1, 4, and 5 have very low probability.
* **Data Points (Approximate):**
* Rating 1: ~1%
* Rating 2: ~20%
* Rating 3: ~79%
* Rating 4: ~0%
* Rating 5: ~0%
**3. Number of Stops Chart (Third from left):**
* **Trend:** The highest probability is at Rating 2, followed by Rating 1, then Rating 3. Ratings 4 and 5 are near zero.
* **Data Points (Approximate):**
* Rating 1: ~23%
* Rating 2: ~63%
* Rating 3: ~14%
* Rating 4: ~0%
* Rating 5: ~0%
**4. Price Chart (Rightmost):**
* **Trend:** An extreme, near-total concentration of probability at Rating 3. All other ratings have virtually zero probability.
* **Data Points (Approximate):**
* Rating 1: ~0%
* Rating 2: ~0%
* Rating 3: ~100%
* Rating 4: ~0%
* Rating 5: ~0%
### Key Observations
1. **Central Tendency:** For three of the four attributes (Departure Time, Duration, Price), the probability mass is overwhelmingly concentrated on Rating 3.
2. **Attribute-Specific Distribution:** "Number of Stops" is the only attribute with a different modal rating (Rating 2) and a more distributed probability across Ratings 1, 2, and 3.
3. **Low Variance:** The distributions are highly peaked (low variance), indicating strong consensus or model bias toward specific ratings for each attribute.
4. **Absence of Extreme Ratings:** Ratings 4 and 5 have near-zero probability across all attributes. Rating 1 has non-zero probability only for "Number of Stops" and "Duration".
### Interpretation
The data presents a profile of the "Gemma 2 9B" system's rating behavior. The near-exclusive assignment of Rating 3 to **Price** and **Departure Time** suggests these attributes are either considered perfectly average/neutral by the system or that the system has a strong bias to default to the middle rating for these features. The distribution for **Duration** also centers on 3 but shows some tolerance for a lower rating (2). **Number of Stops** is the most differentiated attribute, with the system showing a clear preference for Rating 2, indicating it likely associates a lower number of stops with a better (or worse, depending on the rating scale's direction) experience.
The overall pattern implies the model's ratings are not uniformly distributed but are instead highly deterministic for specific attributes, potentially reflecting training data biases, a simplistic scoring algorithm, or a deliberate design to categorize features into broad buckets (e.g., "average" for most, "slightly below average" for stops). The complete absence of high ratings (4, 5) is notable and could indicate a conservative rating scale, a lack of positive exemplars in the data, or a system calibrated to avoid extreme scores.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Bar Charts: Gemini 1.5 Pro Attribute Ratings
### Overview
The image displays a set of four horizontal bar charts under the main title "Gemini 1.5 Pro". Each chart represents the probability distribution of user ratings (on a scale of 1 to 5) for a different attribute of the model: Departure Time, Duration, Number of Stops, and Price. The charts are arranged in a single row.
### Components/Axes
* **Main Title:** "Gemini 1.5 Pro" (centered at the top).
* **Common Y-Axis (All Charts):** Labeled "Probability (%)". The scale runs from 0 to 100 with major tick marks at 0, 20, 40, 60, 80, and 100.
* **Common X-Axis (All Charts):** Labeled "Rating". The scale shows discrete values: 1, 2, 3, 4, 5.
* **Individual Chart Titles (from left to right):**
1. "Departure Time"
2. "Duration"
3. "Number of Stops"
4. "Price"
* **Data Representation:** Solid blue vertical bars for each rating category (1-5) within each chart.
### Detailed Analysis
**Chart 1: Departure Time**
* **Trend:** The distribution peaks sharply at rating 2 and declines thereafter.
* **Data Points (Approximate):**
* Rating 1: ~10%
* Rating 2: ~60% (Highest)
* Rating 3: ~20%
* Rating 4: ~5%
* Rating 5: ~5%
**Chart 2: Duration**
* **Trend:** The distribution is heavily skewed towards the lowest rating, with a steep drop-off.
* **Data Points (Approximate):**
* Rating 1: ~70% (Highest)
* Rating 2: ~20%
* Rating 3: ~8%
* Rating 4: ~2%
* Rating 5: ~0% (No visible bar)
**Chart 3: Number of Stops**
* **Trend:** The distribution peaks at rating 2, with a significant portion at rating 1, and very low probabilities for higher ratings.
* **Data Points (Approximate):**
* Rating 1: ~40%
* Rating 2: ~50% (Highest)
* Rating 3: ~10%
* Rating 4: ~0% (No visible bar)
* Rating 5: ~0% (No visible bar)
**Chart 4: Price**
* **Trend:** The distribution peaks at rating 2, with a gradual decline for higher ratings.
* **Data Points (Approximate):**
* Rating 1: ~30%
* Rating 2: ~40% (Highest)
* Rating 3: ~20%
* Rating 4: ~8%
* Rating 5: ~2%
### Key Observations
1. **Consistent Peak at Low Ratings:** For three of the four attributes (Departure Time, Number of Stops, Price), the highest probability is assigned to rating 2. For "Duration", the highest probability is at rating 1.
2. **Attribute-Specific Distributions:** "Duration" shows the most extreme negative skew, with 70% probability at the lowest rating. "Price" has the most evenly distributed probabilities across ratings 1-3 compared to the others.
3. **Absence of High Ratings:** Ratings 4 and 5 have very low or zero probability across all attributes, indicating a strong consensus against high ratings for these specific aspects of Gemini 1.5 Pro in this dataset.
4. **Spatial Layout:** All charts share identical axis scales and styling, placed side-by-side for direct comparison. The legend (color) is uniform (blue bars) and does not vary between charts.
### Interpretation
This visualization suggests that, based on the underlying data, users or evaluators perceive Gemini 1.5 Pro's performance poorly on these specific, likely task-related, attributes. The attributes "Departure Time", "Number of Stops", and "Price" are not literal for an AI model; they are likely **metaphorical labels for specific evaluation criteria or task dimensions** (e.g., "Departure Time" could represent response latency, "Number of Stops" could represent coherence or step-by-step reasoning, and "Price" could represent cost-effectiveness or resource usage).
The data indicates a strong clustering of opinions in the low-rating range (1-2), with "Duration" (perhaps representing task completion time or context window utilization) being the most critically viewed aspect. The near-total absence of ratings 4 and 5 implies that, for these particular metrics, the model is not seen as excelling. This pattern could reflect a challenging evaluation benchmark, a specific user cohort's feedback, or a snapshot of model performance on a difficult subset of tasks. The charts effectively communicate that, across multiple dimensions, the assessed performance is concentrated at the lower end of the scale.
</details>
Figure 11: Priors of Gemma 2 9B Original and Gemini 1.5 Pro for each flight feature. We obtain these priors via the prompting-based elicitation method (Supplementary B). A rating of 1 indicates a strongest preference for the earliest departure time, the shortest duration, the fewest number of stops, and the lowest price, while a rating of 5 indicates the opposite. A rating of 3 indicates no preference.
In the section Generalization to interactions with human users, we find that the original LLMs, before fine-tuning, were able to provide recommendations with an accuracy substantially higher than chance even before their first interaction with the user, suggesting that the LLMsâ priors are aligned with human preferences. In this section, we test this hypothesis by asking two models, Gemma 2 and Gemini 1.5, for their verbalized beliefs in advance of any interaction with a particular user. Fig. 11 shows the results. For Gemma 2 9B, the hypothesis is only partly supported: the prior derived from this model assigns a high probability to âno preferenceâ for most of the features, with the exception of the number of stops, where it reflects a moderate preference for fewer stops. By contrast, Gemini 1.5 Pro has a more diffuse prior over these features, which favors cheaper and shorter flights, as well as flights that leave earlier in the day, plausibly reflecting the preferences of most flyers. We note that the interpretation of this pattern of results is complicated by the fact that Gemmaâs verbalized prior beliefs may not faithfully reflect the underlying prior it uses to make recommendations before having interacted with a user.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Grouped Bar Charts: AI Model Accuracy Across Recommendation Tasks
### Overview
The image displays three grouped bar charts comparing the performance of different large language models (LLMs) on three recommendation tasks: Flight, Hotel, and Web Shopping. Each chart measures model accuracy (%) after an initial round and a final round of evaluation, with a baseline "Random" performance indicated. The models compared include variants of Gemma, Llama, and Qwen, alongside a "Bayesian Assistant" or "Direct FT" (Fine-Tuned) variant.
### Components/Axes
* **Chart Layout:** Three vertically stacked sub-charts labeled **a. Flight Recommendation**, **b. Hotel Recommendation**, and **c. Web Shopping**.
* **Y-Axis (All Charts):** Labeled "Accuracy (%)". Scale runs from 0 to 100 in increments of 20.
* **X-Axis (All Charts):** Lists model variants grouped by base model family (Gemma, Llama, Qwen) and training method (Original, Oracle, Bayesian, and Direct FT for Web Shopping only).
* **Legend (Top-Left of each chart):**
* **Pattern:** Diagonally hatched bars represent "After 1st Round". Solid bars represent "Final Round".
* **Color:** Each model family has a distinct color scheme:
* **Gemma:** Blue (Original), Light Orange (Oracle), Dark Orange (Bayesian).
* **Llama:** Blue (Original), Light Orange (Oracle), Dark Orange (Bayesian).
* **Qwen:** Blue (Original), Light Orange (Oracle), Dark Orange (Bayesian).
* **Special Variants:** "Bayesian Assistant" (Brown, in charts a & b) and "Gemma/Llama/Qwen Direct FT" (Green, in chart c).
* **Baseline:** A dashed horizontal line labeled "Random" is present at approximately 33% accuracy across all charts.
### Detailed Analysis
#### **a. Flight Recommendation**
* **Gemma Family:**
* Original: 1st Round ~37%, Final ~37%.
* Oracle: 1st Round ~50%, Final ~61%.
* Bayesian: 1st Round ~57%, Final ~76%.
* **Llama Family:**
* Original: 1st Round ~36%, Final ~38%.
* Oracle: 1st Round ~48%, Final ~62%.
* Bayesian: 1st Round ~57%, Final ~75%.
* **Qwen Family:**
* Original: 1st Round ~37%, Final ~37%.
* Oracle: 1st Round ~43%, Final ~53%.
* Bayesian: 1st Round ~55%, Final ~68%.
* **Bayesian Assistant:** 1st Round ~58%, Final ~81%.
#### **b. Hotel Recommendation**
* **Gemma Family:**
* Original: 1st Round ~37%, Final ~37%.
* Oracle: 1st Round ~46%, Final ~53%.
* Bayesian: 1st Round ~53%, Final ~66%.
* **Llama Family:**
* Original: 1st Round ~38%, Final ~41%.
* Oracle: 1st Round ~45%, Final ~56%.
* Bayesian: 1st Round ~51%, Final ~65%.
* **Qwen Family:**
* Original: 1st Round ~35%, Final ~36%.
* Oracle: 1st Round ~43%, Final ~48%.
* Bayesian: 1st Round ~50%, Final ~59%.
* **Bayesian Assistant:** 1st Round ~58%, Final ~81%.
#### **c. Web Shopping**
* **Gemma Family:**
* Original: 1st Round ~46%, Final ~54%.
* Oracle: 1st Round ~50%, Final ~61%.
* Bayesian: 1st Round ~59%, Final ~73%.
* Direct FT: 1st Round ~76%, Final ~84%.
* **Llama Family:**
* Original: 1st Round ~50%, Final ~59%.
* Oracle: 1st Round ~49%, Final ~63%.
* Bayesian: 1st Round ~57%, Final ~70%.
* Direct FT: 1st Round ~73%, Final ~82%.
* **Qwen Family:**
* Original: 1st Round ~42%, Final ~43%.
* Oracle: 1st Round ~57%, Final ~66%.
* Bayesian: 1st Round ~59%, Final ~69%.
* Direct FT: 1st Round ~72%, Final ~81%.
### Key Observations
1. **Consistent Improvement:** For every model and variant, the "Final Round" accuracy is equal to or higher than the "After 1st Round" accuracy, indicating learning or refinement occurs between rounds.
2. **Bayesian Superiority:** Within each model family (Gemma, Llama, Qwen), the "Bayesian" variant consistently outperforms the "Original" and "Oracle" variants in both rounds across all three tasks.
3. **Top Performer:** The "Bayesian Assistant" (in Flight and Hotel) and the "Direct FT" variants (in Web Shopping) achieve the highest final accuracies, reaching 81-84%.
4. **Task Difficulty:** The "Random" baseline is ~33%. Most "Original" models perform near or slightly above this baseline, especially in Flight and Hotel tasks, suggesting these are challenging tasks without specialized training. Web Shopping shows higher baseline performance for Original models.
5. **Model Family Comparison:** In the Bayesian variants, Gemma and Llama often perform similarly and slightly better than Qwen in Flight and Hotel tasks. In Web Shopping, the Direct FT variants across families show very close performance (81-84%).
### Interpretation
The data demonstrates a clear hierarchy of effectiveness in training methodologies for recommendation tasks. The progression from "Original" (likely base models) to "Oracle" (possibly with some ground-truth guidance) to "Bayesian" (incorporating probabilistic reasoning) shows a significant and consistent gain in accuracy. This suggests that integrating Bayesian or similar probabilistic frameworks substantially improves an LLM's ability to make accurate recommendations.
The exceptional performance of the "Bayesian Assistant" and "Direct FT" models indicates that either a specialized architecture or intensive task-specific fine-tuning is required to achieve state-of-the-art results (80%+ accuracy). The fact that Direct Fine-Tuning is only shown for Web Shopping might imply it was the most amenable to this approach or was the focus of that experiment.
The charts effectively argue that for complex decision-making tasks like recommendations, moving beyond standard instruction-tuning to methods that explicitly handle uncertainty (Bayesian) or involve direct optimization on the task (FT) is crucial for high performance. The consistent pattern across three different domains (flights, hotels, shopping) strengthens the generalizability of this conclusion.
</details>
Figure 12: Variability across simulated users. We show accuracy after the first and final (fifth) rounds. (a) We compare the original LLMs, fine-tuned LLMs, and the upper bound (the Bayesian Assistant) on flight recommendation. (b) Comparison of LLMs and the upper bound (the Bayesian Assistant) on hotel recommendation. (c) Comparison of LLMs and the upper bound (LLMs fine-tuned directly on the task) for web shopping. Error bars indicate the standard deviation across reward functions (for flight and hotel recommendations) or product categories (for web shopping).
### D.2 Variability in LLM Accuracy Across Simulated Users
In our main experiments, we show results averaged over all simulated users. Here, we explore how the LLMâs accuracy varies by user. As before, for flight and hotel recommendations, the user is characterized as a reward function. For web shopping, we have 10 users with different goals (i.e. preferred attributes) for each category; we average their performance and compute the standard deviation across 100 product categories (see Table 1 for examples). All methods exhibit high variance as shown in Fig. 12.
Table 1: Example product categories and their corresponding goals of different users.
| Product Category | Userâs Goals (Preferred Attributes) |
| --- | --- |
| Beds | eco friendly, twin with drawers |
| wood frame, easy assemble, twin | |
| memory foam, solid wood | |
| Menâs athletic shoes | running shoes, lace up |
| non slip, mesh | |
| daily wear, color back, size 14 | |
| Food & beverage | simple ingredients |
| gluten free | |
| low sodium | |
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Charts: Model Accuracy vs. Number of Interactions
### Overview
The image contains two side-by-side line charts comparing the performance (accuracy) of different AI models or methods over an increasing number of interactions. Both charts share the same axes and scale. The left chart compares Gemini 1.5 Pro, Gemma 2 9B, and a Bayesian method against a Random baseline. The right chart compares Gemma Oracle and Gemma Bayesian against the same Random baseline. All data series include error bars, indicating variability or confidence intervals.
### Components/Axes
**Common to Both Charts:**
* **X-Axis:** Label: "# Interactions". Scale: Linear, from 0 to 5, with integer markers at 0, 1, 2, 3, 4, 5.
* **Y-Axis:** Label: "Accuracy (%)". Scale: Linear, from 0 to 100, with major ticks at 0, 20, 40, 60, 80, 100.
* **Baseline:** A dashed grey line labeled "Random" is present in both charts at approximately 33% accuracy, serving as a constant reference point.
**Left Chart Legend (Position: Top-Left):**
1. **Gemini 1.5 Pro:** Teal line with circular markers.
2. **Gemma 2 9B:** Blue line with circular markers.
3. **Bayesian:** Brown line with circular markers.
4. **Random:** Dashed grey line.
**Right Chart Legend (Position: Top-Left):**
1. **Gemma Oracle:** Light orange line with circular markers.
2. **Gemma Bayesian:** Dark orange line with circular markers.
3. **Random:** Dashed grey line.
### Detailed Analysis
**Left Chart: Gemini 1.5 Pro vs. Gemma 2 9B vs. Bayesian**
* **Trend Verification:**
* **Bayesian (Brown):** Shows a strong, steady upward slope from interaction 0 to 5.
* **Gemini 1.5 Pro (Teal):** Shows a moderate upward slope that appears to plateau after interaction 3.
* **Gemma 2 9B (Blue):** Shows a shallow upward slope, plateauing early around interaction 2.
* **Random (Grey Dashed):** Flat horizontal line.
* **Data Points (Approximate values with ~ uncertainty):**
* **Interaction 0:** Bayesian ~35%, Gemini ~35%, Gemma ~30%.
* **Interaction 1:** Bayesian ~52%, Gemini ~45%, Gemma ~38%.
* **Interaction 2:** Bayesian ~61%, Gemini ~47%, Gemma ~38%.
* **Interaction 3:** Bayesian ~68%, Gemini ~49%, Gemma ~38%.
* **Interaction 4:** Bayesian ~73%, Gemini ~50%, Gemma ~38%.
* **Interaction 5:** Bayesian ~77%, Gemini ~50%, Gemma ~38%.
**Right Chart: Gemma Oracle vs. Gemma Bayesian**
* **Trend Verification:**
* **Gemma Bayesian (Dark Orange):** Shows a strong upward slope, similar in shape to the "Bayesian" line in the left chart.
* **Gemma Oracle (Light Orange):** Shows a moderate upward slope, less steep than Gemma Bayesian.
* **Random (Grey Dashed):** Flat horizontal line.
* **Data Points (Approximate values with ~ uncertainty):**
* **Interaction 0:** Both Gemma models start near ~35%.
* **Interaction 1:** Gemma Bayesian ~50%, Gemma Oracle ~50%.
* **Interaction 2:** Gemma Bayesian ~60%, Gemma Oracle ~51%.
* **Interaction 3:** Gemma Bayesian ~65%, Gemma Oracle ~54%.
* **Interaction 4:** Gemma Bayesian ~68%, Gemma Oracle ~57%.
* **Interaction 5:** Gemma Bayesian ~71%, Gemma Oracle ~59%.
### Key Observations
1. **Performance Hierarchy:** In the left chart, the Bayesian method significantly outperforms both Gemini 1.5 Pro and Gemma 2 9B after the first interaction. In the right chart, Gemma Bayesian consistently outperforms Gemma Oracle.
2. **Learning Curves:** All models (except Random) show improved accuracy with more interactions, but their rates of improvement differ markedly. The Bayesian approaches show the steepest and most sustained improvement.
3. **Plateaus:** Gemini 1.5 Pro and Gemma 2 9B appear to reach a performance plateau (around 50% and 38% respectively) after 2-3 interactions, suggesting limited further gain from additional interactions. The Bayesian methods show no clear plateau within the observed range.
4. **Starting Point:** All models begin at or above the Random baseline (~33%) at interaction 0, indicating some initial capability.
5. **Variability:** Error bars are present for all data points. The variability (length of error bars) appears relatively consistent across interactions for each series, though a precise quantification is not possible from the visual.
### Interpretation
The data suggests a clear advantage for Bayesian methods in this specific interactive learning or optimization task. The "Bayesian" and "Gemma Bayesian" models demonstrate superior sample efficiency, extracting more performance gain per interaction compared to the standard Gemini and Gemma models, and compared to the "Oracle" variant.
The plateauing of the non-Bayesian models indicates they may be hitting a performance ceiling inherent to their architecture or training for this task, whereas the Bayesian approaches continue to refine their accuracy. The consistent outperformance of "Gemma Bayesian" over "Gemma Oracle" is particularly noteworthy, as it suggests the Bayesian framework itself provides a benefit beyond what an "oracle" (which might imply access to privileged information) provides in this context.
The Random baseline at ~33% likely represents chance performance for a 3-class classification problem. The fact that all models start above this line at interaction 0 implies they possess some pre-existing knowledge or bias relevant to the task before any interactions occur.
**Language Declaration:** All text in the image is in English.
</details>
Figure 13: Variability across reward functions over rounds. Error bars indicate standard deviation across reward functions.
We additionally show results over rounds in Fig. 13. We find that both original LLMs and the Bayesian Assistant display high variance across reward function. While the variance of the Bayesian Assistant decreases as the number of interactions increases, as does that of the fine-tuned LLMs, the variance of the original LLM remains largely constant across interactions. Notably, Gemma Bayesian has lower variance while maintaining similar performance to the Bayesian Assistant.
In particular, we hypothesize that reward functions that more strongly deviate from the LLMâs prior (Supplementary Section D.1) may be harder to infer. For example, the LLM may assume most people prefer shorter flights over long ones, making it more difficult to infer the preferences of an âabnormalâ user who prefers longer flights. To test the hypothesis that the variability across reward functions is due in part to the prior, we fit linear regression models predicting a reward functionâs final-round accuracy from its L2 distance to the mean of the prior reward distribution, focusing on Gemma in this experiment. We elicit the priors separately for Gemma Original, Gemma Bayesian and Gemma Oracle. The prior of the Bayesian Assistant is uniform, as before. Before computing distances we normalize the reward functions (divide them by their sum) to account for the fact that some functions are equivalent; for example, the reward function $[-1,-1,-1,-1]$ is equivalent to the function $[-0.5,-0.5,-0.5,-0.5]$ as both will always lead the user to prefer the same flights.
In line with this hypothesis, we find negative regression coefficients for Gemma Original, indicating it performs worse when the reward function deviates from its priors (Fig. 14). The absolute coefficients for the Bayesian Assistant and Gemma Bayesian are similar, and much smaller than that of Gemma Original. For these three models, the impact of L2 distance from prior on the final-round accuracy is all significant (p $<$ 0.001). The Gemma Oracle does not show sensitivity to this distance (p = 0.24).
<details>
<summary>x15.png Details</summary>

### Visual Description
## Scatter Plot Series: Model Performance vs. Prior Distance
### Overview
The image displays a series of four horizontally arranged scatter plots. Each plot visualizes the relationship between the "L2 Distance from Prior Mean" (x-axis) and "Few-shot Accuracy (%)" (y-axis) for a different model or method. The plots share identical axes scales and labels, allowing for direct comparison. Each plot contains a cloud of data points and a dashed trend line with an annotated slope coefficient (`c`).
### Components/Axes
* **Titles (Top-Center of each subplot):**
1. Gemma Original
2. Gemma Oracle
3. Gemma Bayesian
4. Bayesian Assistant
* **X-Axis (Bottom of each subplot):** Label: "L2 Distance from Prior Mean". Scale: Linear, from 0.0 to 2.0, with major ticks at 0.0, 0.5, 1.0, 1.5, 2.0.
* **Y-Axis (Left of each subplot):** Label: "Few-shot Accuracy (%)". Scale: Linear, from 0 to 100, with major ticks at 0, 20, 40, 60, 80, 100.
* **Data Series & Legend:** Each plot uses a distinct color for its data points, which serves as its own legend:
* Plot 1: Blue points.
* Plot 2: Yellow points.
* Plot 3: Orange points.
* Plot 4: Beige/Light Brown points.
* **Trend Lines:** A black dashed line is fitted to the data in each plot.
* **Annotations:** Each plot contains a text annotation of the form `c = [value]`, positioned near the left side of the trend line.
### Detailed Analysis
**Plot 1: Gemma Original**
* **Trend:** The dashed trend line has a clear negative slope, descending from left to right.
* **Annotation:** `c = -10.46`. This indicates a negative correlation: as L2 distance increases, few-shot accuracy tends to decrease.
* **Data Distribution:** The blue points are widely scattered. At low L2 distance (~0.2-0.5), accuracy values range broadly from ~20% to ~80%. The cloud of points drifts downward as distance increases, with fewer high-accuracy points beyond L2=1.5.
**Plot 2: Gemma Oracle**
* **Trend:** The dashed trend line is nearly horizontal, with a very slight positive slope.
* **Annotation:** `c = 0.58`. This indicates a very weak positive correlation.
* **Data Distribution:** The yellow points form a dense, broad cloud. Accuracy values are concentrated between ~40% and ~90% across the entire range of L2 distance. There is no strong visual trend of accuracy changing with distance.
**Plot 3: Gemma Bayesian**
* **Trend:** The dashed trend line has a clear positive slope, ascending from left to right.
* **Annotation:** `c = 1.48`. This indicates a positive correlation: as L2 distance increases, few-shot accuracy tends to increase.
* **Data Distribution:** The orange points show a visible upward drift. At low L2 distance, points are spread from ~30% to ~90%. At higher distances (L2 > 1.5), the density of points with accuracy >80% increases noticeably.
**Plot 4: Bayesian Assistant**
* **Trend:** The dashed trend line has the steepest positive slope among the four plots.
* **Annotation:** `c = 1.01`. This indicates a positive correlation, though the slope value is slightly lower than Gemma Bayesian's. Visually, the line appears steep due to the data distribution.
* **Data Distribution:** The beige points are densely clustered in the upper region of the plot. Most points lie between 60% and 100% accuracy. The upward trend is evident, with the lowest accuracy values becoming rarer as L2 distance increases.
### Key Observations
1. **Divergent Correlations:** The fundamental relationship between L2 distance and accuracy reverses across models. "Gemma Original" shows a negative correlation (`c = -10.46`), while the three other methods show positive correlations (`c = 0.58, 1.48, 1.01`).
2. **Performance Ceiling:** "Bayesian Assistant" and "Gemma Bayesian" show a higher density of points near the top of the accuracy scale (80-100%) compared to "Gemma Original" and "Gemma Oracle".
3. **Variance:** "Gemma Original" exhibits high variance in accuracy at any given L2 distance. "Bayesian Assistant" shows lower variance, with points more tightly clustered at higher accuracy levels.
4. **Trend Line Steepness:** While "Gemma Bayesian" has the highest annotated slope (`c=1.48`), the trend line for "Bayesian Assistant" appears visually steep because its data cloud is concentrated in the high-accuracy region, creating a strong upward pull from a high baseline.
### Interpretation
This visualization compares how different model variants perform on tasks that are "distant" from their prior knowledge (measured by L2 distance). The key insight is that **incorporating Bayesian methods fundamentally changes the model's relationship with out-of-distribution or novel data.**
* **Gemma Original** struggles with tasks far from its prior, showing a performance degradation (negative slope). This is the expected behavior for a standard model.
* **Gemma Oracle** shows almost no relationship (`câ0`). This suggests it may have access to some form of ground truth or idealized information that neutralizes the difficulty posed by distance.
* **Gemma Bayesian** and **Bayesian Assistant** exhibit a *positive* correlation. This is a significant and non-intuitive finding. It suggests these Bayesian-inspired models not only handle distant tasks well but may actually *benefit* from or be specifically calibrated for scenarios where the task is far from the average prior. Their accuracy improves as the task becomes more "unusual" relative to the prior.
The progression from left to right illustrates a shift from a standard model that degrades on novel tasks, to Bayesian-infused models that are robust and even excel in those conditions. The "Bayesian Assistant" appears to be the most refined, achieving high accuracy with lower variance across the board. The data argues for the efficacy of Bayesian approaches in improving few-shot learning robustness and performance on distribution-shifted tasks.
</details>
Figure 14: The relationship between the final-round accuracy and the normalized L2 distance to the mean of the prior reward distribution (1000 randomly sampled points for readability). $c$ refers to the coefficient in a linear regression predicting accuracy from L2 distance. The impact of L2 distance on final-round accuracy is significant (p $<$ 0.001) for Gemma Original, Gemma Bayesian, and Bayesian Assistant, but not for Gemma Oracle (p = 0.24).
### D.3 Interacting with Non-deterministic Users
Our main experiments assume the simulated user always makes decisions that are consistent with its reward function. By contrast, as we show in the section Generalization to interactions with human users, humans may behave inconsistently with their stated preferences. To simulate this real-world stochasticity, we evaluate a setting where the LLM interacts with a non-deterministic user. We add noise to the userâs behavior, such that with a certain probability they select a non-optimal choice, that is a choice that does not maximize their reward. The relationship between the percentage of noise and final-round accuracy is shown in Fig. 15. We experiment with the three variants of Gemma, and with the Bayesian Assistant. As expected, performance decreases across the board as the amount of noise increases. For realistic noise values in the 10â60% range, we find that Gemma Bayesian is more robust to noise compared not only to Gemma Original and Gemma Oracle, but also to the Bayesian Assistant, which is the best model in the noiseless setting. This robustness to noise illustrates an advantage of an LLM fine-tuned to mimic a symbolic model compared to the original symbolic model (see Discussion).
<details>
<summary>x16.png Details</summary>

### Visual Description
## Line Chart: Model Accuracy vs. Noise Level
### Overview
This is a line chart comparing the performance of four different models or methods as a function of increasing noise. The chart demonstrates how the "Final-round Accuracy" of each model degrades as the percentage of "Noise" increases from 0% to 100%.
### Components/Axes
* **X-Axis:** Labeled "Noise (%)". The scale runs from 0 to 100 with major tick marks at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Y-Axis:** Labeled "Final-round Accuracy (%)". The scale runs from 0 to 100 with major tick marks at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Legend:** Located in the top-right quadrant of the chart area. It contains four entries:
1. **Gemma Original:** Represented by a solid blue line with circular markers.
2. **Gemma Oracle:** Represented by a solid light yellow/beige line with circular markers.
3. **Gemma Bayesian:** Represented by a solid orange line with circular markers.
4. **Bayesian Assistant:** Represented by a dashed grey line with circular markers.
### Detailed Analysis
The chart plots four data series. Below is an analysis of each, including approximate values extracted from the chart. Values are estimated based on the grid lines and carry inherent visual uncertainty.
**1. Gemma Original (Blue Line, Solid)**
* **Trend:** Nearly flat, showing very minimal decline in accuracy as noise increases.
* **Data Points (Approximate):**
* Noise 0%: ~38%
* Noise 20%: ~37%
* Noise 40%: ~36%
* Noise 60%: ~35%
* Noise 80%: ~34%
* Noise 100%: ~34%
**2. Gemma Oracle (Light Yellow Line, Solid)**
* **Trend:** Steady, moderate downward slope.
* **Data Points (Approximate):**
* Noise 0%: ~61%
* Noise 20%: ~55%
* Noise 40%: ~48%
* Noise 60%: ~40%
* Noise 80%: ~32%
* Noise 100%: ~25%
**3. Gemma Bayesian (Orange Line, Solid)**
* **Trend:** Steep, consistent downward slope. It starts as the second-highest performer but ends as the lowest.
* **Data Points (Approximate):**
* Noise 0%: ~76%
* Noise 20%: ~64%
* Noise 40%: ~52%
* Noise 60%: ~40%
* Noise 80%: ~28%
* Noise 100%: ~16%
**4. Bayesian Assistant (Grey Line, Dashed)**
* **Trend:** Steep initial decline, then a more gradual slope. It starts as the highest performer but is overtaken by Gemma Original at high noise levels.
* **Data Points (Approximate):**
* Noise 0%: ~81%
* Noise 20%: ~68%
* Noise 40%: ~50%
* Noise 60%: ~38%
* Noise 80%: ~30%
* Noise 100%: ~22%
### Key Observations
1. **Performance Hierarchy Inversion:** At 0% noise, the order from highest to lowest accuracy is: Bayesian Assistant > Gemma Bayesian > Gemma Oracle > Gemma Original. At 100% noise, the order is completely inverted for the top three: Gemma Original > Gemma Oracle > Bayesian Assistant > Gemma Bayesian.
2. **Robustness vs. Peak Performance:** The "Gemma Original" model exhibits high robustness to noise (flat line) but has the lowest peak accuracy. Conversely, the "Bayesian Assistant" and "Gemma Bayesian" models achieve high accuracy in low-noise conditions but are highly sensitive to noise, suffering significant degradation.
3. **Crossover Points:** The lines intersect at various points, indicating noise levels where model performance is equivalent. For example, the Gemma Original and Gemma Bayesian lines cross at approximately 65% noise.
4. **Dashed Line Distinction:** The "Bayesian Assistant" is the only series represented with a dashed line, potentially indicating it is a baseline, a different class of model, or uses a distinct methodology.
### Interpretation
This chart illustrates a classic trade-off in machine learning and robust systems: **specialization versus generalization**.
* The **Bayesian Assistant** and **Gemma Bayesian** models appear to be highly specialized or optimized for clean data (low noise). Their steep decline suggests they may overfit to the training distribution and lack mechanisms to handle corrupted or out-of-distribution inputs.
* The **Gemma Original** model, while less accurate in ideal conditions, demonstrates remarkable stability. This suggests it may have inherent regularization or a simpler architecture that prevents it from learning noise-specific patterns, making it more reliable in unpredictable, real-world environments where data quality cannot be guaranteed.
* The **Gemma Oracle** model sits in the middle, offering a compromise between peak performance and noise robustness.
The practical implication is that model selection should be driven by the expected operational environment. For controlled settings with clean data, a Bayesian approach yields the best results. For applications where input noise is significant or variable (e.g., real-world sensor data, user-generated content), a more robust model like "Gemma Original" would be the safer and more reliable choice, despite its lower theoretical maximum accuracy. The chart quantifies the cost of robustness in terms of peak performance and vice-versa.
</details>
Figure 15: Final-round accuracy when interacting with a noisy simulated user. We add noise to the simulated userâs choice such that with some probability the user chooses an option that is different from the one that maximizes its reward. We plot fine-round accuracy with respect to the amount of noise. While all models show a decrease in performance as noise increases, Gemma Bayesian demonstrates greater robustness for noise levels between 10% and 60%. Error bars (too small to be visible in the plot) show standard errors across three random seeds (and three training runs).
### D.4 What Makes Bayesian Teaching Effective?
We have shown that it is more effective to fine-tune our LLMs on the Bayesian Assistantâs predictions than on the userâs true choices. In this section, we explore and rule out two deflationary hypotheses that might explain the effectiveness of this method, and tentatively conclude that the advantage of Bayesian teaching is in fact due the probabilistically optimal predictions made by the Bayesian Assistant. In all of the experiments described in this section, we focus on Gemma 2 9B. We use the same list of flight option sets for all models, and vary only the supervision we provide during fine-tuning (i.e. the assistantâs recommendations).
<details>
<summary>x17.png Details</summary>

### Visual Description
## Bar Charts: Model Accuracy Comparison
### Overview
The image displays two side-by-side bar charts comparing the accuracy (in percentage) of different AI model variants across two experimental conditions. Both charts share the same y-axis scale (0-100% Accuracy) and a common legend. The charts are labeled "a. Varying Incorrect Predictions" and "b. Varying Priors."
### Components/Axes
* **Y-Axis (Both Charts):** Labeled "Accuracy (%)". Scale runs from 0 to 100 in increments of 20.
* **X-Axis (Chart a):** Lists four model types: "Gemma Original", "Gemma Bayesian", "Gemma Oracle with Noise", and "Bayesian Assistant".
* **X-Axis (Chart b):** Lists four model types: "Gemma Bayesian (LLM-based)", "Gemma Bayesian (Uniform)", "Gemma Bayesian (LLM-opposite)", and "Bayesian Assistant".
* **Legend (Top-Left of each chart):**
* `After 1st Round`: Represented by hatched/patterned bars.
* `Final Round`: Represented by solid-colored bars.
* `Random`: Represented by a horizontal dashed line.
* **Data Labels:** Numerical accuracy values are printed directly above each bar.
* **Error Bars:** Present on most bars, indicating variability or confidence intervals.
### Detailed Analysis
#### Chart a: Varying Incorrect Predictions
This chart compares model performance when varying the type of incorrect predictions used in training or prompting.
* **Gemma Original:**
* After 1st Round (Blue, hatched): **37%**
* Final Round (Blue, solid): **37%**
* *Trend:* No improvement between rounds.
* **Gemma Bayesian:**
* After 1st Round (Orange, hatched): **57%**
* Final Round (Orange, solid): **76%**
* *Trend:* Significant upward slope from first to final round.
* **Gemma Oracle with Noise:**
* After 1st Round (Green, hatched): **40%**
* Final Round (Green, solid): **45%**
* *Trend:* Slight upward slope.
* **Bayesian Assistant:**
* After 1st Round (Brown, hatched): **58%**
* Final Round (Brown, solid): **81%**
* *Trend:* Strong upward slope, achieving the highest final accuracy in this chart.
* **Random Baseline:** The dashed line is positioned at approximately **33%**.
#### Chart b: Varying Priors
This chart compares model performance when using different prior distributions (LLM-based, Uniform, LLM-opposite) within a Bayesian framework.
* **Gemma Bayesian (LLM-based):**
* After 1st Round (Yellow, hatched): **51%**
* Final Round (Yellow, solid): **71%**
* *Trend:* Strong upward slope.
* **Gemma Bayesian (Uniform):**
* After 1st Round (Orange, hatched): **57%**
* Final Round (Orange, solid): **76%**
* *Trend:* Strong upward slope. (Note: This appears identical to "Gemma Bayesian" in Chart a).
* **Gemma Bayesian (LLM-opposite):**
* After 1st Round (Green, hatched): **50%**
* Final Round (Green, solid): **66%**
* *Trend:* Upward slope.
* **Bayesian Assistant:**
* After 1st Round (Brown, hatched): **58%**
* Final Round (Brown, solid): **81%**
* *Trend:* Strong upward slope, again achieving the highest final accuracy.
* **Random Baseline:** The dashed line is positioned at approximately **33%**.
### Key Observations
1. **Consistent Superiority of Bayesian Assistant:** In both experimental conditions, the "Bayesian Assistant" model achieves the highest final round accuracy (81%).
2. **Universal Improvement:** All models except "Gemma Original" show a clear improvement in accuracy from the "After 1st Round" to the "Final Round." The "Gemma Original" model shows no change.
3. **Impact of Bayesian Methods:** All Bayesian variants (Gemma Bayesian, Gemma Oracle with Noise, Gemma Bayesian with different priors) outperform the non-Bayesian "Gemma Original" in the final round.
4. **Prior Sensitivity (Chart b):** The choice of prior affects performance. The "Uniform" prior (76%) leads to higher final accuracy than the "LLM-based" (71%) or "LLM-opposite" (66%) priors for the Gemma Bayesian model.
5. **Baseline Comparison:** All final round results are substantially above the ~33% random chance baseline.
### Interpretation
The data demonstrates the effectiveness of Bayesian approaches and iterative refinement for improving model accuracy on the given task.
* **Bayesian Frameworks Add Value:** The consistent improvement of Bayesian models over the original Gemma model suggests that incorporating uncertainty estimation (via Bayesian methods) leads to more robust and accurate predictions after iterative refinement.
* **The "Assistant" Architecture is Key:** The "Bayesian Assistant" is the top performer in both charts, indicating its specific architecture or training protocol is particularly well-suited for this task, regardless of the experimental variable being tested (incorrect predictions or priors).
* **Iterative Process is Crucial:** The significant jumps from "After 1st Round" to "Final Round" for most models highlight the importance of the multi-round process. The model learns and corrects itself effectively over iterations.
* **Priors Matter, But Not Dramatically:** While Chart b shows that the uniform prior yields the best results among the tested priors for the standard Gemma Bayesian model, the performance spread (66% to 76%) is less dramatic than the difference between Bayesian and non-Bayesian approaches. This suggests the core Bayesian mechanism is more impactful than the specific prior choice in this context.
* **The Original Model is Stagnant:** The lack of improvement in "Gemma Original" serves as a control, confirming that the gains seen in other models are due to their modified (Bayesian) architectures and the iterative process, not simply from additional rounds of the same procedure.
</details>
Figure 16: Final-round accuracy of LLMs fine-tuned with different data variants. (a) Accuracy of the model using Bayesian teaching and the model using oracle teaching with random noise. (b) Accuracy of models fine-tuned on predictions from variants of the Bayesian Assistant, initialized with different priors. Error bars show standard errors across three random seeds (and three training runs).
#### Hypothesis: Incorrect predictions regularize training.
The Bayesian Assistant can make incorrect predictions, especially in the first few rounds, due to the fact that it only has limited information about the user (see the Bayesian Assistantâs accuracy over rounds in Fig. 24). Could these incorrect predictions regularize training and prevent overfitting, accounting for the effectiveness of Bayesian teaching? To test this hypothesis, we fine-tune the LLM using oracle teaching injected with random noise: 40% of the time, instead of predicting the userâs choice, the assistant recommends one of the incorrect options at random. The proportion of incorrect predictions in this control roughly matches that of the Bayesian predictions averaged across all five interactions. Contrary to the regularization hypothesis, we find that incorrect predictions do not necessarily improve performance: the model fine-tuned on noisy userâs choices (Gemma Oracle with Noise) barely outperforms the original LLM and has high standard error (Fig. 16). This suggests that random noise alone cannot explain why Bayesian predictions are more effective; rather, the Bayesianâs educated mistakes are more valuable than random errors.
#### Hypothesis: The LLM benefits from the correct prior.
We initialize the Bayesian Assistant using the uniform prior, which assigns equal probability to all reward functions, and therefore aligns with the data generation process of our evaluation. One hypothesis is that the LLM benefits from this correct prior (in the sense that it is calibrated to the distribution of simulated users in our experiment), which makes the predictions of the Bayesian Assistant more effective for supervised fine-tuning.
To test this hypothesis, we fine-tune Gemma three times, using the predictions of three variants of the Bayesian Assistant, initialized with three different priors: the uniform prior, the LLM-based prior obtained from Gemma Original (see Fig. 11), and the prior that is contrary to the LLM-based one (e.g., if Gemmaâs prior favors cheaper flights, this prior would instead prefer more expensive flights). The results are shown in Fig. 16. LLMs fine-tuned on predictions from all three Bayesian models perform very well and dramatically better than the original LLM. The choice of prior does influence the performance of the fine-tuned LLMs. The model fine-tuned on Bayesian predictions using the uniform prior, which matches the distribution of users in our sample, achieves the best accuracy. The LLM-based prior, despite being biased and spiky, leads to accuracy that is only slightly worse. The LLM-opposite prior, which is both biased and mismatches the LLMâs beliefs, leads to a more significant performance drop. That being said, the vast gap between all three LLMs fine-tuned on Bayesian predictions and Gemma Original suggests that the correct prior alone does not fully explain the effectiveness of Bayesian teaching.
### D.5 Qualitative Example
In Fig. 17, we show a qualitative example of the evolution of the reward distributions of Gemma Original and the Bayesian Assistant over interactions. In this case, since the userâs true reward function differs significantly from the LLMâs prior, both Gemma Original and the Bayesian Assistant perform poorly at the start of the interactions. However, while the Bayesian Assistant gradually converges toward the ground-truth reward function after a few rounds, Gemma Original continues to assign high probability to reward functions that are inconsistent with its observations.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Bar Chart Grid: LLM vs. Bayesian Probability by Round
### Overview
The image displays a 2x5 grid of bar charts comparing the performance of two modelsâLLM (top row) and Bayesian (bottom row)âacross five sequential rounds (Round 1 to Round 5). Each chart plots probability percentages against the "L2 Distance Rank" of predictions. A legend identifies three data series: a dashed blue line for "Ground-truth," red bars for "Incorrect" predictions, and green bars for "Correct" predictions. The overall visualization appears to track how each model's prediction accuracy and confidence (as probability) evolve relative to the ground truth rank over multiple rounds.
### Components/Axes
* **Layout:** Two rows, five columns. Each column represents a round (1-5).
* **Y-Axis (Top Row):** Labeled "LLM Probability (%)". Scale ranges from 0 to 100 in increments of 20.
* **Y-Axis (Bottom Row):** Labeled "Bayesian Probability (%)". Scale ranges from 0 to 100 in increments of 20.
* **X-Axis (All Charts):** Labeled "L2 Distance Rank". Scale ranges from 0 to 500 in increments of 100.
* **Legend:** Positioned in the top-right corner of the entire figure.
* `--- Ground-truth` (Dashed blue vertical line)
* `â Incorrect` (Solid red bar)
* `â Correct` (Solid green bar)
* **Titles:** Each column is titled "Round 1", "Round 2", "Round 3", "Round 4", "Round 5" at the top of the respective column.
### Detailed Analysis
**Top Row (LLM Probability):**
* **Round 1:** One prominent red bar at approximately L2 Rank 300 with a height of ~65%. A smaller red bar is near rank 100 (~10%). No green bars are visible.
* **Round 2:** Two prominent red bars: one near rank 150 (~38%) and another near rank 200 (~38%). Smaller red bars appear near ranks 100 and 300. No green bars.
* **Round 3:** Two prominent red bars: one near rank 350 (~52%) and another near rank 380 (~35%). Smaller red bars near ranks 300 and 400. No green bars.
* **Round 4:** One very tall red bar near rank 420 (~75%). A smaller red bar near rank 100 (~8%). No green bars.
* **Round 5:** Two prominent red bars: one near rank 250 (~20%) and another near rank 280 (~30%). Smaller red bars near ranks 200 and 300. No green bars.
**Bottom Row (Bayesian Probability):**
* **Round 1:** One prominent green bar at approximately L2 Rank 300 with a height of ~65%. A smaller red bar is near rank 100 (~10%).
* **Round 2:** Multiple green bars: a cluster between ranks 200-300, with the tallest near rank 250 (~20%) and another near rank 300 (~28%). Several small red bars are also present in the same rank range.
* **Round 3:** Two prominent red bars: one near rank 250 (~22%) and another near rank 300 (~35%). Smaller red bars near ranks 0 and 100. No green bars.
* **Round 4:** One prominent green bar at rank 0 (the ground-truth line) with a height of ~55%. No red bars are visible.
* **Round 5:** One prominent green bar at rank 0 with a height of ~58%. No red bars are visible.
**Ground-Truth Line:** In all ten charts, a dashed blue vertical line is consistently positioned at L2 Distance Rank = 0.
### Key Observations
1. **Model Performance Divergence:** The LLM model (top row) shows exclusively incorrect predictions (red bars) across all five rounds. The Bayesian model (bottom row) shows a mix of incorrect and correct predictions in early rounds (1-3) but transitions to exclusively correct predictions (green bars) at the ground-truth rank (0) in Rounds 4 and 5.
2. **Trend in Bayesian Model:** The Bayesian model demonstrates a clear trend of improvement. Its correct predictions shift from being at a high L2 rank (~300 in Round 1) to being perfectly aligned with the ground truth (rank 0) by Round 4, maintaining this in Round 5.
3. **Trend in LLM Model:** The LLM model shows no correct predictions. The rank and probability of its incorrect predictions fluctuate across rounds without a clear improving trend.
4. **Probability Magnitude:** The highest probability assigned by the LLM to an incorrect prediction is ~75% (Round 4). The highest probability assigned by the Bayesian model to a correct prediction is ~65% (Round 1), later settling at ~55-58% for perfect-rank predictions.
### Interpretation
This visualization likely compares the calibration and accuracy of a Large Language Model (LLM) and a Bayesian model on a task where the correct answer corresponds to a low "L2 Distance Rank" (ideally 0). The data suggests a significant difference in learning or adaptation between the two models over sequential rounds.
* **The Bayesian model appears to learn effectively.** Its progression from high-rank correct predictions to perfect-rank correct predictions indicates it is successfully refining its internal parameters or beliefs to align with the ground truth. The presence of both red and green bars in early rounds suggests initial uncertainty that resolves over time.
* **The LLM model shows no evidence of learning this specific metric.** Its consistent assignment of high probability to incorrect, high-rank answers implies a fundamental misalignment between its output distribution and the ground truth for this task. It may be confidently wrong, a known issue with some LLMs.
* **The "L2 Distance Rank" is a critical metric.** It quantifies how far a model's top prediction is from the correct answer in some embedding space. A rank of 0 means the model's top choice is the ground truth. The Bayesian model achieves this; the LLM does not.
* **The charts demonstrate model calibration vs. accuracy.** The LLM assigns high probability (confidence) to wrong answers, indicating poor calibration for this task. The Bayesian model's confidence (probability) for its correct answers is moderate (~55-65%), which may reflect appropriate uncertainty.
In summary, the image provides strong visual evidence that for the evaluated task and metric, the Bayesian model adapts and becomes accurate over time, while the LLM remains persistently and confidently incorrect. This could inform model selection for sequential decision-making or ranking tasks where ground truth alignment is crucial.
</details>
Figure 17: The reward distributions of Gemma Original (top) and the Bayesian Assistant (bottom) over multiple rounds. The reward functions are sorted by their normalized L2 distance from the ground-truth (GT) reward function, indicated by the blue dashed line at $x=0$ . Red indicates that the reward functionâs prediction on the given options is incorrect, while green indicates that its prediction is correct.
## Appendix E Sensitivity to the Informativeness of Option Sets
In each round of the flight recommendation task, we present the model with a set of three flight options, and the userâs choice among those options. The amount of information that can be gained through this process varies from round to round. For example, a choice between two flight options that differ in exactly one feature could be more informative than the choice between options that differ along multiple dimensions: the minimal pair of options provides direct evidence for the userâs preference for the particular feature. We expect a strong probabilistic reasoner to be sensitive to this factor: when the userâs choice between a particular set of options provides more information about their preferences, we expect the system to update its beliefs more substantially.
In this section we test whether LLMs display this behavior. In contrast with the main experiments, where we sample the option sets randomly, here we sample them based on their informativeness. To measure the amount of information contained in a set of options $O$ , we define the ground truth information gain as
$$
\displaystyle g(O,o^*,p(\bm{Ξ}),q(\bm{Ξ})) \displaystyle=KL(p(\bm{Ξ})||q(\bm{Ξ}))-KL(p(\bm{Ξ})||q(\bm{Ξ}|O,o^*)) \displaystyle=\log q(\bm{Ξ}^*|O,o^*)-\log q(\bm{Ξ}^*), \tag{5}
$$
where $p(\bm{Ξ})=ÎŽ(\bm{Ξ}^*)$ and $q(\bm{Ξ})$ is either $q_B(\bm{Ξ})$ or $q_LLM(\bm{Ξ})$ . This metric captures the increase in the posterior probability of the ground-truth reward function (that is, the userâs true reward function) after this set of options has been observed. Note that $g$ is relative to the model that is used to update the probability distribution; we use $g_\textit{B}$ and $g_\textit{LLM}$ to refer to the gain derived from the Bayesian Assistant and the LLM, respectively.
### E.1 Experimental Setup
We randomly sample 5,000 candidate option sets, compute the ground truth information gain of each one based on the Bayesian Assistant, and select the option set that leads to the desired value of $g_\textit{B}$ . The performance is evaluated at the end of a five-round interaction, and the ground truth information gain is averaged over these five rounds. We evaluate the Bayesian Assistant as well as Gemma Original, Gemma Oracle, and Gemma Bayesian; as in our main experiments, the Bayesian Assistant is initialized with the uniform prior.
<details>
<summary>x19.png Details</summary>

### Visual Description
## [Line Charts with Confidence Intervals]: Comparative Performance of Gemma Model Variants
### Overview
The image contains two side-by-side line charts (labeled **a** and **b**) that plot the performance of different versions of a "Gemma" model against a common x-axis metric. Both charts include shaded regions representing confidence intervals or variance around the mean trend lines. The charts compare a baseline model ("Gemma Original"), an idealized version ("Gemma Oracle"), and a Bayesian variant ("Gemma Bayesian"). Chart **a** also includes two reference baselines: "Bayesian Assistant" and "Random."
### Components/Axes
**Common X-Axis (Both Charts):**
* **Label:** `Avg. Bayesian GT Information Gain`
* **Scale:** Linear, ranging from 0.2 to 1.2, with major ticks at 0.2, 0.4, 0.6, 0.8, 1.0, and 1.2.
**Chart a (Left):**
* **Y-Axis Label:** `Final-round Accuracy (%)`
* **Scale:** Linear, ranging from 0 to 100, with major ticks at 0, 20, 40, 60, 80, and 100.
* **Legend (Bottom-Left Corner):**
* `Gemma Original` (Solid blue line)
* `Gemma Oracle` (Solid light orange line)
* `Gemma Bayesian` (Solid orange line)
* `Bayesian Assistant` (Dashed brown line)
* `Random` (Dashed gray line)
**Chart b (Right):**
* **Y-Axis Label:** `Avg. LLM GT Information Gain`
* **Scale:** Linear, ranging from -0.4 to 0.3, with major ticks at -0.4, -0.3, -0.2, -0.1, 0.0, 0.1, 0.2, and 0.3.
* **Legend (Bottom-Right Corner):**
* `Gemma Original` (Blue dots)
* `Gemma Oracle` (Light orange dots)
* `Gemma Bayesian` (Orange dots)
### Detailed Analysis
**Chart a: Final-round Accuracy vs. Information Gain**
* **Trend Verification & Data Points (Approximate):**
1. **Gemma Original (Blue):** A flat, horizontal line. Accuracy remains constant at approximately **35%** across the entire x-axis range (0.2 to 1.2).
2. **Random (Gray Dashed):** A flat, horizontal line, nearly identical to Gemma Original, at approximately **35%**.
3. **Gemma Oracle (Light Orange):** Starts at ~55% accuracy at x=0.2. It rises to a peak of approximately **60%** around x=0.6, then gradually declines to about **55%** by x=1.2. The shaded confidence band is widest around the peak.
4. **Gemma Bayesian (Orange):** Starts at ~55% at x=0.2. It shows a steady, monotonic increase, reaching approximately **80%** accuracy by x=1.2. The trend is clearly upward-sloping.
5. **Bayesian Assistant (Brown Dashed):** Starts at ~55% at x=0.2. It exhibits the steepest and most consistent upward trend, reaching nearly **100%** accuracy by x=1.2.
**Chart b: LLM GT Information Gain vs. Bayesian GT Information Gain**
* **Trend Verification & Data Points (Approximate):**
1. **Gemma Original (Blue Dots):** The data points are scattered around the **y=0** line across the entire x-axis range. There is no clear upward or downward trend; the values fluctuate roughly between -0.05 and +0.05.
2. **Gemma Oracle (Light Orange Dots):** Starts with negative values (~ -0.3 at x=0.2). It increases sharply, crossing into positive territory around x=0.45, peaks at approximately **+0.15** near x=0.7, and then shows a slight decline to about **+0.12** by x=1.2.
3. **Gemma Bayesian (Orange Dots):** Starts with the most negative values (~ -0.4 at x=0.2). It shows a strong, consistent upward trend, crossing zero around x=0.5, and continues to rise, reaching approximately **+0.25** by x=1.2. This series shows the highest final values.
### Key Observations
1. **Performance Hierarchy:** In both metrics (Accuracy and Information Gain), the `Gemma Bayesian` model significantly outperforms the `Gemma Original` baseline and generally outperforms the `Gemma Oracle` at higher levels of Bayesian GT Information Gain (x > ~0.6).
2. **Baseline Comparison:** The `Gemma Original` model performs identically to the `Random` baseline in final accuracy (Chart a) and shows no net information gain (Chart b), suggesting it is not effectively utilizing the provided information.
3. **Oracle Limitation:** The `Gemma Oracle` model shows an initial performance boost but its gains plateau and even slightly decline at higher x-values, indicating a potential ceiling or overfitting effect.
4. **Strongest Performer:** The `Bayesian Assistant` (in Chart a) demonstrates the most robust and scalable performance, achieving near-perfect accuracy as the information gain metric increases.
5. **Correlation:** There is a clear positive correlation between the x-axis metric (`Avg. Bayesian GT Information Gain`) and the performance of the Bayesian-informed models (`Gemma Bayesian`, `Bayesian Assistant`) on both y-axis metrics.
### Interpretation
The data suggests that incorporating Bayesian methods (`Gemma Bayesian`, `Bayesian Assistant`) allows the model to effectively translate increased "Bayesian GT Information Gain" into both higher final accuracy and higher own information gain (LLM GT). This indicates successful learning and utilization of the provided ground truth information.
The `Gemma Original` model's flatline performance implies it is not learning from the information signal, acting as a static baseline. The `Gemma Oracle` model's curveârising then fallingâmight represent a model that can leverage information up to a point but then becomes miscalibrated or overconfident when the information gain becomes very high.
The stark contrast between the ascending `Gemma Bayesian` line and the flat `Gemma Original`/`Random` lines in Chart **a** provides strong visual evidence for the efficacy of the Bayesian approach in this context. Chart **b** reinforces this by showing that the Bayesian model not only becomes more accurate but also becomes better at generating information itself as it receives more informative ground truth. The `Bayesian Assistant` likely represents an upper-bound or idealized agent, showing the potential ceiling of performance if the Bayesian integration is optimized.
</details>
Figure 18: Analysis of sensitivity of LLMs to the informativeness of option sets. (a) Effect of option set informativity on model performance. Option set informativity is defined by ground-truth information gain, i.e., the increase in the log probability assigned by the Bayesian Assistant to the ground truth reward function after observing the provided options. We plot accuracy after five interactions as a function of option set informativity averaged over the five interactions. (b) The relationship between ground-truth information gain computed from the Bayesian Assistant and from LLMs.
### E.2 Results
The Bayesian Assistantâs performance consistently improves as option sets become more informative: after observing highly informative options, its performance is almost perfect (Fig. 18). Gemma Original does not show sensitivity to option set informativity, but the fine-tuned models are much more sensitive to this factor: their performance positively correlates with the Bayesian ground-truth information gain up to a certain point. Gemma Bayesian saturates later than Gemma Oracle, and achieves higher final accuracy, especially in the highly informative regime.
### E.3 Comparing LLM-derived and Bayesian Information Gain
Recall that information gain is relative to the model that is used to update the probability distributions: $g_\textit{LLM}$ quantifies the amount of information the LLM can absorb from a particular set of options, whereas $g_\textit{B}$ quantifies the amount that the ideal Bayesian reasoner can absorb. How does $g_\textit{LLM}$ related to $g_\textit{B}$ for each of the variants of Gemma? We find that the correlation between the two measures is weak for Gemma Original (Fig. 18). For Gemma Oracle and Gemma Bayesian, we observe a more complex pattern. When $g_\textit{B}$ is small, there is a positive relationship between the two metrics, indicating that options that are informative from the Bayesian perspective are beneficial for the fine-tuned LLMs. In this part of the range, the information gain derived from Gemma Bayesian shows a stronger correlation with $g_\textit{B}$ compared with Gemma Oracle. When $g_\textit{B}$ is large, however, the relationship levels off and we no longer see a correlation between $g_\textit{B}$ and $g_\textit{LLM}$ for either of the fine-tuned models. This suggests that even Gemma Bayesian only approximates and does not fully implement the normative Bayesian reasoning strategy.
## Appendix F Human Experiments
### F.1 Humans As Assistants
#### Participants.
For the experiment where human participants acted as the assistant to a simulated user, we recruited 720 participants through the Prolific platform [Palan and Schitter, 2018]. Each human participant interacted with one simulated user from a subset of 48 simulated users (out of the total 624 users), which we sampled based on the L2 distance of their reward function from the origin. The average age of human participants was 37.2 (SD=12.5). Of those, 54.9% identified as male (395), 44.6% as female (321), and 0.6% preferred not to say (4). The major nationalities of human participants were the United States at 32.5% (234), United Kingdom at 23.2% (167), South Africa at 10.3% (74), Canada at 7.6% (55), and Poland at 4.4% (32). By ethnicity, 62.5% (450) were White, 17.4% (125) were Black, 11.9% (86) were Asian, and 5.6% (40) were Mixed. All participants reported using English as their primary language.
#### Procedure.
At the beginning of the experiment, each participant was asked to complete a preference questionnaire to indicate their initial guess of the userâs preferences for each individual feature. The participant subsequently proceeded to the annotation round, where they made recommendations from three flight options. After the selection, the human annotator received feedback indicating whether their choice was correct. They were then redirected to the preference questionnaire to report their updated beliefs about the userâs preferences. This completed one round. The annotator repeated the same procedure for five rounds. Following these five rounds, we also implemented a quality control annotation round where the annotator interacted with a typical user with a highly informative option list (differing only in one feature dimension). We expected this quality control round to be very easy for participants who were paying close attention to task. We filtered out participants who failed the quality control annotation. The mean and median completion time (including the quality control annotation) was 9.35 minutes and 7.90 minutes, respectively, with a standard deviation of 5.08 minutes.
#### Additional Results.
Our main results show the accuracy of human assistants using their direct predictions of the userâs preferred choices. Since we also ask the annotator to rate their beliefs about the userâs preferences after each round, we can also use these estimated preferences to make recommendationsâthe same procedure we use in Section B. This allows us to evaluate on the larger held-out set and reduce the noise. As shown in Fig. 19, we find that while the accuracy of the human annotatorsâ direct prediction may not monotonically improve from one round to the next, their beliefs about the userâs preferences become consistently more accurate over rounds.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Charts: Direct vs. Belief-based Prediction Accuracy
### Overview
The image contains two side-by-side line charts comparing the performance of two prediction methods ("Direct" and "Beliefs") against a "Random" baseline. Both charts plot prediction accuracy (as a percentage) against the number of interactions. The charts are labeled "a. Direct Prediction" and "b. Belief-based Prediction on Held-out Set."
### Components/Axes
**Common Elements (Both Charts):**
* **Y-Axis:** Labeled "Accuracy (%)". Scale runs from 0 to 100, with major tick marks at 0, 20, 40, 60, 80, and 100.
* **X-Axis:** Labeled "# Interactions". Represents a discrete count of interactions.
* **Legend:** Located in the top-right corner of each chart's plot area.
* A green line with diamond markers represents the model's performance.
* A gray dashed line represents the "Random" baseline.
* **Data Series:** Each model's performance is shown as a green line with diamond markers at each data point. Vertical error bars extend above and below each marker.
**Chart-Specific Elements:**
* **Chart (a):**
* **Title:** "a. Direct Prediction"
* **X-Axis Range:** 0 to 4 interactions.
* **Legend Label for Green Line:** "Direct"
* **Chart (b):**
* **Title:** "b. Belief-based Prediction on Held-out Set"
* **X-Axis Range:** 0 to 5 interactions.
* **Legend Label for Green Line:** "Beliefs"
### Detailed Analysis
**Chart (a): Direct Prediction**
* **Trend:** The "Direct" method's accuracy shows an initial increase and then plateaus.
* **Data Points (Approximate):**
* Interactions: 0 | Accuracy: ~35% (Error bar spans ~25% to ~45%)
* Interactions: 1 | Accuracy: ~40% (Error bar spans ~30% to ~50%)
* Interactions: 2 | Accuracy: ~48% (Error bar spans ~38% to ~58%)
* Interactions: 3 | Accuracy: ~47% (Error bar spans ~37% to ~57%)
* Interactions: 4 | Accuracy: ~47% (Error bar spans ~37% to ~57%)
* **Random Baseline:** The dashed "Random" line is constant at approximately 33% accuracy across all interaction counts.
**Chart (b): Belief-based Prediction on Held-out Set**
* **Trend:** The "Beliefs" method's accuracy shows a steady, gradual increase across all measured interactions.
* **Data Points (Approximate):**
* Interactions: 0 | Accuracy: ~37% (Error bar spans ~30% to ~44%)
* Interactions: 1 | Accuracy: ~42% (Error bar spans ~35% to ~49%)
* Interactions: 2 | Accuracy: ~44% (Error bar spans ~37% to ~51%)
* Interactions: 3 | Accuracy: ~46% (Error bar spans ~39% to ~53%)
* Interactions: 4 | Accuracy: ~48% (Error bar spans ~41% to ~55%)
* Interactions: 5 | Accuracy: ~50% (Error bar spans ~43% to ~57%)
* **Random Baseline:** The dashed "Random" line is constant at approximately 33% accuracy across all interaction counts.
### Key Observations
1. **Superiority Over Random:** Both the "Direct" and "Beliefs" methods consistently outperform the "Random" baseline (33%) at every measured point after zero interactions.
2. **Performance Trajectory:** The "Direct" method (Chart a) appears to reach a performance ceiling or plateau after 2 interactions. In contrast, the "Beliefs" method (Chart b) demonstrates a continuous, albeit slowing, upward trend in accuracy up to 5 interactions.
3. **Initial Performance:** At 0 interactions, both methods start at a similar accuracy level (~35-37%), which is only marginally better than random guessing.
4. **Variability:** The error bars for both methods are substantial, indicating significant variance in performance across different runs or samples. The overlap in error bars between consecutive points suggests the improvements, while visible in the trend, may not always be statistically distinct at each step.
### Interpretation
The data suggests that incorporating a "belief-based" mechanism into the prediction model leads to more sustained learning over multiple interactions compared to a "direct" prediction approach. While both methods improve upon a random baseline, the direct method's performance gains saturate quickly. The belief-based method's steady climb implies it may be better at accumulating knowledge or refining its internal state with each interaction, leading to better generalization on a held-out set. This is a classic pattern in machine learning where a more complex model (beliefs) can capture incremental improvements that a simpler model (direct) cannot sustain. The large error bars, however, caution that the exact performance can be noisy, and the observed trends represent average behavior. The key takeaway is that for tasks requiring sequential interactions, a belief-augmented architecture appears more promising for long-term performance gains.
</details>
Figure 19: Accuracy of the human assistant over rounds. (a) Based on the humanâs direct predictions on provided option sets. (b) Based on the humanâs beliefs about the userâs preferences on held-out option sets. Error bars show the averaged standard error across human participants.
#### Qualitative Analysis.
One pattern we observe in human assistants is that they tend to favor simpler heuristics when there is limited evidence. For example, in Table 2, we show that when there are multiple valid user preferences, human assistants may rely on simpler heuristics, e.g. in this example, always choosing the cheapest flight. In contrast, the fine-tuned Gemma Bayesian model does not seem to exhibit this behavior.
Table 2: Qualitative examples of LLM and human predictions. Here, the user strongly prefers an early departure time, weakly prefers a short flight duration, and has no preference for the number of stops and the price. Most human participants tend to favor a simpler heuristic, i.e., always choosing the cheapest flight, while Gemma Bayesian does not seem to exhibit this behavior.
| Flight 1 | 05:12 PM | 30 min | 1 | $190 | Flight 1 | Flight 1 | Flight 1: 66.7% |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Flight 2 | 03:36 PM | 12 hr 12 min | 2 | $460 | Flight 2: 26.7% | | |
| Flight 3 | 10:00 PM | 10 hr 15 min | 2 | $640 | Flight 3: 6.7% | | |
| Flight 1 | 06:48 PM | 4 hr 24 min | 1 | $370 | Flight 2 | Flight 2 | Flight 1: 40.0% |
| Flight 2 | 07:36 AM | 16 hr 6 min | 1 | $100 | Flight 2: 33.3% | | |
| Flight 3 | 10:00 PM | 20 hr | 0 | $550 | Flight 3: 26.7% | | |
| Flight 1 | 10:00 PM | 30 min | 1 | $280 | Flight 3 | Flight 3 | Flight 1: 60% |
| Flight 2 | 08:24 PM | 30 min | 0 | $910 | Flight 2: 0.0% | | |
| Flight 3 | 06:00 AM | 8 hr 18 min | 0 | $370 | Flight 3: 40% | | |
### F.2 Humans As Users
#### Participants.
For the experiment where human participants acted as the users, we recruited 500 participants through the Prolific platform. The average age of the participants was 38.7 (SD=13.6); 51.0% identified as male (255), 48.4% as female (242), and 0.6% preferred not to say (3). The major nationalities of human participants were the United States at 40.0% (200), United Kingdom at 16.0% (80), South Africa at 9.0% (45), Canada at 7.8% (39), and Australia at 5.6% (28), with smaller representations from other countries. In terms of ethnicity, 65.2% (326) identified as White, 15.0% (75) as Black, 8.4% (42) as Asian, 7.2% (36) as Mixed, and 4.0% (20) as Other. All participants reported that English is their primary language.
#### Procedure.
Each participant in this experiment was first asked to complete a preference questionnaire to indicate their own preferences for different flight features. They then proceeded to the annotation rounds, where they needed to select their preferred option out of three flight options. To ensure quality, we required annotators to think for at least 30 seconds before making their selection. The procedure continued for five rounds. Participants were told to make choices consistent with their initially stated preferences throughout all five rounds. The mean and median completion times were 6.43 minutes and 5.18 minutes, respectively, with a standard deviation of 3.51 minutes.
#### Additional Results.
<details>
<summary>x21.png Details</summary>

### Visual Description
## Bar Charts: Flight Rating Distributions
### Overview
The image displays four separate bar charts arranged horizontally in a single row. Each chart visualizes the probability distribution (in percentage) of user ratings (1 to 5) for a different flight attribute. The charts share identical axes and styling, using blue bars on a white background.
### Components/Axes
* **Chart Titles (Top of each subplot):** "Departure Time", "Duration", "Number of Stops", "Price".
* **Y-Axis (Common to all charts):** Labeled "Probability (%)". The scale runs from 0 to 100 with major tick marks at 0, 20, 40, 60, 80, and 100.
* **X-Axis (Common to all charts):** Labeled "Rating". The scale shows discrete categories: 1, 2, 3, 4, 5.
* **Data Series:** Each chart contains five vertical bars, one for each rating category (1-5). All bars are the same shade of blue. There is no legend, as the color is consistent and the categories are defined by the x-axis.
### Detailed Analysis
**1. Departure Time Chart:**
* **Trend:** The distribution is relatively flat with a slight peak at rating 2.
* **Approximate Values:**
* Rating 1: ~25%
* Rating 2: ~30% (Highest)
* Rating 3: ~20%
* Rating 4: ~20%
* Rating 5: ~8% (Lowest)
**2. Duration Chart:**
* **Trend:** A clear peak at rating 2, followed by a steady decline for higher ratings.
* **Approximate Values:**
* Rating 1: ~28%
* Rating 2: ~40% (Highest)
* Rating 3: ~25%
* Rating 4: ~8%
* Rating 5: ~2% (Lowest)
**3. Number of Stops Chart:**
* **Trend:** A very strong skew towards rating 1, with a sharp drop-off for all other ratings.
* **Approximate Values:**
* Rating 1: ~55% (Highest)
* Rating 2: ~35%
* Rating 3: ~5%
* Rating 4: ~5%
* Rating 5: ~2% (Lowest)
**4. Price Chart:**
* **Trend:** Similar to the Duration chart, with a pronounced peak at rating 2.
* **Approximate Values:**
* Rating 1: ~38%
* Rating 2: ~48% (Highest)
* Rating 3: ~10%
* Rating 4: ~4%
* Rating 5: ~1% (Lowest)
### Key Observations
* **Consistent Low Ratings:** Across all four attributes, ratings 4 and 5 consistently have the lowest probabilities, rarely exceeding 10%.
* **Dominant Ratings:** Each attribute has a dominant rating category: "Number of Stops" is overwhelmingly rated 1, while "Duration" and "Price" are most frequently rated 2. "Departure Time" has the most even distribution.
* **Attribute Sensitivity:** The "Number of Stops" chart shows the most polarized response, suggesting users have a very strong preference (likely for fewer stops, hence the high rating 1 probability). "Departure Time" appears to be the least polarizing factor.
### Interpretation
This data likely represents user satisfaction or preference ratings for different aspects of a flight booking or experience. The distributions suggest:
1. **"Number of Stops" is the most critical differentiator.** The extreme skew towards rating 1 indicates that a non-stop flight (or a flight with the minimum number of stops) is highly valued and is perhaps a primary decision-making factor. Flights with more stops receive disproportionately poor ratings.
2. **"Duration" and "Price" are important but follow a similar, less extreme pattern.** The peak at rating 2 suggests that for these factors, a "good but not perfect" offering is most common or most accepted. Very long durations or very high prices (which would correspond to ratings 4 and 5) are strongly avoided or result in poor satisfaction.
3. **"Departure Time" is a more neutral factor.** The flatter distribution implies that user satisfaction with departure times is more varied and less tied to a single ideal. This could mean acceptable departure windows are broader, or that this factor is less important in the overall rating.
**Overall Implication:** To maximize positive ratings (or booking likelihood), a flight service should prioritize offering non-stop routes. Competitive pricing and reasonable duration are also key, but the data shows a clear hierarchy where minimizing stops is paramount. The low probabilities for ratings 4 and 5 across the board indicate that extreme negatives in any category are rare, suggesting the dataset may come from a context where baseline service levels are already acceptable.
</details>
Figure 20: The distributions of human participantsâ initial preferences for different flight features. A rating of 1 indicates the strongest preference for the earliest departure time, the shortest duration, the fewest number of stops, and the lowest price, while a rating of 5 indicates the opposite. A rating of 3 indicates no preference.
<details>
<summary>x22.png Details</summary>

### Visual Description
## [Chart Pair]: Distribution of Reward Functions and Accuracy Comparison
### Overview
The image contains two distinct charts presented side-by-side. The left chart (a) is a bar chart showing the probability distribution across a set of reward function indices. The right chart (b) is a line chart comparing the accuracy of four different models or methods over a series of interactions. The overall context appears to be an analysis of reinforcement learning or AI alignment, focusing on reward functions and model performance.
### Components/Axes
**Chart a: Distribution of Reward Function**
* **Title:** "a. Distribution of Reward Function"
* **X-axis:** Label: "Reward Function Index". Scale: Linear, marked from 0 to 600 in increments of 100.
* **Y-axis:** Label: "Probability (%)". Scale: Linear, marked from 0 to 5 in increments of 1.
* **Data Series:** A single series represented by vertical blue bars. Each bar corresponds to a specific reward function index.
**Chart b: Accuracy on Human Reward Fn. Set**
* **Title:** "b. Accuracy on Human Reward Fn. Set"
* **X-axis:** Label: "# Interactions". Scale: Discrete integers from 0 to 5.
* **Y-axis:** Label: "Accuracy (%)". Scale: Linear, marked from 0 to 100 in increments of 20.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries:
1. `Gemma Original` - Solid blue line with diamond markers.
2. `Gemma Oracle` - Solid yellow line with square markers.
3. `Gemma Bayesian` - Solid orange line with circle markers.
4. `Bayesian Assistant` - Dashed brown line with diamond markers.
### Detailed Analysis
**Chart a: Distribution of Reward Function**
The distribution is highly non-uniform and sparse. The probability mass is concentrated in several distinct clusters or peaks across the index range.
* **Highest Peak:** The tallest bar is located at approximately index 150-160, reaching a probability of nearly 4%.
* **Other Major Peaks:** Significant clusters of high probability appear around indices 250-280, 350-400, and 450-500. The peaks in these clusters generally range between 2% and 3.5% probability.
* **Low-Probability Regions:** Large stretches of indices, particularly between 0-100, 200-250, and 500-600, show very low or near-zero probability, indicated by very short or absent bars.
* **Overall Shape:** The chart suggests that out of over 600 possible reward functions, only a specific subset (perhaps 50-100) are considered probable or relevant in this context.
**Chart b: Accuracy on Human Reward Fn. Set**
This chart tracks the performance of four methods as the number of interactions increases.
* **Trend Verification & Data Points (Approximate):**
* **Gemma Original (Blue):** The line is nearly flat, showing minimal improvement. It starts at ~55% accuracy at 0 interactions and ends at ~52% at 5 interactions. The trend is essentially stagnant.
* **Gemma Oracle (Yellow):** Shows a steady, moderate upward trend. Starts at ~40% (0 interactions), rises to ~50% (1), ~55% (2), ~58% (3), ~60% (4), and ends at ~62% (5).
* **Gemma Bayesian (Orange):** Shows a strong, consistent upward trend. Starts the lowest at ~28% (0 interactions), then climbs sharply to ~55% (1), ~65% (2), ~70% (3), ~75% (4), and ends at ~78% (5).
* **Bayesian Assistant (Dashed Brown):** Shows the strongest performance and trend. Starts at ~40% (0 interactions), rises to ~60% (1), ~70% (2), ~75% (3), ~80% (4), and ends at the highest point of ~82% (5). It consistently outperforms the other methods after the first interaction.
### Key Observations
1. **Performance Hierarchy:** After 1 interaction, a clear and consistent performance hierarchy is established and maintained: Bayesian Assistant > Gemma Bayesian > Gemma Oracle > Gemma Original.
2. **Impact of Bayesian Methods:** Both methods incorporating "Bayesian" in their name (Gemma Bayesian and Bayesian Assistant) show significantly greater improvement with more interactions compared to the non-Bayesian methods.
3. **The "Oracle" Baseline:** The Gemma Oracle, which likely represents an idealized or informed baseline, is outperformed by the Bayesian methods after just 1-2 interactions.
4. **Stagnation of Baseline:** The Gemma Original model shows no benefit from additional interactions, suggesting it lacks an effective mechanism for learning or adaptation from the provided feedback.
5. **Sparse Reward Distribution:** Chart (a) indicates that the "Human Reward Fn. Set" referenced in chart (b)'s title is not a uniform set but is composed of a sparse selection of specific, probable reward functions.
### Interpretation
The data tells a compelling story about model adaptation and the value of Bayesian approaches in learning from human feedback (or a set of human-aligned reward functions).
* **What the data suggests:** The primary finding is that models equipped with Bayesian updating mechanisms (Gemma Bayesian, Bayesian Assistant) are far more effective at leveraging successive interactions to improve their accuracy on a target set of human reward functions. The Bayesian Assistant, in particular, demonstrates the most efficient and highest overall learning curve.
* **How elements relate:** Chart (a) provides crucial context for chart (b). The sparse distribution of reward functions implies that the learning task is not about mastering a broad, uniform space, but about identifying and aligning with a specific, narrow set of "correct" or "human-preferred" functions. The success of the Bayesian methods suggests they are particularly adept at this kind of targeted identification and alignment.
* **Notable anomalies/outliers:** The complete lack of improvement in the Gemma Original model is the most striking anomaly. It serves as a critical control, highlighting that the improvements seen in the other models are due to their specific architectural or algorithmic choices (like Bayesian inference) and not merely a function of receiving more interactions.
* **Underlying implication:** The results argue strongly for the integration of probabilistic, Bayesian reasoning into AI systems designed to learn from human feedback. Such systems appear to be more sample-efficient (learning faster from fewer interactions) and ultimately more capable of achieving high alignment accuracy. The "Assistant" variant likely incorporates additional design elements that further optimize this learning process.
</details>
Figure 21: Analysis of human reward functions. (a) Distribution of human reward functions. (b) Accuracy over rounds on the subset of the original data where the simulated userâs reward function is set of reward functions stated by the human participants. Error bars show standard errors across three random seeds (and three training runs).
<details>
<summary>x23.png Details</summary>

### Visual Description
## [Multi-Panel Chart]: Human Consistency and Model Accuracy Analysis
### Overview
The image is a composite figure containing three main panels (a, b, c), each presenting statistical analyses related to human user consistency and the accuracy of different AI models (Gemma variants and a Bayesian Assistant) across multiple interaction rounds. The data appears to come from a study evaluating model performance in a task involving option sets, with results broken down by all data and a "High Consistency" subset.
### Components/Axes
The figure is divided into three horizontal panels, each with a title:
* **Panel a:** "Human User Average Consistency"
* **Panel b:** "Accuracy on Human-annotated Option Sets"
* **Panel c:** "Accuracy on Held-out Option Sets"
**Panel a Components:**
1. **Left Chart (Line Graph):**
* **Title:** Human User Average Consistency
* **Y-axis:** "Consistency (%)" ranging from 0 to 100.
* **X-axis:** "Round" with discrete markers at 1, 2, 3, 4, 5.
* **Data Series:** A single blue line with diamond markers.
2. **Right Chart (Histogram):**
* **Y-axis:** "Probability (%)" ranging from 0 to 35.
* **X-axis:** "Avg. Consistency (%)" with bins labeled 0, 20, 40, 60, 80, 100.
* **Data Series:** A series of blue vertical bars representing a probability distribution.
**Panel b & c Components (Identical Structure):**
Each panel contains two side-by-side line charts.
1. **Left Chart:** Subtitled "All".
2. **Right Chart:** Subtitled "High Consistency".
3. **Shared Axes:**
* **Y-axis:** "Accuracy (%)" ranging from 0 to 100.
* **X-axis:** "# Interactions". Panel b ranges from 0 to 4. Panel c ranges from 0 to 5.
4. **Legend (Present in the right chart of each panel, bottom-right corner):**
* **Gemma Original:** Blue line with diamond markers.
* **Gemma Oracle:** Light orange line with circle markers.
* **Gemma Bayesian:** Dark orange line with diamond markers.
* **Bayesian Assistant:** Gray dashed line with circle markers.
### Detailed Analysis
**Panel a: Human User Average Consistency**
* **Line Graph Trend:** The blue line shows a slight dip from Round 1 to Round 2, then a gradual, shallow upward trend through Round 5.
* **Approximate Data Points:**
* Round 1: ~65%
* Round 2: ~58%
* Round 3: ~60%
* Round 4: ~63%
* Round 5: ~63%
* **Histogram Distribution:** The distribution of average consistency across users is roughly unimodal and slightly left-skewed.
* **Approximate Probabilities by Bin:**
* 0-20%: ~3%
* 20-40%: ~7%
* 40-60%: ~19%
* 60-80%: ~31% (Mode)
* 80-100%: ~27%
* >100%: ~12% (Note: This bin extends beyond the 100% label, suggesting some values may be slightly above 100% or the binning is inclusive).
**Panel b: Accuracy on Human-annotated Option Sets**
* **"All" Data Chart:**
* **Gemma Original (Blue):** Starts high (~61% at 0 interactions) and remains nearly flat, ending at ~61% at 4 interactions.
* **Gemma Oracle (Light Orange):** Starts low (~32% at 0), rises steadily to ~51% at 4 interactions.
* **Gemma Bayesian (Dark Orange):** Starts lowest (~21% at 0), shows the steepest increase, surpassing Gemma Oracle around 2 interactions, and ends highest at ~62% at 4 interactions.
* **Bayesian Assistant (Gray Dashed):** Starts at ~33% at 0, increases steadily to ~56% at 4 interactions.
* **"High Consistency" Data Chart:**
* **Gemma Original (Blue):** Starts at ~64% at 0, remains flat until a slight rise to ~65% at 4 interactions.
* **Gemma Oracle (Light Orange):** Starts at ~33% at 0, rises to ~52% at 4 interactions.
* **Gemma Bayesian (Dark Orange):** Starts at ~21% at 0, rises sharply to ~67% at 4 interactions, becoming the top performer.
* **Bayesian Assistant (Gray Dashed):** Starts at ~34% at 0, rises to ~62% at 4 interactions.
**Panel c: Accuracy on Held-out Option Sets**
* **"All" Data Chart:**
* **Gemma Original (Blue):** Starts at ~64% at 0, declines to ~57% at 1 interaction, then slowly recovers to ~61% at 5 interactions.
* **Gemma Oracle (Light Orange):** Starts at ~40% at 0, rises to ~53% at 5 interactions.
* **Gemma Bayesian (Dark Orange):** Starts at ~19% at 0, rises steeply to ~61% at 5 interactions.
* **Bayesian Assistant (Gray Dashed):** Starts at ~40% at 0, rises to ~59% at 5 interactions.
* **"High Consistency" Data Chart:**
* **Gemma Original (Blue):** Starts at ~65% at 0, dips to ~60% at 1, and ends at ~59% at 5 interactions.
* **Gemma Oracle (Light Orange):** Starts at ~39% at 0, rises to ~55% at 5 interactions.
* **Gemma Bayesian (Dark Orange):** Starts at ~19% at 0, rises sharply to ~67% at 5 interactions.
* **Bayesian Assistant (Gray Dashed):** Starts at ~39% at 0, rises to ~64% at 5 interactions.
### Key Observations
1. **Human Consistency:** Human user consistency is moderate (averaging around 60-65%) and relatively stable across rounds, with most users falling in the 60-80% consistency range.
2. **Model Learning Curves:** All models except "Gemma Original" show a clear positive learning curve, with accuracy improving as the number of interactions increases.
3. **Bayesian Model Superiority:** The "Gemma Bayesian" model consistently shows the most dramatic improvement, starting from the lowest accuracy at 0 interactions but often achieving the highest or near-highest accuracy by the final interaction, especially in the "High Consistency" subsets.
4. **Impact of Data Subset:** Models generally achieve higher final accuracy scores on the "High Consistency" data subsets compared to the "All" data subsets in both panels b and c.
5. **Gemma Original Plateau:** The "Gemma Original" model shows little to no improvement with more interactions, suggesting it may not be adapting or learning within this interaction framework.
### Interpretation
This data suggests a study where AI models are iteratively refined or queried over multiple interactions to perform a task (likely involving selecting or evaluating options). The key findings are:
* **The value of iterative interaction:** Models that incorporate Bayesian updating (Gemma Bayesian, Bayesian Assistant) leverage additional interactions to significantly improve their accuracy, while the base model (Gemma Original) does not.
* **Human data as a filter:** Performance is notably better when evaluated on option sets where human users were highly consistent. This implies that model accuracy is sensitive to the clarity or consensus of the underlying task, and high human consistency may indicate a less ambiguous problem space where models can excel.
* **Cold-start problem:** The Bayesian models start with very low accuracy at zero interactions but overcome this quickly. This highlights a potential trade-off: these models may require a few interactions to "warm up" and gather sufficient data to make accurate predictions, but their ceiling for improvement is higher.
* **Task Difficulty:** The general trend of improvement across interactions for most models indicates the task is learnable. The initial low scores for Bayesian models might reflect a deliberate exploration phase or a prior that is updated with interaction data.
In essence, the figure argues for the efficacy of Bayesian approaches in interactive learning settings, particularly when dealing with data that has a strong signal of human consensus.
</details>
Figure 22: Results on interactions with real human users. (a) Consistency between the human usersâ choices and the predictions derived from their initially stated preferences. We show user consistency over rounds and the distribution of the average user consistency. Error bars show standard errors across five-round option sets list. (b) Accuracy over rounds on human-annotated option sets. We show the results of all human users and users with high consistency, i.e., their choices matched their initially stated preferences in 4 or 5 of the rounds (40.4% of the data). (c) Accuracy over rounds on the held-out set, where the preferred choices are deterministically computed based on the human userâs preferences. Error bars show standard errors across three random seeds (and three training runs).
In the main paper we report results for this more realistic setting where the model interacts with real human users on the flight recommendation task. Surprisingly, we also find that the original LLMs achieved good performance, unlike what was observed in earlier experiments.
We hypothesize that two factors may contribute to its improved performance. First, unlike our simulated users whose preferences are uniformly sampled from the space of possible reward functions, human preferences are biased towards particular types of functions, and in Fig. 21 we show that some reward functions are considerably more common than others in our sample of human participants. For example, most participants report preferring cheaper flights (see Fig. 20). As such, a viable strategy for the original LLM could be to rely on its prior knowledge about user preferences to make relatively good recommendations. To investigate this further, we filter results for simulated users based on reward functions stated by the human participants in Fig. 21. We observe that also in this case, Gemma Original achieves a higher accuracy of around 60% (as opposed to 37% in Fig. 2, matching the high accuracy it obtained in Fig. 6. This makes it clear that the bias among the human preferences in this experiment contributes to the stronger performance of original LLMs.
Secondly, human users may not necessarily behave consistently with their preferences, i.e., their choices may differ from those that would reflect their initially stated preferences. Indeed, note how in Fig. 21 the gap between the original LLM and the Bayesian LLM increases significantly when evaluating on consistent simulated users. To quantify this potential discrepancy, we compute the consistency between the human userâs choices and the predictions derived from their preferences. The latter are obtained by mapping their stated preferences to corresponding reward functions and selecting the option with the highest reward accordingly. In line with our hypothesis, the average consistency is relatively low at 60%, with chance performance being 33.3% (Fig. 22).
We further break down the performance by user consistency over rounds and show results for high-consistency users; that is, users whose choices were consistent with their stated preferences in 4 or 5 of the rounds (Fig. 22). We find that all models perform better for the high-consistency users. Specifically, when user consistency is high, the improvement of Gemma Bayesian over Gemma Original increases.
Finally, to limit the effect of such inconsistencies, while still retaining the real interactions between human users and the model, we also evaluate the LLMs on a held-out set of 100 randomly sampled options that simulate perfectly consistent users; to do so we use the preferred options derived from the participantsâ initially stated preferences rather than the participantsâ actual choices. As shown in Fig. 22, when removing inconsistency from the evaluation data, Gemma Bayesian achieves the best performance. Gemma Original performs best initially, likely due to its correct prior about human users, but its performance decreases over rounds, indicating its limited ability to incorporate the simulated userâs feedback.
### F.3 Human Annotation Interface
We show the human annotation interface where humans act as the assistant in Fig. 23. The interface allows the human annotator to select the best option from three flight options, rate their estimation of the userâs preferences, and check the flight booking history from previous rounds. The annotation interface where humans act as the user is similar.
## Appendix G Statistical Analyses
This Supplementary reports analyses that test whether Bayesian teaching leads to statistically significant improvement over the baselines. We fit linear mixed-effects models treating each method (Bayesian teaching, oracle teaching, and the original) and model family (Gemma, Llama, and Qwen) as fixed effects while controlling for various sources of randomness. For flight and hotel recommendation, we include training run, evaluation random seed, and reward function as random effects. For web shopping, we treat training run ad product category as random effects. Overall, the models demonstrate statistically significant differences between methods across all domains and all three model families (Gemma, Llama, and Qwen).
In flight recommendation, the original LLM achieves a baseline accuracy of 37.0% (95% CI: 30.6â43.5%). The Oracle LLM performs significantly better with a 24.0% increase (95% CI: 16.6â31.4%, p $<$ 0.001), while the Bayesian LLM shows an even more substantial 38.5% increase (95% CI: 31.1â45.9%, p $<$ 0.001). Model family shows no significant effect on performance, with differences between model families all non-significant. The interaction between method and model family was not statistically significant (minimum p = 0.19). Within each model family, improvements between all methods are significant (p $<$ 0.001), with the exception of Qwen Oracle versus Qwen Original which shows slightly weaker but still significant improvement (p = 0.002).
In hotel recommendation, the original LLM achieves a baseline accuracy of 36.7% (95% CI: 32.1â41.3%). The Oracle LLM performs significantly better with a 16.7% increase (95% CI: 11.4â22.0%, p $<$ 0.001), while the Bayesian LLM shows a 29.4% increase (95% CI: 24.1â34.7%, p $<$ 0.001). Model family shows no significant main effect on performance. The interaction between method and model family is not statistically significant (all interaction p-values $>$ 0.11). Within each model family, most pairwise comparisons show p-values $<$ 0.001, with two exceptions: Llama Bayesian versus Llama Oracle shows weak significance (p = 0.001), and Qwen Oracle versus Qwen Original shows weaker significance (p = 0.002).
In web shopping, the original LLM achieves a baseline accuracy of 54.0% (95% CI: 49.6â58.4%). The Oracle LLM performs significantly better with a 7.1% increase (95% CI: 2.3â11.8%, p = 0.013), while the Bayesian LLM shows a more substantial 18.6% increase (95% CI: 13.8â23.4%, p $<$ 0.001). Unlike the other domains, model family shows a significant effect, with Qwen showing a significant decrease of -11.1% (95% CI: -17.0â5.3%, p = 0.003) compared to the baseline. There is also a significant interaction between Oracle method and Qwen (15.8%, 95% CI: 9.0â22.6%, p = 0.001). Within-family pairwise comparisons show different patterns: for Gemma, all method comparisons are significant (Original-Oracle: p = 0.033; others p $<$ 0.001); for Llama, Original-Oracle is non-significant (p = 0.199) while Original-Bayesian (p = 0.001) and Oracle-Bayesian (p = 0.004) are significant; for Qwen, Original-Oracle and Original-Bayesian are highly significant (p $<$ 0.001), but Oracle-Bayesian is non-significant (p = 0.282).
## Appendix H Results Details
We show results over rounds for different models and methods in Fig. 24 â 27. For each, we show the accuracy based on the LLMâs or human direct prediction (âdirectâ) and accuracy based on predictions derived from their beliefs about the userâs preferences (âbeliefsâ) if available.
<details>
<summary>x24.png Details</summary>

### Visual Description
## Screenshot: Flight Preference Survey Interface
### Overview
This image is a screenshot of a web-based survey or decision-making interface titled "Select the Best Option." It presents a user with a flight selection task (Round 1 of 5), followed by a preferences questionnaire, and concludes with an annotation summary showing the correct answer versus the user's selection. The interface is text-heavy with a clean, form-like layout using green and blue buttons.
### Components/Axes
The interface is structured vertically into distinct sections:
1. **Header Section:**
* Title: "Select the Best Option" (centered, green text).
* Sub-header: "Round 1 of 5" (centered, within a light gray bar).
2. **Flight Selection Section:**
* Three rectangular boxes, each describing a flight option.
* **Flight 1:** `departure time: 02:00 PM, duration: 30 min, number of stops: 1, price: $370`
* **Flight 2:** `departure time: 02:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730`
* **Flight 3:** `departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $1000`
* Two buttons below the options:
* "Submit Selection" (green button, left).
* "Check Summary" (blue button, right).
3. **Preferences Questionnaire Section:**
* Section Title: "Preferences Questionnaire" (centered, green text).
* Contains four questions, each with a 5-point Likert scale. The options are presented as horizontal bars.
* **Question 1:** "On a scale of 1 to 5, what is your preference for departure time?"
* 1: I strongly prefer an earlier morning departure time
* 2: I prefer an earlier morning departure time
* 3: I have no strong preference
* 4: I prefer a later evening departure time
* 5: I strongly prefer a later evening departure time
* **Question 2:** "On a scale of 1 to 5, what is your preference for flight duration?"
* 1: I strongly prefer shorter flights
* 2: I prefer shorter flights
* 3: I have no strong preference
* 4: I prefer longer flights
* 5: I strongly prefer longer flights
* **Question 3:** "On a scale of 1 to 5, what is your preference for number of stops?"
* 1: I strongly prefer non-stop flights
* 2: I prefer non-stop flights
* 3: I have no strong preference
* 4: I prefer flights with stops
* 5: I strongly prefer flights with stops
* **Question 4:** "On a scale of 1 to 5, what is your preference for price?"
* 1: I strongly prefer cheaper flights
* 2: I prefer cheaper flights
* 3: I have no strong preference
* 4: I prefer more expensive flights
* 5: I strongly prefer more expensive flights
* A single button at the bottom of this section: "Submit All Responses" (green).
4. **Annotation Summary Section:**
* Section Title: "Annotation Summary" (centered, green text).
* A light gray box containing:
* "Round 1"
* The same three flight descriptions as listed in the Flight Selection Section.
* "Correct Option: Flight 1"
* "Your Selection: Flight 2"
* A final button at the bottom: "Back to Annotation" (green).
### Detailed Analysis
* **Flight Data Comparison:**
* **Departure Time:** Flights 1 and 2 depart at the same time (02:00 PM). Flight 3 departs later (03:36 PM).
* **Duration:** There is extreme variance. Flight 1 is very short (30 min), Flight 2 is medium (4 hr 24 min), and Flight 3 is very long (16 hr 6 min).
* **Stops:** Flight 1 has 1 stop. Flights 2 and 3 are non-stop (0 stops).
* **Price:** Directly correlates with duration and inversely with convenience (stops). Flight 1 is cheapest ($370), Flight 2 is mid-range ($730), and Flight 3 is most expensive ($1000).
* **Questionnaire Structure:** Each question uses a symmetric 5-point scale, with option 3 as the neutral midpoint. The wording for options 1/2 and 4/5 are mirrors of each other (e.g., "strongly prefer shorter" vs. "strongly prefer longer").
* **User Outcome:** The summary reveals a discrepancy. The system's "Correct Option" is Flight 1 (the shortest, cheapest, but with a stop). The user's "Selection" was Flight 2 (the non-stop, mid-priced, medium-duration flight).
### Key Observations
1. **Trade-off Design:** The flight options present a clear multi-attribute trade-off: time (duration) vs. cost vs. convenience (stops). There is no single "best" option across all attributes.
2. **Questionnaire Purpose:** The questionnaire is designed to elicit the user's personal weighting of these attributes (departure time, duration, stops, price) before or after they make a selection.
3. **Feedback Mechanism:** The "Annotation Summary" provides immediate feedback by showing the "correct" answer, suggesting this is part of a training, calibration, or research study where user preferences are being measured against a predefined optimal choice.
4. **Interface Flow:** The flow is linear: View Options -> (Optionally) Submit Selection -> Complete Preference Questionnaire -> Submit Preferences -> View Summary/Feedback.
### Interpretation
This interface is likely a component of a **decision-making study, user preference elicitation system, or a training module for travel booking algorithms**. Its purpose is to investigate or calibrate how individuals weigh conflicting attributes when making a choice.
The "Correct Option: Flight 1" is particularly revealing. From a purely mathematical or algorithmic perspective, Flight 1 might be deemed "correct" because it minimizes total travel time (30 min) and cost ($370), despite the inconvenience of one stop. The user's choice of Flight 2 (non-stop, but longer and more expensive) suggests a personal preference for convenience (non-stop) over raw speed and cost, or perhaps a distrust of short flights with connections.
The subsequent questionnaire aims to quantify this exact preference structure. By comparing the user's selections on the 1-5 scales with their initial flight choice, researchers can model the user's utility function. For instance, if the user selected "5: I strongly prefer non-stop flights" and chose Flight 2, their behavior is consistent. If they selected "1: I strongly prefer cheaper flights" but chose Flight 2 over Flight 1, it indicates a potential inconsistency or a more complex decision rule.
In essence, this screen captures a moment in a process designed to **decode human decision-making in the face of trade-offs**, with applications in personalized recommendation systems, behavioral economics research, or UX design for complex choice interfaces.
</details>
Figure 23: Example of annotation interface where humans act as the flight recommendation assistant. The human annotator was asked to select the best option and rate their estimation of the userâs preferences. We also allow the annotator to check a summary of previous flight booking history. The annotation interface where humans act as the user is similar.
<details>
<summary>x25.png Details</summary>

### Visual Description
## Multi-Panel Line Chart: Accuracy of Different Methods Across AI Models and Humans
### Overview
The image displays a 3x3 grid of nine line charts. Each chart compares the performance of four different methods ("Direct", "Beliefs", "Bayesian Assistant", "Random") on a specific AI model or human participants. The performance metric is "Accuracy (%)" plotted against the number of interactions (0 to 5). The overall purpose is to demonstrate how different prompting or reasoning strategies affect accuracy as an agent (AI or human) gains more interaction experience with a task.
### Components/Axes
* **Global Legend:** Positioned at the top center of the entire figure.
* `Direct`: Blue line with diamond markers.
* `Beliefs`: Orange line with diamond markers.
* `Bayesian Assistant`: Beige/light brown dashed line with diamond markers.
* `Random`: Gray dashed horizontal line (baseline).
* **Subplot Titles:** Each of the nine charts has a centered title indicating the subject:
* Row 1: `Gemma 2 9B`, `Gemma 2 27B`, `Llama 3 8B`
* Row 2: `Llama 3 70B`, `Qwen 2.5 7B`, `Qwen 2.5 32B`
* Row 3: `GPT-4.1 Mini`, `Gemini 1.5 Pro`, `Human`
* **Axes (Consistent across all subplots):**
* **X-axis:** Label: `# interactions`. Scale: Linear, from 0 to 5 with integer markers.
* **Y-axis:** Label: `Accuracy (%)`. Scale: Linear, from 0 to 100 with increments of 20.
* **Data Series:** Each chart contains four lines corresponding to the global legend. The "Random" baseline is a flat, dashed gray line at approximately 33.3% accuracy across all charts.
### Detailed Analysis
**Row 1: Gemma 2 9B, Gemma 2 27B, Llama 3 8B**
* **Trend:** In all three charts, the "Bayesian Assistant" (beige) shows a strong, near-linear upward trend, starting near the baseline at interaction 0 and reaching ~80% accuracy by interaction 5. The "Beliefs" (orange) method shows a sharp initial increase from interaction 0 to 1 (to ~50%), then plateaus. The "Direct" (blue) method shows a very slight, gradual increase, remaining below 40%. "Random" (gray) is constant at ~33%.
* **Approximate Values (Gemma 2 9B):**
* Bayesian Assistant: 0: ~35%, 1: ~58%, 2: ~67%, 3: ~73%, 4: ~78%, 5: ~81%
* Beliefs: 0: ~35%, 1: ~50%, 2: ~49%, 3: ~48%, 4: ~48%, 5: ~48%
* Direct: 0: ~35%, 1: ~37%, 2: ~37%, 3: ~37%, 4: ~37%, 5: ~37%
* **Note:** The patterns for Gemma 2 27B and Llama 3 8B are visually almost identical to Gemma 2 9B.
**Row 2: Llama 3 70B, Qwen 2.5 7B, Qwen 2.5 32B**
* **Llama 3 70B:** Shows a different pattern. "Bayesian Assistant" still leads with a strong upward trend (~80% at 5). However, both "Direct" and "Beliefs" show significant, parallel improvement, rising from ~35% to nearly 60% by interaction 5. "Direct" and "Beliefs" are very close in performance.
* **Qwen 2.5 7B:** "Bayesian Assistant" follows the standard strong upward trend. "Direct" and "Beliefs" are nearly flat and overlapping, hovering just above the "Random" baseline (~35-36%) across all interactions.
* **Qwen 2.5 32B:** "Bayesian Assistant" trend is standard. Here, "Direct" (blue) shows a clear upward trend, reaching ~48% by interaction 5. "Beliefs" (orange) also increases but more slowly, reaching ~40% at interaction 5. "Direct" outperforms "Beliefs" for this model.
**Row 3: GPT-4.1 Mini, Gemini 1.5 Pro, Human**
* **GPT-4.1 Mini:** Resembles the Row 1 pattern. "Bayesian Assistant" leads strongly. "Beliefs" plateaus around 50% after interaction 1. "Direct" shows a slight increase to ~42%.
* **Gemini 1.5 Pro:** "Bayesian Assistant" leads. Both "Direct" and "Beliefs" show steady, parallel improvement, with "Beliefs" maintaining a slight lead over "Direct" (ending at ~57% vs. ~51% at interaction 5).
* **Human:** This chart includes vertical error bars on the "Direct" and "Beliefs" data points, indicating variability in human performance. "Bayesian Assistant" shows the standard strong trend. "Direct" and "Beliefs" are intertwined, both showing a gradual increase with significant overlap in their error bars, ending near 50% at interaction 5. The "Random" baseline is at ~33%.
### Key Observations
1. **Dominant Method:** The "Bayesian Assistant" method consistently and significantly outperforms all other methods across every single model and human subjects. Its accuracy improves reliably with more interactions.
2. **Model-Specific Behavior:** The relative performance of "Direct" vs. "Beliefs" is highly model-dependent. In some models (Gemma series, Llama 3 8B, GPT-4.1 Mini), "Beliefs" is clearly superior. In others (Llama 3 70B, Gemini 1.5 Pro), they are comparable. In Qwen 2.5 32B, "Direct" is better, and in Qwen 2.5 7B, both are ineffective.
3. **Human Performance:** Human accuracy with "Direct" or "Beliefs" methods shows high variability (large error bars) and does not clearly surpass the AI models, ending in a similar range (~50%).
4. **Baseline:** The "Random" baseline is consistently at ~33.3%, suggesting a 3-choice task.
5. **Plateau Effect:** For many models (e.g., Gemma series), the "Beliefs" method shows a rapid initial gain but then plateaus, while "Bayesian Assistant" continues to improve.
### Interpretation
This data strongly suggests that incorporating a structured Bayesian reasoning framework ("Bayesian Assistant") provides a substantial and robust advantage in learning from interactions compared to simpler "Direct" prompting or "Beliefs"-based approaches. The advantage is universal across a diverse set of state-of-the-art AI models and even holds for humans.
The variability in the "Direct" vs. "Beliefs" comparison indicates that the benefit of explicitly modeling beliefs is not universal and may depend on the underlying capabilities or training of the specific model. Models like Qwen 2.5 7B appear to lack the capacity to utilize either interactive strategy effectively beyond random guessing.
The human data is particularly insightful. It shows that unaided human reasoning ("Direct" or "Beliefs") in this interactive setting is noisy and does not consistently outperform AI models, highlighting the potential for AI systems with appropriate reasoning frameworks (like the Bayesian Assistant) to surpass human performance in iterative learning tasks. The chart collectively argues for the importance of methodological design (the reasoning framework) over raw model size or human intuition for this class of problem.
</details>
Figure 24: Accuracy over rounds for different original LLMs. We show accuracy based on direct predictions and accuracy based on predictions derived from their beliefs about usersâ preferences. Error bars show standard errors across three random seeds (and three training runs).
<details>
<summary>x26.png Details</summary>

### Visual Description
## Multi-Panel Line Chart: Model Accuracy vs. Interaction Count
### Overview
The image displays a 3x3 grid of line charts. Each chart plots the "Accuracy (%)" of a specific Large Language Model (LLM) variant against the "# interactions" (from 0 to 5). The grid is organized by model family (rows: Gemma, Llama, Qwen) and by training/evaluation condition (columns: Original, Oracle, Bayesian). Each chart contains four data series representing different prompting or reasoning methods.
### Components/Axes
* **Global Legend (Top Center):** Positioned above the grid, it defines the four data series:
* **Direct:** Solid blue line with circular markers.
* **Beliefs:** Solid orange line with diamond markers.
* **Bayesian Assistant:** Dashed light brown line with circular markers.
* **Random:** Dashed grey horizontal line (baseline).
* **Y-Axis (All Charts):** Labeled "Accuracy (%)". Scale runs from 0 to 100 with major ticks at 0, 20, 40, 60, 80, 100.
* **X-Axis (All Charts):** Labeled "# interactions". Scale runs from 0 to 5 with integer ticks.
* **Subplot Titles (Top of each chart):**
* Row 1: "Gemma Original", "Gemma Oracle", "Gemma Bayesian"
* Row 2: "Llama Original", "Llama Oracle", "Llama Bayesian"
* Row 3: "Qwen Original", "Qwen Oracle", "Qwen Bayesian"
### Detailed Analysis
**Data Series Trends & Approximate Values:**
**Row 1: Gemma Models**
* **Gemma Original:**
* *Bayesian Assistant (Light Brown):* Strong upward trend. Starts ~35% (0), rises to ~58% (1), ~68% (2), ~74% (3), ~78% (4), ~81% (5).
* *Beliefs (Orange):* Sharp initial rise then plateau. Starts ~35% (0), jumps to ~50% (1), then remains flat ~48-49% (2-5).
* *Direct (Blue):* Very slight upward trend, near baseline. Starts ~35% (0), ends ~38% (5).
* *Random (Grey):* Constant at ~33%.
* **Gemma Oracle:**
* *Bayesian Assistant:* Similar strong upward trend as Original, reaching ~81% (5).
* *Beliefs & Direct:* Both show steady, parallel upward trends. Beliefs is consistently ~5-8% higher than Direct. At 5 interactions: Beliefs ~64%, Direct ~60%.
* **Gemma Bayesian:**
* All three active methods (Direct, Beliefs, Bayesian Assistant) show strong, converging upward trends. They start clustered ~35-38% (0) and end between ~72-78% (5). Bayesian Assistant remains slightly highest.
**Row 2: Llama Models**
* **Llama Original:** Pattern is nearly identical to Gemma Original. Bayesian Assistant rises to ~81% (5). Beliefs plateaus ~48%. Direct shows minimal gain.
* **Llama Oracle:** Pattern is nearly identical to Gemma Oracle. Bayesian Assistant leads (~81% at 5). Beliefs (~65%) and Direct (~61%) show steady, parallel growth.
* **Llama Bayesian:** Pattern is nearly identical to Gemma Bayesian. All methods show strong growth, converging between ~72-78% at 5 interactions.
**Row 3: Qwen Models**
* **Qwen Original:**
* *Bayesian Assistant:* Follows the same strong upward trend as other "Original" models, reaching ~81% (5).
* *Beliefs & Direct:* Both show almost no improvement, hovering near the Random baseline (~33-36%) across all interactions.
* **Qwen Oracle:**
* *Bayesian Assistant:* Strong upward trend to ~81% (5).
* *Direct:* Shows a steady upward trend, reaching ~53% (5).
* *Beliefs:* Remains flat near the baseline (~35%).
* **Qwen Bayesian:**
* *Bayesian Assistant:* Strong upward trend to ~81% (5).
* *Direct:* Shows a strong upward trend, reaching ~68% (5).
* *Beliefs:* Remains flat near the baseline (~35%).
### Key Observations
1. **Consistent Bayesian Assistant Superiority:** The "Bayesian Assistant" method (light brown dashed line) achieves the highest or tied-for-highest accuracy in every single chart, consistently reaching approximately 81% accuracy at 5 interactions.
2. **"Original" Condition Limitation:** In the "Original" condition (left column), only the Bayesian Assistant method shows significant learning. The "Direct" and "Beliefs" methods show minimal to no improvement for Gemma/Llama, and absolutely none for Qwen.
3. **"Oracle" Condition Boost:** The "Oracle" condition (middle column) enables strong learning for the "Direct" method in all models and for the "Beliefs" method in Gemma/Llama (but not Qwen).
4. **"Bayesian" Condition Convergence:** The "Bayesian" condition (right column) causes all three active methods to perform well and converge, particularly for Gemma and Llama.
5. **Qwen's Unique "Beliefs" Behavior:** The Qwen model shows a distinct pattern where the "Beliefs" method (orange) fails to improve in *any* condition, remaining at baseline accuracy.
6. **Random Baseline:** The "Random" baseline is constant at approximately 33% across all charts, suggesting a 3-choice task.
### Interpretation
This data demonstrates the significant impact of both the model's training/evaluation condition (Original, Oracle, Bayesian) and the prompting/reasoning method (Direct, Beliefs, Bayesian Assistant) on interactive learning performance.
* **The Bayesian Assistant is a robust meta-strategy:** Its consistent top performance suggests it effectively leverages interaction history to update beliefs and guide queries, regardless of the base model or condition.
* **Oracles provide critical information:** The "Oracle" condition, which likely provides ground-truth feedback, unlocks learning capability for simpler methods like "Direct" prompting, which otherwise stagnates.
* **Model-specific limitations exist:** Qwen's "Beliefs" method's complete failure to learn, even with an Oracle, indicates a potential incompatibility between that model's architecture/training and the belief-based prompting approach tested here.
* **The "Bayesian" condition may induce a helpful prior:** Making the model itself "Bayesian" seems to create an internal state where even simple methods like "Direct" prompting can learn effectively from interactions, closing the gap with more sophisticated methods.
The charts collectively argue that for interactive learning tasks, employing a Bayesian meta-strategy (the Assistant) is highly effective, and that providing models with structured feedback (Oracle) or Bayesian-friendly internal representations is crucial for enabling simpler interaction methods to succeed.
</details>
Figure 25: Accuracy over rounds for different original LLMs and fine-tuned LLMs on the flight recommendation task. We show accuracy based on direct predictions and accuracy based on predictions derived from their beliefs about usersâ preferences. Error bars show standard errors across three random seeds (and three training runs).
<details>
<summary>x27.png Details</summary>

### Visual Description
## Line Charts: Accuracy vs. Interactions for Various Models and Methods
### Overview
The image displays a 3x3 grid of line charts. Each chart plots "Accuracy (%)" on the y-axis against "# interactions" (from 0 to 5) on the x-axis. The charts compare the performance of three different methods ("Direct", "Beliefs", "Bayesian Assistant") across three different models ("Gemma", "Llama", "Qwen") under three different conditions ("Original", "Oracle", "Bayesian"). A "Random" baseline is also shown.
### Components/Axes
* **Legend (Top Center):** Located above the grid of charts.
* `Direct`: Blue line with diamond markers.
* `Beliefs`: Orange line with circle markers.
* `Bayesian Assistant`: Beige/light brown dashed line with diamond markers.
* `Random`: Gray dashed line (no markers).
* **Chart Titles (Top of each subplot):** The 3x3 grid is organized as follows:
* **Top Row (Gemma):** "Gemma Original", "Gemma Oracle", "Gemma Bayesian"
* **Middle Row (Llama):** "Llama Original", "Llama Oracle", "Llama Bayesian"
* **Bottom Row (Qwen):** "Qwen Original", "Qwen Oracle", "Qwen Bayesian"
* **Axes:**
* **Y-axis (All charts):** Label: "Accuracy (%)". Scale: 0 to 100, with major ticks at 0, 20, 40, 60, 80, 100.
* **X-axis (All charts):** Label: "# interactions". Scale: 0 to 5, with integer ticks at 0, 1, 2, 3, 4, 5. The label is explicitly shown only on the bottom row of charts.
### Detailed Analysis
**General Trend Verification:** Across all charts, the "Bayesian Assistant" (beige dashed line) shows a strong, consistent upward trend, starting near the random baseline and rising steeply. The "Direct" (blue) and "Beliefs" (orange) lines show more varied trends depending on the model and condition. The "Random" baseline (gray dashed) is flat at approximately 33% accuracy.
**Chart-by-Chart Data Points (Approximate Values):**
1. **Gemma Original:**
* Bayesian Assistant: Starts ~35% (0), rises to ~80% (5).
* Beliefs: Starts ~35% (0), peaks at ~50% (1), then slowly declines to ~45% (5).
* Direct: Starts ~35% (0), rises slightly to ~38% (1), then plateaus near ~38% (5).
* Random: Flat at ~33%.
2. **Gemma Oracle:**
* Bayesian Assistant: Similar steep rise to ~80% (5).
* Beliefs: Starts ~35% (0), rises steadily to ~55% (5).
* Direct: Starts ~38% (0), rises steadily to ~53% (5).
* Random: Flat at ~33%.
3. **Gemma Bayesian:**
* Bayesian Assistant: Similar steep rise to ~80% (5).
* Beliefs: Starts ~35% (0), rises to ~62% (5).
* Direct: Starts ~38% (0), rises to ~65% (5).
* Random: Flat at ~33%.
4. **Llama Original:**
* Bayesian Assistant: Similar steep rise to ~80% (5).
* Beliefs: Starts ~35% (0), rises to ~48% (1), then plateaus near ~47% (5).
* Direct: Starts ~35% (0), rises slowly to ~40% (5).
* Random: Flat at ~33%.
5. **Llama Oracle:**
* Bayesian Assistant: Similar steep rise to ~80% (5).
* Beliefs: Starts ~35% (0), rises steadily to ~57% (5).
* Direct: Starts ~35% (0), rises steadily to ~56% (5).
* Random: Flat at ~33%.
6. **Llama Bayesian:**
* Bayesian Assistant: Similar steep rise to ~80% (5).
* Beliefs: Starts ~35% (0), rises to ~62% (5).
* Direct: Starts ~35% (0), rises to ~65% (5).
* Random: Flat at ~33%.
7. **Qwen Original:**
* Bayesian Assistant: Similar steep rise to ~80% (5).
* Beliefs: Starts ~35% (0), rises to ~38% (1), then declines to ~35% (5).
* Direct: Starts ~35% (0), rises very slightly to ~37% (5).
* Random: Flat at ~33%.
8. **Qwen Oracle:**
* Bayesian Assistant: Similar steep rise to ~80% (5).
* Beliefs: Starts ~35% (0), remains flat near ~36% (5).
* Direct: Starts ~38% (0), rises steadily to ~48% (5).
* Random: Flat at ~33%.
9. **Qwen Bayesian:**
* Bayesian Assistant: Similar steep rise to ~80% (5).
* Beliefs: Starts ~35% (0), remains flat near ~36% (5).
* Direct: Starts ~38% (0), rises to ~59% (5).
* Random: Flat at ~33%.
### Key Observations
1. **Dominant Performance:** The "Bayesian Assistant" method consistently and significantly outperforms all other methods across every model and condition, achieving ~80% accuracy by 5 interactions.
2. **Condition Impact:** The "Original" condition appears most challenging for the "Direct" and "Beliefs" methods, often leading to performance plateaus or declines after an initial rise. The "Oracle" and "Bayesian" conditions generally allow these methods to improve with more interactions.
3. **Model-Specific Behavior:** The "Qwen" model shows a distinct pattern where the "Beliefs" method performs poorly (near random) in the "Oracle" and "Bayesian" conditions, while the "Direct" method improves. In contrast, for "Gemma" and "Llama", "Beliefs" and "Direct" often perform similarly in the "Oracle" and "Bayesian" conditions.
4. **Baseline:** The "Random" baseline is consistently at ~33%, suggesting a 3-class classification problem where random guessing yields one-third accuracy.
### Interpretation
This data strongly suggests that the "Bayesian Assistant" method is highly effective at leveraging multiple interactions to improve accuracy, regardless of the underlying model (Gemma, Llama, Qwen) or the testing condition. Its steep, consistent learning curve indicates a robust mechanism for incorporating feedback.
The performance of the "Direct" and "Beliefs" methods is highly sensitive to the condition. The "Original" condition likely represents a standard or zero-shot setup where these methods struggle to improve beyond a low ceiling. The "Oracle" and "Bayesian" conditions probably provide additional information or a more favorable evaluation framework, enabling gradual learning. The stark underperformance of "Beliefs" with Qwen in these conditions is a notable anomaly, suggesting a potential incompatibility between that method and the Qwen model's architecture or output format in those specific settings.
Overall, the charts demonstrate a clear hierarchy: Bayesian Assistant >> (Direct â Beliefs) > Random, with the gap between the top method and the others being substantial. The key takeaway is the superior sample efficiency and effectiveness of the Bayesian Assistant approach for this task.
</details>
Figure 26: Accuracy over rounds for different original LLMs and fine-tuned LLMs on the hotel recommendation task. We show accuracy based on direct predictions and accuracy based on predictions derived from their beliefs about usersâ preferences. Error bars show standard errors across three random seeds (and three training runs).
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Charts: Accuracy vs. Interactions for Three AI Models Across Three Methods
### Overview
The image displays a 3x3 grid of nine line charts. Each chart plots "Accuracy (%)" against the "# interactions" (number of interactions) for a specific combination of a base AI model (Gemma, Llama, Qwen) and a training/evaluation method (Original, Oracle, Bayesian). Each chart contains three data series: "Direct," "Random," and "Direct Fine-tuning."
### Components/Axes
* **Overall Layout:** A 3x3 grid of individual line charts.
* **Common Legend:** Located at the top center of the entire image.
* **Direct:** Blue solid line with circular markers.
* **Random:** Gray dashed line.
* **Direct Fine-tuning:** Green dashed line with diamond markers.
* **Individual Chart Titles:** Each chart has a title in the format `[Model] [Method]`.
* Top Row (Left to Right): `Gemma Original`, `Gemma Oracle`, `Gemma Bayesian`
* Middle Row (Left to Right): `Llama Original`, `Llama Oracle`, `Llama Bayesian`
* Bottom Row (Left to Right): `Qwen Original`, `Qwen Oracle`, `Qwen Bayesian`
* **Axes (Identical for all charts):**
* **X-axis:** Label: `# interactions`. Ticks: 0, 1, 2, 3, 4, 5.
* **Y-axis:** Label: `Accuracy (%)`. Scale: 0 to 100, with major ticks at 0, 20, 40, 60, 80, 100.
### Detailed Analysis
Data points are approximate visual estimates from the charts.
**1. Gemma Charts (Top Row)**
* **Gemma Original:**
* *Direct Fine-tuning (Green):* Starts ~55% (0), rises steeply to ~75% (1), then gradually to ~85% (5). Trend: Sharp initial increase, then plateaus.
* *Direct (Blue):* Starts ~35% (0), rises to ~45% (1), then gradually to ~55% (5). Trend: Steady, moderate increase.
* *Random (Gray):* Flat line at ~35% across all interactions.
* **Gemma Oracle:**
* *Direct Fine-tuning (Green):* Very similar to Original: ~55% (0) to ~85% (5).
* *Direct (Blue):* Starts ~35% (0), rises to ~50% (1), then to ~60% (5). Slightly higher than Original.
* *Random (Gray):* Flat at ~35%.
* **Gemma Bayesian:**
* *Direct Fine-tuning (Green):* Similar trajectory: ~55% (0) to ~85% (5).
* *Direct (Blue):* Starts ~35% (0), rises sharply to ~60% (1), then to ~75% (5). Shows the strongest performance for the Direct method among Gemma variants.
* *Random (Gray):* Flat at ~35%.
**2. Llama Charts (Middle Row)**
* **Llama Original:**
* *Direct Fine-tuning (Green):* Starts ~55% (0), rises to ~75% (1), then to ~85% (5). Similar to Gemma.
* *Direct (Blue):* Starts ~35% (0), rises to ~50% (1), then to ~60% (5).
* *Random (Gray):* Flat at ~35%.
* **Llama Oracle:**
* *Direct Fine-tuning (Green):* Similar: ~55% (0) to ~85% (5).
* *Direct (Blue):* Starts ~35% (0), rises to ~50% (1), then to ~65% (5).
* *Random (Gray):* Flat at ~35%.
* **Llama Bayesian:**
* *Direct Fine-tuning (Green):* Similar: ~55% (0) to ~85% (5).
* *Direct (Blue):* Starts ~35% (0), rises sharply to ~60% (1), then to ~75% (5). Mirrors the Bayesian boost seen in Gemma.
* *Random (Gray):* Flat at ~35%.
**3. Qwen Charts (Bottom Row)**
* **Qwen Original:**
* *Direct Fine-tuning (Green):* Starts ~55% (0), rises to ~75% (1), then to ~85% (5).
* *Direct (Blue):* Starts ~35% (0), rises to ~40% (1), then plateaus around ~45% (5). Shows the weakest Direct method performance.
* *Random (Gray):* Flat at ~35%.
* **Qwen Oracle:**
* *Direct Fine-tuning (Green):* Similar: ~55% (0) to ~85% (5).
* *Direct (Blue):* Starts ~35% (0), rises to ~60% (1), then to ~65% (5). Significant improvement over Original.
* *Random (Gray):* Flat at ~35%.
* **Qwen Bayesian:**
* *Direct Fine-tuning (Green):* Similar: ~55% (0) to ~85% (5).
* *Direct (Blue):* Starts ~35% (0), rises sharply to ~60% (1), then to ~70% (5). Again, Bayesian method boosts Direct performance.
* *Random (Gray):* Flat at ~35%.
### Key Observations
1. **Consistent Baseline:** The "Random" baseline is a flat line at approximately 35% accuracy across all nine charts, serving as a constant reference point.
2. **Dominant Series:** "Direct Fine-tuning" (green) is consistently the highest-performing method in every chart, starting around 55% and converging near 85% accuracy by 5 interactions.
3. **Method Impact:** The "Direct" method (blue) performance varies significantly by model and method:
* It shows the lowest growth in the "Original" settings, especially for Qwen.
* It improves in "Oracle" settings.
* It shows the most dramatic improvement in "Bayesian" settings, where its trajectory becomes much steeper, especially between 0 and 1 interaction.
4. **Model Similarity:** The overall patterns are remarkably consistent across the three base models (Gemma, Llama, Qwen). The primary differentiator is the effectiveness of the "Direct" method under different training paradigms.
5. **Plateau Effect:** All improving series (Direct Fine-tuning and Direct) show a logarithmic-like growth curve: the largest gains occur between 0 and 1 interactions, with diminishing returns thereafter.
### Interpretation
This data demonstrates the significant efficacy of fine-tuning ("Direct Fine-tuning") for improving model accuracy with interaction, consistently outperforming a direct prompting approach ("Direct") across multiple base models. The "Random" baseline confirms the tasks are non-trivial.
The most insightful finding is the role of the training paradigm ("Original," "Oracle," "Bayesian") on the "Direct" method's performance. The "Bayesian" approach appears to substantially enhance the model's ability to learn from early interactions, as evidenced by the sharp initial slope of the blue line in the Bayesian charts. This suggests that Bayesian methods may be particularly effective at creating models that are more responsive to interactive feedback. The "Oracle" setting provides a moderate boost, likely representing an upper-bound or idealized training scenario.
The consistency across Gemma, Llama, and Qwen indicates these findings are robust and not specific to a single model architecture. The convergence of the "Direct Fine-tuning" series to a similar high accuracy (~85%) suggests a potential performance ceiling for this interaction-based task, regardless of the base model or the specific advanced training method used. The charts collectively argue for the value of both fine-tuning and advanced training paradigms (like Bayesian optimization) in building interactive AI systems.
</details>
Figure 27: Accuracy over rounds for different original LLMs and fine-tuned LLMs on the web shopping task. We show accuracy based on direct predictions and accuracy based on predictions derived from their beliefs about usersâ preferences. Error bars show standard errors across three training runs if any.
## Appendix I Example Interactions
We show example interactions in Tables 3 â 12.
Table 3: Example of a two-round interaction between the user and the LLM.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. Which flight is the best option? Flight 1: departure time: 04:00 PM, duration: 18 hr 3 min, number of stops: 2, price: $280 Flight 2: departure time: 10:48 AM, duration: 6 hr 21 min, number of stops: 1, price: $370 Flight 3: departure time: 06:48 PM, duration: 10 hr 5 min, number of stops: 1, price: $810 Model: The best option is Flight 2. User: Your option Flight 2 is correct. |
Table 4: Example of prompting the LLM for the userâs preference on a single feature.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. On a scale of 1 to 5, what is my preference for price? - 1: I strongly prefer a cheaper flight - 2: I prefer a cheaper flight - 3: I have no strong preference - 4: I prefer a more expensive flight - 5: I strongly prefer a more expensive flight Model: Your preference for price is: 1. |
Table 5: Example of a two-round interaction between the user and the LLM using the numerical representation of flights.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 0.7, duration: 0.5, number of stops: 1.0, price: 0.9 Flight 2: departure time: 0.9, duration: 0.6, number of stops: 0.0, price: 0.7 Flight 3: departure time: 0.5, duration: 0.9, number of stops: 0.5, price: 0.1 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. Which flight is the best option? Flight 1: departure time: 0.5, duration: 0.9, number of stops: 0.5, price: 0.1 Flight 2: departure time: 0.2, duration: 0.3, number of stops: 0.5, price: 0.2 Flight 3: departure time: 0.8, duration: 0.8, number of stops: 1.0, price: 0.9 Model: The best option is Flight 2. User: Your option Flight 2 is correct. |
Table 6: Example of prompting the LLM for the userâs preference on a single feature using the numerical representation of flights.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 0.7, duration: 0.5, number of stops: 1.0, price: 0.9 Flight 2: departure time: 0.9, duration: 0.6, number of stops: 0.0, price: 0.7 Flight 3: departure time: 0.5, duration: 0.9, number of stops: 0.5, price: 0.1 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. On a scale of 1 to 5, what is my preference for duration? - 1: I strongly prefer the minimum value - 2: I prefer the minimum value - 3: I have no strong preference - 4: I prefer the maximum value - 5: I strongly prefer the maximum value |
Table 7: Example of prompting the LLM for the userâs preference on a single feature using the modelâs directly generated probability judgements.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. On a scale of 1 to 5, what is my preference for price? - 1: I strongly prefer a cheaper flight - 2: I prefer a cheaper flight - 3: I have no strong preference - 4: I prefer a more expensive flight - 5: I strongly prefer a more expensive flight Provide an integer between 0 and 100 (%) that reflects the probability of each scale. Format your response exactly as follows: - 1: ??% ... Model: The probabilities of each scale are: - 1: 70% - 2: 10% - 3: 15% - 4: 5% - 5: 0% |
Table 8: Example of a first-round interaction between the user and the LLM using chain-of-thought prompting.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. First, infer my preferences by reasoning about each feature. For each feature, estimate the probability distribution of my preference across a 1-to-5 scale. For example, you might estimate a 30% probability that I strongly prefer an earlier morning flight (scale 1), a 10% probability that I prefer an earlier morning flight (scale 2), a 20% probability that I have no strong preference (scale 3), and so on. Then, use these probabilities to determine the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Letâs think step by step. End your response with âThe best option is Flight <your choice>.â. Model: Letâs analyze each flight feature and estimate your preferences: ... <Modelâs reasoning chain> ... The best option is Flight 2. User: Your option Flight 2 is correct. |
Table 9: Example of a first-round interaction between the user and the LLM with verbalized reward posterior distribution.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Model: Based on the current information, the probabilities for each preference scale across all features are: The probabilities for each scale of your preference for departure time are: - 1: I strongly prefer an earlier morning departure time, 0.3% - 2: I prefer an earlier morning departure time, 1.0% - 3: I have no strong preference, 97.6% - 4: I prefer a later evening departure time, 1.1% - 5: I strongly prefer a later evening departure time, 0.1% The probabilities for each scale of your preference for duration are: ... User: Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 2. User: Your option Flight 2 is incorrect. I prefer Flight 1. Model: Based on the current information, the probabilities for each preference scale across all features are: The probabilities for each scale of your preference for departure time are: - 1: I strongly prefer an earlier morning departure time, 0.1% - 2: I prefer an earlier morning departure time, 4.1% - 3: I have no strong preference, 2.5% - 4: I prefer a later evening departure time, 93.2% - 5: I strongly prefer a later evening departure time, 0.1% |
Table 10: Example of a two-round interaction between the user and the LLM using the non-interactive setting. The blue part highlights the difference compared to the interactive setting in Table 3.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 2. User: Your option Flight 2 is correct. Which flight is the best option? Flight 1: departure time: 04:00 PM, duration: 18 hr 3 min, number of stops: 2, price: $280 Flight 2: departure time: 10:48 AM, duration: 6 hr 21 min, number of stops: 1, price: $370 Flight 3: departure time: 06:48 PM, duration: 10 hr 5 min, number of stops: 1, price: $810 Model: The best option is Flight 2. User: Your option Flight 2 is correct. |
Table 11: Example of a first-round interaction between the user and the LLM on the hotel recommendation task.
| Col1 |
| --- |
| User: Help me select the best hotels for my trips. I have specific preferences for what I like and dislike in a hotel, and these preferences remain the same. You need to figure out my preferences and select the best hotels for me. Use your best judgment if you are unsure. Do not say you need more information. Which hotel is the best option? Hotel 1: distance to downtown: 4 miles, price: $550, rating: 3 stars, amenities: free parking and free breakfast Hotel 2: distance to downtown: 3 miles, price: $820, rating: 2 stars, amenities: free parking, free breakfast, and pool Hotel 3: distance to downtown: 2.3 miles, price: $370, rating: 1 stars, amenities: free parking Model: The best option is Hotel 3. User: Your option Hotel 3 is incorrect. I prefer Hotel 2. |
Table 12: Example of a first-round interaction between the user and the LLM on the web shopping task.
| Col1 |
| --- |
| User: Help me select the best product. I have specific preferences for what I like and dislike in a product, and these preferences remain the same. You need to figure out my preferences and select the best products for me. Use your best judgment if you are unsure. Do not say you need more information. Which product is the best option? Product 1: Title: Chic D Independence Day Table Runner 72 Inches Long, Gnome Cotton Linen Spring Table Cloth Runners for Wedding Party Dinning Coffee Holiday, Farmhouse Style, USA Flag Description: - 14x72inch Dining Table Runner Size: 36x183cm, fits round, square or rectangular tables that seat 4 to 8 people. - High Quality Cotton Linen Table Runner: 30%cotton, 70%burlap. triangle hem, wrinkle-free, no fade. easy to maintain and reusable. - Our elegant table runner is perfect for holidays, catering, birthday party, dinning, brunches, potlucks, BBQâs, buffets, garden, bridal party etc - Decorative Table Runner can be used as decor on kitchen tables; dining tables; entry tables; desks & more - Custom table runner, u also can personalized your tabletop decoration Color: black white Size: 13x108inch Product 2: Title: Ambesonne Orange Mandala Coffee Table, Pastel Colored Flourishes and Dark Toned Details Spring Bloom, Acrylic Glass Center Table with Wooden Frame for Offices Dorms, Small, Peach Dark Brown Cream Description: - Measurements - 24" Long x 18" Wide x 15" High coffee table with patterns that will add vitality to your living room. - Made from - High Quality Beech Wooden Frame and Acrylic Glass Table Top. Durable and Elegant! - Features- Light weight and easy to move around. You can place anywhere and enjoy its beauty! - Easy To Use- It is easy to assembly with numbered parts and clear instructions. - Versatile- Small coffee table, to keep your necessities within reach. Practical item for your living room! Color: blue purple Size: large Product 3: Title: White Round Dining Table and 4 Chairs,Mid-Century Modern Coffee Table Round Kitchen Table Small Office Table for Office Kitchen Dining Room (Round Table+4 Pale Grey Chairs) Description: - Dining Table and Chair Size:The dining table size is 35.4*35.4*29.5 inch, dining chairs is 36*17.5 inch.Our table and chairs suitable for 4-6 people. This glass table can also be used as a conference table, coffee table - Glass Dining Table and Chair Material:The table top is made of clear glass and equipped with solid wood metal legs, with a modern design style and elegant appearance.The Chair is made of velvet and solid wood metal cushions, the dining table and chair is an essential furniture Size: round table with wood legs Model: The best option is Product 3. User: Your option Product 3 is incorrect. I prefer Product 2. |