# Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
**Authors**: Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste
> Google DeepMind University of British Columbia Vector Institute
> Google Research New York University
> Google Research
linluqiu@mit.edu, svansteenkiste@google.com, linzen@google.com
## Abstract
Large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs must construct representations of the world and form probabilistic beliefs about them. To provide personalized recommendations, for example, the LLM needs to infer a userâs preferences from their behavior over multiple interactions. The Bayesian inference framework lays out the optimal way for an agent to update its beliefs as it receives new information. We first show that LLMs fall far short of the standard defined by the Bayesian framework. We then show that by teaching LLMs to mimic the predictions of the normative Bayesian model, we can dramatically improve their ability to update their beliefs; this ability generalizes to new tasks. We conclude that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
## 1 Introduction
Humans interact with the world based on our beliefs about it. To effectively support decision making, our beliefs need to correspond to the structure of the world as much as possible; in other words, our beliefs need to be supported by appropriate âworld modelsâ [Johnson-Laird, 1980, Ha and Schmidhuber, 2018, LeCun, 2022, Wong et al., 2023]. We typically do not have perfect knowledge about the outside world; to the extent that we are uncertain about our environment, our beliefs need to be probabilistic, reflecting this uncertainty. And for these beliefs to remain relevant as the world changes, or as new information about the world becomes available, we need to update our beliefs to reflect the new information. The framework of Bayesian inference describes the normative way in which new information should trigger a change in oneâs beliefs so as to maximize the effectiveness of these beliefs as a foundation for acting in the world [Chater et al., 2006]. The Bayesian framework has informed a substantial body of work in cognitive science, which has identified both areas where humans act as the framework predicts, as well as deviations from it [Griffiths et al., 2024, Jern et al., 2017, Tenenbaum et al., 2011, Xu and Tenenbaum, 2007, Baker et al., 2011, Tenenbaum et al., 2006, Chater and Manning, 2006, Griffiths et al., 2007, Chaigneau et al., 2025, Rehder, 2018, Rottman and Hastie, 2016, Sloman and Lagnado, 2015].
In the last few years, artificial intelligence systems based on large language models (LLMs) have become dramatically more capable than in the past [Team, 2024a, Achiam et al., 2023, Anthropic, 2024, Team, 2024b, Touvron et al., 2023, Guo et al., 2025]. Far outgrowing their original motivationâas methods to estimate the probabilities of different word sequencesâthese systems are now being used for applications where they interact with users and with the outside world. As with humans, for the LLMsâ interactions with users to be effective, the LLMsâ beliefs need to reflect their experience with the user and to be continuously updated as more information becomes available. Here, we ask: do LLMs act as if they have probabilistic beliefs that are updated as expected from normative Bayesian inference? To the extent that the LLMsâ behavior deviates from the normative Bayesian strategy, how can we minimize these deviations?
We begin to study these questions using a simple controlled setting: a flight recommendation task [Lin et al., 2022], illustrated in Fig. 1. This task involves multiple rounds of interactions between a simulated user and an LLM, where the LLM is acting as a flight booking assistant. In each round, the assistant is given a small number of flight options, and is expected to recommend one of them to the user, based on the userâs preferences. The userâs preferences are not directly communicated to the LLM: it only observes the choices the user makes among the flight options. To make optimal recommendations, then, the LLM must construct an implicit model of the factors that shape the userâs preferences, and must reason probabilistically about those factors as it learns about the userâs choices across multiple sets of flight options.
We compare the LLMsâ behavior to that of a model that follows the normative Bayesian strategy, which we refer to as the Bayesian Assistant. This model maintains a probability distribution that reflects its beliefs about the userâs preferences, and uses Bayesâ rule to update this distribution as new information about the userâs choices becomes available. Unlike many real-life scenarios, where it is difficult to specify and implement the Bayesian strategy computationally, in this controlled setting this strategy can be computed exactly, allowing us to precisely estimate the extent to which LLMs deviate from it.
We use this framework to evaluate a range of LLMs and find that they all perform significantly worse than the normative Bayesian Assistant (Fig. 2). Most importantly, in contrast to the Bayesian Assistant, which gradually improves its recommendations as it receives additional information about the userâs choices, LLMsâ performance often plateaus after a single interaction, pointing to a limited ability to adapt to new information.
We then introduce Bayesian teaching, a strategy to teach an LLM to approximate Bayesian reasoning. We provide the LLM with examples of interactions between the user and the Bayesian Assistant, and have the LLM mimic those interactions. We find that, by leading the LLMs to gradually adapt to the user over the course of the interactions, this method substantially improves the LLMsâ performance on the flight recommendation task. Crucially, teaching the LLMs to mimic the Bayesian Assistant in one task allows them to generalize to other tasks that similarly require making decisions under uncertainty; those include not only different variants of the flight recommendation task, but also a related hotel recommendation task, as well as a web shopping task with real-world products (Fig. 1), a much more complex task for which it is difficult to specify and implement a fully Bayesian model.
Notably, while the Bayesian Assistant often makes incorrect predictions as it reasons under uncertainty, especially in the early rounds of interaction, we find that it is a more effective teacher than a teacher that directly provides the LLMs with usersâ choices (which we refer to as an oracle teacher); in other words, the Bayesian modelâs educated guesses make for a stronger learning signal than the correct answers. Overall, we conclude that through observing the Bayesian Assistant perform a particular task, the LLMs are able to approximate transferable probabilistic reasoning skills.
To summarize our contributions: we first identify significant limitations of off-the-shelf LLMs in tasks that require forming and updating probabilistic beliefs. We then demonstrate that, by having the LLMs mimic an normative Bayesian model, we can teach them effectively to approximate probabilistic belief updates, and show that these skills can generalize to new environments. These findings suggest that LLMs can be used in interactive settings where information is provided gradually, including complex application domains where implementing an exact Bayesian model is difficult. More generally, our results highlight a unique strength of deep learning models such as LLMs: they can learn to mimic a symbolic model and generalize its strategy to domains that are too complex to specify in a classic symbolic model.
## 2 Evaluating Belief Updates via Flight Recommendations
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: AI-Powered Recommendation System with User Feedback Loop
### Overview
This diagram illustrates an AI-powered recommendation system that provides suggestions for flights, hotels, and web shopping items. The system incorporates a feedback loop where user choices are evaluated, and the AI (represented as a stylized human figure) adjusts its recommendations using Bayesian teaching. The diagram shows three distinct recommendation scenarios, each with user interaction and system response.
### Components/Axes
The diagram consists of the following key components:
* **User Profiles:** Represented by stylized human icons, indicating user interaction.
* **Recommendation Modules:** Three modules are depicted: "Flight Recommendation," "Hotel Recommendation," and "Web Shopping." Each module is associated with a set of icons representing the recommended items (airplane, hotel buildings, clothing/accessories).
* **Bar Charts:** Each recommendation module is linked to a bar chart displaying key attributes.
* **Flight Recommendation Charts:** X-axis labels: "Duration", "#Stops", "Price". Y-axis is not labeled, but represents relative values.
* **Hotel Recommendation Charts:** X-axis labels: "Distance", "Amenities", "Rating". Y-axis is not labeled, but represents relative values.
* **Web Shopping Charts:** X-axis labels: "Duration", "#Stops", "Price". Y-axis is not labeled, but represents relative values.
* **AI Agent:** A central figure representing the AI system, labeled "Bayesian teaching."
* **Arrows:** Indicate the flow of information and feedback between the user, the AI agent, and the recommendation modules.
* **Text Boxes:** Contain the questions posed to the user, the system's recommendations, and the user's responses.
### Detailed Analysis or Content Details
**Scenario 1: Flight Recommendation (Top)**
* **Question:** "Help me select the best flights for my trips... Which flight is the best option?"
* **Flight Options:**
* Flight 1: duration: 10 hr 15 min, # stops: 2, price: $100
* Flight 2: duration: 4 hr 24 min, # stops: 0, price: $750
* Flight 3: duration: 7 hr 13 min, # stops: 1, price: $370
* **System Recommendation:** "The best option is Flight 1."
* **User Response:** "Your option Flight 1 is incorrect. I prefer Flight 2."
**Scenario 2: Hotel Recommendation (Middle)**
* **Question:** "Which flight is the best option?" (Note: This appears to be a typo, should be "hotel")
* **Flight Options:** (Note: This appears to be a typo, should be "hotel")
* Flight 1: duration: 5 hr 20 min, # stops: 1, price: $290
* Flight 2: duration: 10 hr 45 min, # stops: 2, price: $150
* Flight 3: duration: 5 hr 5 min, # stops: 1, price: $370
* **System Recommendation:** "The best option is Flight 3." (Note: This appears to be a typo, should be "hotel")
* **User Response:** "Your option flight 3 is correct." (Note: This appears to be a typo, should be "hotel")
**Scenario 3: Web Shopping (Bottom)**
* **Question:** (No explicit question, implied recommendation)
* **Item Attributes:**
* Machine washable
* Size: XL
* Color: Black
* Easy assemble, eco-friendly
* **System Recommendation:** "The best option is Flight 3." (Note: This appears to be a typo, should be "item")
* **User Response:** "Your option flight 3 is correct." (Note: This appears to be a typo, should be "item")
**Bar Chart Analysis (General):**
The bar charts are used to visually represent the attributes of the recommended items. The height of each bar corresponds to the value of the attribute. The charts are positioned to the left of each recommendation module. The exact values are not provided, only relative comparisons can be made.
### Key Observations
* **Feedback Loop:** The diagram highlights the importance of user feedback in refining the AI's recommendations. The AI learns from incorrect choices and adjusts its strategy.
* **Bayesian Teaching:** The use of "Bayesian teaching" suggests that the AI employs a probabilistic approach to learning and updating its beliefs based on new evidence (user feedback).
* **Typographical Errors:** There are several instances of "Flight" being used when "Hotel" or "Item" is likely intended, suggesting a potential error in the diagram's creation.
* **Attribute Representation:** The bar charts provide a visual comparison of different attributes, allowing users to understand the trade-offs between different options.
### Interpretation
The diagram demonstrates a closed-loop AI recommendation system designed to learn user preferences and provide increasingly relevant suggestions. The system leverages Bayesian principles to update its model based on user feedback, improving its accuracy over time. The three scenarios (flights, hotels, web shopping) illustrate the system's versatility and applicability to different domains. The presence of typographical errors suggests a need for careful review and correction of the diagram's content. The diagram emphasizes the importance of incorporating user interaction and feedback into AI-powered recommendation systems to enhance their effectiveness and user satisfaction. The system is designed to adapt to individual user preferences, moving beyond simple rule-based recommendations to a more personalized and intelligent approach.
</details>
Figure 1: Evaluating and improving LLMsâ probabilistic belief updates. The flight recommendation task (left) involves multi-round interactions between a user and a flight booking assistant. In each round, the assistant is asked to recommend to the user one of three available flight options. The assistant is then shown the flight that was in fact chosen by the user (based on the userâs reward function, which characterizes the userâs preferences). To make good recommendations, the assistant needs to infer the userâs preferences from the userâs choices. To teach the LLM to reason probabilistically, we fine-tune the LLM on interactions between users and a Bayesian Assistant, which represents the normative way to update beliefs about the userâs preferences. We then evaluate the fine-tuned model on the flight recommendation task as well as two new tasks (right).
We first describe the simplified flight recommendation task, derived that we use to evaluate the LLMs [Lin et al., 2022]. In this task, we have the LLMs interact with a simulated user for five rounds. In each round, three flight options are presented to both the user and the assistant. Each flight is defined by a departure time, a duration, a number of stops, and a cost (see Fig. 1). Each simulated user is characterized by a set of preferences: for each feature, they can have a strong or weak preference for high or low values of the feature (e.g., they may prefer longer or shorter flights), or no preference regarding this feature. We refer to this set of preferences as the userâs reward function. We have 624 possible users in total (see Appendix Section A). These preferences, which determine the flights that the user chooses, are not directly revealed to the assistant. The goal of the assistant is to recommend the flight that matches the userâs choice. At the end of each round, the user indicates to the assistant whether or not it chose correctly, and provides it with the correct answer.
After each round, we evaluate the accuracy of the assistantâs recommendations for 100 new sets of three flights that differ from the ones on which the assistant has received feedback. We do not provide any feedback to the assistant for these new flight option sets (see Appendix Fig. 7 for the evaluation workflow).
### 2.1 The Bayesian Assistant
Because the usersâ preferences are only revealed gradually, through their choices among flight options, we cannot expect the LLMs to reach perfect accuracy immediately after a single round of interaction. As an upper bound on the LLMsâ performance, we define a Bayesian Assistant, which implements the strategy that optimally takes into account the evidence about the userâs preferences that accumulates over rounds of interaction. This entails maintaining uncertainty about those preferences when the evidence is partial: instead of committing to a single most likely reward function, which could turn out to be incorrect in future rounds, the assistant maintains a probability distribution over possible reward functions. After each round, the Bayesian Assistant updates its distribution over reward functions using Bayesâ rule: the probability of each reward function after the round (the posterior) is computed based on its probability before the round (the prior) and whether or not it was compatible with the userâs choice (the likelihood). This normative model represents the best performance that we can possibly expect from any system. Because the number of possible reward functions is small, we are able to perform exact Bayesian inference (see Appendix Section A).
This method requires us to define the Bayesian Assistantâs initial prior distribution, that is, its probabilistic assumptions about which user preferences are more likely, in advance of any interaction with the user. We use an uninformed prior, where all possible sets of user preferences are equally likely (for experiments with alternative priors, see Appendix Section D.4).
<details>
<summary>x2.png Details</summary>

### Visual Description
## Bar Chart: Model Accuracy Comparison
### Overview
This bar chart compares the accuracy of several language models (Gemma, Llama, Qwen, GPT, Gemini, Human, and Bayesian Assistant) across two rounds of evaluation: "After 1st Round" and "Final Round". A "Random" accuracy baseline is also provided as a dashed horizontal line. The accuracy is measured in percentage (%).
### Components/Axes
* **X-axis:** Model Name (Gemma 2 9B, Gemma 2 7B, Llama 3 8B, Llama 3 70B, Qwen 2.5 7B, Qwen 2.5 32B, GPT-4.1 Mini, Gemini 1.5 Pro, Human, Bayesian Assistant)
* **Y-axis:** Accuracy (%) - Scale ranges from 0 to 100.
* **Legend:**
* Blue: "After 1st Round"
* Gray: "Final Round"
* Dashed Line: "Random" (approximately at 33%)
### Detailed Analysis
The chart displays accuracy scores for each model in two stages. The "Random" line serves as a baseline for comparison.
* **Gemma 2 9B:** After 1st Round: ~37%. Final Round: ~37%.
* **Gemma 2 7B:** After 1st Round: ~37%. Final Round: ~40%.
* **Llama 3 8B:** After 1st Round: ~36%. Final Round: ~38%.
* **Llama 3 70B:** After 1st Round: ~45%. Final Round: ~58%. This model shows the largest increase in accuracy between rounds.
* **Qwen 2.5 7B:** After 1st Round: ~37%. Final Round: ~37%.
* **Qwen 2.5 32B:** After 1st Round: ~42%. Final Round: ~49%.
* **GPT-4.1 Mini:** After 1st Round: ~42%. Final Round: ~42%.
* **Gemini 1.5 Pro:** After 1st Round: ~45%. Final Round: ~51%.
* **Human:** After 1st Round: ~39%. Final Round: ~47%. The "Human" data includes error bars, indicating variability in performance.
* **Bayesian Assistant:** After 1st Round: ~58%. Final Round: ~81%. This model demonstrates the highest accuracy, particularly in the final round. The "Bayesian Assistant" data also includes error bars.
The "After 1st Round" bars are consistently blue, while the "Final Round" bars are gray. The dashed "Random" line is positioned around the 33% mark.
### Key Observations
* **Llama 3 70B** and **Bayesian Assistant** show the most significant improvements in accuracy from the first to the final round.
* **Bayesian Assistant** achieves the highest overall accuracy in the final round (81%).
* Several models (Gemma 2 9B, Gemma 2 7B, Qwen 2.5 7B, GPT-4.1 Mini) show little to no improvement between the first and final rounds.
* The "Human" performance is represented with error bars, suggesting a wider range of accuracy scores.
* All models perform above the "Random" baseline.
### Interpretation
The data suggests that iterative refinement (the "Final Round") can significantly improve the performance of some language models, particularly Llama 3 70B and Bayesian Assistant. The lack of improvement in other models may indicate that they have reached a performance plateau with the given evaluation method or dataset. The high accuracy of the Bayesian Assistant in the final round suggests it benefits substantially from the iterative process. The inclusion of error bars for "Human" performance acknowledges the inherent variability in human judgment. The "Random" baseline provides a crucial point of reference, demonstrating that all models are performing better than chance. The chart highlights the potential for improvement through continued training and refinement of language models, but also suggests that some models may be more amenable to improvement than others. The difference in performance between the models could be due to differences in architecture, training data, or optimization techniques.
</details>
Figure 2: LLMs show limited or no improvement over multiple interactions with the user. We show accuracy after the first round and final (fifth) round. We compare off-the-shelf LLMs from different model families to human participants and the Bayesian Assistant. For human participants, we only evaluate on a subset of 48 out of our 624 simulated users. The LLMs perform considerably worse than the Bayesian Assistant. Human participants demonstrate a larger improvement than most LLMs as they receive more information, but they still fall short of the accuracy that characterizes the normative Bayesian strategy. For the human study, the error bars show the averaged standard error across participants; for models, they show the standard error across the three sets of interactions with each of the 624 users.
### 2.2 LLMs Show Limited Evidence of Belief Updating
The LLMs we evaluate, like most contemporary LLMs, are first trained to predict upcoming words in a large collection of texts (âpre-trainingâ), and are then specialized to follow user instructions provided in natural language (âinstruction-tuningâ) [Sanh et al., 2022, Wei et al., 2022a]. Most commercially available models are closed-weights: we can query them but we cannot access their parameters. We evaluate two such closed-weights models, Gemini 1.5 Pro [Team, 2024a] and GPT-4.1 Mini [OpenAI, 2025], which were among the state-of-the-art LLMs at the time of writing [Chiang et al., 2024]. We also evaluate the following open-weights models: Gemma 2 (9B and 27B parameters) [Team, 2024b], Llama 3 (8B and 70B parameters) [Grattafiori et al., 2024], and Qwen 2.5 (7B and 32B parameters) [Yang et al., 2024a]. We chose those models because their performance was quite competitive, and their weights are openly available, which makes it possible to perform fine-tuning (see the next section). We provide these LLMs with English instructions explaining how to act as a flight booking assistant (see Fig. 1 for an example, and Appendix Table 3 for a detailed interaction).
We show results in Fig. 2. Overall, the accuracy of the LLMs after the five rounds of interaction is considerably lower than that of the Bayesian Assistant, and most of the models show little improvement after the first round of interaction (Fig. 2 shows results after the first and fifth round; for results after each of the five rounds, see Appendix Fig. 24). For an exploration of how the modelsâ performance varies across usersâ possible reward functions, see Appendix Section D.2.
A range of follow-up experiments failed to produce meaningful improvement in the LLMsâ behavior (for details, see Appendix Section C.1). Those include experiments with âchain-of-thought promptingâ [Wei et al., 2022b, Nye et al., 2021, Kojima et al., 2022], that is, instructions that are meant to encourage the LLM to reason more explicitly (Appendix Fig. 9); an experiment with alternative, purely numerical representations of the flight options that we hypothesized might be easier for the LLMs to parse than the verbal ones we used for our main experiments (Appendix Fig. 9); a setting where we have 30 instead of five rounds of interaction (Appendix Fig. 9); and experiments with models that are only pre-trained to predict upcoming words in texts, without subsequent training to follow user instructions (Appendix Fig. 9).
We also had human participants act as the assistant to a subset of 48 simulated users (see Appendix Section A and Appendix Section F.1 for details). The human participants made recommendations for five rounds and showed a significant improvement between round 1 and 5 (p = 0.002, logistic mixed-effects model). In terms of accuracy, they perform better than small LLMs and slightly worse than larger LLMs (see Appendix Fig. 24 for performance over rounds). That being said, like all LLMs, humans also fall substantially short of the accuracy expected from the normative Bayesian strategy.
## 3 Teaching LLMs to Approximate Bayesian Reasoning
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Accuracy Comparison
### Overview
This bar chart compares the accuracy of several models (Gemma Original, Gemma Bayesian, Llama Original, Llama Oracle, Llama Bayesian, Owen Original, Owen Oracle, Owen Bayesian, and Bayesian Assistant) across two rounds of evaluation: "After 1st Round" and "Final Round". A horizontal dashed line represents the accuracy of a "Random" baseline. Error bars are present for each data point, indicating the uncertainty in the accuracy measurement.
### Components/Axes
* **X-axis:** Model Name (Gemma Original, Gemma Bayesian, Llama Original, Llama Oracle, Llama Bayesian, Owen Original, Owen Oracle, Owen Bayesian, Bayesian Assistant)
* **Y-axis:** Accuracy (%) - Scale ranges from 0 to 100.
* **Legend:**
* Blue: "After 1st Round"
* Orange: "Final Round"
* Gray: "Random" (represented as a dashed horizontal line)
* **Error Bars:** Vertical lines extending above and below each bar, representing the standard error or confidence interval.
### Detailed Analysis
The chart consists of nine groups of bars, one for each model. Each group contains two bars representing the accuracy in the "After 1st Round" and "Final Round". The "Random" baseline is a horizontal dashed line at approximately 37% accuracy.
Here's a breakdown of the accuracy values, with approximate values and uncertainties based on the error bars:
* **Gemma Original:**
* After 1st Round: Approximately 37% ± 2%
* Final Round: Approximately 50% ± 3%
* **Gemma Bayesian:**
* After 1st Round: Approximately 61% ± 3%
* Final Round: Approximately 76% ± 3%
* **Llama Original:**
* After 1st Round: Approximately 36% ± 2%
* Final Round: Approximately 38% ± 2%
* **Llama Oracle:**
* After 1st Round: Approximately 48% ± 3%
* Final Round: Approximately 62% ± 3%
* **Llama Bayesian:**
* After 1st Round: Approximately 57% ± 3%
* Final Round: Approximately 75% ± 3%
* **Owen Original:**
* After 1st Round: Approximately 37% ± 2%
* Final Round: Approximately 43% ± 3%
* **Owen Oracle:**
* After 1st Round: Approximately 53% ± 3%
* Final Round: Approximately 55% ± 3%
* **Owen Bayesian:**
* After 1st Round: Approximately 68% ± 3%
* Final Round: Approximately 55% ± 3%
* **Bayesian Assistant:**
* After 1st Round: Approximately 58% ± 3%
* Final Round: Approximately 81% ± 3%
**Trends:**
* For most models, the accuracy increases from the "After 1st Round" to the "Final Round".
* The "Bayesian Assistant" model demonstrates the highest accuracy in the "Final Round", significantly exceeding the other models.
* "Llama Original" shows minimal improvement between the two rounds.
* "Owen Bayesian" shows a decrease in accuracy from the first to the final round.
### Key Observations
* The "Bayesian Assistant" model significantly outperforms all other models in the "Final Round".
* The "Llama Original" model shows very little improvement between the first and final rounds, remaining close to the "Random" baseline.
* The "Owen Bayesian" model is an outlier, showing a decrease in accuracy in the final round.
* The "Random" baseline is consistently below the accuracy of most models in the "Final Round", indicating that the models are generally performing better than random chance.
### Interpretation
The data suggests that iterative refinement ("Final Round" vs. "After 1st Round") generally improves model accuracy. The Bayesian approaches (Gemma Bayesian, Llama Bayesian, Owen Bayesian, and Bayesian Assistant) consistently show higher accuracy than their non-Bayesian counterparts. The "Bayesian Assistant" model's exceptional performance in the "Final Round" indicates that this approach, when fully refined, can achieve significantly higher accuracy than other models tested. The lack of improvement in "Llama Original" suggests that this model may have reached its performance limit or requires a different optimization strategy. The decrease in accuracy for "Owen Bayesian" is an anomaly that warrants further investigation â it could be due to overfitting, a bug in the training process, or other unforeseen factors. The consistent performance above the random baseline indicates that all models, to some degree, are learning and making predictions better than chance. The error bars indicate the uncertainty in these measurements, and it's important to consider these when comparing the performance of different models.
</details>
Figure 3: Supervised fine-tuning teaches LLMs to approximate probabilistic inference. We show accuracy after the first round and final (fifth) round across different assistants. We compare the original LLMs, LLMs fine-tuned on user interactions with the Bayesian Assistant, and LLMs fine-tuned on user interactions with an oracle, which always provides the correct answer. Both types of fine-tuning significantly improve LLMsâ performance, and Bayesian teaching is consistently more effectively than oracle teaching. Error bars show the standard error across three random seeds (and three training runs). All results are statistical significant, $p<0.001$ (see Appendix Section G).
We next describe the supervised fine-tuning technique we use to teach the LLM to mimic the normative Bayesian model; we show that this method substantially improves the LLMâs ability to update its beliefs correctly.
From a technical perspective, supervised fine-tuning is similar to the method used to train most LLMs in the first place. The model is provided with the first words of a text and is trained to predict the upcoming word. After each example, the LLMâs weights are adjusted to increase the likelihood of a correct prediction if the same example is observed again. The main difference is that while in the first phase of training the texts are typically drawn from the Internet or similar resources, in the supervised fine-tuning phase the texts are constructed in a targeted way (automatically or by human writers) so as to teach the LLM particular skills [Sanh et al., 2022, Wei et al., 2022a]; to improve arithmetic skills, for example, the model may be given the text âthe output of $1+1=\ldots$ is $2$ â. We apply supervised fine-tuning to the three medium-sized open-weights models (Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B); we do not attempt to fine-tune the larger models from these families due to computational constraints. We update all of the modelsâ weights in fine-tuning (in Appendix Section C.2, we show that a different training objective, Direct Preference Optimization [Rafailov et al., 2023], produces similar results, as does a computationally cheaper fine-tuning method, LoRA [Hu et al., 2022], which only updates a subset of the modelâs weights).
We explore two strategies to create supervised fine-tuning data. For both strategies, we construct 10 five-round interactions per user. These interactions follow the same format as described above (Appendix Table 3). In the first strategy, which we refer to as oracle teaching, we provide the LLM with interactions between simulated users and an âoracleâ assistant that has perfect knowledge of the userâs preferences, and as such always recommends the option that is identical to the userâs choices.
The second strategy, which we call Bayesian teaching, provides the LLM with interactions between the user and the Bayesian Assistant. In this setting, the assistant will often choose flights that do not match the userâs preferred choice, especially in early rounds where it has considerable uncertainty about the userâs preferences. We hypothesize that despite this fact mimicking the Bayesian Assistantâs best guesses would teach the LLM to maintain uncertainty and update its beliefs more effectively than the first strategy where the LLM is trained on the correct choices. This approach can be seen as a form of distillation, where a model is trained by learning to mimic another system [Hinton et al., 2015, Kim and Rush, 2016, Deng et al., 2023, Wang et al., 2023b, Li et al., 2023b, Jung et al., 2024, Yu et al., 2024, Chen et al., 2024b]. We use a uniform prior for the Bayesian Assistant that produces the supervised fine-tuning data. Other priors perform similarly (see Appendix Fig. 16).
### 3.1 Fine-Tuning Teaches LLMs to Adapt to Users
Both supervised fine-tuning strategies, oracle teaching and Bayesian teaching, significantly improve the LLMsâ performance on the flight recommendation task (Fig. 3). Crucially, after fine-tuning, the LLMsâ performance gradually improves as more information becomes available; this contrasts with the original LLMs, which plateaued after the first round (see the substantial performance improvement between the first and last round in Fig. 3; for detailed results for each round, see Appendix Fig. 25). While there is still a performance gap between the fine-tuned LLMs and the normative Bayesian Assistant, this gap is much narrower than for the original LLMs. All three medium-sized LLMs, which before fine-tuning performed worse than either stronger models or our human participants, markedly outperform them after fine-tuning.
We find that Bayesian teaching leads to higher accuracy and less variability across repetitions of the experiment than oracle teaching (Fig. 3). Bayesian teaching also successfully makes the LLM more Bayesian: the Bayesian-tuned LLMsâ predictions agree with those of the Bayesian Assistant around 80% of the time, significantly more often than do the predictions of the original LLMs and oracle-tuned LLMs (Fig. 4). In Appendix Section D.4, we show that the effectiveness of Bayesian teaching cannot be explained by two potential confounds, and conclude that the effectiveness of this method is in fact due to the Bayesian signal it provides.
The amount of information that can be gained from the userâs choice for a particular option set varies from one set to another. For example, a choice between two flight options that differ in exactly one feature provides direct evidence for the userâs preference for that feature; such a choice could be more informative about the userâs preferences than the choice between options that differ along multiple dimensions. We expect a model with more sophisticated probabilistic skills to show greater sensitivity to this factor. Do our fine-tuned models show such sensitivity? Focusing on the Gemma models, we find that Gemma Original does not show sensitivity to option set informativity, but both fine-tuned versions of Gemma do, with Gemma Bayesian displaying considerably more sensitivity than Gemma Oracle (Appendix Section E).
Can the fine-tuned models accurately verbalize their beliefs? To address this question, we ask the LLMs explicitly for their beliefs about the userâs preferencesâwe have the simulated user ask them, for example, âon a scale of 1 to 5, what is my preference for price?â. We then test for the accuracy of these verbalized beliefs by deriving flight recommendations from those beliefs, using the same decision procedure we use with the Bayesian Assistant. We find that this approach generally performs better the approach we have used so far where we directly ask for the LLMsâ recommendations; that predictions based on the fine-tuned LLMsâ verbalized beliefs are substantially more accurate than those based on the original LLMsâ verbalized beliefs; and that the Bayesian-tuned LLMs produce more accurate beliefs than either the original LLMs or oracle-tuned ones (for additional details, see Appendix Section B).
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Charts: Agreement with Bayesian Assistant vs. Number of Interactions
### Overview
This image presents three separate line charts, each comparing the agreement with a Bayesian Assistant (measured in percentage) for three different Large Language Models (LLMs): Gemma, Llama, and Qwen. The x-axis represents the number of interactions, ranging from 0 to 5, while the y-axis represents the percentage of agreement, ranging from 0% to 100%. Each chart displays three lines, each representing a different LLM configuration: Original LLM, Oracle LLM, and Bayesian LLM.
### Components/Axes
* **X-axis Title:** "# Interactions"
* **Y-axis Title:** "Agreement with Bayesian Assistant (%)"
* **Legend:** Located at the top-left of each chart.
* Original LLM (Blue line with triangle markers)
* Oracle LLM (Green line with circle markers)
* Bayesian LLM (Orange line with diamond markers)
* **Chart Titles:** Gemma, Llama, Qwen (positioned above each respective chart)
* **X-axis Markers:** 0, 1, 2, 3, 4, 5
* **Y-axis Markers:** 0, 20, 40, 60, 80, 100
### Detailed Analysis or Content Details
**Gemma Chart:**
* **Original LLM (Blue):** Starts at approximately 32% agreement at 0 interactions, fluctuates around 38-42% for interactions 1-4, and decreases to approximately 30% at 5 interactions.
* **Oracle LLM (Green):** Starts at approximately 58% agreement at 0 interactions, increases to around 65% at 1 interaction, then plateaus around 60-65% for interactions 2-5.
* **Bayesian LLM (Orange):** Starts at approximately 88% agreement at 0 interactions, decreases sharply to around 70% at 1 interaction, and then plateaus around 70-75% for interactions 2-5.
**Llama Chart:**
* **Original LLM (Blue):** Starts at approximately 32% agreement at 0 interactions, remains relatively stable around 38-42% for all interactions up to 5.
* **Oracle LLM (Green):** Starts at approximately 32% agreement at 0 interactions, increases rapidly to around 85% at 1 interaction, and then decreases slightly to around 80% for interactions 2-5.
* **Bayesian LLM (Orange):** Starts at approximately 32% agreement at 0 interactions, increases rapidly to around 88% at 1 interaction, and then decreases slightly to around 84% for interactions 2-5.
**Qwen Chart:**
* **Original LLM (Blue):** Starts at approximately 48% agreement at 0 interactions, decreases to around 32-38% for interactions 1-5.
* **Oracle LLM (Green):** Starts at approximately 48% agreement at 0 interactions, increases rapidly to around 80% at 1 interaction, and then decreases to around 65-70% for interactions 2-5.
* **Bayesian LLM (Orange):** Starts at approximately 88% agreement at 0 interactions, decreases rapidly to around 75% at 1 interaction, and then decreases to around 60-65% for interactions 2-5.
### Key Observations
* The Bayesian LLM consistently demonstrates the highest agreement with the Bayesian Assistant at 0 interactions across all three models.
* The Oracle LLM shows the most significant increase in agreement with the Bayesian Assistant after the first interaction across all three models.
* The Original LLM generally exhibits the lowest and most stable agreement with the Bayesian Assistant across all three models.
* Agreement with the Bayesian Assistant tends to decrease for the Bayesian and Oracle LLMs as the number of interactions increases, suggesting a potential diminishing return or adaptation effect.
### Interpretation
The data suggests that incorporating Bayesian principles (as seen in the Bayesian LLM) or oracle-based feedback (as seen in the Oracle LLM) can significantly improve the agreement of LLMs with a Bayesian Assistant, particularly at the start of an interaction sequence. The initial high agreement of the Bayesian LLM indicates that the Bayesian framework is effective in aligning the model's responses with the assistant's expectations. The rapid increase in agreement for the Oracle LLM suggests that providing feedback or guidance can quickly improve performance.
The decreasing agreement observed in the Bayesian and Oracle LLMs as the number of interactions increases could be due to several factors. It might indicate that the models are adapting to the interaction context and diverging from the initial Bayesian alignment, or that the feedback provided by the Bayesian Assistant becomes less relevant or effective over time. The relatively stable, lower agreement of the Original LLM suggests that it lacks the mechanisms to effectively incorporate Bayesian principles or feedback, resulting in consistent but suboptimal performance.
The differences between the models (Gemma, Llama, Qwen) highlight the varying capabilities of different LLM architectures in leveraging Bayesian approaches and feedback mechanisms. Further investigation is needed to understand the underlying reasons for these differences and to optimize the integration of Bayesian principles into LLMs for improved performance and alignment.
</details>
Figure 4: Fine-tuned LLMs agree more with the Bayesian Assistant. We show agreement between the LLMs and the Bayesian Assistant, measured by the proportion of trials where the LLMs makes the same predictions as the Bayesian Assistant. Fine-tuning on the Bayesian Assistantâs predictions makes the LLMs more Bayesian, with the Bayesian versions of each LLM achieving the highest agreement with the Bayesian Assistant. Error bars (too small to be visible in plot) show standard errors across three random seeds (and three training runs).
### 3.2 Fine-Tuned LLMs Generalize to New Tasks
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Charts: Generalization Performance of LLMs
### Overview
The image presents three sets of charts comparing the performance of several Large Language Models (LLMs) â Gemma, Llama, Owen, and a Random baseline â across different generalization tasks. The first set (a) examines performance as a function of the number of features used. The second (b) focuses on generalization to Hotel Recommendation, and the third (c) on Web Shopping. All charts use Final-round Accuracy (%) as the y-axis. Error bars are present in the bar charts.
### Components/Axes
* **X-axis (Charts a):** Number of Features (ranging from 2 to 8).
* **Y-axis (All Charts):** Final-round Accuracy (%) (ranging from 0 to 100).
* **Legends (Chart a):**
* Original LLM (Red Line)
* Oracle LLM (Blue Line)
* Bayesian Assistant (Purple Line)
* Random (Gray Dashed Line)
* **X-axis Labels (Charts b & c):** Gemma Original, Gemma Oracle, Gemma Bayesian, Llama Original, Llama Oracle, Llama Bayesian, Owen Original, Owen Oracle, Owen Bayesian.
* **Legends (Charts b & c):**
* Bayesian Assistant (Charts b & c)
* Random (Charts b & c)
* Direct Fine-tuning on Web Shopping (Chart c)
### Detailed Analysis or Content Details
**a. Generalization to Different Numbers of Features**
* **Original LLM (Red):** Starts at approximately 70% accuracy with 2 features, dips to around 50% with 4 features, then rises to approximately 75% with 8 features.
* **Oracle LLM (Blue):** Starts at approximately 60% accuracy with 2 features, decreases to around 40% with 5 features, and then plateaus around 45% with 8 features.
* **Bayesian Assistant (Purple):** Starts at approximately 50% accuracy with 2 features, rises to a peak of around 70% with 6 features, and then declines to approximately 60% with 8 features.
* **Random (Gray Dashed):** Remains relatively constant around 20% accuracy across all feature numbers.
**b. Generalization to Hotel Recommendation**
* **Gemma Original:** Approximately 37% accuracy.
* **Gemma Oracle:** Approximately 53% accuracy.
* **Gemma Bayesian:** Approximately 66% accuracy.
* **Llama Original:** Approximately 41% accuracy.
* **Llama Oracle:** Approximately 56% accuracy.
* **Llama Bayesian:** Approximately 65% accuracy.
* **Owen Original:** Approximately 36% accuracy.
* **Owen Oracle:** Approximately 48% accuracy.
* **Owen Bayesian:** Approximately 59% accuracy.
**c. Generalization to Web Shopping**
* **Gemma Original:** Approximately 54% accuracy.
* **Gemma Oracle:** Approximately 61% accuracy.
* **Gemma Bayesian:** Approximately 73% accuracy.
* **Llama Original:** Approximately 59% accuracy.
* **Llama Oracle:** Approximately 63% accuracy.
* **Llama Bayesian:** Approximately 70% accuracy.
* **Owen Original:** Approximately 43% accuracy.
* **Owen Oracle:** Approximately 66% accuracy.
* **Owen Bayesian:** Approximately 69% accuracy.
### Key Observations
* The Bayesian Assistant consistently outperforms the Original and Oracle LLMs across all tasks.
* Performance generally increases with the addition of features (Chart a), but the relationship is not always linear. Some models show a dip in performance at intermediate feature counts.
* The Random baseline consistently performs poorly, indicating that the LLMs are learning something beyond chance.
* The Oracle LLM generally performs better than the Original LLM, but not as well as the Bayesian Assistant.
* Gemma Bayesian consistently achieves the highest accuracy in the Hotel Recommendation and Web Shopping tasks.
### Interpretation
The data suggests that the Bayesian Assistant approach is the most effective for generalization in these tasks. The improvement over the Original and Oracle LLMs indicates that the Bayesian method is better at incorporating prior knowledge and handling uncertainty. The varying performance with different numbers of features (Chart a) suggests that there is an optimal level of complexity for each model and task. The consistent poor performance of the Random baseline validates the effectiveness of the LLMs. The fact that the Oracle LLM performs better than the Original LLM suggests that providing access to additional information (the "oracle") can improve performance, but the Bayesian Assistant is still superior. The differences in performance between the models on different tasks (Hotel Recommendation vs. Web Shopping) suggest that the optimal model architecture and training data may vary depending on the specific application. The error bars in charts b and c indicate the variability of the results, and further statistical analysis would be needed to determine the significance of the observed differences.
</details>
Figure 5: Bayesian teaching generalizes outside the task used for fine-tuning. (a) Final-round accuracy gain in fine-tuned models compared to the original LLM when varying task complexity (here the number of features is a proxy for task complexity). (b) Final-round accuracy for LLMs on the hotel recommendation task, which was not seen during fine-tuning. We show the normative Bayesian Assistantâs performance with brown dashed lines. (c) Final-round accuracy for LLMs on the web shopping domain, also unseen during fine-tuning. The green dashed line indicates the performance of the LLM when it is fine-tuned directly on web shopping data, such that no domain generalization is necessary. Error bars indicate the standard errors over three training runs (for web shopping) and additionally three random seeds (for flight recommendation and hotel recommendation).
As a result of Bayesian teaching, the LLMs demonstrate a greatly improved ability to approximate Bayesian probabilistic inference. Is this ability specific to the particular task the models were trained on, or do the LLMsâ probabilistic skills improve more broadly? To answer this question, we evaluate the fine-tuned LLMs on a set of tasks that diverge to different extents from our original flight recommendation task (see the right panel of Fig. 1 for an overview). All tasks require the LLMs to infer the userâs preferences from the userâs choices over multiple interactions. Overall, as we show in the rest of this section, we find that fine-tuned LLMs show considerable generalization to new settings, and that, as before, Bayesian teaching is more effective than oracle teaching.
We first test the LLMs on variants of the flight recommendation task with different numbers of features: whereas in the interactions provided during fine-tuning, flights were characterized by four features, in this evaluation setting flights are described by between two and eight features. This requires the LLM to generalize to features that were not included in fine-tuning (e.g., the number of checked bags). In this setting, we find that both types of fine-tuning lead to large improvement in accuracy compared to the original LLMs. We also find that Bayesian teaching is considerably more effective than oracle teaching, as before (Fig. 5). We note that as the number of features increases, the space of possible reward functions grows exponentially, and the task becomes inherently more difficult, even for the Bayesian Assistant. Despite this fact, for both fine-tuning methods, performance relative to the upper bound defined by the Bayesian Assistant drops off only moderately as the number of features increases.
The generalization experiments we have discussed so far focused on variants of the flight recommendation task. We next evaluate whether the LLMs can generalize the probabilistic skills they acquire through fine-tuning and apply them to other domains. We consider two such domains: hotel recommendations and web shopping. The hotel recommendation task is a synthetic task whose structure is similar to that of the flight recommendation task presented in fine-tuning. Here, each hotel is defined by four features: distance to downtown, price, rating, and amenities (for an example, see Appendix Table 11).
The web shopping task uses real-world products from a simulated environment [Yao et al., 2022], and differs much more substantially from the fine-tuning task than does the hotel recommendation task. It is difficult to construct a Bayesian Assistant for more natural scenarios like the web shopping task, where the space of user preferences is large and hard to specify formally. For this reason, successful transfer from synthetic settings like the flight recommendation task to more natural scenarios represents a particularly important application of Bayesian teaching. In the web shopping task, each user is defined by a set of randomly sampled goals that characterize the product they are interested in; for example, they might be looking for a shirt that is machine washable, or for a size XL shirt (see Appendix Table 1 for examples). As in the flight domain, the assistant interacts with the user for multiple rounds. In each round, a set of product options is randomly sampled from the product category (e.g., shirts), and the assistant is asked to recommend the best option. Each product is represented by a short title along with a detailed description (see Appendix Table 12 for an example). The user provides feedback at the end of each round, indicating whether or not the assistantâs recommendation was correct. The userâs preferred option is the one with the highest reward, as defined in Yao et al. [2022]. As mentioned above, it is difficult to construct a Bayesian Assistant for this task due to the large space of possible preferences. Instead, as an alternative upper bound on the transfer performance we can expect from the models fine-tuned on the flight recommendation task, we fine-tune LLMs directly on data from the shopping task.
We find that LLMs fine-tuned on the flight recommendation task generalize to both hotel recommendations and web shopping: they perform much better than the original LLMs on those tasks (Fig. 5 and Fig. 5). Bayesian teaching continues to outperform oracle teaching, though the gap is smaller for web shopping than hotel recommendations. There remains a gap between the generalization performance of the LLMs fine-tuned on flight recommendations and the upper bound obtained by fine-tuning the LLMs directly on the web shopping interactions (green dashed line in Fig. 5). Overall, we conclude that fine-tuning, and especially Bayesian teaching, imparts probabilistic skills that transfer substantially beyond the setting used for fine-tuning.
### 3.3 Generalization to Interactions with Human Users
The synthetically generated data we have used so far makes two simplifying assumptions: the simulated usersâ choices faithfully reflect the reward function that characterizes their preferences, and all reward functions are encountered equally often. In practice, these assumptions may not hold as humansâ behavior could occasionally be inconsistent with their preferences, due to inattention or other biases, and some preferences may be more common in the population than others (such as a preference for lower price). To evaluate the models in a more realistic setting, we recruit human participants to act as users. Each human participant is asked to first state their preferences for each of the flight features, and then select their preferred flight out of three options, for five different sets of options. We collect data from 10 human participants each for 50 lists of flight option sets, for a total of 500 participants (see Appendix Section A).
The performance of both fine-tuned models and the Bayesian Assistant for human users consistently improves over rounds (Fig. 6), and, as was the case for the simulated users, the Bayesian LLMs consistently outperform the Oracle LLMs; at least for some model families, the Bayesian LLMs also outperform the original LLMs. This indicates that the Bayesian LLMs generalize to human users from the simulated users on which they were fine-tuned.
All models, including the Bayesian Assistant, show substantially lower performance for humans than they did for simulated users, where accuracy after five rounds approached 80% (Fig. 3). In the Appendix Section F.2, we show that this is due to the fact that participantsâ choices are not always consistent with their stated preferences, and as such are impossible to predict with high accuracy (Appendix Fig. 22). For the subset of human users whose choices are perfectly consistent with their preferences, the Bayesian LLM performs much better than the original LLM (Appendix Fig. 21; see also Appendix Section D.3, where we study inconsistent simulated users).
Unlike for the simulated users, for human users the original LLMs perform well even after a single interaction (although, crucially, the original LLMs do not improve over interactions). We attribute the original LLMsâ surprisingly strong performance to the fact that human users have generally predictable preferences (e.g., a preference for cheaper flights), such that guesses based on the LLMâs priors, without any adaptation to the individual user, can be quite effective (see Appendix Figs. 20 and 21 for evidence for this hypothesis).
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: LLM Accuracy vs. Interactions
### Overview
The image presents three line charts, each displaying the accuracy of different Large Language Models (LLMs) â Gemma, Llama, and Qwen â as a function of the number of interactions. Each chart compares the performance of "Original LLM", "Oracle LLM", "Bayesian LLM", "Bayesian Assistant", and a "Random" baseline. Accuracy is measured in percentage (%).
### Components/Axes
* **X-axis:** "# Interactions" ranging from 0 to 4.
* **Y-axis:** "Accuracy (%)" ranging from 0 to 100.
* **Legend:** Located at the top-right of each chart, containing the following labels and corresponding colors:
* Original LLM (Blue)
* Oracle LLM (Orange)
* Bayesian LLM (Purple)
* Bayesian Assistant (Light Orange/Peach)
* Random (Gray dashed line)
* **Chart Titles:** Gemma, Llama, and Qwen, positioned above each respective chart.
### Detailed Analysis or Content Details
**Gemma Chart:**
* **Original LLM (Blue):** Starts at approximately 58%, dips to around 54% at 1 interaction, then rises to approximately 60% at 2 interactions, and remains relatively stable around 58-60% for the remaining interactions.
* **Oracle LLM (Orange):** Starts at approximately 28%, increases steadily to around 52% at 4 interactions.
* **Bayesian LLM (Purple):** Starts at approximately 32%, increases to around 48% at 2 interactions, and plateaus around 50-52% for the remaining interactions.
* **Bayesian Assistant (Light Orange/Peach):** Starts at approximately 28%, increases to around 45% at 2 interactions, and plateaus around 45-48% for the remaining interactions.
* **Random (Gray dashed line):** Maintains a constant value of approximately 30% across all interactions.
**Llama Chart:**
* **Original LLM (Blue):** Starts at approximately 58%, dips to around 54% at 1 interaction, then rises to approximately 60% at 2 interactions, and remains relatively stable around 58-60% for the remaining interactions.
* **Oracle LLM (Orange):** Starts at approximately 28%, increases steadily to around 52% at 4 interactions.
* **Bayesian LLM (Purple):** Starts at approximately 32%, increases to around 48% at 2 interactions, and plateaus around 50-52% for the remaining interactions.
* **Bayesian Assistant (Light Orange/Peach):** Starts at approximately 28%, increases to around 45% at 2 interactions, and plateaus around 45-48% for the remaining interactions.
* **Random (Gray dashed line):** Maintains a constant value of approximately 30% across all interactions.
**Qwen Chart:**
* **Original LLM (Blue):** Starts at approximately 58%, dips to around 54% at 1 interaction, then rises to approximately 60% at 2 interactions, and remains relatively stable around 58-60% for the remaining interactions.
* **Oracle LLM (Orange):** Starts at approximately 28%, increases steadily to around 52% at 4 interactions.
* **Bayesian LLM (Purple):** Starts at approximately 32%, increases to around 48% at 2 interactions, and plateaus around 50-52% for the remaining interactions.
* **Bayesian Assistant (Light Orange/Peach):** Starts at approximately 28%, increases to around 45% at 2 interactions, and plateaus around 45-48% for the remaining interactions.
* **Random (Gray dashed line):** Maintains a constant value of approximately 30% across all interactions.
All three charts exhibit identical trends and approximate values for each LLM type.
### Key Observations
* The "Original LLM" consistently demonstrates the highest accuracy across all three models (Gemma, Llama, and Qwen), remaining above the "Random" baseline.
* The "Oracle LLM" shows a consistent increase in accuracy with more interactions, starting below the "Random" baseline and eventually surpassing it.
* Both "Bayesian LLM" and "Bayesian Assistant" show improvement with interactions, but their accuracy plateaus after 2 interactions.
* The "Random" baseline remains constant, providing a benchmark for performance.
* The three charts are nearly identical, suggesting the observed trends are consistent across the different LLMs.
### Interpretation
The data suggests that the "Original LLM" configuration performs best across all three models, indicating its inherent capabilities. The "Oracle LLM" benefits from increased interactions, suggesting it learns and improves with more data or feedback. The "Bayesian" approaches show initial improvement but reach a plateau, potentially indicating limitations in their learning mechanisms or the need for more complex models. The consistent performance of the "Random" baseline highlights the value of the LLMs in exceeding random chance. The near-identical trends across Gemma, Llama, and Qwen suggest a common underlying pattern in how these LLMs respond to interactions and learning strategies. The initial dip in accuracy for the "Original LLM" at 1 interaction could be due to a temporary disruption or adjustment period before the model stabilizes. The fact that all models outperform the random baseline indicates that they are all learning and improving, even if at different rates and to different extents.
</details>
Figure 6: Bayesian teaching generalizes to human users. We show accuracy over rounds when the user is a human participant. The original LLMs achieve strong performance but do not show any learning behavior. In contrast, fine-tuned LLMs (with both Bayesian and Oracle teachers) improve their performance over rounds, and the Bayesian LLMs consistently outperform the Oracle LLMs. Error bars show standard errors across four random seeds (and three training runs; the errors bars are not visible in the plot because they are very small).
## 4 Discussion
To interact with the world successfully, an agent needs to adapt its behavior as it obtains additional information about the statistics of this environment. To evaluate the ability of large language models (LLMs) to do so, we introduced a simple flight recommendation task where, in order to make accurate predictions, the model needs to adapt to a userâs preferences over multiple interactions with the user. We tested a range of LLMs and found that they struggle to form and update probabilistic beliefs. We further found that continuing the LLMsâ training through exposure to interactions between users and the Bayesian Assistantâa model that implements the normative probabilistic belief update strategyâdramatically improves the LLMsâ ability to approximate probabilistic reasoning. Crucially, this improvement did not only hold for the flight recommendation task the LLM was trained on, but also generalized to variants to the flight recommendation task that the LLM has not encountered before, as well as to other tasks. Across the board, this approach, which we refer to as Bayesian teaching, was more effective than a related approach where the LLM is fine-tuned directly on the correct answers, pointing to the effectiveness of the Bayesian training signal.
Our paradigm differs from those used in previous investigations of LLMsâ probabilistic reasoning abilities, where LLMs were expected to compute statistics explicitly [Nafar et al., 2025, Paruchuri et al., 2024] or provide probability judgments [Zhu and Griffiths, 2024, BelĂ©m et al., 2024]. In our paradigm, probabilistic reasoning is as essential as it is in explicit reasoning tasks, but, crucially, it is implicit in the task. Unlike in some recent studies, where the assistant is expected to ask questions to directly elicit the userâs preferences [Li et al., 2023a, Handa et al., 2024, Piriyakulkij et al., 2023, Andukuri et al., 2024, Peng et al., 2024, Aliannejadi et al., 2021, Chen et al., 2024a, Lin et al., 2022], our setup expects the assistant to gradually infer the userâs preferences by simply observing the userâs choices, and to provide recommendations that are increasingly in line with the userâs true preferences. Finally, our findings are consistent with those of concurrent work [Zhao et al., 2025], which also investigates LLMsâ ability to infer user preferences from different types of dialogues, including a condition where the user accepts or rejects one or more options provided by the assistantâa setup similar to oursâwhere the models performed poorly. Compared to this concurrent study, our work analyzes the LLMsâ behavior through the lens of Bayesian inference, and demonstrates the benefits of mimicking a Bayesian model in fine-tuning compared to a more standard fine-tuning strategy, where the model is always provided with the correct answer (oracle teaching, in the terminology we used in the current paper).
We observed robust generalization from the synthetic flight recommendation task on which the LLMs were fine-tuned to the more natural web shopping task. While performance was even stronger when we fine-tuned the LLM directly on interactions from this task (the green dashed line in Fig. 5), in practice it may be difficult or expensive to collect such data; our synthetic fine-tuning strategy provides an alternative that improves the LLMâs probabilistic reasoning abilities across tasks, without requiring collecting additional data and re-training the model on the new domain.
Our proposal is related to but distinct from approaches that embed an LLM inside a neuro-symbolic framework for probabilistic reasoning [Wong et al., 2023, Feng et al., 2024, Liu et al., 2024, Piriyakulkij et al., 2024, Grand et al., 2023, Ying et al., 2024, Ellis, 2023]. In those approaches, the LLM is used to translate between natural language inputs and formal representations, which in turn serve as input to a symbolic model that can update its beliefs according to the Bayesian framework [Wong et al., 2023]. Indeed, we provide further evidence that hybrid methods can outperform the LLM-only approach in Appendix Section B, where we describe a variation of our method where we first ask the LLM to verbalize its beliefs about the userâs preferences, and then we use an external, symbolic system to make predictions based on these verbalized beliefs. The experiments described in that Appendix section show that in simple tasks where preferences can be mapped to predictions, such hybrid methods indeed outperform a direct interaction with the LLM. Our preliminary explorations of this approach can be developed in greater detail in future work.
Besides their superior performance in certain cases, neuro-symbolic methods have the benefit of greater interpretability, and their probabilistic inferences could be more robust. Crucially, however, the utility of such methods is limited to problems whose structure can be made explicit in the symbolic component of the system. By contrast, the method we propose empowers the LLM to approximate probabilistic inference on its own, such that it can apply this skill to domains that are hard to codify explicitly in a symbolic system, domains such as the web shopping task we have examined. This approach leverages LLMsâ remarkable ability to generalize to new problems defined using natural language.
Notably, even in cases where the domain is simple enough for a purely symbolic model to be constructed, such models may not be consistently more accurate than LLMs. In our study, we found that while for âwell-behavedâ simulated users a moderate performance gap persisted between the fine-tuned models and the Bayesian Assistant, for human users, whose choices are not always consistent with their preferences, our Bayesian LLMs were in fact superior to the fully symbolic Bayesian Assistant, demonstrating LLMsâ greater robustness to noise compared to symbolic models.
We have argued that through mimicking the Bayesian Assistant the LLMs learn to perform probabilistic inference, albeit only approximately. This hypothesis may appear to be surprising in light of the fact that LLMsâ training objective does not explicitly provide supervision for this skill, and that the transformer architecture does not explicitly track probability distributions: it is trained only to predict the next word produced by the Bayesian Assistant. That being said, there is mounting evidence that in order to predict the next token successfully, LLMs can acquire sophisticated representations that match the structure of the process that generated those tokens. In the case of natural language syntax, for example, the internal representations of LLM trained solely to predict upcoming words have been shown to encode abstract features such as syntactic role and grammatical number [Lakretz et al., 2019, Hao and Linzen, 2023, Manning et al., 2020]. It would be a fruitful direction for future work to determine how probabilistic reasoning is implemented by the LLMsâ internal representations, for example by using techniques such as probes and causal interventions [Finlayson et al., 2021, Ravfogel et al., 2021, Vig et al., 2020] to find internal representations of the modelâs probability distributions over usersâ preferences, or using circuit analysis [Wang et al., 2023a] to explore the computations through which the model updates these distributions.
The success of Bayesian teaching in imparting approximate probabilistic reasoning skills to LLMs opens up a range of questions for future work. Would the benefits of Bayesian teaching extend to larger models than we were able to fine-tune in this work, or to the recent generation of models that are explicitly trained to reason in words [Guo et al., 2025]? Does the benefit of Bayesian teaching extend to continuous domains and real-world applications beyond the ones we evaluated (for example, interactions whose goal goes beyond shopping)? Could we provide the models with a stronger supervision signalâfor example, by instructing them to consider explicit probability distributions, by providing them with explicit supervision on the optimal way to update these distributions (for example, by supervising beliefs as in Appendix Fig. 10), or by encouraging them to maintain explicit representations of users such that the probability distributions are consistent across interactions with the same user, through methods such as supervised fine-tuning or reinforcement learning?
The goal of this study was not to replicate human behavior in LLMs, but rather to identify methods that can bring LLMsâ probabilistic reasoning skills closer to the normative Bayesian strategy: for most applications we expect AI assistants to be follow normative reasoning standards rather than reproduce human deviations from that standard. That being said, our comparisons between LLMs and humans point to a number of directions for future work. Our participants showed substantial deviations from the normative reasoning strategy, in line with prior work on reasoning biases [Eisape et al., 2024, Rottman and Hastie, 2016, Chaigneau et al., 2025, Tversky and Kahneman, 1974]. To what extent can people be taught to follow the normative strategy more closely? Can participantsâ apparent biases be explained as consequences of resource limitations [Simon, 1955]? How consistent are participantsâ choices with their stated preferences? Do peopleâs deviations from the normative strategy align with those of LLMs [Eisape et al., 2024], and what properties of an LLM lead to closer alignment with humans?
While our findings from our first experiment point to the limitations of particular LLMs, the positive findings of our subsequent fine-tuning experiments can be viewed as a demonstration of the strength of the LLM âpost-trainingâ paradigm more generally: by training the LLMs on demonstrations of the normative strategy to perform the task, we were able to improve their performance considerably, suggesting that they learned to approximate the probabilistic reasoning strategy illustrated by the demonstrations. The LLMs were able to generalize this strategy to domains where it is difficult to encode it explicitly in a symbolic model, demonstrating the power of distilling a classic symbolic model into a neural network. We hypothesize that this generalization ability is, in part, responsible for LLMsâ remarkable empirical success.
## Acknowledgments
We thank Stephanie Chan, Andrew Lampinen, Michael Mozer, Peter Shaw, and Zhaofeng Wu for helpful discussions.
## Author Contributions
L.Q., F.S., T.L., and S.V.S. co-led the project. S.V.S. conceptualized the project direction. L.Q. conducted the experiments and analysis. L.Q., F.S., T.L., and S.V.S. framed, analyzed and designed experiments, with inputs from K.A. and Y.K. L.Q., T.L., and S.V.S. wrote the paper with help from F.S., K.A., and Y.K.
## References
- Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. GPT-4 technical report. ArXiv preprint, abs/2303.08774, 2023.
- Aliannejadi et al. [2021] M. Aliannejadi, J. Kiseleva, A. Chuklin, J. Dalton, and M. Burtsev. Building and evaluating open-domain dialogue corpora with clarifying questions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
- Andukuri et al. [2024] C. Andukuri, J.-P. FrÀnken, T. Gerstenberg, and N. Goodman. STaR-GATE: Teaching language models to ask clarifying questions. In First Conference on Language Modeling, 2024.
- Anthropic [2024] Anthropic. Claude 3, 2024.
- Baker et al. [2011] C. Baker, R. Saxe, and J. Tenenbaum. Bayesian theory of mind: Modeling joint belief-desire attribution. In Proceedings of the annual meeting of the cognitive science society, volume 33, 2011.
- Belém et al. [2024] C. G. Belém, M. Kelly, M. Steyvers, S. Singh, and P. Smyth. Perceptions of linguistic uncertainty by language models and humans. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
- Brown et al. [2020] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Chaigneau et al. [2025] S. Chaigneau, N. Marchant, and B. Rehder. Breaking the chains of independence: A bayesian uncertainty model of normative violations in human causal probabilistic reasoning. OSF, 2025.
- Chater and Manning [2006] N. Chater and C. D. Manning. Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10, 2006.
- Chater et al. [2006] N. Chater, J. B. Tenenbaum, and A. Yuille. Probabilistic models of cognition: Conceptual foundations. Trends in Cognitive Sciences, 10(7), 2006.
- Chen et al. [2024a] S. Chen, S. Wiseman, and B. Dhingra. Chatshop: Interactive information seeking with language agents. ArXiv preprint, abs/2404.09911, 2024a.
- Chen et al. [2024b] X. Chen, H. Huang, Y. Gao, Y. Wang, J. Zhao, and K. Ding. Learning to maximize mutual information for chain-of-thought distillation. In Findings of the Association for Computational Linguistics: ACL 2024, 2024b.
- Chiang et al. [2024] W. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. I. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024.
- Christiano et al. [2017] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017.
- Deng et al. [2023] Y. Deng, K. Prasad, R. Fernandez, P. Smolensky, V. Chaudhary, and S. Shieber. Implicit chain of thought reasoning via knowledge distillation. ArXiv preprint, abs/2311.01460, 2023.
- Eisape et al. [2024] T. Eisape, M. Tessler, I. Dasgupta, F. Sha, S. Steenkiste, and T. Linzen. A systematic comparison of syllogistic reasoning in humans and language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024.
- Ellis [2023] K. Ellis. Human-like few-shot learning via bayesian reasoning over natural language. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Feng et al. [2024] Y. Feng, B. Zhou, W. Lin, and D. Roth. BIRD: A trustworthy bayesian inference framework for large language models. In The Thirteenth International Conference on Learning Representations, 2024.
- Finlayson et al. [2021] M. Finlayson, A. Mueller, S. Gehrmann, S. Shieber, T. Linzen, and Y. Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021.
- Grand et al. [2023] G. Grand, V. Pepe, J. Andreas, and J. Tenenbaum. Loose lips sink ships: Asking questions in battleship with language-informed program sampling. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46, 2023.
- Grattafiori et al. [2024] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models, 2024.
- Griffiths et al. [2007] T. L. Griffiths, M. Steyvers, and J. B. Tenenbaum. Topics in semantic association. Psychological Review, 114, 2007.
- Griffiths et al. [2024] T. L. Griffiths, N. Chater, and J. B. Tenenbaum. Bayesian Models of Cognition: Reverse Engineering the Mind. The MIT Press, 2024. ISBN 9780262049412.
- Guo et al. [2025] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. Nature, 645, 2025.
- Ha and Schmidhuber [2018] D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems, 31, 2018.
- Handa et al. [2024] K. Handa, Y. Gal, E. Pavlick, N. Goodman, J. Andreas, A. Tamkin, and B. Z. Li. Bayesian preference elicitation with language models. ArXiv preprint, abs/2403.05534, 2024.
- Hao and Linzen [2023] S. Hao and T. Linzen. Verb conjugation in transformers is determined by linear encodings of subject number. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023.
- Hinton et al. [2015] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.
- Hu et al. [2022] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
- Hu and Levy [2023] J. Hu and R. Levy. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- J. Koehler and James [2010] D. J. Koehler and G. James. Probability matching and strategy availability. Memory & Cognition, 38(6), 2010.
- Jern et al. [2017] A. Jern, C. G. Lucas, and C. Kemp. People learn other peopleâs preferences through inverse decision-making. Cognition, 168, 2017. ISSN 0010-0277.
- Johnson-Laird [1980] P. N. Johnson-Laird. Mental models in cognitive science. Cognitive Science, 4(1), 1980.
- Jung et al. [2024] J. Jung, P. West, L. Jiang, F. Brahman, X. Lu, J. Fisher, T. Sorensen, and Y. Choi. Impossible distillation for paraphrasing and summarization: How to make high-quality lemonade out of small, low-quality model. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024.
- Kim and Rush [2016] Y. Kim and A. M. Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
- Kojima et al. [2022] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Kotha et al. [2024] S. Kotha, J. M. Springer, and A. Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024.
- Lakretz et al. [2019] Y. Lakretz, G. Kruszewski, T. Desbordes, D. Hupkes, S. Dehaene, and M. Baroni. The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- LeCun [2022] Y. LeCun. A path towards autonomous machine intelligence. Open Review, 62(1), 2022.
- Li et al. [2023a] B. Z. Li, A. Tamkin, N. Goodman, and J. Andreas. Eliciting human preferences with language models. In The Thirteenth International Conference on Learning Representations, 2023a.
- Li et al. [2023b] L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, and Y. Choi. Symbolic chain-of-thought distillation: Small models can also âthinkâ step-by-step. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023b.
- Lin et al. [2022] J. Lin, D. Fried, D. Klein, and A. Dragan. Inferring rewards from language in context. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
- Lin et al. [2024] Y. Lin, H. Lin, W. Xiong, S. Diao, J. Liu, J. Zhang, R. Pan, H. Wang, W. Hu, H. Zhang, H. Dong, R. Pi, H. Zhao, N. Jiang, H. Ji, Y. Yao, and T. Zhang. Mitigating the alignment tax of RLHF. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
- Liu et al. [2024] R. Liu, J. Geng, J. Peterson, I. Sucholutsky, and T. L. Griffiths. Large language models assume people are more rational than we really are. In The Thirteenth International Conference on Learning Representations, 2024.
- Manning et al. [2020] C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48), 2020.
- Nafar et al. [2025] A. Nafar, K. B. Venable, and P. Kordjamshidi. Reasoning over uncertain text by generative large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025.
- Nye et al. [2021] M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2021.
- OpenAI [2025] OpenAI. Introducing GPT-4.1 in the API, 2025.
- Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Palan and Schitter [2018] S. Palan and C. Schitter. Prolific.acâA subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 2018.
- Paruchuri et al. [2024] A. Paruchuri, J. Garrison, S. Liao, J. B. Hernandez, J. Sunshine, T. Althoff, X. Liu, and D. McDuff. What are the odds? language models are capable of probabilistic reasoning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
- Peng et al. [2024] A. Peng, Y. Sun, T. Shu, and D. Abel. Pragmatic feature preferences: Learning reward-relevant preferences from human input. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024.
- Piriyakulkij et al. [2023] T. Piriyakulkij, V. Kuleshov, and K. Ellis. Active preference inference using language models and probabilistic reasoning. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Piriyakulkij et al. [2024] T. Piriyakulkij, C. Langenfeld, T. A. Le, and K. Ellis. Doing experiments and revising rules with natural language and probabilistic reasoning. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024.
- Rafailov et al. [2023] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Ravfogel et al. [2021] S. Ravfogel, G. Prasad, T. Linzen, and Y. Goldberg. Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Proceedings of the 25th Conference on Computational Natural Language Learning, 2021.
- Rehder [2018] B. Rehder. Beyond Markov: Accounting for independence violations in causal reasoning. Cognitive Psychology, 103, 2018.
- Rottman and Hastie [2016] B. M. Rottman and R. Hastie. Do people reason rationally about causally related events? Markov violations, weak inferences, and failures of explaining away. Cognitive Psychology, 87, 2016.
- Sanh et al. [2022] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Biderman, L. Gao, T. Wolf, and A. M. Rush. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
- Simon [1955] H. A. Simon. A behavioral model of rational choice. The Quarterly Journal of Economics, 1955.
- Sloman and Lagnado [2015] S. A. Sloman and D. Lagnado. Causality in thought. Annual Review of Psychology, 66(1), 2015.
- Stiennon et al. [2020] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Team [2024a] Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024a.
- Team [2024b] G. Team. Gemma 2: Improving open language models at a practical size. ArXiv preprint, abs/2408.00118, 2024b.
- Tenenbaum et al. [2006] J. B. Tenenbaum, T. L. Griffiths, and C. Kemp. Theory-based bayesian models of inductive learning and reasoning. Trends in Cognitive Sciences, 10(7), 2006. ISSN 1364-6613. Special issue: Probabilistic models of cognition.
- Tenenbaum et al. [2011] J. B. Tenenbaum, C. Kemp, T. L. Griffiths, and N. D. Goodman. How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022), 2011.
- Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
- Tversky and Kahneman [1974] A. Tversky and D. Kahneman. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. Science, 185(4157), 1974.
- Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017.
- Vig et al. [2020] J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, and S. M. Shieber. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Wang et al. [2023a] K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023a.
- Wang et al. [2023b] P. Wang, Z. Wang, Z. Li, Y. Gao, B. Yin, and X. Ren. SCOTT: Self-consistent chain-of-thought distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023b.
- Wang et al. [2024] Y. Wang, A. Bai, N. Peng, and C.-J. Hsieh. On the loss of context-awareness in general instruction fine-tuning. ArXiv preprint, abs/2411.02688, 2024.
- Wei et al. [2022a] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022a.
- Wei et al. [2022b] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022b.
- Wong et al. [2023] L. Wong, G. Grand, A. K. Lew, N. D. Goodman, V. K. Mansinghka, J. Andreas, and J. B. Tenenbaum. From word models to world models: Translating from natural language to the probabilistic language of thought. ArXiv preprint, abs/2306.12672, 2023.
- Xu and Tenenbaum [2007] F. Xu and J. B. Tenenbaum. Word learning as Bayesian inference. Psychological Review, 114(2), 2007.
- Yang et al. [2024a] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv e-prints, 2024a.
- Yang et al. [2024b] H. Yang, Y. Zhang, J. Xu, H. Lu, P.-A. Heng, and W. Lam. Unveiling the generalization power of fine-tuned large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024b.
- Yao et al. [2022] S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Ying et al. [2024] L. Ying, T. Zhi-Xuan, L. Wong, V. Mansinghka, and J. Tenenbaum. Grounding language about belief in a bayesian theory-of-mind. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46, 2024.
- Yu et al. [2024] P. Yu, J. Xu, J. E. Weston, and I. Kulikov. Distilling system 2 into system 1. In The First Workshop on System-2 Reasoning at Scale, NeurIPSâ24, 2024.
- Zhao et al. [2025] S. Zhao, M. Hong, Y. Liu, D. Hazarika, and K. Lin. Do LLMs recognize your preferences? evaluating personalized preference following in LLMs. In The Thirteenth International Conference on Learning Representations, 2025.
- Zhu and Griffiths [2024] J.-Q. Zhu and T. Griffiths. Incoherent probability judgments in large language models. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46, 2024.
## Appendix A Experimental Details
### A.1 Simulated Users in the Flight Recommendation Task
In each round, we presented a set of $k$ flight options $\mathcal{O}=\{o_{1},...,o_{k}\}$ to both the simulated user and the assistant (typically $k=3$ ). Each flight has a departure time, a duration, a number of stops, and a cost; these four features are encoded in a vector $\bm{\phi}(o)\in\mathbb{R}^{4}$ . For each flight option, each feature can take one of 11 values uniformly distributed between 0 and 1, except for the number of stops, which has 3 values. This defines $3\times 11^{3}$ unique flight options. We converted these four numbers into a textual description illustrated in Fig. 1.
The userâs preferences are defined by a reward function $\bm{\theta}$ parameterized by four numbers, which indicate the userâs preferences for the aforementioned features. The space $\Theta$ of reward functions includes all four-dimensional vectors with the values $\{-1,-0.5,0,0.5,1\}$ , where $-1$ corresponds to a preference for low values of this feature (e.g., short flights) and $1$ to a preference for high values (e.g., long flights). Given a set of flight options $\mathcal{O}$ , the user computes the reward $r(o;\bm{\theta})=\bm{\theta^{T}}\bm{\phi}(o)$ of each flight $o$ , and chooses the flight with the highest reward:
$$
\displaystyle o^{*}(\mathcal{O},\bm{\theta})=\textrm{argmax}_{o\in\mathcal{O}}r(\mathcal{O};\bm{\theta}). \tag{1}
$$
When there was a tie between multiple options, we randomly selected one of the options that had the highest reward. We excluded the reward function $(0,0,0,0)$ , that is, the completely indifferent user. This results in a total of $5^{4}-1=624$ possible reward functions, corresponding to 624 simulated users. We note that these simulated users are highly simplified and are not meant to capture the full complexity of humans: humans do not always choose the option that maximizes their utility [J. Koehler and James, 2010], and their preferences may evolve over time.
### A.2 The Bayesian Assistant
Since the space of reward functions is relatively small, we were able perform exact Bayesian updates. In each round, given options $\mathcal{O}$ and the userâs preferred option $o^{*}$ , the Bayesian Assistant updates its posterior as follows:
$$
\displaystyle q^{i+1}_{B}(\bm{\theta}|\mathcal{O}^{i+1},o^{*i+1})=\frac{p(o^{*i+1}|\bm{\theta},\mathcal{O}^{i+1})q^{i}_{B}(\bm{\theta})}{p(o^{*i+1}|\mathcal{O}^{i+1})}, \tag{2}
$$
where the likelihood function indicates whether the reward function is consistent with the userâs choice:
$$
\displaystyle p(o^{*}|\bm{\theta},\mathcal{O})=\mathbbm{1}\big[\max_{o\in\mathcal{O}}r(o;\bm{\theta})=o^{*}]. \tag{3}
$$
The Bayesian Assistant then makes flight recommendations based on its reward posterior mean, $\hat{\bm{\theta}}=\mathbb{E}_{q(\bm{\theta})}[\bm{\theta}]$ , following Equation 1. In most experiments, we used the uniform prior (for experiments with other priors, see Supplementary Fig. C10b).
### A.3 LLMs
Our main experiments focus on the instruction-tuned versions of open-weights models, including models from the Gemma 2 [Team, 2024b], Llama 3 [Grattafiori et al., 2024], and Qwen 2.5 [Yang et al., 2024a] families. We used Gemma 2 models with 9B parameters (https://huggingface.co/google/gemma-2-9b-it) and 27B parameters (https://huggingface.co/google/gemma-2-27b-it), Llama 3 models with 8B parameters (https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and 70B parameters (https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), and Qwen 2.5 models with 7B paramters (https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and 32B parameters (https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). We also evaluated Gemini 1.5 Pro [Team, 2024a] and GPT-4.1 Mini [OpenAI, 2025], which can only be accessed through an API, as representatives of stronger models whose weights are not accessible. All of the models we use are based on the Transformer neural network architecture [Vaswani et al., 2017]. We used greedy decoding (temperature of 0) for all experiments.
### A.4 Generalization Tasks
For the variants of the flight recommendation task (see âFine-tuned LLMs generalize to new tasksâ), we varied the number of flight features, ranging from two to eight features. The full flight features include the following features, in addition to the above four: arrival time, layover duration, cancellation policy, and number of bags. As the number of possible reward functions grows exponentially with the number of features, we randomly sampled up to 1,000 reward functions (simulated users) for each number of features.
For the hotel recommendation task, the hotel features include distance to downtown, price, rating, and amenities. For each hotel option, the distance to downtown and price take one of 11 values uniformly distributed between 0 and 1, while rating and amenities take one of 5 values uniformly distributed between 0 and 1, resulting in $5\times 5\times 11^{2}$ unique hotel options. We evaluated $624$ different simulated users, as in the flight recommendation task.
For the web shopping task, we used real-world products that are publicly available at https://webshop-pnlp.github.io. We chose the 100 categories with the most products. Each product is described by a title and bullet point descriptions, whose length is limited to 800 characters. The reward of a user for a product was calculated based on text-matching heuristics on product attributes and options, following Yao et al. [2022]. For each category, we randomly sampled 10 users, each consisting of five-round interactions. Performance was evaluated on 100 held-out option sets within the same category.
To reduce the sensitivity of the results to the specific randomly selected option sets, we averaged all experiments over three random seeds for flight and hotel recommendations, and over all categories for web shopping. In each case, we report the mean and the standard error across runs and evaluation seeds.
### A.5 LLM Fine-Tuning
We used the instruction-tuned version of Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B for all fine-tuning experiments. For each reward function, we generated 10 userâassistant interactions, resulting in $624\times 10=6,240$ fine-tuning examples, each with five-round interactions. We experimented with fine-tuning on more examples but did not observe any significant improvement. The interactions were formatted as shown in Supplementary Table H3.
We used full fine-tuning (i.e. all parameters were updated) with a learning rate of 2e-6, a batch size of 128, and a maximum sequence length of 2048, for 1 epoch. The models were fine-tuned using the standard language modeling objective, i.e., the cross-entropy loss between the modelâs predicted token probabilities and the ground-truth tokens in the training data. The loss was only computed on the modelâs responses. For each setup, we trained three models with different random seeds. We conducted all fine-tuning experiments using 4 $\times$ H100 GPUs based on the standard recipe (https://github.com/huggingface/alignment-handbook). Fine-tuning Gemma 2 9B, Llama 3 8B and Qwen 2.5 7B required about an hour for each model.
### A.6 Human Annotations
We collected two sets of human annotations for the flight recommendation task: one where the annotators act as assistants and one where they act as users. The human annotators were recruited online and paid the market rate of $12 an hour, as suggested by the Prolific platform [Palan and Schitter, 2018] we used to recruit participants. See details in Supplementary Section E.
The annotation setup for the assistant role follows the evaluation setup we used for LLMs. In each round, the annotator was asked to make recommendations from three flight options, with each represented in the same format shown to the LLMs. After making their recommendation, the annotator received feedback indicating whether their choice was correct. They were then directed to a preference questionnaire, where they provided their estimates of the userâs preferences for each individual feature (see annotation interface in Supplementary Fig. G17). We sampled 48 reward functions by first grouping them based on the L2 distance between their four-dimensional parameter vector and the origin, then sampling from each group proportionally to its size. We had 15 separate participants provide annotations for each of the 48 simulated users (720 human participants in total).
When the annotator serves in the user role, we first asked them to rate their own preferences for different flight features; this serves as their reward function. Then, the annotator was asked to select their preferred option out of three flight options based on their preferences; this was repeated for five rounds. We constructed 50 such lists of five rounds of flights options, and had 10 annotators produce annotations for each of these 50 lists (500 human participants in total). We then produced three randomly shuffled variants of each of the interactions, for a total of 2000 interactions (500 original interactions and $3\times 500$ shuffled interactions). This ensures that a particular option set is not consistently at a particular point in the interaction (for example, at the end of the interaction, where the participants may be paying less attention). To ensure quality, we required annotators to think for at least 30 seconds before making their selection.
## Appendix B Can LLMs Accurately Verbalize Their Beliefs?
The results of the fine-tuning experiments described in the main text suggest that fine-tuned models are able to infer the userâs preferences, at least implicitly. Here, we test if the LLMs can verbalize their beliefs about the userâs preferences, based on the userâs previous booking history which is provided as context.
### B.1 Eliciting Beliefs About User Preferences
We elicit beliefs in one of two ways. For the open-weights models (Gemma 2, Llama 3, and Qwen 2.5), for which we have access to the probability distribution over upcoming words, we employ continuation scoring, as follows. After interacting with the LLM for one or more rounds, the user asks the LLM for its beliefs about the userâs preferences, for example, âon a scale of 1 to 5, what is my preference for price?â, where $1$ indicates a strong preference for cheaper flights, $3$ indicates no strong preference, and $5$ indicates a strong preference for expensive flights. We score the numbers 1, 2, 3, 4, and 5 as possible continuations of the current text and re-normalize them to form a probability distribution over these five numbers (see Table 4 for a detailed example).
For closed-weights models (Gemini 1.5 Pro and GPT-4.1 Mini), where the LLMâs underlying probability distribution over upcoming words is not made accessible to researchers, we ask the LLM to explicitly provide a probability distribution over each of the five points on the scale (see Table 7; for a comparison between the scoring and explicit probability judgment method in Gemma 2, which finds that scoring yields more accurate estimates, see Section C.1). For our human participants, we use a similar survey to the one we use for closed-weights models.
We then approximate the distribution over reward functions as a factorization of these feature distributions:
$$
\displaystyle q_{LLM}^{i}(\bm{\theta}|\mathcal{O}^{i},o^{*i})\approx\prod_{j}q_{LLM}^{i}(\bm{\theta}_{j}|\mathcal{O}^{i},o^{*i},c_{j}^{i}). \tag{4}
$$
where $q_{LLM}^{i}(\bm{\theta}_{j}|\mathcal{O}^{i},o^{*i},c_{j}^{i})$ is the probability that the LLM assigns to each of the userâs preferences for feature $j$ given the current context in the prompt $c_{j}^{i}$ , using either scoring or explicit probability judgement. This makes an independence assumption, whereby the preference for one feature does not interact with the preference for another; because this assumption is quite strong, we cannot guarantee that it provides a complete picture of the LLMâs beliefs over all possible reward functions. We elicit the LLMâs beliefs by prompting it; it is possible that other techniques, such as probing, where a classifier is trained to decode the modelâs internal activations, could yield different results. We leave a more systematic study of this question for future work.
### B.2 Evaluating the LLMâs Verbalized Beliefs
To determine whether the LLM can accurately verbalize its beliefs about the userâs preferences, we derive flight recommendations from the LLMâs verbalized beliefs, using the same procedure that the Bayesian Assistant uses to make recommendations based on its explicit beliefs, and evaluate the accuracy of these recommendations. We follow the same evaluation setup as our main experiments, except at the end of each round, we query the LLMâs beliefs about the userâs preferences. Importantly, this process branches out from the main dialogue, ensuring the interaction remains unaffected (Fig. 7). We also examine whether the recommendations produced in this way are consistent with the recommendations made by the LLM directly. High consistency between these two measures would suggest that the LLMâs verbalized beliefs align with the implicit internal beliefs used by the LLM to make predictions in the original setup.
We also perform an analogous analysis for the experiment where human participants acted as the assistant to a simulated user. Recall that after each round we asked the participants what they thought the userâs preferences were. We use these verbalized beliefs about the userâs preferences as input to the same computation we used for the LLMsâ verbalized beliefs. As with the LLMs, we can compute the consistency between the flight choices derived in this way and the participantsâ choices in the experiment. We only evaluated on the five-round interactions of the 48 simulated users for which we have human annotations.
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Diagram: Belief-Option-Interaction Loop
### Overview
The image depicts a cyclical diagram illustrating a belief-option-interaction loop. It shows how initial beliefs lead to options, interaction with those options results in updated beliefs, and this process repeats. The diagram uses stylized airplane icons to represent options and bar charts to represent beliefs. Arrows indicate the flow of information and influence.
### Components/Axes
The diagram consists of four main components arranged in a loop:
1. **Initial Beliefs:** Represented by a bar chart with three bars, colored orange, grey, and green.
2. **New Options:** Represented by a stack of rectangles with airplane icons.
3. **Interaction:** Represented by a rectangle with airplane icons.
4. **Updated Beliefs:** Represented by a bar chart with three bars, colored orange, grey, and green.
The diagram also includes the following labels:
* "Previous Beliefs" - a dashed arrow pointing from the Initial Beliefs to New Options.
* "New Options" - a solid arrow pointing from New Options to Interaction.
* "Interaction" - a label placed above the Interaction rectangle.
* "Updated Beliefs" - a dashed arrow pointing from Interaction to Updated Beliefs.
Arrows are used to show the flow of information. Solid arrows represent direct influence, while dashed arrows represent a more indirect or iterative influence.
### Detailed Analysis or Content Details
The diagram shows a loop starting with "Previous Beliefs" (represented by the bar chart). This leads to "New Options" (the stack of rectangles with airplanes). Interaction with these options (the "Interaction" rectangle) results in "Updated Beliefs" (the second bar chart). The "Updated Beliefs" then feed back into the loop, influencing the "New Options" in the next iteration.
The bar charts in "Previous Beliefs" and "Updated Beliefs" appear to have three categories, represented by the orange, grey, and green bars. The heights of the bars are visually different, suggesting varying degrees of belief in each category. The exact values are not provided.
The airplane icons in "New Options" and "Interaction" are identical, suggesting that the options remain consistent throughout the interaction process.
### Key Observations
* The loop structure emphasizes the iterative nature of belief updating.
* The use of dashed arrows for "Previous Beliefs" and "Updated Beliefs" suggests that these influences are not deterministic but rather contribute to the overall process.
* The bar charts visually represent the strength of beliefs in different categories.
* The diagram does not provide any quantitative data, focusing instead on the conceptual flow of information.
### Interpretation
This diagram illustrates a cognitive process where beliefs influence the options considered, interaction with those options leads to new information, and this information updates the beliefs. This is a simplified model of Bayesian updating or reinforcement learning, where prior beliefs are updated based on new evidence. The diagram highlights the cyclical nature of this process, suggesting that beliefs are constantly being refined through interaction with the world.
The use of airplanes as options could suggest a domain where choices involve travel or transportation. However, the diagram is abstract enough that the specific domain is not critical to the underlying concept. The diagram is a conceptual illustration of a feedback loop, rather than a presentation of specific data. It is a visual metaphor for how we learn and adapt our beliefs based on experience.
</details>
Figure 7: Experimental design for LLM evaluation. At the end of each round, we evaluate the LLM using new option sets for which it has not received feedback. The evaluation branches out from the main interactions (that is, the evaluation performed after the first round is not included in the context of the second round). The LLMâs direct evaluation, where we ask the LLM directly to choose a flight, follows the blue lines; the belief-based evaluation, where we first assess the LLMâs beliefs about the userâs preferences and then use them to choose the flight, follows the gray lines. The dashed lines indicate the deterministic conversion of the LLMâs beliefs into flight recommendations.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Performance Comparison
### Overview
The image presents a comparison of the performance of several Large Language Models (LLMs) â Gemma Original, Gemma Oracle, Gemma Bayesian, Llama Original, Llama Oracle, Llama Bayesian, Owen Original, Owen Oracle, Owen Bayesian, GPT-4.1 Mini, Gemini 1.5 Pro, and Human â across three different prediction tasks: Direct Prediction, Belief-based Prediction, and Consistency between Direct and Belief-based Predictions. The performance metric is "Final-round Accuracy (%)" for the first two tasks and "Final-round Consistency (%)" for the third. Each model's performance is represented by a bar with error bars indicating uncertainty. A "Bayesian Assistant" and "Random" baseline are also included for comparison.
### Components/Axes
* **X-axis:** Model Name (Gemma Original, Gemma Oracle, Gemma Bayesian, Llama Original, Llama Oracle, Llama Bayesian, Owen Original, Owen Oracle, Owen Bayesian, GPT-4.1 Mini, Gemini 1.5 Pro, Human)
* **Y-axis:** Final-round Accuracy (%) or Final-round Consistency (%) (Scale: 0 to 100)
* **Legend:**
* Blue: Bayesian Assistant
* Orange: Random
* Green: Model Performance (varying shades for each model)
* **Subplots:** Three separate bar charts labeled a, b, and c, representing the three prediction tasks.
* **Error Bars:** Represent uncertainty in the performance metric.
### Detailed Analysis or Content Details
**a. Direct Prediction**
* **Bayesian Assistant:** Approximately 68% accuracy, with an error bar ranging from approximately 64% to 72%.
* **Random:** Approximately 37% accuracy, with an error bar ranging from approximately 33% to 41%.
* **Gemma Original:** Approximately 37% accuracy, with an error bar ranging from approximately 33% to 41%.
* **Gemma Oracle:** Approximately 61% accuracy, with an error bar ranging from approximately 57% to 65%.
* **Gemma Bayesian:** Approximately 76% accuracy, with an error bar ranging from approximately 72% to 80%.
* **Llama Original:** Approximately 38% accuracy, with an error bar ranging from approximately 34% to 42%.
* **Llama Oracle:** Approximately 62% accuracy, with an error bar ranging from approximately 58% to 66%.
* **Llama Bayesian:** Approximately 75% accuracy, with an error bar ranging from approximately 71% to 79%.
* **Owen Original:** Approximately 37% accuracy, with an error bar ranging from approximately 33% to 41%.
* **Owen Oracle:** Approximately 53% accuracy, with an error bar ranging from approximately 49% to 57%.
* **Owen Bayesian:** Approximately 68% accuracy, with an error bar ranging from approximately 64% to 72%.
* **GPT-4.1 Mini:** Approximately 42% accuracy, with an error bar ranging from approximately 38% to 46%.
* **Gemini 1.5 Pro:** Approximately 51% accuracy, with an error bar ranging from approximately 47% to 55%.
* **Human:** Approximately 47% accuracy, with an error bar ranging from approximately 43% to 51%.
**b. Belief-based Prediction**
* **Bayesian Assistant:** Approximately 64% accuracy, with an error bar ranging from approximately 60% to 68%.
* **Random:** Approximately 34% accuracy, with an error bar ranging from approximately 30% to 38%.
* **Gemma Original:** Approximately 48% accuracy, with an error bar ranging from approximately 44% to 52%.
* **Gemma Oracle:** Approximately 72% accuracy, with an error bar ranging from approximately 68% to 76%.
* **Gemma Bayesian:** Approximately 72% accuracy, with an error bar ranging from approximately 68% to 76%.
* **Llama Original:** Approximately 47% accuracy, with an error bar ranging from approximately 43% to 51%.
* **Llama Oracle:** Approximately 66% accuracy, with an error bar ranging from approximately 62% to 70%.
* **Llama Bayesian:** Approximately 72% accuracy, with an error bar ranging from approximately 68% to 76%.
* **Owen Original:** Approximately 36% accuracy, with an error bar ranging from approximately 32% to 40%.
* **Owen Oracle:** Approximately 36% accuracy, with an error bar ranging from approximately 32% to 40%.
* **Owen Bayesian:** Approximately 36% accuracy, with an error bar ranging from approximately 32% to 40%.
* **GPT-4.1 Mini:** Approximately 50% accuracy, with an error bar ranging from approximately 46% to 54%.
* **Gemini 1.5 Pro:** Approximately 57% accuracy, with an error bar ranging from approximately 53% to 61%.
* **Human:** Approximately 45% accuracy, with an error bar ranging from approximately 41% to 49%.
**c. Consistency between Direct and Belief-based Predictions**
* **Bayesian Assistant:** Approximately 76% consistency, with an error bar ranging from approximately 72% to 80%.
* **Random:** Approximately 46% consistency, with an error bar ranging from approximately 42% to 50%.
* **Gemma Original:** Approximately 46% consistency, with an error bar ranging from approximately 42% to 50%.
* **Gemma Oracle:** Approximately 76% consistency, with an error bar ranging from approximately 72% to 80%.
* **Gemma Bayesian:** Approximately 81% consistency, with an error bar ranging from approximately 77% to 85%.
* **Llama Original:** Approximately 32% consistency, with an error bar ranging from approximately 28% to 36%.
* **Llama Oracle:** Approximately 77% consistency, with an error bar ranging from approximately 73% to 81%.
* **Llama Bayesian:** Approximately 81% consistency, with an error bar ranging from approximately 77% to 85%.
* **Owen Original:** Approximately 21% consistency, with an error bar ranging from approximately 17% to 25%.
* **Owen Oracle:** Approximately 36% consistency, with an error bar ranging from approximately 32% to 40%.
* **Owen Bayesian:** Approximately 35% consistency, with an error bar ranging from approximately 31% to 39%.
* **GPT-4.1 Mini:** Approximately 44% consistency, with an error bar ranging from approximately 40% to 48%.
* **Gemini 1.5 Pro:** Approximately 53% consistency, with an error bar ranging from approximately 49% to 57%.
* **Human:** Approximately 42% consistency, with an error bar ranging from approximately 38% to 46%.
### Key Observations
* Models with "Bayesian" in their name consistently outperform their "Original" and "Oracle" counterparts across all three tasks.
* The "Random" baseline performs poorly in all tasks, indicating the models are learning something beyond chance.
* Gemma Bayesian and Llama Bayesian achieve the highest accuracy/consistency scores in most cases.
* Owen models generally perform worse than Gemma, Llama, GPT-4.1, and Gemini.
* Human performance is generally comparable to or slightly below that of the best-performing models (Gemini 1.5 Pro and Bayesian models).
### Interpretation
The data suggests that incorporating Bayesian principles into the model architecture significantly improves performance in both direct prediction and belief-based prediction tasks, as well as the consistency between the two. The consistently high scores of the "Bayesian" models indicate that this approach is effective in capturing and representing uncertainty, leading to more accurate and reliable predictions. The relatively poor performance of the "Owen" models suggests that their architecture or training data may be less effective. The fact that human performance is competitive with the best models highlights the complexity of the tasks and the potential for further improvement in LLM performance. The consistency metric (c) is particularly interesting, as it suggests that the Bayesian models are not only more accurate but also more internally coherent in their predictions. This could be a valuable property for applications where trustworthiness and explainability are important. The error bars indicate that the differences in performance between some models are statistically significant, while others may be due to random variation. Further analysis with larger sample sizes would be needed to confirm these findings.
</details>
Figure 8: Comparison of direct accuracy and belief-based accuracy. We show final-round accuracy based on (a) the LLMâs or humanâs direct predictions and (b) predictions derived from their verbalized beliefs about the userâs preferences. The gray dashed line indicates random performance, while the brown dashed line indicates the performance of the Bayesian Assistant. For human participants, we only evaluate on a subset of our evaluation data, which includes 48 different simulated users. (c) Final-round consistency between two predictions: the prediction directly provided by the LLM or human participants and the prediction derived from their beliefs about the userâs preferences. Fine-tuned LLMs show better consistency than the original LLMs, with Bayesian LLMs achieving the highest consistency. Error bars show standard error across participants for humans and across three random seeds (and three training runs) for LLMs.
### B.3 Results
For the original LLMs, we find that the approach described in this sectionâwhere we first estimate the LLMsâ beliefs about the userâs preferences by explicitly querying the LLMs, and then use a decision-making component that is external to the LLMâperforms better than directly using the LLMsâ predictions (Fig. 8 vs. Fig. 8, original LLMs). We also find that the original LLMsâ direct predictions are often inconsistent with the belief-based predictions (those derived from the beliefs elicited from the LLMs), with less than 50% alignment between the two sets of predictions (Fig. 8). Human participants similarly show high inconsistency between the two types of predictions.
Predictions based on the fine-tuned LLMsâ verbalized beliefs are substantially more accurate than those based on the original LLMâs verbalized beliefs, except for Qwen 2.5 models (Fig. 8 and Fig. 8, Oracle LLMs and Bayesian LLMs). This suggests that both training methods teach the LLM to infer the userâs preferences and update them as more information becomes available, even though neither method provides the model with direct access to usersâ preferences. For both Gemma 2 and Llama 3, the Bayesian variant of the LLMs produces more accurate estimates of the userâs beliefs than the Oracle one. Likewise, compared to the Oracle variants, the Bayesian variants achieve higher consistency between the predictions directly provided by the LLM and those derived from the LLMâs verbalized beliefs. The difference in overall accuracy between these modelsâ direct predictions and belief-based predictions is much smaller after fine-tuning. This trend, however, does not hold for Qwen 2.5 model: while direct prediction accuracy improves after fine-tuning, belief-based prediction accuracy remains unchanged from the original LLM. This suggests that for Gemma 2 and Llama 3, prompt-based prediction elicitation may tap into a representation that is shared with the computation used to make direct predictions, but that is not the case for Qwen 2.5.
## Appendix C Additional Results
### C.1 The Original LLMsâ Poor Performance is Robust to Evaluation Setup
<details>
<summary>x9.png Details</summary>

### Visual Description
## Bar Charts: LLM Accuracy Across Different Conditions
### Overview
The image presents six bar charts (labeled a through f) comparing the final-round accuracy (%) of a Large Language Model (LLM) under various conditions. Each chart explores a different aspect of the LLM's performance, including prompting methods, flight representations, number of rounds, assessing LLM beliefs, providing user preferences, and types of LLMs. Four data series are represented in each chart: "Direct" (blue), "Beliefs" (light blue), "Bayesian Assistant" (orange), and "Random" (yellow).
### Components/Axes
* **Y-axis:** Final-round Accuracy (%), ranging from 0 to 100.
* **X-axis:** Varies depending on the chart, representing different conditions or categories.
* **Legend:** Located in the top-left corner, identifying the data series colors:
* Direct (Blue)
* Beliefs (Light Blue)
* Bayesian Assistant (Orange)
* Random (Yellow)
* **Charts:** Arranged in a 2x3 grid.
### Detailed Analysis or Content Details
**a. Prompting Methods**
* X-axis categories: Interactive, Non-Interactive, + CoT, + LLM Posterior.
* Direct: 37%, 36%, 39%, 53% - Trend: Generally stable, with a significant increase at "+ LLM Posterior".
* Beliefs: 48%, 39%, 46%, 38% - Trend: Starts high, dips, then recovers.
* Bayesian Assistant: N/A, N/A, 39%, 38% - Trend: Data only available for the last two categories, relatively stable.
* Random: N/A, N/A, N/A, N/A - Trend: No data.
**b. Flight Representations**
* X-axis categories: Textual, Numerical.
* Direct: 48%, 37% - Trend: Decreases from Textual to Numerical.
* Beliefs: 48%, 36% - Trend: Decreases from Textual to Numerical.
* Bayesian Assistant: N/A, 39% - Trend: Data only available for Numerical.
* Random: N/A, N/A - Trend: No data.
**c. Number of Rounds**
* X-axis categories: 5 Rounds, 30 Rounds.
* Direct: 48%, 37% - Trend: Decreases from 5 to 30 rounds.
* Beliefs: 37%, 37% - Trend: Remains constant.
* Bayesian Assistant: N/A, 43% - Trend: Data only available for 30 rounds.
* Random: N/A, N/A - Trend: No data.
**d. Assessing the LLM's Beliefs**
* X-axis categories: Scoring, Generation.
* Direct: 37%, 48% - Trend: Increases from Scoring to Generation.
* Beliefs: 48%, 37% - Trend: Decreases from Scoring to Generation.
* Bayesian Assistant: N/A, 41% - Trend: Data only available for Generation.
* Random: N/A, N/A - Trend: No data.
**e. Providing User's Preferences**
* X-axis categories: Original, + User's Preferences.
* Direct: 37%, 62% - Trend: Significant increase with user preferences.
* Beliefs: 48%, 38% - Trend: Decreases with user preferences.
* Bayesian Assistant: N/A, N/A - Trend: No data.
* Random: N/A, N/A - Trend: No data.
**f. Types of LLMs**
* X-axis categories: Instruct, Base.
* Direct: 37%, 48% - Trend: Increases from Instruct to Base.
* Beliefs: 36%, 36% - Trend: Remains constant.
* Bayesian Assistant: N/A, 36% - Trend: Data only available for Base.
* Random: N/A, N/A - Trend: No data.
### Key Observations
* The "Direct" method consistently shows moderate accuracy across most conditions.
* The "Beliefs" method often performs well initially but can decline in certain scenarios.
* The "Bayesian Assistant" generally shows promising results when data is available, but is often missing data.
* The "Random" method consistently lacks data.
* Providing user preferences (chart e) leads to a substantial increase in accuracy for the "Direct" method.
* "+ LLM Posterior" prompting method (chart a) shows the highest accuracy for the "Direct" method.
### Interpretation
The data suggests that the LLM's performance is highly sensitive to the prompting method and the context provided. The significant improvement observed when incorporating user preferences indicates that aligning the LLM with user expectations is crucial for achieving higher accuracy. The varying performance of the "Beliefs" method suggests that the LLM's internal beliefs may not always align with the desired outcome. The limited data for the "Bayesian Assistant" and "Random" methods hinders a comprehensive comparison. The charts collectively demonstrate the importance of carefully designing prompts and leveraging external information (like user preferences) to optimize LLM performance. The absence of data for the "Random" method suggests it may not be a viable approach or was not tested in these conditions. The differences in accuracy between "Instruct" and "Base" LLM types (chart f) suggest that instruction tuning can improve performance. The overall trend shows that the LLM's accuracy is not uniform across all conditions, highlighting the need for tailored approaches based on the specific task and context.
</details>
Figure 9: Final-round accuracy of Gemma Original under different variations of our experimental setup. We report both the modelâs direct predictions (hatched bars) and the predictions derived from the modelâs verbalized beliefs (solid bars; Supplementary Section B). (a) We compare the original interactive setting, where we directly ask the LLM to generate predictions and provide it with feedback, with other common techniques: non-interactive prompting, where we always provide correct examples; chain-of-thought (CoT) prompting, which encourages the LLM to think step-by-step; and methods that incorporate the LLMâs verbalized reward posterior distribution in the context. (b) The textual representation of the flight options, which uses natural language descriptions deterministically generated from the feature values, compared to the numerical representation, which directly uses the feature values. (c) 5-round interactions between the user and the LLM compared to 30-round interactions. (d) The scoring method, which assesses the LLMâs beliefs by scoring possible continuations, compared to the generation method, where we explicitly ask the LLM to generate probability judgments. (e) Performance without versus with the verbalized userâs preferences from the Bayesian model. (f) Instruction-tuned models versus their pre-trained base models. Error bars show standard errors across three random seeds.
In light of the poor performance of the original LLMs (before fine-tuning), we considered various modifications to our evaluation setting. These include prompting-based methods, that is, modifications to the instructions provided to the LLM; an alternative, numerical representation of the flight options; and a greater number of interactions. We also ablate methods that access the LLMâs verbalized beliefs, explore whether providing the userâs preferences improves performance, and compare the instructed version of the models with their corresponding pre-trained versions. These robustness analyses focus on Gemma 2 9B. Overall, we do not observe significant differences across these evaluations; the only methods that we find effectively improved model performance involved fine-tuning (see Section C.2).
#### Advanced prompting methods do not improve accuracy.
Our main experiments evaluate the LLM in an interactive setting, where the user provides it with feedback indicating whether the LLMâs choice is correct. In this case, the userâs feedback is always based on the LLMâs prediction. We first experiment with an alternative non-interactive setting, where the context for the assistant includes all previous rounds and the option chosen by the assistant in these context rounds is always correct, a setting that better approximates the standard few-shot or in-context learning setup for LLMs (Brown et al. [2020]; see Table 10 for an example). While performance on direct prediction remains similar, we observe a performance drop when evaluating with predictions derived from the LLMâs beliefs (Fig. 9, âNon-interactiveâ).
Chain-of-thought (CoT) prompting [Wei et al., 2022b, Nye et al., 2021, Kojima et al., 2022], which encourages the model to generate step-by-step reasoning chains, has been shown to be effective on many reasoning tasks. We evaluate the LLM using this strategy by explicitly including reasoning hints and the phrase âLetâs think step by stepâ in the instruction (see Table 8 for an example prompt). We find that LLMs prompted with CoT perform similarly to those prompted in the way described in the main text (Fig. 9, â $+$ CoTâ).
Since inferring userâs preferences based on current information before making recommendations is crucial in our task, we further evaluate another CoT-style two-stage prompting method, where we allow the LLM to explicitly reason over the posterior distribution over reward functions. Specifically, we verbalize the LLMâs reward posterior distribution using natural language and add it to the LLMâs context (see Table 9 for an example). Explicitly encouraging the LLM to reason over its own reward posterior distribution improves the predictions derived from its verbalized beliefs. However, direct prediction accuracy remains similar (Fig. 9, â $+$ LLM Posteriorâ).
Though additional prompt engineering and advanced prompting techniques could potentially yield different results; in particular, some prompts may more effectively extract the modelâs beliefs than the ones we used. For the moment, however, our preliminary findings suggest that it is challenging to significantly improve the LLMâs performance purely through prompting.
#### The LLMsâ poor performance is not due to inability to parse the flight representations.
Our main experiments use a representation that deterministically maps the feature value of each flight to a textual description (e.g., the departure time may be 02:00 PM and the duration 2 hr 30 min). While this textual representation is closer to realistic scenarios, and may therefore better align with the LLMâs training distribution, this setup introduces a potential confounder that complicates the interpretation of our results: the LLMâs poor performance in the flight recommendation task could be due to its inability to translate the text description into the feature space required for probabilistic reasoning. To control for this factor, we investigate an alternative numerical representation of the flight options, where we directly provide the LLM with numerical feature values in the same format we provide them to the Bayesian Assistant (e.g., the duration may be 0.9 instead of 16 hr 6 min; see Table 5 and Table 6 for examples). We find that, if anything, the textual representation outperforms its numerical counterpart in both accuracy metrics (Fig. 9). This suggests that the LLMâs poor performance cannot be attributed to an inability to parse the textual format to a numerical value.
#### Increasing the number of interactions does not improve performance.
Our previous experiments include only five rounds of interactions between the user and the LLM. To investigate the possibility that LLMs do in fact extract information from the interaction and update their beliefs, but do so more slowly than the Bayesian Assistant, we increase the number of interactions to 30. We find that Gemma Original still shows similar performance; if anything, its performance is slightly worse compared to our main experiments (Fig. 9). This indicates that simply increasing the number of interactions is unlikely to significantly improve the LLMâs performance.
#### Assessing the LLMâs beliefs: Scoring continuations vs. explicit probability judgments.
In the main experiment, for the open-weights LLMs where we have access to the probabilities the LLM assigns to upcoming words, we estimate the LLMâs distribution over reward functions by asking it to rate individual features and scoring the possible continuations; for flight duration, for example, we might ask it what the userâs preference is on a scale of 1 to 5. We refer to this method as âscoringâ. Here, we compare this method to one where we instruct the LLM to assign a probability to each of the five ratings on each scale; we refer to this method as âgenerationâ (see Table 7 for an example). The generation method is also used for experiments with the closed-weights models, as we do not have access to these LLMâs probabilities. As in the scoring method, we renormalize the probabilities to ensure that they sum to 1 (although we find that this step is typically not necessary as they already sum to 1). Overall, we find the scoring-based reward distribution, which we use in the main text for the open-weights models, is closer than the generation-based one to the ground truth distribution (Fig. 9; for related results, see Hu and Levy [2023]).
#### Can the LLM make recommendations given the userâs preferences?
The flight recommendation task requires two capabilities: inferring the userâs preferences and making predictions based on these preferences. We previously showed that the original LLM performs poorly at inferring the userâs preferences (Fig. 8). Here, we investigate its performance in the latter one. Specifically, we provide the LLM with the verbalized reward posterior distribution from the normative Bayesian model (see Table 9 for an example). In this case, the LLM only needs to make recommendations based on the provided preferences. We find that having access to the optimal reward posterior distribution improves belief-based accuracy; however, the direct prediction accuracy remains similar (Fig. 9). Although our method of presenting userâs preferences to the LLM may not be optimal, these results suggest that the LLM struggles to make correct recommendations even when the userâs preferences are explicitly provided.
#### Types of LLMs: Instructed model vs. pre-trained base model.
We use instruction-tuned LLMs for our main experiments. As these models are trained using an additional post-training alignment stage, their behavior is likely to differ from their pre-trained base model counterparts [Lin et al., 2024, Yang et al., 2024b, Wang et al., 2024, Kotha et al., 2024], though because we expect instruction-tuning to improve the modelsâ interactive capabilities, we hypothesize that the base version of Gemma would not perform better than the instruction-tuned one. As base models are not well-suited to interactive evaluation, we evaluate them using the non-interactive setting by providing them with in-context examples (see earlier in this section). We find that the base model performs comparably to the instruction-tuned one (Fig. 9); we omit the results for Llama 3 and Qwen 2.5, which were similar.
### C.2 Modifications to Training Setup
This Supplementary describes variants on the methods we used to fine-tune the LLMs on interactions with users. We only explore these variants for Bayesian teaching, which was consistently more effective than oracle teaching). We use Gemma 2 9B for all of the follow-up experiments reported in this section.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Bar Charts: Accuracy Comparison of Language Model Training Techniques
### Overview
The image presents three bar charts (labeled a, b, and c) comparing the final-round accuracy (%) of a language model under different training conditions. Each chart focuses on a different aspect of the training process: training objectives, fine-tuning methods, and training signals. Error bars are present on some bars, indicating variance.
### Components/Axes
* **Y-axis (all charts):** "Final-round Accuracy (%)", ranging from 0 to 100.
* **X-axis:** Varies depending on the chart.
* **Chart a (Training Objectives):** "SFT", "DPO"
* **Chart b (Fine-tuning Methods):** "Full", "LoRA"
* **Chart c (Training Signals):** "Interaction", "Preferences", "Both"
* **Legend (top-left, applies to all charts):**
* "Direct" (Solid Blue)
* "Beliefs" (Striped Blue)
* "Bayesian Assistant" (Solid Orange)
* "Random" (Striped Green)
### Detailed Analysis or Content Details
**Chart a: Training Objectives**
* **Direct (SFT):** Approximately 76% accuracy.
* **Beliefs (SFT):** Approximately 72% accuracy.
* **Direct (DPO):** Approximately 66% accuracy.
* **Beliefs (DPO):** Approximately 70% accuracy.
* The "Direct" bars are consistently higher than the "Beliefs" bars for both SFT and DPO.
* Error bars are present on the "Direct" DPO bar, indicating some variance.
**Chart b: Fine-tuning Methods**
* **Direct (Full):** Approximately 76% accuracy.
* **Beliefs (Full):** Approximately 72% accuracy.
* **Direct (LoRA):** Approximately 70% accuracy.
* **Beliefs (LoRA):** Approximately 68% accuracy.
* Similar to Chart a, "Direct" consistently outperforms "Beliefs" for both "Full" and "LoRA".
**Chart c: Training Signals**
* **Direct (Interaction):** Approximately 76% accuracy.
* **Beliefs (Interaction):** Approximately 72% accuracy.
* **Direct (Preferences):** Approximately 55% accuracy.
* **Beliefs (Preferences):** Approximately 79% accuracy.
* **Direct (Both):** Approximately 78% accuracy.
* **Beliefs (Both):** Approximately 79% accuracy.
* In this chart, "Beliefs" outperforms "Direct" for "Preferences" and "Both". "Direct" is higher for "Interaction".
### Key Observations
* The "Direct" approach generally yields higher accuracy than the "Beliefs" approach, except when using "Preferences" or "Both" training signals.
* The "Interaction" training signal results in the highest accuracy for the "Direct" approach.
* The "Preferences" training signal results in significantly lower accuracy for the "Direct" approach, but comparable or higher accuracy for the "Beliefs" approach.
* The "Random" data series is only present in Chart c, and shows a consistent accuracy of approximately 79% for "Both" training signals.
### Interpretation
The data suggests that the "Direct" training approach is more effective when using "Interaction" or "Full" fine-tuning methods. However, the "Beliefs" approach appears to be more robust or even superior when utilizing "Preferences" or a combination of "Both" training signals. This could indicate that the "Beliefs" approach is better at leveraging information from preference-based feedback, while the "Direct" approach benefits more from direct interaction data. The lower accuracy of the "Direct" approach with "Preferences" might suggest that it struggles to generalize from preference signals without additional context or regularization. The consistent performance of the "Random" approach at a high level suggests that the training signals themselves are strong predictors of accuracy, regardless of the specific method used. The error bars on the "Direct" DPO bar in Chart a indicate that the results for that condition may be less reliable or have higher variability.
</details>
Figure 10: Final-round accuracy of LLMs fine-tuned with different training strategies on the flight recommendation task. We use Bayesian teaching (i.e. usersâ interactions with the Bayesian Assistant) for all experiments. (a) Comparison of training objectives: supervised fine-tuning (SFT) vs. direct preference optimization (DPO). (b) Fine-tuning methods: full fine-tuning, which updates all model parameters, vs. LoRA fine-tuning, a parameter-efficient method that only updates partial parameters. (c) Training only on interactions between users and assistants, as in our other experiments, compared to training on the Bayesian Assistantâs estimate of the userâs preferences, as well as training on both interactions and the estimated preferences. Error bars show standard errors across three random seeds and three training runs.
#### Training objective: Supervised fine-tuning vs. Direct preference optimization.
In most of our experiments, we use supervised fine-tuning (SFT) to provide the oracle and Bayesian predictions. In this method, the LLM is trained to predict the upcoming token in the interaction, the same objective used during pre-trainining. Here, we examine the utility of reinforcement learning from human feedback (RLHF; Christiano et al. [2017], Ouyang et al. [2022], Stiennon et al. [2020]), another common practice for adapting LLMsâ behavior following pre-training, in which the LLM is instead provided with an explicit signal indicating whether an output is preferable. In particular, we use direct preference optimization (DPO; Rafailov et al. [2023]), where the model is trained to assign higher probability to the preferred response than to the less preferred one. We investigate the DPO training objective by treating the Bayesian Assistantâs prediction as the preferred one while using a different random recommendation as the less preferred one. We train the model with the DPO objective with a learning rate of 2e-6 and $\beta=0.1$ . We find that training on Bayesian predictions works comparably for both SFT (used in main experiments) and DPO objectives (Fig. 10), indicating that the approach is robust to the choice of training objective.
#### Full vs. parameter-efficient fine-tuning.
In our main experiments, we update all model parameters. As this approach becomes less feasible as the model size grows, a common strategy to improve training efficiency relies on parameter-efficient fine-tuning, where only a subset of parameters is updated. We evaluate this approach using Low-Rank Adaptation (LoRA; Hu et al. [2022]), a method that injects trainable rank decomposition matrices while keeping the original model weights frozen. We perform LoRA fine-tuning with a learning rate of 2e-5. While LoRA fine-tuning performs slightly worse than full fine-tuning (Fig. 10), it achieves comparable performance while significantly reducing training costs. This demonstrates that our fine-tuning strategy can be effectively applied in computationally efficient settings, which is particularly beneficial for larger LLMs.
#### Providing Bayesian preference estimates in fine-tuning.
We have shown in the main text that fine-tuning the LLMs to make better recommendations also significantly improves its ability to infer the userâs preferences, even though their supervision does not explicitly include the userâs preferences. Here, we investigate a complementary setup, where we explicitly train the model to match the Bayesian Assistantâs estimates of the userâs preferences, but not to make flight recommendations. The Bayesian Assistant produces a posterior probability distribution over all reward functions after each round; we select the reward function with the highest posterior probability and provide it to the LLM, formatted as in Table 4. We find that, like training on interactions, providing the userâs preferences as a fine-tuning signal improves both accuracy measures, compared to Gemma Original, but the gain in direct prediction accuracy on is smaller than when we fine-tune on interactions (Fig. 10). We also explore a setting where both the interactions and the preferences are provided during training; this setting leads to the best performance on both metrics, with accuracy approaching the accuracy of the Bayesian Assistant.
## Appendix D Additional Analyses
### D.1 LLM Priors
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Bar Charts: Probability Distributions for Travel Attributes
### Overview
The image presents four separate bar charts, arranged horizontally. Each chart displays a probability distribution for a different travel attribute: Departure Time, Duration, Number of Stops, and Price. The x-axis of each chart represents a "Rating" scale from 1 to 5, and the y-axis represents "Probability (%)" from 0 to 100. A title "Gemma 2 9B" is present above the charts.
### Components/Axes
* **Title:** Gemma 2 9B (positioned centrally above the four charts)
* **X-axis Label (all charts):** Rating
* **Y-axis Label (all charts):** Probability (%)
* **X-axis Scale (all charts):** 1, 2, 3, 4, 5
* **Y-axis Scale (all charts):** 0, 20, 40, 60, 80, 100
* **Chart 1:** Departure Time
* **Chart 2:** Duration
* **Chart 3:** Number of Stops
* **Chart 4:** Price
### Detailed Analysis
**Chart 1: Departure Time**
* The bar for Rating 3 is the highest, reaching approximately 95% probability.
* The bar for Rating 4 is also high, at approximately 95% probability.
* The bars for Ratings 1, 2, and 5 are very low, near 0% probability.
**Chart 2: Duration**
* The bar for Rating 3 is the highest, reaching approximately 80% probability.
* The bar for Rating 2 is approximately 20% probability.
* The bars for Ratings 1, 4, and 5 are very low, near 0% probability.
**Chart 3: Number of Stops**
* The bar for Rating 2 is the highest, reaching approximately 60% probability.
* The bar for Rating 3 is approximately 25% probability.
* The bar for Rating 1 is approximately 10% probability.
* The bars for Ratings 4 and 5 are very low, near 0% probability.
**Chart 4: Price**
* The bar for Rating 4 is the highest, reaching approximately 95% probability.
* The bar for Rating 5 is also high, at approximately 95% probability.
* The bars for Ratings 1, 2, and 3 are very low, near 0% probability.
### Key Observations
* Departure Time and Price have high probabilities concentrated at Ratings 3/4 and 4/5 respectively, indicating a preference for moderate to high values.
* Duration has a peak at Rating 3, suggesting a preference for moderate durations.
* Number of Stops has a peak at Rating 2, suggesting a preference for a small number of stops.
* Ratings 1 and 5 consistently have very low probabilities across all attributes, indicating these values are rarely observed.
### Interpretation
The data suggests a user preference profile for travel options. Users seem to prefer departure times and prices that are neither too low nor too high (ratings 3-5). They also prefer moderate durations and a small number of stops. The consistent low probability for ratings 1 and 5 across all attributes suggests that extremely low or high values for these features are undesirable. The "Gemma 2 9B" title suggests this data is related to a model or system named Gemma, potentially used for travel recommendations or analysis. The distributions are not normal, but rather heavily skewed towards specific rating values, indicating strong preferences. The data could be used to optimize travel recommendations by prioritizing options that align with these observed preferences.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Bar Charts: Travel Feature Distributions
### Overview
The image presents four separate bar charts, arranged horizontally. Each chart depicts the probability distribution of a different travel feature: Departure Time, Duration, Number of Stops, and Price. The x-axis of each chart represents a "Rating" scale from 1 to 5, and the y-axis represents "Probability (%)". All charts share the same scales for both axes. A "Gemini 1.5 Pro" watermark is present at the top-right.
### Components/Axes
* **X-axis Label (all charts):** "Rating" (ranging from 1 to 5)
* **Y-axis Label (all charts):** "Probability (%)" (ranging from 0 to 100)
* **Chart Titles (from left to right):** "Departure Time", "Duration", "Number of Stops", "Price"
* **Data Series:** Each chart has a single data series represented by blue bars.
### Detailed Analysis or Content Details
**1. Departure Time:**
* The bar for Rating 1 has a height of approximately 60%.
* The bar for Rating 2 has a height of approximately 20%.
* The bar for Rating 3 has a height of approximately 10%.
* The bars for Ratings 4 and 5 are very small, both around 5%.
* Trend: The probability decreases as the rating increases.
**2. Duration:**
* The bar for Rating 1 has a height of approximately 70%.
* The bar for Rating 2 has a height of approximately 20%.
* The bar for Rating 3 has a height of approximately 5%.
* The bars for Ratings 4 and 5 are very small, both around 2%.
* Trend: The probability decreases sharply as the rating increases.
**3. Number of Stops:**
* The bar for Rating 1 has a height of approximately 50%.
* The bar for Rating 2 has a height of approximately 40%.
* The bar for Rating 3 has a height of approximately 10%.
* The bars for Ratings 4 and 5 are very small, both around 2%.
* Trend: The probability decreases as the rating increases, but less dramatically than for Departure Time or Duration.
**4. Price:**
* The bar for Rating 1 has a height of approximately 40%.
* The bar for Rating 2 has a height of approximately 30%.
* The bar for Rating 3 has a height of approximately 20%.
* The bar for Rating 4 has a height of approximately 10%.
* The bar for Rating 5 has a height of approximately 5%.
* Trend: The probability decreases as the rating increases.
### Key Observations
* For all four features, the highest probability is associated with a Rating of 1.
* The distributions are heavily skewed towards lower ratings.
* Ratings 4 and 5 have consistently low probabilities across all features.
* The "Duration" chart shows the most concentrated distribution, with a very high probability for Rating 1.
### Interpretation
The data suggests that, for this dataset of travel options, lower ratings are more common for all four features (Departure Time, Duration, Number of Stops, and Price). A "Rating" of 1 appears to be the most typical or preferred value for each feature. This could indicate that most travel options have short departure times, short durations, few stops, and low prices. The steep decline in probability as the rating increases suggests that options with higher ratings (e.g., long departure times, long durations, many stops, high prices) are relatively rare.
The consistent pattern across all four charts implies a systemic preference or constraint within the data. It's possible that the data represents budget travel options, or that the rating scale is designed to favor lower values. Without further context on what the "Rating" scale represents, it's difficult to draw more specific conclusions. The "Gemini 1.5 Pro" watermark suggests this data may be generated or analyzed by that AI model.
</details>
Figure 11: Priors of Gemma 2 9B Original and Gemini 1.5 Pro for each flight feature. We obtain these priors via the prompting-based elicitation method (Supplementary B). A rating of 1 indicates a strongest preference for the earliest departure time, the shortest duration, the fewest number of stops, and the lowest price, while a rating of 5 indicates the opposite. A rating of 3 indicates no preference.
In the section Generalization to interactions with human users, we find that the original LLMs, before fine-tuning, were able to provide recommendations with an accuracy substantially higher than chance even before their first interaction with the user, suggesting that the LLMsâ priors are aligned with human preferences. In this section, we test this hypothesis by asking two models, Gemma 2 and Gemini 1.5, for their verbalized beliefs in advance of any interaction with a particular user. Fig. 11 shows the results. For Gemma 2 9B, the hypothesis is only partly supported: the prior derived from this model assigns a high probability to âno preferenceâ for most of the features, with the exception of the number of stops, where it reflects a moderate preference for fewer stops. By contrast, Gemini 1.5 Pro has a more diffuse prior over these features, which favors cheaper and shorter flights, as well as flights that leave earlier in the day, plausibly reflecting the preferences of most flyers. We note that the interpretation of this pattern of results is complicated by the fact that Gemmaâs verbalized prior beliefs may not faithfully reflect the underlying prior it uses to make recommendations before having interacted with a user.
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Performance Across Tasks and Strategies
### Overview
The image presents three bar charts (a, b, and c), each comparing the performance of different language models (Gemma Original, Gemma Oracle, Gemma Bayesian, Llama Original, Llama Oracle, Llama Bayesian, Owen Original, Owen Oracle, Owen Bayesian, and Bayesian Assistant) across three strategies: "After 1st Round", "Final Round", and "Random". The charts evaluate performance on three distinct tasks: "Flight Recommendation", "Hotel Recommendation", and "Web Shopping". Performance is measured by "Accuracy (%)". Error bars are included for each data point, indicating variability.
### Components/Axes
* **X-axis:** Represents the different language models used for each task.
* **Y-axis:** Represents the accuracy percentage, ranging from 0 to 100. The axis is labeled "Accuracy (%)".
* **Bar Colors:**
* "After 1st Round": Dark Blue
* "Final Round": Purple
* "Random": Light Gray
* **Tasks:**
* a. Flight Recommendation
* b. Hotel Recommendation
* c. Web Shopping
* **Models:** Gemma Original, Gemma Oracle, Gemma Bayesian, Llama Original, Llama Oracle, Llama Bayesian, Owen Original, Owen Oracle, Owen Bayesian, Bayesian Assistant, Direct FT (only in Web Shopping)
### Detailed Analysis or Content Details
**a. Flight Recommendation**
* **Gemma Original:** After 1st Round: ~37%, Final Round: ~50%, Random: ~37%
* **Gemma Oracle:** After 1st Round: ~37%, Final Round: ~61%, Random: ~37%
* **Gemma Bayesian:** After 1st Round: ~37%, Final Round: ~76%, Random: ~36%
* **Llama Original:** After 1st Round: ~38%, Final Round: ~48%, Random: ~38%
* **Llama Oracle:** After 1st Round: ~38%, Final Round: ~62%, Random: ~38%
* **Llama Bayesian:** After 1st Round: ~38%, Final Round: ~75%, Random: ~38%
* **Owen Original:** After 1st Round: ~37%, Final Round: ~43%, Random: ~37%
* **Owen Oracle:** After 1st Round: ~37%, Final Round: ~55%, Random: ~37%
* **Owen Bayesian:** After 1st Round: ~37%, Final Round: ~68%, Random: ~37%
* **Bayesian Assistant:** After 1st Round: ~37%, Final Round: ~81%, Random: ~37%
**b. Hotel Recommendation**
* **Gemma Original:** After 1st Round: ~37%, Final Round: ~46%, Random: ~38%
* **Gemma Oracle:** After 1st Round: ~37%, Final Round: ~53%, Random: ~38%
* **Gemma Bayesian:** After 1st Round: ~37%, Final Round: ~66%, Random: ~38%
* **Llama Original:** After 1st Round: ~41%, Final Round: ~45%, Random: ~41%
* **Llama Oracle:** After 1st Round: ~41%, Final Round: ~56%, Random: ~41%
* **Llama Bayesian:** After 1st Round: ~41%, Final Round: ~65%, Random: ~41%
* **Owen Original:** After 1st Round: ~36%, Final Round: ~43%, Random: ~36%
* **Owen Oracle:** After 1st Round: ~36%, Final Round: ~50%, Random: ~36%
* **Owen Bayesian:** After 1st Round: ~36%, Final Round: ~59%, Random: ~36%
* **Bayesian Assistant:** After 1st Round: ~36%, Final Round: ~81%, Random: ~36%
**c. Web Shopping**
* **Gemma Original:** After 1st Round: ~46%, Final Round: ~54%, Random: ~50%
* **Gemma Oracle:** After 1st Round: ~46%, Final Round: ~61%, Random: ~50%
* **Gemma Bayesian:** After 1st Round: ~46%, Final Round: ~73%, Random: ~50%
* **Gemma Direct FT:** After 1st Round: ~46%, Final Round: ~76%, Random: ~50%
* **Llama Original:** After 1st Round: ~50%, Final Round: ~59%, Random: ~50%
* **Llama Oracle:** After 1st Round: ~50%, Final Round: ~63%, Random: ~50%
* **Llama Bayesian:** After 1st Round: ~50%, Final Round: ~70%, Random: ~50%
* **Llama Direct FT:** After 1st Round: ~50%, Final Round: ~84%, Random: ~50%
* **Owen Original:** After 1st Round: ~42%, Final Round: ~43%, Random: ~42%
* **Owen Oracle:** After 1st Round: ~42%, Final Round: ~66%, Random: ~42%
* **Owen Bayesian:** After 1st Round: ~42%, Final Round: ~69%, Random: ~42%
* **Owen Direct FT:** After 1st Round: ~42%, Final Round: ~81%, Random: ~42%
### Key Observations
* The "Bayesian Assistant" consistently achieves the highest accuracy across all three tasks and strategies.
* The "Final Round" strategy generally yields higher accuracy than "After 1st Round" and "Random" for most models.
* The "Random" strategy consistently shows the lowest accuracy.
* The "Direct FT" models (Gemma Direct FT, Llama Direct FT, Owen Direct FT) perform exceptionally well in the "Web Shopping" task, surpassing other models.
* The error bars indicate some variability in performance, but the overall trends are clear.
### Interpretation
The data suggests that Bayesian approaches, particularly when implemented as an "Assistant", significantly improve performance in these recommendation tasks. The improvement from "After 1st Round" to "Final Round" indicates that iterative refinement of the models is beneficial. The poor performance of the "Random" strategy highlights the importance of structured learning and reasoning. The exceptional performance of the "Direct FT" models in "Web Shopping" suggests that fine-tuning on task-specific data is highly effective for that particular domain. The consistent superiority of the Bayesian Assistant across all tasks suggests a robust and generalizable approach to recommendation. The error bars suggest that while the trends are clear, there is some inherent variability in the model performance, which could be due to the randomness in the data or the model initialization. The data demonstrates the effectiveness of different strategies and model architectures for various recommendation tasks, providing valuable insights for future model development.
</details>
Figure 12: Variability across simulated users. We show accuracy after the first and final (fifth) rounds. (a) We compare the original LLMs, fine-tuned LLMs, and the upper bound (the Bayesian Assistant) on flight recommendation. (b) Comparison of LLMs and the upper bound (the Bayesian Assistant) on hotel recommendation. (c) Comparison of LLMs and the upper bound (LLMs fine-tuned directly on the task) for web shopping. Error bars indicate the standard deviation across reward functions (for flight and hotel recommendations) or product categories (for web shopping).
### D.2 Variability in LLM Accuracy Across Simulated Users
In our main experiments, we show results averaged over all simulated users. Here, we explore how the LLMâs accuracy varies by user. As before, for flight and hotel recommendations, the user is characterized as a reward function. For web shopping, we have 10 users with different goals (i.e. preferred attributes) for each category; we average their performance and compute the standard deviation across 100 product categories (see Table 1 for examples). All methods exhibit high variance as shown in Fig. 12.
Table 1: Example product categories and their corresponding goals of different users.
| Product Category | Userâs Goals (Preferred Attributes) |
| --- | --- |
| Beds | eco friendly, twin with drawers |
| wood frame, easy assemble, twin | |
| memory foam, solid wood | |
| Menâs athletic shoes | running shoes, lace up |
| non slip, mesh | |
| daily wear, color back, size 14 | |
| Food & beverage | simple ingredients |
| gluten free | |
| low sodium | |
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Number of Interactions for Language Models
### Overview
The image presents two line charts comparing the accuracy of different language models (Gemini 1.5 Pro, Gemma 2 9B, Gemma Oracle, and Bayesian) against a random baseline, as a function of the number of interactions. Both charts share the same x-axis representing the number of interactions, ranging from 0 to 5. The y-axis represents accuracy in percentage, ranging from 0 to 100. Error bars are included for each data point, indicating the variability in accuracy.
### Components/Axes
* **X-axis:** "# Interactions" (Number of Interactions) - Scale: 0, 1, 2, 3, 4, 5
* **Y-axis:** "Accuracy (%)" (Accuracy in Percentage) - Scale: 0, 20, 40, 60, 80, 100
* **Left Chart Legend:**
* Gemini 1.5 Pro (Blue)
* Gemma 2 9B (Light Blue)
* Bayesian (Gray)
* Random (Gray Dashed)
* **Right Chart Legend:**
* Gemma Oracle (Orange)
* Gemma Bayesian (Light Orange)
* Random (Gray Dashed)
### Detailed Analysis or Content Details
**Left Chart: Gemini 1.5 Pro, Gemma 2 9B, Bayesian vs. Random**
* **Gemini 1.5 Pro (Blue):** The line is relatively flat, starting at approximately 38% accuracy at 0 interactions. It increases to around 48% at 1 interaction, plateaus around 52% between 2 and 4 interactions, and then slightly decreases to approximately 50% at 5 interactions. Error bars are consistently around +/- 8%.
* **Gemma 2 9B (Light Blue):** Starts at approximately 34% accuracy at 0 interactions. It increases to around 42% at 1 interaction, then plateaus around 45% between 2 and 5 interactions. Error bars are consistently around +/- 10%.
* **Bayesian (Gray):** Starts at approximately 32% accuracy at 0 interactions. It increases steadily to around 42% at 1 interaction, then continues to increase to approximately 50% at 3 interactions, and finally reaches around 55% at 5 interactions. Error bars are consistently around +/- 10%.
* **Random (Gray Dashed):** A horizontal line at approximately 32% accuracy across all interaction levels. Error bars are consistently around +/- 10%.
**Right Chart: Gemma Oracle, Gemma Bayesian vs. Random**
* **Gemma Oracle (Orange):** Starts at approximately 34% accuracy at 0 interactions. It increases steadily to around 48% at 1 interaction, then continues to increase to approximately 62% at 3 interactions, and finally reaches around 72% at 5 interactions. Error bars are consistently around +/- 10%.
* **Gemma Bayesian (Light Orange):** Starts at approximately 34% accuracy at 0 interactions. It increases steadily to around 48% at 1 interaction, then continues to increase to approximately 60% at 3 interactions, and finally reaches around 70% at 5 interactions. Error bars are consistently around +/- 10%.
* **Random (Gray Dashed):** A horizontal line at approximately 32% accuracy across all interaction levels. Error bars are consistently around +/- 10%.
### Key Observations
* In the left chart, Gemini 1.5 Pro and Gemma 2 9B show limited improvement in accuracy with increasing interactions, remaining relatively stable after the initial increase. Bayesian shows a more consistent increase in accuracy with more interactions.
* In the right chart, both Gemma Oracle and Gemma Bayesian demonstrate a clear positive correlation between the number of interactions and accuracy, with a significant increase observed as the number of interactions grows.
* All models consistently outperform the random baseline.
* The error bars suggest a considerable degree of variability in the accuracy measurements for all models.
### Interpretation
The data suggests that the benefit of increased interactions varies significantly between different language models. Gemini 1.5 Pro and Gemma 2 9B appear to reach a performance plateau relatively quickly, indicating that additional interactions do not substantially improve their accuracy. In contrast, Bayesian, Gemma Oracle, and Gemma Bayesian demonstrate a more sustained improvement in accuracy with increasing interactions, suggesting that these models are better able to leverage additional information or refine their responses through iterative interaction.
The consistent outperformance of all models compared to the random baseline indicates that these models possess some level of inherent understanding or ability to learn from the interaction process. The error bars highlight the inherent uncertainty in evaluating language model performance and suggest that the observed differences in accuracy may not always be statistically significant.
The two charts provide a comparative analysis of different model architectures and their responsiveness to iterative interaction. The results could inform the design of interaction strategies for language models, suggesting that for some models, focusing on more efficient initial interactions may be more beneficial than simply increasing the number of interactions.
</details>
Figure 13: Variability across reward functions over rounds. Error bars indicate standard deviation across reward functions.
We additionally show results over rounds in Fig. 13. We find that both original LLMs and the Bayesian Assistant display high variance across reward function. While the variance of the Bayesian Assistant decreases as the number of interactions increases, as does that of the fine-tuned LLMs, the variance of the original LLM remains largely constant across interactions. Notably, Gemma Bayesian has lower variance while maintaining similar performance to the Bayesian Assistant.
In particular, we hypothesize that reward functions that more strongly deviate from the LLMâs prior (Supplementary Section D.1) may be harder to infer. For example, the LLM may assume most people prefer shorter flights over long ones, making it more difficult to infer the preferences of an âabnormalâ user who prefers longer flights. To test the hypothesis that the variability across reward functions is due in part to the prior, we fit linear regression models predicting a reward functionâs final-round accuracy from its L2 distance to the mean of the prior reward distribution, focusing on Gemma in this experiment. We elicit the priors separately for Gemma Original, Gemma Bayesian and Gemma Oracle. The prior of the Bayesian Assistant is uniform, as before. Before computing distances we normalize the reward functions (divide them by their sum) to account for the fact that some functions are equivalent; for example, the reward function $[-1,-1,-1,-1]$ is equivalent to the function $[-0.5,-0.5,-0.5,-0.5]$ as both will always lead the user to prefer the same flights.
In line with this hypothesis, we find negative regression coefficients for Gemma Original, indicating it performs worse when the reward function deviates from its priors (Fig. 14). The absolute coefficients for the Bayesian Assistant and Gemma Bayesian are similar, and much smaller than that of Gemma Original. For these three models, the impact of L2 distance from prior on the final-round accuracy is all significant (p $<$ 0.001). The Gemma Oracle does not show sensitivity to this distance (p = 0.24).
<details>
<summary>x15.png Details</summary>

### Visual Description
## Scatter Plots: Final-round Accuracy vs. L2 Distance from Prior Mean
### Overview
The image presents four scatter plots, each representing a different model: Gemma Original, Gemma Oracle, Gemma Bayesian, and Bayesian Assistant. Each plot visualizes the relationship between "Final-round Accuracy (%)" on the y-axis and "L2 Distance from Prior Mean" on the x-axis. A dashed line is fitted to each scatter plot, and the correlation coefficient 'c' is displayed for each model.
### Components/Axes
* **X-axis (all plots):** L2 Distance from Prior Mean (Scale: 0.0 to 2.0, approximately)
* **Y-axis (all plots):** Final-round Accuracy (%) (Scale: 0 to 100, approximately)
* **Plots (from left to right):**
* Gemma Original (Data points are blue)
* Gemma Oracle (Data points are light orange)
* Gemma Bayesian (Data points are orange)
* Bayesian Assistant (Data points are light brown/grey)
* **Correlation Line (all plots):** Dashed black line representing the linear correlation.
* **Correlation Coefficient (all plots):** Value 'c' displayed near the bottom-right of each plot.
### Detailed Analysis or Content Details
**1. Gemma Original (Blue):**
* Trend: The data points show a weak negative correlation. As the L2 Distance from Prior Mean increases, the Final-round Accuracy tends to decrease slightly.
* Correlation Coefficient (c): -10.46
* Data Distribution: Points are scattered, with a concentration of points between 20% and 80% accuracy, and L2 distances ranging from 0.0 to 2.0.
**2. Gemma Oracle (Light Orange):**
* Trend: The data points show a weak positive correlation. As the L2 Distance from Prior Mean increases, the Final-round Accuracy tends to increase slightly.
* Correlation Coefficient (c): 0.58
* Data Distribution: Points are more spread out than Gemma Original, with a wider range of accuracy values (from approximately 20% to 95%) and L2 distances (0.0 to 2.0).
**3. Gemma Bayesian (Orange):**
* Trend: The data points show a strong positive correlation. As the L2 Distance from Prior Mean increases, the Final-round Accuracy tends to increase significantly.
* Correlation Coefficient (c): 1.48
* Data Distribution: Points are clustered along a diagonal line, indicating a strong relationship between the two variables. Accuracy values range from approximately 40% to 100%, and L2 distances range from 0.0 to 2.0.
**4. Bayesian Assistant (Light Brown/Grey):**
* Trend: The data points show a weak positive correlation. As the L2 Distance from Prior Mean increases, the Final-round Accuracy tends to increase slightly.
* Correlation Coefficient (c): 1.01
* Data Distribution: Points are widely scattered, with a concentration of points between 60% and 90% accuracy, and L2 distances ranging from 0.0 to 2.0.
### Key Observations
* Gemma Bayesian exhibits the strongest positive correlation between L2 Distance from Prior Mean and Final-round Accuracy.
* Gemma Original exhibits a negative correlation, suggesting that greater distance from the prior mean leads to lower accuracy.
* Gemma Oracle and Bayesian Assistant show weak positive correlations.
* The correlation coefficients vary significantly across the models, indicating different relationships between the two variables.
### Interpretation
The plots demonstrate how different models respond to variations in their distance from a prior mean. A positive correlation suggests that moving further from the prior mean leads to higher accuracy, potentially indicating that the model is learning and adapting. A negative correlation suggests the opposite, that deviating from the prior mean degrades performance. The strength of the correlation, as indicated by the 'c' value, quantifies the degree of this relationship.
The Gemma Bayesian model's strong positive correlation suggests it benefits significantly from exploring solutions further from its initial prior. Conversely, the Gemma Original model's negative correlation suggests it performs best when staying close to its prior. The Gemma Oracle and Bayesian Assistant models show more moderate responses.
These differences could be due to variations in model architecture, training data, or the specific task being performed. The plots provide valuable insights into the behavior of each model and how they leverage prior knowledge during the learning process. The 'c' values are not standard correlation coefficients (ranging from -1 to 1), but rather a scaling factor specific to this visualization, indicating the slope of the fitted line.
</details>
Figure 14: The relationship between the final-round accuracy and the normalized L2 distance to the mean of the prior reward distribution (1000 randomly sampled points for readability). $c$ refers to the coefficient in a linear regression predicting accuracy from L2 distance. The impact of L2 distance on final-round accuracy is significant (p $<$ 0.001) for Gemma Original, Gemma Bayesian, and Bayesian Assistant, but not for Gemma Oracle (p = 0.24).
### D.3 Interacting with Non-deterministic Users
Our main experiments assume the simulated user always makes decisions that are consistent with its reward function. By contrast, as we show in the section Generalization to interactions with human users, humans may behave inconsistently with their stated preferences. To simulate this real-world stochasticity, we evaluate a setting where the LLM interacts with a non-deterministic user. We add noise to the userâs behavior, such that with a certain probability they select a non-optimal choice, that is a choice that does not maximize their reward. The relationship between the percentage of noise and final-round accuracy is shown in Fig. 15. We experiment with the three variants of Gemma, and with the Bayesian Assistant. As expected, performance decreases across the board as the amount of noise increases. For realistic noise values in the 10â60% range, we find that Gemma Bayesian is more robust to noise compared not only to Gemma Original and Gemma Oracle, but also to the Bayesian Assistant, which is the best model in the noiseless setting. This robustness to noise illustrates an advantage of an LLM fine-tuned to mimic a symbolic model compared to the original symbolic model (see Discussion).
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Line Chart: Final-Round Accuracy vs. Noise Percentage
### Overview
This line chart depicts the relationship between noise percentage and final-round accuracy for four different models: Gemma Original, Gemma Oracle, Gemma Bayesian, and Bayesian Assistant. The chart illustrates how each model's accuracy degrades as the level of noise increases.
### Components/Axes
* **X-axis:** Noise (%) - Ranges from 0% to 100%, with markers at 0, 20, 40, 60, 80, and 100.
* **Y-axis:** Final-round Accuracy (%) - Ranges from 0% to 100%, with markers at 0, 20, 40, 60, 80, and 100.
* **Legend:** Located in the top-right corner, identifying each line with a color and label:
* Gemma Original (Blue)
* Gemma Oracle (Yellow/Orange)
* Gemma Bayesian (Orange)
* Bayesian Assistant (Gray)
### Detailed Analysis
* **Gemma Original (Blue):** The line is relatively flat, indicating minimal change in accuracy with increasing noise.
* At 0% Noise: Approximately 38% Accuracy.
* At 20% Noise: Approximately 38% Accuracy.
* At 40% Noise: Approximately 38% Accuracy.
* At 60% Noise: Approximately 36% Accuracy.
* At 80% Noise: Approximately 34% Accuracy.
* At 100% Noise: Approximately 32% Accuracy.
* **Gemma Oracle (Yellow/Orange):** The line slopes downward, showing a decrease in accuracy as noise increases.
* At 0% Noise: Approximately 82% Accuracy.
* At 20% Noise: Approximately 72% Accuracy.
* At 40% Noise: Approximately 62% Accuracy.
* At 60% Noise: Approximately 52% Accuracy.
* At 80% Noise: Approximately 42% Accuracy.
* At 100% Noise: Approximately 32% Accuracy.
* **Gemma Bayesian (Orange):** The line exhibits a steeper downward slope than Gemma Oracle, indicating a more significant decrease in accuracy with increasing noise.
* At 0% Noise: Approximately 78% Accuracy.
* At 20% Noise: Approximately 68% Accuracy.
* At 40% Noise: Approximately 58% Accuracy.
* At 60% Noise: Approximately 48% Accuracy.
* At 80% Noise: Approximately 38% Accuracy.
* At 100% Noise: Approximately 20% Accuracy.
* **Bayesian Assistant (Gray):** The line is between Gemma Original and Gemma Oracle, showing a moderate decrease in accuracy with increasing noise.
* At 0% Noise: Approximately 62% Accuracy.
* At 20% Noise: Approximately 58% Accuracy.
* At 40% Noise: Approximately 52% Accuracy.
* At 60% Noise: Approximately 46% Accuracy.
* At 80% Noise: Approximately 38% Accuracy.
* At 100% Noise: Approximately 30% Accuracy.
### Key Observations
* Gemma Original demonstrates the most robust performance, maintaining a relatively stable accuracy level across all noise percentages.
* Gemma Bayesian is the most sensitive to noise, experiencing the largest drop in accuracy as noise increases.
* Gemma Oracle and Bayesian Assistant fall between these two extremes, exhibiting moderate sensitivity to noise.
* All models show a decline in accuracy as noise increases, but the rate of decline varies significantly.
### Interpretation
The chart suggests that Gemma Original is the most resilient model to noisy data, while Gemma Bayesian is the most vulnerable. This could be due to differences in the models' architectures or training methodologies. The data indicates that the Bayesian approach, while potentially offering higher accuracy in clean environments (as seen at 0% noise), suffers significantly when exposed to noise. The relatively stable performance of Gemma Original suggests it may be a more reliable choice for applications where data quality is uncertain. The consistent downward trend across all models highlights the general principle that noise negatively impacts the accuracy of machine learning models. The differences in the slopes of the lines reveal the varying degrees of robustness among the models. The chart provides valuable insights into the trade-offs between accuracy and robustness in these different models.
</details>
Figure 15: Final-round accuracy when interacting with a noisy simulated user. We add noise to the simulated userâs choice such that with some probability the user chooses an option that is different from the one that maximizes its reward. We plot fine-round accuracy with respect to the amount of noise. While all models show a decrease in performance as noise increases, Gemma Bayesian demonstrates greater robustness for noise levels between 10% and 60%. Error bars (too small to be visible in the plot) show standard errors across three random seeds (and three training runs).
### D.4 What Makes Bayesian Teaching Effective?
We have shown that it is more effective to fine-tune our LLMs on the Bayesian Assistantâs predictions than on the userâs true choices. In this section, we explore and rule out two deflationary hypotheses that might explain the effectiveness of this method, and tentatively conclude that the advantage of Bayesian teaching is in fact due the probabilistically optimal predictions made by the Bayesian Assistant. In all of the experiments described in this section, we focus on Gemma 2 9B. We use the same list of flight option sets for all models, and vary only the supervision we provide during fine-tuning (i.e. the assistantâs recommendations).
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Bar Charts: Accuracy of Prediction Models
### Overview
The image presents two bar charts (labeled 'a' and 'b') comparing the accuracy of different prediction models under varying conditions. Chart 'a' focuses on "Varying Incorrect Predictions," while chart 'b' focuses on "Varying Priors." Both charts measure accuracy as a percentage, ranging from 0% to 100%, on the y-axis. The x-axis represents different models or configurations. Each model is evaluated after the first round and after a final round of processing. A "Random" accuracy baseline is also provided.
### Components/Axes
**Common Elements:**
* **Y-axis:** Accuracy (%) - Scale ranges from 0 to 100, with increments of 20.
* **Legend:**
* "After 1st Round" (Blue)
* "Final Round" (Orange)
* "Random" (Gray dashed line)
* **X-axis:** Model/Configuration names.
**Chart a (Varying Incorrect Predictions):**
* **X-axis Labels:** Gemma Original, Gemma Bayesian, Gemma Oracle with Noise, Bayesian Assistant.
**Chart b (Varying Priors):**
* **X-axis Labels:** Gemma Bayesian (LLM-based), Gemma Bayesian (Uniform), Gemma Bayesian (LLM-opposite), Bayesian Assistant.
### Detailed Analysis or Content Details
**Chart a (Varying Incorrect Predictions):**
* **Gemma Original:** After 1st Round: Approximately 37%. Final Round: Approximately 37%.
* **Gemma Bayesian:** After 1st Round: Approximately 57%. Final Round: Approximately 76%.
* **Gemma Oracle with Noise:** After 1st Round: Approximately 40% with an error bar extending roughly +/- 5%. Final Round: Approximately 45% with an error bar extending roughly +/- 5%.
* **Bayesian Assistant:** After 1st Round: Approximately 58%. Final Round: Approximately 81%.
* **Random:** Approximately 37% (represented by a dashed gray line).
**Chart b (Varying Priors):**
* **Gemma Bayesian (LLM-based):** After 1st Round: Approximately 51%. Final Round: Approximately 71%.
* **Gemma Bayesian (Uniform):** After 1st Round: Approximately 57%. Final Round: Approximately 76%.
* **Gemma Bayesian (LLM-opposite):** After 1st Round: Approximately 50%. Final Round: Approximately 66%.
* **Bayesian Assistant:** After 1st Round: Approximately 58%. Final Round: Approximately 81%.
* **Random:** Approximately 37% (represented by a dashed gray line).
### Key Observations
* In both charts, the "Bayesian Assistant" consistently achieves the highest accuracy in the final round, reaching approximately 81%.
* The "Random" baseline accuracy is around 37% in both charts.
* In Chart a, the "Gemma Original" model shows no improvement between the first and final rounds.
* In Chart a, the "Gemma Bayesian" model shows a significant improvement from the first to the final round.
* In Chart b, all models show improvement from the first to the final round.
* The error bars on "Gemma Oracle with Noise" in Chart a indicate a higher degree of uncertainty in its performance.
### Interpretation
The data suggests that incorporating Bayesian methods, particularly when used with an assistant model, significantly improves prediction accuracy. The "Bayesian Assistant" consistently outperforms other models across both experimental setups.
The difference between Chart a and Chart b highlights the impact of the type of incorrect predictions and priors used. Chart a, focusing on varying incorrect predictions, shows that the "Gemma Original" model is unable to learn from its mistakes, while the "Gemma Bayesian" model benefits from the Bayesian approach. Chart b, focusing on varying priors, demonstrates that the choice of prior (LLM-based, Uniform, LLM-opposite) influences performance, but the Bayesian Assistant remains the most robust.
The consistent performance of the "Random" baseline suggests a lower bound on achievable accuracy. The error bars on the "Gemma Oracle with Noise" model indicate that adding noise can introduce uncertainty, potentially hindering performance. The overall trend indicates that Bayesian approaches are effective in improving prediction accuracy, especially when combined with an assistant model capable of leveraging prior knowledge and learning from errors.
</details>
Figure 16: Final-round accuracy of LLMs fine-tuned with different data variants. (a) Accuracy of the model using Bayesian teaching and the model using oracle teaching with random noise. (b) Accuracy of models fine-tuned on predictions from variants of the Bayesian Assistant, initialized with different priors. Error bars show standard errors across three random seeds (and three training runs).
#### Hypothesis: Incorrect predictions regularize training.
The Bayesian Assistant can make incorrect predictions, especially in the first few rounds, due to the fact that it only has limited information about the user (see the Bayesian Assistantâs accuracy over rounds in Fig. 24). Could these incorrect predictions regularize training and prevent overfitting, accounting for the effectiveness of Bayesian teaching? To test this hypothesis, we fine-tune the LLM using oracle teaching injected with random noise: 40% of the time, instead of predicting the userâs choice, the assistant recommends one of the incorrect options at random. The proportion of incorrect predictions in this control roughly matches that of the Bayesian predictions averaged across all five interactions. Contrary to the regularization hypothesis, we find that incorrect predictions do not necessarily improve performance: the model fine-tuned on noisy userâs choices (Gemma Oracle with Noise) barely outperforms the original LLM and has high standard error (Fig. 16). This suggests that random noise alone cannot explain why Bayesian predictions are more effective; rather, the Bayesianâs educated mistakes are more valuable than random errors.
#### Hypothesis: The LLM benefits from the correct prior.
We initialize the Bayesian Assistant using the uniform prior, which assigns equal probability to all reward functions, and therefore aligns with the data generation process of our evaluation. One hypothesis is that the LLM benefits from this correct prior (in the sense that it is calibrated to the distribution of simulated users in our experiment), which makes the predictions of the Bayesian Assistant more effective for supervised fine-tuning.
To test this hypothesis, we fine-tune Gemma three times, using the predictions of three variants of the Bayesian Assistant, initialized with three different priors: the uniform prior, the LLM-based prior obtained from Gemma Original (see Fig. 11), and the prior that is contrary to the LLM-based one (e.g., if Gemmaâs prior favors cheaper flights, this prior would instead prefer more expensive flights). The results are shown in Fig. 16. LLMs fine-tuned on predictions from all three Bayesian models perform very well and dramatically better than the original LLM. The choice of prior does influence the performance of the fine-tuned LLMs. The model fine-tuned on Bayesian predictions using the uniform prior, which matches the distribution of users in our sample, achieves the best accuracy. The LLM-based prior, despite being biased and spiky, leads to accuracy that is only slightly worse. The LLM-opposite prior, which is both biased and mismatches the LLMâs beliefs, leads to a more significant performance drop. That being said, the vast gap between all three LLMs fine-tuned on Bayesian predictions and Gemma Original suggests that the correct prior alone does not fully explain the effectiveness of Bayesian teaching.
### D.5 Qualitative Example
In Fig. 17, we show a qualitative example of the evolution of the reward distributions of Gemma Original and the Bayesian Assistant over interactions. In this case, since the userâs true reward function differs significantly from the LLMâs prior, both Gemma Original and the Bayesian Assistant perform poorly at the start of the interactions. However, while the Bayesian Assistant gradually converges toward the ground-truth reward function after a few rounds, Gemma Original continues to assign high probability to reward functions that are inconsistent with its observations.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Chart: LLM and Bayesian Probability vs. L2 Distance Rank
### Overview
The image presents a series of five histograms, arranged in two rows and five columns, comparing the probability distributions of Large Language Model (LLM) and Bayesian methods against the L2 Distance Rank across five rounds. Each histogram visualizes the probability (in percentage) of different L2 Distance Ranks, with a vertical line indicating the "Ground-truth" rank. The chart aims to compare the accuracy of LLM and Bayesian approaches in identifying correct answers based on their distance ranking.
### Components/Axes
* **X-axis:** L2 Distance Rank (ranging from 0 to 500, with increments of 100).
* **Y-axis:** Probability (%) (ranging from 0 to 100, with increments of 20).
* **Legend:**
* Ground-truth (represented by a dashed light blue vertical line)
* Incorrect (represented by a red histogram)
* Correct (represented by a green histogram)
* **Titles:** Each column is labeled with "Round 1", "Round 2", "Round 3", "Round 4", and "Round 5" respectively.
* **Rows:** The top row represents LLM Probability, and the bottom row represents Bayesian Probability.
### Detailed Analysis or Content Details
**Round 1:**
* **LLM Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 70-80%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
* **Bayesian Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 60-70%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
**Round 2:**
* **LLM Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 60-70%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
* **Bayesian Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 60-70%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
**Round 3:**
* **LLM Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 60-70%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
* **Bayesian Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 60-70%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
**Round 4:**
* **LLM Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 60-70%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
* **Bayesian Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 60-70%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
**Round 5:**
* **LLM Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 60-70%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
* **Bayesian Probability:** The "Incorrect" distribution (red) peaks around 400-500 with a probability of approximately 60-70%. The "Correct" distribution (green) peaks around 0-100 with a probability of approximately 80-90%. The Ground-truth line is at approximately 200.
Across all rounds, the "Correct" distribution consistently shows a higher probability at lower L2 Distance Ranks (0-100), while the "Incorrect" distribution peaks at higher ranks (400-500). The Ground-truth line consistently falls within the range of higher L2 Distance Ranks.
### Key Observations
* Both LLM and Bayesian methods show a similar pattern across all rounds.
* The "Correct" distribution is heavily skewed towards lower L2 Distance Ranks, indicating that lower-ranked options are more likely to be correct.
* The "Incorrect" distribution is heavily skewed towards higher L2 Distance Ranks, indicating that higher-ranked options are more likely to be incorrect.
* The Ground-truth rank consistently falls within the range of higher L2 Distance Ranks, suggesting that the correct answer is often not the closest one according to the L2 distance metric.
* There is minimal variation in the distributions across the five rounds.
### Interpretation
The data suggests that both LLM and Bayesian methods tend to favor options with lower L2 Distance Ranks, but the correct answer is often found at a higher rank. This implies that the L2 Distance metric may not be a reliable indicator of correctness in this context. The consistent pattern across all rounds suggests that this is not a random occurrence but a systematic bias. The high probability of "Correct" answers at low L2 Distance Ranks indicates that the methods are often correct when the closest option is the right one, but they struggle when the correct answer is further away in terms of L2 distance. This could be due to the nature of the problem, where the correct answer may require more nuanced reasoning that is not captured by the L2 distance metric. The lack of significant change across rounds suggests that the methods do not learn or adapt to this bias over time. Further investigation is needed to understand why the correct answer is often found at a higher L2 Distance Rank and to develop methods that can better identify correct answers in this scenario.
</details>
Figure 17: The reward distributions of Gemma Original (top) and the Bayesian Assistant (bottom) over multiple rounds. The reward functions are sorted by their normalized L2 distance from the ground-truth (GT) reward function, indicated by the blue dashed line at $x=0$ . Red indicates that the reward functionâs prediction on the given options is incorrect, while green indicates that its prediction is correct.
## Appendix E Sensitivity to the Informativeness of Option Sets
In each round of the flight recommendation task, we present the model with a set of three flight options, and the userâs choice among those options. The amount of information that can be gained through this process varies from round to round. For example, a choice between two flight options that differ in exactly one feature could be more informative than the choice between options that differ along multiple dimensions: the minimal pair of options provides direct evidence for the userâs preference for the particular feature. We expect a strong probabilistic reasoner to be sensitive to this factor: when the userâs choice between a particular set of options provides more information about their preferences, we expect the system to update its beliefs more substantially.
In this section we test whether LLMs display this behavior. In contrast with the main experiments, where we sample the option sets randomly, here we sample them based on their informativeness. To measure the amount of information contained in a set of options $\mathcal{O}$ , we define the ground truth information gain as
$$
\displaystyle g(\mathcal{O},o^{*},p(\bm{\theta}),q(\bm{\theta})) \displaystyle=\mathrm{KL}(p(\bm{\theta})||q(\bm{\theta}))-\mathrm{KL}(p(\bm{\theta})||q(\bm{\theta}|\mathcal{O},o^{*})) \displaystyle=\log q(\bm{\theta}^{*}|\mathcal{O},o^{*})-\log q(\bm{\theta}^{*}), \tag{5}
$$
where $p(\bm{\theta})=\delta(\bm{\theta}^{*})$ and $q(\bm{\theta})$ is either $q_{B}(\bm{\theta})$ or $q_{LLM}(\bm{\theta})$ . This metric captures the increase in the posterior probability of the ground-truth reward function (that is, the userâs true reward function) after this set of options has been observed. Note that $g$ is relative to the model that is used to update the probability distribution; we use $g_{\textit{B}}$ and $g_{\textit{LLM}}$ to refer to the gain derived from the Bayesian Assistant and the LLM, respectively.
### E.1 Experimental Setup
We randomly sample 5,000 candidate option sets, compute the ground truth information gain of each one based on the Bayesian Assistant, and select the option set that leads to the desired value of $g_{\textit{B}}$ . The performance is evaluated at the end of a five-round interaction, and the ground truth information gain is averaged over these five rounds. We evaluate the Bayesian Assistant as well as Gemma Original, Gemma Oracle, and Gemma Bayesian; as in our main experiments, the Bayesian Assistant is initialized with the uniform prior.
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Charts: Performance Comparison of LLM Models
### Overview
The image presents two scatter plots (labeled 'a' and 'b') comparing the performance of three Large Language Models (LLMs): Gemma Original, Gemma Oracle, and Gemma Bayesian, against a Bayesian Assistant and a Random baseline. Plot 'a' shows Final-round Accuracy as a function of Avg. Bayesian GT Information Gain, while plot 'b' shows Avg. LLM GT Information Gain as a function of Avg. Bayesian GT Information Gain. Shaded areas represent confidence intervals.
### Components/Axes
**Plot a:**
* **X-axis:** Avg. Bayesian GT Information Gain (ranging from approximately 0.2 to 1.2)
* **Y-axis:** Final-round Accuracy (%) (ranging from 0 to 100)
* **Data Series:**
* Gemma Original (solid blue line)
* Gemma Oracle (dashed blue line)
* Gemma Bayesian (solid orange line)
* Bayesian Assistant (dashed gray line)
* Random (dashed gray line)
**Plot b:**
* **X-axis:** Avg. Bayesian GT Information Gain (ranging from approximately 0.2 to 1.2)
* **Y-axis:** Avg. LLM GT Information Gain (ranging from approximately -0.4 to 0.3)
* **Data Series:**
* Gemma Original (blue dots)
* Gemma Oracle (blue dots)
* Gemma Bayesian (orange dots)
**Legend:**
The legend is positioned in the top-right corner of each plot, clearly associating colors and line styles with each model.
### Detailed Analysis or Content Details
**Plot a:**
* **Gemma Original:** The line is relatively flat, hovering around 20-30% accuracy across the range of Bayesian Information Gain.
* **Gemma Oracle:** Similar to Gemma Original, the line is flat, remaining around 20-30% accuracy.
* **Gemma Bayesian:** The line shows a clear upward trend. Starting around 40% accuracy at an Information Gain of 0.4, it rises to approximately 85% accuracy at an Information Gain of 1.0, then plateaus and slightly declines to around 75% at 1.2.
* **Bayesian Assistant:** The dashed gray line starts around 60% accuracy at 0.4, rises to approximately 95% at 0.8, and then remains relatively stable around 90-100% for the rest of the range.
* **Random:** The dashed gray line is nearly flat, fluctuating around 20-30% accuracy.
**Plot b:**
* **Gemma Original:** The data points show a generally flat trend, fluctuating around 0.05 to 0.15.
* **Gemma Oracle:** The data points show a generally flat trend, fluctuating around 0.05 to 0.15.
* **Gemma Bayesian:** The data points show a clear upward trend. Starting around -0.2 at an Information Gain of 0.4, it rises to approximately 0.25 at an Information Gain of 1.0, then declines slightly to around 0.2 at 1.2.
### Key Observations
* Gemma Bayesian consistently outperforms Gemma Original and Gemma Oracle in both plots.
* The Bayesian Assistant consistently achieves the highest accuracy in Plot a.
* Plot b shows a strong correlation between Avg. Bayesian GT Information Gain and Avg. LLM GT Information Gain for Gemma Bayesian.
* Gemma Original and Gemma Oracle show minimal improvement in accuracy or information gain as Bayesian Information Gain increases.
### Interpretation
The data suggests that incorporating Bayesian information gain significantly improves the performance of the Gemma model (Gemma Bayesian). The upward trend in both plots for Gemma Bayesian indicates that as the model gains more information through the Bayesian process, its accuracy (Plot a) and information gain (Plot b) increase. The flat lines for Gemma Original and Gemma Oracle suggest that they do not effectively utilize the Bayesian information gain. The Bayesian Assistant serves as a strong benchmark, demonstrating the potential of Bayesian approaches. The Random baseline highlights the importance of informed decision-making over chance. The confidence intervals (shaded areas in Plot a) indicate the variability in the results, but the overall trends remain clear. The decline in Gemma Bayesian's performance at the highest Information Gain values in Plot a might indicate a point of diminishing returns or potential overfitting.
</details>
Figure 18: Analysis of sensitivity of LLMs to the informativeness of option sets. (a) Effect of option set informativity on model performance. Option set informativity is defined by ground-truth information gain, i.e., the increase in the log probability assigned by the Bayesian Assistant to the ground truth reward function after observing the provided options. We plot accuracy after five interactions as a function of option set informativity averaged over the five interactions. (b) The relationship between ground-truth information gain computed from the Bayesian Assistant and from LLMs.
### E.2 Results
The Bayesian Assistantâs performance consistently improves as option sets become more informative: after observing highly informative options, its performance is almost perfect (Fig. 18). Gemma Original does not show sensitivity to option set informativity, but the fine-tuned models are much more sensitive to this factor: their performance positively correlates with the Bayesian ground-truth information gain up to a certain point. Gemma Bayesian saturates later than Gemma Oracle, and achieves higher final accuracy, especially in the highly informative regime.
### E.3 Comparing LLM-derived and Bayesian Information Gain
Recall that information gain is relative to the model that is used to update the probability distributions: $g_{\textit{LLM}}$ quantifies the amount of information the LLM can absorb from a particular set of options, whereas $g_{\textit{B}}$ quantifies the amount that the ideal Bayesian reasoner can absorb. How does $g_{\textit{LLM}}$ related to $g_{\textit{B}}$ for each of the variants of Gemma? We find that the correlation between the two measures is weak for Gemma Original (Fig. 18). For Gemma Oracle and Gemma Bayesian, we observe a more complex pattern. When $g_{\textit{B}}$ is small, there is a positive relationship between the two metrics, indicating that options that are informative from the Bayesian perspective are beneficial for the fine-tuned LLMs. In this part of the range, the information gain derived from Gemma Bayesian shows a stronger correlation with $g_{\textit{B}}$ compared with Gemma Oracle. When $g_{\textit{B}}$ is large, however, the relationship levels off and we no longer see a correlation between $g_{\textit{B}}$ and $g_{\textit{LLM}}$ for either of the fine-tuned models. This suggests that even Gemma Bayesian only approximates and does not fully implement the normative Bayesian reasoning strategy.
## Appendix F Human Experiments
### F.1 Humans As Assistants
#### Participants.
For the experiment where human participants acted as the assistant to a simulated user, we recruited 720 participants through the Prolific platform [Palan and Schitter, 2018]. Each human participant interacted with one simulated user from a subset of 48 simulated users (out of the total 624 users), which we sampled based on the L2 distance of their reward function from the origin. The average age of human participants was 37.2 (SD=12.5). Of those, 54.9% identified as male (395), 44.6% as female (321), and 0.6% preferred not to say (4). The major nationalities of human participants were the United States at 32.5% (234), United Kingdom at 23.2% (167), South Africa at 10.3% (74), Canada at 7.6% (55), and Poland at 4.4% (32). By ethnicity, 62.5% (450) were White, 17.4% (125) were Black, 11.9% (86) were Asian, and 5.6% (40) were Mixed. All participants reported using English as their primary language.
#### Procedure.
At the beginning of the experiment, each participant was asked to complete a preference questionnaire to indicate their initial guess of the userâs preferences for each individual feature. The participant subsequently proceeded to the annotation round, where they made recommendations from three flight options. After the selection, the human annotator received feedback indicating whether their choice was correct. They were then redirected to the preference questionnaire to report their updated beliefs about the userâs preferences. This completed one round. The annotator repeated the same procedure for five rounds. Following these five rounds, we also implemented a quality control annotation round where the annotator interacted with a typical user with a highly informative option list (differing only in one feature dimension). We expected this quality control round to be very easy for participants who were paying close attention to task. We filtered out participants who failed the quality control annotation. The mean and median completion time (including the quality control annotation) was 9.35 minutes and 7.90 minutes, respectively, with a standard deviation of 5.08 minutes.
#### Additional Results.
Our main results show the accuracy of human assistants using their direct predictions of the userâs preferred choices. Since we also ask the annotator to rate their beliefs about the userâs preferences after each round, we can also use these estimated preferences to make recommendationsâthe same procedure we use in Section B. This allows us to evaluate on the larger held-out set and reduce the noise. As shown in Fig. 19, we find that while the accuracy of the human annotatorsâ direct prediction may not monotonically improve from one round to the next, their beliefs about the userâs preferences become consistently more accurate over rounds.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Charts: Prediction Accuracy vs. Number of Interactions
### Overview
The image presents two line charts, labeled 'a. Direct Prediction' and 'b. Belief-based Prediction on Held-out Set'. Both charts compare the accuracy of a prediction method against a random baseline, as a function of the number of interactions. Accuracy is measured in percentage (%).
### Components/Axes
Both charts share the following components:
* **X-axis:** Labeled "# Interactions", ranging from 0 to 4 in chart 'a' and 0 to 5 in chart 'b'. The axis is discrete, representing integer values.
* **Y-axis:** Labeled "Accuracy (%)", ranging from 0 to 100.
* **Legend:** Located in the top-left corner of each chart.
* 'Direct' (or 'Beliefs') - Represented by a solid green line with light green shaded error bars.
* 'Random' - Represented by a gray dashed line.
### Detailed Analysis or Content Details
**Chart a: Direct Prediction**
* **Direct (Green Line):** The line starts at approximately 32% accuracy at 0 interactions, rises to a peak of roughly 52% at 2 interactions, then declines slightly to around 48% at 4 interactions. The error bars indicate a significant degree of variability, ranging from approximately 25% to 60% across all interaction levels.
* 0 Interactions: ~32% ± ~10%
* 1 Interaction: ~40% ± ~15%
* 2 Interactions: ~52% ± ~10%
* 3 Interactions: ~48% ± ~10%
* 4 Interactions: ~48% ± ~10%
* **Random (Gray Dashed Line):** The line is relatively flat, starting at approximately 30% and remaining around 30-35% throughout all interaction levels.
**Chart b: Belief-based Prediction on Held-out Set**
* **Beliefs (Green Line):** The line shows an increasing trend, starting at approximately 38% accuracy at 0 interactions and rising to around 52% at 5 interactions. The error bars are substantial, ranging from approximately 30% to 60% across all interaction levels.
* 0 Interactions: ~38% ± ~10%
* 1 Interaction: ~42% ± ~10%
* 2 Interactions: ~46% ± ~10%
* 3 Interactions: ~48% ± ~10%
* 4 Interactions: ~50% ± ~10%
* 5 Interactions: ~52% ± ~10%
* **Random (Gray Dashed Line):** Similar to chart 'a', the line is relatively flat, starting at approximately 30% and remaining around 30-35% throughout all interaction levels.
### Key Observations
* In both charts, the 'Direct'/'Beliefs' method consistently outperforms the 'Random' baseline.
* The error bars are large in both charts, indicating high variance in the results.
* Chart 'a' shows a peak in accuracy at 2 interactions, followed by a slight decline.
* Chart 'b' demonstrates a consistent increase in accuracy with increasing interactions.
### Interpretation
These charts compare the performance of two prediction methods â a 'Direct' prediction method and a 'Belief-based' method â against a random baseline. The number of interactions appears to represent the amount of data or experience used by the prediction methods.
Chart 'a' suggests that the 'Direct' prediction method benefits from a small number of interactions (up to 2), but further interactions do not lead to significant improvements and may even cause a slight decrease in accuracy. The large error bars suggest that the performance is highly variable and sensitive to the specific data.
Chart 'b' indicates that the 'Belief-based' method improves with more interactions, suggesting that it is able to learn and refine its predictions over time. Again, the large error bars highlight the variability in the results.
The consistent outperformance of both methods over the 'Random' baseline suggests that they are both capable of learning something from the data, but the high variance indicates that the results may not be robust or generalizable. The difference between the two charts could be due to the nature of the data used (direct vs. held-out set) or the specific algorithms employed. The held-out set in chart 'b' likely provides a more realistic assessment of the model's ability to generalize to unseen data.
</details>
Figure 19: Accuracy of the human assistant over rounds. (a) Based on the humanâs direct predictions on provided option sets. (b) Based on the humanâs beliefs about the userâs preferences on held-out option sets. Error bars show the averaged standard error across human participants.
#### Qualitative Analysis.
One pattern we observe in human assistants is that they tend to favor simpler heuristics when there is limited evidence. For example, in Table 2, we show that when there are multiple valid user preferences, human assistants may rely on simpler heuristics, e.g. in this example, always choosing the cheapest flight. In contrast, the fine-tuned Gemma Bayesian model does not seem to exhibit this behavior.
Table 2: Qualitative examples of LLM and human predictions. Here, the user strongly prefers an early departure time, weakly prefers a short flight duration, and has no preference for the number of stops and the price. Most human participants tend to favor a simpler heuristic, i.e., always choosing the cheapest flight, while Gemma Bayesian does not seem to exhibit this behavior.
| Flight 1 | 05:12 PM | 30 min | 1 | $190 | Flight 1 | Flight 1 | Flight 1: 66.7% |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Flight 2 | 03:36 PM | 12 hr 12 min | 2 | $460 | Flight 2: 26.7% | | |
| Flight 3 | 10:00 PM | 10 hr 15 min | 2 | $640 | Flight 3: 6.7% | | |
| Flight 1 | 06:48 PM | 4 hr 24 min | 1 | $370 | Flight 2 | Flight 2 | Flight 1: 40.0% |
| Flight 2 | 07:36 AM | 16 hr 6 min | 1 | $100 | Flight 2: 33.3% | | |
| Flight 3 | 10:00 PM | 20 hr | 0 | $550 | Flight 3: 26.7% | | |
| Flight 1 | 10:00 PM | 30 min | 1 | $280 | Flight 3 | Flight 3 | Flight 1: 60% |
| Flight 2 | 08:24 PM | 30 min | 0 | $910 | Flight 2: 0.0% | | |
| Flight 3 | 06:00 AM | 8 hr 18 min | 0 | $370 | Flight 3: 40% | | |
### F.2 Humans As Users
#### Participants.
For the experiment where human participants acted as the users, we recruited 500 participants through the Prolific platform. The average age of the participants was 38.7 (SD=13.6); 51.0% identified as male (255), 48.4% as female (242), and 0.6% preferred not to say (3). The major nationalities of human participants were the United States at 40.0% (200), United Kingdom at 16.0% (80), South Africa at 9.0% (45), Canada at 7.8% (39), and Australia at 5.6% (28), with smaller representations from other countries. In terms of ethnicity, 65.2% (326) identified as White, 15.0% (75) as Black, 8.4% (42) as Asian, 7.2% (36) as Mixed, and 4.0% (20) as Other. All participants reported that English is their primary language.
#### Procedure.
Each participant in this experiment was first asked to complete a preference questionnaire to indicate their own preferences for different flight features. They then proceeded to the annotation rounds, where they needed to select their preferred option out of three flight options. To ensure quality, we required annotators to think for at least 30 seconds before making their selection. The procedure continued for five rounds. Participants were told to make choices consistent with their initially stated preferences throughout all five rounds. The mean and median completion times were 6.43 minutes and 5.18 minutes, respectively, with a standard deviation of 3.51 minutes.
#### Additional Results.
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Bar Charts: Distribution of Ratings for Travel Attributes
### Overview
The image presents four separate bar charts, each displaying the probability distribution of ratings (from 1 to 5) for different travel attributes: Departure Time, Duration, Number of Stops, and Price. The y-axis of each chart represents Probability (in percentage), while the x-axis represents the Rating.
### Components/Axes
* **X-axis:** Rating (1 to 5) - common to all four charts.
* **Y-axis:** Probability (%) - common to all four charts, ranging from 0% to 100%.
* **Chart Titles:**
* Departure Time
* Duration
* Number of Stops
* Price
* **No Legend:** The charts do not have a legend, as each bar represents the probability for a specific rating.
### Detailed Analysis or Content Details
**1. Departure Time:**
* The highest probability is around 25-30% for a rating of 2.
* The probability decreases as the rating increases.
* Approximate values:
* Rating 1: ~5%
* Rating 2: ~28%
* Rating 3: ~20%
* Rating 4: ~12%
* Rating 5: ~5%
**2. Duration:**
* The highest probability is around 40-45% for a rating of 2.
* The probability is relatively high for ratings 1 and 2, then decreases sharply for ratings 3, 4, and 5.
* Approximate values:
* Rating 1: ~15%
* Rating 2: ~42%
* Rating 3: ~15%
* Rating 4: ~8%
* Rating 5: ~5%
**3. Number of Stops:**
* The highest probability is around 55-60% for a rating of 1.
* The probability decreases rapidly as the rating increases.
* Approximate values:
* Rating 1: ~58%
* Rating 2: ~30%
* Rating 3: ~7%
* Rating 4: ~3%
* Rating 5: ~2%
**4. Price:**
* The highest probability is around 45-50% for a rating of 2.
* The probability decreases as the rating increases, but is relatively stable for ratings 3, 4, and 5.
* Approximate values:
* Rating 1: ~10%
* Rating 2: ~48%
* Rating 3: ~18%
* Rating 4: ~12%
* Rating 5: ~8%
### Key Observations
* The "Number of Stops" attribute has a strong preference for lower ratings (rating of 1 is most common).
* "Departure Time", "Duration", and "Price" have a more distributed preference, with ratings of 2 being the most common.
* Ratings of 5 are relatively uncommon for all attributes.
### Interpretation
The data suggests that travelers generally prefer fewer stops, while they are more neutral about Departure Time, Duration, and Price. The distribution of ratings indicates that most travelers do not give extremely high or low ratings to these attributes, suggesting a moderate level of satisfaction or dissatisfaction. The high probability for rating 1 in "Number of Stops" indicates that direct flights (or flights with very few stops) are highly valued. The relatively low probabilities for ratings of 5 across all attributes suggest that there is room for improvement in all areas. The data could be used to prioritize improvements in areas where ratings are lower, or to maintain the quality of attributes that already receive high ratings.
</details>
Figure 20: The distributions of human participantsâ initial preferences for different flight features. A rating of 1 indicates the strongest preference for the earliest departure time, the shortest duration, the fewest number of stops, and the lowest price, while a rating of 5 indicates the opposite. A rating of 3 indicates no preference.
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Chart: Reward Function Distribution and Accuracy on Human Reward Function Set
### Overview
The image presents two charts side-by-side. The first chart (a) displays the distribution of a reward function, showing the probability of each reward function index. The second chart (b) illustrates the accuracy of different models (Gemma Original, Gemma Oracle, Gemma Bayesian, and Bayesian Assistant) on a human reward function set, plotted against the number of interactions.
### Components/Axes
**Chart a: Distribution of Reward Function**
* **X-axis:** Reward Function Index (ranging from approximately 0 to 600)
* **Y-axis:** Probability (%) (ranging from 0 to approximately 3.5)
* **Title:** Distribution of Reward Function
**Chart b: Accuracy on Human Reward Fn. Set**
* **X-axis:** # Interactions (ranging from 0 to 5)
* **Y-axis:** Accuracy (%) (ranging from 0 to 100)
* **Title:** Accuracy on Human Reward Fn. Set
* **Legend:**
* Gemma Original (Blue, marked with triangles)
* Gemma Oracle (Light Orange, marked with circles)
* Gemma Bayesian (Orange, marked with squares)
* Bayesian Assistant (Light Pink, marked with crosses)
### Detailed Analysis or Content Details
**Chart a: Distribution of Reward Function**
The distribution is a histogram-like plot. There is a large peak around an index of approximately 75-125, with a probability of around 3.3%. There are several smaller peaks and valleys throughout the range of reward function indices, indicating a non-uniform distribution. The probability generally decreases as the index moves away from the initial peak, with several smaller peaks appearing between approximately 200 and 550.
**Chart b: Accuracy on Human Reward Fn. Set**
* **Gemma Original (Blue Triangles):** Starts at approximately 55% accuracy at 0 interactions, dips to around 48% at 1 interaction, and remains relatively stable around 50-55% for the remaining interactions.
* **Gemma Oracle (Light Orange Circles):** Starts at approximately 25% accuracy at 0 interactions, rises sharply to around 60% at 1 interaction, and continues to increase to approximately 75% at 5 interactions.
* **Gemma Bayesian (Orange Squares):** Starts at approximately 45% accuracy at 0 interactions, rises to around 70% at 2 interactions, and continues to increase to approximately 78% at 5 interactions.
* **Bayesian Assistant (Light Pink Crosses):** Starts at approximately 50% accuracy at 0 interactions, rises to around 65% at 1 interaction, and continues to increase to approximately 80% at 5 interactions.
### Key Observations
* The reward function distribution (Chart a) is highly variable, with a prominent peak and numerous smaller fluctuations.
* In Chart b, the Gemma Oracle, Gemma Bayesian, and Bayesian Assistant models all demonstrate an increasing trend in accuracy as the number of interactions increases.
* The Gemma Original model shows a slight decrease in accuracy after the first interaction and remains relatively flat.
* The Bayesian Assistant consistently achieves the highest accuracy across all interaction levels.
* The Gemma Oracle shows the most significant improvement in accuracy with increasing interactions.
### Interpretation
The distribution of the reward function (Chart a) suggests that the reward landscape is complex and not uniformly distributed. This complexity could pose challenges for reinforcement learning algorithms.
Chart b demonstrates the effectiveness of incorporating Bayesian methods and oracle feedback into the Gemma models. The Gemma Oracle, which presumably benefits from access to perfect information, shows a substantial improvement in accuracy with more interactions. The Gemma Bayesian and Bayesian Assistant models also exhibit improved performance, indicating that Bayesian inference can help the models learn more effectively from limited data. The relatively flat performance of the Gemma Original model suggests that it struggles to adapt to the human reward function set without the benefits of oracle feedback or Bayesian reasoning. The consistent high performance of the Bayesian Assistant suggests it is the most robust and effective model for this task. The initial dip in accuracy for Gemma Original could be due to overfitting to the initial data or a temporary instability during the learning process.
</details>
Figure 21: Analysis of human reward functions. (a) Distribution of human reward functions. (b) Accuracy over rounds on the subset of the original data where the simulated userâs reward function is set of reward functions stated by the human participants. Error bars show standard errors across three random seeds (and three training runs).
<details>
<summary>x23.png Details</summary>

### Visual Description
## Charts: Model Performance & Human Consistency
### Overview
The image contains three charts evaluating the performance of different models (Gemma Original, Gemma Oracle, Gemma Bayesian, and Bayesian Assistant) in relation to human user consistency. The charts assess accuracy on both human-annotated and held-out option sets, under both all consistency levels and high consistency levels. The first chart (a) shows the distribution of human user average consistency.
### Components/Axes
* **Chart a:**
* **X-axis:** Round (1 to 5)
* **Y-axis:** Consistency (%)
* **Chart Type:** Line graph and Histogram
* **Charts b & c:**
* **X-axis:** # Interactions (0 to 4/5)
* **Y-axis:** Accuracy (%)
* **Chart Type:** Line graph
* **Legend:**
* Gemma Original (Blue)
* Gemma Oracle (Orange)
* Gemma Bayesian (Yellow)
* Bayesian Assistant (Brown)
* **Sub-Charts:** Each chart (b & c) is split into two sub-charts: "All" and "High Consistency".
### Detailed Analysis or Content Details
**Chart a: Human User Average Consistency**
* The line graph shows a decreasing trend in consistency from Round 1 to Round 5.
* Round 1: Approximately 75% consistency.
* Round 2: Approximately 60% consistency.
* Round 3: Approximately 50% consistency.
* Round 4: Approximately 45% consistency.
* Round 5: Approximately 40% consistency.
* The histogram shows the probability distribution of average consistency percentages.
* The distribution is right-skewed, with a peak between 60% and 80%.
* The probability decreases as consistency decreases below 60%.
* The probability is low for consistency above 80%.
**Chart b: Accuracy on Human-annotated Option Sets**
* **All:**
* Gemma Original: Starts at approximately 65% accuracy and remains relatively stable around 60-70% across all interactions.
* Gemma Oracle: Starts at approximately 20% accuracy and increases to around 40% by interaction 4.
* Gemma Bayesian: Starts at approximately 20% accuracy and increases to around 50% by interaction 4.
* Bayesian Assistant: Starts at approximately 20% accuracy and increases to around 45% by interaction 4.
* **High Consistency:**
* Gemma Original: Starts at approximately 60% accuracy and increases to around 70% by interaction 4.
* Gemma Oracle: Starts at approximately 40% accuracy and remains relatively stable around 40-50% across all interactions.
* Gemma Bayesian: Starts at approximately 40% accuracy and increases to around 60% by interaction 4.
* Bayesian Assistant: Starts at approximately 40% accuracy and increases to around 60% by interaction 4.
**Chart c: Accuracy on Held-out Option Sets**
* **All:**
* Gemma Original: Starts at approximately 60% accuracy and decreases to around 40% by interaction 5.
* Gemma Oracle: Starts at approximately 20% accuracy and increases to around 40% by interaction 3, then plateaus.
* Gemma Bayesian: Starts at approximately 20% accuracy and increases to around 40% by interaction 3, then plateaus.
* Bayesian Assistant: Starts at approximately 20% accuracy and increases to around 40% by interaction 3, then plateaus.
* **High Consistency:**
* Gemma Original: Starts at approximately 40% accuracy and increases to around 60% by interaction 2, then plateaus.
* Gemma Oracle: Starts at approximately 20% accuracy and increases to around 50% by interaction 3, then plateaus.
* Gemma Bayesian: Starts at approximately 20% accuracy and increases to around 50% by interaction 3, then plateaus.
* Bayesian Assistant: Starts at approximately 20% accuracy and increases to around 50% by interaction 3, then plateaus.
### Key Observations
* Human consistency decreases with each round of interaction.
* Gemma Original consistently performs well on human-annotated and held-out option sets, especially under high consistency conditions.
* Gemma Oracle, Gemma Bayesian, and Bayesian Assistant show improvement with more interactions, but generally perform lower than Gemma Original.
* Accuracy is generally higher for high consistency data compared to all data.
* The "All" data in Chart c shows Gemma Original's accuracy decreasing with more interactions, while the "High Consistency" data shows it plateauing.
### Interpretation
The data suggests that while human consistency diminishes over repeated interactions, the models, particularly Gemma Original, can maintain or even improve accuracy, especially when dealing with consistent human input. The improvement of Gemma Oracle, Bayesian and Assistant with more interactions suggests they benefit from learning, but are still less robust than Gemma Original. The divergence in Gemma Original's performance between the "All" and "High Consistency" datasets in Chart c indicates that the model is more sensitive to inconsistent human input. This could be due to the model being trained on more consistent data, or it could be a limitation of the model's ability to handle noisy data. The histogram in Chart a highlights the inherent variability in human judgment, which is a crucial factor to consider when evaluating model performance. The models' performance is clearly tied to the quality and consistency of the human input.
</details>
Figure 22: Results on interactions with real human users. (a) Consistency between the human usersâ choices and the predictions derived from their initially stated preferences. We show user consistency over rounds and the distribution of the average user consistency. Error bars show standard errors across five-round option sets list. (b) Accuracy over rounds on human-annotated option sets. We show the results of all human users and users with high consistency, i.e., their choices matched their initially stated preferences in 4 or 5 of the rounds (40.4% of the data). (c) Accuracy over rounds on the held-out set, where the preferred choices are deterministically computed based on the human userâs preferences. Error bars show standard errors across three random seeds (and three training runs).
In the main paper we report results for this more realistic setting where the model interacts with real human users on the flight recommendation task. Surprisingly, we also find that the original LLMs achieved good performance, unlike what was observed in earlier experiments.
We hypothesize that two factors may contribute to its improved performance. First, unlike our simulated users whose preferences are uniformly sampled from the space of possible reward functions, human preferences are biased towards particular types of functions, and in Fig. 21 we show that some reward functions are considerably more common than others in our sample of human participants. For example, most participants report preferring cheaper flights (see Fig. 20). As such, a viable strategy for the original LLM could be to rely on its prior knowledge about user preferences to make relatively good recommendations. To investigate this further, we filter results for simulated users based on reward functions stated by the human participants in Fig. 21. We observe that also in this case, Gemma Original achieves a higher accuracy of around 60% (as opposed to 37% in Fig. 2, matching the high accuracy it obtained in Fig. 6. This makes it clear that the bias among the human preferences in this experiment contributes to the stronger performance of original LLMs.
Secondly, human users may not necessarily behave consistently with their preferences, i.e., their choices may differ from those that would reflect their initially stated preferences. Indeed, note how in Fig. 21 the gap between the original LLM and the Bayesian LLM increases significantly when evaluating on consistent simulated users. To quantify this potential discrepancy, we compute the consistency between the human userâs choices and the predictions derived from their preferences. The latter are obtained by mapping their stated preferences to corresponding reward functions and selecting the option with the highest reward accordingly. In line with our hypothesis, the average consistency is relatively low at 60%, with chance performance being 33.3% (Fig. 22).
We further break down the performance by user consistency over rounds and show results for high-consistency users; that is, users whose choices were consistent with their stated preferences in 4 or 5 of the rounds (Fig. 22). We find that all models perform better for the high-consistency users. Specifically, when user consistency is high, the improvement of Gemma Bayesian over Gemma Original increases.
Finally, to limit the effect of such inconsistencies, while still retaining the real interactions between human users and the model, we also evaluate the LLMs on a held-out set of 100 randomly sampled options that simulate perfectly consistent users; to do so we use the preferred options derived from the participantsâ initially stated preferences rather than the participantsâ actual choices. As shown in Fig. 22, when removing inconsistency from the evaluation data, Gemma Bayesian achieves the best performance. Gemma Original performs best initially, likely due to its correct prior about human users, but its performance decreases over rounds, indicating its limited ability to incorporate the simulated userâs feedback.
### F.3 Human Annotation Interface
We show the human annotation interface where humans act as the assistant in Fig. 23. The interface allows the human annotator to select the best option from three flight options, rate their estimation of the userâs preferences, and check the flight booking history from previous rounds. The annotation interface where humans act as the user is similar.
## Appendix G Statistical Analyses
This Supplementary reports analyses that test whether Bayesian teaching leads to statistically significant improvement over the baselines. We fit linear mixed-effects models treating each method (Bayesian teaching, oracle teaching, and the original) and model family (Gemma, Llama, and Qwen) as fixed effects while controlling for various sources of randomness. For flight and hotel recommendation, we include training run, evaluation random seed, and reward function as random effects. For web shopping, we treat training run ad product category as random effects. Overall, the models demonstrate statistically significant differences between methods across all domains and all three model families (Gemma, Llama, and Qwen).
In flight recommendation, the original LLM achieves a baseline accuracy of 37.0% (95% CI: 30.6â43.5%). The Oracle LLM performs significantly better with a 24.0% increase (95% CI: 16.6â31.4%, p $<$ 0.001), while the Bayesian LLM shows an even more substantial 38.5% increase (95% CI: 31.1â45.9%, p $<$ 0.001). Model family shows no significant effect on performance, with differences between model families all non-significant. The interaction between method and model family was not statistically significant (minimum p = 0.19). Within each model family, improvements between all methods are significant (p $<$ 0.001), with the exception of Qwen Oracle versus Qwen Original which shows slightly weaker but still significant improvement (p = 0.002).
In hotel recommendation, the original LLM achieves a baseline accuracy of 36.7% (95% CI: 32.1â41.3%). The Oracle LLM performs significantly better with a 16.7% increase (95% CI: 11.4â22.0%, p $<$ 0.001), while the Bayesian LLM shows a 29.4% increase (95% CI: 24.1â34.7%, p $<$ 0.001). Model family shows no significant main effect on performance. The interaction between method and model family is not statistically significant (all interaction p-values $>$ 0.11). Within each model family, most pairwise comparisons show p-values $<$ 0.001, with two exceptions: Llama Bayesian versus Llama Oracle shows weak significance (p = 0.001), and Qwen Oracle versus Qwen Original shows weaker significance (p = 0.002).
In web shopping, the original LLM achieves a baseline accuracy of 54.0% (95% CI: 49.6â58.4%). The Oracle LLM performs significantly better with a 7.1% increase (95% CI: 2.3â11.8%, p = 0.013), while the Bayesian LLM shows a more substantial 18.6% increase (95% CI: 13.8â23.4%, p $<$ 0.001). Unlike the other domains, model family shows a significant effect, with Qwen showing a significant decrease of -11.1% (95% CI: -17.0â5.3%, p = 0.003) compared to the baseline. There is also a significant interaction between Oracle method and Qwen (15.8%, 95% CI: 9.0â22.6%, p = 0.001). Within-family pairwise comparisons show different patterns: for Gemma, all method comparisons are significant (Original-Oracle: p = 0.033; others p $<$ 0.001); for Llama, Original-Oracle is non-significant (p = 0.199) while Original-Bayesian (p = 0.001) and Oracle-Bayesian (p = 0.004) are significant; for Qwen, Original-Oracle and Original-Bayesian are highly significant (p $<$ 0.001), but Oracle-Bayesian is non-significant (p = 0.282).
## Appendix H Results Details
We show results over rounds for different models and methods in Fig. 24 â 27. For each, we show the accuracy based on the LLMâs or human direct prediction (âdirectâ) and accuracy based on predictions derived from their beliefs about the userâs preferences (âbeliefsâ) if available.
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Screenshot: Preference Questionnaire for Flight Selection
### Overview
This screenshot displays a user interface for a flight selection task, likely part of an experiment or A/B testing scenario. The interface presents three flight options and a questionnaire to gauge user preferences regarding departure time, flight duration, number of stops, and price. A summary of the annotation round is also visible.
### Components/Axes
The screenshot is divided into several sections:
* **Header:** "Select the Best Option" and "Round 1 of 6"
* **Flight Options:** Three flight options are listed, each with:
* Departure Time
* Duration
* Number of Stops
* Price
* **Preferences Questionnaire:** Four sections, each with a 5-point Likert scale question:
* Departure Time Preference
* Flight Duration Preference
* Number of Stops Preference
* Price Preference
* **Buttons:** "Submit Selection", "Check Summary", "Submit As Responses"
* **Annotation Summary:** Displays the flight options, the correct option, and the user's selection.
* **Navigation:** "Back to Annotation"
### Detailed Analysis or Content Details
**Flight Options:**
* Flight 1: departure time: 02:00 PM, duration: 30 min, number of stops: 1, price: $370
* Flight 2: departure time: 02:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730
* Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $1000
**Preferences Questionnaire:**
Each question uses a scale of 1 to 5, with the following labels:
* **Departure Time:**
1: I strongly prefer an earlier morning departure time
2: I prefer an earlier morning departure time
3: I have no strong preference
4: I prefer a later evening departure time
5: I strongly prefer a later evening departure time
* **Flight Duration:**
1: I strongly prefer shorter flights
2: I prefer shorter flights
3: I have no strong preference
4: I prefer longer flights
5: I strongly prefer longer flights
* **Number of Stops:**
1: I strongly prefer non-stop flights
2: I prefer non-stop flights
3: I have no strong preference
4: I prefer flights with stops
5: I strongly prefer flights with stops
* **Price:**
1: I strongly prefer cheaper flights
2: I prefer cheaper flights
3: I have no strong preference
4: I prefer more expensive flights
5: I strongly prefer more expensive flights
**Annotation Summary:**
* Flight 1: departure time: 02:00 PM, duration: 30 min, number of stops: 1, price: $370
* Flight 2: departure time: 02:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730
* Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $1000
* Correct Option: Flight 1
* Your Selection: Flight 2
### Key Observations
* The flight options vary significantly in price and duration. Flight 1 is the cheapest and shortest, while Flight 3 is the most expensive and longest.
* The user selected Flight 2, which is more expensive and longer than the correct option (Flight 1).
* The questionnaire is designed to understand the user's trade-offs between different flight attributes.
* The interface is part of a multi-round experiment ("Round 1 of 6").
### Interpretation
The screenshot represents a controlled experiment to understand how users make flight selection decisions. The questionnaire aims to uncover the relative importance of different flight attributes (departure time, duration, stops, price). The annotation summary indicates that the user did not choose the objectively "best" option (Flight 1), suggesting that factors beyond price and duration influenced their decision. This could be due to a preference for non-stop flights (Flight 2 has zero stops) or a specific time preference not fully captured by the questionnaire. The multi-round nature of the experiment suggests that the researchers are interested in observing how user preferences evolve over time and with exposure to different flight options. The "Back to Annotation" button suggests that the user is providing feedback on their selection process, potentially to improve the accuracy of the experiment or to understand the reasoning behind their choices.
</details>
Figure 23: Example of annotation interface where humans act as the flight recommendation assistant. The human annotator was asked to select the best option and rate their estimation of the userâs preferences. We also allow the annotator to check a summary of previous flight booking history. The annotation interface where humans act as the user is similar.
<details>
<summary>x25.png Details</summary>

### Visual Description
\n
## Charts: Accuracy vs. Interactions for Various Models
### Overview
The image presents a 3x3 grid of line charts, each comparing the accuracy of different language models (Gemma 2.9B, Gemma 2.27B, Llama 3.8B, Llama 3.70B, Qwen 2.5 7B, Qwen 2.5 32B, GPT-4.1 Mini, Gemini 1.5 Pro, and Human) across a range of interaction numbers (1 to 5). Accuracy is measured in percentage (%). Each chart displays four data series representing different prompting strategies: "Direct", "Beliefs", "Bayesian Assistant", and "Random". Error bars are present on each data point, indicating the variance in accuracy.
### Components/Axes
* **X-axis:** "# interactions" - ranging from 0 to 5.
* **Y-axis:** "Accuracy (%)" - ranging from 0 to 100.
* **Legend:** Located at the top of the image, identifying the data series by color and label:
* Direct (Blue)
* Beliefs (Orange)
* Bayesian Assistant (Purple)
* Random (Gray dashed)
* **Charts:** Nine individual line charts, each representing a different model. The models are labeled above each chart.
### Detailed Analysis or Content Details
Here's a breakdown of each chart, with approximate values extracted from the visual representation. Note that due to the image resolution and line thickness, values are estimates with an uncertainty of ±2%.
**1. Gemma 2.9B**
* Direct: Starts at ~38%, increases to ~42% at interaction 2, then plateaus around ~40% for interactions 3-5.
* Beliefs: Starts at ~36%, remains relatively flat around ~38-40% across all interactions.
* Bayesian Assistant: Starts at ~42%, decreases to ~38% at interaction 2, then remains relatively flat around ~38-40% for interactions 3-5.
* Random: Starts at ~40%, decreases to ~36% at interaction 2, then remains relatively flat around ~36-38% for interactions 3-5.
**2. Gemma 2.27B**
* Direct: Starts at ~40%, increases to ~46% at interaction 2, then plateaus around ~44-46% for interactions 3-5.
* Beliefs: Starts at ~38%, remains relatively flat around ~40-42% across all interactions.
* Bayesian Assistant: Starts at ~42%, remains relatively flat around ~40-42% across all interactions.
* Random: Starts at ~40%, remains relatively flat around ~38-40% across all interactions.
**3. Llama 3.8B**
* Direct: Starts at ~40%, increases to ~46% at interaction 2, then plateaus around ~44-46% for interactions 3-5.
* Beliefs: Starts at ~38%, remains relatively flat around ~40-42% across all interactions.
* Bayesian Assistant: Starts at ~42%, remains relatively flat around ~40-42% across all interactions.
* Random: Starts at ~40%, decreases to ~36% at interaction 2, then remains relatively flat around ~36-38% for interactions 3-5.
**4. Llama 3.70B**
* Direct: Starts at ~22%, increases sharply to ~46% at interaction 2, then increases to ~52% at interaction 5.
* Beliefs: Starts at ~20%, increases to ~36% at interaction 2, then plateaus around ~36-40% for interactions 3-5.
* Bayesian Assistant: Starts at ~24%, increases to ~38% at interaction 2, then plateaus around ~38-40% for interactions 3-5.
* Random: Starts at ~22%, remains relatively flat around ~22-26% across all interactions.
**5. Qwen 2.5 7B**
* Direct: Starts at ~36%, increases to ~44% at interaction 2, then plateaus around ~42-44% for interactions 3-5.
* Beliefs: Starts at ~34%, remains relatively flat around ~36-38% across all interactions.
* Bayesian Assistant: Starts at ~38%, remains relatively flat around ~36-38% across all interactions.
* Random: Starts at ~36%, remains relatively flat around ~34-36% across all interactions.
**6. Qwen 2.5 32B**
* Direct: Starts at ~38%, increases to ~46% at interaction 2, then plateaus around ~44-46% for interactions 3-5.
* Beliefs: Starts at ~36%, remains relatively flat around ~38-40% across all interactions.
* Bayesian Assistant: Starts at ~40%, remains relatively flat around ~38-40% across all interactions.
* Random: Starts at ~38%, remains relatively flat around ~36-38% across all interactions.
**7. GPT-4.1 Mini**
* Direct: Starts at ~36%, remains relatively flat around ~38-40% across all interactions.
* Beliefs: Starts at ~34%, remains relatively flat around ~36-38% across all interactions.
* Bayesian Assistant: Starts at ~38%, remains relatively flat around ~36-38% across all interactions.
* Random: Starts at ~36%, remains relatively flat around ~34-36% across all interactions.
**8. Gemini 1.5 Pro**
* Direct: Starts at ~38%, remains relatively flat around ~40-42% across all interactions.
* Beliefs: Starts at ~36%, remains relatively flat around ~38-40% across all interactions.
* Bayesian Assistant: Starts at ~40%, remains relatively flat around ~38-40% across all interactions.
* Random: Starts at ~38%, remains relatively flat around ~36-38% across all interactions.
**9. Human**
* Direct: Starts at ~40%, decreases to ~36% at interaction 2, then remains relatively flat around ~36-38% for interactions 3-5.
* Beliefs: Starts at ~38%, decreases to ~34% at interaction 2, then remains relatively flat around ~34-36% for interactions 3-5.
* Bayesian Assistant: Starts at ~42%, decreases to ~38% at interaction 2, then remains relatively flat around ~36-38% for interactions 3-5.
* Random: Starts at ~40%, decreases to ~36% at interaction 2, then remains relatively flat around ~34-36% for interactions 3-5.
### Key Observations
* Llama 3.70B shows the most significant improvement in accuracy with increasing interactions, particularly when using the "Direct" prompting strategy.
* The "Random" prompting strategy consistently performs the worst across most models.
* The "Direct" and "Beliefs" prompting strategies generally yield higher accuracy than the "Random" strategy.
* The "Bayesian Assistant" strategy shows a slight decrease in accuracy with increasing interactions for some models.
* Human performance decreases with increasing interactions.
### Interpretation
The data suggests that increasing the number of interactions can improve the accuracy of some language models, particularly Llama 3.70B. The choice of prompting strategy significantly impacts performance, with the "Direct" strategy generally being more effective than the "Random" strategy. The relatively flat performance of many models across interactions suggests a saturation point where additional interactions do not lead to substantial accuracy gains. The decrease in human performance with increasing interactions is an interesting outlier, potentially indicating a cognitive fatigue effect or a change in task approach. The error bars indicate that the observed differences in accuracy may not always be statistically significant, and further investigation is needed to confirm these trends. The comparison between models highlights the varying capabilities of different architectures and training datasets. The Bayesian Assistant strategy's initial high performance followed by a plateau or slight decrease could indicate that the model benefits from initial contextualization but doesn't effectively integrate further information.
</details>
Figure 24: Accuracy over rounds for different original LLMs. We show accuracy based on direct predictions and accuracy based on predictions derived from their beliefs about usersâ preferences. Error bars show standard errors across three random seeds (and three training runs).
<details>
<summary>x26.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Interactions for Different Models & Belief Systems
### Overview
The image presents a 3x3 grid of line charts, each comparing the accuracy of different language models (Gemma, Llama, Qwen) under different belief systems (Original, Oracle, Bayesian) against the number of interactions. A "Random" baseline is also included for comparison. Each chart plots accuracy (in percentage) on the y-axis against the number of interactions on the x-axis, ranging from 0 to 5.
### Components/Axes
* **X-axis Label (all charts):** "# interactions"
* **Y-axis Label (all charts):** "Accuracy (%)"
* **Legend (top-left of each chart):**
* **Blue Line:** Direct
* **Orange Line:** Beliefs
* **Purple Line:** Bayesian Assistant
* **Gray Dashed Line:** Random
* **Chart Titles (top-center of each chart):**
* Gemma Original
* Gemma Oracle
* Gemma Bayesian
* Llama Original
* Llama Oracle
* Llama Bayesian
* Qwen Original
* Qwen Oracle
* Qwen Bayesian
### Detailed Analysis or Content Details
Here's a breakdown of the data for each chart, noting trends and approximate values.
**1. Gemma Original:**
* **Direct (Blue):** Starts at approximately 30%, increases to around 60% by interaction 5. The line slopes upward.
* **Beliefs (Orange):** Remains relatively flat around 40% throughout all interactions.
* **Bayesian Assistant (Purple):** Starts at approximately 40%, increases to around 65% by interaction 5. The line slopes upward.
* **Random (Gray):** Remains flat around 30% throughout all interactions.
**2. Gemma Oracle:**
* **Direct (Blue):** Starts at approximately 40%, increases to around 70% by interaction 5. The line slopes upward.
* **Beliefs (Orange):** Starts at approximately 40%, increases to around 60% by interaction 5. The line slopes upward.
* **Bayesian Assistant (Purple):** Remains relatively flat around 60% throughout all interactions.
* **Random (Gray):** Remains flat around 30% throughout all interactions.
**3. Gemma Bayesian:**
* **Direct (Blue):** Starts at approximately 40%, increases to around 65% by interaction 5. The line slopes upward.
* **Beliefs (Orange):** Remains relatively flat around 60% throughout all interactions.
* **Bayesian Assistant (Purple):** Starts at approximately 50%, increases to around 70% by interaction 5. The line slopes upward.
* **Random (Gray):** Remains flat around 30% throughout all interactions.
**4. Llama Original:**
* **Direct (Blue):** Starts at approximately 30%, increases to around 60% by interaction 5. The line slopes upward.
* **Beliefs (Orange):** Remains relatively flat around 40% throughout all interactions.
* **Bayesian Assistant (Purple):** Starts at approximately 40%, increases to around 70% by interaction 5. The line slopes upward.
* **Random (Gray):** Remains flat around 30% throughout all interactions.
**5. Llama Oracle:**
* **Direct (Blue):** Starts at approximately 40%, increases to around 75% by interaction 5. The line slopes upward.
* **Beliefs (Orange):** Starts at approximately 40%, increases to around 65% by interaction 5. The line slopes upward.
* **Bayesian Assistant (Purple):** Starts at approximately 50%, increases to around 75% by interaction 5. The line slopes upward.
* **Random (Gray):** Remains flat around 30% throughout all interactions.
**6. Llama Bayesian:**
* **Direct (Blue):** Starts at approximately 40%, increases to around 70% by interaction 5. The line slopes upward.
* **Beliefs (Orange):** Remains relatively flat around 60% throughout all interactions.
* **Bayesian Assistant (Purple):** Starts at approximately 50%, increases to around 75% by interaction 5. The line slopes upward.
* **Random (Gray):** Remains flat around 30% throughout all interactions.
**7. Qwen Original:**
* **Direct (Blue):** Starts at approximately 30%, increases to around 60% by interaction 5. The line slopes upward.
* **Beliefs (Orange):** Remains relatively flat around 40% throughout all interactions.
* **Bayesian Assistant (Purple):** Starts at approximately 40%, increases to around 65% by interaction 5. The line slopes upward.
* **Random (Gray):** Remains flat around 30% throughout all interactions.
**8. Qwen Oracle:**
* **Direct (Blue):** Starts at approximately 40%, increases to around 70% by interaction 5. The line slopes upward.
* **Beliefs (Orange):** Starts at approximately 40%, increases to around 60% by interaction 5. The line slopes upward.
* **Bayesian Assistant (Purple):** Remains relatively flat around 60% throughout all interactions.
* **Random (Gray):** Remains flat around 30% throughout all interactions.
**9. Qwen Bayesian:**
* **Direct (Blue):** Starts at approximately 40%, increases to around 65% by interaction 5. The line slopes upward.
* **Beliefs (Orange):** Remains relatively flat around 60% throughout all interactions.
* **Bayesian Assistant (Purple):** Starts at approximately 50%, increases to around 70% by interaction 5. The line slopes upward.
* **Random (Gray):** Remains flat around 30% throughout all interactions.
### Key Observations
* The "Direct" and "Bayesian Assistant" approaches generally show increasing accuracy with more interactions across all models and belief systems.
* The "Beliefs" approach tends to remain relatively flat, indicating it doesn't benefit significantly from increased interactions.
* The "Random" baseline consistently shows low accuracy, serving as a clear lower bound for performance.
* The "Oracle" belief system often leads to higher initial accuracy and faster improvement compared to the "Original" and "Bayesian" systems.
* Llama models generally achieve higher accuracy than Gemma and Qwen models, particularly with the "Oracle" and "Bayesian" belief systems.
### Interpretation
The data suggests that increasing the number of interactions generally improves the accuracy of language models, particularly when using the "Direct" and "Bayesian Assistant" approaches. The "Oracle" belief system appears to provide a strong starting point for accuracy, potentially by incorporating prior knowledge or constraints. The relatively flat performance of the "Beliefs" approach suggests that simply incorporating beliefs without a mechanism for updating them based on interactions doesn't lead to significant improvements. The consistent lower performance of the "Random" baseline highlights the effectiveness of the language models compared to random guessing. The differences in performance between the Gemma, Llama, and Qwen models suggest that model architecture and training data play a significant role in accuracy. The consistent trend of increasing accuracy with interactions indicates a learning process is occurring within the models. The fact that the lines don't plateau suggests that even more interactions could potentially lead to further accuracy gains.
</details>
Figure 25: Accuracy over rounds for different original LLMs and fine-tuned LLMs on the flight recommendation task. We show accuracy based on direct predictions and accuracy based on predictions derived from their beliefs about usersâ preferences. Error bars show standard errors across three random seeds (and three training runs).
<details>
<summary>x27.png Details</summary>

### Visual Description
\n
## Line Charts: Accuracy vs. Interactions for Different Models & Belief Systems
### Overview
This image presents nine separate line charts, arranged in a 3x3 grid. Each chart displays the relationship between the number of interactions and accuracy (%) for different language models (Gemma, Llama, Qwen) under different belief systems (Original, Oracle, Bayesian) and compared to a random baseline. The x-axis represents the number of interactions, ranging from 0 to 5. The y-axis represents accuracy, ranging from 0% to 100%. Each chart includes four lines representing "Direct", "Beliefs", "Bayesian Assistant", and "Random" approaches.
### Components/Axes
* **X-axis Label (all charts):** "# interactions"
* **Y-axis Label (all charts):** "Accuracy (%)"
* **Legend (top-right of each chart):**
* Blue Solid Line: "Direct"
* Orange Dashed Line: "Beliefs"
* Gray Solid Line: "Bayesian Assistant"
* Black Dashed-Dotted Line: "Random"
* **Chart Titles (top-center of each chart):**
* Gemma Original
* Gemma Oracle
* Gemma Bayesian
* Llama Original
* Llama Oracle
* Llama Bayesian
* Qwen Original
* Qwen Oracle
* Qwen Bayesian
### Detailed Analysis or Content Details
**Gemma Original:**
* Direct: Starts at approximately 45%, increases to approximately 65% at 5 interactions.
* Beliefs: Starts at approximately 40%, increases to approximately 50% at 5 interactions.
* Bayesian Assistant: Starts at approximately 35%, increases to approximately 45% at 5 interactions.
* Random: Remains relatively flat around 30% throughout all interactions.
**Gemma Oracle:**
* Direct: Starts at approximately 50%, increases to approximately 70% at 5 interactions.
* Beliefs: Starts at approximately 40%, increases to approximately 55% at 5 interactions.
* Bayesian Assistant: Starts at approximately 40%, remains around 45% throughout all interactions.
* Random: Remains relatively flat around 30% throughout all interactions.
**Gemma Bayesian:**
* Direct: Starts at approximately 45%, increases to approximately 65% at 5 interactions.
* Beliefs: Starts at approximately 40%, increases to approximately 55% at 5 interactions.
* Bayesian Assistant: Starts at approximately 40%, increases to approximately 50% at 5 interactions.
* Random: Remains relatively flat around 30% throughout all interactions.
**Llama Original:**
* Direct: Starts at approximately 40%, increases to approximately 60% at 5 interactions.
* Beliefs: Starts at approximately 35%, increases to approximately 45% at 5 interactions.
* Bayesian Assistant: Starts at approximately 30%, increases to approximately 40% at 5 interactions.
* Random: Remains relatively flat around 30% throughout all interactions.
**Llama Oracle:**
* Direct: Starts at approximately 45%, increases to approximately 65% at 5 interactions.
* Beliefs: Starts at approximately 40%, increases to approximately 55% at 5 interactions.
* Bayesian Assistant: Starts at approximately 35%, remains around 40% throughout all interactions.
* Random: Remains relatively flat around 30% throughout all interactions.
**Llama Bayesian:**
* Direct: Starts at approximately 40%, increases to approximately 60% at 5 interactions.
* Beliefs: Starts at approximately 35%, increases to approximately 50% at 5 interactions.
* Bayesian Assistant: Starts at approximately 30%, increases to approximately 40% at 5 interactions.
* Random: Remains relatively flat around 30% throughout all interactions.
**Qwen Original:**
* Direct: Starts at approximately 45%, increases to approximately 65% at 5 interactions.
* Beliefs: Starts at approximately 40%, increases to approximately 50% at 5 interactions.
* Bayesian Assistant: Starts at approximately 35%, increases to approximately 45% at 5 interactions.
* Random: Remains relatively flat around 30% throughout all interactions.
**Qwen Oracle:**
* Direct: Starts at approximately 45%, increases to approximately 65% at 5 interactions.
* Beliefs: Starts at approximately 40%, increases to approximately 55% at 5 interactions.
* Bayesian Assistant: Starts at approximately 35%, remains around 40% throughout all interactions.
* Random: Remains relatively flat around 30% throughout all interactions.
**Qwen Bayesian:**
* Direct: Starts at approximately 45%, increases to approximately 65% at 5 interactions.
* Beliefs: Starts at approximately 40%, increases to approximately 55% at 5 interactions.
* Bayesian Assistant: Starts at approximately 35%, increases to approximately 45% at 5 interactions.
* Random: Remains relatively flat around 30% throughout all interactions.
### Key Observations
* The "Direct" approach consistently outperforms the other approaches across all models and belief systems.
* The "Random" approach consistently shows the lowest accuracy, serving as a baseline.
* The "Beliefs" and "Bayesian Assistant" approaches generally show similar trends, with moderate improvements in accuracy as the number of interactions increases.
* The "Oracle" belief system generally leads to slightly higher accuracy compared to the "Original" belief system for the "Direct" approach.
* The "Bayesian Assistant" approach appears to plateau in accuracy after a certain number of interactions (around 3-4) for the "Oracle" belief system in all models.
### Interpretation
The data suggests that a "Direct" approach to interacting with these language models yields the highest accuracy, and that accuracy generally increases with more interactions. The "Oracle" belief system seems to provide a slight advantage in accuracy, particularly when combined with the "Direct" approach. The "Beliefs" and "Bayesian Assistant" approaches offer some improvement over the "Random" baseline, but are significantly less effective than the "Direct" approach. The plateauing of the "Bayesian Assistant" approach with the "Oracle" belief system could indicate a limit to the benefits of Bayesian reasoning after a certain level of information is gathered. The consistent low performance of the "Random" approach highlights the importance of structured interaction for achieving higher accuracy. The similar performance across Gemma, Llama, and Qwen suggests that the underlying model architecture has less impact on the effectiveness of these interaction strategies than the interaction strategy itself.
</details>
Figure 26: Accuracy over rounds for different original LLMs and fine-tuned LLMs on the hotel recommendation task. We show accuracy based on direct predictions and accuracy based on predictions derived from their beliefs about usersâ preferences. Error bars show standard errors across three random seeds (and three training runs).
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Number of Interactions for Different Models and Methods
### Overview
This image presents nine line charts arranged in a 3x3 grid. Each chart displays the relationship between "Accuracy (%)" and "# interactions" for a specific model (Gemma, Llama, Qwen) and a specific method (Original, Oracle, Bayesian). Three different training methods are compared: "Direct", "Random", and "Direct Fine-tuning". The charts aim to visualize how accuracy changes with increasing interactions for each model-method combination.
### Components/Axes
* **X-axis:** "# interactions" - Ranging from 0 to 5, with markers at each integer value.
* **Y-axis:** "Accuracy (%)" - Ranging from 0 to 100, with markers at 20, 40, 60, 80, and 100.
* **Legend:** Located in the top-left corner of each chart, identifying the three lines:
* "Direct" - Represented by a solid blue line.
* "Random" - Represented by a dashed gray line.
* "Direct Fine-tuning" - Represented by a dashed light-blue line.
* **Chart Titles:** Each chart is labeled with the model name and method, e.g., "Gemma Original", "Llama Oracle", "Qwen Bayesian".
### Detailed Analysis or Content Details
Here's a breakdown of the data for each chart, noting trends and approximate values.
**Row 1: Gemma**
* **Gemma Original:**
* Direct: Starts at approximately 30% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* Random: Starts at approximately 30% accuracy at 0 interactions, remains relatively flat around 30-40% accuracy throughout. (Trend: Flat)
* Direct Fine-tuning: Starts at approximately 40% accuracy at 0 interactions, increases to around 70% at 5 interactions. (Trend: Upward sloping)
* **Gemma Oracle:**
* Direct: Starts at approximately 50% accuracy at 0 interactions, increases to around 80% at 5 interactions. (Trend: Upward sloping)
* Random: Starts at approximately 40% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* Direct Fine-tuning: Starts at approximately 60% accuracy at 0 interactions, remains relatively flat around 60-70% accuracy throughout. (Trend: Flat)
* **Gemma Bayesian:**
* Direct: Starts at approximately 40% accuracy at 0 interactions, increases to around 70% at 5 interactions. (Trend: Upward sloping)
* Random: Starts at approximately 40% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* Direct Fine-tuning: Starts at approximately 50% accuracy at 0 interactions, increases to around 70% at 5 interactions. (Trend: Upward sloping)
**Row 2: Llama**
* **Llama Original:**
* Direct: Starts at approximately 30% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* Random: Starts at approximately 30% accuracy at 0 interactions, remains relatively flat around 30-40% accuracy throughout. (Trend: Flat)
* Direct Fine-tuning: Starts at approximately 40% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* **Llama Oracle:**
* Direct: Starts at approximately 40% accuracy at 0 interactions, increases to around 70% at 5 interactions. (Trend: Upward sloping)
* Random: Starts at approximately 40% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* Direct Fine-tuning: Starts at approximately 60% accuracy at 0 interactions, remains relatively flat around 60-70% accuracy throughout. (Trend: Flat)
* **Llama Bayesian:**
* Direct: Starts at approximately 40% accuracy at 0 interactions, increases to around 70% at 5 interactions. (Trend: Upward sloping)
* Random: Starts at approximately 40% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* Direct Fine-tuning: Starts at approximately 50% accuracy at 0 interactions, increases to around 70% at 5 interactions. (Trend: Upward sloping)
**Row 3: Qwen**
* **Qwen Original:**
* Direct: Starts at approximately 30% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* Random: Starts at approximately 30% accuracy at 0 interactions, remains relatively flat around 30-40% accuracy throughout. (Trend: Flat)
* Direct Fine-tuning: Starts at approximately 40% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* **Qwen Oracle:**
* Direct: Starts at approximately 40% accuracy at 0 interactions, increases to around 70% at 5 interactions. (Trend: Upward sloping)
* Random: Starts at approximately 40% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* Direct Fine-tuning: Starts at approximately 60% accuracy at 0 interactions, remains relatively flat around 60-70% accuracy throughout. (Trend: Flat)
* **Qwen Bayesian:**
* Direct: Starts at approximately 40% accuracy at 0 interactions, increases to around 70% at 5 interactions. (Trend: Upward sloping)
* Random: Starts at approximately 40% accuracy at 0 interactions, increases to around 60% at 5 interactions. (Trend: Upward sloping)
* Direct Fine-tuning: Starts at approximately 50% accuracy at 0 interactions, increases to around 70% at 5 interactions. (Trend: Upward sloping)
### Key Observations
* "Direct" and "Direct Fine-tuning" consistently outperform "Random" across all models and methods.
* "Direct Fine-tuning" often starts with a higher initial accuracy than "Direct", but the improvement with increasing interactions is similar.
* The "Oracle" method generally yields higher accuracy than the "Original" and "Bayesian" methods.
* All three models (Gemma, Llama, Qwen) exhibit similar trends across the different methods.
### Interpretation
The data suggests that "Direct" and "Direct Fine-tuning" are more effective training methods than "Random" for improving accuracy in these models. The "Oracle" method appears to provide a beneficial starting point for accuracy. The consistent trends across the three models indicate that the observed patterns are likely generalizable and not specific to a particular model architecture. The increasing accuracy with more interactions suggests that these models benefit from continued learning and refinement. The relatively flat performance of the "Random" method highlights the importance of structured training approaches. The fact that "Direct Fine-tuning" often starts higher but doesn't necessarily improve *faster* than "Direct" suggests that the initial fine-tuning provides a good starting point, but further interactions are still crucial for maximizing performance.
</details>
Figure 27: Accuracy over rounds for different original LLMs and fine-tuned LLMs on the web shopping task. We show accuracy based on direct predictions and accuracy based on predictions derived from their beliefs about usersâ preferences. Error bars show standard errors across three training runs if any.
## Appendix I Example Interactions
We show example interactions in Tables 3 â 12.
Table 3: Example of a two-round interaction between the user and the LLM.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. Which flight is the best option? Flight 1: departure time: 04:00 PM, duration: 18 hr 3 min, number of stops: 2, price: $280 Flight 2: departure time: 10:48 AM, duration: 6 hr 21 min, number of stops: 1, price: $370 Flight 3: departure time: 06:48 PM, duration: 10 hr 5 min, number of stops: 1, price: $810 Model: The best option is Flight 2. User: Your option Flight 2 is correct. |
Table 4: Example of prompting the LLM for the userâs preference on a single feature.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. On a scale of 1 to 5, what is my preference for price? - 1: I strongly prefer a cheaper flight - 2: I prefer a cheaper flight - 3: I have no strong preference - 4: I prefer a more expensive flight - 5: I strongly prefer a more expensive flight Model: Your preference for price is: 1. |
Table 5: Example of a two-round interaction between the user and the LLM using the numerical representation of flights.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 0.7, duration: 0.5, number of stops: 1.0, price: 0.9 Flight 2: departure time: 0.9, duration: 0.6, number of stops: 0.0, price: 0.7 Flight 3: departure time: 0.5, duration: 0.9, number of stops: 0.5, price: 0.1 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. Which flight is the best option? Flight 1: departure time: 0.5, duration: 0.9, number of stops: 0.5, price: 0.1 Flight 2: departure time: 0.2, duration: 0.3, number of stops: 0.5, price: 0.2 Flight 3: departure time: 0.8, duration: 0.8, number of stops: 1.0, price: 0.9 Model: The best option is Flight 2. User: Your option Flight 2 is correct. |
Table 6: Example of prompting the LLM for the userâs preference on a single feature using the numerical representation of flights.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 0.7, duration: 0.5, number of stops: 1.0, price: 0.9 Flight 2: departure time: 0.9, duration: 0.6, number of stops: 0.0, price: 0.7 Flight 3: departure time: 0.5, duration: 0.9, number of stops: 0.5, price: 0.1 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. On a scale of 1 to 5, what is my preference for duration? - 1: I strongly prefer the minimum value - 2: I prefer the minimum value - 3: I have no strong preference - 4: I prefer the maximum value - 5: I strongly prefer the maximum value |
Table 7: Example of prompting the LLM for the userâs preference on a single feature using the modelâs directly generated probability judgements.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 1. User: Your option Flight 1 is incorrect. I prefer Flight 2. On a scale of 1 to 5, what is my preference for price? - 1: I strongly prefer a cheaper flight - 2: I prefer a cheaper flight - 3: I have no strong preference - 4: I prefer a more expensive flight - 5: I strongly prefer a more expensive flight Provide an integer between 0 and 100 (%) that reflects the probability of each scale. Format your response exactly as follows: - 1: ??% ... Model: The probabilities of each scale are: - 1: 70% - 2: 10% - 3: 15% - 4: 5% - 5: 0% |
Table 8: Example of a first-round interaction between the user and the LLM using chain-of-thought prompting.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. First, infer my preferences by reasoning about each feature. For each feature, estimate the probability distribution of my preference across a 1-to-5 scale. For example, you might estimate a 30% probability that I strongly prefer an earlier morning flight (scale 1), a 10% probability that I prefer an earlier morning flight (scale 2), a 20% probability that I have no strong preference (scale 3), and so on. Then, use these probabilities to determine the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Letâs think step by step. End your response with âThe best option is Flight <your choice>.â. Model: Letâs analyze each flight feature and estimate your preferences: ... <Modelâs reasoning chain> ... The best option is Flight 2. User: Your option Flight 2 is correct. |
Table 9: Example of a first-round interaction between the user and the LLM with verbalized reward posterior distribution.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Model: Based on the current information, the probabilities for each preference scale across all features are: The probabilities for each scale of your preference for departure time are: - 1: I strongly prefer an earlier morning departure time, 0.3% - 2: I prefer an earlier morning departure time, 1.0% - 3: I have no strong preference, 97.6% - 4: I prefer a later evening departure time, 1.1% - 5: I strongly prefer a later evening departure time, 0.1% The probabilities for each scale of your preference for duration are: ... User: Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 2. User: Your option Flight 2 is incorrect. I prefer Flight 1. Model: Based on the current information, the probabilities for each preference scale across all features are: The probabilities for each scale of your preference for departure time are: - 1: I strongly prefer an earlier morning departure time, 0.1% - 2: I prefer an earlier morning departure time, 4.1% - 3: I have no strong preference, 2.5% - 4: I prefer a later evening departure time, 93.2% - 5: I strongly prefer a later evening departure time, 0.1% |
Table 10: Example of a two-round interaction between the user and the LLM using the non-interactive setting. The blue part highlights the difference compared to the interactive setting in Table 3.
| Col1 |
| --- |
| User: Help me select the best flights for my trips. I have specific preferences for what I like and dislike in a flight, and these preferences remain the same. You need to figure out my preferences and select the best flights for me. Use your best judgment if you are unsure. Do not say you need more information. Which flight is the best option? Flight 1: departure time: 02:00 PM, duration: 2 hr 30 min, number of stops: 1, price: $370 Flight 2: departure time: 10:00 PM, duration: 4 hr 24 min, number of stops: 0, price: $730 Flight 3: departure time: 03:36 PM, duration: 16 hr 6 min, number of stops: 0, price: $900 Model: The best option is Flight 2. User: Your option Flight 2 is correct. Which flight is the best option? Flight 1: departure time: 04:00 PM, duration: 18 hr 3 min, number of stops: 2, price: $280 Flight 2: departure time: 10:48 AM, duration: 6 hr 21 min, number of stops: 1, price: $370 Flight 3: departure time: 06:48 PM, duration: 10 hr 5 min, number of stops: 1, price: $810 Model: The best option is Flight 2. User: Your option Flight 2 is correct. |
Table 11: Example of a first-round interaction between the user and the LLM on the hotel recommendation task.
| Col1 |
| --- |
| User: Help me select the best hotels for my trips. I have specific preferences for what I like and dislike in a hotel, and these preferences remain the same. You need to figure out my preferences and select the best hotels for me. Use your best judgment if you are unsure. Do not say you need more information. Which hotel is the best option? Hotel 1: distance to downtown: 4 miles, price: $550, rating: 3 stars, amenities: free parking and free breakfast Hotel 2: distance to downtown: 3 miles, price: $820, rating: 2 stars, amenities: free parking, free breakfast, and pool Hotel 3: distance to downtown: 2.3 miles, price: $370, rating: 1 stars, amenities: free parking Model: The best option is Hotel 3. User: Your option Hotel 3 is incorrect. I prefer Hotel 2. |
Table 12: Example of a first-round interaction between the user and the LLM on the web shopping task.
| Col1 |
| --- |
| User: Help me select the best product. I have specific preferences for what I like and dislike in a product, and these preferences remain the same. You need to figure out my preferences and select the best products for me. Use your best judgment if you are unsure. Do not say you need more information. Which product is the best option? Product 1: Title: Chic D Independence Day Table Runner 72 Inches Long, Gnome Cotton Linen Spring Table Cloth Runners for Wedding Party Dinning Coffee Holiday, Farmhouse Style, USA Flag Description: - 14x72inch Dining Table Runner Size: 36x183cm, fits round, square or rectangular tables that seat 4 to 8 people. - High Quality Cotton Linen Table Runner: 30%cotton, 70%burlap. triangle hem, wrinkle-free, no fade. easy to maintain and reusable. - Our elegant table runner is perfect for holidays, catering, birthday party, dinning, brunches, potlucks, BBQâs, buffets, garden, bridal party etc - Decorative Table Runner can be used as decor on kitchen tables; dining tables; entry tables; desks & more - Custom table runner, u also can personalized your tabletop decoration Color: black white Size: 13x108inch Product 2: Title: Ambesonne Orange Mandala Coffee Table, Pastel Colored Flourishes and Dark Toned Details Spring Bloom, Acrylic Glass Center Table with Wooden Frame for Offices Dorms, Small, Peach Dark Brown Cream Description: - Measurements - 24" Long x 18" Wide x 15" High coffee table with patterns that will add vitality to your living room. - Made from - High Quality Beech Wooden Frame and Acrylic Glass Table Top. Durable and Elegant! - Features- Light weight and easy to move around. You can place anywhere and enjoy its beauty! - Easy To Use- It is easy to assembly with numbered parts and clear instructions. - Versatile- Small coffee table, to keep your necessities within reach. Practical item for your living room! Color: blue purple Size: large Product 3: Title: White Round Dining Table and 4 Chairs,Mid-Century Modern Coffee Table Round Kitchen Table Small Office Table for Office Kitchen Dining Room (Round Table+4 Pale Grey Chairs) Description: - Dining Table and Chair Size:The dining table size is 35.4*35.4*29.5 inch, dining chairs is 36*17.5 inch.Our table and chairs suitable for 4-6 people. This glass table can also be used as a conference table, coffee table - Glass Dining Table and Chair Material:The table top is made of clear glass and equipped with solid wood metal legs, with a modern design style and elegant appearance.The Chair is made of velvet and solid wood metal cushions, the dining table and chair is an essential furniture Size: round table with wood legs Model: The best option is Product 3. User: Your option Product 3 is incorrect. I prefer Product 2. |