## A Tutorial on Thompson Sampling
Daniel J. Russo 1 , Benjamin Van Roy 2 , Abbas Kazerouni 2 , Ian Osband 3 and Zheng Wen 4
1 Columbia University
2 Stanford University
3 Google DeepMind
4 Adobe Research
## ABSTRACT
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.
In memory of Arthur F. Veinott, Jr.
## 1 Introduction
The multi-armed bandit problem has been the subject of decades of intense study in statistics, operations research, electrical engineering, computer science, and economics. A 'one-armed bandit' is a somewhat antiquated term for a slot machine, which tends to 'rob' players of their money. The colorful name for our problem comes from a motivating story in which a gambler enters a casino and sits down at a slot machine with multiple levers, or arms, that can be pulled. When pulled, an arm produces a random payout drawn independently of the past. Because the distribution of payouts corresponding to each arm is not listed, the player can learn it only by experimenting. As the gambler learns about the arms' payouts, she faces a dilemma: in the immediate future she expects to earn more by exploiting arms that yielded high payouts in the past, but by continuing to explore alternative arms she may learn how to earn higher payouts in the future. Can she develop a sequential strategy for pulling arms that balances this tradeoff and maximizes the cumulative payout earned? The following Bernoulli bandit problem is a canonical example.
Example 1.1. ( Bernoulli Bandit ) Suppose there are K actions, and when played, any action yields either a success or a failure. Action
k ∈ { 1 , ..., K } produces a success with probability θ k ∈ [0 , 1]. The success probabilities ( θ 1 , .., θ K ) are unknown to the agent, but are fixed over time, and therefore can be learned by experimentation. The objective, roughly speaking, is to maximize the cumulative number of successes over T periods, where T is relatively large compared to the number of arms K .
The 'arms' in this problem might represent different banner ads that can be displayed on a website. Users arriving at the site are shown versions of the website with different banner ads. A success is associated either with a click on the ad, or with a conversion (a sale of the item being advertised). The parameters θ k represent either the click-throughrate or conversion-rate among the population of users who frequent the site. The website hopes to balance exploration and exploitation in order to maximize the total number of successes.
A naive approach to this problem involves allocating some fixed fraction of time periods to exploration and in each such period sampling an arm uniformly at random, while aiming to select successful actions in other time periods. We will observe that such an approach can be quite wasteful even for the simple Bernoulli bandit problem described above and can fail completely for more complicated problems.
Problems like the Bernoulli bandit described above have been studied in the decision sciences since the second world war, as they crystallize the fundamental trade-off between exploration and exploitation in sequential decision making. But the information revolution has created significant new opportunities and challenges, which have spurred a particularly intense interest in this problem in recent years. To understand this, let us contrast the Internet advertising example given above with the problem of choosing a banner ad to display on a highway. A physical banner ad might be changed only once every few months, and once posted will be seen by every individual who drives on the road. There is value to experimentation, but data is limited, and the cost of of trying a potentially ineffective ad is enormous. Online, a different banner ad can be shown to each individual out of a large pool of users, and data from each such interaction is stored. Small-scale experiments are now a core tool at most leading Internet companies.
Our interest in this problem is motivated by this broad phenomenon. Machine learning is increasingly used to make rapid data-driven decisions. While standard algorithms in supervised machine learning learn passively from historical data, these systems often drive the generation of their own training data through interacting with users. An online recommendation system, for example, uses historical data to optimize current recommendations, but the outcomes of these recommendations are then fed back into the system and used to improve future recommendations. As a result, there is enormous potential benefit in the design of algorithms that not only learn from past data, but also explore systemically to generate useful data that improves future performance. There are significant challenges in extending algorithms designed to address Example 1.1 to treat more realistic and complicated decision problems. To understand some of these challenges, consider the problem of learning by experimentation to solve a shortest path problem.
Example 1.2. (Online Shortest Path) An agent commutes from home to work every morning. She would like to commute along the path that requires the least average travel time, but she is uncertain of the travel time along different routes. How can she learn efficiently and minimize the total travel time over a large number of trips?
Figure 1.1: Shortest path problem.
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Diagram: Directed Acyclic Graph (DAG)
### Overview
The image depicts a directed acyclic graph (DAG) consisting of 12 nodes, numbered 1 through 12, connected by directed edges. Each edge is labeled with a theta value (θ), representing a parameter or weight associated with the connection. The graph appears to represent a sequence of dependencies or a probabilistic model.
### Components/Axes
The diagram consists of:
* **Nodes:** 12 circular nodes labeled 1 to 12.
* **Edges:** Directed arrows connecting the nodes.
* **Edge Labels:** Each edge is labeled with a theta value (θ) in the format θ<sub>i,j</sub>, where 'i' is the source node and 'j' is the destination node.
### Detailed Analysis or Content Details
The graph's structure and edge labels are as follows:
* Node 1 has outgoing edges to nodes 2, 3, 4, 5, and 6.
* Edge 1 -> 2: θ<sub>1,2</sub>
* Edge 1 -> 3: θ<sub>1,3</sub>
* Edge 1 -> 4: θ<sub>1,4</sub>
* Edge 1 -> 5: θ<sub>1,5</sub>
* Edge 1 -> 6: θ<sub>1,6</sub>
* Node 2 has an outgoing edge to node 7.
* Edge 2 -> 7: θ<sub>2,7</sub>
* Node 3 has an outgoing edge to node 7.
* Edge 3 -> 7: θ<sub>3,7</sub>
* Node 4 has an outgoing edge to node 8.
* Edge 4 -> 8: θ<sub>4,8</sub>
* Node 5 has an outgoing edge to node 9.
* Edge 5 -> 9: θ<sub>5,9</sub>
* Node 6 has an outgoing edge to node 9.
* Edge 6 -> 9: θ<sub>6,9</sub>
* Node 7 has an outgoing edge to node 10.
* Edge 7 -> 10: θ<sub>7,10</sub>
* Node 8 has outgoing edges to nodes 10 and 11.
* Edge 8 -> 10: θ<sub>8,10</sub>
* Edge 8 -> 11: θ<sub>8,11</sub>
* Node 9 has an outgoing edge to node 11.
* Edge 9 -> 11: θ<sub>9,11</sub>
* Node 10 has an outgoing edge to node 12.
* Edge 10 -> 12: θ<sub>10,12</sub>
* Node 11 has an outgoing edge to node 12.
* Edge 11 -> 12: θ<sub>11,12</sub>
### Key Observations
The graph is a DAG, meaning there are no cycles. Node 1 is the root node, and Node 12 is the terminal node. The graph represents a branching structure where information or probability flows from Node 1 to Node 12 through various paths. The theta values represent the strength or weight of each connection.
### Interpretation
This diagram likely represents a Bayesian network or a similar probabilistic graphical model. The nodes represent random variables, and the edges represent conditional dependencies between them. The theta values (θ<sub>i,j</sub>) could represent conditional probabilities or weights in a machine learning model. The structure suggests that the state of Node 1 influences the states of all other nodes, and the final state is represented by Node 12. The branching structure allows for multiple paths from Node 1 to Node 12, indicating that Node 12's state is influenced by multiple factors. Without knowing the specific context, it's difficult to determine the exact meaning of the nodes and edges, but the diagram provides a clear visual representation of a complex dependency structure. The diagram is a visual representation of a system where the value of a node is dependent on the values of its predecessors.
</details>
We can formalize this as a shortest path problem on a graph G = ( V, E ) with vertices V = { 1 , ..., N } and edges E . An example is illustrated in Figure 1.1. Vertex 1 is the source (home) and vertex N is the destination (work). Each vertex can be thought of as an intersection, and for two vertices i, j ∈ V , an edge ( i, j ) ∈ E is present if there is a direct road connecting the two intersections. Suppose that traveling along an edge e ∈ E requires time θ e on average. If these parameters were known, the agent would select a path ( e 1 , .., e n ), consisting of a sequence of adjacent edges connecting vertices 1 and N , such that the expected total time θ e 1 + ... + θ e n is minimized. Instead, she chooses paths in a sequence of periods. In period t , the realized time y t,e to traverse edge e is drawn independently from a distribution with mean θ e . The agent sequentially chooses a path x t , observes the realized travel time ( y t,e ) e ∈ x t along each edge in the path, and incurs cost c t = ∑ e ∈ x t y t,e equal to the total travel time. By exploring intelligently, she hopes to minimize cumulative travel time ∑ T t =1 c t over a large number of periods T .
This problem is conceptually similar to the Bernoulli bandit in Example 1.1, but here the number of actions is the number of paths in the graph, which generally scales exponentially in the number of edges. This raises substantial challenges. For moderate sized graphs, trying each possible path would require a prohibitive number of samples, and algorithms that require enumerating and searching through the set of all paths to reach a decision will be computationally intractable. An efficient approach therefore needs to leverage the statistical and computational structure of problem.
In this model, the agent observes the travel time along each edge traversed in a given period. Other feedback models are also natural: the agent might start a timer as she leaves home and checks it once she arrives, effectively only tracking the total travel time of the chosen path. This is closer to the Bernoulli bandit model, where only the realized reward (or cost) of the chosen arm was observed. We have also taken the random edge-delays y t,e to be independent, conditioned on θ e . A more realistic model might treat these as correlated random variables, reflecting that neighboring roads are likely to be congested at the same time. Rather than design a specialized algorithm for each possible statistical
model, we seek a general approach to exploration that accommodates flexible modeling and works for a broad array of problems. We will see that Thompson sampling accommodates such flexible modeling, and offers an elegant and efficient approach to exploration in a wide range of structured decision problems, including the shortest path problem described here.
Thompson sampling - also known as posterior sampling and probability matching - was first proposed in 1933 (Thompson, 1933; Thompson, 1935) for allocating experimental effort in two-armed bandit problems arising in clinical trials. The algorithm was largely ignored in the academic literature until recently, although it was independently rediscovered several times in the interim (Wyatt, 1997; Strens, 2000) as an effective heuristic. Now, more than eight decades after it was introduced, Thompson sampling has seen a surge of interest among industry practitioners and academics. This was spurred partly by two influential articles that displayed the algorithm's strong empirical performance (Chapelle and Li, 2011; Scott, 2010). In the subsequent five years, the literature on Thompson sampling has grown rapidly. Adaptations of Thompson sampling have now been successfully applied in a wide variety of domains, including revenue management (Ferreira et al. , 2015), marketing (Schwartz et al. , 2017), web site optimization (Hill et al. , 2017), Monte Carlo tree search (Bai et al. , 2013), A/B testing (Graepel et al. , 2010), Internet advertising (Graepel et al. , 2010; Agarwal, 2013; Agarwal et al. , 2014), recommendation systems (Kawale et al. , 2015), hyperparameter tuning (Kandasamy et al. , 2018), and arcade games (Osband et al. , 2016a); and have been used at several companies, including Adobe, Amazon (Hill et al. , 2017), Facebook, Google (Scott, 2010; Scott, 2015), LinkedIn (Agarwal, 2013; Agarwal et al. , 2014), Microsoft (Graepel et al. , 2010), Netflix, and Twitter.
The objective of this tutorial is to explain when, why, and how to apply Thompson sampling. A range of examples are used to demonstrate how the algorithm can be used to solve a variety of problems and provide clear insight into why it works and when it offers substantial benefit over naive alternatives. The tutorial also provides guidance on approximations to Thompson sampling that can simplify computation
as well as practical considerations like prior distribution specification, safety constraints and nonstationarity. Accompanying this tutorial we also release a Python package 1 that reproduces all experiments and figures presented. This resource is valuable not only for reproducible research, but also as a reference implementation that may help practioners build intuition for how to practically implement some of the ideas and algorithms we discuss in this tutorial. A concluding section discusses theoretical results that aim to develop an understanding of why Thompson sampling works, highlights settings where Thompson sampling performs poorly, and discusses alternative approaches studied in recent literature. As a baseline and backdrop for our discussion of Thompson sampling, we begin with an alternative approach that does not actively explore.
1 Python code and documentation is available at https://github.com/iosband/ ts\_tutorial.
## Greedy Decisions
Greedy algorithms serve as perhaps the simplest and most common approach to online decision problems. The following two steps are taken to generate each action: (1) estimate a model from historical data and (2) select the action that is optimal for the estimated model, breaking ties in an arbitrary manner. Such an algorithm is greedy in the sense that an action is chosen solely to maximize immediate reward. Figure 2.1 illustrates such a scheme. At each time t , a supervised learning algorithm fits a model to historical data pairs H t -1 = (( x 1 , y 1 ) , . . . , ( x t -1 , y t -1 )), generating an estimate ˆ θ of model parameters. The resulting model can then be used to predict the reward r t = r ( y t ) from applying action x t . Here, y t is an observed outcome, while r is a known function that represents the agent's preferences. Given estimated model parameters ˆ θ , an optimization algorithm selects the action x t that maximizes expected reward, assuming that θ = ˆ θ . This action is then applied to the exogenous system and an outcome y t is observed.
A shortcoming of the greedy approach, which can severely curtail performance, is that it does not actively explore. To understand this issue, it is helpful to focus on the Bernoulli bandit setting of Example 1.1. In that context, the observations are rewards, so r t = r ( y t ) = y t .
Figure 2.1: Online decision algorithm.
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Diagram: Online Decision Algorithm System
### Overview
The image depicts a diagram illustrating an online decision algorithm interacting with a "system". The algorithm consists of an optimizer and a supervised learning model, and the interaction involves actions, observations, and rewards. The diagram shows the flow of information between the algorithm and the system.
### Components/Axes
The diagram consists of two main blocks:
1. **Online Decision Algorithm:** A large orange rectangle labeled "online decision algorithm". This block contains two smaller blocks: "optimizer" and "supervised learning".
2. **System:** A turquoise rectangle labeled "system".
The following labels are present, indicating data flow:
* **x<sub>t</sub>**: Action input to the system.
* **y<sub>t</sub>**: Observation output from the system.
* **r<sub>t</sub> = r(y<sub>t</sub>)**: Reward calculated from the observation.
* **â** : Model parameter estimate.
* **x<sub>t</sub>**: Action input to the supervised learning model.
### Detailed Analysis or Content Details
The diagram shows the following flow:
1. The "online decision algorithm" receives an observation (y<sub>t</sub>) from the "system".
2. The observation (y<sub>t</sub>) is fed into the "supervised learning" block.
3. The "supervised learning" block outputs a model parameter estimate (â) to the "optimizer".
4. The "optimizer" adjusts the model based on the observation and outputs an action (x<sub>t</sub>) to the "system".
5. The "system" receives the action (x<sub>t</sub>) and produces an observation (y<sub>t</sub>) and a reward (r<sub>t</sub>).
6. The reward (r<sub>t</sub>) is fed back into the "online decision algorithm" to influence future actions.
The reward is defined as r<sub>t</sub> = r(y<sub>t</sub>), indicating that the reward is a function of the observation.
### Key Observations
The diagram illustrates a closed-loop system where the algorithm learns from the environment (the "system") through trial and error. The "supervised learning" component suggests that the algorithm uses observed data to improve its model, while the "optimizer" component suggests that the algorithm adjusts its parameters to maximize the reward.
### Interpretation
This diagram represents a reinforcement learning framework. The "online decision algorithm" acts as an agent that interacts with an environment (the "system"). The agent takes actions (x<sub>t</sub>), receives observations (y<sub>t</sub>) and rewards (r<sub>t</sub>), and learns to optimize its actions to maximize the cumulative reward. The use of "supervised learning" within the algorithm suggests a model-based reinforcement learning approach, where the agent learns a model of the environment and uses this model to predict future outcomes. The feedback loop between the algorithm and the system is crucial for the learning process, as it allows the algorithm to adapt to the environment and improve its performance over time. The equation r<sub>t</sub> = r(y<sub>t</sub>) highlights that the reward is directly tied to the state of the environment as observed by the agent. This is a common setup in reinforcement learning where the reward function defines the goal of the agent.
</details>
At each time t , a greedy algorithm would generate an estimate ˆ θ k of the mean reward for each k th action, and select the action that attains the maximum among these estimates.
Suppose there are three actions with mean rewards θ ∈ R 3 . In particular, each time an action k is selected, a reward of 1 is generated with probability θ k . Otherwise, a reward of 0 is generated. The mean rewards are not known to the agent. Instead, the agent's beliefs in any given time period about these mean rewards can be expressed in terms of posterior distributions. Suppose that, conditioned on the observed history H t -1 , posterior distributions are represented by the probability density functions plotted in Figure 2.2. These distributions represent beliefs after the agent tries actions 1 and 2 one thousand times each, action 3 three times, receives cumulative rewards of 600, 400, and 1, respectively, and synthesizes these observations with uniform prior distributions over mean rewards of each action. They indicate that the agent is confident that mean rewards for actions 1 and 2 are close to their expectations of approximately 0 . 6 and 0 . 4. On the other hand, the agent is highly uncertain about the mean reward of action 3, though he expects 0 . 4.
The greedy algorithm would select action 1, since that offers the maximal expected mean reward. Since the uncertainty around this expected mean reward is small, observations are unlikely to change the expectation substantially, and therefore, action 1 is likely to be selected
Figure 2.2: Probability density functions over mean rewards.
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Line Chart: Probability Density of Mean Reward for Different Actions
### Overview
The image presents a line chart illustrating the probability density of the mean reward for three different actions. The x-axis represents the mean reward, ranging from 0.0 to 1.0, while the y-axis represents the probability density, ranging from 0.0 to 30.0. Three distinct lines, each representing an action, depict the distribution of mean rewards.
### Components/Axes
* **X-axis Title:** "mean reward"
* **Y-axis Title:** "probability density"
* **X-axis Range:** 0.0 to 1.0
* **Y-axis Range:** 0.0 to 30.0
* **Legend:** Located in the top-right corner.
* "action 1" - Blue line
* "action 2" - Red line
* "action 3" - Green line
### Detailed Analysis
* **Action 1 (Blue Line):** This line exhibits a unimodal distribution, peaking at approximately 0.65 with a probability density of around 26. The line slopes downward on both sides of the peak, approaching a probability density of 0 at the extremes of the x-axis.
* **Action 2 (Red Line):** This line shows a relatively flat distribution with a slight dip in the center. The probability density remains consistently low, fluctuating between approximately 2 and 8 across the entire range of mean rewards. It appears to have a very broad distribution.
* **Action 3 (Green Line):** This line also displays a unimodal distribution, peaking at approximately 0.4 with a probability density of around 25. Similar to Action 1, the line slopes downward on both sides of the peak, approaching a probability density of 0 at the extremes of the x-axis.
Here's a breakdown of approximate data points:
| Action | Mean Reward | Probability Density |
| :------- | :---------- | :------------------ |
| Action 1 | 0.2 | ~2 |
| Action 1 | 0.4 | ~10 |
| Action 1 | 0.6 | ~25 |
| Action 1 | 0.8 | ~10 |
| Action 2 | 0.2 | ~4 |
| Action 2 | 0.4 | ~6 |
| Action 2 | 0.6 | ~4 |
| Action 2 | 0.8 | ~6 |
| Action 3 | 0.2 | ~5 |
| Action 3 | 0.4 | ~25 |
| Action 3 | 0.6 | ~10 |
| Action 3 | 0.8 | ~5 |
### Key Observations
* Action 1 and Action 3 have similar peak probability densities, but their peaks are located at different mean reward values (0.65 and 0.4, respectively).
* Action 2 has a significantly lower probability density across all mean reward values compared to Action 1 and Action 3.
* The distributions for Action 1 and Action 3 are relatively narrow and concentrated around their respective peaks, indicating a higher certainty in the mean reward for those actions.
* Action 2's distribution is broad and flat, suggesting a high degree of uncertainty in the mean reward.
### Interpretation
The chart suggests that Actions 1 and 3 are more likely to yield higher mean rewards than Action 2. Action 1 is centered around a mean reward of 0.65, while Action 3 is centered around 0.4. Action 2, however, has a low probability density across the entire range, indicating it is less likely to provide a substantial reward. The narrow distributions of Actions 1 and 3 suggest that the rewards are more predictable, while the broad distribution of Action 2 indicates a higher level of risk or variability. This data could be used to inform a decision-making process, favoring Actions 1 and 3 over Action 2. The difference in peak locations between Action 1 and Action 3 suggests that different strategies or approaches may lead to different expected rewards.
</details>
ad infinitum . It seems reasonable to avoid action 2, since it is extremely unlikely that θ 2 > θ 1 . On the other hand, if the agent plans to operate over many time periods, it should try action 3. This is because there is some chance that θ 3 > θ 1 , and if this turns out to be the case, the agent will benefit from learning that and applying action 3. To learn whether θ 3 > θ 1 , the agent needs to try action 3, but the greedy algorithm will unlikely ever do that. The algorithm fails to account for uncertainty in the mean reward of action 3, which should entice the agent to explore and learn about that action.
Dithering is a common approach to exploration that operates through randomly perturbing actions that would be selected by a greedy algorithm. One version of dithering, called /epsilon1 -greedy exploration , applies the greedy action with probability 1 -/epsilon1 and otherwise selects an action uniformly at random. Though this form of exploration can improve behavior relative to a purely greedy approach, it wastes resources by failing to 'write off' actions regardless of how unlikely they are to be optimal. To understand why, consider again the posterior distributions of Figure 2.2. Action 2 has almost no chance of being optimal, and therefore, does not deserve experimental trials, while the uncertainty surrounding action 3 warrants exploration. However, /epsilon1 -greedy explo-
ration would allocate an equal number of experimental trials to each action. Though only half of the exploratory actions are wasted in this example, the issue is exacerbated as the number of possible actions increases. Thompson sampling, introduced more than eight decades ago (Thompson, 1933), provides an alternative to dithering that more intelligently allocates exploration effort.
## Thompson Sampling for the Bernoulli Bandit
To digest how Thompson sampling (TS) works, it is helpful to begin with a simple context that builds on the Bernoulli bandit of Example 1.1 and incorporates a Bayesian model to represent uncertainty.
Example 3.1. (Beta-Bernoulli Bandit) Recall the Bernoulli bandit of Example 1.1. There are K actions. When played, an action k produces a reward of one with probability θ k and a reward of zero with probability 1 -θ k . Each θ k can be interpreted as an action's success probability or mean reward. The mean rewards θ = ( θ 1 , ..., θ K ) are unknown, but fixed over time. In the first period, an action x 1 is applied, and a reward r 1 ∈ { 0 , 1 } is generated with success probability P ( r 1 = 1 | x 1 , θ ) = θ x 1 . After observing r 1 , the agent applies another action x 2 , observes a reward r 2 , and this process continues.
Let the agent begin with an independent prior belief over each θ k . Take these priors to be beta-distributed with parameters α = ( α 1 , . . . , α K ) and β ∈ ( β 1 , . . . , β K ). In particular, for each action k , the prior probability density function of θ k is
$$p ( \theta _ { k } ) = \frac { \Gamma ( \alpha _ { k } + \beta _ { k } ) } { \Gamma ( \alpha _ { k } ) \Gamma ( \beta _ { k } ) } \theta _ { k } ^ { \alpha _ { k } - 1 } ( 1 - \theta _ { k } ) ^ { \beta _ { k } - 1 } ,$$
where Γ denotes the gamma function. As observations are gathered, the distribution is updated according to Bayes' rule. It is particularly convenient to work with beta distributions because of their conjugacy properties. In particular, each action's posterior distribution is also beta with parameters that can be updated according to a simple rule:
/negationslash
$$( \alpha _ { k } , \beta _ { k } ) \leftarrow \begin{cases} ( \alpha _ { k } , \beta _ { k } ) & i f x _ { t } \neq k \\ ( \alpha _ { k } , \beta _ { k } ) + ( r _ { t } , 1 - r _ { t } ) & i f x _ { t } = k . \end{cases}$$
Note that for the special case of α k = β k = 1, the prior p ( θ k ) is uniform over [0 , 1]. Note that only the parameters of a selected action are updated. The parameters ( α k , β k ) are sometimes called pseudocounts, since α k or β k increases by one with each observed success or failure, respectively. A beta distribution with parameters ( α k , β k ) has mean α k / ( α k + β k ), and the distribution becomes more concentrated as α k + β k grows. Figure 2.2 plots probability density functions of beta distributions with parameters ( α 1 , β 1 ) = (601 , 401), ( α 2 , β 2 ) = (401 , 601), and ( α 3 , β 3 ) = (2 , 3).
Algorithm 3.1 presents a greedy algorithm for the beta-Bernoulli bandit. In each time period t , the algorithm generates an estimate ˆ θ k = α k / ( α k + β k ), equal to its current expectation of the success probability θ k . The action x t with the largest estimate ˆ θ k is then applied, after which a reward r t is observed and the distribution parameters α x t and β x t are updated.
TS, specialized to the case of a beta-Bernoulli bandit, proceeds similarly, as presented in Algorithm 3.2. The only difference is that the success probability estimate ˆ θ k is randomly sampled from the posterior distribution, which is a beta distribution with parameters α k and β k , rather than taken to be the expectation α k / ( α k + β k ). To avoid a common misconception, it is worth emphasizing TS does not sample ˆ θ k from the posterior distribution of the binary value y t that would be observed if action k is selected. In particular, ˆ θ k represents a statistically plausible success probability rather than a statistically plausible observation.
Algorithm 3.1 BernGreedy( K,α,β ) 1: for t = 1 , 2 , . . . do 2: #estimate model: 3: for k = 1 , . . . , K do 4: ˆ θ k ← α k / ( α k + β k ) 5: end for 6: 7: #select and apply action: 8: x t ← argmax k ˆ θ k 9: Apply x t and observe r t 10: 11: #update distribution: 12: ( α x t , β x t ) ← ( α x t + r t , β x t +1 -r t ) 13: end for
## Algorithm 3.2 BernTS( K,α,β )
```
Algorithm 3.2 BernTS(K, \alpha, \beta)
1: for t = 1, 2,... do
2: #sample model:
3: for k = 1, ..., K do
4: Sample $\hat{k}_k^{\theta_k} \quad$ end for
5: end for
7: #select and apply action:
8: x_t <- argmax_{\hat{k}_k}
9: Apply x_t and observe $r_t$
10:
11: end for
```
To understand how TS improves on greedy actions with or without dithering, recall the three armed Bernoulli bandit with posterior distributions illustrated in Figure 2.2. In this context, a greedy action would forgo the potentially valuable opportunity to learn about action 3. With dithering, equal chances would be assigned to probing actions 2 and 3, though probing action 2 is virtually futile since it is extremely unlikely to be optimal. TS, on the other hand would sample actions 1, 2, or 3, with probabilities approximately equal to 0 . 82, 0, and 0 . 18, respectively. In each case, this is the probability that the random estimate drawn for the action exceeds those drawn for other actions. Since these estimates are drawn from posterior distributions, each of these probabilities is also equal to the probability that the corresponding action is optimal, conditioned on observed history. As such, TS explores to resolve uncertainty where there is a chance that resolution will help the agent identify the optimal action, but avoids probing where feedback would not be helpful.
It is illuminating to compare simulated behavior of TS to that of a greedy algorithm. Consider a three-armed beta-Bernoulli bandit with mean rewards θ 1 = 0 . 9, θ 2 = 0 . 8, and θ 3 = 0 . 7. Let the prior distribution over each mean reward be uniform. Figure 3.1 plots results based on ten thousand independent simulations of each algorithm. Each simulation is over one thousand time periods. In each simulation, actions
are randomly rank-ordered for the purpose of tie-breaking so that the greedy algorithm is not biased toward selecting any particular action. Each data point represents the fraction of simulations for which a particular action is selected at a particular time.
Figure 3.1: Probability that the greedy algorithm and Thompson sampling selects an action.
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Line Charts: Action Probability vs. Time Period for Two Algorithms
### Overview
The image presents two line charts comparing the action probability of three actions (Action 1, Action 2, Action 3) over a time period of 1000 units. The left chart (a) represents the "greedy algorithm", while the right chart (b) represents "Thompson sampling". Both charts share the same axes: time period (t) on the x-axis and action probability on the y-axis, ranging from 0 to 1.
### Components/Axes
* **X-axis:** "time period (t)", ranging from 0 to 1000.
* **Y-axis:** "action probability", ranging from 0 to 1.
* **Legend (top-right of both charts):**
* "variable" label.
* "action 1" (represented by a red line).
* "action 2" (represented by a grey line).
* "action 3" (represented by a green line).
* **Chart (a) Title:** "(a) greedy algorithm" positioned below the chart.
* **Chart (b) Title:** "(b) Thompson sampling" positioned below the chart.
### Detailed Analysis or Content Details
**Chart (a): Greedy Algorithm**
* **Action 1 (Red Line):** The line is approximately horizontal, starting at a probability of ~0.52 and remaining relatively constant at ~0.51 throughout the entire time period (0-1000).
* **Action 2 (Grey Line):** The line is approximately horizontal, starting at a probability of ~0.32 and remaining relatively constant at ~0.31 throughout the entire time period (0-1000).
* **Action 3 (Green Line):** The line is approximately horizontal, starting at a probability of ~0.16 and remaining relatively constant at ~0.15 throughout the entire time period (0-1000).
**Chart (b): Thompson Sampling**
* **Action 1 (Red Line):** The line exhibits an upward trend, starting at a probability of ~0.05 at t=0, rapidly increasing to approximately ~0.85 by t=250, and then leveling off to a probability of ~0.82 by t=1000.
* **Action 2 (Grey Line):** The line exhibits a downward trend, starting at a probability of ~0.45 at t=0, decreasing to approximately ~0.15 by t=250, and then leveling off to a probability of ~0.12 by t=1000.
* **Action 3 (Green Line):** The line exhibits a downward trend, starting at a probability of ~0.5 at t=0, decreasing to approximately ~0.03 by t=250, and then leveling off to a probability of ~0.02 by t=1000.
### Key Observations
* The "greedy algorithm" maintains constant action probabilities throughout the time period, indicating a lack of adaptation or learning.
* "Thompson sampling" demonstrates dynamic action probabilities, with Action 1 increasing in probability while Actions 2 and 3 decrease. This suggests that Thompson sampling is learning to favor Action 1 over time.
* The initial probabilities in Thompson sampling are different from the greedy algorithm, showing an initial exploration phase.
### Interpretation
The charts illustrate the difference in behavior between a greedy algorithm and Thompson sampling in a decision-making process. The greedy algorithm consistently selects actions based on their initial probabilities, without adapting to new information. In contrast, Thompson sampling dynamically adjusts action probabilities based on observed outcomes, leading to a convergence towards the optimal action (Action 1 in this case). The initial exploration phase of Thompson sampling is evident in the changing probabilities at the beginning of the time period. The leveling off of the lines in Thompson sampling suggests that the algorithm has reached a stable state where it consistently favors Action 1. This demonstrates the ability of Thompson sampling to balance exploration and exploitation, ultimately leading to better performance than a static greedy approach.
</details>
From the plots, we see that the greedy algorithm does not always converge on action 1, which is the optimal action. This is because the algorithm can get stuck, repeatedly applying a poor action. For example, suppose the algorithm applies action 3 over the first couple time periods and receives a reward of 1 on both occasions. The algorithm would then continue to select action 3, since the expected mean reward of either alternative remains at 0 . 5. With repeated selection of action 3, the expected mean reward converges to the true value of 0 . 7, which reinforces the agent's commitment to action 3. TS, on the other hand, learns to select action 1 within the thousand periods. This is evident from the fact that, in an overwhelmingly large fraction of simulations, TS selects action 1 in the final period.
The performance of online decision algorithms is often studied and compared through plots of regret. The per-period regret of an algorithm over a time period t is the difference between the mean reward of an optimal action and the action selected by the algorithm. For the Bernoulli bandit problem, we can write this as regret t ( θ ) = max k θ k -θ x t . Figure 3.2a plots per-period regret realized by the greedy algorithm and TS, again averaged over ten thousand simulations. The average
per-period regret of TS vanishes as time progresses. That is not the case for the greedy algorithm.
Comparing algorithms with fixed mean rewards raises questions about the extent to which the results depend on the particular choice of θ . As such, it is often useful to also examine regret averaged over plausible values of θ . A natural approach to this involves sampling many instances of θ from the prior distributions and generating an independent simulation for each. Figure 3.2b plots averages over ten thousand such simulations, with each action reward sampled independently from a uniform prior for each simulation. Qualitative features of these plots are similar to those we inferred from Figure 3.2a, though regret in Figure 3.2a is generally smaller over early time periods and larger over later time periods, relative to Figure 3.2b. The smaller regret in early time periods is due to the fact that with θ = (0 . 9 , 0 . 8 , 0 . 7), mean rewards are closer than for a typical randomly sampled θ , and therefore the regret of randomly selected actions is smaller. The fact that per-period regret of TS is larger in Figure 3.2a than Figure 3.2b over later time periods, like period 1000, is also a consequence of proximity among rewards with θ = (0 . 9 , 0 . 8 , 0 . 7). In this case, the difference is due to the fact that it takes longer to differentiate actions than it would for a typical randomly sampled θ .
Figure 3.2: Regret from applying greedy and Thompson sampling algorithms to the three-armed Bernoulli bandit.
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Line Chart: Per-Period Regret vs. Time Period for Different Agents
### Overview
The image presents two line charts comparing the per-period regret of two agents, "TS" (Thompson Sampling) and "greedy", over time. The charts differ in the parameter setting used for the agents. The x-axis represents the time period (t), ranging from 0 to 1000. The y-axis represents the per-period regret, ranging from 0 to 0.25.
### Components/Axes
* **X-axis:** "time period (t)" - Scale from 0 to 1000, with markers at 0, 250, 500, 750, and 1000.
* **Y-axis:** "per-period regret" - Scale from 0 to 0.25, with markers at 0, 0.05, 0.10, 0.15, 0.20, and 0.25.
* **Legend:** Located in the top-right corner of each chart.
* "TS" - Represented by a red line.
* "greedy" - Represented by a blue line.
* **Chart (a):** Subtitle "(a) θ = (0.9, 0.8, 0.7)"
* **Chart (b):** Subtitle "(b) average over random θ"
### Detailed Analysis or Content Details
**Chart (a): θ = (0.9, 0.8, 0.7)**
* **TS (Red Line):** The line starts at approximately 0.08 and rapidly decreases to approximately 0.01 by time period 250. It continues to decrease slowly, reaching approximately 0.002 by time period 1000. The trend is strongly downward, indicating a rapid reduction in regret.
* **greedy (Blue Line):** The line starts at approximately 0.09 and remains relatively constant at around 0.08-0.09 throughout the entire time period (1000). The trend is flat, indicating consistent regret.
**Chart (b): average over random θ**
* **TS (Red Line):** The line starts at approximately 0.08 and rapidly decreases to approximately 0.01 by time period 250. It continues to decrease slowly, reaching approximately 0.003 by time period 1000. The trend is strongly downward, indicating a rapid reduction in regret.
* **greedy (Blue Line):** The line starts at approximately 0.09 and remains relatively constant at around 0.08-0.09 throughout the entire time period (1000). The trend is flat, indicating consistent regret.
### Key Observations
* In both charts, the "TS" agent consistently outperforms the "greedy" agent, exhibiting significantly lower per-period regret over time.
* The "greedy" agent's regret remains relatively constant, suggesting it does not adapt or learn from its actions.
* The initial regret for both agents is similar, but the "TS" agent quickly reduces its regret, while the "greedy" agent does not.
* The parameter setting (θ) does not appear to significantly impact the relative performance of the two agents, as the trends are similar in both charts.
### Interpretation
The data demonstrates the effectiveness of the Thompson Sampling (TS) agent in minimizing per-period regret compared to a greedy agent. The TS agent's ability to explore and learn from its actions allows it to adapt and improve its performance over time, leading to a substantial reduction in regret. The greedy agent, lacking this adaptive capability, maintains a consistently high level of regret.
The fact that the trends are similar in both charts (with specific θ values and averaged over random θ values) suggests that the superiority of TS over greedy is robust and not dependent on the specific parameter settings. This indicates that Thompson Sampling is a generally effective strategy for minimizing regret in this type of environment. The flat line for the greedy agent suggests it is stuck in a suboptimal strategy and unable to improve its performance. The rapid initial drop in regret for the TS agent indicates a quick learning phase, where it efficiently identifies better actions.
</details>
## General Thompson Sampling
TS can be applied fruitfully to a broad array of online decision problems beyond the Bernoulli bandit, and we now consider a more general setting. Suppose the agent applies a sequence of actions x 1 , x 2 , x 3 , . . . to a system, selecting each from a set X . This action set could be finite, as in the case of the Bernoulli bandit, or infinite. After applying action x t , the agent observes an outcome y t , which the system randomly generates according to a conditional probability measure q θ ( ·| x t ). The agent enjoys a reward r t = r ( y t ), where r is a known function. The agent is initially uncertain about the value of θ and represents his uncertainty using a prior distribution p .
Algorithms 4.1 and 4.2 present greedy and TS approaches in an abstract form that accommodates this very general problem. The two differ in the way they generate model parameters ˆ θ . The greedy algorithm takes ˆ θ to be the expectation of θ with respect to the distribution p , while TS draws a random sample from p . Both algorithms then apply actions that maximize expected reward for their respective models. Note that, if there are a finite set of possible observations y t , this expectation is given by
$$\mathbb { E } _ { q _ { \hat { \theta } } } [ r ( y _ { t } ) | x _ { t } = x ] = \sum _ { o } q _ { \hat { \theta } } ( o | x ) r ( o ) .$$
The distribution p is updated by conditioning on the realized observation ˆ y t . If θ is restricted to values from a finite set, this conditional distribution can be written by Bayes rule as
$$( 4 . 2 ) \quad \mathbb { P } _ { p , q } ( \theta = u | x _ { t } , y _ { t } ) = \frac { p ( u ) q _ { u } ( y _ { t } | x _ { t } ) } { \sum _ { v } p ( v ) q _ { v } ( y _ { t } | x _ { t } ) } .$$
## Algorithm 4.1 Greedy( X , p, q, r )
```
Algorithm 4.1 Greedy(X,p,q,r)
1: for t = 1,2,... do
2: #estimate model:
3: $\theta \leftarrow \mathbb{E}_{p}[\theta]$
4:
5: $#select and apply action:
6: x_{t \leftarrow argmax_{x\in\mathcal{X}\,q_0^{\theta}}[r(y_{t})|x_{t = x}]$
7:
8: Apply x_{t and observe y_{t}$
9:
10: #update distribution:
11: p \leftarrow \mathbb{P}_{p,q}(\theta \in \cdot | x_{t},y_{t})$
12: end for
```
## Algorithm
4.2
Thompson( X , p, q, r )
```
\frac{1: for t = 1,2,... do
2: #sample model:
3: Sample $\hat{\theta} \sim p
4: #select and apply action:
5: x_{t <- argmax_{x\in\mathbb{E}q_\hat{y}(r(y_t)|x_t = x)}
6: Apply $x_t$ and observe $y_t
7: }
8: #update distribution:
9: p <- \mathbb{P}_{p,q}(\theta \in \cdot|x_{t}, y_{t})
```
The Bernoulli bandit with a beta prior serves as a special case of this more general formulation. In this special case, the set of actions is X = { 1 , . . . , K } and only rewards are observed, so y t = r t . Observations and rewards are modeled by conditional probabilities q θ (1 | k ) = θ k and q θ (0 | k ) = 1 -θ k . The prior distribution is encoded by vectors α and β , with probability density function given by:
$$p ( \theta ) = \prod _ { k = 1 } ^ { K } \frac { \Gamma ( \alpha + \beta ) } { \Gamma ( \alpha _ { k } ) \Gamma ( \beta _ { k } ) } \theta _ { k } ^ { \alpha _ { k } - 1 } ( 1 - \theta _ { k } ) ^ { \beta _ { k } - 1 } ,$$
where Γ denotes the gamma function. In other words, under the prior distribution, components of θ are independent and beta-distributed, with parameters α and β .
For this problem, the greedy algorithm (Algorithm 4.1) and TS (Algorithm 4.2) begin each t th iteration with posterior parameters ( α k , β k ) for k ∈ { 1 , . . . , K } . The greedy algorithm sets ˆ θ k to the expected value E p [ θ k ] = α k / ( α k + β k ), whereas TS randomly draws ˆ θ k from a beta distribution with parameters ( α k , β k ). Each algorithm then selects the action x that maximizes E q ˆ θ [ r ( y t ) | x t = x ] = ˆ θ x . After applying the
selected action, a reward r t = y t is observed, and belief distribution parameters are updated according to
$$( \alpha , \beta ) \gets ( \alpha + r _ { t } 1 _ { x _ { t } } , \beta + ( 1 - r _ { t } ) 1 _ { x _ { t } } ) ,$$
where 1 x t is a vector with component x t equal to 1 and all other components equal to 0.
Algorithms 4.1 and 4.2 can also be applied to much more complex problems. As an example, let us consider a version of the shortest path problem presented in Example 1.2.
Example 4.1. (Independent Travel Times) Recall the shortest path problem of Example 1.2. The model is defined with respect to a directed graph G = ( V, E ), with vertices V = { 1 , . . . , N } , edges E , and mean travel times θ ∈ R N . Vertex 1 is the source and vertex N is the destination. An action is a sequence of distinct edges leading from source to destination. After applying action x t , for each traversed edge e ∈ x t , the agent observes a travel time y t,e that is independently sampled from a distribution with mean θ e . Further, the agent incurs a cost of ∑ e ∈ x t y t,e , which can be thought of as a reward r t = -∑ e ∈ x t y t,e .
Consider a prior for which each θ e is independent and log-Gaussiandistributed with parameters µ e and σ 2 e . That is, ln( θ e ) ∼ N ( µ e , σ 2 e ) is Gaussian-distributed. Hence, E [ θ e ] = e µ e + σ 2 e / 2 . Further, take y t,e | θ to be independent across edges e ∈ E and log-Gaussian-distributed with parameters ln( θ e ) -˜ σ 2 / 2 and ˜ σ 2 , so that E [ y t,e | θ e ] = θ e . Conjugacy properties accommodate a simple rule for updating the distribution of θ e upon observation of y t,e :
$$( 4 . 3 ) \quad ( \mu _ { e } , \sigma _ { e } ^ { 2 } ) \leftarrow \left ( \frac { \frac { 1 } { \sigma _ { e } ^ { 2 } } \mu _ { e } + \frac { 1 } { \tilde { \sigma } ^ { 2 } } \left ( \ln ( y _ { t , e } ) + \frac { \tilde { \sigma } ^ { 2 } } { 2 } \right ) } { \frac { 1 } { \sigma _ { e } ^ { 2 } } + \frac { 1 } { \tilde { \sigma } ^ { 2 } } } , \frac { 1 } { \frac { 1 } { \sigma _ { e } ^ { 2 } } + \frac { 1 } { \tilde { \sigma } ^ { 2 } } } \right ) .$$
To motivate this formulation, consider an agent who commutes from home to work every morning. Suppose possible paths are represented by a graph G = ( V, E ). Suppose the agent knows the travel distance d e associated with each edge e ∈ E but is uncertain about average travel times. It would be natural for her to construct a prior for which expectations are equal to travel distances. With the log-Gaussian prior, this can be accomplished by setting µ e = ln( d e ) -σ 2 e / 2. Note that the
parameters µ e and σ 2 e also express a degree of uncertainty; in particular, the prior variance of mean travel time along an edge is ( e σ 2 e -1) d 2 e .
The greedy algorithm (Algorithm 4.1) and TS (Algorithm 4.2) can be applied to Example 4.1 in a computationally efficient manner. Each algorithm begins each t th iteration with posterior parameters ( µ e , σ e ) for each e ∈ E . The greedy algorithm sets ˆ θ e to the expected value E p [ θ e ] = e µ e + σ 2 e / 2 , whereas TS randomly draws ˆ θ e from a log-Gaussian distribution with parameters µ e and σ 2 e . Each algorithm then selects its action x to maximize E q ˆ θ [ r ( y t ) | x t = x ] = -∑ e ∈ x t ˆ θ e . This can be cast as a deterministic shortest path problem, which can be solved efficiently, for example, via Dijkstra's algorithm. After applying the selected action, an outcome y t is observed, and belief distribution parameters ( µ e , σ 2 e ), for each e ∈ E , are updated according to (4.3).
Figure 4.1 presents results from applying greedy and TS algorithms to Example 4.1, with the graph taking the form of a binomial bridge, as shown in Figure 4.2, except with twenty rather than six stages, so there are 184,756 paths from source to destination. Prior parameters are set to µ e = -1 2 and σ 2 e = 1 so that E [ θ e ] = 1, for each e ∈ E , and the conditional distribution parameter is ˜ σ 2 = 1. Each data point represents an average over ten thousand independent simulations.
The plots of regret demonstrate that the performance of TS converges quickly to optimal, while that is far from true for the greedy algorithm. We also plot results generated by /epsilon1 -greedy exploration, varying /epsilon1 . For each trip, with probability 1 -/epsilon1 , this algorithm traverses a path produced by a greedy algorithm. Otherwise, the algorithm samples a path randomly. Though this form of exploration can be helpful, the plots demonstrate that learning progresses at a far slower pace than with TS. This is because /epsilon1 -greedy exploration is not judicious in how it selects paths to explore. TS, on the other hand, orients exploration effort towards informative rather than entirely random paths.
Plots of cumulative travel time relative to optimal offer a sense for the fraction of driving time wasted due to lack of information. Each point plots an average of the ratio between the time incurred over some number of days and the minimal expected travel time given θ . With TS, this converges to one at a respectable rate. The same can not be said for /epsilon1 -greedy approaches.
Figure 4.1: Performance of Thompson sampling and /epsilon1 -greedy algorithms in the shortest path problem.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Charts: Regret and Cumulative Travel Time vs. Optimal
### Overview
The image presents two line charts, labeled (a) "regret" and (b) "cumulative travel time vs. optimal". Both charts compare the performance of several agents (greedy, 0.01-greedy, 0.05-greedy, 0.1-greedy, and TS) over time periods from 0 to 500. The x-axis represents time period (t), while the y-axis in (a) represents per-period regret, and in (b) represents total distance divided by the optimal distance.
### Components/Axes
* **Chart (a): Regret**
* X-axis: time period (t), ranging from 0 to 500.
* Y-axis: per-period regret, ranging from 0 to 10.
* Legend: "agent" with the following categories:
* greedy (red)
* 0.01-greedy (orange)
* 0.05-greedy (green)
* 0.1-greedy (purple)
* TS (gray)
* **Chart (b): Cumulative Travel Time vs. Optimal**
* X-axis: time period (t), ranging from 0 to 500.
* Y-axis: total distance / optimal, ranging from 1.1 to 2.1.
* Legend: "agent" with the following categories:
* greedy (red)
* 0.01-greedy (orange)
* 0.05-greedy (green)
* 0.1-greedy (purple)
* TS (gray)
* Dotted line at x=100, y=1.25
### Detailed Analysis or Content Details
**Chart (a): Regret**
* **greedy (red):** Starts at approximately 6.5 and remains relatively constant around 3.5-4.0 throughout the time period.
* **0.01-greedy (orange):** Starts at approximately 1.5 and decreases rapidly to around 0.2 by t=100, then continues to decrease slowly, reaching approximately 0.05 by t=500.
* **0.05-greedy (green):** Starts at approximately 2.0 and decreases rapidly to around 0.1 by t=100, then continues to decrease slowly, reaching approximately 0.05 by t=500.
* **0.1-greedy (purple):** Starts at approximately 2.5 and decreases rapidly to around 0.1 by t=100, then continues to decrease slowly, reaching approximately 0.05 by t=500.
* **TS (gray):** Starts at approximately 2.5 and decreases rapidly to around 0.1 by t=100, then continues to decrease slowly, reaching approximately 0.05 by t=500.
**Chart (b): Cumulative Travel Time vs. Optimal**
* **greedy (red):** Starts at approximately 1.9 and decreases rapidly to around 1.25 by t=100, then continues to decrease slowly, reaching approximately 1.15 by t=500.
* **0.01-greedy (orange):** Starts at approximately 1.8 and decreases rapidly to around 1.2 by t=100, then continues to decrease slowly, reaching approximately 1.15 by t=500.
* **0.05-greedy (green):** Starts at approximately 1.8 and decreases rapidly to around 1.2 by t=100, then continues to decrease slowly, reaching approximately 1.15 by t=500.
* **0.1-greedy (purple):** Starts at approximately 1.8 and decreases rapidly to around 1.2 by t=100, then continues to decrease slowly, reaching approximately 1.15 by t=500.
* **TS (gray):** Starts at approximately 1.8 and decreases rapidly to around 1.2 by t=100, then continues to decrease slowly, reaching approximately 1.15 by t=500.
### Key Observations
* In both charts, the "greedy" agent consistently performs worse than the other agents, exhibiting higher regret and a larger ratio of total distance to optimal distance.
* The agents "0.01-greedy", "0.05-greedy", "0.1-greedy", and "TS" exhibit very similar performance in both charts, converging to similar values as time increases.
* The initial drop in both charts is most pronounced between t=0 and t=100.
* The dotted line in chart (b) at x=100, y=1.25 may indicate a benchmark or threshold.
### Interpretation
The data suggests that the "greedy" agent, while simple, is suboptimal in this scenario, consistently incurring higher regret and longer travel times relative to the optimal solution. The other agents, which incorporate some degree of exploration or learning (indicated by the "greedy" coefficient and "TS" for Thompson Sampling), converge to a similar level of performance, suggesting that a small amount of exploration significantly improves results. The rapid initial improvement (between t=0 and t=100) indicates a quick learning phase where the agents adapt to the environment. The convergence of the non-greedy agents suggests diminishing returns from further exploration after a certain point. The dotted line in chart (b) could represent a target performance level, and the agents' convergence towards this level indicates successful adaptation. The charts demonstrate the trade-off between exploration and exploitation in reinforcement learning or decision-making processes.
</details>
Figure 4.2: A binomial bridge with six stages.
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Diagram: Grid-Based Directed Graph
### Overview
</details>
Algorithm 4.2 can be applied to problems with complex information structures, and there is often substantial value to careful modeling of such structures. As an example, we consider a more complex variation of the binomial bridge example.
Example 4.2. (Correlated Travel Times) As with Example 4.1, let each θ e be independent and log-Gaussian-distributed with parameters µ e and σ 2 e . Let the observation distribution be characterized by
$$y _ { t , e } = \zeta _ { t , e } \eta _ { t } \nu _ { t , \ell ( e ) } \theta _ { e } ,$$
where each ζ t,e represents an idiosyncratic factor associated with edge e , η t represents a factor that is common to all edges, /lscript ( e ) indicates whether edge e resides in the lower half of the binomial bridge, and ν t, 0 and ν t, 1 represent factors that bear a common influence on edges in the upper and lower halves, respectively. We take each ζ t,e , η t , ν t, 0 , and ν t, 1 to be independent log-Gaussian-distributed with parameters -˜ σ 2 / 6 and ˜ σ 2 / 3. The distributions of the shocks ζ t,e , η t , ν t, 0 and ν t, 1 are known, and only the parameters θ e corresponding to each individual edge must be learned through experimentation. Note that, given these parameters, the marginal distribution of y t,e | θ is identical to that of Example 4.1, though the joint distribution over y t | θ differs.
The common factors induce correlations among travel times in the binomial bridge: η t models the impact of random events that influence traffic conditions everywhere, like the day's weather, while ν t, 0 and ν t, 1 each reflect events that bear influence only on traffic conditions along edges in half of the binomial bridge. Though mean edge travel times are independent under the prior, correlated observations induce dependencies in posterior distributions.
Conjugacy properties again facilitate efficient updating of posterior parameters. Let φ, z t ∈ R N be defined by
$$\phi _ { e } = \ln ( \theta _ { e } ) \quad a n d \quad z _ { t , e } = \left \{ \begin{array} { l l } { \ln ( y _ { t , e } ) } & { i f e \in x _ { t } } \\ { 0 } & { o t h e r w i s e . } \end{array}$$
Note that it is with some abuse of notation that we index vectors and matrices using edge indices. Define a | x t | × | x t | covariance matrix ˜ Σ with elements
/negationslash
$$\tilde { \Sigma } _ { e , e ^ { \prime } } = \left \{ \begin{array} { l l } { \tilde { \sigma } ^ { 2 } } & { f o r e = e ^ { \prime } } \\ { 2 \tilde { \sigma } ^ { 2 } / 3 } & { f o r e \neq e ^ { \prime } , \ell ( e ) = \ell ( e ^ { \prime } ) } \\ { \tilde { \sigma } ^ { 2 } / 3 } & { o t h e r w i s e , } \end{array}$$
for e, e ′ ∈ x t , and a N × N concentration matrix
$$\tilde { C } _ { e , e ^ { \prime } } = \left \{ \begin{array} { l l } { \tilde { \Sigma } _ { e , e ^ { \prime } } ^ { - 1 } } & { i f e , e ^ { \prime } \in x _ { t } } \\ { 0 } & { o t h e r w i s e , } \end{array}$$
for e, e ′ ∈ E . Then, the posterior distribution of φ is Gaussian with a mean vector µ and covariance matrix Σ that can be updated according
to
$$\begin{array} { r l } { ( 4 . 4 ) } & \left ( \mu , \Sigma \right ) \leftarrow \left ( \left ( \Sigma ^ { - 1 } + \tilde { C } \right ) ^ { - 1 } \left ( \Sigma ^ { - 1 } \mu + \tilde { C } z _ { t } \right ) , \left ( \Sigma ^ { - 1 } + \tilde { C } \right ) ^ { - 1 } \right ) . } \end{array}$$
TS (Algorithm 4.2) can again be applied in a computationally efficient manner. Each t th iteration begins with posterior parameters µ ∈ R N and Σ ∈ R N × N . The sample ˆ θ can be drawn by first sampling a vector ˆ φ from a Gaussian distribution with mean µ and covariance matrix Σ, and then setting ˆ θ e = ˆ φ e for each e ∈ E . An action x is selected to maximize E q ˆ θ [ r ( y t ) | x t = x ] = -∑ e ∈ x t ˆ θ e , using Djikstra's algorithm or an alternative. After applying the selected action, an outcome y t is observed, and belief distribution parameters ( µ, Σ) are updated according to (4.4).
Figure 4.3: Performance of two versions of Thompson sampling in the shortest path problem with correlated travel times.
<details>
<summary>Image 8 Details</summary>

### Visual Description
\n
## Line Chart: Regret and Cumulative Travel Time vs. Optimal
### Overview
The image presents two line charts side-by-side. The left chart (a) displays "regret" on the y-axis against "time period (t)" on the x-axis. The right chart (b) shows "total distance / optimal" on the y-axis against "time period (t)" on the x-axis. Both charts compare two "agent" types: "coherent TS" and "misspecified TS".
### Components/Axes
* **Chart (a): Regret**
* X-axis: "time period (t)", ranging from 0 to 500.
* Y-axis: "per-period regret", ranging from 0 to 10.
* **Chart (b): Cumulative Travel Time vs. Optimal**
* X-axis: "time period (t)", ranging from 0 to 500.
* Y-axis: "total distance / optimal", ranging from 0 to 1.8.
* **Legend (Both Charts):**
* "coherent TS" - represented by a solid red line.
* "misspecified TS" - represented by a dashed grey line.
* The legend is positioned in the top-right corner of each chart.
### Detailed Analysis or Content Details
**Chart (a): Regret**
* **coherent TS (Red Line):** The line starts at approximately 9.5 at t=0, then rapidly decreases, becoming nearly flat around a value of 0.1 at t=200. It remains relatively stable between 0.1 and 0.2 for the rest of the time period.
* **misspecified TS (Grey Dashed Line):** The line begins at approximately 3.5 at t=0, and decreases more slowly than the red line. It reaches approximately 0.2 at t=200, and continues to decrease, leveling off around 0.15 at t=400.
**Chart (b): Cumulative Travel Time vs. Optimal**
* **coherent TS (Red Line):** The line starts at approximately 1.8 at t=0, and decreases rapidly, reaching approximately 1.1 at t=100. It continues to decrease, leveling off around 1.05 at t=400.
* **misspecified TS (Grey Dashed Line):** The line begins at approximately 1.1 at t=0, and decreases more slowly than the red line. It reaches approximately 1.05 at t=200, and continues to decrease, leveling off around 1.03 at t=400. A dotted horizontal line is present at y=1.0.
### Key Observations
* In both charts, the "coherent TS" agent consistently exhibits lower values than the "misspecified TS" agent, indicating better performance.
* The rate of decrease is faster for the "coherent TS" agent in both charts, especially in the initial time periods.
* The "misspecified TS" agent's performance plateaus at a higher level than the "coherent TS" agent.
* The dotted line in chart (b) at y=1.0 suggests a benchmark or optimal performance level.
### Interpretation
The data suggests that the "coherent TS" agent is more effective at minimizing regret and achieving a travel time closer to the optimal solution compared to the "misspecified TS" agent. The rapid initial decrease in both charts for the "coherent TS" agent indicates a faster learning or adaptation process. The leveling off of both lines suggests that the agents are approaching a stable state, but the "misspecified TS" agent remains at a suboptimal level. The dotted line in chart (b) provides a reference point, showing that while both agents improve over time, neither fully reaches the optimal travel time. The difference in performance between the two agents highlights the importance of accurate model specification ("coherent TS" vs. "misspecified TS") in achieving optimal results. The charts demonstrate a clear trade-off between regret and cumulative travel time, with the "coherent TS" agent consistently outperforming the "misspecified TS" agent in both metrics.
</details>
Figure 4.3 plots results from applying TS to Example 4.2, again with the binomial bridge, µ e = -1 2 , σ 2 e = 1, and ˜ σ 2 = 1. Each data point represents an average over ten thousand independent simulations. Despite model differences, an agent can pretend that observations made in this new context are generated by the model described in Example 4.1. In particular, the agent could maintain an independent log-Gaussian posterior for each θ e , updating parameters ( µ e , σ 2 e ) as though each y t,e | θ is independently drawn from a log-Gaussian distribution. As a baseline for comparison, Figure 4.3 additionally plots results from application of this approach, which we will refer to here as misspecified TS . The comparison demonstrates substantial improvement that results from
accounting for interdependencies among edge travel times, as is done by what we refer to here as coherent TS . Note that we have assumed here that the agent must select a path before initiating each trip. In particular, while the agent may be able to reduce travel times in contexts with correlated delays by adjusting the path during the trip based on delays experienced so far, our model does not allow this behavior.
## 5
## Approximations
Conjugacy properties in the Bernoulli bandit and shortest path examples that we have considered so far facilitated simple and computationally efficient Bayesian inference. Indeed, computational efficiency can be an important consideration when formulating a model. However, many practical contexts call for more complex models for which exact Bayesian inference is computationally intractable. Fortunately, there are reasonably efficient and accurate methods that can be used to approximately sample from posterior distributions.
In this section we discuss four approaches to approximate posterior sampling: Gibbs sampling, Langevin Monte Carlo, sampling from a Laplace approximation, and the bootstrap. Such methods are called for when dealing with problems that are not amenable to efficient Bayesian inference. As an example, we consider a variation of the online shortest path problem.
Example 5.1. (Binary Feedback) Consider Example 4.2, except with deterministic travel times and noisy binary observations. Let the graph represent a binomial bridge with M stages. Let each θ e be independent and gamma-distributed with E [ θ e ] = 1, E [ θ 2 e ] = 1 . 5, and observations
be generated according to
$$y _ { t } | \theta \sim \left \{ \begin{array} { l l } { 1 } & { w i t h p r o b a b i l i t y \frac { 1 } { 1 + \exp \left ( \sum _ { e \in x _ { t } } \theta _ { e } - M \right ) } } \\ { 0 } & { o t h e r w i s e . } \end{array}$$
We take the reward to be the rating r t = y t . This information structure could be used to model, for example, an Internet route recommendation service. Each day, the system recommends a route x t and receives feedback y t from the driver, expressing whether the route was desirable. When the realized travel time ∑ e ∈ x t θ e falls short of the prior expectation M , the feedback tends to be positive, and vice versa.
This new model does not enjoy conjugacy properties leveraged in Section 4 and is not amenable to efficient exact Bayesian inference. However, the problem may be addressed via approximation methods. To illustrate, Figure 5.1 plots results from application of three approximate versions of TS to an online shortest path problem on a twenty-stage binomial bridge with binary feedback. The algorithms leverage Langevin Monte Carlo, the Laplace approximation, and the bootstrap, three approaches we will discuss, and the results demonstrate effective learning, in the sense that regret vanishes over time. Also plotted as a baseline for comparison are results from application of the greedy algorithm.
In the remainder of this section, we will describe several approaches to approximate TS. It is worth mentioning that we do not cover an exhaustive list, and further, our descriptions do not serve as comprehensive or definitive treatments of each approach. Rather, our intent is to offer simple descriptions that convey key ideas that may be extended or combined to serve needs arising in any specific application.
Throughout this section, let f t -1 denote the posterior density of θ conditioned on the history H t -1 = (( x 1 , y 1 ) , . . . , ( x t -1 , y t -1 )) of observations. TS generates an action x t by sampling a parameter vector ˆ θ from f t -1 and solving for the optimal path under ˆ θ . The methods we describe generate a sample ˆ θ whose distribution approximates the posterior ˆ f t -1 , which enables approximate implementations of TS when exact posterior sampling is infeasible.
Figure 5.1: Regret experienced by approximation methods applied to the path recommendation problem with binary feedback.
<details>
<summary>Image 9 Details</summary>

### Visual Description
\n
## Line Chart: Per-Period Regret vs. Time Period
### Overview
The image presents a line chart illustrating the per-period regret of different agents over a time period of 1000 units. The chart compares the performance of four agents: Langevin TS, Laplace TS, bootstrap TS, and greedy. The y-axis represents the per-period regret, while the x-axis represents the time period (t).
### Components/Axes
* **X-axis:** "time period (t)", ranging from approximately 0 to 1000.
* **Y-axis:** "per-period regret", ranging from approximately 0 to 0.5.
* **Legend (top-right):**
* Langevin TS (Red)
* Laplace TS (Gray)
* bootstrap TS (Green)
* greedy (Purple)
### Detailed Analysis
The chart displays four distinct lines, each representing an agent's per-period regret over time.
* **Langevin TS (Red):** The line starts at approximately 0.45 at t=0 and rapidly decreases to around 0.07 by t=250. It continues to decrease slowly, reaching approximately 0.055 by t=1000.
* **Laplace TS (Gray):** The line begins at approximately 0.35 at t=0 and decreases more gradually than Langevin TS, reaching around 0.065 by t=250. It continues to decrease, leveling off around 0.05 by t=1000.
* **bootstrap TS (Green):** The line starts at approximately 0.3 at t=0 and decreases at a rate similar to Laplace TS, reaching around 0.06 by t=250. It continues to decrease, leveling off around 0.05 by t=1000.
* **greedy (Purple):** The line begins at approximately 0.25 at t=0 and decreases rapidly, reaching around 0.06 by t=250. It continues to decrease, leveling off around 0.05 by t=1000.
All lines exhibit a decreasing trend, indicating that the per-period regret decreases as the time period increases. The initial decrease is more pronounced for Langevin TS and greedy, while Laplace TS and bootstrap TS show a more gradual decline. All lines converge towards a similar level of per-period regret around t=1000.
### Key Observations
* Langevin TS initially exhibits the highest per-period regret but also the fastest initial decrease.
* The greedy agent starts with the lowest per-period regret but its decrease is not as rapid as Langevin TS.
* Laplace TS and bootstrap TS show similar performance throughout the time period.
* All agents converge to a similar per-period regret level around t=1000, suggesting they achieve comparable performance in the long run.
### Interpretation
The chart demonstrates the learning process of different agents in a sequential decision-making environment. The per-period regret represents the loss incurred by not choosing the optimal action at each time step. The decreasing trend indicates that the agents are learning from their experiences and improving their decision-making over time.
The initial differences in per-period regret likely reflect the exploration-exploitation trade-off of each agent. Langevin TS and greedy may prioritize exploration initially, leading to higher regret but faster learning. Laplace TS and bootstrap TS may prioritize exploitation, leading to lower initial regret but slower learning.
The convergence of the lines towards the end of the time period suggests that all agents eventually achieve a similar level of performance, indicating that they have effectively learned the optimal strategy. The fact that all agents converge to a non-zero regret level suggests that there may be inherent uncertainty or complexity in the environment that prevents them from achieving perfect performance.
</details>
## 5.1 Gibbs Sampling
Gibbs sampling is a general Markov chain Monte Carlo (MCMC) algorithm for drawing approximate samples from multivariate probability distributions. It produces a sequence of sampled parameters ( ˆ θ n : n = 0 , 1 , 2 , . . . ) forming a Markov chain with stationary distribution f t -1 . Under reasonable technical conditions, the limiting distribution of this Markov chain is its stationary distribution, and the distribution of ˆ θ n converges to f t -1 .
Gibbs sampling starts with an initial guess ˆ θ 0 . Iterating over sweeps n = 1 , . . . , N , for each n th sweep, the algorithm iterates over the components k = 1 , . . . , K , for each k generating a one-dimensional marginal distribution
$$f _ { t - 1 } ^ { n , k } ( \theta _ { k } ) \, \infty \, f _ { t - 1 } ( ( \hat { \theta } _ { 1 } ^ { n } , \dots , \hat { \theta } _ { k - 1 } ^ { n } , \theta _ { k } , \hat { \theta } _ { k + 1 } ^ { n - 1 } , \dots , \hat { \theta } _ { K } ^ { n - 1 } ) ) ,$$
and sampling the k th component according to ˆ θ n k ∼ f n,k t -1 . After N of sweeps, the prevailing vector ˆ θ N is taken to be the approximate posterior sample. We refer to (Casella and George, 1992) for a more thorough introduction to the algorithm.
Gibbs sampling applies to a broad range of problems, and is often computationally viable even when sampling from f t -1 is not. This is because sampling from a one-dimensional distribution is simpler. That
said, for complex problems, Gibbs sampling can still be computationally demanding. This is the case, for example, with our path recommendation problem with binary feedback. In this context, it is easy to implement a version of Gibbs sampling that generates a close approximation to a posterior sample within well under a minute. However, running thousands of simulations each over hundreds of time periods can be quite time-consuming. As such, we turn to more efficient approximation methods.
## 5.2 Laplace Approximation
We now discuss an approach that approximates a potentially complicated posterior distribution by a Gaussian distribution. Samples from this simpler Gaussian distribution can then serve as approximate samples from the posterior distribution of interest. Chapelle and Li (Chapelle and Li, 2011) proposed this method to approximate TS in a display advertising problem with a logistic regression model of ad-click-through rates.
Let g denote a probability density function over R K from which we wish to sample. If g is unimodal, and its log density ln( g ( φ )) is strictly concave around its mode φ , then g ( φ ) = e ln( g ( φ )) is sharply peaked around φ . It is therefore natural to consider approximating g locally around its mode. A second-order Taylor approximation to the log-density gives
$$\ln ( g ( \phi ) ) \approx \ln ( g ( \overline { \phi } ) ) - \frac { 1 } { 2 } ( \phi - \overline { \phi } ) ^ { \top } C ( \phi - \overline { \phi } ) ,$$
$$C = - \nabla ^ { 2 } \ln ( g ( \bar { \phi } ) ) .$$
As an approximation to the density g , we can then use
$$\tilde { g } ( \phi ) \varpropto e ^ { - \frac { 1 } { 2 } ( \phi - \overline { \phi } ) ^ { \top } C ( \phi - \overline { \phi } ) } .$$
This is proportional to the density of a Gaussian distribution with mean φ and covariance C -1 , and hence
$$\tilde { g } ( \phi ) = \sqrt { | C / 2 \pi | } e ^ { - \frac { 1 } { 2 } ( \phi - \overline { \phi } ) ^ { \top } C ( \phi - \overline { \phi } ) } .$$
where
We refer to this as the Laplace approximation of g . Since there are efficient algorithms for generating Gaussian-distributed samples, this offers a viable means to approximately sampling from g .
As an example, let us consider application of the Laplace approximation to Example 5.1. Bayes rule implies that the posterior density f t -1 of θ satisfies
$$f _ { t - 1 } ( \theta ) \varpropto f _ { 0 } ( \theta ) \prod _ { \tau = 1 } ^ { t - 1 } \left ( \frac { 1 } { 1 + \exp \left ( \sum _ { e \in x _ { \tau } } \theta _ { e } - M \right ) } \right ) ^ { y _ { \tau } } \left ( \frac { \exp \left ( \sum _ { e \in x _ { \tau } } \theta _ { e } - M \right ) } { 1 + \exp \left ( \sum _ { e \in x _ { \tau } } \theta _ { e } - M \right ) } \right ) ^ { 1 - y _ { \tau } } .$$
The mode θ can be efficiently computed via maximizing f t -1 , which is log-concave. An approximate posterior sample ˆ θ is then drawn from a Gaussian distribution with mean θ and covariance matrix ( -∇ 2 ln( f t -1 ( θ ))) -1 .
Laplace approximations are well suited for Example 5.1 because the log-posterior density is strictly concave and its gradient and Hessian can be computed efficiently. Indeed, more broadly, Laplace approximations tend to be effective for posterior distributions with smooth densities that are sharply peaked around their mode. They tend to be computationally efficient when one can efficiently compute the posterior mode, and can efficiently form the Hessian of the log-posterior density.
The behavior of the Laplace approximation is not invariant to a substitution of variables, and it can sometimes be helpful to apply such a substitution. To illustrate this point, let us revisit the online shortest path problem of Example 4.2. For this problem, posterior distributions components of θ are log-Gaussian. However, the distribution of φ , where φ e = ln( θ e ) for each edge e ∈ E , is Gaussian. As such, if the Laplace approximation approach is applied to generate a sample ˆ φ from the posterior distribution of φ , the Gaussian approximation is no longer an approximation, and, letting ˆ θ e = exp( ˆ φ e ) for each e ∈ E , we obtain a sample ˆ θ exactly from the posterior distribution of θ . In this case, through a variable substitution, we can sample in a manner that makes the Laplace approximation exact. More broadly, for any given problem, it may be possible to introduce variable substitutions that enhance the efficacy of the Laplace approximation.
To produce the computational results reported in Figure 5.1, we applied Newton's method with a backtracking line search to maximize
ln( f t -1 ). Though regret decays and should eventually vanish, it is easy to see from the figure that, for our example, the performance of the Laplace approximation falls short of Langevin Monte Carlo, which we will discuss in the next section. This is likely due to the fact that the posterior distribution is not sufficiently close to Gaussian. It is interesting that, despite serving as a popular approach in practical applications of TS (Chapelle and Li, 2011; Gómez-Uribe, 2016), the Laplace approximation can leave substantial value on the table.
## 5.3 Langevin Monte Carlo
We now describe an alternative Markov chain Monte Carlo method that uses gradient information about the target distribution. Let g ( φ ) denote a log-concave probability density function over R K from which we wish to sample. Suppose that ln( g ( φ )) is differentiable and its gradients are efficiently computable. Arising first in physics, Langevin dynamics refer to the diffusion process
$$\begin{array} { r l r } { ( 5 . 1 ) } & d \phi _ { t } = \nabla \ln ( g ( \phi _ { t } ) ) d t + \sqrt { 2 } d B _ { t } } \end{array}$$
where B t is a standard Brownian motion process. This process has g as its unique stationary distribution, and under reasonable technical conditions, the distribution of φ t converges rapidly to this stationary distribution (Roberts and Tweedie, 1996; Mattingly et al. , 2002). Therefore simulating the process (5.1) provides a means of approximately sampling from g .
Typically, one instead implements a Euler discretization of this stochastic differential equation
$$\phi _ { n + 1 } = \phi _ { n } + \epsilon \nabla \ln ( g ( \phi _ { n } ) ) + \sqrt { 2 \epsilon } W _ { n } \quad n \in \mathbb { N } ,$$
where W 1 , W 2 , . . . are i.i.d. standard Gaussian random variables and /epsilon1 > 0 is a small step size. Like a gradient ascent method, under this method φ n tends to drift in directions of increasing density g ( φ n ). However, random Gaussian noise W n is injected at each step so that, for large n , the position of φ n is random and captures the uncertainty in the distribution g . A number of papers establish rigorous guarantees for the rate at which this Markov chain converges to its stationary
distribution (Roberts and Rosenthal, 1998; Bubeck et al. , 2018; Durmus and Moulines, 2016; Cheng and Bartlett, 2018). These papers typically require /epsilon1 is sufficiently small, or that a decaying sequence of step sizes ( /epsilon1 1 , /epsilon1 2 , . . . ) is used.
We make two standard modifications to this method to improve computational efficiency. First, following recent work (Welling and Teh, 2011), we implement stochastic gradient Langevin Monte Carlo, which uses sampled minibatches of data to compute approximate rather than exact gradients. Our implementation uses a mini-batch size of 100; this choice seems to be effective but has not been carefully optimized. When fewer than 100 observations are available, we follow the Markov chain (5.2) with exact gradient computation. When more than 100 observations have been gathered, we follow (5.2) but use an estimated gradient ∇ ln(ˆ g n ( φ n )) at each step based on a random subsample of 100 data points. Some work provides rigorous guarantees for stochastic gradient Langevin Monte Carlo by arguing the cumulative impact of the noise in gradient estimation is second order relative to the additive Gaussian noise (Teh et al. , 2016).
Our second modification involves the use of a preconditioning matrix to improve the mixing rate of the Markov chain (5.2). For the path recommendation problem in Example 5.1, we have found that the log posterior density becomes ill-conditioned in later time periods. For this reason, gradient ascent converges very slowly to the posterior mode. Effective optimization methods should leverage second order information. Similarly, due to poor conditioning, we may need to choose an extremely small step size /epsilon1 , causing the Markov chain in 5.2 to mix slowly. We have found that preconditioning substantially improves performance. Langevin MCMC can be implemented with a symmetric positive definite preconditioning matrix A by simulating the Markov chain
$$\begin{array} { r l } & { \phi _ { n + 1 } = \phi _ { n } + \epsilon A \nabla \ln ( g ( \phi _ { n } ) ) + \sqrt { 2 } \epsilon A ^ { 1 / 2 } W _ { n } \quad n \in \mathbb { N } , } \end{array}$$
where A 1 / 2 denotes the matrix square root of A . In our implementation, we take φ 0 = argmax φ ln( g ( φ )), so the chain is initialized at the posterior mode, computed via means discussed in Section 5.2, and take the preconditioning matrix A = -( ∇ 2 ln( g ( φ )) | φ = φ 0 ) -1 to be the negative inverse Hessian at that point. It may be possible to improve
computational efficiency by constructing an incremental approximation to the Hessian, as we will discuss in Subsection 5.6, but we do not explore that improvement here.
## 5.4 Bootstrapping
As an alternative, we discuss an approach based on the statistical bootstrap, which accommodates even very complex densities. Use of the bootstrap for TS was first considered in (Eckles and Kaptein, 2014), though the version studied there applies to Bernoulli bandits and does not naturally generalize to more complex problems. There are many other versions of the bootstrap approach that can be used to approximately sample from a posterior distribution. For concreteness, we introduce a specific one that is suitable for examples we cover in this tutorial.
Like the Laplace approximation approach, our bootstrap method assumes that θ is drawn from a Euclidean space R K . Consider first a standard bootstrap method for evaluating the sampling distribution of the maximum likelihood estimate of θ . The method generates a hypothetical history ˆ H t -1 = ((ˆ x 1 , ˆ y 1 ) , . . . , (ˆ x t -1 , ˆ y t -1 )), which is made up of t -1 action-observation pairs, each sampled uniformly with replacement from H t -1 . We then maximize the likelihood of θ under the hypothetical history, which for our shortest path recommendation problem is given by
$$\begin{array} { r } { \hat { L } _ { t - 1 } ( \theta ) = \prod _ { \tau = 1 } ^ { t - 1 } \left ( \frac { 1 } { 1 + \exp \left ( \sum _ { e \in \hat { x } _ { \tau } } \theta _ { e } - M \right ) } \right ) ^ { \hat { y } _ { \tau } } \left ( \frac { \exp \left ( \sum _ { e \in \hat { x } _ { \tau } } \theta _ { e } - M \right ) } { 1 + \exp \left ( \sum _ { e \in \hat { x } _ { \tau } } \theta _ { e } - M \right ) } \right ) ^ { 1 - \hat { y } _ { \tau } } . } \end{array}$$
The randomness in the maximizer of ˆ L t -1 reflects the randomness in the sampling distribution of the maximum likelihood estimate. Unfortunately, this method does not take the agent's prior into account. A more severe issue is that it grossly underestimates the agent's real uncertainty in initial periods. The modification described here is intended to overcome these shortcomings in a simple way.
The method proceeds as follows. First, as before, we draw a hypothetical history ˆ H t -1 = ((ˆ x 1 , ˆ y 1 ) , . . . , (ˆ x t -1 , ˆ y t -1 )), which is made up of t -1 action-observation pairs, each sampled uniformly with replacement
from H t -1 . Next, we draw a sample θ 0 from the prior distribution f 0 . Let Σ denote the covariance matrix of the prior f 0 . Finally, we solve the maximization problem
$$\hat { \theta } = \underset { \theta \in \mathbb { R } ^ { k } } { \arg \max } \ e ^ { - ( \theta - \theta ^ { 0 } ) ^ { \top } \Sigma ( \theta - \theta ^ { 0 } ) } \hat { L } _ { t - 1 } ( \theta )$$
and treat ˆ θ as an approximate posterior sample. This can be viewed as maximizing a randomized approximation ˆ f t -1 to the posterior density, where ˆ f t -1 ( θ ) ∝ e -( θ -θ 0 ) /latticetop Σ( θ -θ 0 ) ˆ L t -1 ( θ ) is what the posterior density would be if the prior were Gaussian with mean θ 0 and covariance matrix Σ, and the history of observations were ˆ H t -1 . When very little data has been gathered, the randomness in the samples mostly stems from the randomness in the prior sample θ 0 . This random prior sample encourages the agent to explore in early periods. When t is large, so a lot of a data has been gathered, the likelihood typically overwhelms the prior sample and randomness in the samples mostly stems from the random selection of the history ˆ H t -1 .
In the context of the shortest path recommendation problem, ˆ f t -1 ( θ ) is log-concave and can therefore be efficiently maximized. Again, to produce our computational results reported in Figure 5.1, we applied Newton's method with a backtracking line search to maximize ln( ˆ f t -1 ). Even when it is not possible to efficiently maximize ˆ f t -1 , however, the bootstrap approach can be applied with heuristic optimization methods that identify local or approximate maxima.
As can be seen from Figure 5.1, for our example, bootstrapping performs about as well as the Laplace approximation. One advantage of the bootstrap is that it is nonparametric, and may work reasonably regardless of the functional form of the posterior distribution, whereas the Laplace approximation relies on a Gaussian approximation and Langevin Monte Carlo relies on log-concavity and other regularity assumptions. That said, it is worth mentioning that there is a lack of theoretical justification for bootstrap approaches or even understanding of whether there are nontrivial problem classes for which they are guaranteed to perform well.
## 5.5 Sanity Checks
Figure 5.1 demonstrates that Laplace approximation, Langevin Monte Carlo, and bootstrap approaches, when applied to the path recommendation problem, learn from binary feedback to improve performance over time. This may leave one wondering, however, whether exact TS would offer substantially better performance. Since we do not have a tractable means of carrying out exact TS for this problem, in this section, we apply our approximation methods to problems for which exact TS is tractable. This enables comparisons between performance of exact and approximate methods.
Recall the three-armed beta-Bernoulli bandit problem for which results from application of greedy and TS algorithms were reported in Figure 3.2(b). For this problem, components of θ are independent under posterior distributions, and as such, Gibbs sampling yields exact posterior samples. Hence, the performance of an approximate version that uses Gibbs sampling would be identical to that of exact TS. Figure 5.2a plots results from applying Laplace approximation, Langevin Monte Carlo, and bootstrap approaches. For this problem, our approximation methods offer performance that is qualitatively similar to exact TS, though the Laplace approximation performs marginally worse than alternatives in this setting.
Next, consider the online shortest path problem with correlated edge delays. Regret experienced by TS applied to such a problem were reported in Figure 4.3a. As discussed in Section 5.2, applying the Laplace approximation approach with an appropriate variable substitution leads to the same results as exact TS. Figure 5.2b compares those results to what is generated by Gibbs sampling, Langevin Monte Carlo, and bootstrap approaches. Again, the approximation methods yield competitive results, although bootstrapping is marginally less effective than others.
It is easy to verify that for the online shortest path problem and specific choices of step size /epsilon1 = 1 / 2 and conditioning matrix A = Σ t , a single Langevin Monte Carlo iteration offers an exact posterior sample. However, our simulations do not use this step size and carry out multiple iterations. The point here is not to optimize results for our specific problem but rather to offer a sanity check for the approach.
Figure 5.2: Regret of approximation methods versus exact Thompson sampling.
<details>
<summary>Image 10 Details</summary>

### Visual Description
\n
## Charts: Per-Period Regret vs. Time Period for Different Agents
### Overview
The image presents two line charts comparing the per-period regret of different agents over time. The left chart (a) depicts results for a "Bernoulli bandit" scenario, while the right chart (b) shows results for an "online shortest path" scenario. Both charts share a similar structure, plotting per-period regret on the y-axis against time period (t) on the x-axis.
### Components/Axes
Both charts have the following components:
* **X-axis:** Labeled "time period (t)". The scale ranges from 0 to 1000 for chart (a) and 0 to 500 for chart (b).
* **Y-axis:** Labeled "per-period regret". Chart (a) ranges from 0 to 0.100, while chart (b) ranges from 0 to 4.
* **Legend:** Located in the top-right corner of each chart, labeled "agent". The agents are:
* Laplace TS (Red)
* Langevin TS (Blue)
* TS (Green)
* bootstrap TS (Purple)
### Detailed Analysis or Content Details
**Chart (a): Bernoulli bandit**
* **Laplace TS (Red):** The line starts at approximately 0.08, rapidly decreases to around 0.03 by t=100, and then gradually declines to approximately 0.015 by t=1000.
* **Langevin TS (Blue):** The line begins at approximately 0.06, decreases to around 0.02 by t=100, and continues to decline slowly, reaching approximately 0.01 by t=1000.
* **TS (Green):** The line starts at approximately 0.05, decreases rapidly to around 0.015 by t=100, and then declines slowly to approximately 0.01 by t=1000.
* **bootstrap TS (Purple):** The line begins at approximately 0.06, decreases to around 0.02 by t=100, and then declines slowly, reaching approximately 0.01 by t=1000.
**Chart (b): online shortest path**
* **Gibbs TS (Red):** The line starts at approximately 3.5, rapidly decreases to around 1.5 by t=100, and then declines slowly to approximately 0.5 by t=500.
* **Langevin TS (Blue):** The line begins at approximately 3.0, decreases to around 1.0 by t=100, and then declines slowly, reaching approximately 0.3 by t=500.
* **TS (Green):** The line starts at approximately 2.5, decreases rapidly to around 0.7 by t=100, and then declines slowly to approximately 0.2 by t=500.
* **bootstrap TS (Purple):** The line begins at approximately 2.5, decreases to around 0.8 by t=100, and then declines slowly, reaching approximately 0.3 by t=500.
### Key Observations
* In both charts, all agents exhibit a decreasing trend in per-period regret as time progresses, indicating learning and improvement.
* In chart (a), the Laplace TS agent initially has the highest regret, but its regret decreases rapidly.
* In chart (b), the Gibbs TS agent initially has the highest regret, but its regret decreases rapidly.
* The differences in regret between agents become smaller as time increases in both scenarios.
* The scale of the y-axis is significantly different between the two charts, reflecting the different magnitudes of regret in the two scenarios.
### Interpretation
The charts demonstrate the performance of different Thompson Sampling (TS) variants in two distinct reinforcement learning environments: a Bernoulli bandit and an online shortest path problem. The per-period regret metric quantifies the cumulative loss incurred by the agent due to suboptimal decisions.
The decreasing trend in regret across all agents indicates that they are learning to make better decisions over time. The initial differences in regret likely reflect the varying exploration-exploitation trade-offs and convergence rates of the different TS algorithms. The fact that the regret curves converge as time increases suggests that all agents eventually achieve a similar level of performance.
The higher regret values observed in the online shortest path scenario (chart b) compared to the Bernoulli bandit scenario (chart a) may be due to the increased complexity of the shortest path problem, which requires the agent to learn a more complex policy. The choice of algorithm appears to have a more pronounced effect in the online shortest path scenario, as the initial differences in regret are more substantial.
The charts provide valuable insights into the effectiveness of different TS algorithms in different reinforcement learning settings. They highlight the importance of considering the specific characteristics of the environment when selecting an algorithm.
</details>
## 5.6 Incremental Implementation
For each of the three approximation methods we have discussed, the computation time required per time period grows as time progresses. This is because each past observation must be accessed to generate the next action. This differs from exact TS algorithms we discussed earlier, which maintain parameters that encode a posterior distribution, and update these parameters over each time period based only on the most recent observation.
In order to keep the computational burden manageable, it can be important to consider incremental variants of our approximation methods. We refer to an algorithm as incremental if it operates with fixed rather than growing per-period compute time. There are many ways to design incremental variants of approximate posterior sampling algorithms we have presented. As concrete examples, we consider here particular incremental versions of Laplace approximation and bootstrap approaches.
For each time t , let /lscript t ( θ ) denote the likelihood of y t conditioned on x t and θ . Hence, conditioned on H t -1 , the posterior density satisfies
$$f _ { t - 1 } ( \theta ) \varpropto f _ { 0 } ( \theta ) \prod _ { \tau = 1 } ^ { t - 1 } \ell _ { \tau } ( \theta ) .$$
Let g 0 ( θ ) = ln( f 0 ( θ )) and g t ( θ ) = ln( /lscript t ( θ )) for t > 0. To identify the mode of f t -1 , it suffices to maximize ∑ t -1 τ =0 g τ ( θ ).
Consider an incremental version of the Laplace approximation. The algorithm maintains statistics H t , and θ t , initialized with θ 0 = argmax θ g 0 ( θ ), and H 0 = ∇ 2 g 0 ( θ 0 ), and updating according to
$$H _ { t } = H _ { t - 1 } + \nabla ^ { 2 } g _ { t } ( \overline { \theta } _ { t - 1 } ) , \\ \bar { \theta } _ { t } = \bar { \theta } _ { t - 1 } - H _ { t } ^ { - 1 } \nabla g _ { t } ( \overline { \theta } _ { t - 1 } ) .$$
This algorithm is a type of online newton method for computing the posterior mode θ t -1 that maximizes ∑ t -1 τ =0 g τ ( θ ). Note that if each function g t -1 is strictly concave and quadratic, as would be the case if the prior is Gaussian and observations are linear in θ and perturbed only by Gaussian noise, each pair θ t -1 and H -1 t -1 represents the mean and covariance matrix of f t -1 . More broadly, these iterates can be viewed as the mean and covariance matrix of a Gaussian approximation to the posterior, and used to generate an approximate posterior sample ˆ θ ∼ N ( θ t -1 , H -1 t -1 ). It is worth noting that for linear and generalized linear models, the matrix ∇ 2 g t ( θ t -1 ) has rank one, and therefore H -1 t = ( H t -1 + ∇ 2 g t ( θ t -1 )) -1 can be updated incrementally using the Sherman-Woodbury-Morrison formula. This incremental version of the Laplace approximation is closely related to the notion of an extended Kalman filter, which has been explored in greater depth by Gómez-Uribe (Gómez-Uribe, 2016) as a means for incremental approximate TS with exponential families of distributions.
Another approach involves incrementally updating each of an ensemble of models to behave like a sample from the posterior distribution. The posterior can be interpreted as a distribution of 'statistically plausible' models, by which we mean models that are sufficiently consistent with prior beliefs and the history of observations. With this interpretation in mind, TS can be thought of as randomly drawing from the range of statistically plausible models. Ensemble sampling aims to maintain, incrementally update, and sample from a finite set of such models. In the spirit of particle filtering, this set of models approximates the posterior distribution. The workings of ensemble sampling are in some ways more intricate than conventional uses of particle filtering, however, because interactions between the ensemble of models and selected actions can skew the distribution. Ensemble sampling is presented in more depth
in (Lu and Van Roy, 2017), which draws inspiration from work on exploration in deep reinforcement learning (Osband et al. , 2016a).
There are multiple ways of generating suitable model ensembles. One builds on the aforementioned bootstrap method and involves fitting each model to a different bootstrap sample. To elaborate, consider maintaining N models with parameters ( θ n t , H n t : n = 1 , . . . , N ). Each set is initialized with θ n 0 ∼ g 0 , H n 0 = ∇ g 0 ( θ n 0 ), d n 0 = 0, and updated according to
$$7 2$$
$$& H _ { t } ^ { n } = H _ { t - 1 } ^ { n } + z _ { t } ^ { n } \nabla ^ { 2 } g _ { t } ( \overline { \theta } _ { t - 1 } ^ { n } ) , \\ & \overline { \theta } _ { t } ^ { n } = \overline { \theta } _ { t - 1 } ^ { n } - z _ { t } ^ { n } ( H _ { t } ^ { n } ) ^ { - 1 } \nabla g _ { t } ( \overline { \theta } _ { t - 1 } ^ { n } ) ,$$
$$3$$
where each z n t is an independent Poisson-distributed sample with mean one. Each θ n t can be viewed as a random statistically plausible model, with randomness stemming from the initialization of θ n 0 and the random weight z n t placed on each observation. The variable, z n τ can loosely be interpreted as a number of replicas of the data sample ( x τ , y τ ) placed in a hypothetical history ˆ H n t . Indeed, in a data set of size t , the number of replicas of a particular bootstrap data sample follows a Binomial( t, 1 /t ) distribution, which is approximately Poisson(1) when t is large. With this view, each θ n t is effectively fit to a different data set ˆ H n t , distinguished by the random number of replicas assigned to each data sample. To generate an action x t , n is sampled uniformly from { 1 , . . . , N } , and the action is chosen to maximize E [ r t | θ = θ n t -1 ]. Here, θ n t -1 serves as the approximate posterior sample. Note that the per-period compute time grows with N , which is an algorithm tuning parameter.
This bootstrap approach offers one mechanism for incrementally updating an ensemble of models. In Section 7.4, we will discuss another, which we apply to active learning with neural networks.
## Practical Modeling Considerations
Our narrative over previous sections has centered around a somewhat idealized view of TS, which ignored the process of prior specification and assumed a simple model in which the system and set of feasible actions is constant over time and there is no side information on decision context. In this section, we provide greater perspective on the process of prior specification and on extensions of TS that serve practical needs arising in some applications.
## 6.1 Prior Distribution Specification
The algorithms we have presented require as input a prior distribution over model parameters. The choice of prior can be important, so let us now discuss its role and how it might be selected. In designing an algorithm for an online decision problem, unless the value of θ were known with certainty, it would not make sense to optimize performance for a single value, because that could lead to poor performance for other plausible values. Instead, one might design the algorithm to perform well on average across a collection of possibilities. The prior can be thought of as a distribution over plausible values, and its choice directs
the algorithm to perform well on average over random samples from the prior.
For a practical example of prior selection, let us revisit the banner ad placement problem introduced in Example 1.1. There are K banner ads for a single product, with unknown click-through probabilities ( θ 1 , . . . , θ K ). Given a prior, TS can learn to select the most successful ad. We could use a uniform or, equivalently, a beta(1 , 1) distribution over each θ k . However, if some values of θ k are more likely than others, using a uniform prior sacrifices performance. In particular, this prior represents no understanding of the context, ignoring any useful knowledge from past experience. Taking knowledge into account reduces what must be learned and therefore reduces the time it takes for TS to identify the most effective ads.
Suppose we have a data set collected from experience with previous products and their ads, each distinguished by stylistic features such as language, font, and background, together with accurate estimates of click-through probabilities. Let us consider an empirical approach to prior selection that leverages this data. First, partition past ads into K sets, with each k th partition consisting of those with stylistic features most similar to the k th ad under current consideration. Figure 6.1 plots a hypothetical empirical cumulative distribution of click-through probabilities for ads in the k th set. It is then natural to consider as a prior a smoothed approximation of this distribution, such as the beta(1 , 100) distribution also plotted in Figure 6.1. Intuitively, this process assumes that click-through probabilities of past ads in set k represent plausible values of θ k . The resulting prior is informative; among other things, it virtually rules out click-through probabilities greater than 0 . 05.
A careful choice of prior can improve learning performance. Figure 6.2 presents results from simulations of a three-armed Bernoulli bandit. Mean rewards of the three actions are sampled from beta(1 , 50), beta(1 , 100), and beta(1 , 200) distributions, respectively. TS is applied with these as prior distributions and with a uniform prior distribution. We refer to the latter as a misspecified prior because it is not consistent with our understanding of the problem. A prior that is consistent in this sense is termed coherent . Each plot represents an average over ten thousand independent simulations, each with independently sampled
Figure 6.1: An empirical cumulative distribution and an approximating beta distribution.
<details>
<summary>Image 11 Details</summary>

### Visual Description
\n
## Chart: Cumulative Distribution of Click-Through Probability
### Overview
The image presents a chart comparing the empirical distribution of click-through probability with a beta distribution. The chart displays cumulative probability on the y-axis against click-through probability on the x-axis. Two lines are plotted, representing the two distributions.
### Components/Axes
* **X-axis Title:** "click-through probability"
* Scale: 0.00 to 0.10, with markers at 0.02, 0.04, 0.06, 0.08.
* **Y-axis Title:** "cumulative probability"
* Scale: 0.00 to 1.00, with markers at 0.2, 0.4, 0.6, 0.8.
* **Legend:** Located in the bottom-right corner.
* "empirical distribution" - Blue line
* "beta distribution" - Green line
### Detailed Analysis
* **Empirical Distribution (Blue Line):** The line starts at approximately (0.00, 0.00) and rises rapidly.
* At click-through probability of 0.00, cumulative probability is approximately 0.00.
* At click-through probability of 0.02, cumulative probability is approximately 0.35.
* At click-through probability of 0.04, cumulative probability is approximately 0.65.
* At click-through probability of 0.06, cumulative probability is approximately 0.80.
* At click-through probability of 0.08, cumulative probability is approximately 0.90.
* At click-through probability of 0.10, cumulative probability is approximately 0.98.
* **Beta Distribution (Green Line):** The line also starts at approximately (0.00, 0.00) and rises, but with a slightly different curve.
* At click-through probability of 0.00, cumulative probability is approximately 0.00.
* At click-through probability of 0.02, cumulative probability is approximately 0.40.
* At click-through probability of 0.04, cumulative probability is approximately 0.70.
* At click-through probability of 0.06, cumulative probability is approximately 0.85.
* At click-through probability of 0.08, cumulative probability is approximately 0.95.
* At click-through probability of 0.10, cumulative probability is approximately 1.00.
### Key Observations
The empirical distribution and the beta distribution are very similar, especially at higher click-through probabilities. The beta distribution appears to slightly overestimate the cumulative probability at lower click-through probabilities (between 0.00 and 0.04) and slightly underestimate it at higher probabilities (between 0.06 and 0.10). Both distributions approach a cumulative probability of 1.0 as the click-through probability approaches 0.10.
### Interpretation
This chart demonstrates how well a beta distribution approximates the empirical distribution of click-through probabilities. The close alignment between the two curves suggests that the beta distribution is a reasonable model for the observed click-through data. The slight discrepancies could indicate that the parameters of the beta distribution are not perfectly tuned to the empirical data, or that the empirical data deviates slightly from a true beta distribution. This type of analysis is common in A/B testing and statistical modeling to understand user behavior and predict future outcomes. The chart suggests that the beta distribution provides a good fit for modeling click-through probabilities, which can be used for tasks like estimating confidence intervals or making predictions about future click-through rates.
</details>
mean rewards. Figure 6.2a plots expected regret, demonstrating that the misspecified prior increases regret. Figure 6.2a plots the evolution of the agent's mean reward conditional expectations. For each algorithm, there are three curves corresponding to the best, second-best, and worst actions, and they illustrate how starting with a misspecified prior delays learning.
Figure 6.2: Comparison of TS for the Bernoulli bandit problem with coherent versus misspecified priors.
<details>
<summary>Image 12 Details</summary>

### Visual Description
## Charts: Regret and Expected Mean Rewards
### Overview
The image presents two line charts, labeled (a) "regret" and (b) "expected mean rewards". Both charts compare the performance of two agents: "TS" (Thompson Sampling) and "misspecified TS". The x-axis in both charts represents "time period (t)", ranging from 0 to 1000. The y-axis of chart (a) is "per-period regret" ranging from 0 to 0.020, while the y-axis of chart (b) is "expected mean reward" ranging from 0 to 0.05.
### Components/Axes
* **Chart (a): Regret**
* X-axis: time period (t) - Scale: 0 to 1000, with markers at 0, 250, 500, 750, and 1000.
* Y-axis: per-period regret - Scale: 0 to 0.020, with markers at 0, 0.005, 0.010, 0.015, and 0.020.
* Legend:
* "TS" - represented by a red line.
* "misspecified TS" - represented by a blue line.
* **Chart (b): Expected Mean Rewards**
* X-axis: time period (t) - Scale: 0 to 1000, with markers at 0, 250, 500, 750, and 1000.
* Y-axis: expected mean reward - Scale: 0 to 0.05, with markers at 0, 0.01, 0.02, 0.03, 0.04, and 0.05.
* Legend:
* "TS" - represented by a red line.
* "misspecified TS" - represented by a blue line.
### Detailed Analysis or Content Details
* **Chart (a): Regret**
* The "TS" line (red) starts at approximately 0.017 at t=0 and decreases rapidly to approximately 0.002 at t=1000. The line exhibits a steep negative slope initially, which gradually flattens out.
* The "misspecified TS" line (blue) starts at approximately 0.010 at t=0 and decreases more slowly than the "TS" line, reaching approximately 0.004 at t=1000. This line also exhibits a negative slope, but it is less steep than the "TS" line.
* **Chart (b): Expected Mean Rewards**
* The "TS" line (red) starts at approximately 0.012 at t=0, initially decreases to a minimum of approximately 0.008 at t=250, and then increases slightly to approximately 0.010 at t=1000.
* The "misspecified TS" line (blue) starts at approximately 0.045 at t=0 and decreases steadily to approximately 0.020 at t=1000. The line exhibits a consistent negative slope throughout the entire time period.
### Key Observations
* In the "regret" chart, "TS" consistently exhibits lower regret than "misspecified TS" across all time periods.
* In the "expected mean rewards" chart, "misspecified TS" starts with a significantly higher expected mean reward, but it decreases over time, eventually falling below the "TS" agent's reward.
* The "TS" agent's reward initially decreases and then stabilizes, suggesting a learning phase followed by consistent performance.
### Interpretation
The data suggests that while the "misspecified TS" agent may initially appear to perform better (higher expected mean reward), the "TS" agent ultimately achieves lower regret and maintains a more stable reward over time. This indicates that the "TS" agent is more efficient in learning the optimal strategy and minimizing long-term losses. The initial higher reward of the "misspecified TS" could be due to chance or a temporary advantage, but its higher regret demonstrates its sub-optimal performance in the long run. The difference in performance highlights the importance of correctly specifying the model for Thompson Sampling to achieve optimal results. The charts demonstrate a trade-off between initial reward and long-term regret, with the "TS" agent prioritizing minimizing regret, while the "misspecified TS" agent initially focuses on maximizing reward but suffers higher regret as a consequence.
</details>
## 6.2 Constraints, Context, and Caution
Though Algorithm 4.2, as we have presented it, treats a very general model, straightforward extensions accommodate even broader scope. One involves imposing time-varying constraints on the actions. In particular, there could be a sequence of admissible action sets X t that constrain actions x t . To motivate such an extension, consider our shortest path example. Here, on any given day, the drive to work may be constrained by announced road closures. If X t does not depend on θ except through possible dependence on the history of observations, TS (Algorithm 4.2) remains an effective approach, with the only required modification being to constrain the maximization problem in Line 6.
Another extension of practical import addresses contextual online decision problems . In such problems, the response y t to action x t also depends on an independent random variable z t that the agent observes prior to making her decision. In such a setting, the conditional distribution of y t takes the form p θ ( ·| x t , z t ). To motivate this, consider again the shortest path example, but with the agent observing a weather report z t from a news channel before selecting a path x t . Weather may affect delays along different edges differently, and the agent can take this into account before initiating her trip. Contextual problems of this flavor can be addressed through augmenting the action space and introducing time-varying constraint sets. In particular, if we view ˜ x t = ( x t , z t ) as the action and constrain its choice to X t = { ( x, z t ) : x ∈ X} , where X is the set from which x t must be chosen, then it is straightforward to apply TS to select actions ˜ x 1 , ˜ x 2 , . . . . For the shortest path problem, this can be interpreted as allowing the agent to dictate both the weather report and the path to traverse, but constraining the agent to provide a weather report identical to the one observed through the news channel.
In some applications, it may be important to ensure that expected performance exceeds some prescribed baseline. This can be viewed as a level of caution against poor performance. For example, we might want each action applied to offer expected reward of at least some level r . This can again be accomplished through constraining actions: in each t th time period, let the action set be X t = { x ∈ X : E [ r t | x t = x ] ≥ r } . Using such an action set ensures that expected average reward exceeds
r . When actions are related, an actions that is initially omitted from the set can later be included if what is learned through experiments with similar actions increases the agent's expectation of reward from the initially omitted action.
## 6.3 Nonstationary Systems
Problems we have considered involve model parameters θ that are constant over time. As TS hones in on an optimal action, the frequency of exploratory actions converges to zero. In many practical applications, the agent faces a nonstationary system, which is more appropriately modeled by time-varying parameters θ 1 , θ 2 , . . . , such that the response y t to action x t is generated according to p θ t ( ·| x t ). In such contexts, the agent should never stop exploring, since it needs to track changes as the system drifts. With minor modification, TS remains an effective approach so long as model parameters change little over durations that are sufficient to identify effective actions.
In principle, TS could be applied to a broad range of problems where the parameters θ 1 , θ 2 , θ 3 , ... evolve according to a stochastic process by using techniques from filtering and sequential Monte Carlo to generate posterior samples. Instead we describe below some much simpler approaches to such problems.
One simple approach to addressing nonstationarity involves ignoring historical observations made beyond some number τ of time periods in the past. With such an approach, at each time t , the agent would produce a posterior distribution based on the prior and conditioned only on the most recent τ actions and observations. Model parameters are sampled from this distribution, and an action is selected to optimize the associated model. The agent never ceases to explore, since the degree to which the posterior distribution can concentrate is limited by the number of observations taken into account. Theory supporting such an approach is developed in (Besbes et al. , 2014).
An alternative approach involves modeling evolution of a belief distribution in a manner that discounts the relevance of past observations and tracks a time-varying parameters θ t . We now consider such a model and a suitable modification of TS. Let us start with
the simple context of a Bernoulli bandit. Take the prior for each k th mean reward to be beta( α, β ). Let the algorithm update parameters to identify the belief distribution of θ t conditioned on the history H t -1 = (( x 1 , y 1 ) , . . . , ( x t -1 , y t -1 )) according to (6.1)
/negationslash
$$\begin{array} { r } { ( \alpha _ { k } , \beta _ { k } ) \leftarrow \left \{ \begin{array} { l l } { \left ( ( 1 - \gamma ) \alpha _ { k } + \gamma \overline { \alpha } , ( 1 - \gamma ) \beta _ { k } + \gamma \overline { \beta } \right ) } & { x _ { t } \neq k } \\ { \left ( ( 1 - \gamma ) \alpha _ { k } + \gamma \overline { \alpha } + r _ { t } , ( 1 - \gamma ) \beta _ { k } + \gamma \overline { \beta } + 1 - r _ { t } \right ) } & { x _ { t } = k , } \end{array} } \end{array}$$
where γ ∈ [0 , 1] and α k , β k > 0. This models a process for which the belief distribution converges to beta( α k , β k ) in the absence of observations. Note that, in the absence of observations, if γ > 0 then ( α k , β k ) converges to ( α k , β k ). Intuitively, the process can be thought of as randomly perturbing model parameters in each time period, injecting uncertainty. The parameter γ controls how quickly uncertainty is injected. At one extreme, when γ = 0, no uncertainty is injected. At the other extreme, γ = 1 and each θ t,k is an independent beta( α k , β k )-distributed process. A modified version of Algorithm 3.2 can be applied to this nonstationary Bernoulli bandit problem, the differences being in the additional arguments γ , α , and β , and the formula used to update distribution parameters.
The more general form of TS presented in Algorithm 4.2 can be modified in an analogous manner. For concreteness, let us focus on the case where θ is restricted to a finite set; it is straightforward to extend things to infinite sets. The conditional distribution update in Algorithm 4.2 can be written as
$$p ( u ) \leftarrow \frac { p ( u ) q _ { u } ( y _ { t } | x _ { t } ) } { \sum _ { v } p ( v ) q _ { v } ( y _ { t } | x _ { t } ) } .$$
To model nonstationary model parameters, we can use the following alternative:
$$p ( u ) \leftarrow \frac { \overline { p } ^ { \gamma } ( u ) p ^ { 1 - \gamma } ( u ) q _ { u } ( y _ { t } | x _ { t } ) } { \sum _ { v } \overline { p } ^ { \gamma } ( v ) p ^ { 1 - \gamma } ( v ) q _ { v } ( y _ { t } | x _ { t } ) } .$$
This generalizes the formula provided earlier for the Bernoulli bandit case. Again, γ controls the rate at which uncertainty is injected. The modified version of Algorithm 3.2, which we refer to as nonstationary TS , takes γ and p as additional arguments and replaces the distribution update formula.
Figure 6.3 illustrates potential benefits of nonstationary TS when dealing with a nonstationairy Bernoulli bandit problem. In these simulations, belief distributions evolve according to Equation (6.1). The prior and stationary distributions are specified by α = α = β = β = 1. The decay rate is γ = 0 . 01. Each plotted point represents an average over 10,000 independent simulations. Regret here is defined by regret t ( θ t ) = max k θ t,k -θ t,x t . While nonstationary TS updates its belief distribution in a manner consistent with the underlying system, TS pretends that the success probabilities are constant over time and updates its beliefs accordingly. As the system drifts over time, TS becomes less effective, while nonstationary TS retains reasonable performance. Note, however, that due to nonstationarity, no algorithm can promise regret that vanishes with time.
Figure 6.3: Comparison of TS versus nonstationary TS with a nonstationary Bernoulli bandit problem.
<details>
<summary>Image 13 Details</summary>

### Visual Description
\n
## Line Chart: Per-Period Regret vs. Time Period
### Overview
This image presents a line chart illustrating the per-period regret of two agents – one operating in a stationary time series (TS) environment and another in a nonstationary TS environment – over a time period of 1000 units. The chart aims to compare the performance of the agents in terms of regret accumulation.
### Components/Axes
* **X-axis:** "time period (t)", ranging from approximately 0 to 1000. The axis is linearly scaled.
* **Y-axis:** "per-period regret", ranging from approximately 0 to 0.25. The axis is linearly scaled.
* **Legend:** Located in the top-right corner, identifying the two data series:
* "nonstationary TS" (represented by a red line)
* "stationary TS" (represented by a blue line)
* **Gridlines:** A light gray grid is present to aid in reading values.
### Detailed Analysis
The chart displays two lines representing the per-period regret over time.
* **Nonstationary TS (Red Line):** The line starts at approximately 0.045 at t=0, rapidly decreases to a minimum of approximately 0.03 at t=100, and then fluctuates around 0.035-0.04 for the remainder of the time period, ending at approximately 0.036 at t=1000. The line exhibits a generally flat trend after the initial decrease.
* **Stationary TS (Blue Line):** The line begins at approximately 0.05 at t=0, decreases to a minimum of approximately 0.04 at t=100, and then gradually increases to approximately 0.055 by t=1000. The line shows a slight upward trend after the initial decrease.
### Key Observations
* Both agents experience a decrease in per-period regret initially.
* The nonstationary TS agent maintains a lower per-period regret throughout the entire time period compared to the stationary TS agent.
* The stationary TS agent's per-period regret shows a slight increasing trend over time, while the nonstationary TS agent's regret remains relatively stable after the initial decrease.
* The difference in per-period regret between the two agents becomes more pronounced as time progresses.
### Interpretation
The data suggests that the agent operating in a nonstationary time series environment exhibits better performance in terms of regret minimization compared to the agent in a stationary environment. The initial decrease in regret for both agents likely represents a learning phase where the agents adapt to their respective environments. The subsequent stabilization of regret for the nonstationary agent indicates that it has effectively adapted to the changing dynamics of its environment. The slight increase in regret for the stationary agent could be due to the inherent limitations of its learning algorithm in a static environment, or potentially due to noise or random fluctuations. The difference in performance highlights the importance of adapting to nonstationary environments in reinforcement learning or decision-making tasks. The chart demonstrates that an agent capable of handling non-stationarity can achieve lower long-term regret.
</details>
## 6.4 Concurrence
In many applications, actions are applied concurrently. As an example, consider a variation of the online shortest path problem of Example 4.1. In the original version of this problem, over each period, an agent selects and traverses a path from origin to destination, and upon completion, updates a posterior distribution based on observed edge traversal times.
Now consider a case in which, over each period, multiple agents travel between the same origin and destination, possibly along different paths, with the travel time experienced by agents along each edge e to conditionally independent, conditioned on θ e . At the end of the period, agents update a common posterior distribution based on their collective experience. The paths represent concurrent actions, which should be selected in a manner that diversifies experience.
TS naturally suits this concurrent mode of operation. Given the posterior distribution available at the beginning of a time period, multiple independent samples can be drawn to produce paths for multiple agents. Figure 6.4 plots results from applying TS in this manner. Each simulation was carried out with K agents navigating over each time period through a twenty-stage binomial bridge. Figure 6.4(a) demonstrates that the per-action regret experienced by each agent decays more rapidly with time as the number of agents grows. This is due to the fact that each agent's learning is accelerated by shared observations. On the other hand, Figure 6.4(b) shows that per-action regret decays more slowly as a function of the number of actions taken so far by the collective of agents. This loss is due to fact that the the posterior distribution is updated only after K concurrent actions are completed, so actions are not informed by observations generated by concurrent ones as would be the case if the K actions were applied sequentially.
<details>
<summary>Image 14 Details</summary>

### Visual Description
\n
## Chart: Per-Period and Per-Action Regret vs. Time/Actions
### Overview
The image presents two line charts comparing the regret of an agent under different values of 'K' (1, 10, 20, 50, 100). The left chart shows per-period regret against time period (t), while the right chart shows per-action regret against the number of actions. Both charts share the same color scheme for each 'K' value.
### Components/Axes
* **Left Chart:**
* X-axis: "time period (t)" ranging from 0 to 100.
* Y-axis: "per-period regret" ranging from 0 to 10.
* Legend (top-right): "agent" with labels:
* K = 1 (Red)
* K = 10 (Blue)
* K = 20 (Green)
* K = 50 (Light Blue)
* K = 100 (Orange)
* **Right Chart:**
* X-axis: "number of actions" ranging from 0 to 250.
* Y-axis: "per-action regret" ranging from 0 to 10.
* Legend (top-right): "agent" with labels:
* K = 1 (Red)
* K = 10 (Blue)
* K = 20 (Green)
* K = 50 (Light Blue)
* K = 100 (Orange)
### Detailed Analysis or Content Details
**Left Chart (Per-Period Regret vs. Time Period):**
* **K = 1 (Red):** The line starts at approximately 9.5 and decreases rapidly to around 3.0 by t=25, then continues to decrease more slowly, reaching approximately 1.5 at t=100.
* **K = 10 (Blue):** The line starts at approximately 9.5 and decreases very rapidly to near 0 by t=10, and remains close to 0 for the rest of the time period.
* **K = 20 (Green):** The line starts at approximately 9.5 and decreases rapidly to near 0 by t=15, and remains close to 0 for the rest of the time period.
* **K = 50 (Light Blue):** The line starts at approximately 9.5 and decreases rapidly to near 0 by t=20, and remains close to 0 for the rest of the time period.
* **K = 100 (Orange):** The line starts at approximately 9.5 and decreases rapidly to around 1.0 by t=25, then continues to decrease more slowly, reaching approximately 0.5 at t=100.
**Right Chart (Per-Action Regret vs. Number of Actions):**
* **K = 1 (Red):** The line starts at approximately 9.5 and decreases rapidly to around 2.5 by 50 actions, then fluctuates between 0.5 and 2.5 for the remainder of the actions, ending around 1.0 at 250 actions.
* **K = 10 (Blue):** The line starts at approximately 9.5 and decreases very rapidly to near 0 by 50 actions, and remains close to 0 for the rest of the actions.
* **K = 20 (Green):** The line starts at approximately 9.5 and decreases very rapidly to near 0 by 50 actions, and remains close to 0 for the rest of the actions.
* **K = 50 (Light Blue):** The line starts at approximately 9.5 and decreases very rapidly to near 0 by 50 actions, and remains close to 0 for the rest of the actions.
* **K = 100 (Orange):** The line starts at approximately 9.5 and decreases rapidly to around 1.0 by 50 actions, then fluctuates between 0 and 1.0 for the remainder of the actions, ending around 0.5 at 250 actions.
### Key Observations
* All lines start with a high regret value (approximately 9.5).
* As 'K' increases, the regret decreases more rapidly and stabilizes closer to 0.
* The per-period regret (left chart) generally decreases and plateaus faster than the per-action regret (right chart).
* K=1 exhibits the highest and most persistent regret in both charts.
* K=10, K=20, and K=50 show very similar behavior, with regret dropping to near zero quickly.
* K=100 shows a slower decrease in regret compared to K=10, K=20, and K=50, but still significantly lower than K=1.
### Interpretation
The data suggests that increasing the value of 'K' leads to a reduction in both per-period and per-action regret. This indicates that the agent learns more effectively and makes better decisions as 'K' increases. The rapid decrease in regret for K=10, 20, and 50 suggests a point of diminishing returns, where further increases in 'K' do not significantly improve performance. The persistent regret observed for K=1 implies that the agent struggles to learn effectively with a small value of 'K'.
The difference between the two charts highlights the impact of time versus the number of actions. Per-period regret decreases more quickly, suggesting that the agent improves its decision-making within each time step. However, per-action regret takes longer to converge, indicating that the cumulative effect of actions still contributes to regret even after the agent has learned to make better decisions in each period. The fluctuations in the per-action regret for K=1 suggest that the agent continues to experience occasional suboptimal actions even after a large number of trials.
</details>
- (a) per-action regret over time
- (b) per-action regret over actions
Figure 6.4: Performance of concurrent Thompson sampling.
As discussed in (Scott, 2010), concurrence plays an important role in web services, where at any time, a system may experiment by providing
different versions of a service to different users. Concurrent TS offers a natural approach for such contexts. The version discussed above involves synchronous action selection and posterior updating. In some applications, it is more appropriate to operate asynchronously, with actions selected on demand and the posterior distribution updated as data becomes available. The efficiency of synchronous and asynchronous variations of concurrent TS is studied in (Kandasamy et al. , 2018). There are also situations where an agent can alter an action based on recent experience of other agents, within a period before the action is complete. For example, in the online shortest path problem, an agent may decide to change course to avoid an edge if new observations made by other agents indicate a long expected travel time. Producing a version of TS that effectively adapts to such information while still exploring in a reliably efficient manner requires careful design, as explained in (Dimakopoulou and Van Roy, 2018).
## Further Examples
As contexts for illustrating the workings of TS, we have presented the Bernoulli bandit and variations of the online shortest path problem. To more broadly illustrate the scope of TS and issues that arise in various kinds of applications, we present several additional examples in this section.
## 7.1 News Article Recommendation
Let us start with an online news article recommendation problem in which a website needs to learn to recommend personalized and contextsensitive news articles to its users, as has been discussed in (Li et al. , 2010) and (Chapelle and Li, 2011). The website interacts with a sequence of users, indexed by t ∈ { 1 , 2 , . . . } . In each round t , it observes a feature vector z t ∈ R d associated with the t th user, chooses a news article x t to display from among a set of k articles X = { 1 , . . . , k } , and then observes a binary reward r t ∈ { 0 , 1 } indicating whether the user liked this article.
The user's feature vector might, for example, encode the following information:
- The visiting user's recent activities, such as the news articles the user has read recently.
- The visiting user's demographic information, such as the user's gender and age.
- The visiting user's contextual information, such as the user's location and the day of week.
Interested readers can refer to Section 5.2.2 of (Li et al. , 2010) for an example of feature construction in a practical context.
Following section 5 of (Chapelle and Li, 2011), we model the probability a user with features z t likes a given article x t through a logit model. Specifically, each article x ∈ X is associated with a d -dimensional parameter vector θ x ∈ R d . Conditioned on x t , θ x t and z t , a positive review occurs with probability g ( z T t θ x t ), where g is the logistic function, given by g ( a ) = 1 / (1 + e -a ). The per-period regret of this problem is defined by
$$\begin{array} { r } { \, r e g r e t _ { t } \left ( \theta _ { 1 } , \dots , \theta _ { K } \right ) = \max _ { x \in \mathcal { X } } g ( z _ { t } ^ { T } \theta _ { x } ) - g ( z _ { t } ^ { T } \theta _ { x _ { t } } ) \quad \forall t = 1 , 2 , \dots } \end{array}$$
and measures the gap in quality between the recommended article x t and the best possible recommendation that could be made based on the user's features. This model allows for generalization across users, enabling the website to learn to predict whether a user with given features z t will like a news article based on experience recommending that article to different users.
As in the path recommendation problem treated in Section 5, this problem is not amenable to efficient exact Bayesian inference. Consequently, we applied two approximate Thompson sampling methods: one samples from a Laplace approximation of the posterior (see Section 5.2) and the other uses Langevin Monte Carlo to generate an approximate posterior sample (see Section 5.3). To offer a baseline, we also applied the /epsilon1 -greedy algorithm, and searched over values of /epsilon1 for the best performer.
We present simulation results for a simplified synthetic setting with K = |X| = 3 news articles and feature dimension d = 7. At each time t ∈ { 1 , 2 , · · · } , the feature vector z t has constant 1 as its first component and each of its other components is independently
drawn from a Bernoulli distribution with success probability 1 / 6. Each components of z t could, for example, indicate presence of a particular feature, like whether the user is a woman or is accessing the site from within the United States, in which the corresponding component of θ x would reflect whether users with this feature tend to enjoy article x more than other users, while the first component of θ x reflects the article's overall popularity.
Figure 7.1: Performance of different algorithms applied to the news article recommendation problem.
<details>
<summary>Image 15 Details</summary>

### Visual Description
\n
## Line Chart: Per-Period Regret vs. Time Period
### Overview
The image presents a line chart illustrating the per-period regret of different agents over time. The x-axis represents the time period (t), ranging from 0 to 5000, and the y-axis represents the per-period regret, ranging from 0 to 0.05. Four different agents are compared: greedy, 0.01-greedy, Langevin TS, and Laplace TS.
### Components/Axes
* **X-axis:** "time period (t)" - Scale ranges from approximately 0 to 5000.
* **Y-axis:** "per-period regret" - Scale ranges from approximately 0 to 0.05.
* **Legend:** Located in the top-right corner, labeling the lines as follows:
* "greedy" - Represented by a red line.
* "0.01-greedy" - Represented by an orange line.
* "Langevin TS" - Represented by a light green line.
* "Laplace TS" - Represented by a blue line.
### Detailed Analysis
* **Greedy (Red Line):** The line starts at approximately 0.048 at t=0 and decreases rapidly to around 0.012 at t=1000. It then plateaus, fluctuating between approximately 0.01 and 0.013 for the remainder of the time period, ending at approximately 0.011 at t=5000.
* **0.01-greedy (Orange Line):** The line begins at approximately 0.048 at t=0 and decreases more gradually than the "greedy" line, reaching around 0.011 at t=1000. It continues to decrease slowly, reaching approximately 0.008 at t=5000.
* **Langevin TS (Light Green Line):** This line starts at approximately 0.048 at t=0 and decreases rapidly to around 0.009 at t=1000. It continues to decrease, albeit at a slower rate, reaching approximately 0.006 at t=5000.
* **Laplace TS (Blue Line):** The line starts at approximately 0.048 at t=0 and decreases rapidly to around 0.007 at t=1000. It continues to decrease, reaching approximately 0.005 at t=5000.
All lines exhibit a decreasing trend, indicating that the per-period regret decreases as the time period increases. The initial drop is steep for all agents, but the rate of decrease slows down over time.
### Key Observations
* The "greedy" agent has the highest per-period regret for most of the time period, although it stabilizes at a relatively high level.
* The "0.01-greedy" agent consistently performs better than the "greedy" agent, with a lower per-period regret throughout.
* "Langevin TS" and "Laplace TS" agents exhibit the lowest per-period regret, with "Laplace TS" slightly outperforming "Langevin TS" towards the end of the time period.
* All agents converge towards a low level of per-period regret as time increases, suggesting that they all learn to minimize regret over time.
### Interpretation
The chart demonstrates the performance of different agents in a sequential decision-making scenario, where the goal is to minimize per-period regret. The results suggest that exploration strategies, such as those employed by "Langevin TS" and "Laplace TS", are more effective at reducing regret in the long run compared to purely greedy approaches. The "0.01-greedy" agent, which incorporates a small amount of exploration, also outperforms the "greedy" agent, indicating the benefit of some level of randomness in decision-making. The convergence of all agents towards a low level of regret suggests that they are all capable of learning from their experiences and improving their performance over time. The differences in performance highlight the trade-off between exploration and exploitation in reinforcement learning. The "greedy" agent exploits its current knowledge but may get stuck in suboptimal solutions, while the exploration-based agents are more likely to discover better solutions but may incur higher regret in the short term.
</details>
Figure 7.1 presents results from applying Laplace and Langevin Monte Carlo approximations of Thompson sampling as well as greedy and /epsilon1 -greedy algorithms. The plots in Figure 7.1 are generated by averaging over 2 , 000 random problem instances. In each instance, the θ x 's were independently sampled from N (0 , I ), where I is the 7 × 7 identity matrix. Based on our simulations, the /epsilon1 -greedy algorithm incurred lowest regret with /epsilon1 = 0 . 01. Even with this optimized value, it is substantially outperformed by Thompson sampling.
We conclude this section by discussing some extensions to the simplified model presented above. One major limitation is that the current model does not allow for generalization across news articles. The website needs to estimate θ x separately for each article x ∈ X , and can't leverage data on the appeal of other, related, articles when doing so. Since today's news websites have thousands or even millions
of articles, this is a major limitation in practice. Thankfully, alternative models allow for generalization across news articles as well as users. One such model constructs a feature vector z t,x that encodes features of the t th user, the article x , and possibly interactions between these. Because the feature vector also depends on x , it is without loss of generality to restrict to a parameter vector θ x = θ that is common across articles. The probability user t likes the article x t is given by g ( z /latticetop t,x t θ ). Such generalization models enable us to do 'transfer learning,' i.e. to use information gained by recommending one article to reduce the uncertainty about the weight vector of another article.
Another limitation of the considered model is that the news article set X is time-invariant. In practice, the set of relevant articles will change over time as fresh articles become available or some existing articles become obsolete. Even with generalization across news articles, a time-varying news article set, or both, the considered online news article recommendation problem is still a contextual bandit problem. As discussed in Section 6.2, all the algorithms discussed in this subsection are also applicable to those cases, after some proper modifications.
## 7.2 Product Assortment
Let us start with an assortment planning problem. Consider an agent who has an ample supply of each of n different products, indexed by i = 1 , 2 , . . . , n . The seller collects a profit of p i per unit sold of product type i . In each period, the agent has the option of offering a subset of the products for sale. Products may be substitutes or complements, and therefore the demand for a product may be influenced by the other products offered for sale in the same period. In order to maximize her profit, the agent needs to carefully select the optimal set of products to offer in each period. We can represent the agent's decision variable in each period as a vector x ∈ { 0 , 1 } n where x i = 1 indicates that product i is offered and x i = 0 indicates that it is not. Upon offering an assortment containing product i in some period, the agent observes a random log-Gaussian-distributed demand d i . The mean of this log-Gaussian distribution depends on the entire assortment x and an uncertain matrix
θ ∈ R k × k . In particular
$$\log ( d _ { i } ) | \theta , x \sim N \left ( ( \theta x ) _ { i } , \sigma ^ { 2 } \right )$$
where σ 2 is a known parameter that governs the level of idiosyncratic randomness in realized demand across periods. For any product i contained in the assortment x ,
/negationslash
$$( \theta x ) _ { i } = \theta _ { i i } + \sum _ { j \neq i } x _ { j } \theta _ { i j } ,$$
where θ ii captures the demand rate for item i if it were the sole product offered and each θ ij captures the effect availability of product j has on demand for product i . When an assortment x is offered, the agent earns expected profit
$$\mathbb { E } \left [ \sum _ { i = 1 } ^ { n } p _ { i } x _ { i } d _ { i } | \theta , x \right ] = \sum _ { i = 1 } ^ { n } p _ { i } x _ { i } e ^ { ( \theta x ) _ { i } + \frac { \sigma ^ { 2 } } { 2 } } .$$
If θ were known, the agent would always select the assortment x that maximizes her expected profit in (7.1). However, when θ is unknown, the agent needs to learn to maximize profit by exploring different assortments and observing the realized demands.
TS can be adopted as a computationally efficient solution to this problem. We assume the agent begins with a multivariate Gaussian prior over θ . Due to conjugacy properties of Gaussian and log-Gaussian distributions, the posterior distribution of θ remains Gaussian after any number of periods. At the beginning of each t 'th period, the TS algorithm draws a sample ˆ θ t from this Gaussian posterior distribution. Then, the agent selects an assortment that would maximize her expected profit in period t if the sampled ˆ θ t were indeed the true parameter.
As in Examples 4.1 and 4.2, the mean and covariance matrix of the posterior distribution of θ can be updated in closed form. However, because θ is a matrix rather than a vector, the explicit form of the update is more complicated. To describe the update rule, we first introduce ¯ θ as the vectorized version of θ which is generated by stacking the columns of θ on top of each other. Let x be the assortment selected in a period, i 1 , i 2 , . . . , i k denote the the products included in this assortment (i.e.,
supp( x ) = { i 1 , i 2 , . . . , i k } ) and z ∈ R k be defined element-wise as
$$z _ { j } = \ln ( d _ { i _ { j } } ) , \, j = 1 , 2 , \dots , k .$$
Let S be a k × n 'selection matrix' where S j,i j = 1 for j = 1 , 2 , . . . , k and all its other elements are 0. Also, define
$$W = x ^ { \top } \otimes S ,$$
where ⊗ denotes the Kronecker product of matrices. At the end of current period, the posterior mean µ and covariance matrix Σ of ¯ θ are updated according to the following rules:
$$\begin{array} { r l r } { \mu } & { \leftarrow \, \left ( \Sigma ^ { - 1 } + \frac { 1 } { \sigma ^ { 2 } } W ^ { \top } W \right ) ^ { - 1 } \left ( \Sigma ^ { - 1 } \mu + \frac { 1 } { \sigma ^ { 2 } } W ^ { \top } z \right ) , } \\ & { \Sigma } & { \leftarrow \, \left ( \Sigma ^ { - 1 } + \frac { 1 } { \sigma ^ { 2 } } W ^ { \top } W \right ) ^ { - 1 } . } \end{array}$$
To investigate the performance of TS in this problem, we simulated a scenario with n = 6 and σ 2 = 0 . 04. We take the profit associated to each product i to be p i = 1 / 6. As the prior distribution, we assumed that each element of θ is independent and Gaussian-distributed with mean 0, the diagonal elements have a variance of 1, and the off-diagonal elements have a variance of 0 . 2. To understand this choice, recall the impact of diagonal and off-diagonal elements of θ . The diagonal element θ ii controls the mean demand when only product i is available, and reflects the inherent quality or popularity of that item. The off-diagonal element θ ij captures the influence availability of product j has on mean demand for product i . Our choice of prior covariance encodes that the dominant effect on demand of a product is likely its own characteristics, rather than its interaction with any single other product. Figure 7.2 presents the performance of different learning algorithms in this problem. In addition to TS, we have simulated the greedy and /epsilon1 -greedy algorithms for various values of /epsilon1 . We found that /epsilon1 = 0 . 07 provides the best performance for /epsilon1 -greedy in this problem.
As illustrated by this figure, the greedy algorithm performs poorly in this problem while /epsilon1 -greedy presents a much better performance. We found that the performance of /epsilon1 -greedy can be improved by using an annealing /epsilon1 of m m + t at each period t . Our simulations suggest using
m = 9 for the best performance in this problem. Figure 7.2 shows that TS outperforms both variations of /epsilon1 -greedy in this problem.
Figure 7.2: Regret experienced by different learning algorithms applied to product assortment problem.
<details>
<summary>Image 16 Details</summary>

### Visual Description
\n
## Line Chart: Per-Period Regret vs. Time Period
### Overview
This image presents a line chart illustrating the per-period regret of different agents over time. The x-axis represents the time period (t), ranging from 0 to 500. The y-axis represents the per-period regret, ranging from 0 to 30. Four different agent strategies are compared: greedy, 0.07-greedy, 9/(9+t)-greedy, and TS (Thompson Sampling).
### Components/Axes
* **X-axis:** "time period (t)" - Scale ranges from approximately 0 to 500.
* **Y-axis:** "per-period regret" - Scale ranges from approximately 0 to 30.
* **Legend (top-right):**
* "greedy" - Red line
* "0.07-greedy" - Gray line
* "9/(9+t)-greedy" - Green line
* "TS" - Purple line
### Detailed Analysis
* **Greedy (Red):** The line is nearly flat and horizontal, maintaining a constant per-period regret of approximately 14.5 throughout the entire time period.
* **0.07-greedy (Gray):** This line starts at approximately 24 regret at t=0 and slopes downward, exhibiting significant fluctuations. It reaches a minimum regret of approximately 5 at t=200, then fluctuates between 5 and 10 for the remainder of the time period, ending at approximately 7 at t=500.
* **9/(9+t)-greedy (Green):** This line begins at approximately 24 regret at t=0 and demonstrates a steep downward slope initially. It continues to decrease, but at a slower rate, reaching a regret of approximately 2 at t=200. It continues to decline, approaching a regret of approximately 0.5 at t=500.
* **TS (Purple):** This line starts at approximately 24 regret at t=0 and rapidly decreases, reaching a regret of approximately 1 at t=50. It continues to decline, approaching a regret of approximately 0 at t=200 and remaining very close to 0 for the rest of the time period, ending at approximately 0.1 at t=500.
### Key Observations
* The "greedy" agent exhibits the highest and constant per-period regret throughout the entire time period.
* The "TS" agent consistently demonstrates the lowest per-period regret, rapidly converging to near-zero regret.
* The "9/(9+t)-greedy" agent shows a significant reduction in regret over time, but not as quickly as the "TS" agent.
* The "0.07-greedy" agent shows a fluctuating regret, with a slower initial decline and a plateau around a regret of 7-10.
### Interpretation
The chart demonstrates the performance of different agent strategies in a dynamic environment, likely a reinforcement learning scenario. The per-period regret metric quantifies the sub-optimality of each agent's decisions over time.
The "greedy" agent's constant high regret suggests it consistently makes suboptimal choices, failing to adapt to the changing environment. The "TS" agent's rapid convergence to near-zero regret indicates its effectiveness in exploring and exploiting the environment to minimize regret. The "9/(9+t)-greedy" agent's decreasing regret suggests a learning process, where the agent gradually improves its decision-making over time. The "0.07-greedy" agent's fluctuating regret suggests a balance between exploration and exploitation, but with less consistent performance than the "TS" agent.
The differences in performance highlight the importance of exploration-exploitation strategies in reinforcement learning. The "TS" agent, with its ability to balance exploration and exploitation, appears to be the most effective strategy in this scenario. The "9/(9+t)-greedy" agent's performance suggests that a decaying exploration rate can lead to improved performance over time. The "0.07-greedy" agent's performance suggests that the exploration rate of 0.07 may not be optimal for this environment.
</details>
## 7.3 Cascading Recommendations
We consider an online recommendation problem in which an agent learns to recommend a desirable list of items to a user . As a concrete example, the agent could be a search engine and the items could be web pages. We consider formulating this problem as a cascading bandit , in which user selections are governed by a cascade model , as is commonly used in the fields of information retrieval and online advertising (Craswell et al. , 2008).
A cascading bandit model is identified by a triple ( K,J,θ ), where K is the number of items, J ≤ K is the number of items recommended in each period, and θ ∈ [0 , 1] K is a vector of attraction probabilities . At the beginning of each t th period, the agent selects and presents to the user an ordered list x t ∈ { 1 , . . . , K } J . The user examines items in x t sequentially, starting from x t, 1 . Upon examining item x t,j , the user finds it attractive with probability θ x t,j . In the event that the user finds the item attractive, he selects the item and leaves the system. Otherwise, he carries on to examine the next item in the list, unless j = J , in
which case he has already considered all recommendations and leaves the system.
The agent observes y t = j if the user selects x t,j and y t = ∞ if the user does not click any item. The associated reward r t = r ( y t ) = 1 { y t ≤ J } indicates whether any item was selected. For each list x = ( x 1 , . . . , x J ) and θ ′ ∈ [0 , 1] K , let
$$\begin{array} { r } { h ( x , \theta ^ { \prime } ) = 1 - \prod _ { j = 1 } ^ { J } \left [ 1 - \theta _ { x _ { j } } ^ { \prime } \right ] . } \end{array}$$
Note that the expected reward at time t is E [ r t | x t , θ ] = h ( x t , θ ). The optimal solution x ∗ ∈ argmax x : | x | = J h ( x, θ ) consists of the J items with largest attraction probabilities. Per-period regret is given by regret t ( θ ) = h ( x ∗ , θ ) -h ( x t , θ ).
## Algorithm
CascadeUCB( K,J,α,β )
```
\frac{CascadeUCB(K, J,\alpha,\beta)}{1: for t = 1,2,... do
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16: end for
```
## 7.1 Algorithm
CascadeTS( K,J,α,β )
```
CAScaleTS(K, J, \alpha, \beta)
1: for t = 1,2,... do
2: #sample model:
for k = 1,..., K do
Sample $\hat{k}~\text{Beta}(\alpha_{k},\beta_{k})$
5: end for
6:
7: #select and apply action:
x_t <- argmaxx:|x|=J h(x, \hat{\theta})
8:
9: Apply $x_t$ and observe $y_t$ and $r_t$
10:
11: #update posterior:
for j = 1,..., min{y_{t}, J} do
\alpha_{x_t,j} <- \alpha_{x_t,j} + 1(j = y_t)
\beta_{x_t,j} <- \beta_{x_t,j} + 1(j < y_t)
14: end for
```
Kveton et al. (2015) proposed learning algorithms for cascading bandits based on itemwise upper confidence bound (UCB) estimates. CascadeUCB (Algorithm 7.1) is a practical variant that allows for specification of prior parameters ( α, β ) that guide the early behavior of the algorithm. CascadeUCB computes a UCB U t ( k ) for each item k ∈ { 1 , . . . , K } and then chooses a list that maximizes h ( · , U t ), which represents an upper confidence bound on the list attraction probability. The list x t can be efficiently generated by choosing the J items
7.2
with highest UCBs. Upon observing the user's response, the algorithm updates the sufficient statistics ( α, β ), which count clicks and views for all the examined items. CascadeTS (Algorithm 7.2) is a Thompson sampling algorithm for cascading bandits. CascadeTS operates in a manner similar to CascadeUCB except that x t is computed based on the sampled attraction probabilities ˆ θ , rather than the itemwise UCBs U t .
In this section, we consider a specific form of UCB, which is defined by
$$U _ { t } ( k ) = \frac { \alpha _ { k } } { \alpha _ { k } + \beta _ { k } } + c \sqrt { \frac { 1 . 5 \log ( t ) } { \alpha _ { k } + \beta _ { k } } } ,$$
for k ∈ { 1 , . . . , K } , where α k / ( α k + β k ) represents the expected value of the attraction probability θ k , while the second term represents an optimistic boost that encourages exploration. Notice that the parameter c ≥ 0 controls the degree of optimism . When c = 1, the above-defined UCB reduces to the standard UCB1, which is considered and analyzed in the context of cascading bandits in (Kveton et al. , 2015). In practice, we can select c through simulations to optimize performance.
Figure 7.3 presents results from applying CascadeTS and CascadeUCB based on UCB1. These results are generated by randomly sampling 1000 cascading bandit instances, K = 1000 and J = 100, in each case sampling each attraction probability θ k independently from Beta(1 , 40). For each instance, CascadeUCB and CascadeTS are applied over 20000 time periods, initialized with ( α k , β k ) = (1 , 40). The plots are of per-period regrets averaged over the 1000 simulations.
The results demonstrate that TS far outperforms this version of CascadeUCB. Why? An obvious reason is that h ( x, U t ) is far too optimistic. In particular, h ( x, U t ) represents the probability of a click if every item in x simultaneously takes on the largest attraction probability that is statistically plausible. However, due to the statistical independence of item attractions, the agent is unlikely to have substantially under-estimated the attraction probability of every item in x . As such, h ( x, U t ) tends to be far too optimistic. CascadeTS, on the other hand, samples components ˆ θ k independently across items. While any sample ˆ θ k might deviate substantially from its mean, it is unlikely that the
Figure 7.3: Comparison of CascadeTS and CascadeUCB with K = 1000 items and J = 100 recommendations per period.
<details>
<summary>Image 17 Details</summary>

### Visual Description
\n
## Line Chart: Per-Period Regret vs. Time Period for Different Agents
### Overview
This image presents a line chart illustrating the per-period regret of three different agents (TS, UCB-best, and UCB1) over a time period of 20,000 units. The chart aims to compare the performance of these agents in terms of their cumulative regret as time progresses.
### Components/Axes
* **X-axis:** "time period (t)", ranging from approximately 0 to 20,000.
* **Y-axis:** "per-period regret", ranging from approximately 0 to 0.105.
* **Legend:** Located in the top-right corner, identifying the three agents:
* TS (represented by a red line)
* UCB-best (represented by a blue line)
* UCB1 (represented by a green line)
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
The chart displays three distinct lines representing the per-period regret of each agent over time.
* **TS (Red Line):** The line starts at approximately 0.095 at t=0, rapidly decreases to approximately 0.015 by t=1000, and then continues to decrease, leveling off around 0.005-0.01 for t > 10,000. The trend is strongly downward, indicating a rapid reduction in regret.
* **UCB-best (Blue Line):** The line begins at approximately 0.04 at t=0, decreases more gradually than TS, reaching approximately 0.005 by t=1000, and then levels off around 0.002-0.004 for t > 5000. The trend is downward, but less steep than TS.
* **UCB1 (Green Line):** The line starts at approximately 0.085 at t=0, initially increases to a peak of approximately 0.09 at t=500, then decreases steadily, but remains significantly higher than the other two lines. By t=20,000, the regret is approximately 0.045. The trend is initially upward, followed by a gradual downward slope.
### Key Observations
* **TS** exhibits the fastest initial reduction in regret.
* **UCB-best** has the lowest overall regret, especially in the long run (t > 5000).
* **UCB1** starts with a relatively high regret, experiences an initial increase, and then decreases, but remains the highest among the three agents throughout the entire time period.
* All three agents demonstrate a decreasing trend in per-period regret as time progresses, suggesting that they all learn and improve their decision-making over time.
### Interpretation
The data suggests that the TS agent learns quickly initially, but its regret plateaus at a higher level than UCB-best. UCB-best demonstrates a more stable and consistently low regret, indicating a more robust and efficient learning strategy. UCB1, while eventually decreasing, exhibits a slower learning rate and a higher overall regret compared to the other two agents.
The initial increase in regret for UCB1 could be attributed to an exploration phase where the agent is actively trying different options to gather information. The subsequent decrease indicates that the agent is learning from its experiences and improving its decision-making. The difference in performance between the agents likely stems from their underlying algorithms and how they balance exploration and exploitation.
The chart highlights the trade-off between initial learning speed and long-term performance. TS learns quickly but doesn't achieve the lowest possible regret, while UCB-best learns more slowly but ultimately outperforms the other agents in terms of minimizing regret. This information is valuable for selecting the appropriate agent for a given application, depending on the specific requirements and priorities.
</details>
sampled attraction probability of every item in x greatly exceeds its mean. As such, the variability in h ( x, ˆ θ ) provides a much more accurate reflection of the magnitude of uncertainty.
The plot labeled 'UCB-best' in Figure 7.3 illustrates performance of CascadeUCB with c = 0 . 05, which approximately minimizes cumulative regret over 20,000 time periods. It is interesting that even after being tuned to the specific problem and horizon, the performance of CascadeUCB falls short of Cascade TS. A likely source of loss stems from the shape of confidence sets used by CascadeUCB. Note that the algorithm uses hyper-rectangular confidence sets, since the set of statistically plausible attraction probability vectors is characterized by a Cartesian product item-level confidence intervals. However, the Bayesian central limit theorem suggests that 'ellipsoidal" confidence sets offer a more suitable choice. Specifically, as data is gathered, the posterior distribution over θ can be well approximated by a multivariate Gaussian, for which level sets are ellipsoidal. Losses due to the use of hyper-rectangular confidence sets have been studied through regret analysis in (Dani et al. , 2008) and through simple analytic examples in (Osband and Van Roy, 2017a).
It is worth noting that tuned versions of CascadeUCB do sometimes perform as well or better than CascadeTS. Figure 7.4 illustrates an example of this. The setting is identical to that used to generate the results of Figure 7.3, except that K = 50 and J = 10, and cumulative regret is approximately optimized with c = 0 . 1. CascadeUCB with the optimally tuned c outperforms CascadeTS. This qualitative difference from the case of K = 1000 and J = 100 is likely due to the fact that hyper-rectangular sets offer poorer approximations of ellipsoids as the dimension increases. This phenomenon and its impact on regret aligns with theoretical results of (Dani et al. , 2008). That said, CascadeUCB is somewhat advantaged in this comparison because it is tuned specifically for the setting and time horizon.
Figure 7.4: Comparison of CascadeTS and CascadeUCB with K = 50 items and J = 10 recommendations per period.
<details>
<summary>Image 18 Details</summary>

### Visual Description
## Line Chart: Per-Period Regret vs. Time Period
### Overview
The image presents a line chart illustrating the per-period regret of three different agents (TS, UCB-best, and UCB1) over a time period of 5000 units. The chart aims to compare the performance of these agents in terms of cumulative regret.
### Components/Axes
* **X-axis:** "time period (t)", ranging from approximately 0 to 5000.
* **Y-axis:** "per-period regret", ranging from approximately 0 to 0.35.
* **Legend:** Located in the top-right corner, identifying the three agents:
* TS (represented by a red line)
* UCB-best (represented by a blue line)
* UCB1 (represented by a green line)
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
* **TS (Red Line):** The red line representing TS exhibits a steep downward trend initially, rapidly decreasing from approximately 0.32 at t=0 to approximately 0.01 at t=5000. The curve appears to be logarithmic or exponential decay.
* At t=100, per-period regret is approximately 0.25.
* At t=500, per-period regret is approximately 0.12.
* At t=1000, per-period regret is approximately 0.08.
* At t=2000, per-period regret is approximately 0.03.
* At t=4000, per-period regret is approximately 0.015.
* **UCB-best (Blue Line):** The blue line representing UCB-best starts at approximately 0.34 at t=0 and decreases more slowly than TS. It reaches approximately 0.11 at t=5000. The curve is also decreasing, but at a slower rate.
* At t=100, per-period regret is approximately 0.30.
* At t=500, per-period regret is approximately 0.22.
* At t=1000, per-period regret is approximately 0.18.
* At t=2000, per-period regret is approximately 0.14.
* At t=4000, per-period regret is approximately 0.12.
* **UCB1 (Green Line):** The green line representing UCB1 starts at approximately 0.35 at t=0 and decreases at a rate between TS and UCB-best. It reaches approximately 0.10 at t=5000.
* At t=100, per-period regret is approximately 0.33.
* At t=500, per-period regret is approximately 0.25.
* At t=1000, per-period regret is approximately 0.20.
* At t=2000, per-period regret is approximately 0.15.
* At t=4000, per-period regret is approximately 0.11.
### Key Observations
* TS consistently exhibits the lowest per-period regret throughout the entire time period.
* UCB-best has the highest per-period regret.
* All three agents demonstrate a decreasing trend in per-period regret as time progresses, indicating learning and improvement.
* The rate of decrease in regret is most rapid for TS, followed by UCB1, and then UCB-best.
### Interpretation
The chart suggests that the TS agent is the most effective in minimizing per-period regret compared to UCB-best and UCB1. This implies that TS learns and adapts more quickly to the environment, leading to better decision-making and reduced cumulative regret. The slower decrease in regret for UCB-best and UCB1 suggests that these agents may require more time to explore and exploit the environment effectively. The differences in performance could be attributed to the underlying algorithms and exploration-exploitation strategies employed by each agent. The logarithmic decay pattern observed in all three lines indicates diminishing returns to learning over time. The initial high regret values suggest a period of significant exploration, while the decreasing regret values indicate a transition towards exploitation of learned knowledge. The chart provides valuable insights into the relative performance of different reinforcement learning algorithms in a dynamic environment.
</details>
## 7.4 Active Learning with Neural Networks
Neural networks are widely used in supervised learning, where given an existing set of predictor-response data pairs, the objective is to produce a model that generalizes to accurately predict future responses conditioned on associated predictors. They are also increasingly being used to guide actions ranging from recommendations to robotic maneuvers. Active
learning is called for to close the loop by generating actions that do not solely maximize immediate performance but also probe the environment to generate data that accelerates learning. TS offers a useful principle upon which such active learning algorithms can be developed.
With neural networks or other complex model classes, computing the posterior distribution over models becomes intractable. Approximations are called for, and incremental updating is essential because fitting a neural network is a computationally intensive task in its own right. In such contexts, ensemble sampling offers a viable approach (Lu and Van Roy, 2017). In Section 5.6, we introduced a particular mechanism for ensemble sampling based on the bootstrap. In this section, we consider an alternative version of ensemble sampling and present results from (Lu and Van Roy, 2017) that demonstrate its application to active learning with neural networks.
To motivate our algorithm, let us begin by discussing how it can be applied to the linear bandit problem.
Example 7.1. (Linear Bandit) Let θ be drawn from R M and distributed according to a N ( µ 0 , Σ 0 ) prior. There is a set of K actions X ⊆ R M . At each time t = 1 , . . . , T , an action x t ∈ X is selected, after which a reward r t = y t = θ /latticetop x t + w t is observed, where w t ∼ N (0 , σ 2 w ).
In this context, ensemble sampling is unwarranted, since exact Bayesian inference can be carried out efficiently via Kalman filtering. Nevertheless, the linear bandit offers a simple setting for explaining the workings of an ensemble sampling algorithm.
Consider maintaining a covariance matrix updated according to
$$\Sigma _ { t + 1 } = \left ( \Sigma _ { t } ^ { - 1 } + x _ { t } x _ { t } ^ { \top } / \sigma _ { w } ^ { 2 } \right ) ^ { - 1 } ,$$
and N models θ 1 t , . . . , θ N t , initialized with θ 1 1 , . . . , θ N 1 each drawn independently from N ( µ 0 , Σ 0 ) and updated incrementally according to
$$\overline { \theta } _ { t + 1 } ^ { n } = \Sigma _ { t + 1 } \left ( \Sigma _ { t } ^ { - 1 } \overline { \theta } _ { t } ^ { n } + x _ { t } ( y _ { t } + \tilde { w } _ { t } ^ { n } ) / \sigma _ { w } ^ { 2 } \right ) ,$$
for n = 1 , . . . , N , where ( ˜ w n t : t = 1 , . . . , T, n = 1 , . . . , N ) are independent N (0 , σ 2 w ) random samples drawn by the updating algorithm. It is
easy to show that the resulting parameter vectors satisfy
$$\overline { \theta } _ { t } ^ { n } = \arg \min _ { \nu } \left ( \frac { 1 } { \sigma _ { w } ^ { 2 } } \sum _ { \tau = 1 } ^ { t - 1 } ( y _ { \tau } + \tilde { w } _ { \tau } ^ { n } - x _ { \tau } ^ { \top } \nu ) ^ { 2 } + ( \nu - \overline { \theta } _ { 1 } ^ { n } ) ^ { \top } \Sigma _ { 0 } ^ { - 1 } ( \nu - \overline { \theta } _ { 1 } ^ { n } ) \right ) .$$
Thich admits an intuitive interpretation: each θ n t is a model fit to a randomly perturbed prior and randomly perturbed observations. As established in (Lu and Van Roy, 2017), for any deterministic sequence x 1 , . . . , x t -1 , conditioned on the history, the models θ 1 t , . . . , θ N t are independent and identically distributed according to the posterior distribution of θ . In this sense, the ensemble approximates the posterior.
The ensemble sampling algorithm we have described for the linear bandit problem motivates an analogous approach for the following neural network model.
Example 7.2. (Neural Network) Let g θ : R M ↦→ R K denote a mapping induced by a neural network with weights θ . Suppose there are K actions X ⊆ R M , which serve as inputs to the neural network, and the goal is to select inputs that yield desirable outputs. At each time t = 1 , . . . , T , an action x t ∈ X is selected, after which y t = g θ ( x t ) + w t is observed, where w t ∼ N (0 , σ 2 w I ). A reward r t = r ( y t ) is associated with each observation. Let θ be distributed according to a N ( µ 0 , Σ 0 ) prior. The idea here is that data pairs ( x t , y t ) can be used to fit a neural network model, while actions are selected to trade off between generating data pairs that reduce uncertainty in neural network weights and those that offer desirable immediate outcomes.
Consider an ensemble sampling algorithm that once again begins with N independent models with connection weights θ 1 1 , . . . , θ N 1 sampled from a N ( µ 0 , Σ 0 ) prior. It could be natural here to let µ 0 = 0 and Σ 0 = σ 2 0 I for some variance σ 2 0 chosen so that the range of probable models spans plausible outcomes. To incrementally update parameters, at each time t , each n th model applies some number of stochastic gradient descent iterations to reduce a loss function of the form
$$\mathcal { L } _ { t } ( \nu ) = \frac { 1 } { \sigma _ { w } ^ { 2 } } \sum _ { \tau = 1 } ^ { t - 1 } ( y _ { \tau } + \tilde { w } _ { \tau } ^ { n } - g _ { \nu } ( x _ { \tau } ) ) ^ { 2 } + ( \nu - \overline { \theta } _ { 1 } ^ { n } ) ^ { \top } \Sigma _ { 0 } ^ { - 1 } ( \nu - \overline { \theta } _ { 1 } ^ { n } ) .$$
Figure 7.5 present results from simulations involving a two-layer neural network, with a set of K actions, X ⊆ R M . The weights of the neural network, which we denote by w 1 ∈ R D × N and w 2 ∈ R D , are each drawn from N (0 , λ ). Let θ ≡ ( w 1 , w 2 ). The mean reward of an action x ∈ X is given by g θ ( x ) = w /latticetop 2 max(0 , w 1 a ). At each time step, we select an action x t ∈ X and observe reward y t = g θ ( x t ) + z t , where z t ∼ N (0 , σ 2 z ). We used M = 100 for the input dimension, D = 50 for the dimension of the hidden layer, number of actions K = 100, prior variance λ = 1, and noise variance σ 2 z = 100. Each component of each action vector is sampled uniformly from [ -1 , 1], except for the last component, which is set to 1 to model a constant offset. All results are averaged over 100 realizations.
In our application of the ensemble sampling algorithm we have described, to facilitate gradient flow, we use leaky rectified linear units of the form max(0 . 01 x, x ) during training, though the target neural network is made up of regular rectified linear units as indicated above. In our simulations, each update was carried out with 5 stochastic gradient steps, with a learning rate of 10 -3 and a minibatch size of 64.
Figure 7.5: Bandit learning with an underlying neural network.
<details>
<summary>Image 19 Details</summary>

### Visual Description
\n
## Line Charts: Per-Period Regret vs. Time Period for Different Agent Configurations
### Overview
The image presents three line charts, labeled (a) Fixed ε-greedy, (b) Annealing ε-greedy, and (c) Ensemble TS. Each chart depicts the relationship between per-period regret (y-axis) and time period (t) (x-axis) for different agent configurations. The charts compare the performance of various parameter settings within each agent type.
### Components/Axes
* **X-axis:** Time period (t), ranging from approximately 0 to 500.
* **Y-axis:** Per-period regret, ranging from approximately 0 to 60.
* **Chart (a) - Fixed ε-greedy:**
* Legend:
* ε = 0.01 (Pink)
* ε = 0.05 (Green)
* ε = 0.1 (Purple)
* ε = 0.2 (Orange)
* ε = 0.3 (Brown)
* **Chart (b) - Annealing ε-greedy:**
* Legend:
* ε = 10/(10+t) (Pink)
* ε = 20/(20+t) (Green)
* ε = 30/(30+t) (Purple)
* ε = 40/(40+t) (Orange)
* ε = 50/(50+t) (Brown)
* **Chart (c) - Ensemble TS:**
* Legend:
* ensemble 3 (Pink)
* ensemble 10 (Green)
* ensemble 30 (Purple)
* ensemble 100 (Orange)
* ensemble 300 (Brown)
### Detailed Analysis or Content Details
**Chart (a) - Fixed ε-greedy:**
* The pink line (ε = 0.01) starts at approximately 55 and decreases rapidly to around 10 by t=100, then plateaus around 8-10.
* The green line (ε = 0.05) starts at approximately 55 and decreases to around 15 by t=100, then plateaus around 12-14.
* The purple line (ε = 0.1) starts at approximately 55 and decreases to around 20 by t=100, then plateaus around 16-18.
* The orange line (ε = 0.2) starts at approximately 55 and decreases to around 25 by t=100, then plateaus around 20-22.
* The brown line (ε = 0.3) starts at approximately 55 and decreases to around 30 by t=100, then plateaus around 25-27.
* All lines exhibit a decreasing trend, but the rate of decrease varies with ε. Lower ε values result in faster initial decreases and lower final regret values.
**Chart (b) - Annealing ε-greedy:**
* The pink line (ε = 10/(10+t)) starts at approximately 55 and decreases rapidly to around 8 by t=100, then continues to decrease slowly, reaching around 5 by t=500.
* The green line (ε = 20/(20+t)) starts at approximately 55 and decreases rapidly to around 12 by t=100, then continues to decrease slowly, reaching around 8 by t=500.
* The purple line (ε = 30/(30+t)) starts at approximately 55 and decreases rapidly to around 16 by t=100, then continues to decrease slowly, reaching around 11 by t=500.
* The orange line (ε = 40/(40+t)) starts at approximately 55 and decreases rapidly to around 20 by t=100, then continues to decrease slowly, reaching around 14 by t=500.
* The brown line (ε = 50/(50+t)) starts at approximately 55 and decreases rapidly to around 24 by t=100, then continues to decrease slowly, reaching around 17 by t=500.
* Similar to Chart (a), all lines decrease over time, but the rate of decrease is influenced by the annealing schedule.
**Chart (c) - Ensemble TS:**
* The pink line (ensemble 3) starts at approximately 55 and decreases rapidly to around 10 by t=100, then plateaus around 6-8.
* The green line (ensemble 10) starts at approximately 55 and decreases rapidly to around 8 by t=100, then plateaus around 5-7.
* The purple line (ensemble 30) starts at approximately 55 and decreases rapidly to around 7 by t=100, then plateaus around 4-6.
* The orange line (ensemble 100) starts at approximately 55 and decreases rapidly to around 6 by t=100, then plateaus around 3-5.
* The brown line (ensemble 300) starts at approximately 55 and decreases rapidly to around 5 by t=100, then plateaus around 2-4.
* All lines exhibit a decreasing trend, with larger ensemble sizes generally leading to lower regret values.
### Key Observations
* In all three charts, the per-period regret decreases over time, indicating learning and improvement in the agents' performance.
* Chart (c) (Ensemble TS) consistently shows the lowest regret values across all ensemble sizes, suggesting that ensemble methods are more effective at minimizing regret.
* In Chart (a) (Fixed ε-greedy), lower ε values lead to better performance (lower regret).
* In Chart (b) (Annealing ε-greedy), the initial rate of exploration decreases as time progresses, leading to a gradual reduction in regret.
### Interpretation
The data suggests that the choice of agent configuration significantly impacts the per-period regret. Ensemble methods (Chart c) consistently outperform both fixed and annealing ε-greedy approaches. Within the fixed ε-greedy approach (Chart a), a lower exploration rate (lower ε) leads to better performance, but may also result in slower initial learning. The annealing ε-greedy approach (Chart b) provides a balance between exploration and exploitation, allowing the agent to adapt its exploration rate over time. The consistent downward trend in all charts indicates that the agents are learning from their experiences and improving their decision-making over time. The ensemble methods likely benefit from averaging out the predictions of multiple agents, reducing the variance and improving the overall performance. The plateaus observed in the later stages of the charts suggest that the agents have converged to a near-optimal policy.
</details>
Figure 7.5 illustrates the performance of several learning algorithms with an underlying neural network. Figure 7.5a demonstrates the per-
formance of an /epsilon1 -greedy strategy across various levels of /epsilon1 . We find that we are able to improve performance with an annealing schedule /epsilon1 = k k + t (Figure 7.5b). However, we find that an ensemble sampling strategy outperforms even the best tuned /epsilon1 -schedules (Figure 7.5c). Further, we see that ensemble sampling strategy can perform well with remarkably few members of this ensemble. Ensemble sampling with fewer members leads to a greedier strategy, which can perform better for shorter horizons, but is prone to premature and suboptimal convergence compared to true TS (Lu and Van Roy, 2017). In this problem, using an ensemble of as few as 30 members provides very good performance.
## 7.5 Reinforcement Learning in Markov Decision Processes
Reinforcement learning (RL) extends upon contextual online decision problems to allow for delayed feedback and long term consequences (Sutton and Barto, 1998; Littman, 2015). Concretely (using the notation of Section 6.2) the response y t to the action x t depends on a context z t ; but we no longer assume that the evolution of the context z t +1 is independent of y t . As such, the action x t may affect not only the reward r ( y t ) but also, through the effect upon the context z t +1 the rewards of future periods ( r ( y t ′ )) t ′ >t . As a motivating example, consider a problem of sequential product recommendations x t where the customer response y t is influenced not only by the quality of the product, but also the history of past recommendations. The evolution of the context z t +1 is then directly affected by the customer response y t ; if a customer watched 'The Godfather' and loved it, then chances are probably higher they may enjoy 'The Godfather 2.'
Maximizing cumulative rewards in a problem with long term consequences can require planning with regards to future rewards, rather than optimizing each period myopically. Similarly, efficient exploration in these domains can require balancing not only the information gained over a single period; but also the potential for future informative actions over subsequent periods. This sophisticated form of temporally-extended exploration, which can be absolutely critical for effective performance, is sometimes called deep exploration (Osband et al. , 2017). TS can be applied successfully to reinforcement learning (Osband et al. , 2013).
However, as we will discuss, special care must be taken with respect to the notion of a time period within TS to preserve deep exploration.
Consider a finite horizon Markov decision process (MDP) M = ( S , A , R M , P M , H, ρ ), where S is the state space, A is the action space, and H is the horizon. The agent begins in a state s 0 , sampled from ρ , and over each timestep h = 0 , .., H -1 the agent selects action a h ∈ A , receives a reward r h ∼ R s M h ,a h , and transitions to a new state s h +1 ∼ P s M h ,a h . Here, R s M h ,a h and P s M h ,a h are probability distributions. A policy µ is a function mapping each state s ∈ S and timestep h = 0 , .., H -1 to an action a ∈ A . The value function V M µ,h ( s ) = E [ ∑ H -1 j = h r j ( s j , µ ( s j , j )) | s h = s ] encodes the expected reward accumulated under µ over the remainder of the episode when starting from state s and timestep h . Finite horizon MDPs model delayed consequences of actions through the evolution of the state, but the scope of this influence is limited to within an individual episode.
Let us consider an episodic RL problem, in which an agent learns about R M and P M over episodes of interaction with an MDP. In each episode, the agent begins in a random state, sampled from ρ , and follows a trajectory, selecting actions and observing rewards and transitions over H timesteps. Immediately we should note that we have already studied a finite horizon MDP under different terminology in Example 1.2: the online shortest path problem. To see the connection, simply view each vertex as a state and the choice of edge as an action within a timestep. With this connection in mind we can express the problem of maximizing the cumulative rewards ∑ K k =1 ∑ H -1 h =0 r ( s kh , a kh ) in a finite horizon MDP equivalently as an online decision problem over periods k = 1 , 2 , .., K , each involving the selection of a policy µ k for use over an episode of interaction between the agent and the MDP. By contrast, a naive application of TS to reinforcement learning that samples a new policy for each timestep within an episode could be extremely inefficient as it does not perform deep exploration.
Consider the example in Figure 7.6 where the underlying MDP is characterized by a long chain of states { s -N , .., s N } and only the one of the far left or far right positions are rewarding with equal probability; all other states produce zero reward and with known dynamics. Learning about the true dynamics of the MDP requires a consistent policy over
Figure 7.6: MDPs where TS with sampling at every timestep within an episode leads to inefficient exploration.
<details>
<summary>Image 20 Details</summary>

### Visual Description
\n
## Diagram: State Transition/Circular Process
### Overview
The image depicts two variations of a circular process or state transition diagram. Both diagrams show a series of interconnected nodes represented by circles, with arrows indicating the direction of flow. The key difference between the two diagrams is the initial and final state, highlighted in red.
### Components/Axes
The diagram consists of the following components:
* **Nodes:** Represented by circles labeled with terms like δ<sub>0</sub>, δ<sub>1</sub>, δ<sub>-N</sub>, and δ<sub>N</sub>. The ellipses "..." indicate that the sequence continues beyond the explicitly shown nodes.
* **Arrows:** Curved arrows connecting the nodes, indicating the direction of the process flow.
* **Highlighted Node:** A node colored red, representing either the starting or ending state of the process.
* **Gray Node:** A node colored gray, representing a state within the process.
* **Text:** The word "OR" separates the two diagrams.
### Detailed Analysis or Content Details
**Diagram 1 (Top):**
* The process begins with a red node labeled δ<sub>-N</sub> (bottom-left).
* The flow proceeds clockwise through a series of nodes labeled δ<sub>-N+1</sub>, δ<sub>0</sub>, δ<sub>1</sub>, and continues with "..." indicating further nodes.
* The process ends with a red node labeled δ<sub>N</sub> (top-right).
**Diagram 2 (Bottom):**
* The process begins with a node labeled δ<sub>-N+1</sub>.
* The flow proceeds clockwise through a series of nodes labeled δ<sub>0</sub>, δ<sub>1</sub>, and continues with "..." indicating further nodes.
* The process ends with a red node labeled δ<sub>N</sub> (bottom-right).
### Key Observations
* Both diagrams represent a cyclical process.
* The only difference is the starting point of the cycle.
* The gray node δ<sub>0</sub> appears in both diagrams, suggesting it's a significant state within the process.
* The red nodes δ<sub>-N</sub> and δ<sub>N</sub> represent the beginning and end points, respectively, but their positions are swapped between the two diagrams.
### Interpretation
The diagram illustrates two possible pathways through a cyclical process. The use of δ<sub>-N</sub> and δ<sub>N</sub> suggests a process involving some form of negation or inverse operation. The "OR" between the diagrams indicates that either pathway is valid, or that the process can start from either δ<sub>-N</sub> or δ<sub>-N+1</sub>. The gray node δ<sub>0</sub> could represent a stable state or a critical point in the process. The cyclical nature of the diagram suggests a repeating or iterative process. Without further context, it's difficult to determine the specific meaning of the process, but it likely represents a system with feedback loops or a state machine with defined transitions. The diagram is abstract and does not provide specific numerical data, but rather a conceptual representation of a process flow.
</details>
N steps right or N steps left; a variant of TS that resamples after each step would be exponentially unlikely to make it to either end within N steps (Osband et al. , 2017). By contrast, sampling only once prior to each episode and holding the policy fixed for the duration of the episode demonstrates deep exploration and results in learning the optimal policy within a single episode.
In order to apply TS to policy selection we need a way of sampling from the posterior distribution for the optimal policy. One efficient way to do this, at least with tractable state and action spaces, is to maintain a posterior distribution over the rewards R M and the transition dynamics P M at each state-action pair ( s, a ). In order to generate a sample for the optimal policy, simply take a single posterior sample for the reward and transitions and then solve for the optimal policy for this sample . This is equivalent sampling from the posterior distribution of the optimal policy, but may be computationally more efficient than maintaining that posterior distribution explicitly. Estimating a posterior distribution over rewards is no different from the setting of bandit learning that we have already discussed at length within this paper. The transition function looks a little different, but for transitions over a finite state space the Dirichlet distribution is a useful conjugate prior. It is a multidimensional generalization of the Beta distribution from Example 3.1. The Dirichlet prior over outcomes in S = { 1 , .., S } is specified by a positive vector of pseudo-observations α ∈ R S + ; updates to the Dirichlet posterior can be performed simply by incrementing the appropriate column of α (Strens, 2000).
In Figure 7.7 we present a computational comparison of TS with sampling per timestep versus per episode, applied to the example of Figure 7.6. Figure 7.7a compares the performance of sampling schemes where the agent has an informative prior that matches the true un-
derlying system. As explained above, sampling once per episode TS is guaranteed to learn the true MDP structure in a single episode. By contrast, sampling per timestep leads to uniformly random actions until either s -N or s N is visited. Therefore, it takes a minimum of 2 N episodes for the first expected reward.
Figure 7.7: TS with sampling per timestep versus per episode.
<details>
<summary>Image 21 Details</summary>

### Visual Description
\n
## Charts: Average Regret vs. Chain Length
### Overview
The image presents two charts comparing the average regret of two agents, "TS timestep" and "TS episode", as a function of the logarithm of the chain length (N). The left chart (a) shows results using an informed prior, while the right chart (b) shows results using an uninformed prior. Both charts share the same x-axis scale, but differ in their y-axis scales and the behavior of the plotted lines.
### Components/Axes
Both charts share the following components:
* **X-axis:** labeled "log(chain length N)", ranging from approximately 1.0 to 1.6. The axis is marked with values 1.0, 1.2, 1.4, and 1.6.
* **Y-axis:** labeled "log(average regret < 0.5)" for chart (a) and "log(episodes until average regret < 0.5)" for chart (b).
* **Legend:** Located in the top-right corner of each chart, labeled "agent" with two entries:
* "TS timestep" (represented by a red line)
* "TS episode" (represented by a blue line)
### Detailed Analysis or Content Details
**Chart (a): Using informed prior.**
* **TS timestep (Red Line):** The line slopes sharply upward, indicating a rapid increase in log(average regret) as log(chain length N) increases.
* At N = 1.0, log(average regret) ≈ 3.2
* At N = 1.2, log(average regret) ≈ 4.5
* At N = 1.4, log(average regret) ≈ 6.5
* At N = 1.6, log(average regret) ≈ 11.5
* **TS episode (Blue Line):** The line is nearly flat, remaining close to zero throughout the range of log(chain length N).
* At N = 1.0, log(average regret) ≈ 0.1
* At N = 1.2, log(average regret) ≈ 0.1
* At N = 1.4, log(average regret) ≈ 0.1
* At N = 1.6, log(average regret) ≈ 0.1
**Chart (b): Using uninformed prior.**
* **TS timestep (Red Line):** The line shows a slight upward trend, but the increase is much less pronounced than in chart (a).
* At N = 1.0, log(average regret) ≈ 3.5
* At N = 1.2, log(average regret) ≈ 4.0
* At N = 1.4, log(average regret) ≈ 4.5
* At N = 1.6, log(average regret) ≈ 5.0
* **TS episode (Blue Line):** The line is relatively flat, with a slight upward trend.
* At N = 1.0, log(average regret) ≈ 3.2
* At N = 1.2, log(average regret) ≈ 3.5
* At N = 1.4, log(average regret) ≈ 3.8
* At N = 1.6, log(average regret) ≈ 4.0
### Key Observations
* In chart (a), the "TS timestep" agent exhibits a significantly higher average regret than the "TS episode" agent as the chain length increases.
* In chart (b), the difference in average regret between the two agents is much smaller, and both agents show a relatively modest increase in regret as the chain length increases.
* The informed prior (chart a) leads to a much more pronounced increase in regret for the "TS timestep" agent.
### Interpretation
The data suggests that the choice of prior significantly impacts the performance of the "TS timestep" agent. When an informed prior is used, the agent's regret increases rapidly with chain length, indicating that it struggles to adapt to the environment. However, when an uninformed prior is used, the agent's regret remains relatively low, suggesting that it is more robust to the lack of prior knowledge.
The "TS episode" agent, in contrast, appears to be less sensitive to the choice of prior. Its regret remains low in both charts, indicating that it is able to learn effectively regardless of whether an informed or uninformed prior is used.
The difference in behavior between the two agents may be due to the way they explore the environment. The "TS timestep" agent may rely more heavily on its prior knowledge, which can be detrimental when the prior is inaccurate. The "TS episode" agent, on the other hand, may be more exploratory, allowing it to overcome the limitations of its prior knowledge.
The charts demonstrate the importance of carefully considering the choice of prior when designing reinforcement learning agents. A poorly chosen prior can lead to suboptimal performance, while a well-chosen prior can significantly improve learning efficiency.
</details>
The difference in performance demonstrated by Figure 7.7a is particularly extreme because the prior structure means that there is only value to deep exploration, and none to 'shallow' exploration (Osband et al. , 2017). In Figure 7.7b we present results for TS on the same environment but with a uniform Dirichlet prior over transitions and a standard Gaussian prior over rewards for each state-action pair. With this prior structure sampling per timestep is not as hopeless, but still performs worse than sampling per episode. Once again, this difference increases with MDP problem size. Overall, Figure 7.7 demonstrates that the benefit of sampling per episode, rather than per timestep, can become arbitrarily large. As an additional benefit this approach is also more computationally efficient, since we only need to solve for the optimal policy once every episode rather than at each timestep.
This more nuanced application of TS to RL is sometimes referred to as posterior sampling for reinforcement learning (PSRL) (Strens, 2000). Recent work has developed a theoretical analyses of PSRL that guarantee strong expected performance over a wide range of environ-
ments (Osband et al. , 2013; Osband and Van Roy, 2014b; Osband and Van Roy, 2014a; Osband and Van Roy, 2017b; Ouyang et al. , 2017). This work builds on and extends theoretical results that will be discussed in Section 8.1.2. It is worth mentioning that PSRL fits in the broader family of Bayesian approaches to efficient reinforcement learning; we refer interested readers to the survey paper (Ghavamzadeh et al. , 2015).
## Why it Works, When it Fails, and Alternative Approaches
Earlier sections demonstrate that TS approaches can be adapted to address a number of problem classes of practical import. In this section, we provide intuition for why TS explores efficiently, and briefly review theoretical work that formalizes this intuition. We will then highlight problem classes for which TS is poorly suited, and refer to some alternative algorithms.
## 8.1 Why Thompson Sampling Works
To understand whether TS is well suited to a particular application, it is useful to develop a high level understanding of why it works. As information is gathered, beliefs about action rewards are carefully tracked. By sampling actions according to the posterior probability that they are optimal, the algorithm continues to sample all actions that could plausibly be optimal, while shifting sampling away from those that are unlikely to be optimal. Roughly speaking, the algorithm tries all promising actions while gradually discarding those that are believed to underperform.This intuition is formalized in recent theoretical analyses of Thompson sampling, which we now review.
## 8.1.1 Regret Analysis for Classical Bandit Problems
Asymptotic Instance Dependent Regret Bounds. Consider the classical beta-Bernoulli bandit problem of Example 1.1. For this problem, sharp results on the asymptotic scaling of regret are available. The cumulative regret of an algorithm over T periods is
$$R e g r e t ( T ) = \sum _ { t = 1 } ^ { T } \left ( \max _ { 1 \leq k \leq K } \theta _ { k } - \theta _ { x _ { t } } \right ) ,$$
where K is the number of actions, x t ∈ { 1 , . . . , K } is the action selected at time t , and θ = ( θ 1 , . . . , θ K ) denotes action success probabilities. For each time horizon T , E [Regret( T ) | θ ] measures the expected T -period regret on the problem instance θ . The conditional expectation integrates over the noisy realizations of rewards and the algorithm's random action selection, holding fixed the success probabilities θ = ( θ 1 , . . . , θ K ). Though this is difficult to evaluate, one can show that
/negationslash
$$\lim _ { T \rightarrow \infty } \frac { \mathbb { E } [ R e g r e t ( T ) | \theta ] } { \log ( T ) } = \sum _ { k \neq k ^ { * } } \frac { \theta _ { k ^ { * } } - \theta _ { k } } { d _ { K L } ( \theta _ { k ^ { * } } | | \theta _ { k } ) } ,$$
assuming that there is a unique optimal action k ∗ . Here, d KL ( θ || θ ′ ) = θ log ( θ θ ′ ) +(1 -θ ) log ( 1 -θ 1 -θ ′ ) is the Kullback-Leibler divergence between Bernoulli distributions. The fundamental lower bound of (Lai and Robbins, 1985) shows no algorithm can improve on the scaling in (8.1), establishing a sense in which the algorithm is asymptotically optimal. That the regret of TS exhibits this scaling was first observed empirically by (Chapelle and Li, 2011). A series of papers provided proofs that formalize this finding (Agrawal and Goyal, 2012; Agrawal and Goyal, 2013a; Kauffmann et al. , 2012).
This result has been extended to cases where reward distributions are Gaussian or, more generally, members of a canonical one-dimensional exponential family (Honda and Takemura, 2014). It has also been extended to the case of Gaussian distributions with unknown variance by (Honda and Takemura, 2014), which further establishes that this result can fail to hold for a particular improper prior distribution. Although, intuitively, the effects of the prior distribution should wash out as T →∞ , all of these results apply to specific choices of uninformative
prior distributions. Establishing asymptotic optimality of TS for broader classes of prior distributions remains an interesting open issue.
Instance-Independent Regret bounds. While the results discussed in the previous section establishes that the regret of TS is optimal in some sense, it is important to understand that this result is asymptotic. Focusing on this asymptotic scaling enables sharp results, but even for problems with long time horizons, there are substantial performance differences among algorithms known to be asymptotically optimal in the sense of (8.1). The bound essentially focuses on a regime in which the agent is highly confident of which action is best but continues to occasionally explore in order to become even more confident. In particular, the bound suggests that for sufficiently large T , regret scales like
/negationslash
$$\mathbb { E } [ R e g r e t ( T ) | \theta ] \approx \sum _ { k \neq k ^ { * } } \frac { \theta _ { k ^ { * } } - \theta _ { k } } { d _ { K L } ( \theta _ { k ^ { * } } | | \theta _ { k } ) } \log ( T ) .$$
This becomes easier to interpret if we specialize to the case in which rewards, conditioned on θ , are Gaussian with unit variance, for which d KL ( θ || θ ′ ) = ( θ -θ ′ ) 2 / 2, and therefore,
/negationslash
$$\mathbb { E } [ R e g r e t ( T ) | \theta ] \approx \sum _ { k \neq k ^ { * } } \frac { 2 } { \theta _ { k ^ { * } } - \theta _ { k } } \log ( T ) .$$
The fact that the final expression is dominated by near-optimal actions reflects that in the relevant asymptotic regime other actions can be essentially ruled out using far fewer samples.
A more subtle issue is that O (log( T )) regret bounds like those described above become vacuous for problems with nearly-optimal actions, since the right-side of 8.2 can become arbitrarily large. This issue is particularly limiting for complex structured online decision problems, where there are often a large or even infinite number of near-optimal actions.
For the Bernoulli bandit problem of Example 1.1, (Agrawal and Goyal, 2013a) establishes that when TS is initialized with a uniform prior,
$$\max _ { \theta ^ { \prime } } \mathbb { E } [ R e g r e t ( T ) | \theta = \theta ^ { \prime } ] = O \left ( \sqrt { K T \log ( T ) } \right ) .$$
This regret bounds holds uniformly over all problem instances, ensuring that there are no instances of bandit problems with binary rewards that will cause the regret of TS to explode. This bound is nearly orderoptimal, in the sense that there exists a distribution over problem instances under which the expected regret of any algorithm is at least Ω( √ KT ) (Bubeck and Cesa-Bianchi, 2012).
## 8.1.2 Regret Analysis for Complex Online Decision Problems
This tutorial has covered the use of TS to address an array of complex online decision problems. In each case, we first modeled the problem at hand, carefully encoding prior knowledge. We then applied TS, trusting it could leverage this structure to accelerate learning. The results described in the previous subsection are deep and interesting, but do not justify using TS in this manner.
We will now describe alternative theoretical analyses of TS that apply very broadly. These analyses point to TS's ability to exploit problem structure and prior knowledge, but also to settings where TS performs poorly.
## Problem Formulation
Consider the following general class of online decision problems. In each period t ∈ N , the agent selects an action x t ∈ X , observes an outcome y t , and associates this with a real-valued reward r ( y t ) that is a known function of the outcome. In the shortest path problem of Examples 4.1 and 4.2, x t is a path, y t is a vector encoding the time taken to traverse each edge in that path, and r t = r ( y t ) is the negative sum of these travel times. More generally, for each t , y t = g ( x t , θ, w t ) where g is some known function and ( w t : t ∈ N ) are i.i.d and independent of θ . This can be thought of as a Bayesian model, where the random variable θ represents the uncertain true characteristics of the system and w t represents idiosyncratic randomness influencing the outcome in period t . Let
$$\mu ( x , \theta ) = \mathbb { E } [ r \left ( g ( x , \theta , w _ { t } ) \right ) | \theta ]$$
denote the expected reward generated by the action x under the parameter θ , where this expectation is taken over the disturbance w t . The agent's uncertainty about θ induces uncertainty about the identity of the optimal action x ∗ ∈ argmax x ∈X µ ( x, θ ).
An algorithm is an adaptive, possibly randomized, rule for selecting an action as a function of the history of actions and observed outcomes. The expected cumulative regret of an algorithm over T periods is
$$\mathbb { E } \left [ R e g r e t ( T ) \right ] = \mathbb { E } \left [ \sum _ { t = 1 } ^ { T } \left ( \mu ( x ^ { * } , \theta ) - \mu ( x _ { t } , \theta ) \right ) \right ] .$$
This expectation is taken over draws of θ , the idiosyncratic noise terms ( w t , . . . , w T ), and the algorithm's internal randomization over actions. This is sometimes called the algorithm's Bayesian regret , since it is integrated over the prior distribution.
It is worth briefly discussing the interpretation of this regret measure. No single algorithm can minimize conditional expected regret E [Regret( T ) | θ = θ ′ ] for every problem instance θ ′ . As discussed in Section 6.1, one algorithm may have lower regret than another for one problem instance but have higher regret for a different problem instance. In order to formulate a coherent optimization problem, we must somehow scalarize this objective. We do this here by aiming to minimize integrated regret E [Regret( T )] = E [ E [Regret( T ) | θ ]]. Under this objective, the prior distribution over θ directs the algorithm to prioritize strong performance in more likely scenarios. Bounds on expected regret help certify that an algorithm has efficiently met this objective. An alternative choice is to bound worst-case regret max θ ′ E [Regret( T ) | θ = θ ′ ]. Certainly, bounds on worst-case regret imply bounds on expected regret, but targeting this objective will rule out the use of flexible prior distributions, discarding one of the TS's most useful features. In particular, designing an algorithm to minimize worst-case regret typically entails substantial sacrifice of performance with likely values of θ .
## Regret Bounds via UCB
One approach to bounding expected regret relies on the fact that TS shares a property of UCB algorithms that underlies many of their
theoretical guarantees. Let us begin by discussing how regret bounds are typically established for UCB algorithms.
A prototypical UCB algorithm generates a function U t based on the history H t -1 such that, for each action x , U t ( x ) is an optimistic but statistically plausible estimate of the expected reward, referred to as an upper-confidence bound. Then, the algorithm selects an action x t that maximizes U t . There are a variety of proposed approaches to generating U t for specific models. For example, (Kaufmann et al. , 2012) suggest taking U t ( x ) to be the (1 -1 /t )th quantile of the posterior distribution of µ ( x, θ ). A simpler heuristic, which is nearly identical to the UCB1 algorithm presented and analyzed in (Auer et al. , 2002), selects actions to maximize U t ( x ) = E [ µ ( x, θ ) | H t -1 ]+ √ 2 ln( t ) /t x , where t x is the number of times action x is selected prior to period t . If t x = 0, √ 2 ln( t ) /t x = ∞ , so each action is selected at least once. As experience with an action accumulates and ln( t ) /t x vanishes, U t ( x ) converges to E [ µ ( x, θ ) | H t -1 ], reflecting increasing confidence.
With any choice of U t , regret over the period decomposes according to
$$\begin{array} { r l r } { \mu ( x ^ { * } , \theta ) - \mu ( \overline { x } _ { t } , \theta ) } & { = } & { \mu ( x ^ { * } , \theta ) - U _ { t } ( \overline { x } _ { t } ) + U _ { t } ( \overline { x } _ { t } ) - \mu ( \overline { x } _ { t } , \theta ) } \\ & { \leq } & { \underbrace { \mu ( x ^ { * } , \theta ) - U _ { t } ( x ^ { * } ) } _ { p e s s i m i s m } + \underbrace { U _ { t } ( \overline { x } _ { t } ) - \mu ( \overline { x } _ { t } , \theta ) } _ { w i d t h } . } \end{array}$$
The inequality follows from the fact that ¯ x t is chosen to maximize U t . If U t ( x ∗ ) ≥ µ ( x ∗ , θ ), which an upper-confidence bound should satisfy with high probability, the pessimism term is negative. The width term, penalizes for slack in the confidence interval at the selected action ¯ x t . For reasonable proposals of U t , the width vanishes over time for actions that are selected repeatedly. Regret bounds for UCB algorithms are obtained by characterizing the rate at which this slack diminishes as actions are applied.
As established in (Russo and Van Roy, 2014b), expected regret bounds for TS can be produced in a similar manner. To understand why, first note that for any function U t that is determined by the history H t -1 ,
$$\begin{array} { r l } & { ( 8 . 4 ) \, \mathbb { E } [ U _ { t } ( x _ { t } ) ] = \mathbb { E } [ \mathbb { E } [ U _ { t } ( x _ { t } ) | \mathbb { H } _ { t - 1 } ] ] = \mathbb { E } [ \mathbb { E } [ U _ { t } ( x ^ { * } ) | \mathbb { H } _ { t - 1 } ] ] = \mathbb { E } [ U _ { t } ( x ^ { * } ) ] . } \end{array}$$
/negationslash
The second equation holds because TS samples x t from the posterior distribution of x ∗ . Note that for this result, it is important that U t is determined by H t -1 . For example, although x t and x ∗ share the same marginal distribution, in general E [ µ ( x ∗ , θ )] = E [ µ ( x t , θ )] since the joint distribution of ( x ∗ , θ ) is not identical to that of ( x t , θ ).
From Equation (8.4), it follows that
$$\begin{array} { r l } { \mathbb { E } \left [ \mu ( x ^ { * } , \theta ) - \mu ( x _ { t } , \theta ) \right ] } & { = } & { \mathbb { E } \left [ \mu ( x ^ { * } , \theta ) - U _ { t } ( x _ { t } ) \right ] + \mathbb { E } \left [ U _ { t } ( x _ { t } ) - \mu ( x _ { t } , \theta ) \right ] } \\ & { = } & { \underbrace { \mathbb { E } \left [ \mu ( x ^ { * } , \theta ) - U _ { t } ( x ^ { * } ) \right ] } _ { p e s s i m i s m } + \underbrace { \mathbb { E } \left [ U _ { t } ( x _ { t } ) - \mu ( x _ { t } , \theta ) \right ] } _ { w i d t h } . } \end{array}$$
If U t is an upper-confidence bound, the pessimism term should be negative, while the width term can be bounded by arguments identical to those that would apply to the corresponding UCB algorithm. Through this relation, many regret bounds that apply to UCB algorithms translate immediately to expected regret bounds for TS.
An important difference to take note of is that UCB regret bounds depend on the specific choice of U t used by the algorithm in question. With TS, on the other hand, U t plays no role in the algorithm and appears only as a figment of regret analysis. This suggests that, while the regret of a UCB algorithm depends critically on the specific choice of upper-confidence bound, TS depends only on the best possible choice. This is a crucial advantage when there are complicated dependencies among actions, as designing and computing with appropriate upperconfidence bounds present significant challenges.
Several examples provided in (Russo and Van Roy, 2014b) demonstrate how UCB regret bounds translate to TS expected regret bounds. These include a bound that applies to all problems with a finite number of actions, as well as stronger bounds that apply when the reward function µ follows a linear model, a generalized linear model, or is sampled from a Gaussian process prior. As an example, suppose mean rewards follow the linear model µ ( x, θ ) = x /latticetop θ for x ∈ R d and θ ∈ R d and that reward noise is sub-Gaussian. It follows from the above relation that existing analyses (Dani et al. , 2008; Rusmevichientong and Tsitsiklis, 2010; Abbasi-Yadkori et al. , 2011) of UCB algorithms imply that under TS
$$\mathbb { E } \left [ R e g r e t ( T ) \right ] = O ( d \sqrt { T } \log ( T ) ) .$$
This bound applies for any prior distribution over a compact set of parameters θ . The bigO notation assumes several quantities are bounded by constants: the magnitude of feasible actions, the magnitude of θ realizations, and the variance proxy of the sub-Gaussian noise distribution. An important feature of this bound is that it depends on the complexity of the parameterized model through the dimension d , and not on the number of actions. Indeed, when there are a very large, or even infinite, number of actions, bounds like (8.3) are vacuous, whereas (8.5) may still provide a meaningful guarantee.
In addition to providing a means for translating UCB to TS bounds, results of (Russo and Van Roy, 2014b; Russo and Van Roy, 2013) unify many of these bounds. In particular, it is shown that across a very broad class of online decision problems, both TS and well-designed UCB algorithms satisfy (8.6)
$$\mathbb { E } \left [ R e g r e t ( T ) \right ] = \tilde { O } \left ( \sqrt { \underbrace { \dim _ { E } \left ( \mathcal { F } , T ^ { - 2 } \right ) } _ { e l u d e r \, d i m e n s i o n } } \underbrace { \log \left ( N \left ( \mathcal { F } , T ^ { - 2 } , \| \cdot \| _ { \infty } \right ) \right ) } _ { \log - c o v e r i n g \, m u b e r } T } \right ) ,$$
where F = { µ ( · , θ ) : θ ∈ Θ } is the set of possible reward functions, Θ is the set of possible parameter vectors θ , and ˜ O ignores logarithmic factors. This expression depends on the class of reward functions F through two measures of complexity. Each captures the approximate structure of the class of functions at a scale T -2 that depends on the time horizon. The first measures the growth rate of the covering numbers of F with respect to the maximum norm, and is closely related to measures of complexity that are common in the supervised learning literature. This quantity roughly captures the sensitivity of F to statistical overfitting. The second measure, the eluder dimension , captures how effectively the value of unobserved actions can be inferred from observed samples. This bound can be specialized to particular function classes. For example, when specialized to the aforementioned linear model, dim E ( F , T -2 ) = O ( d log( T )) and log ( N ( F , T -2 , ‖·‖ ∞ )) = O ( d log( T )), and it follows that √
$$\mathbb { E } \left [ R e g r e t ( T ) \right ] = \tilde { O } ( d \sqrt { T } ) .$$
It is worth noting that, as established in (Russo and Van Roy, 2014b; Russo and Van Roy, 2013), notions of complexity common to the supervised learning literature such as covering numbers and Kolmogorov and Vapnik-Chervonenkis dimensions are insufficient for bounding regret in online decision problems. As such, the new notion of eluder dimension introduced in (Russo and Van Roy, 2014b; Russo and Van Roy, 2013) plays an essential role in (8.6).
## Regret Bounds via Information Theory
Another approach to bounding regret, developed in (Russo and Van Roy, 2016), leverages the tools of information theory. The resulting bounds more clearly reflect the benefits of prior knowledge, and the analysis points to shortcomings of TS and how they can addressed by alternative algorithms. A focal point in this analysis is the notion of an information ratio , which for any model and online decision algorithm is defined by
$$\Gamma _ { t } = \frac { \left ( \mathbb { E } \left [ \mu ( x ^ { * } , \theta ) - \mu ( x _ { t } , \theta ) \right ] \right ) ^ { 2 } } { I \left ( x ^ { * } ; ( x _ { t } , y _ { t } ) | \mathbb { H } _ { t - 1 } \right ) } .$$
The numerator is the square of expected single-period regret, while in the denominator, the conditional mutual information I ( x ∗ ; ( x t , y t ) | H t -1 ) between the uncertain optimal action x ∗ and the impending observation ( x t , y t ) measures expected information gain. 1
The information ratio depends on both the model and algorithm and can be interpreted as an expected 'cost' incurred per bit of information acquired. If the information ratio is small, an algorithm can only incur large regret when it is expected to gain a lot of information about which action is optimal. This suggests that expected regret is bounded in terms of the maximum amount of information any algorithm could
1 An alternative definition of the information ratio -the expected regret ( E [ µ ( x ∗ , θ ) -µ ( x t , θ ) | H t -1 = h t -1 ]) 2 divided by the mutual information I ( x ∗ ; ( x t , y t ) | H t -1 = h t -1 ), both conditioned on a particular history h t -1 - was used in the original paper on this topic (Russo and Van Roy, 2016). That paper established bounds on the information ratio that hold uniformly over possible realizations of h t -1 . It was observed in (Russo and Van Roy, 2018b) that the same bounds apply with the information ratio defined as in (8.7), which integrates over h t . The presentation here mirrors the more elegant treatment of these ideas in (Russo and Van Roy, 2018b).
expect to acquire, which is at most the entropy of the prior distribution of the optimal action. The following regret bound from (Russo and Van Roy, 2016), which applies to any model and algorithm, formalizes this observation:
$$\mathbb { E } \left [ R e g r e t ( T ) \right ] \leq \sqrt { \overline { \Gamma } H ( x ^ { * } ) T } ,$$
where Γ = max t ∈{ 1 ,...,T } Γ t . An important feature of this bound is its dependence on initial uncertainty about the optimal action x ∗ , measured in terms of the entropy H ( x ∗ ). This captures the benefits of prior information in a way that is missing from previous regret bounds.
A simple argument establishes bound (8.8):
$$\begin{array} { r l } { A s i m p l e a g u m e n t e s h a b l e s b o u n d ( 8 . 8 ) \colon } \\ { \mathbb { E } \left [ R e g r e t ( T ) \right ] } & { = } & { \sum _ { t = 1 } ^ { T } \mathbb { E } \left [ \mu ( x ^ { * } , \theta ) - \mu ( x _ { t } , \theta ) \right ] } \\ & { \leq } & { \sqrt { \bar { \Gamma } T \sum _ { t = 1 } ^ { T } I \left ( x ^ { * } ; ( x _ { t } , y _ { t } ) | \mathbb { H } _ { t - 1 } \right ) } , } \end{array}$$
where the inequality follows from Jensen's inequality and the fact that Γ t ≤ Γ. Intuitively, I ( x ∗ , ( x t , y t ) | H t -1 ) represents the expected information gained about x ∗ , and the sum over periods cannot exceed the entropy H ( x ∗ ). Applying this relation, which is formally established in (Russo and Van Roy, 2016) via the chain rule of mutual information, we obtain (8.8).
It may be illuminating to interpret the bound in the case of TS applied to a shortest path problem. Here, r t is the negative travel time of the path selected in period t and we assume the problem has been appropriately normalized so that r t ∈ [ -1 , 0] almost surely. For a problem with d edges, θ ∈ R d encodes the mean travel time along each edge, and x ∗ = x ∗ ( θ ) denotes the shortest path under θ . As established in (Russo and Van Roy, 2016), the information ratio can be bounded above by d/ 2, and therefore, (8.8) specializes to E [Regret( T )] ≤ √ dH ( x ∗ ) T/ 2 . Note that the number of actions in the problem is the number of paths, which can be exponential in the number of edges. This bound reflects
two ways in which TS is able to exploit the problem's structure to nevertheless learn efficiently. First, it depends on the number of edges d rather than the number of paths. Second, it depends on the entropy H ( x ∗ ) of the decision-maker's prior over which path is shortest. Entropy is never larger than the logarithm of the number of paths, but can be much smaller if the agent has informed prior over which path is shortest. Consider for instance the discussion following Example 4.1, where the agent had knowledge of the distance of each edge and believed a priori that longer edges were likely to require greater travel time; this prior knowledge reduces the entropy of the agent's prior, and the bound formalizes that this prior knowledge improves performance. Stronger bounds apply when the agent receives richer feedback in each time period. At one extreme, the agent observes the realized travel time along every edge in that period, including those she did not traverse. In that case, (Russo and Van Roy, 2016) establishes that the information ratio is bounded by 1 / 2, and therefore, E [Regret( T )] ≤ √ H ( x ∗ ) T/ 2. The paper also defines a class of problems where the agent observes the time to traverse each individual edge along the chosen path and establishes that the information ratio is bounded by d/ 2 m and E [Regret( T )] ≤ √ dH ( x ∗ ) T/ 2 m , where m is the maximal number of edges in a path.
The three aforementioned bounds of d/ 2, 1 / 2 and d/ 2 m on information ratios reflect the impact of each problem's information structure on the regret-per-bit of information acquired by TS about the optimum. Subsequent work has established bounds on the information ratio for problems with convex reward functions (Bubeck and Eldan, 2016) and for problems with graph structured feedback (Liu et al. , 2017).
The bound of (8.8) can become vacuous as the number of actions increases due to the dependence on entropy. In the extreme, the entropy H ( x ∗ ) can become infinite when there are an infinite number of actions. It may be possible to derive alternative information-theoretic bounds that depend instead on a rate-distortion function. In this context, a ratedistortion function should capture the amount of information required to deliver near-optimal performance. Connections between rate-distortion theory and online decision problems have been established in (Russo and Van Roy, 2018b), which studies a variation of TS that aims to
learn satisficing actions. Use of rate-distortion concepts to analyze the standard version of TS remains an interesting direction for further work.
## Further Regret Analyses
Let us now discuss some alternatives to the regret bounds described above. For linear bandit problems, (Agrawal and Goyal, 2013b) provides an analysis of TS with an uninformative Gaussian prior. Their results yield a bound on worst-case expected regret of min θ ′ : ‖ θ ′ ‖ 2 ≤ 1 E [Regret( T ) | θ = θ ′ ] = ˜ O ( d 3 / 2 √ T ) . Due to technical challenges in the proof, this bound does not actually apply to TS with proper posterior updating, but instead to a variant that inflates the variance of posterior samples. This leads to an additional d 1 / 2 factor in this bound relative to that in (8.5). It is an open question whether a worst-case regret bound can be established for standard TS in this context, without requiring any modification to the posterior samples. Recent work has revisited this analysis and provided improved proof techniques (Abeille and Lazaric, 2017). Furthering this line of work, (Agrawal et al. , 2017) study an assortment optimization problem and provide worst-case regret bounds for an algorithm that is similar to TS but samples from a modified posterior distribution. Following a different approach, (Gopalan et al. , 2014) provides an asymptotic analysis of Thomson sampling for parametric problems with finite parameter spaces. Another recent line of theoretical work treats extensions of TS to reinforcement learning (Osband et al. , 2013; Gopalan and Mannor, 2015; Osband et al. , 2016b; Kim, 2017).
## 8.1.3 Why Randomize Actions
TS is a stationary randomized strategy : randomized in that each action is randomly sampled from a distribution and stationary in that this action distribution is determined by the posterior distribution of θ and otherwise independent of the time period. It is natural to wonder whether randomization plays a fundamental role or if a stationary deterministic strategy can offer similar behavior. The following example from (Russo and Van Roy, 2018a) sheds light on this matter.
Example 8.1. (A Known Standard) Consider a problem with two actions X = { 1 , 2 } and a binary parameter θ that is distributed Bernoulli( p 0 ). Rewards from action 1 are known to be distributed Bernoulli(1 / 2). The distribution of rewards from action 2 is Bernoulli(3 / 4) if θ = 1 and Bernoulli(1 / 4) if θ = 0.
Consider a stationary deterministic strategy for this problem. With such a strategy, each action x t is a deterministic function of p t -1 , the probability that θ = 1 conditioned H t -1 . Suppose that for some p 0 > 0, the strategy selects x 1 = 1. Since the resulting reward is uninformative, p t = p 0 and x t = 1 for all t , and thus, expected cumulative regret grows linearly with time. If, on the other hand, x 1 = 2 for all p 0 > 0, then x t = 2 for all t , which again results in expected cumulative regret that grows linearly with time. It follows that, for any deterministic stationary strategy, there exists a prior probability p 0 such that expected cumulative regret grows linearly with time. As such, for expected cumulative regret to exhibit a sublinear horizon dependence, as is the case with the bounds we have discussed, a stationary strategy must randomize actions. Alternatively, one can satisfy such bounds via a strategy that is deterministic but nonstationary, as is the case with typical UCB algorithms.
## 8.2 Limitations of Thompson Sampling
TS is effective across a broad range of problems, but there are contexts in which TS leaves a lot of value on the table. We now highlight four problem features that are not adequately addressed by TS.
## 8.2.1 Problems that do not Require Exploration
We start with the simple observation that TS is a poor choice for problems where learning does not require active exploration. In such contexts, TS is usually outperformed by greedier algorithms that do not invest in costly exploration. As an example, consider the problem of selecting a portfolio made up of publicly traded financial securities. This can be cast as an online decision problem. However, since historical returns
are publicly available, it is possible to backtest trading strategies, eliminating the need to engage in costly real-world experimentation. Active information gathering may become important, though, for traders who trade large volumes of securities over short time periods, substantially influencing market prices, or when information is more opaque, such as in dark pools.
In contextual bandit problems, even when actions influence observations, randomness of context can give rise to sufficient exploration so that additional active exploration incurs unnecessary cost. Results of (Bastani et al. , 2018) formalize conditions under which greedy behavior is effective because of passive exploration induced by contextual randomness. The following example captures the essence of this phenomenon.
Example 8.2. (Contextual Linear Bandit) Consider two actions X = { 1 , 2 } and parameters θ 1 and θ 2 that are independent and standardGaussian-distributed. A context z t is associated with each time period t and is drawn independently from a standard Gaussian. In period t , the agent selects an action x t based on prevailing context z t , as well as observed history, and then observes a reward r t = z t θ x t + w t , where w t is i.i.d. zero-mean noise.
Consider selecting a greedy action x t for this problem. Given point estimates ˆ θ 1 and ˆ θ 2 , assuming ties are broken randomly, each action is selected with equal probability, with the choice determined by the random context. This probing of both actions alleviates the need for active exploration, which would decrease immediate reward. It is worth noting, though, that active exploration can again become essential if the context variables are binary-valued with z t ∈ { 0 , 1 } . In particular, if the agent converges on a point estimate ˆ θ 1 = θ 1 > 0, and action 2 is optimal but with an erroneous negative point estimate ˆ θ 2 < 0 < θ 2 , a greedy strategy may repeatedly select action 1 and never improve its estimate for action 2. The greedy strategy faces similar difficulties with a reward function of the form r t = z t,x t θ x t + θ x t + w t , that entails learning offset parameters θ 1 and θ 2 , even if context variables are standardGaussian-distributed. For example, if θ 1 < θ 2 and θ 2 is sufficiently underestimated, as the distributions of θ 1 and θ 2 concentrate around 0, a greedy strategy takes increasingly long to recover. In the extreme
case where θ 1 = θ 2 = 0 with probability 1, the problem reduces to one with independent actions and Gaussian noise, and the greedy policy may never recover. It is worth noting that the news recommendation problem of Section 7.1 involves a contextual bandit that embodies both binary context variables and offset parameters.
## 8.2.2 Problems that do not Require Exploitation
At the other extreme, TS may also be a poor choice for problems that do not require exploitation. For example, consider a classic simulation optimization problem. Given a realistic simulator of some stochastic system, we may like to identify, among a finite set of actions, the best according to a given objective function. Simulation can be expensive, so we would like to intelligently and adaptively allocate simulation effort so the best choice can be rapidly identified. Though this problem requires intelligent exploration, this does not need to be balanced with a desire to accrue high rewards while experimenting. This problem is called ranking and selection in the simulation optimization community and either best arm identification or a pure-exploration problem in the multi-armed bandit literature. It can be possible to perform much better than TS for such problems. The issue is that once TS is fairly confident of which action is best, it exploits this knowledge and plays that action in nearly all periods. As a result, it is very slow to refine its knowledge of alternative actions. Thankfully, as shown by (Russo, 2016), there is a simple modification to TS that addresses this issue. The resulting pure exploration variant of TS dramatically outperforms standard TS, and is in some sense asymptotically optimal for this best-arm identification problem. It is worth highlighting that although TS is often applied to A/B testing problems, this pure exploration variant of the algorithm may be a more appropriate choice.
## 8.2.3 Time Sensitivity
TS is effective at minimizing the exploration costs required to converge on an optimal action. It may perform poorly, however, in time-sensitive learning problems where it is better to exploit a high performing suboptimal action than to invest resources exploring actions that might offer
slightly improved performance. The following example from (Russo and Van Roy, 2018b) illustrates the issue.
Example 8.3. (Many-Armed Deterministic Bandit) Consider an action set X = { 1 , . . . , K } and a K -dimensional parameter vector θ with independent components, each distributed uniformly over [0 , 1]. Each action x results in reward θ x , which is deterministic conditioned on θ . As K grows, it takes longer to identify the optimal action x ∗ = argmax x ∈X θ x . Indeed, for any algorithm, P ( x ∗ ∈ { x 1 , . . . x t } ) ≤ t/K . Therefore, no algorithm can expect to select x ∗ within time t /lessmuch K . On the other hand, by simply selecting actions in order, with x 1 = 1 , x 2 = 2 , x 3 = 3 , . . . , the agent can expect to identify an /epsilon1 -optimal action within t = 1 //epsilon1 time periods, independent of K .
Applied to this example, TS is likely to sample a new action in each time period so long as t /lessmuch K . The problem with this is most pronounced in the asymptotic regime of K →∞ , for which TS never repeats any action because, at any point in time, there will be actions better than those previously selected. It is disconcerting that TS can be so dramatically outperformed by a simple variation: settle for the first action x for which θ x ≥ 1 -/epsilon1 .
While stylized, the above example captures the essence of a basic dilemma faced in all decision problems and not adequately addressed by TS. The underlying issue is time preference. In particular, if an agent is only concerned about performance over an asymptotically long time horizon, it is reasonable to aim at learning x ∗ , while this can be a bad idea if shorter term performance matters and a satisficing action can be learned more quickly.
Related issues also arise in the nonstationary learning problems described in Section 6.3. As a nonstationary system evolves, past observations become irrelevant to optimizing future performance. In such cases, it may be impossible to converge on the current optimal action before the system changes substantially, and the algorithms presented in Section 6.3 might perform better if they are modified to explore less aggressively.
Interestingly, the information theoretic regret bounds described in the previous subsection also point to this potential shortcoming of TS.
Indeed, the regret bounds there depend on the entropy of the optimal action H ( A ∗ ), which may tend to infinity as the number of actions grows, reflecting the enormous quantity of information needed to identify the exact optimum. This issue is discussed further in (Russo and Van Roy, 2018b). That paper proposes and analyzes satisficing TS , a variant of TS that is designed to minimize exploration costs required to identify an action that is sufficiently close to optimal.
## 8.2.4 Problems Requiring Careful Assessment of Information Gain
TS is well suited to problems where the best way to learn which action is optimal is to test the most promising actions. However, there are natural problems where such a strategy is far from optimal, and efficient learning requires a more careful assessment of the information actions provide. The following example from (Russo and Van Roy, 2018a) highlights this point.
Example 8.4. (A Revealing Action) Suppose there are k +1 actions { 0 , 1 , ..., k } , and θ is an unknown parameter drawn uniformly at random from Θ = { 1 , .., k } . Rewards are deterministic conditioned on θ , and when played action i ∈ { 1 , ..., k } always yields reward 1 if θ = i and 0 otherwise. Action 0 is a special 'revealing' action that yields reward 1 / 2 θ when played.
Note that action 0 is known to never yield the maximal reward, and is therefore never selected by TS. Instead, TS will select among actions { 1 , ..., k } , ruling out only a single action at a time until a reward 1 is earned and the optimal action is identified. A more intelligent algorithm for this problem would recognize that although action 0 cannot yield the maximal reward, sampling it is valuable because of the information it provides about other actions. Indeed, by sampling action 0 in the first period, the decision maker immediately learns the value of θ , and can exploit that knowledge to play the optimal action in all subsequent periods.
The shortcoming of TS in the above example can be interpreted through the lens of the information ratio (8.7). For this problem, the information ratio when actions are sampled by TS is far from the
minimum possible, reflecting that it is possible to a acquire information at a much lower cost per bit. The following two examples, also from (Russo and Van Roy, 2018a), illustrate a broader range of problems for which TS suffers in this manner. The first illustrates issues that arise with sparse linear models.
/negationslash
Example 8.5. (Sparse Linear Model) Consider a linear bandit problem where X ⊂ R d and the reward from an action x ∈ X is x T θ , which is deterministic conditioned on θ . The true parameter θ is known to be drawn uniformly at random from the set of one-hot vectors Θ = { θ ′ ∈ { 0 , 1 } d : ‖ θ ′ ‖ 0 = 1 } . For simplicity, assume d is an integer power of two. The action set is taken to be the set of nonzero vectors in { 0 , 1 } d , normalized so that components of each vector sum to one: X = { x ‖ x ‖ 1 : x ∈ { 0 , 1 } d , x = 0 } .
Let i ∗ be the index for which θ i ∗ = 1. This bandit problem amounts to a search for i ∗ . When an action x t is selected the observed reward r t = x T t θ is positive if i ∗ is in the support of x t or 0 otherwise. Given that actions in X can support any subset of indices, i ∗ can be found via a bisection search, which requires log( d ) periods in expectation. On the other hand, TS selects exclusively from the set of actions that could be optimal. This includes only one-hot vectors. Each such action results in either ruling out one index or identifying i ∗ . As such, the search carried out by TS requires d/ 2 periods in expectation.
Our final example involves an assortment optimization problem.
Example 8.6. (Assortment Optimization) Consider the problem of repeatedly recommending an assortment of products to a customer. The customer has unknown type θ ∈ Θwhere | Θ | = n . Each product is geared toward customers of a particular type, and the assortment of m products offered is characterized by the vector of product types x ∈ X = Θ m . We model customer responses through a random utility model in which customers are more likely to derive high value from a product geared toward their type. When offered an assortment of products x , the customer associates with the i th product utility u θit ( x ) = β 1 θ ( x i ) + w it , where 1 θ indicates whether its argument is θ , w it follows a standard Gumbel distribution, and β ∈ R is a known constant. This is a standard
multinomial logit discrete choice model. The probability a customer of type θ chooses product i is given by
$$\frac { \exp \left ( \beta 1 _ { \theta } ( x _ { i } ) \right ) } { \sum _ { j = 1 } ^ { m } \exp \left ( \beta 1 _ { \theta } ( x _ { j } ) \right ) } .$$
When an assortment x t is offered at time t , the customer makes a choice i t = arg max i u θit ( x ) and leaves a review u θi t t ( x ) indicating the utility derived from the product, both of which are observed by the recommendation system. The reward to the recommendation system is the normalized utility u θi t t ( x ) /β .
If the type θ of the customer were known, then the optimal recommendation would be x ∗ = ( θ, θ, . . . , θ ), which consists only of products targeted at the customer's type. Therefore, TS would only ever offer assortments consisting of a single type of product. Because of this, TS requires n samples in expectation to learn the customer's true type. However, as discussed in (Russo and Van Roy, 2018a), learning can be dramatically accelerated through offering diverse assortments. To see why, suppose that θ is drawn uniformly at random from Θ and consider the limiting case where β →∞ . In this regime, the probability a customer chooses a product of type θ if it is available tends to 1, and the normalized review β -1 u θi t t ( x ) tends to 1 θ ( x i t ), an indicator for whether the chosen product is of type θ . While the customer type remains unknown, offering a diverse assortment, consisting of m different and previously untested product types, will maximize both immediate expected reward and information gain, since this attains the highest probability of containing a product of type θ . The customer's response almost perfectly indicates whether one of those items is of type θ . By continuing to offer such assortments until identifying the customer type, with extremely high probability, an algorithm can learn the type within /ceilingleft n/m /ceilingright periods. As such, diversification can accelerate learning by a factor of m relative to TS.
In each of the three examples of this section, TS fails to explore in any reasonably intelligent manner. Russo and Van Roy (2018a) propose an alternative algorithm - information-directed sampling - that samples actions in a manner that minimizes the information ratio, and this addresses the shortcomings of TS in these examples. It is worth
mentioning, however, that despite possible advantages, informationdirected sampling requires more complex computations and may not be practical across the range of applications for which TS is well-suited.
## 8.3 Alternative Approaches
Much of the the work on multi-armed bandit problems has focused on problems with a finite number of independent actions, like the betaBernoulli bandit of Example 3.1. For such problems, for the objective of maximizing expected discounted reward, the Gittins index theorem (Gittins and Jones, 1979) characterizes an optimal strategy. This strategy can be implemented via solving a dynamic program for action in each period, as explained in (Katehakis and Veinott, 1987), but this is computationally onerous relative to TS. For more complicated problems, the Gittins index theorem fails to hold, and computing optimal actions is typically infeasible. A thorough treatment of Gittins indices is provided in (Gittins et al. , 2011).
Upper-confidence-bound algorithms, as discussed in Section 8.1.2, offer another approach to efficient exploration. At a high level, these algorithms are similar to TS, in that they continue sampling all promising actions while gradually discarding those that underperform. Section 8.1.2 also discusses a more formal relation between the two approaches, as originally established in (Russo and Van Roy, 2014b). UCB algorithms have been proposed for a variety of problems, including bandit problems with independent actions (Lai and Robbins, 1985; Auer et al. , 2002; Cappé et al. , 2013; Kaufmann et al. , 2012), linear bandit problems (Dani et al. , 2008; Rusmevichientong and Tsitsiklis, 2010), bandits with continuous action spaces and smooth reward functions (Kleinberg et al. , 2008; Bubeck et al. , 2011; Srinivas et al. , 2012), and exploration in reinforcement learning (Jaksch et al. , 2010). As discussed, for example, in (Russo and Van Roy, 2014b; Osband and Van Roy, 2017a; Osband and Van Roy, 2017b), the design of upper-confidence bounds that simultaneously accommodate both statistical and computational efficiency often poses a challenge, leading to use of UCB algorithms that sacrifice statistical efficiency relative to TS.
Information-directed sampling (Russo and Van Roy, 2014a) aims to better manage the trade-off between immediate reward and information acquired by sampling an action through minimizing the information ratio. The knowledge gradient algorithm (Frazier et al. , 2008; Frazier et al. , 2009) and several other heuristics presented in (Francetich and Kreps, 2017a; Francetich and Kreps, 2017b) similarly aim to more carefully assess the value of information and also address time-sensitivity. Finally, there is a large literature on online decision problems in adversarial environments, which we will not review here; see (Bubeck and CesaBianchi, 2012) for thorough coverage.
## Acknowledgements
This work was generously supported by a research grant from Boeing, a Marketing Research Award from Adobe, and Stanford Graduate Fellowships courtesy of Burt and Deedee McMurty, PACCAR, and Sequoia Capital. We thank Stephen Boyd, Michael Jordan, Susan Murphy, David Tse, and the anonymous reviewers for helpful suggestions, and Roland Heller, Xiuyuan Lu, Luis Neumann, Vincent Tan, and Carrie Wu for pointing out typos.
## References
- Abbasi-Yadkori, Y., D. Pál, and C. Szepesvári. 2011. 'Improved algorithms for linear stochastic bandits'. In: Advances in Neural Information Processing Systems 24 . 2312-2320.
- Abeille, M. and A. Lazaric. 2017. 'Linear Thompson sampling revisited'. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics . 176-184.
- Agarwal, D. 2013. 'Computational advertising: the LinkedIn way'. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management . ACM. 1585-1586.
- Agarwal, D., B. Long, J. Traupman, D. Xin, and L. Zhang. 2014. 'Laser: a scalable response prediction platform for online advertising'. In: Proceedings of the 7th ACM international conference on Web search and data mining . ACM. 173-182.
- Agrawal, S., V. Avadhanula, V. Goyal, and A. Zeevi. 2017. 'Thompson sampling for the MNL-bandit'. In: Proceedings of the 30th Annual Conference on Learning Theory . 76-78.
- Agrawal, S. and N. Goyal. 2012. 'Analysis of Thompson sampling for the multi-armed bandit problem'. In: Proceedings of the 25th Annual Conference on Learning Theory . 39.1-39.26.
- Agrawal, S. and N. Goyal. 2013a. 'Further optimal regret bounds for Thompson sampling'. In: Proceedings of the 16th International Conference on Artificial Intelligence and Statistics . 99-107.
- Agrawal, S. and N. Goyal. 2013b. 'Thompson sampling for contextual bandits with linear payoffs'. In: Proceedings of The 30th International Conference on Machine Learning . 127-135.
- Auer, P., N. Cesa-Bianchi, and P. Fischer. 2002. 'Finite-time analysis of the multiarmed bandit problem'. Machine Learning . 47(2): 235-256.
- Bai, A., F. Wu, and X. Chen. 2013. 'Bayesian mixture modelling and inference based Thompson sampling in Monte-Carlo tree search'. In: Advances in Neural Information Processing Systems 26 . 1646-1654.
- Bastani, H., M. Bayati, and K. Khosravi. 2018. 'Exploiting the natural exploration in contextual bandits'. arXiv preprint arXiv:1704.09011 .
- Besbes, O., Y. Gur, and A. Zeevi. 2014. 'Stochastic Multi-ArmedBandit Problem with Non-stationary Rewards'. In: Advances in Neural Information Processing Systems 27 . 199-207.
- Bubeck, S., R. Munos, G. Stoltz, and C. Szepesvári. 2011. 'X-armed bandits'. Journal of Machine Learning Research . 12: 1655-1695.
- Bubeck, S. and N. Cesa-Bianchi. 2012. 'Regret analysis of stochastic and nonstochastic multi-armed bandit problems'. Foundations and Trends in Machine Learning . 5(1): 1-122.
- Bubeck, S. and R. Eldan. 2016. 'Multi-scale exploration of convex functions and bandit convex optimization'. In: Proccedings of 29th Annual Conference on Learning Theory . 583-589.
- Bubeck, S., R. Eldan, and J. Lehec. 2018. 'Sampling from a log-concave distribution with projected Langevin Monte Carlo'. Discrete & Computational Geometry .
- Cappé, O., A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. 2013. 'Kullback-Leibler upper confidence bounds for optimal sequential allocation'. Annals of Statistics . 41(3): 1516-1541.
- Casella, G. and E. I. George. 1992. 'Explaining the Gibbs sampler'. The American Statistician . 46(3): 167-174.
- Chapelle, O. and L. Li. 2011. 'An empirical evaluation of Thompson sampling'. In: Advances in Neural Information Processing Systems 24 . 2249-2257.
- Cheng, X. and P. Bartlett. 2018. 'Convergence of Langevin MCMC in KL-divergence'. In: Proceedings of the 29th International Conference on Algorithmic Learning Theory . 186-211.
- Craswell, N., O. Zoeter, M. Taylor, and B. Ramsey. 2008. 'An experimental comparison of click position-bias models'. In: Proceedings of the 2008 International Conference on Web Search and Data Mining . ACM. 87-94.
- Dani, V., T. Hayes, and S. Kakade. 2008. 'Stochastic linear optimization under bandit feedback'. In: Proceedings of the 21st Annual Conference on Learning Theory . 355-366.
- Dimakopoulou, M. and B. Van Roy. 2018. 'Coordinated exploration in concurrent reinforcement learning'. arXiv preprint arXiv:1802.01282 .
- Durmus, A. and E. Moulines. 2016. 'Sampling from strongly log-concave distributions with the Unadjusted Langevin Algorithm'. arXiv preprint arXiv:1605.01559 .
- Eckles, D. and M. Kaptein. 2014. 'Thompson sampling with the online bootstrap'. arXiv preprint arXiv:1410.4009 .
- Ferreira, K. J., D. Simchi-Levi, and H. Wang. 2015. 'Online network revenue management using Thompson sampling'. Working Paper .
- Francetich, A. and D. M. Kreps. 2017a. 'Choosing a Good Toolkit: Bayes-Rule Based Heuristics'. preprint .
- Francetich, A. and D. M. Kreps. 2017b. 'Choosing a Good Toolkit: Reinforcement Learning'. preprint .
- Frazier, P., W. Powell, and S. Dayanik. 2009. 'The knowledge-gradient policy for correlated normal beliefs'. INFORMS Journal on Computing . 21(4): 599-613.
- Frazier, P., W. Powell, and S. Dayanik. 2008. 'A knowledge-gradient policy for sequential information collection'. SIAM Journal on Control and Optimization . 47(5): 2410-2439.
- Ghavamzadeh, M., S. Mannor, J. Pineau, and A. Tamar. 2015. 'Bayesian reinforcement learning: A survey'. Foundations and Trends in Machine Learning . 8(5-6): 359-483.
- Gittins, J. and D. Jones. 1979. 'A dynamic allocation index for the discounted multiarmed bandit problem'. Biometrika . 66(3): 561565.
- Gittins, J., K. Glazebrook, and R. Weber. 2011. Multi-armed bandit allocation indices . John Wiley & Sons.
- Gómez-Uribe, C. A. 2016. 'Online algorithms for parameter mean and variance estimation in dynamic regression'. arXiv preprint arXiv:1605.05697v1 .
- Gopalan, A., S. Mannor, and Y. Mansour. 2014. 'Thompson sampling for complex online problems'. In: Proceedings of the 31st International Conference on Machine Learning . 100-108.
- Gopalan, A. and S. Mannor. 2015. 'Thompson sampling for learning parameterized Markov decision processes'. In: Proceedings of the 24th Annual Conference on Learning Theory . 861-898.
- Graepel, T., J. Candela, T. Borchert, and R. Herbrich. 2010. 'Webscale Bayesian click-through rate prediction for sponsored search advertising in Microsoft's Bing search engine'. In: Proceedings of the 27th International Conference on Machine Learning . 13-20.
- Hill, D. N., H. Nassif, Y. Liu, A. Iyer, and S. V. N. Vishwanathan. 2017. 'An efficient bandit algorithm for realtime multivariate optimization'. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 1813-1821.
- Honda, J. and A. Takemura. 2014. 'Optimality of Thompson sampling for Gaussian bandits depends on priors'. In: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics . 375-383.
- Jaksch, T., R. Ortner, and P. Auer. 2010. 'Near-optimal regret bounds for reinforcement learning'. Journal of Machine Learning Research . 11: 1563-1600.
- Kandasamy, K., A. Krishnamurthy, J. Schneider, and B. Poczos. 2018. 'Parallelised Bayesian optimisation via Thompson sampling'. In: To appear in proceedings of the 22nd International Conference on Artificial Intelligence and Statistics .
- Katehakis, M. N. and A. F. Veinott Jr. 1987. 'The multi-armed bandit problem: decomposition and computation'. Mathematics of Operations Research . 12(2): 262-268.
- Kauffmann, E., N. Korda, and R. Munos. 2012. 'Thompson sampling: an asymptotically optimal finite time analysis'. In: Proceedings of the 24th International Conference on Algorithmic Learning Theory . 199-213.
- Kaufmann, E., O. Cappé, and A. Garivier. 2012. 'On Bayesian upper confidence bounds for bandit problems'. In: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics . 592-600.
- Kawale, J., H. H. Bui, B. Kveton, L. Tran-Thanh, and S. Chawla. 2015. 'Efficient Thompson sampling for online matrix-factorization recommendation'. In: Advances in Neural Information Processing Systems 28 . 1297-1305.
- Kim, M. J. 2017. 'Thompson sampling for stochastic control: the finite parameter case'. IEEE Transactions on Automatic Control . 62(12): 6415-6422.
- Kleinberg, R., A. Slivkins, and E. Upfal. 2008. 'Multi-armed bandits in metric spaces'. In: Proceedings of the 40th ACM Symposium on Theory of Computing . 681-690.
- Kveton, B., C. Szepesvari, Z. Wen, and A. Ashkan. 2015. 'Cascading bandits: learning to rank in the cascade model'. In: Proceedings of the 32nd International Conference on Machine Learning . 767-776.
- Lai, T. and H. Robbins. 1985. 'Asymptotically efficient adaptive allocation rules'. Advances in applied mathematics . 6(1): 4-22.
- Li, L., W. Chu, J. Langford, and R. E. Schapire. 2010. 'A Contextualbandit approach to personalized news article recommendation'. In: Proceedings of the 19th International Conference on World Wide Web . 661-670.
- Littman, M. L. 2015. 'Reinforcement learning improves behaviour from evaluative feedback'. Nature . 521(7553): 445-451.
- Liu, F., S. Buccapatnam, and N. Shroff. 2017. 'Information directed sampling for stochastic bandits with graph feedback'. arXiv preprint arXiv:1711.03198 .
- Lu, X. and B. Van Roy. 2017. 'Ensemble Sampling'. Advances in Neural Information Processing Systems 30 : 3258-3266.
- Mattingly, J. C., A. M. Stuart, and D. J. Higham. 2002. 'Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise'. Stochastic processes and their applications . 101(2): 185-232.
- Osband, I., D. Russo, and B. Van Roy. 2013. '(More) Efficient reinforcement learning via posterior sampling'. In: Advances in Neural Information Processing Systems 26 . 3003-3011.
- Osband, I., C. Blundell, A. Pritzel, and B. Van Roy. 2016a. 'Deep exploration via bootstrapped DQN'. In: Advances in Neural Information Processing Systems 29 . 4026-4034.
- Osband, I., D. Russo, Z. Wen, and B. Van Roy. 2017. 'Deep exploration via randomized value functions'. arXiv preprint arXiv:1703.07608 .
- Osband, I. and B. Van Roy. 2014a. 'Model-based reinforcement learning and the eluder dimension'. In: Advances in Neural Information Processing Systems 27 . 1466-1474.
- Osband, I. and B. Van Roy. 2014b. 'Near-optimal reinforcement learning in factored MDPs'. In: Advances in Neural Information Processing Systems 27 . 604-612.
- Osband, I. and B. Van Roy. 2017a. 'On optimistic versus randomized exploration in reinforcement learning'. In: Proceedings of The Multidisciplinary Conference on Reinforcement Learning and Decision Making .
- Osband, I. and B. Van Roy. 2017b. 'Why is posterior sampling better than optimism for reinforcement learning?' In: Proceedings of the 34th International Conference on Machine Learning . 2701-2710.
- Osband, I., B. Van Roy, and Z. Wen. 2016b. 'Generalization and exploration via randomized value functions'. In: Proceedings of The 33rd International Conference on Machine Learning . 2377-2386.
- Ouyang, Y., M. Gagrani, A. Nayyar, and R. Jain. 2017. 'Learning unknown Markov decision processes: A Thompson sampling approach'. In: Advances in Neural Information Processing Systems 30 . 13331342.
- Roberts, G. O. and J. S. Rosenthal. 1998. 'Optimal scaling of discrete approximations to Langevin diffusions'. Journal of the Royal Statistical Society: Series B (Statistical Methodology) . 60(1): 255268.
- Roberts, G. O. and R. L. Tweedie. 1996. 'Exponential convergence of Langevin distributions and their discrete approximations'. Bernoulli : 341-363.
- Rusmevichientong, P. and J. Tsitsiklis. 2010. 'Linearly parameterized bandits'. Mathematics of Operations Research . 35(2): 395-411.
- Russo, D. and B. Van Roy. 2013. 'Eluder Dimension and the Sample Complexity of Optimistic Exploration'. In: Advances in Neural Information Processing Systems 26 . 2256-2264.
- Russo, D. and B. Van Roy. 2014a. 'Learning to optimize via informationdirected sampling'. In: Advances in Neural Information Processing Systems 27 . 1583-1591.
- Russo, D. and B. Van Roy. 2014b. 'Learning to optimize via posterior sampling'. Mathematics of Operations Research . 39(4): 1221-1243.
- Russo, D. and B. Van Roy. 2016. 'An Information-Theoretic analysis of Thompson sampling'. Journal of Machine Learning Research . 17(68): 1-30.
- Russo, D. 2016. 'Simple bayesian algorithms for best arm identification'. In: Conference on Learning Theory . 1417-1418.
- Russo, D. and B. Van Roy. 2018a. 'Learning to optimize via informationdirected sampling'. Operations Research . 66(1): 230-252.
- Russo, D. and B. Van Roy. 2018b. 'Satisficing in time-sensitive bandit learning'. arXiv preprint arXiv:1803.02855 .
- Schwartz, E. M., E. T. Bradlow, and P. S. Fader. 2017. 'Customer acquisition via display advertising using multi-armed bandit experiments'. Marketing Science . 36(4): 500-522.
- Scott, S. 2010. 'A modern Bayesian look at the multi-armed bandit'. Applied Stochastic Models in Business and Industry . 26(6): 639-658.
- Scott, S. L. 2015. 'Multi-armed bandit experiments in the online service economy'. Applied Stochastic Models in Business and Industry . 31(1): 37-45.
- Srinivas, N., A. Krause, S. Kakade, and M. Seeger. 2012. 'InformationTheoretic regret bounds for Gaussian process optimization in the bandit setting'. IEEE Transactions on Information Theory . 58(5): 3250-3265.
- Strens, M. 2000. 'A Bayesian framework for reinforcement learning'. In: Proceedings of the 17th International Conference on Machine Learning . 943-950.
- Sutton, R. S. and A. G. Barto. 1998. Reinforcement learning: An introduction . Vol. 1. MIT press Cambridge.
- Teh, Y. W., A. H. Thiery, and S. J. Vollmer. 2016. 'Consistency and fluctuations for stochastic gradient Langevin dynamics'. Journal of Machine Learning Research . 17(7): 1-33.
- Thompson, W. R. 1935. 'On the theory of apportionment'. American Journal of Mathematics . 57(2): 450-456.
- Thompson, W. 1933. 'On the likelihood that one unknown probability exceeds another in view of the evidence of two samples'. Biometrika . 25(3/4): 285-294.
- Welling, M. and Y. W. Teh. 2011. 'Bayesian learning via stochastic gradient Langevin dynamics'. In: Proceedings of the 28th International Conference on Machine Learning . 681-688.
- Wyatt, J. 1997. 'Exploration and inference in learning from reinforcement'. PhD thesis . University of Edinburgh. College of Science and Engineering. School of Informatics.