## Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior
Baihan Lin 1 , β , Djallel Bouneffouf 2 and Guillermo Cecchi 2
1 Columbia University, New York, USA
2 IBM Research, Yorktown Heights, NY, USA
baihan.lin@columbia.edu, djallel.bouneffouf@ibm.com, gcecchi@us.ibm.com
Abstract. As an important psychological and social experiment, the Iterated Prisoner's Dilemma (IPD) treats the choice to cooperate or defect as an atomic action. We propose to study the behaviors of online learning algorithms in the Iterated Prisoner's Dilemma (IPD) game, where we investigate the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning. We evaluate them based on a tournament of iterated prisoner's dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward-driven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated, as an effort to connect artificial intelligence algorithms with human behaviors and their abnormal states in neuropsychiatric conditions. 3
Keywords: Online learning, Bandits, Contextual bandits, Reinforcement learning, Iterated Prisoner's Dilemma, Behavioral modeling
## 1 Introduction
Social dilemmas expose tensions between cooperation and defection. Understanding the best way of playing the iterated prisoner's dilemma (IPD) has been of interest to the scientific community since the formulation of the game seventy years ago [5]. To evaluate the algorithm a round robin computer tournament was proposed, where algorithms competed against each others [2]. The winner was decided on the average score a strategy achieved. Using this framework, we propose here to focus on studying reward driven online learning algorithm with different type of attentions mechanism, where we define attention "as the behavioral and cognitive process of selectively concentrating on a discrete stimulus while ignoring other perceivable stimuli" [21]. Following this definition, we
3 The data and codes to reproduce all the empirical results can be accessed at https://github.com/doerlbh/dilemmaRL.
analyze three algorithms classes: the no-attention-to-the-context online learning agent (the multi armed bandit algorithms) outputs an action but does not use any information about the state of the environment (context); the contextual bandit algorithm extends the model by making the decision conditional on the current state of the environment, and finally reinforcement learning as an extension of contextual bandits which makes decision conditional on the current state of the environment and the next state of the unknown environments. This paper mainly focuses on an answer to two questions:
- -Does attending to the context for an online learning algorithm helps on the task of maximizing the rewards in an IPD tournament, and how do different attention biases shape behavior?
- -Does attending to the context for an online learning algorithm helps to mimic human behavior?
To answer these questions, we have performed two experimenters: (1) The first one where we have run a tournament of the iterated prisoner's dilemma: Since the seminal tournament in 1980 [5], a number of IPD tournaments have been undertaken [2,12,10]. In this work, we adopt a similar tournament setting, but also extended it to cases with more than two players. Empirically, we evaluated the algorithms in four settings of the Iterated Prisoner's Dilemma: pairwise-agent tournament, three-agent tournament, 'mental'-agent tournament. (2) Behavioral cloning prediction task: where we train the the three types of algorithm to mimic the human behavior on some training set and then test them in a test set. Our main results are the following:
- -We observe that contextual bandits are not performing well in the tournament, which means that considering the current situation to make decision is the worst in this kind of social dilemma game. Basically we should either do not care about the current situation or caring about more situations, but not just the current one.
- -We observe that bandit algorithms (without context) is the best in term of fitting the human data, which implies that humans may not consider the context when they play the iterated prisoner's dilemma.
This paper is organized as follows. We first review related works and introduces some background concepts. Then we explain the two experiments we have performed. Experimental evaluation highlights the empirical results we have got. Finally, the last section concludes the paper and points out possible directions for future works.
As far as we are aware, this is the first work that evaluated the online learning algorithms in social gaming settings. Although the agents that we evaluated here are not newly proposed by us, we believe that given this understudied information asymmetry problem setting, our work helps the community understand how the inductive bias of different methods yield different behaviors in social agent settings (e.g. iterated prisoners' dilemma), and thus provides a nontrivial contribution to the fields, both in understanding machine learning algorithms, and in studying mechanistic models of human behaviors in social settings.
## 2 Related Work
There is much computational work focused on non understanding the strategy space and finding winning strategies in the iterated prisoner's dilemma; Authors in [22] present and discuss several improvements to the Q-Learning algorithm, allowing for an easy numerical measure of the exploitability of a given strategy. [18] propose a mechanism for achieving cooperation and communication in MultiAgent Reinforcement Learning settings by intrinsically rewarding agents for obeying the commands of other agents. We are interested in investigating how algorithms are behaving and also how they are modeling the human decisions in the IPD, with the larger goal of understanding human decision-making. Forinstance, In [39] authors have proposed an active modeling technique to predict the behavior of IPD players. The proposed method can model the opponent player's behavior while taking advantage of interactive game environments. The data showed that the observer was able to build, through direct actions, a more accurate model of an opponent's behavior than when the data were collected through random actions. [15] they propose the first predictive model of human cooperation able to organize a number of different experimental findings that are not explained by the standard model and they show also that the model makes satisfactorily accurate quantitative predictions of population average behavior in one-shot social dilemmas. To the best of our knowledge no study has been exploring the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning in social dilemma.
## 3 Background
Multi-Armed Bandit (MAB): The multi-armed bandit (MAB) algorithm models a sequential decision-making process, where at each time point a the algorithm selects an action from a given finite set of possible actions, attempting to maximize the cumulative reward over time [23,3].
Contextual Bandit Algorithm (CB). Following [24], this problem is defined as follows. At each time point (iteration) t β { 1 , ..., T } , an agent is presented with a context ( feature vector ) x t β R N before choosing an arm k β A = { 1 , ..., K } . We will denote by X = { X 1 , ..., X N } the set of features (variables) defining the context. Let r t = ( r 1 t , ..., r t K ) denote a reward vector, where r k t β [0 , 1] is a reward at time t associated with the arm k β A . Herein, we will primarily focus on the Bernoulli bandit with binary reward, i.e. r k t β { 0 , 1 } . Let Ο : X β A denote a policy. Also, D c,r denotes a joint distribution over ( x , r ) . We will assume that the expected reward is a linear function of the context, i.e. E [ r k t | x t ] = Β΅ T k x t , where Β΅ k is an unknown weight vector associated with arm k .
Reinforcement Learning (RL). Reinforcement learning defines a class of algorithms for solving problems modeled as Markov decision processes (MDP) [46]. An MDP is defined by the tuple ( S , A , T , R , Ξ³ ) , where S is a set of possible states, A is a set of actions, T is a transition function defined as T ( s, a, s β² ) = Pr( s β² | s, a ) , where s, s β² β S and a β A , and R : S Γ A Γ S β¦β R is a reward function, Ξ³
Fig. 1: Success, Teamwork, Cooperation & Competition in two-agent tournament.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Heatmaps: Pairwise Agents Performance Metrics
### Overview
The image contains four heatmaps comparing the performance of 16 pairwise agents across four metrics: normalized reward, sum of rewards, percentage of cooperation, and relative rewards. Each heatmap uses a color gradient to represent values, with legends on the right indicating scale ranges. Agents are listed on both axes, and the matrices show pairwise interactions.
### Components/Axes
- **X/Y Axes**: Labeled with agent names (e.g., UCB, TS, eGreedy, EXP3, HBTS, LinUCB, CTS, EXP4, SCTS, GL, DQL, SARSA, SQL, Coop, DQN, TIMTAT).
- **Legends**:
1. **Normalized Reward**: Blue gradient (0β60).
2. **Sum of Rewards**: Darker blue gradient (0β120).
3. **Percentage of Cooperation**: Red gradient (0β100).
4. **Relative Rewards**: Mixed red/blue gradient (-20β20).
- **Placement**: Legends are positioned to the right of each heatmap. Agent names are centered on both axes.
### Detailed Analysis
1. **Normalized Reward (Mem=5)**:
- **High Values**: UCB (60), TS (60), eGreedy (45), EXP3 (45), HBTS (45), LinUCB (45), CTS (45), EXP4 (45), SCTS (45), GL (45), DQL (45), SARSA (45), SQL (45), Coop (45), DQN (45), TIMTAT (45).
- **Low Values**: Most agents cluster around 15β30, with Coop and DQN showing the lowest (15).
2. **Sum of Rewards (Mem=5)**:
- **High Values**: UCB (120), TS (120), eGreedy (100), EXP3 (100), HBTS (100), LinUCB (100), CTS (100), EXP4 (100), SCTS (100), GL (100), DQL (100), SARSA (100), SQL (100), Coop (100), DQN (100), TIMTAT (100).
- **Low Values**: Most agents cluster around 40β80, with Coop and DQN showing the lowest (40).
3. **Percentage of Cooperation (Mem=5)**:
- **High Values**: UCB (100), TS (100), eGreedy (80), EXP3 (80), HBTS (80), LinUCB (80), CTS (80), EXP4 (80), SCTS (80), GL (80), DQL (80), SARSA (80), SQL (80), Coop (80), DQN (80), TIMTAT (80).
- **Low Values**: Most agents cluster around 20β60, with Coop and DQN showing the lowest (20).
4. **Relative Rewards (Mem=5)**:
- **Positive Values**: UCB (+20), TS (+20), eGreedy (+10), EXP3 (+10), HBTS (+10), LinUCB (+10), CTS (+10), EXP4 (+10), SCTS (+10), GL (+10), DQL (+10), SARSA (+10), SQL (+10), Coop (+10), DQN (+10), TIMTAT (+10).
- **Negative Values**: Most agents cluster around -10 to -20, with Coop and DQN showing the lowest (-20).
### Key Observations
- **Dominant Agents**: UCB and TS consistently achieve the highest values across all metrics, suggesting optimal performance in cooperative scenarios.
- **Competitive Agents**: Coop and DQN exhibit the lowest cooperation percentages and negative relative rewards, indicating poor teamwork.
- **Moderate Performers**: Agents like eGreedy, EXP3, and HBTS show mid-range values, balancing reward and cooperation.
- **Anomalies**: The relative rewards heatmap reveals stark contrasts, with UCB/TS outperforming others by +20, while Coop/DQN underperform by -20.
### Interpretation
The data highlights UCB and TS as the most effective agents in cooperative environments, likely due to their strategic design (e.g., UCBβs exploration-exploitation balance, TSβs Thompson Sampling). Coop and DQNβs poor performance in cooperation and relative rewards suggests they prioritize individual gains over teamwork. The sum of rewards heatmap reinforces this, showing UCB/TS as the only agents achieving the maximum value (120). The relative rewards metric underscores systemic competition, where top agents gain at the expense of others. This implies that agent selection should prioritize UCB/TS for collaborative tasks, while Coop/DQN may require redesign for team-based applications.
</details>
<!-- formula-not-decoded -->
is a discount factor that decreases the impact of the past reward on current action choice. Typically, the objective is to maximize the discounted long-term reward, assuming an infinite-horizon decision process, i.e. to find a policy function Ο : S β¦β A which specifies the action to take in a given state, so that the cumulative reward is maximized: max Ο β β t =0 Ξ³ t R ( s t , a t , s t +1 ) .
## 4 Experimental Setup
Here, we describe the two main experiments we run, Iterated Prisoner's Dilemma (IPD) and Behavioral Cloning with Demonstration Rewards (BCDR).
## 4.1 Iterated Prisoner's Dilemma (IPD)
The Iterated Prisoner's Dilemma (IPD) can be defined as a matrix game G = [ N, { A i } i β N , { R i } i β N ] , where N is the set of agents, A i is the set of actions available to agent i with A being the joint action space A 1 ΓΒ·Β·Β· Γ A n , and R i is the reward function for agent i . A special case of this generic multi-agent IPD is the classical two-agent case (Table 1). In this game, each agent has two actions: cooperate (C) and defect (D), and can receive one of the four possible rewards: R (Reward), P (Penalty), S (Sucker), and T (Temptation). In the multi-agent setting, if all agents Cooperates (C), they all receive Reward (R); if all agents defects (D), they all receive Penalty (P); if some agents Cooperate (C) and some Defect (D), cooperators receive Sucker (S) and defector receive Temptation (T). The four payoffs satisfy the following inequalities: T > R > P > S and 2 R > T + S . The PD is a one round game, but is commonly studied in a manner where the prior outcomes matter to understand the evolution of cooperative behaviour from complex dynamics [6].
## 4.2 Behavioral Cloning with Demonstration Rewards
Here we define a new type of multi-agent online learning setting, the Behavior Cloning with Demonstration Rewards (BCDR), present a novel training procedure
and agent for solving this problem. In this setting, and similar to [8,7,37] the agent first goes through a constraint learning phase where it is allowed to query the actions and receive feedback r e k ( t ) β [0 , 1] about whether or not the chosen decision matches the teacher's action (from demonstration). During the deployment (testing) phase, the goal of the agent is to maximize both r k ( t ) β [0 , 1] , the reward of the action k at time t , and the (unobserved) r e k ( t ) β [0 , 1] , which models whether or not the taking action k matches which action the teacher would have taken. During the deployment phase, the agent receives no feedback on the value of r e k ( t ) , where we would like to observe how the behavior captures the teacher's policy profile. In our specific problem, the human data plays the role of the teacher, and the behavioral cloning aims to train our agents to mimic the human behaviors.
## 4.3 Online Learning Agents
We briefly outlined the different types of online learning algorithms we have used: Multi-Armed Bandit (MAB): The multi-armed bandit algorithm models a sequential decision-making process, where at each time point a the algorithm selects an action from a given finite set of possible actions, attempting to maximize the cumulative reward over time [23,3,13]. In the multi-armed bandit agent pool, we have Thompson Sampling (TS) [48], Upper Confidence Bound (UCB) [3], epsilon Greedy (eGreedy) [46], EXP3 [4] and the Human Based Thompson Sampling (HBTS) [14].
Contextual Bandit (CB). Following [24], this problem is defined as follows. At each time point (iteration), an agent is presented with a context ( feature vector ) before choosing an arm. In the contextual bandit agent pool, we have Contextual Thompson Sampling (CTS) [1], LinUCB [26], EXP4 [11] and Split Contextual Thompson Sampling (SCTS) [29,33].
Reinforcement Learning (RL). Reinforcement learning defines a class of algorithms for solving problems modeled as Markov decision processes (MDP) [46]. An MDP is defined by the tuple with a set of possible states, a set of actions and a transition function. In the reinforcement learning agent pool, we have QLearning (QL), Double Q-Learning (DQL) [19], State-action-reward-state-action (SARSA) [44] and Split Q-Learning (SQL) [28,32]. We also selected three most popular handcrafted policy for Iterated Prisoner's Dilemma: 'Coop' stands for always cooperating, 'Dfct' stands for always defecting and 'Tit4Tat' stands for following what the opponent chose for the last time (which was the winner approach in the 1980 IPD tournament [5]).
The choices of the agents evaluated in this work are the most common online learning agents in bandits, contextual bandits and reinforcement learning (the three online learning classes). We thought that competing them against one another, and competing the three online learning classes against one another might be an interesting experiment to study how the inductive bias of different methods yield different behaviors in social agent settings (e.g. iterated prisoners' dilemma).
Fig. 2: Cumulative reward and cooperation rate averaged by class in two- and three-player setting.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Line Graphs: Cumulative Normalized Rewards and Cooperation Rates Across Agent Configurations
### Overview
The image contains four line graphs comparing the performance of four methods (MAB, CB, RL, handcrafted) across two agent configurations (two agents and three agents). The graphs track cumulative normalized rewards and cooperation rates over 50 rounds. Shaded regions represent confidence intervals (likely Β±1 standard deviation).
---
### Components/Axes
1. **Graph 1 (Two Agents: Cumulative Rewards)**
- **X-axis**: Rounds (0β50)
- **Y-axis**: Cumulative normalized rewards (0β30)
- **Legend**:
- MAB (red)
- CB (blue)
- RL (purple)
- Handcrafted (green)
- **Shading**: Confidence intervals around lines
2. **Graph 2 (Three Agents: Cumulative Rewards)**
- **X-axis**: Rounds (0β50)
- **Y-axis**: Cumulative normalized rewards (0β4)
- **Legend**:
- MAB (red)
- CB (blue)
- RL (purple)
- **Shading**: Confidence intervals
3. **Graph 3 (Two Agents: Cooperation Rate)**
- **X-axis**: Rounds (0β50)
- **Y-axis**: Percentage of cooperation (0β100%)
- **Legend**:
- MAB (red)
- CB (blue)
- RL (purple)
- Handcrafted (green)
- **Shading**: Confidence intervals
4. **Graph 4 (Three Agents: Cooperation Rate)**
- **X-axis**: Rounds (0β50)
- **Y-axis**: Percentage of cooperation (0β30%)
- **Legend**:
- MAB (red)
- CB (blue)
- RL (purple)
- **Shading**: Confidence intervals
---
### Detailed Analysis
#### Graph 1 (Two Agents: Rewards)
- **Trends**:
- All methods show upward slopes.
- RL (purple) and handcrafted (green) outperform MAB (red) and CB (blue) consistently.
- By round 50, RL reaches ~30 rewards, handcrafted ~28, MAB ~26, CB ~24.
- **Key Data Points**:
- At round 10: RL (~10), handcrafted (~9), MAB (~8), CB (~7).
- At round 30: RL (~22), handcrafted (~21), MAB (~19), CB (~17).
#### Graph 2 (Three Agents: Rewards)
- **Trends**:
- All methods increase, but rewards plateau earlier.
- RL (purple) leads, followed by MAB (red), CB (blue).
- By round 50: RL (~3.5), MAB (~3), CB (~2.5).
- **Key Data Points**:
- At round 10: RL (~1.2), MAB (~1), CB (~0.8).
- At round 30: RL (~2.8), MAB (~2.5), CB (~2.2).
#### Graph 3 (Two Agents: Cooperation Rate)
- **Trends**:
- MAB (red) drops sharply to ~20% by round 10, then stabilizes.
- RL (purple) and handcrafted (green) maintain ~60β80% cooperation.
- CB (blue) declines gradually to ~40%.
- **Key Data Points**:
- At round 5: MAB (~80%), RL (~70%), handcrafted (~65%), CB (~60%).
- At round 30: MAB (~22%), RL (~65%), handcrafted (~60%), CB (~40%).
#### Graph 4 (Three Agents: Cooperation Rate)
- **Trends**:
- MAB (red) drops steeply to ~5% by round 10, then stabilizes.
- CB (blue) declines to ~10%, RL (purple) to ~15%.
- Handcrafted (green) in Graph 3 (~40%) is absent here, suggesting lower performance with three agents.
- **Key Data Points**:
- At round 5: MAB (~25%), CB (~20%), RL (~18%).
- At round 30: MAB (~5%), CB (~8%), RL (~12%).
---
### Key Observations
1. **Reward Performance**:
- RL and handcrafted methods achieve higher cumulative rewards in both agent configurations.
- Three-agent rewards are significantly lower than two-agent rewards across all methods.
2. **Cooperation Rates**:
- MAB performs poorly in cooperation, especially with three agents.
- Handcrafted methods show higher cooperation in two-agent settings but are omitted in three-agent graphs, implying inferior scalability.
- RL maintains moderate cooperation rates but declines with increased agents.
3. **Confidence Intervals**:
- Shaded regions indicate variability; RL and handcrafted show narrower intervals (more consistent performance).
- MABβs wide intervals in cooperation graphs suggest instability.
---
### Interpretation
- **Method Efficacy**:
- RL and handcrafted methods excel in reward maximization but face challenges in maintaining cooperation as agent numbers increase.
- MAB struggles with cooperation, particularly in multi-agent scenarios, despite decent reward performance.
- **Scalability Issues**:
- Three-agent configurations reduce both rewards and cooperation rates, highlighting difficulties in coordination.
- Handcrafted methods may lack adaptability in complex environments (e.g., three agents).
- **Trade-offs**:
- Methods optimized for rewards (RL, handcrafted) may sacrifice cooperation, and vice versa.
- The absence of handcrafted in three-agent cooperation graphs suggests it underperforms or was excluded due to poor results.
- **Confidence Intervals**:
- Narrower intervals for RL/handcrafted imply more reliable performance, while MABβs wide intervals indicate high variance in outcomes.
This analysis underscores the need for methods that balance reward optimization and cooperation, especially in scalable multi-agent systems.
</details>
Fig. 3: Reward feedbacks and cooperation rates in some two-player and the three-player settings.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Line Graphs: Algorithm Performance Comparison Across Metrics
### Overview
The image contains 10 subplots arranged in two rows (5 per row), comparing algorithm performance across two metrics: **normalized reward** (0-1 scale) and **percentage of cooperation** (0-100 scale). Each subplot compares two algorithms across 50 rounds, with distinct color-coded lines for each algorithm. Titles specify the comparison (e.g., "reward feedback: OL vs. CTS") and the legend identifies algorithms by color.
---
### Components/Axes
- **X-axis**: "round" (0β50), consistent across all subplots.
- **Y-axis**:
- Top row: "normalized reward" (0β1).
- Bottom row: "percentage of cooperation" (0β100).
- **Legends**: Positioned in the top-right corner of each subplot, with two entries per subplot. Colors are consistent across subplots:
- OL: Purple
- CTS: Blue
- DQL: Orange
- TiT4t: Green
- SARSA: Pink
- LinUCB: Light blue
---
### Detailed Analysis
#### Top Row (Normalized Reward)
1. **OL vs. CTS**:
- OL (purple) starts at ~0.8, declines to ~0.6 by round 50.
- CTS (blue) starts at ~0.6, fluctuates between ~0.5β0.7.
2. **UCB vs. DQL**:
- UCB (orange) starts at ~0.7, drops to ~0.5 by round 50.
- DQL (orange) starts at ~0.6, stabilizes at ~0.55.
3. **DQL vs. TiT4t**:
- DQL (orange) starts at ~0.6, declines to ~0.4.
- TiT4t (green) starts at ~0.7, stabilizes at ~0.65.
4. **SARSA vs. LinUCB**:
- SARSA (pink) starts at ~0.5, rises to ~0.65.
- LinUCB (light blue) starts at ~0.4, rises to ~0.6.
5. **UCB vs. LinUCB vs. OL**:
- UCB (orange) starts at ~0.7, drops to ~0.5.
- LinUCB (light blue) starts at ~0.4, rises to ~0.6.
- OL (purple) starts at ~0.8, declines to ~0.6.
#### Bottom Row (Cooperation Ratio)
1. **OL vs. CTS**:
- OL (purple) starts at ~80%, drops to ~60%.
- CTS (blue) starts at ~40%, rises to ~60%.
2. **UCB vs. DQL**:
- UCB (orange) starts at ~70%, drops to ~50%.
- DQL (orange) starts at ~60%, drops to ~40%.
3. **DQL vs. TiT4t**:
- DQL (orange) starts at ~70%, drops to ~50%.
- TiT4t (green) starts at ~80%, stabilizes at ~60%.
4. **SARSA vs. LinUCB**:
- SARSA (pink) starts at ~50%, rises to ~70%.
- LinUCB (light blue) starts at ~30%, rises to ~50%.
5. **UCB vs. LinUCB vs. OL**:
- UCB (orange) starts at ~70%, drops to ~50%.
- LinUCB (light blue) starts at ~30%, rises to ~50%.
- OL (purple) starts at ~80%, drops to ~60%.
---
### Key Observations
- **Consistent Declines**: Algorithms like OL and UCB show declining normalized rewards over time, while SARSA and LinUCB improve.
- **Cooperation Trends**: SARSA and LinUCB demonstrate the highest cooperation ratios by round 50, while OL and UCB decline.
- **Anomalies**:
- In "UCB vs. LinUCB vs. OL" (top-right), OLβs reward feedback spikes sharply at round 10 before declining.
- In "SARSA vs. LinUCB" (bottom row), SARSAβs cooperation ratio jumps sharply at round 20.
---
### Interpretation
1. **Performance Insights**:
- Algorithms like SARSA and LinUCB outperform others in cooperation, suggesting better collaborative behavior.
- OL and UCB, while strong initially, degrade over time, possibly due to suboptimal reward feedback or cooperation strategies.
2. **Algorithm Dynamics**:
- TiT4t maintains stable performance in both metrics, indicating robustness.
- DQLβs decline in cooperation ratio suggests limitations in long-term collaboration.
3. **Outliers**:
- The sharp spike in OLβs reward feedback at round 10 (top-right subplot) may reflect a temporary strategic advantage or anomaly in the simulation.
---
### Spatial Grounding & Validation
- Legends are consistently placed in the top-right of each subplot, ensuring clarity.
- Color assignments (e.g., OL = purple) are validated across all subplots to avoid misinterpretation.
- Y-axis scales (0β1 vs. 0β100) are distinct for reward vs. cooperation metrics, preventing confusion.
---
### Conclusion
The data highlights trade-offs between reward optimization and cooperation. Algorithms prioritizing cooperation (SARSA, LinUCB) achieve higher long-term collaboration, while others (OL, UCB) prioritize short-term gains at the cost of sustainability. This aligns with game-theoretic principles where cooperation often requires balancing immediate rewards with collective outcomes.
</details>
## 5 Results: Algorithms' Tournament
Game settings. The payoffs are set as the classical IPD game: T = 5 , R = 3 , P = 1 , S = 0 . Following [42], we create create standardized payoff measures from the R, S, T, P values using two differences between payoffs associate with important game outcomes, both normalized by the difference between the temptation to defect and being a sucker when cooperating as the other defects.
State representations. In most IPD literature, the state is defined the pair of previous actions of self and opponent. Studies suggest that only one single previous state is needed to define any prisoner's dilemma strategy [41]. However, as we are interested in understanding the role of three levels of information (no information, with context but without state, and with both context and state), we expand the state representation to account for the past n pairs of actions as the history (or memory) for the agents. For CB algorithms, this history is their context. For RL algorithms, this history is their state representation. In the following sections, we will present the results in which the memory is set to be the past 5 action pairs.
Learning settings. In all experiments, the discount factor Ξ³ was set to be 0.95. The exploration is included with -greedy algorithm with set to be 0.05 (except for the algorithms that already have an exploration mechanism).
Fig. 4: Behavioral Cloning: bandits modeled human data the best with the lowest prediction error.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Line Chart: Behavioral Cloning Cooperation Rate (Mem=5)
### Overview
The left chart visualizes the percentage of cooperation rates over 8 rounds for 12 algorithms (e.g., UCB, TS, LinUCB) and human data. The right chart compares prediction errors (vs. human data) for the same algorithms.
### Components/Axes
**Left Chart**:
- **Y-axis**: Percentage of cooperation (0β100%)
- **X-axis**: Rounds (0β8)
- **Legend**: 12 algorithms (colors: UCB=orange, TS=red, eGreedy=light red, EXP3=yellow, HBTS=light yellow, LinUCB=blue, CTS=teal, EXP4=dark teal, SCTS=cyan, QL=purple, DQL=light purple, SARSA=magenta, SQL=dark purple) + Human data (black line with shaded confidence interval).
**Right Chart**:
- **Y-axis**: Prediction error (range: -80 to 40)
- **X-axis**: Algorithms (UCB, TS, eGreedy, EXP3, HBTS, LinUCB, CTS, EXP4, SCTS, QL, DQL, SARSA, SQL)
- **Box Plots**: Median (horizontal line), quartiles (box), whiskers (range), outliers (dots).
### Detailed Analysis
**Left Chart Trends**:
1. **LinUCB (blue)**: Starts at ~100% cooperation, drops sharply to ~10% by round 8.
2. **UCB (orange)**: Declines gradually from ~70% to ~40%.
3. **TS (red)**: Starts ~60%, declines to ~30%.
4. **eGreedy (light red)**: Flat ~30% after round 2.
5. **EXP3 (yellow)**: Starts ~50%, declines to ~35%.
6. **HBTS (light yellow)**: Starts ~40%, declines to ~25%.
7. **CTS (teal)**: Starts ~50%, declines to ~30%.
8. **EXP4 (dark teal)**: Starts ~40%, declines to ~20%.
9. **SCTS (cyan)**: Starts ~30%, declines to ~15%.
10. **QL (purple)**: Starts ~20%, declines to ~10%.
11. **DQL (light purple)**: Starts ~15%, declines to ~5%.
12. **SARSA (magenta)**: Starts ~10%, declines to ~5%.
13. **SQL (dark purple)**: Starts ~5%, declines to ~2%.
14. **Human data (black)**: Starts ~70%, declines to ~50% (shaded area: Β±5% confidence).
**Right Chart Trends**:
- **UCB**: Median error ~0, range -20 to 20.
- **TS**: Median error ~-10, range -30 to 10.
- **eGreedy**: Median error ~-5, range -20 to 10.
- **EXP3**: Median error ~5, range -10 to 20.
- **HBTS**: Median error ~-5, range -20 to 10.
- **LinUCB**: Median error ~-20, range -40 to 0.
- **CTS**: Median error ~-10, range -30 to 10.
- **EXP4**: Median error ~-15, range -40 to 0.
- **SCTS**: Median error ~-25, range -50 to -10.
- **QL**: Median error ~-30, range -50 to -10.
- **DQL**: Median error ~-35, range -60 to -20.
- **SARSA**: Median error ~-40, range -60 to -20.
- **SQL**: Median error ~-45, range -70 to -30.
### Key Observations
1. **Left Chart**:
- LinUCB dominates early rounds but collapses by round 8.
- Human cooperation rates decline steadily (~70% β 50%).
- Algorithms like eGreedy and EXP3 show slower declines.
- Confidence intervals (shaded areas) suggest uncertainty in human data.
2. **Right Chart**:
- **Negative errors** (algorithm < human) indicate underprediction.
- **Positive errors** (algorithm > human) indicate overprediction.
- **SCTS, QL, DQL, SARSA, SQL** show the largest negative errors (best alignment with human data).
- **EXP3** has the highest positive error (worst alignment).
### Interpretation
- **Cooperation vs. Error Trade-off**: Algorithms with lower cooperation rates (e.g., SQL, SARSA) exhibit larger negative errors, suggesting they better mimic human behavior despite poor performance.
- **LinUCB Anomaly**: High initial cooperation but poor long-term alignment (large negative error), indicating overoptimism in early rounds.
- **Human Data**: Stable decline suggests a baseline for comparison; algorithms failing to track this trend (e.g., EXP3) perform poorly.
- **Confidence Intervals**: Human data uncertainty (Β±5%) implies variability in baseline behavior.
This analysis highlights algorithmic limitations in replicating human cooperation dynamics, with most methods struggling to maintain alignment over time.
</details>
The learning rate was polynomial Ξ± t ( s, a ) = 1 /n t ( s, a ) 0 . 8 , which was shown in previous work to be better in theory and in practice [16]. All experiments were performed and averaged for at least 100 runs, and over 50 steps of dueling actions from the initial state.
Reported measures. To capture the behavior of the algorithms, we report five measures: individual normalized rewards, collective normalized rewards, difference of normalized rewards, the cooperation rate and normalized reward feedback at each round. We are interested in the individual rewards since that is what online learning agents should effectively maximize their expected cumulative discounted reward for. We are interested in the collective rewards because it might offer important insights on the teamwork of the participating agents. We are interested in the difference between each individual player's reward and the average reward of all participating players because it might capture the internal competition within a team. We record the cooperation rate as the percentage of cooperating in all rounds since it is not only a probe for the emergence of strategies, but also the standard measure in behavioral modeling to compare human data and models [36,30,31]. Lastly, we provided reward feedback at each round as a diagnostic tool to understand the specific strategy emerged from each game. (The color codes throughout this paper are set constant for each of the 14 agents, such that all handcrafted agents have green-ish colors, MAB agents red-ish, CB agents blue-ish and RL agents purple-ish).
## 5.1 Multi-Agent Tournament
Results for two-agent tournament. We record the behaviors of the agents playing against each other (and with themselves). Figure 1 summarizes the reward and behavior patterns of the tournament. We first notice that MAB and RL algorithms learn to cooperate when their opponent is Coop , yielding a high mutual reward, while CB algorithms mostly decide to defect on Coop to exploit its trust. From the cooperation heatmap, we also observe that RL algorithms appear to be more defective when facing an MAB or CB algorithm than facing another RL algorithm. MAB algorithms are more defective when facing a CB algorithm than facing an RL or another MAB algorithm. Adversarial algorithms EXP3 and
EXP4 fail to learn any distinctive policy. We also note interesting teamwork and competition behaviors in the heatmaps of collective rewards and relative rewards: CB algorithms are the best team players, yielding an overall highest collective rewards, followed by RL; RL are the most competitive opponents, yielding an overall highest relative rewards, followed by MAB.
Figure 2 summarizes the averaged reward and cooperation for each of the three classes, where we observe handcrafted algorithms the best, followed by RL algorithms and then MAB algorithms. CB algorithms receive the lowest final rewards among the four agent classes. Surprisingly, it also suggests that a lower cooperation rate don't imply a higher reward. The most cooperative learning algorithm class is CB, followed by RL. MAB, the most defective agents, don't score the highest.
Detailed probing into specific games (Figure 3) uncovers more diverse strategies than these revealed by the cooperation rates. For instance, in the game of QL vs. CTS, we observe that CTS converges to a fixed cooperation rate within the first few rounds and stayed constant since then, while the QL gradually decays its cooperation rate. In the game of UCB1 vs. DQL, UCB1 seemed to oscillate between a high and low cooperation rate within the first few rounds (because it is built to explore all actions first), while DQL gradually decays its cooperation rate. In DQL vs. Tit4Tat, we observe a seemingly mimicking effect of DQL to a tit-for-tat-like behaviors. In the game of SARSA vs. LinUCB, LinUCB converges to a fixed cooperation rate with the first few rounds and stays constant since then, while SARSA slowly decays its cooperation rate. There seems to be a universality of the three classes within the first few rounds.
Cognitive interpretations of these learning systems. The main distinctions between the three classes of algorithms are the complexity of the learning mechanism and the cognitive system they adopt. In MAB setting, there is no attention to any contexts, and the agents aim to most efficiently allocate a fixed limited set of cognitive resources between competing (alternative) choices in a way that maximizes their expected gain. In CB setting, the agents apply an attention mechanism to the current context, and aim to collect enough information about how the context vectors and rewards relate to each other, so that they can predict the next best action to play by looking at the feature vectors. In RL setting, the agents not only pay attention to the current context, but also apply the attention mechanism to multiple contexts relate to different states, and aim to use the past experience to find out which actions lead to higher cumulative rewards. Our results suggest that in the Iterate Prisoner's Dilemma of two learning systems, an optimal learning policy should hold memory for different state representations and allocate attention to different contexts across the states, which explained the overall best performance by RL algorithms. This further suggests that in zero-sum games like the IPD, participating learning systems tend to undergo multiple states. The overall underperformance of CB suggests that the attention to only the current context was not sufficient without the state representation, because the learning system might mix the the context-dependent reward mappings of multiple states, which can oversimplify the policy and potentially mislead the
learning as an interfering effect. On the other hand, MAB ignores the context information entirely, so they are not susceptible to the interfering effect from the representations of different contexts. Their learned policies, however, don't exhibit any interesting flexibility to account for any major change in the state (e.g., the opponent may just finish a major learning episode and switch strategies).
Results for three-agent tournament. Here we wish to understand how all three classes of algorithms interact in the same arena. For each game, we pick one algorithm from each class (one from MAB, one from CB and one from RL) to make our player pool. We observe in Figure 2 a very similar pattern as the two-player case: RL agents demonstrate the best performance (highest final rewards) followed by MAB, and CB performed the worst. However, in three-agent setting, although CB is still the most cooperative, and RL became the most defective. More detailed probing into the specific games (Figure 3) demonstrate more diverse strategies than these revealed by the cooperation rates. Take the game UCB1 vs. LinUCB vs. QL as an example, MAB algorithms start off as the most defective but later start to cooperate more in following rounds, while RL algorithms became more and more defective. CB in both cases stays cooperative at a relatively high rate.
## 6 Behavioral Cloning with Human Data
We collate the human data comprising 168,386 individual decisions from many human subjects experiments [2,12,10] that used real financial incentives and transparently conveyed the rules of the game to the subjects. As a a standard procedure in experimental economics, subjects anonymously interact with each other and their decisions to cooperate or defect at each time period of each interaction are recorded. They receive payoffs proportional to the outcomes in the same or similar payoff as the one we use in Table 1. Following the similar preprocessing steps as [36,30,31], we can construct the comprehensive collection of game structures and individual decisions from the description of the experiments in the published papers and the publicly available data sets. This comprehensive dataset consists of behavioral trajectories of different time horizons, ranging from 2 to 30 rounds, but most of these experimental data only host full historical information of at most past 9 actions. We further select only those trajectories with these full historical information, which comprise 8,257 behavioral trajectories. We randomly select 8,000 of them as training set and the other 257 as test set.
In the training phase, all agents are trained with the demonstration rewards as feedback sequentially for the trajectories in the training set. In the testing phase, we paused all the learning, and tested on 257 trajectories independently, recorded their cooperation rate. In each test trajectory, we compared their evolution of cooperation rate to that of the human data and compute a prediction error.
Figure 4 summarizes the testing results of all the agents in predicting the actions and their cooperation rates from human data. From the heatmap of the cooperation rates, we observe that the behavioral policy that each agent cloned from the data varies by class. RL algorithms all seem to learn to defect at all
costs ('tragedy of the commons'). CB algorithms mostly converge to a policy that adopted a fixed cooperation rate. Comparing with the other two, MAB algorithms learn a more diverse cooperation rates across test cases. The line plot on the right confirms our understanding.The cooperation rate by the real humans (the black curve) tends to decline slowly from around 70% to around 40%. UCB1 and epsilon Greedy both captured the decaying properties, mimicing the strategy of the human actions. The prediction error analysis matches this intuition. The UCB1 and epsilon greedy (or MAB algorithms in general), appear to be best capturing human cooperation.
## 7 Clinical Evidences and Implications
Evidence has linked dopamine function to reinforcement learning via midbrain neurons and connections to the basal ganglia, limbic regions, and cortex. Neuron firing rates computationally represent reward magnitude, expectancy, and violations (prediction error) and other value-based signals [45], allowing an animal to update and maintain value expectations associated with particular states and actions. When functioning properly, this helps an animal develop a policy to maximize outcomes by approaching/choosing cues with higher expected value and avoiding cues associated with loss or punishment. This is similar to reinforcement learning widely used in computing and robotics [46], suggesting mechanistic overlap in humans and AI. Evidence of Q-learning and actor-critic models have been observed in spiking activity in midbrain dopamine neurons in primates [9] and in human striatum by blood-oxygen-level-dependent imaging (BOLD) [38].
The literature on the reward processing abnormalities in particular neurological and psychiatric disorders is quite extensive; below we summarize some of the recent developments in this fast-growing field. It is well-known that the neuromodulator dopamine plays a key role in reinforcement learning processes. Parkinson's disease (PD) patients, who have depleted dopamine in the basal ganglia, tend to have impaired performance on tasks that require learning from trial and error. For example, [17] demonstrate that off-medication PD patients are better at learning to avoid choices that lead to negative outcomes than they are at learning from positive outcomes, while dopamine medication typically used to treat PD symptoms reverses this bias. Alzheimer's disease (AD) is the most common cause of dementia in the elderly and, besides memory impairment, it is associated with a variable degree of executive function impairment and visuospatial impairment. As discussed in [40], AD patients have decreased pursuit of rewarding behaviors, including loss of appetite; these changes are often secondary to apathy, associated with diminished reward system activity. Moveover, poor performance on certain tasks is associated with memory impairments. Frontotemporal dementia (bvFTD) usually involves a progressive change in personality and behavior including disinhibition, apathy, eating changes, repetitive or compulsive behaviors, and loss of empathy [40], and it is hypothesized that those changes are associated with abnormalities in reward processing. For instance, alterations in eating habits with a preference for carbohydrate sweet rich foods and overeating in bvFTD
patients can be associated with abnormally increased reward representation for food, or impairment in the negative (punishment) signal associated with fullness. Authors in [34] suggest that the strength of the association between a stimulus and the corresponding response is more susceptible to degradation in Attentiondeficit/hyperactivity disorder (ADHD) patients, which suggests problems with storing the stimulus-response associations. Among other functions, storing the associations requires working memory capacity, which is often impaired in ADHD patients. [43] demonstrated that patients suffering from addictive behavior have heightened stimulus-response associations, resulting in enhanced reward-seeking behavior for the stimulus which generated such association. [47] suggested that chronic pain can elicit in a hypodopaminergic (low dopamine) state that impairs motivated behavior, resulting into a reduced drive in chronic pain patients to pursue the rewards. Reduced reward response may underlie a key system mediating the anhedonia and depression, which are common in chronic pain.
## 8 Discussion
The broader motivation of this work is to increase the two-way traffic between artificial intelligence and neuropsychiatry, in the hope that a deeper understanding of brain mechanisms revealed by how they function ('neuro') and dysfunction ('psychiatry') can provide for better AI models, and conversely AI can help to conceptualize the otherwise bewildering complexity of the brain.
The behavioral cloning results suggest that bandit algorithms (without context) are the best in term of fitting the human data, which open the hypothesis that human are not considering the context when they are playing the iterated prisoner's dilemma. This discovery proposes new modeling effort on human study in the bandit framework, and points to future experimental designs which incorporate these new parametric settings and control conditions. In particular, we propose that our approach may be relevant to study reward processing in different mental disorders, for which some mechanistic insights are available. A body of recent literature has demonstrated that a spectrum of neurological and psychiatric disease symptoms are related to biases in learning from positive and negative feedback [35]. Studies in humans have shown that when reward signaling in the direct pathway is over-expressed, this may enhance state value and incur pathological reward-seeking behavior, like gambling or substance use. Conversely, enhanced aversive error signals result in dampened reward experience thereby causing symptoms like apathy, social withdrawal, fatigue, and depression. Both genetic predispositions and experiences during critical periods of development can predispose an individual to learn from positive or negative outcomes, making them more or less at risk for brain-based illnesses [20]. This highlight our need to understand how intelligent systems learn from rewards and punishments, and how experience sampling may impact reinforcement learning during influential training periods. Simulation results of the mental variants matches many of the clinical implications presented here, but also points to other complications from the social setting that deserve future investigation.
The approach proposed in the present manuscript, we hope, will contribute to expand and deepen the dialogue between AI and neuropsychiatry.
## 9 Conclusion
In this work, we explore the full spectrum of online learning agents: multi-armed bandits, contextual bandits and reinforcement learning. To quantitatively study their behaviors, we evaluate them based on a series of tournaments of iterated prisoner's dilemma. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward driven agents, where we observe that the contextual bandit is not performing well in the tournament, which means that considering the current situation to make decision is the worst in this kind of game. Basically we should either not care about the current situation or caring about more situations, but not just the current one. We have also studied the capacity of these algorithms to fit the human behavior. We observed that bandit algorithms (without context) are the best in term of fitting the human data, which opens the hypothesis that human are not considering the context when they are playing the IPD. Next steps include extending our evaluations to other sequential social dilemma environments with more complicated and mixed incentive structure, such as fruit Gathering game and Wolfpack hunting game [25,49], comparing these mechanistic decision making models with predictive modeling surrogate models [30,31], and building reinforcement learning-based recommendation systems that model properties of human decision making [27].
## References
1. Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML (3). pp. 127-135 (2013)
2. Andreoni, J., Miller, J.H.: Rational cooperation in the finitely repeated prisoner's dilemma: Experimental evidence. Econ. J. 103 (1993)
3. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2-3), 235-256 (2002)
4. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32 (1) (2002)
5. Axelrod, R.: Effective choice in the prisoner's dilemma. Journal of conflict resolution 24 (1980)
6. Axelrod, R., Hamilton, W.D.: The evolution of cooperation. Science (1981)
7. Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Incorporating behavioral constraints in online AI systems. In: Proceedings of AAAI (2019)
8. Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Using multi-armed bandits to learn ethical priorities for online ai systems. IBM Journal of Research and Development 63 (2019)
9. Bayer, H.M., Glimcher, P.W.: Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal. Neuron 47 (1), 129-141 (jul 2005)
10. Bereby-Meyer, Y., Roth, A.E.: The speed of learning in noisy games: Partial reinforcement and the sustainability of cooperation. American Economic Review 96 (4), 1029-1042 (2006)
11. Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.: Contextual bandit algorithms with supervised learning guarantees. In: AISTATS (2011)
12. BΓ³, P.D.: Cooperation under the shadow of the future: experimental evidence from infinitely repeated games. American economic review 95 (2005)
13. Bouneffouf, D., Rish, I.: A survey on practical applications of multi-armed and contextual bandits. CoRR abs/1904.10040 (2019)
14. Bouneffouf, D., Rish, I., Cecchi, G.A.: Bandit models of human behavior: Reward processing in mental disorders. In: AGI. Springer (2017)
15. Capraro, V.: A model of human cooperation in social dilemmas. PloS one 8 (8) (2013)
16. Even-Dar, E., Mansour, Y.: Learning rates for q-learning. Journal of Machine Learning Research 5 (Dec), 1-25 (2003)
17. Frank, M.J., Seeberger, L.C., O'reilly, R.C.: By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306 (5703), 1940-1943 (2004)
18. Gupta, Gaurav: Obedience-based multi-agent cooperation for sequential social dilemmas (2020)
19. Hasselt, H.V.: Double q-learning. In: NIPS (2010)
20. Holmes, A.J., Patrick, L.M.: The Myth of Optimality in Clinical Neuroscience. Trends in Cognitive Sciences 22 (3), 241-257 (feb 2018)
21. Johnson, A., Proctor, R.W.: Attention: Theory and practice. Sage (2004)
22. Kies, M.: Finding best answers for the iterated prisoner's dilemma using improved q-learning. Available at SSRN 3556714 (2020)
23. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), 4-22 (1985)
24. Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: NIPS (2008)
25. Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint (2017)
26.
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual- bandit-based news article recommendation algorithms. In: WSDM (2011)
27. Lin, B.: Supervisorbot: Nlp-annotated real-time recommendations of psychotherapy treatment strategies with deep reinforcement learning. arXiv preprint (2022)
28. Lin, B., Bouneffouf, D., Cecchi, G.: Split q learning: reinforcement learning with two-stream rewards. In: Proceedings of the 28th IJCAI (2019)
29. Lin, B., Bouneffouf, D., Cecchi, G.: Unified models of human behavioral agents in bandits, contextual bandits, and rl. arXiv preprint arXiv:2005.04544 (2020)
30. Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making in psychological tasks with recurrent neural networks. PloS one 17 (5), e0267907 (2022)
31. Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making with lstm. In: 2022 International Joint Conference on Neural Networks (IJCNN) (2022)
32. Lin, B., Bouneffouf, D., Reinen, J., Rish, I., Cecchi, G.: A story of two streams: Reinforcement learning models from human behavior and neuropsychiatry. In: Proceedings of the 19th AAMAS. pp. 744-752 (2020)
33. Lin, B., Cecchi, G., Bouneffouf, D., Reinen, J., Rish, I.: Models of human behavioral agents in bandits, contextual bandits and rl. In: International Workshop on Human Brain and Artificial Intelligence. pp. 14-33. Springer (2021)
34. Luman, M., Van Meel, C.S., Oosterlaan, J., Sergeant, J.A., Geurts, H.M.: Does reward frequency or magnitude drive reinforcement-learning in attentiondeficit/hyperactivity disorder? Psychiatry research 168 (3) (2009)
35. Maia, T.V., Frank, M.J.: From reinforcement learning models to psychiatric and neurological disorders. Nature Neuroscience 14 (2), 154-162 (2011)
36. Nay, J.J., Vorobeychik, Y.: Predicting human cooperation. PloS one (2016)
37. Noothigattu, R., Bouneffouf, D., Mattei, N., Chandra, R., Madan, P., Varshney, K.R., Campbell, M., Singh, M., Rossi, F.: Teaching AI agents ethical values using reinforcement learning and policy orchestration. In: Proceedings of the 28th IJCAI. pp. 6377-6381 (2019)
38. O'Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., Dolan, R.J.: Dissociable Roles of Ventral and Dorsal Striatum in Instrumental. Science 304 (16 April), 452-454 (2004)
39. Park, H., Kim, K.J.: Active player modeling in the iterated prisoner's dilemma. Computational intelligence and neuroscience 2016 (2016)
40. Perry, D.C., Kramer, J.H.: Reward processing in neurodegenerative disease. Neurocase 21 (1), 120-133 (2015)
41. Press, W.H., Dyson, F.J.: Iterated prisoner's dilemma contains strategies that dominate any evolutionary opponent. PNAS 109 (26), 10409-10413 (2012)
42. Rapoport, A., Chammah, A.M., Orwant, C.J.: Prisoner's dilemma: A study in conflict and cooperation, vol. 165. University of Michigan press (1965)
43. Redish, A.D., Jensen, S., Johnson, A., Kurth-Nelson, Z.: Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychological review 114 (3), 784 (2007)
44. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems, vol. 37. University of Cambridge, Cambridge, England (1994)
45. Schultz, W., Dayan, P., Montague, P.R.: A Neural Substrate of Prediction and Reward. Science 275 (5306), 1593-1599 (mar 1997)
46. Sutton, R.S., Barto, A.G., et al.: Introduction to reinforcement learning, vol. 135. MIT Press (1998)
47. Taylor, A.M., Becker, S., Schweinhardt, P., Cahill, C.: Mesolimbic dopamine signaling in acute and chronic pain: implications for motivation, analgesia, and addiction. Pain 157 (6), 1194 (2016)
48. Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika (1933)
49. Wang, W., Hao, J., Wang, Y., Taylor, M.: Towards cooperation in sequential prisoner's dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint arXiv:1803.00162 (2018)