2006.06580

Model: gemma-3-27b-it-free

## Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior Baihan Lin 1 , ∗ , Djallel Bouneffouf 2 and Guillermo Cecchi 2 1 Columbia University, New York, USA 2 IBM Research, Yorktown Heights, NY, USA baihan.lin@columbia.edu, djallel.bouneffouf@ibm.com, gcecchi@us.ibm.com Abstract. As an important psychological and social experiment, the Iterated Prisoner's Dilemma (IPD) treats the choice to cooperate or defect as an atomic action. We propose to study the behaviors of online learning algorithms in the Iterated Prisoner's Dilemma (IPD) game, where we investigate the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning. We evaluate them based on a tournament of iterated prisoner's dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward-driven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated, as an effort to connect artificial intelligence algorithms with human behaviors and their abnormal states in neuropsychiatric conditions. 3 Keywords: Online learning, Bandits, Contextual bandits, Reinforcement learning, Iterated Prisoner's Dilemma, Behavioral modeling ## 1 Introduction Social dilemmas expose tensions between cooperation and defection. Understanding the best way of playing the iterated prisoner's dilemma (IPD) has been of interest to the scientific community since the formulation of the game seventy years ago [5]. To evaluate the algorithm a round robin computer tournament was proposed, where algorithms competed against each others [2]. The winner was decided on the average score a strategy achieved. Using this framework, we propose here to focus on studying reward driven online learning algorithm with different type of attentions mechanism, where we define attention "as the behavioral and cognitive process of selectively concentrating on a discrete stimulus while ignoring other perceivable stimuli" [21]. Following this definition, we 3 The data and codes to reproduce all the empirical results can be accessed at https://github.com/doerlbh/dilemmaRL. analyze three algorithms classes: the no-attention-to-the-context online learning agent (the multi armed bandit algorithms) outputs an action but does not use any information about the state of the environment (context); the contextual bandit algorithm extends the model by making the decision conditional on the current state of the environment, and finally reinforcement learning as an extension of contextual bandits which makes decision conditional on the current state of the environment and the next state of the unknown environments. This paper mainly focuses on an answer to two questions: - -Does attending to the context for an online learning algorithm helps on the task of maximizing the rewards in an IPD tournament, and how do different attention biases shape behavior? - -Does attending to the context for an online learning algorithm helps to mimic human behavior? To answer these questions, we have performed two experimenters: (1) The first one where we have run a tournament of the iterated prisoner's dilemma: Since the seminal tournament in 1980 [5], a number of IPD tournaments have been undertaken [2,12,10]. In this work, we adopt a similar tournament setting, but also extended it to cases with more than two players. Empirically, we evaluated the algorithms in four settings of the Iterated Prisoner's Dilemma: pairwise-agent tournament, three-agent tournament, 'mental'-agent tournament. (2) Behavioral cloning prediction task: where we train the the three types of algorithm to mimic the human behavior on some training set and then test them in a test set. Our main results are the following: - -We observe that contextual bandits are not performing well in the tournament, which means that considering the current situation to make decision is the worst in this kind of social dilemma game. Basically we should either do not care about the current situation or caring about more situations, but not just the current one. - -We observe that bandit algorithms (without context) is the best in term of fitting the human data, which implies that humans may not consider the context when they play the iterated prisoner's dilemma. This paper is organized as follows. We first review related works and introduces some background concepts. Then we explain the two experiments we have performed. Experimental evaluation highlights the empirical results we have got. Finally, the last section concludes the paper and points out possible directions for future works. As far as we are aware, this is the first work that evaluated the online learning algorithms in social gaming settings. Although the agents that we evaluated here are not newly proposed by us, we believe that given this understudied information asymmetry problem setting, our work helps the community understand how the inductive bias of different methods yield different behaviors in social agent settings (e.g. iterated prisoners' dilemma), and thus provides a nontrivial contribution to the fields, both in understanding machine learning algorithms, and in studying mechanistic models of human behaviors in social settings. ## 2 Related Work There is much computational work focused on non understanding the strategy space and finding winning strategies in the iterated prisoner's dilemma; Authors in [22] present and discuss several improvements to the Q-Learning algorithm, allowing for an easy numerical measure of the exploitability of a given strategy. [18] propose a mechanism for achieving cooperation and communication in MultiAgent Reinforcement Learning settings by intrinsically rewarding agents for obeying the commands of other agents. We are interested in investigating how algorithms are behaving and also how they are modeling the human decisions in the IPD, with the larger goal of understanding human decision-making. Forinstance, In [39] authors have proposed an active modeling technique to predict the behavior of IPD players. The proposed method can model the opponent player's behavior while taking advantage of interactive game environments. The data showed that the observer was able to build, through direct actions, a more accurate model of an opponent's behavior than when the data were collected through random actions. [15] they propose the first predictive model of human cooperation able to organize a number of different experimental findings that are not explained by the standard model and they show also that the model makes satisfactorily accurate quantitative predictions of population average behavior in one-shot social dilemmas. To the best of our knowledge no study has been exploring the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning in social dilemma. ## 3 Background Multi-Armed Bandit (MAB): The multi-armed bandit (MAB) algorithm models a sequential decision-making process, where at each time point a the algorithm selects an action from a given finite set of possible actions, attempting to maximize the cumulative reward over time [23,3]. Contextual Bandit Algorithm (CB). Following [24], this problem is defined as follows. At each time point (iteration) t ∈ { 1 , ..., T } , an agent is presented with a context ( feature vector ) x t ∈ R N before choosing an arm k ∈ A = { 1 , ..., K } . We will denote by X = { X 1 , ..., X N } the set of features (variables) defining the context. Let r t = ( r 1 t , ..., r t K ) denote a reward vector, where r k t ∈ [0 , 1] is a reward at time t associated with the arm k ∈ A . Herein, we will primarily focus on the Bernoulli bandit with binary reward, i.e. r k t ∈ { 0 , 1 } . Let π : X → A denote a policy. Also, D c,r denotes a joint distribution over ( x , r ) . We will assume that the expected reward is a linear function of the context, i.e. E [ r k t | x t ] = µ T k x t , where µ k is an unknown weight vector associated with arm k . Reinforcement Learning (RL). Reinforcement learning defines a class of algorithms for solving problems modeled as Markov decision processes (MDP) [46]. An MDP is defined by the tuple ( S , A , T , R , γ ) , where S is a set of possible states, A is a set of actions, T is a transition function defined as T ( s, a, s ′ ) = Pr( s ′ | s, a ) , where s, s ′ ∈ S and a ∈ A , and R : S × A × S ↦→ R is a reward function, γ Fig. 1: Success, Teamwork, Cooperation & Competition in two-agent tournament. <details> <summary>Image 1 Details</summary> ![584f8453](/v1/image/584f8453dc1c7740f1a3acd6167fc5da4bd1f7d51a16b575400b2fa9a069c925) ### Visual Description \n ## Heatmaps: Pairwise Agent Performance Metrics ### Overview The image presents four heatmaps, each visualizing a different metric related to the performance of various agents in a pairwise setting. The agents are listed on the y-axis, and the other agent they are being compared against is listed on the x-axis. Each cell in the heatmap represents the value of the metric for that specific agent pair, with color intensity indicating the magnitude of the value. ### Components/Axes Each heatmap shares the following components: * **Y-axis (Vertical):** Lists the agents being evaluated. The agents are: UCB, eComedy, EXP3, HBT3, LINUCB, CTB, EXP1, SCTS, QL, DQL, SARSA, Coop, SQL, Old, Tital. * **X-axis (Horizontal):** Lists the agents being compared against. The agents are: UCB, eComedy, EXP3, HBT3, LINUCB, CTB, EXP1, SCTS, QL, DQL, SARSA, Coop, SQL, Old, Tital. * **Color Scale (Legend):** Each heatmap has a unique color scale indicating the range of values represented by the colors. * **Heatmap 1 (Normalized Reward):** Color scale ranges from approximately 0 to 20. Blue indicates lower values, and red indicates higher values. * **Heatmap 2 (Teamwork in Rewards):** Color scale ranges from approximately 0 to 20. Blue indicates lower values, and red indicates higher values. * **Heatmap 3 (Percentage of Cooperation):** Color scale ranges from approximately 0 to 100. Blue indicates lower values, and red indicates higher values. * **Heatmap 4 (Competition in Rewards):** Color scale ranges from approximately -15 to 15. Blue indicates lower values, and red indicates higher values. ### Detailed Analysis or Content Details **Heatmap 1: Pairwise Agents, Normalized Reward (Mean)** * The heatmap shows a generally sparse pattern of high rewards. * The diagonal elements (agent vs. itself) are all approximately 0. * The highest reward values (around 18-20) are observed for the pair (QL, QL), (DQL, DQL), (SARSA, SARSA), (Coop, Coop), (SQL, SQL). * The agent 'Tital' consistently shows low normalized rewards when paired with other agents. * The agent 'UCB' shows low normalized rewards when paired with other agents. **Heatmap 2: Teamwork in Rewards (Mean)** * The heatmap shows a generally sparse pattern of high teamwork rewards. * The diagonal elements (agent vs. itself) are all approximately 0. * The highest teamwork rewards (around 18-20) are observed for the pair (QL, QL), (DQL, DQL), (SARSA, SARSA), (Coop, Coop), (SQL, SQL). * The agent 'Tital' consistently shows low teamwork rewards when paired with other agents. * The agent 'UCB' shows low teamwork rewards when paired with other agents. **Heatmap 3: Pairwise Agents, Percentage of Cooperation (Mean)** * The heatmap shows a generally high percentage of cooperation across most agent pairs. * The diagonal elements (agent vs. itself) are all approximately 100. * The lowest cooperation percentages (around 0-20) are observed for the pair (Old, Old), (Tital, Tital). * The agent 'Old' consistently shows low cooperation percentages when paired with other agents. * The agent 'Tital' consistently shows low cooperation percentages when paired with other agents. **Heatmap 4: Competition in Rewards (Mean)** * The heatmap shows a mix of positive and negative competition values. * The diagonal elements (agent vs. itself) are all approximately 0. * The highest positive competition values (around 10-15) are observed for the pair (EXP3, EXP3), (HBT3, HBT3), (LINUCB, LINUCB), (CTB, CTB), (EXP1, EXP1), (SCTS, SCTS). * The lowest negative competition values (around -10 to -15) are observed for the pair (QL, QL), (DQL, DQL), (SARSA, SARSA), (Coop, Coop), (SQL, SQL). * The agent 'Tital' consistently shows negative competition values when paired with other agents. ### Key Observations * Agents QL, DQL, SARSA, Coop, and SQL consistently exhibit high normalized rewards and teamwork rewards, but also high negative competition. * Agents UCB and eComedy consistently exhibit low normalized rewards and teamwork rewards. * Agents Old and Tital consistently exhibit low cooperation percentages and negative competition. * The diagonal elements in all heatmaps indicate self-comparison, which generally results in expected values (0 for reward, 100 for cooperation). ### Interpretation The data suggests that different agents exhibit varying levels of performance, cooperation, and competition when interacting with each other. Agents like QL, DQL, SARSA, Coop, and SQL appear to be effective at maximizing rewards and teamwork, but this comes at the cost of increased competition. Conversely, agents like UCB and eComedy struggle to achieve high rewards or teamwork. Agents Old and Tital are consistently poor performers in terms of cooperation and exhibit negative competition, suggesting they may be detrimental to overall system performance. The relationship between the heatmaps reveals interesting trade-offs. For example, high normalized reward and teamwork often correlate with negative competition, indicating that these agents may be dominating resources or hindering the performance of others. The heatmap of competition provides insight into the dynamics of agent interactions, highlighting which agents are more likely to compete for resources or rewards. The consistent low performance of agents Old and Tital suggests they may be poorly suited for the task or require further optimization. The high performance of agents QL, DQL, SARSA, Coop, and SQL suggests they are promising candidates for further development and deployment. </details>  is a discount factor that decreases the impact of the past reward on current action choice. Typically, the objective is to maximize the discounted long-term reward, assuming an infinite-horizon decision process, i.e. to find a policy function π : S ↦→ A which specifies the action to take in a given state, so that the cumulative reward is maximized: max π ∑ ∞ t =0 γ t R ( s t , a t , s t +1 ) . ## 4 Experimental Setup Here, we describe the two main experiments we run, Iterated Prisoner's Dilemma (IPD) and Behavioral Cloning with Demonstration Rewards (BCDR). ## 4.1 Iterated Prisoner's Dilemma (IPD) The Iterated Prisoner's Dilemma (IPD) can be defined as a matrix game G = [ N, { A i } i ∈ N , { R i } i ∈ N ] , where N is the set of agents, A i is the set of actions available to agent i with A being the joint action space A 1 ×··· × A n , and R i is the reward function for agent i . A special case of this generic multi-agent IPD is the classical two-agent case (Table 1). In this game, each agent has two actions: cooperate (C) and defect (D), and can receive one of the four possible rewards: R (Reward), P (Penalty), S (Sucker), and T (Temptation). In the multi-agent setting, if all agents Cooperates (C), they all receive Reward (R); if all agents defects (D), they all receive Penalty (P); if some agents Cooperate (C) and some Defect (D), cooperators receive Sucker (S) and defector receive Temptation (T). The four payoffs satisfy the following inequalities: T > R > P > S and 2 R > T + S . The PD is a one round game, but is commonly studied in a manner where the prior outcomes matter to understand the evolution of cooperative behaviour from complex dynamics [6]. ## 4.2 Behavioral Cloning with Demonstration Rewards Here we define a new type of multi-agent online learning setting, the Behavior Cloning with Demonstration Rewards (BCDR), present a novel training procedure and agent for solving this problem. In this setting, and similar to [8,7,37] the agent first goes through a constraint learning phase where it is allowed to query the actions and receive feedback r e k ( t ) ∈ [0 , 1] about whether or not the chosen decision matches the teacher's action (from demonstration). During the deployment (testing) phase, the goal of the agent is to maximize both r k ( t ) ∈ [0 , 1] , the reward of the action k at time t , and the (unobserved) r e k ( t ) ∈ [0 , 1] , which models whether or not the taking action k matches which action the teacher would have taken. During the deployment phase, the agent receives no feedback on the value of r e k ( t ) , where we would like to observe how the behavior captures the teacher's policy profile. In our specific problem, the human data plays the role of the teacher, and the behavioral cloning aims to train our agents to mimic the human behaviors. ## 4.3 Online Learning Agents We briefly outlined the different types of online learning algorithms we have used: Multi-Armed Bandit (MAB): The multi-armed bandit algorithm models a sequential decision-making process, where at each time point a the algorithm selects an action from a given finite set of possible actions, attempting to maximize the cumulative reward over time [23,3,13]. In the multi-armed bandit agent pool, we have Thompson Sampling (TS) [48], Upper Confidence Bound (UCB) [3], epsilon Greedy (eGreedy) [46], EXP3 [4] and the Human Based Thompson Sampling (HBTS) [14]. Contextual Bandit (CB). Following [24], this problem is defined as follows. At each time point (iteration), an agent is presented with a context ( feature vector ) before choosing an arm. In the contextual bandit agent pool, we have Contextual Thompson Sampling (CTS) [1], LinUCB [26], EXP4 [11] and Split Contextual Thompson Sampling (SCTS) [29,33]. Reinforcement Learning (RL). Reinforcement learning defines a class of algorithms for solving problems modeled as Markov decision processes (MDP) [46]. An MDP is defined by the tuple with a set of possible states, a set of actions and a transition function. In the reinforcement learning agent pool, we have QLearning (QL), Double Q-Learning (DQL) [19], State-action-reward-state-action (SARSA) [44] and Split Q-Learning (SQL) [28,32]. We also selected three most popular handcrafted policy for Iterated Prisoner's Dilemma: 'Coop' stands for always cooperating, 'Dfct' stands for always defecting and 'Tit4Tat' stands for following what the opponent chose for the last time (which was the winner approach in the 1980 IPD tournament [5]). The choices of the agents evaluated in this work are the most common online learning agents in bandits, contextual bandits and reinforcement learning (the three online learning classes). We thought that competing them against one another, and competing the three online learning classes against one another might be an interesting experiment to study how the inductive bias of different methods yield different behaviors in social agent settings (e.g. iterated prisoners' dilemma). Fig. 2: Cumulative reward and cooperation rate averaged by class in two- and three-player setting. <details> <summary>Image 2 Details</summary> ![0d9c9cff](/v1/image/0d9c9cff55250efac606a61489f8832421eb37c4fd53ba7913f384cee49d26fe) ### Visual Description \n ## Line Charts: Cumulative Rewards and Cooperation Rates ### Overview The image presents four line charts visualizing the performance of different algorithms (MAB, CB, RL, handcrafted) in two and three-agent scenarios. The first two charts display cumulative normalized rewards over rounds, while the latter two show the percentage of cooperation over rounds. Each chart includes shaded areas representing confidence intervals. ### Components/Axes All four charts share the following characteristics: * **X-axis:** "round" ranging from 0 to 50. * **Y-axis:** Varies depending on the chart. * **Algorithms (Lines/Shaded Areas):** MAB (Multi-Armed Bandit), CB (presumably Contextual Bandit), RL (Reinforcement Learning), and "handcrafted". * **Legend:** Located at the top-right of each chart, indicating the color corresponding to each algorithm. * **Mem(5):** This appears to be a parameter or condition for the experiment, indicated in the chart titles. Specifically: * **Chart 1:** "Two Agents avg cum norm rewards (Mem(5))" - Y-axis: "Cumulative normalized rewards". * **Chart 2:** "Three Agents avg cum norm rewards (Mem(5))" - Y-axis: "Cumulative normalized rewards". * **Chart 3:** "Two Agents cooperation rate (Mem(5))" - Y-axis: "Percentage of cooperation". * **Chart 4:** "Three Agents cooperation rate (Mem(5))" - Y-axis: "Percentage of cooperation". ### Detailed Analysis or Content Details **Chart 1: Two Agents Cumulative Normalized Rewards** * **MAB (Green):** The line starts at approximately 0 at round 0 and increases steadily, reaching around 3.5 at round 50. * **CB (Orange):** The line starts at approximately 0 at round 0 and increases steadily, reaching around 3.0 at round 50. * **RL (Pink):** The line starts at approximately 0 at round 0 and increases steadily, reaching around 2.5 at round 50. * **handcrafted (Light Green):** The line starts at approximately 0 at round 0 and increases steadily, reaching around 3.2 at round 50. * The shaded areas indicate confidence intervals around each line. MAB has the widest confidence interval, while RL has the narrowest. **Chart 2: Three Agents Cumulative Normalized Rewards** * **MAB (Green):** The line starts at approximately 0 at round 0 and increases steadily, reaching around 4.0 at round 50. * **CB (Orange):** The line starts at approximately 0 at round 0 and increases steadily, reaching around 3.5 at round 50. * **RL (Pink):** The line starts at approximately 0 at round 0 and increases steadily, reaching around 3.0 at round 50. * The shaded areas indicate confidence intervals around each line. MAB has the widest confidence interval, while RL has the narrowest. **Chart 3: Two Agents Cooperation Rate** * **MAB (Green):** The line starts at approximately 80% at round 0 and decreases steadily, reaching around 20% at round 50. * **CB (Orange):** The line starts at approximately 80% at round 0 and decreases steadily, reaching around 30% at round 50. * **RL (Pink):** The line starts at approximately 80% at round 0 and decreases steadily, reaching around 10% at round 50. * **handcrafted (Light Green):** The line starts at approximately 80% at round 0 and decreases steadily, reaching around 40% at round 50. * The shaded areas indicate confidence intervals around each line. **Chart 4: Three Agents Cooperation Rate** * **MAB (Green):** The line starts at approximately 90% at round 0 and decreases steadily, reaching around 20% at round 50. * **CB (Orange):** The line starts at approximately 90% at round 0 and decreases steadily, reaching around 30% at round 50. * **RL (Pink):** The line starts at approximately 90% at round 0 and decreases steadily, reaching around 10% at round 50. * The shaded areas indicate confidence intervals around each line. ### Key Observations * **Rewards:** MAB consistently achieves the highest cumulative normalized rewards in both two and three-agent scenarios. * **Cooperation:** All algorithms exhibit a decrease in cooperation rate over rounds. RL consistently shows the lowest cooperation rate in both scenarios. * **Confidence Intervals:** MAB generally has wider confidence intervals, suggesting more variability in its performance. * **Trade-off:** There appears to be a trade-off between reward and cooperation. Algorithms with higher rewards (MAB) tend to have lower cooperation rates. ### Interpretation The data suggests that MAB is the most effective algorithm for maximizing cumulative rewards in both two and three-agent settings, given the parameter Mem(5). However, this comes at the cost of reduced cooperation among agents. RL, while achieving lower rewards, maintains a slightly higher level of cooperation compared to MAB. The decreasing cooperation rates across all algorithms indicate a potential challenge in sustaining collaborative behavior over extended interactions. The confidence intervals provide insight into the reliability of each algorithm's performance. The wider intervals for MAB suggest that its performance is more sensitive to variations in the environment or initial conditions. The parameter Mem(5) likely influences the memory or learning capacity of the agents, and its value may be a key factor in the observed performance differences. Further investigation is needed to understand the underlying mechanisms driving these trends and to explore strategies for improving both reward and cooperation in multi-agent systems. </details> Fig. 3: Reward feedbacks and cooperation rates in some two-player and the three-player settings. <details> <summary>Image 3 Details</summary> ![a544e614](/v1/image/a544e61461bb1f623afdcfaeb6428ea12e702a7cfe7e4af95e0bf68af165d5a8) ### Visual Description \n ## Chart: Learning Curves for Multi-Agent Reinforcement Learning Algorithms ### Overview The image presents a series of six 2x2 grids of line charts, comparing the performance of different multi-agent reinforcement learning algorithms. Each grid compares two algorithms across two metrics: normalized reward and percentage of cooperation, plotted against the round number (presumably representing training iterations). The algorithms being compared are QLearning vs. CTS, UCB vs. DQL, DQL vs. ToolFlat, SARSA vs. LinUCB, and LinUCB vs. UCB. ### Components/Axes Each chart shares the following components: * **X-axis:** "round" - ranging from 0 to 50. * **Y-axis (Top Chart):** "normalized reward" - ranging from 0.6 to 1.0. * **Y-axis (Bottom Chart):** "percentage of cooperation" - ranging from 0 to 100. * **Legends:** Each chart has a legend indicating the algorithms being compared, with each algorithm represented by a distinct color. The algorithms used are: * CTS * QL * UCB * DQL * ToolFlat * SARSA * LinUCB ### Detailed Analysis or Content Details **Grid 1: QLearning vs. CTS** * **Normalized Reward (Top):** * CTS (Blue): Starts around 0.75, fluctuates between 0.7 and 0.85, with a slight upward trend. Approximately 0.78 at round 50. * QL (Purple): Starts around 0.75, decreases rapidly to around 0.65, then stabilizes around 0.68. Approximately 0.67 at round 50. * **Percentage of Cooperation (Bottom):** * CTS (Blue): Starts around 80, drops sharply to around 10-20, and stabilizes. Approximately 15 at round 50. * QL (Purple): Starts around 80, drops very sharply to near 0, and remains close to 0. Approximately 0 at round 50. **Grid 2: UCB vs. DQL** * **Normalized Reward (Top):** * UCB (Orange): Starts around 0.8, fluctuates between 0.75 and 0.9, with a slight upward trend. Approximately 0.85 at round 50. * DQL (Light Orange): Starts around 0.75, increases to around 0.85, then fluctuates. Approximately 0.83 at round 50. * **Percentage of Cooperation (Bottom):** * UCB (Orange): Starts around 80, drops sharply to around 20-30, and stabilizes. Approximately 25 at round 50. * DQL (Light Orange): Starts around 80, drops very sharply to near 0, and remains close to 0. Approximately 0 at round 50. **Grid 3: DQL vs. ToolFlat** * **Normalized Reward (Top):** * DQL (Green): Starts around 0.7, increases steadily to around 0.9. Approximately 0.9 at round 50. * ToolFlat (Light Green): Starts around 0.7, increases steadily to around 0.85. Approximately 0.85 at round 50. * **Percentage of Cooperation (Bottom):** * DQL (Green): Starts around 80, drops very sharply to near 0, and remains close to 0. Approximately 0 at round 50. * ToolFlat (Light Green): Starts around 80, drops sharply to around 20-30, and stabilizes. Approximately 25 at round 50. **Grid 4: SARSA vs. LinUCB** * **Normalized Reward (Top):** * SARSA (Pink): Starts around 0.7, drops sharply to around 0.6, then remains relatively stable. Approximately 0.6 at round 50. * LinUCB (Light Pink): Starts around 0.7, increases to around 0.9, then fluctuates. Approximately 0.85 at round 50. * **Percentage of Cooperation (Bottom):** * SARSA (Pink): Starts around 80, drops very sharply to near 0, and remains close to 0. Approximately 0 at round 50. * LinUCB (Light Pink): Starts around 80, drops sharply to around 20-30, and stabilizes. Approximately 25 at round 50. **Grid 5: LinUCB vs. UCB** * **Normalized Reward (Top):** * LinUCB (Red): Starts around 0.7, increases to around 0.9, then fluctuates. Approximately 0.85 at round 50. * UCB (Light Red): Starts around 0.7, fluctuates between 0.7 and 0.85. Approximately 0.78 at round 50. * **Percentage of Cooperation (Bottom):** * LinUCB (Red): Starts around 80, drops sharply to around 20-30, and stabilizes. Approximately 25 at round 50. * UCB (Light Red): Starts around 80, drops sharply to around 30-40, and stabilizes. Approximately 35 at round 50. ### Key Observations * Algorithms like DQL, ToolFlat, and LinUCB generally achieve higher normalized rewards compared to their counterparts (QL, UCB, SARSA). * The percentage of cooperation consistently decreases for all algorithms, often converging to very low values (near 0). * There's a clear trade-off between reward and cooperation. Algorithms with higher rewards tend to have lower cooperation rates. * The initial drop in cooperation is particularly steep for algorithms like QL, DQL, and SARSA. ### Interpretation The data suggests that maximizing normalized reward in these multi-agent scenarios often comes at the cost of cooperation. Algorithms that prioritize individual reward (like DQL and LinUCB) achieve higher rewards but exhibit significantly lower cooperation rates. Conversely, algorithms like CTS and UCB maintain some level of cooperation, but their rewards are comparatively lower. The rapid decline in cooperation across all algorithms indicates a potential challenge in designing multi-agent systems that can effectively balance individual incentives with collective goals. The differences in the learning curves highlight the impact of different learning algorithms on the emergent behavior of the agents. The fact that the percentage of cooperation drops to near zero for some algorithms suggests that these agents may be engaging in competitive or even exploitative behavior. The comparison between LinUCB and UCB is particularly interesting, as they both utilize upper confidence bound exploration, but LinUCB consistently outperforms UCB in terms of normalized reward while maintaining a slightly higher cooperation rate. This suggests that the linear function approximation used in LinUCB may be more effective in capturing the dynamics of the environment. </details> ## 5 Results: Algorithms' Tournament Game settings. The payoffs are set as the classical IPD game: T = 5 , R = 3 , P = 1 , S = 0 . Following [42], we create create standardized payoff measures from the R, S, T, P values using two differences between payoffs associate with important game outcomes, both normalized by the difference between the temptation to defect and being a sucker when cooperating as the other defects. State representations. In most IPD literature, the state is defined the pair of previous actions of self and opponent. Studies suggest that only one single previous state is needed to define any prisoner's dilemma strategy [41]. However, as we are interested in understanding the role of three levels of information (no information, with context but without state, and with both context and state), we expand the state representation to account for the past n pairs of actions as the history (or memory) for the agents. For CB algorithms, this history is their context. For RL algorithms, this history is their state representation. In the following sections, we will present the results in which the memory is set to be the past 5 action pairs. Learning settings. In all experiments, the discount factor γ was set to be 0.95. The exploration is included with -greedy algorithm with set to be 0.05 (except for the algorithms that already have an exploration mechanism). Fig. 4: Behavioral Cloning: bandits modeled human data the best with the lowest prediction error. <details> <summary>Image 4 Details</summary> ![ea95236c](/v1/image/ea95236c793f060108d6615c56814dfee3de825c720f73b4f38f073378062ce2) ### Visual Description \n ## Charts: Behavioral Cloning - Cooperation Rate & Prediction Error ### Overview The image presents two charts related to behavioral cloning and cooperation rates. The left chart displays the percentage of cooperation over rounds for various algorithms and human data. The right chart shows the prediction error between the algorithms' predictions and the human data on the cooperation rate. Both charts specify a memory (Mem) value of 5. ### Components/Axes **Left Chart:** * **Title:** Behavioral Cloning: cooperation rate (Mem=5) * **X-axis:** round (ranging from 0 to 8) * **Y-axis:** percentage of cooperation (ranging from 0 to 100) * **Algorithms (Legend):** UCB (orange), TS (red), eGreedy (brown), EXP3 (light brown), HBTS (light orange), LinUCB (yellow), CTS (cyan), EXP4 (light cyan), SCTS (turquoise), QL (blue), DQL (magenta), SARSA (pink), SQL (purple), Human data (dark gray) **Right Chart:** * **Title:** Behavioral Cloning: error between the prediction and the human data on the cooperation rate (Mem=5) * **X-axis:** Algorithms (UCB, TS, eGreedy, EXP3, HBTS, LinUCB, CTS, EXP4, SCTS, QL, DQL, SARSA, SQL) * **Y-axis:** prediction error (ranging from -80 to 40) ### Detailed Analysis or Content Details **Left Chart:** * **Human data (dark gray):** Starts at approximately 80% cooperation and decreases slightly to around 70% by round 8. The area around the line represents the standard deviation, showing a relatively stable cooperation rate. * **UCB (orange):** Starts at approximately 60% and decreases rapidly to around 20% by round 8. * **TS (red):** Starts at approximately 50% and decreases to around 30% by round 8. * **eGreedy (brown):** Remains relatively flat around 40% throughout the rounds. * **EXP3 (light brown):** Starts at approximately 50% and decreases to around 30% by round 8. * **HBTS (light orange):** Starts at approximately 50% and decreases to around 30% by round 8. * **LinUCB (yellow):** Starts at approximately 50% and decreases to around 30% by round 8. * **CTS (cyan):** Starts at approximately 50% and decreases to around 30% by round 8. * **EXP4 (light cyan):** Starts at approximately 50% and decreases to around 30% by round 8. * **SCTS (turquoise):** Starts at approximately 50% and decreases to around 30% by round 8. * **QL (blue):** Starts at approximately 50% and decreases to around 30% by round 8. * **DQL (magenta):** Starts at approximately 50% and decreases to around 30% by round 8. * **SARSA (pink):** Starts at approximately 50% and decreases to around 30% by round 8. * **SQL (purple):** Starts at approximately 50% and decreases to around 30% by round 8. **Right Chart:** * **UCB:** Approximately -10 prediction error. * **TS:** Approximately -5 prediction error. * **eGreedy:** Approximately -25 prediction error. * **EXP3:** Approximately -15 prediction error. * **HBTS:** Approximately 0 prediction error. * **LinUCB:** Approximately -20 prediction error. * **CTS:** Approximately -10 prediction error. * **EXP4:** Approximately -5 prediction error. * **SCTS:** Approximately 0 prediction error. * **QL:** Approximately -10 prediction error. * **DQL:** Approximately 0 prediction error. * **SARSA:** Approximately 0 prediction error. * **SQL:** Approximately 10 prediction error. ### Key Observations * The human data maintains a relatively high and stable cooperation rate throughout the rounds. * Most algorithms show a decreasing cooperation rate over rounds. * UCB exhibits the most significant decrease in cooperation rate. * HBTS, SCTS, DQL, and SARSA have the lowest prediction errors, indicating they best predict human cooperation rates. * eGreedy has the highest negative prediction error, suggesting it consistently underestimates human cooperation. * SQL has a positive prediction error, suggesting it consistently overestimates human cooperation. ### Interpretation The data suggests that the algorithms, with the exception of HBTS, SCTS, DQL, and SARSA, struggle to maintain a cooperation rate comparable to human behavior over time. The decreasing cooperation rates of most algorithms indicate a potential learning or adaptation issue. The prediction error chart highlights which algorithms are closest to replicating human behavior. The low prediction errors for HBTS, SCTS, DQL, and SARSA suggest these algorithms are better suited for modeling human cooperation in this scenario. The outliers, such as eGreedy and SQL, demonstrate significant discrepancies in their predictions, indicating a need for further refinement or alternative approaches. The consistent negative error for eGreedy suggests it is overly pessimistic, while the positive error for SQL suggests it is overly optimistic. The memory parameter (Mem=5) likely influences these results, and exploring different memory values could reveal further insights. The charts provide a comparative analysis of different behavioral cloning algorithms in a cooperative environment, offering valuable information for selecting and optimizing algorithms for similar tasks. </details> The learning rate was polynomial α t ( s, a ) = 1 /n t ( s, a ) 0 . 8 , which was shown in previous work to be better in theory and in practice [16]. All experiments were performed and averaged for at least 100 runs, and over 50 steps of dueling actions from the initial state. Reported measures. To capture the behavior of the algorithms, we report five measures: individual normalized rewards, collective normalized rewards, difference of normalized rewards, the cooperation rate and normalized reward feedback at each round. We are interested in the individual rewards since that is what online learning agents should effectively maximize their expected cumulative discounted reward for. We are interested in the collective rewards because it might offer important insights on the teamwork of the participating agents. We are interested in the difference between each individual player's reward and the average reward of all participating players because it might capture the internal competition within a team. We record the cooperation rate as the percentage of cooperating in all rounds since it is not only a probe for the emergence of strategies, but also the standard measure in behavioral modeling to compare human data and models [36,30,31]. Lastly, we provided reward feedback at each round as a diagnostic tool to understand the specific strategy emerged from each game. (The color codes throughout this paper are set constant for each of the 14 agents, such that all handcrafted agents have green-ish colors, MAB agents red-ish, CB agents blue-ish and RL agents purple-ish). ## 5.1 Multi-Agent Tournament Results for two-agent tournament. We record the behaviors of the agents playing against each other (and with themselves). Figure 1 summarizes the reward and behavior patterns of the tournament. We first notice that MAB and RL algorithms learn to cooperate when their opponent is Coop , yielding a high mutual reward, while CB algorithms mostly decide to defect on Coop to exploit its trust. From the cooperation heatmap, we also observe that RL algorithms appear to be more defective when facing an MAB or CB algorithm than facing another RL algorithm. MAB algorithms are more defective when facing a CB algorithm than facing an RL or another MAB algorithm. Adversarial algorithms EXP3 and EXP4 fail to learn any distinctive policy. We also note interesting teamwork and competition behaviors in the heatmaps of collective rewards and relative rewards: CB algorithms are the best team players, yielding an overall highest collective rewards, followed by RL; RL are the most competitive opponents, yielding an overall highest relative rewards, followed by MAB. Figure 2 summarizes the averaged reward and cooperation for each of the three classes, where we observe handcrafted algorithms the best, followed by RL algorithms and then MAB algorithms. CB algorithms receive the lowest final rewards among the four agent classes. Surprisingly, it also suggests that a lower cooperation rate don't imply a higher reward. The most cooperative learning algorithm class is CB, followed by RL. MAB, the most defective agents, don't score the highest. Detailed probing into specific games (Figure 3) uncovers more diverse strategies than these revealed by the cooperation rates. For instance, in the game of QL vs. CTS, we observe that CTS converges to a fixed cooperation rate within the first few rounds and stayed constant since then, while the QL gradually decays its cooperation rate. In the game of UCB1 vs. DQL, UCB1 seemed to oscillate between a high and low cooperation rate within the first few rounds (because it is built to explore all actions first), while DQL gradually decays its cooperation rate. In DQL vs. Tit4Tat, we observe a seemingly mimicking effect of DQL to a tit-for-tat-like behaviors. In the game of SARSA vs. LinUCB, LinUCB converges to a fixed cooperation rate with the first few rounds and stays constant since then, while SARSA slowly decays its cooperation rate. There seems to be a universality of the three classes within the first few rounds. Cognitive interpretations of these learning systems. The main distinctions between the three classes of algorithms are the complexity of the learning mechanism and the cognitive system they adopt. In MAB setting, there is no attention to any contexts, and the agents aim to most efficiently allocate a fixed limited set of cognitive resources between competing (alternative) choices in a way that maximizes their expected gain. In CB setting, the agents apply an attention mechanism to the current context, and aim to collect enough information about how the context vectors and rewards relate to each other, so that they can predict the next best action to play by looking at the feature vectors. In RL setting, the agents not only pay attention to the current context, but also apply the attention mechanism to multiple contexts relate to different states, and aim to use the past experience to find out which actions lead to higher cumulative rewards. Our results suggest that in the Iterate Prisoner's Dilemma of two learning systems, an optimal learning policy should hold memory for different state representations and allocate attention to different contexts across the states, which explained the overall best performance by RL algorithms. This further suggests that in zero-sum games like the IPD, participating learning systems tend to undergo multiple states. The overall underperformance of CB suggests that the attention to only the current context was not sufficient without the state representation, because the learning system might mix the the context-dependent reward mappings of multiple states, which can oversimplify the policy and potentially mislead the learning as an interfering effect. On the other hand, MAB ignores the context information entirely, so they are not susceptible to the interfering effect from the representations of different contexts. Their learned policies, however, don't exhibit any interesting flexibility to account for any major change in the state (e.g., the opponent may just finish a major learning episode and switch strategies). Results for three-agent tournament. Here we wish to understand how all three classes of algorithms interact in the same arena. For each game, we pick one algorithm from each class (one from MAB, one from CB and one from RL) to make our player pool. We observe in Figure 2 a very similar pattern as the two-player case: RL agents demonstrate the best performance (highest final rewards) followed by MAB, and CB performed the worst. However, in three-agent setting, although CB is still the most cooperative, and RL became the most defective. More detailed probing into the specific games (Figure 3) demonstrate more diverse strategies than these revealed by the cooperation rates. Take the game UCB1 vs. LinUCB vs. QL as an example, MAB algorithms start off as the most defective but later start to cooperate more in following rounds, while RL algorithms became more and more defective. CB in both cases stays cooperative at a relatively high rate. ## 6 Behavioral Cloning with Human Data We collate the human data comprising 168,386 individual decisions from many human subjects experiments [2,12,10] that used real financial incentives and transparently conveyed the rules of the game to the subjects. As a a standard procedure in experimental economics, subjects anonymously interact with each other and their decisions to cooperate or defect at each time period of each interaction are recorded. They receive payoffs proportional to the outcomes in the same or similar payoff as the one we use in Table 1. Following the similar preprocessing steps as [36,30,31], we can construct the comprehensive collection of game structures and individual decisions from the description of the experiments in the published papers and the publicly available data sets. This comprehensive dataset consists of behavioral trajectories of different time horizons, ranging from 2 to 30 rounds, but most of these experimental data only host full historical information of at most past 9 actions. We further select only those trajectories with these full historical information, which comprise 8,257 behavioral trajectories. We randomly select 8,000 of them as training set and the other 257 as test set. In the training phase, all agents are trained with the demonstration rewards as feedback sequentially for the trajectories in the training set. In the testing phase, we paused all the learning, and tested on 257 trajectories independently, recorded their cooperation rate. In each test trajectory, we compared their evolution of cooperation rate to that of the human data and compute a prediction error. Figure 4 summarizes the testing results of all the agents in predicting the actions and their cooperation rates from human data. From the heatmap of the cooperation rates, we observe that the behavioral policy that each agent cloned from the data varies by class. RL algorithms all seem to learn to defect at all costs ('tragedy of the commons'). CB algorithms mostly converge to a policy that adopted a fixed cooperation rate. Comparing with the other two, MAB algorithms learn a more diverse cooperation rates across test cases. The line plot on the right confirms our understanding.The cooperation rate by the real humans (the black curve) tends to decline slowly from around 70% to around 40%. UCB1 and epsilon Greedy both captured the decaying properties, mimicing the strategy of the human actions. The prediction error analysis matches this intuition. The UCB1 and epsilon greedy (or MAB algorithms in general), appear to be best capturing human cooperation. ## 7 Clinical Evidences and Implications Evidence has linked dopamine function to reinforcement learning via midbrain neurons and connections to the basal ganglia, limbic regions, and cortex. Neuron firing rates computationally represent reward magnitude, expectancy, and violations (prediction error) and other value-based signals [45], allowing an animal to update and maintain value expectations associated with particular states and actions. When functioning properly, this helps an animal develop a policy to maximize outcomes by approaching/choosing cues with higher expected value and avoiding cues associated with loss or punishment. This is similar to reinforcement learning widely used in computing and robotics [46], suggesting mechanistic overlap in humans and AI. Evidence of Q-learning and actor-critic models have been observed in spiking activity in midbrain dopamine neurons in primates [9] and in human striatum by blood-oxygen-level-dependent imaging (BOLD) [38]. The literature on the reward processing abnormalities in particular neurological and psychiatric disorders is quite extensive; below we summarize some of the recent developments in this fast-growing field. It is well-known that the neuromodulator dopamine plays a key role in reinforcement learning processes. Parkinson's disease (PD) patients, who have depleted dopamine in the basal ganglia, tend to have impaired performance on tasks that require learning from trial and error. For example, [17] demonstrate that off-medication PD patients are better at learning to avoid choices that lead to negative outcomes than they are at learning from positive outcomes, while dopamine medication typically used to treat PD symptoms reverses this bias. Alzheimer's disease (AD) is the most common cause of dementia in the elderly and, besides memory impairment, it is associated with a variable degree of executive function impairment and visuospatial impairment. As discussed in [40], AD patients have decreased pursuit of rewarding behaviors, including loss of appetite; these changes are often secondary to apathy, associated with diminished reward system activity. Moveover, poor performance on certain tasks is associated with memory impairments. Frontotemporal dementia (bvFTD) usually involves a progressive change in personality and behavior including disinhibition, apathy, eating changes, repetitive or compulsive behaviors, and loss of empathy [40], and it is hypothesized that those changes are associated with abnormalities in reward processing. For instance, alterations in eating habits with a preference for carbohydrate sweet rich foods and overeating in bvFTD patients can be associated with abnormally increased reward representation for food, or impairment in the negative (punishment) signal associated with fullness. Authors in [34] suggest that the strength of the association between a stimulus and the corresponding response is more susceptible to degradation in Attentiondeficit/hyperactivity disorder (ADHD) patients, which suggests problems with storing the stimulus-response associations. Among other functions, storing the associations requires working memory capacity, which is often impaired in ADHD patients. [43] demonstrated that patients suffering from addictive behavior have heightened stimulus-response associations, resulting in enhanced reward-seeking behavior for the stimulus which generated such association. [47] suggested that chronic pain can elicit in a hypodopaminergic (low dopamine) state that impairs motivated behavior, resulting into a reduced drive in chronic pain patients to pursue the rewards. Reduced reward response may underlie a key system mediating the anhedonia and depression, which are common in chronic pain. ## 8 Discussion The broader motivation of this work is to increase the two-way traffic between artificial intelligence and neuropsychiatry, in the hope that a deeper understanding of brain mechanisms revealed by how they function ('neuro') and dysfunction ('psychiatry') can provide for better AI models, and conversely AI can help to conceptualize the otherwise bewildering complexity of the brain. The behavioral cloning results suggest that bandit algorithms (without context) are the best in term of fitting the human data, which open the hypothesis that human are not considering the context when they are playing the iterated prisoner's dilemma. This discovery proposes new modeling effort on human study in the bandit framework, and points to future experimental designs which incorporate these new parametric settings and control conditions. In particular, we propose that our approach may be relevant to study reward processing in different mental disorders, for which some mechanistic insights are available. A body of recent literature has demonstrated that a spectrum of neurological and psychiatric disease symptoms are related to biases in learning from positive and negative feedback [35]. Studies in humans have shown that when reward signaling in the direct pathway is over-expressed, this may enhance state value and incur pathological reward-seeking behavior, like gambling or substance use. Conversely, enhanced aversive error signals result in dampened reward experience thereby causing symptoms like apathy, social withdrawal, fatigue, and depression. Both genetic predispositions and experiences during critical periods of development can predispose an individual to learn from positive or negative outcomes, making them more or less at risk for brain-based illnesses [20]. This highlight our need to understand how intelligent systems learn from rewards and punishments, and how experience sampling may impact reinforcement learning during influential training periods. Simulation results of the mental variants matches many of the clinical implications presented here, but also points to other complications from the social setting that deserve future investigation. The approach proposed in the present manuscript, we hope, will contribute to expand and deepen the dialogue between AI and neuropsychiatry. ## 9 Conclusion In this work, we explore the full spectrum of online learning agents: multi-armed bandits, contextual bandits and reinforcement learning. To quantitatively study their behaviors, we evaluate them based on a series of tournaments of iterated prisoner's dilemma. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward driven agents, where we observe that the contextual bandit is not performing well in the tournament, which means that considering the current situation to make decision is the worst in this kind of game. Basically we should either not care about the current situation or caring about more situations, but not just the current one. We have also studied the capacity of these algorithms to fit the human behavior. We observed that bandit algorithms (without context) are the best in term of fitting the human data, which opens the hypothesis that human are not considering the context when they are playing the IPD. Next steps include extending our evaluations to other sequential social dilemma environments with more complicated and mixed incentive structure, such as fruit Gathering game and Wolfpack hunting game [25,49], comparing these mechanistic decision making models with predictive modeling surrogate models [30,31], and building reinforcement learning-based recommendation systems that model properties of human decision making [27]. ## References 1. Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML (3). pp. 127-135 (2013) 2. Andreoni, J., Miller, J.H.: Rational cooperation in the finitely repeated prisoner's dilemma: Experimental evidence. Econ. J. 103 (1993) 3. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2-3), 235-256 (2002) 4. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32 (1) (2002) 5. Axelrod, R.: Effective choice in the prisoner's dilemma. Journal of conflict resolution 24 (1980) 6. Axelrod, R., Hamilton, W.D.: The evolution of cooperation. Science (1981) 7. Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Incorporating behavioral constraints in online AI systems. In: Proceedings of AAAI (2019) 8. Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Using multi-armed bandits to learn ethical priorities for online ai systems. IBM Journal of Research and Development 63 (2019) 9. Bayer, H.M., Glimcher, P.W.: Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal. Neuron 47 (1), 129-141 (jul 2005) 10. Bereby-Meyer, Y., Roth, A.E.: The speed of learning in noisy games: Partial reinforcement and the sustainability of cooperation. American Economic Review 96 (4), 1029-1042 (2006) 11. Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.: Contextual bandit algorithms with supervised learning guarantees. In: AISTATS (2011) 12. Bó, P.D.: Cooperation under the shadow of the future: experimental evidence from infinitely repeated games. American economic review 95 (2005) 13. Bouneffouf, D., Rish, I.: A survey on practical applications of multi-armed and contextual bandits. CoRR abs/1904.10040 (2019) 14. Bouneffouf, D., Rish, I., Cecchi, G.A.: Bandit models of human behavior: Reward processing in mental disorders. In: AGI. Springer (2017) 15. Capraro, V.: A model of human cooperation in social dilemmas. PloS one 8 (8) (2013) 16. Even-Dar, E., Mansour, Y.: Learning rates for q-learning. Journal of Machine Learning Research 5 (Dec), 1-25 (2003) 17. Frank, M.J., Seeberger, L.C., O'reilly, R.C.: By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306 (5703), 1940-1943 (2004) 18. Gupta, Gaurav: Obedience-based multi-agent cooperation for sequential social dilemmas (2020) 19. Hasselt, H.V.: Double q-learning. In: NIPS (2010) 20. Holmes, A.J., Patrick, L.M.: The Myth of Optimality in Clinical Neuroscience. Trends in Cognitive Sciences 22 (3), 241-257 (feb 2018) 21. Johnson, A., Proctor, R.W.: Attention: Theory and practice. Sage (2004) 22. Kies, M.: Finding best answers for the iterated prisoner's dilemma using improved q-learning. Available at SSRN 3556714 (2020) 23. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), 4-22 (1985) 24. Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: NIPS (2008) 25. Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint (2017) 26. Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual- bandit-based news article recommendation algorithms. In: WSDM (2011) 27. Lin, B.: Supervisorbot: Nlp-annotated real-time recommendations of psychotherapy treatment strategies with deep reinforcement learning. arXiv preprint (2022) 28. Lin, B., Bouneffouf, D., Cecchi, G.: Split q learning: reinforcement learning with two-stream rewards. In: Proceedings of the 28th IJCAI (2019) 29. Lin, B., Bouneffouf, D., Cecchi, G.: Unified models of human behavioral agents in bandits, contextual bandits, and rl. arXiv preprint arXiv:2005.04544 (2020) 30. Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making in psychological tasks with recurrent neural networks. PloS one 17 (5), e0267907 (2022) 31. Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making with lstm. In: 2022 International Joint Conference on Neural Networks (IJCNN) (2022) 32. Lin, B., Bouneffouf, D., Reinen, J., Rish, I., Cecchi, G.: A story of two streams: Reinforcement learning models from human behavior and neuropsychiatry. In: Proceedings of the 19th AAMAS. pp. 744-752 (2020) 33. Lin, B., Cecchi, G., Bouneffouf, D., Reinen, J., Rish, I.: Models of human behavioral agents in bandits, contextual bandits and rl. In: International Workshop on Human Brain and Artificial Intelligence. pp. 14-33. Springer (2021) 34. Luman, M., Van Meel, C.S., Oosterlaan, J., Sergeant, J.A., Geurts, H.M.: Does reward frequency or magnitude drive reinforcement-learning in attentiondeficit/hyperactivity disorder? Psychiatry research 168 (3) (2009) 35. Maia, T.V., Frank, M.J.: From reinforcement learning models to psychiatric and neurological disorders. Nature Neuroscience 14 (2), 154-162 (2011) 36. Nay, J.J., Vorobeychik, Y.: Predicting human cooperation. PloS one (2016) 37. Noothigattu, R., Bouneffouf, D., Mattei, N., Chandra, R., Madan, P., Varshney, K.R., Campbell, M., Singh, M., Rossi, F.: Teaching AI agents ethical values using reinforcement learning and policy orchestration. In: Proceedings of the 28th IJCAI. pp. 6377-6381 (2019) 38. O'Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., Dolan, R.J.: Dissociable Roles of Ventral and Dorsal Striatum in Instrumental. Science 304 (16 April), 452-454 (2004) 39. Park, H., Kim, K.J.: Active player modeling in the iterated prisoner's dilemma. Computational intelligence and neuroscience 2016 (2016) 40. Perry, D.C., Kramer, J.H.: Reward processing in neurodegenerative disease. Neurocase 21 (1), 120-133 (2015) 41. Press, W.H., Dyson, F.J.: Iterated prisoner's dilemma contains strategies that dominate any evolutionary opponent. PNAS 109 (26), 10409-10413 (2012) 42. Rapoport, A., Chammah, A.M., Orwant, C.J.: Prisoner's dilemma: A study in conflict and cooperation, vol. 165. University of Michigan press (1965) 43. Redish, A.D., Jensen, S., Johnson, A., Kurth-Nelson, Z.: Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychological review 114 (3), 784 (2007) 44. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems, vol. 37. University of Cambridge, Cambridge, England (1994) 45. Schultz, W., Dayan, P., Montague, P.R.: A Neural Substrate of Prediction and Reward. Science 275 (5306), 1593-1599 (mar 1997) 46. Sutton, R.S., Barto, A.G., et al.: Introduction to reinforcement learning, vol. 135. MIT Press (1998) 47. Taylor, A.M., Becker, S., Schweinhardt, P., Cahill, C.: Mesolimbic dopamine signaling in acute and chronic pain: implications for motivation, analgesia, and addiction. Pain 157 (6), 1194 (2016) 48. Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika (1933) 49. Wang, W., Hao, J., Wang, Y., Taylor, M.: Towards cooperation in sequential prisoner's dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint arXiv:1803.00162 (2018)

Rendering Paper...