## Learning Reward Machines in Cooperative Multi-Agent Tasks
Leo Ardon
Imperial College London
## ABSTRACT
This paper presents a novel approach to Multi-Agent Reinforcement Learning (MARL) that combines cooperative task decomposition with the learning of reward machines (RMs) encoding the structure of the sub-tasks. The proposed method helps deal with the non-Markovian nature of the rewards in partially observable environments and improves the interpretability of the learnt policies required to complete the cooperative task. The RMs associated with each sub-task are learnt in a decentralised manner and then used to guide the behaviour of each agent. By doing so, the complexity of a cooperative multi-agent problem is reduced, allowing for more effective learning. The results suggest that our approach is a promising direction for future research in MARL, especially in complex environments with large state spaces and multiple agents.
## KEYWORDS
Reinforcement Learning; Multi-agent; Reward Machine; NeuroSymbolic
## ACMReference Format:
Leo Ardon, Daniel Furelos-Blanco, and Alessandra Russo. 2023. Learning Reward Machines in Cooperative Multi-Agent Tasks. In Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), London, United Kingdom, May 29 - June 2, 2023 , IFAAMAS, 8 pages.
## 1 INTRODUCTION
With impressive advances in the past decade, the Reinforcement Learning (RL) [18] paradigm appears as a promising avenue in the quest to have machines learn autonomously how to achieve a goal. Originally evaluated in the context of a single agent interacting with the rest of the world, researchers have since then extended the approach to the multi-agent setting (MARL), where multiple autonomous entities learn in the same environment. Multiple agents learning concurrently brings a set of challenges, such as partial observability, non-stationarity, and scalability; however, this setting is more realistic. Like in many real-life scenarios, the actions of one individual often affect the way others act and should therefore be considered in the learning process.
A sub-field of MARL called Cooperative Multi-Agent Reinforcement Learning (CMARL) focuses on learning policies for multiple agents that must coordinate their actions to achieve a shared objective. With a 'divide and conquer' strategy, complex problems can thus be decomposed into smaller and simpler ones assigned to different agents acting in parallel in the environment. The problem of learning an optimal sequence of actions now also includes the need to strategically divide the task among agents that must learn
Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 - June 2, 2023, London, United Kingdom . Β© 2023 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.
Daniel Furelos-Blanco
Imperial College London
Alessandra Russo Imperial College London to coordinate and communicate in order to find the optimal way to accomplish the goal.
An emergent field of research in the RL community exploits finite-state machines to encode the structure of the reward function; ergo, the structure of the task at hand. This new construct, introduced by Toro Icarte et al. [19] and called reward machine (RM), can model non-Markovian reward and provides a symbolic interpretation of the stages required to complete a task. While the structure of the task is sometimes known in advance, it is rarely true in practice. Learning the RM has thus been the object of several works in recent years [3, 9, 10, 13, 21, 22] as a way to reduce the human input. In the multi-agent settings, however, the challenges highlighted before make the learning of the RM encoding the global task particularly difficult. Instead, it is more efficient to learn the RMs of smaller tasks obtained from an appropriate task decomposition. The 'global' RM can later be reconstructed with a parallel composition of the RM of the sub-tasks.
Inspired by the work of Neary et al. [16] using RMs to decompose and distribute a global task to a team of collaborative agents, we propose a method combining task decomposition with autonomous learning of the RMs. Our approach offers a more scalable and interpretable way to learn the structure of a complex task involving collaboration among multiple agents. We present an algorithm that learns in parallel the RMs associated with the sub-task of each agent and their associated RL policies whilst guaranteeing that their executions achieve the global cooperative task. We experimentally show that with our method a team of agents is able to learn how to solve a collaborative task in two different environments.
## 2 BACKGROUND
This section gives a summary of the relevant background to make the paper self-contained. Given a finite set X , Ξ (X) denotes the probability simplex over X , X β denotes (possibly empty) sequences of elements from X , and X + is a non-empty sequence. The symbols β₯ and β€ denote false and true, respectively.
## 2.1 Reinforcement Learning
Weconsider π -episodic labeled Markovdecision processes (MDPs) [22], characterized by a tuple β¨S , π πΌ , A , π, π, π , πΎ, P , π β© consisting of a set of states S , an initial state π πΌ β S , a set of actions A , a transition function π : S Γ A β Ξ (S) , a not necessarily Markovian reward function π : (S ΓA) + ΓS β R , the time horizon π of each episode, a discount factor πΎ β [ 0 , 1 ) , a finite set of propositions P , and a labeling function π : S Γ S β 2 P .
A (state-action) history β π‘ = β¨ π 0 , π 0 , . . . , π π‘ β© β (S Γ A) β Γ S is mapped into a label trace (or trace) π π‘ = β¨ π ( π 0 , π 1 ) , . . . , π ( π π‘ -1 , π π‘ )β© β ( 2 P ) + by applying the labeling function to each state transition in β π‘ . We assume that the reward function can be written in terms of a label trace, i.e. formally π ( β π‘ + 1 ) = π ( π π‘ + 1 , π π‘ + 1 ) . The goal is to find an optimal policy π : ( 2 P ) + Γ S β A , mapping (traces-states) to actions, in order to maximize the expected cumulative discounted reward π
π‘ = E π h Λ π π = π‘ πΎ π -π‘ π ( π π + 1 , π π + 1 ) i .
At time π‘ , the agent keeps a trace π π‘ β ( 2 P ) + , and observes state π π‘ β S . The agent chooses the action to execute with its policy π = π ( π π‘ , π π‘ ) β A , and the environment transitions to state π π‘ + 1 βΌ π (Β· | π π‘ , π π‘ ) . As a result, the agent observes a new state π π‘ + 1 and a new label L π‘ + 1 = π ( π π‘ , π π‘ + 1 ) , and gets the reward π π‘ + 1 = π ( π π‘ + 1 , π π‘ + 1 ) . The trace π π‘ + 1 = π π‘ β L π‘ + 1 is updated with the new label.
In this work, we focus on the goal-conditioned RL problem [14] where the agent's task is to reach a goal state as rapidly as possible. We thus distinguish between two types of traces: goal traces and incomplete traces. A trace π π‘ is said to be a goal trace if the task's goal has been achieved before π‘ otherwise we say that the trace is incomplete . Formally, the MDP includes a termination function π : ( 2 P ) + Γ S β {β₯ , β€} indicating that the trace π π‘ at time π‘ is an incomplete trace if π ( π π‘ , π π‘ ) = β₯ or a goal trace if π ( π π‘ , π π‘ ) = β€ . We assume the reward is 1 for goal traces and 0 otherwise. For ease of notation, we will refer to a goal trace as π β€ π‘ and an incomplete trace as π β₯ π‘ . Note that we assume the agent-environment interaction stops when the task's goal is achieved. This is without loss of generality as it is equivalent to having a final absorbing state with a null reward associated and no label returned by the labeling function π .
The single-agent framework above can be extended to the collaborative multi-agent setting as a Markov game. A cooperative Markov game of π agents is a tuple G = β¨ π, S , π πΌ , A , π, π, π, π , πΎ, P , π β© , where S = S 1 Γ Β· Β· Β· Γ S π is the set of joint states, π πΌ β S is a joint initial state, A = A 1 ΓΒ· Β· Β· Γ A π is the set of joint actions, π : S Γ A β Ξ ( S ) is a joint transition function, π : ( S Γ A ) + Γ S β R is the collective reward function, π : ( 2 P ) + Γ S β {β₯ , β€} is the collective termination function, and π : S Γ S β 2 P is an event labeling function. π , πΎ , and P are defined as for MDPs. Like Neary et al. [16], we assume each agent π΄ π 's dynamics are independently governed by local transition functions π π : S π ΓA π β Ξ (S π ) , hence the joint transition function is π ( π β² | π , π ) = Ξ π π = 1 π ( π β² π | π π , π π ) for all π , π β² β S and π β A . The objective is to find a team policy π : ( 2 P ) + Γ S β A mapping pairs of traces and joint states to joint actions that maximizes the expected cumulative collective reward.
Example 2.1. We use the ThreeButtons task [16], illustrated in Figure 1a, as a running example. The task consists of three agents ( π΄ 1 , π΄ 2 , and π΄ 3 ) that must cooperate for agent π΄ 1 to get to the πΊπππ position. To achieve this, agents must open doors that prevent other agents from progressing by pushing buttons π π΅ , πΊ π΅ , and π
π΅ . The button of a given color opens the door of the same color and, two agents are required to push π
π΅ at once hence involving synchronization. Once a door has been opened, it remains open until the end of the episode. The order in which high-level actions must be performed is shown in the figure: (1) π΄ 1 pushes π π΅ , (2) π΄ 2 pushes πΊ π΅ , (3) π΄ 2 and π΄ 3 push π
π΅ , and (4) π΄ 1 reaches πΊπππ . We can formulate this problem using the RL framework by defining a shared reward function returning 1 when the πΊπππ position is reached. States solely consist of the position of the agents in the environment. The proposition set in this task is P = { π
π΅ , π π΅ , πΊ π΅ , π΄ Β¬ π
π΅ 2 , π΄ π
π΅ 2 , π΄ Β¬ π
π΅ 3 , π΄ π
π΅ 3 , πΊπππ } , where (i) π
π΅ , π π΅ and πΊ π΅ indicate that the red, yellow and green buttons have been pushed respectively, (ii) π΄ π
π΅ π (resp. π΄ Β¬ π
π΅ π ), where π β { 2 , 3 } , indicates that agent π΄ π is (resp. has stopped) pushing the red button π
π΅ , and (iii) πΊπππ indicates that the goal position has been reached. A minimal goal trace is β¨{} , { π π΅ } , { πΊ π΅ } , { π΄ π
π΅ 2 } , { π΄ π
π΅ 3 } , { π
π΅ } , { πΊπππ }β© .
## 2.2 Reward Machines
We here introduce reward machines, the formalism we use to express decomposable (multi-agent) tasks.
## 2.2.1 Definitions.
A reward machine (RM) [19, 20] is a finite-state machine that represents the reward function of an RL task. Formally, an RM is a tuple π = β¨U , P , π’ 0 , π’ π΄ , πΏ π’ , πΏ π β© where U is a set of states; P is a set of propositions; π’ 0 β U is the initial state; π’ π΄ β U is the final state; πΏ π’ : U Γ 2 P β U is a state-transition function such that πΏ π’ ( π’, L) is the state that results from observing label L β 2 P in state π’ β U ; and πΏ π : UΓU β R is a reward-transition function such that πΏ π ( π’,π’ β² ) is the reward obtained for transitioning from state π’ β U to π’ β² β U . We assume that (i) there are no outgoing transitions from π’ π΄ , and (ii) πΏ π ( π’,π’ β² ) = 1 if π’ β² = π’ π΄ and 0 otherwise (following the reward assumption in Section 2.1). Given a trace π = β¨L 0 , . . . , L π β© , a traversal (or run) is a sequence of RM states β¨ π£ 0 , . . . , π£ π + 1 β© such that π£ 0 = π’ 0 , and π£ π + 1 = πΏ π’ ( π£ π , L π ) for π β [ 1 , π ] .
Ideally, RMs are such that (i) the cross product S Γ U of their states with the MDP's make the reward and termination functions Markovian, and (ii) traversals for goal traces end in the final state π’ π΄ and traversals for incomplete traces do not.
Example 2.2. Figure 1b shows an RM for the ThreeButtons task. For simplicity, edges are labeled using the single proposition that triggers the transitions instead of sets. Note that the goal trace introduced in Example 2.1 ends in the final state π’ π΄ .
## 2.2.2 RL Algorithm.
The Q-learning for RMs (QRM) algorithm [19, 20] exploits the task structure modeled through RMs. QRM learns a Q-function π π’ for each state π’ β U in the RM. Given an experience tuple β¨ π , π, π β² β© , a Q-function π π’ is updated as follows:
<!-- formula-not-decoded -->
where π’ β² = πΏ π’ ( π’, π ( π , π β² )) . All Q-functions (or a subset of them) are updated at each step using the same experience tuple in a counterfactual manner. In the tabular case, QRM is guaranteed to converge to an optimal policy.
## 2.2.3 Multi-Agent Decomposition.
Neary et al. [16] recently proposed to decompose the RM for a multiagent task into several RMs (one per agent) executed in parallel. The task decomposition into individual and independently learnable sub-tasks addresses the problem of non-stationarity inherent in the multi-agent setting. The use of RMs gives the high-level structure of the task each agent must solve. In this setting, each agent π΄ π has its own RM π π , local state space S π (e.g., it solely observes its position in the grid), and propositions P π such that P = -π P π . Besides, instead of having a single opaque labeling function, each agent π΄ π employs its own labeling function π π : S π ΓS π β 2 P π . Each labeling function is assumed to return at most one proposition per
Figure 1: Illustration of the ThreeButtons grid (a) and a reward machine modeling the task's structure (b) [16].
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: Robot Navigation and State Transition
### Overview
The image presents two related diagrams: (a) a grid-world representation of a robot's environment and path, and (b) a state transition diagram representing the robot's navigation logic. The grid-world shows a robot navigating from a starting point to a goal, while the state transition diagram illustrates the sequence of states and actions the robot takes to achieve this.
### Components/Axes
**(a) Grid-World:**
* **Axes:** Implicitly, a 2D grid with horizontal (x) and vertical (y) axes. The axes are not explicitly labeled with numerical values, but are represented by the grid structure.
* **Grid Cells:** Colored cells representing different areas: Yellow, Green, Red, and Black.
* **Robot Positions:** Marked by colored circles: Yellow (YB), Green (GB), and Red (RB).
* **Path:** A dotted line indicating the robot's trajectory.
* **Labels:** A1, A2, A3 (vertical columns), 1, 2, 3, 4 (horizontal rows), Goal.
**(b) State Transition Diagram:**
* **Nodes:** Circles representing states: u0 (start), u1, u2, u3, u4, u5, u6 (Goal).
* **Edges:** Arrows representing transitions between states, labeled with actions and conditions: YB, GB, A2RB, A3RB, A2RB, A3RB, RB, uA.
* **Labels:** Start, Goal.
### Detailed Analysis or Content Details
**(a) Grid-World:**
* The grid is 4 rows x 3 columns.
* The robot starts at position (1,1) represented by a yellow circle labeled YB.
* The robot moves to position (2,1) represented by a green circle labeled GB.
* The robot then moves to position (3,1) represented by a red circle labeled RB.
* The robot then moves towards the Goal, which is located in the bottom-right corner of the grid.
* The path is a dotted line from YB -> GB -> RB -> Goal.
* The grid is partitioned into regions:
* Column 1: Yellow (top), Yellow (middle), Red (bottom)
* Column 2: Black
* Column 3: Green
**(b) State Transition Diagram:**
* **u0 -> u1:** Transition labeled "YB".
* **u1 -> u2:** Transition labeled "GB".
* **u2 -> u3:** Transition labeled "A2RB".
* **u2 -> u4:** Transition labeled "A3RB".
* **u3 -> u5:** Transition labeled "A3RB".
* **u3 -> u4:** Transition labeled "A2RB".
* **u4 -> u5:** Transition labeled "A2RB".
* **u4 -> u3:** Transition labeled "A3RB".
* **u5 -> u6:** Transition labeled "RB".
* **u6 -> Goal:** Transition labeled "uA".
### Key Observations
* The grid-world represents a simplified environment with obstacles (black region).
* The robot's path avoids the black obstacle.
* The state transition diagram provides a formal representation of the robot's decision-making process.
* The actions (YB, GB, RB, A2RB, A3RB, uA) correspond to the robot's movements and interactions with the environment.
* The state transition diagram shows that the robot can reach the goal state (u6) from the start state (u0) by following a specific sequence of actions.
### Interpretation
The diagrams illustrate a robot navigation problem. The grid-world provides a visual representation of the environment and the robot's path, while the state transition diagram formalizes the robot's control logic. The robot navigates from a starting point to a goal, avoiding obstacles, by transitioning through a series of states based on its actions and the environment's conditions. The actions A2RB and A3RB likely represent actions related to navigating around obstacles in columns 2 and 3 respectively. The labels YB, GB, and RB likely represent the robot's position in the grid-world, triggering specific transitions in the state machine. The diagram suggests a reactive control strategy where the robot's actions are determined by its current state and the perceived environment. The use of a state transition diagram indicates a formal approach to robot control, potentially enabling verification and optimization of the navigation strategy.
</details>
Figure 2: RMs for each of the agents in ThreeButtons [16].
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Diagram: Reactive Maps (RM) for Agents A1, A2, and A3
### Overview
The image presents three separate reactive maps (RM) for three agents: A1, A2, and A3. Each map is a directed graph illustrating the agent's possible states and transitions based on environmental perceptions. The maps show the progression from a "start" state to a "Goal" state, with intermediate states labeled u0, u1, u2, and uA (superscripted with the agent number). Arrows indicate transitions triggered by specific conditions denoted by symbols like YB, RB, GB, A~RB, and A2RB, A3RB.
### Components/Axes
Each reactive map consists of:
* **States:** Represented by circles labeled u0, u1, u2, and uA (with agent-specific superscripts).
* **Transitions:** Represented by directed arrows connecting states.
* **Transition Conditions:** Labels on the arrows indicating the conditions that trigger the transition. These include:
* YB (approximately 0.75 times the length of the arrow)
* RB (approximately 0.75 times the length of the arrow)
* GB (approximately 0.75 times the length of the arrow)
* A~RB (approximately 0.75 times the length of the arrow)
* A2RB (approximately 0.75 times the length of the arrow)
* A3RB (approximately 0.75 times the length of the arrow)
* **Start State:** Labeled "start" and is the initial state of each map.
* **Goal State:** Labeled "Goal" (for A1) or uA (superscripted with the agent number) and represents the desired outcome.
* **Map Labels:** Each map is labeled (a) RM for A1, (b) RM for A2, and (c) RM for A3.
### Detailed Analysis / Content Details
**Map (a) - RM for A1:**
* Start -> u0ΒΉ (transition: YB)
* u0ΒΉ -> u1ΒΉ (transition: YB)
* u1ΒΉ -> u2ΒΉ (transition: RB)
* u2ΒΉ -> uAΒΉ (transition: Goal)
* uAΒΉ -> uAΒ² (transition: RB)
**Map (b) - RM for A2:**
* Start -> u0Β² (transition: YB)
* u0Β² -> u1Β² (transition: YB)
* u1Β² -> u2Β² (transition: GB)
* u2Β² -> u3Β² (transition: A2RB)
* u2Β² -> u3Β² (transition: A~RB)
* u3Β² -> uAΒ² (transition: RB)
**Map (c) - RM for A3:**
* Start -> u0Β³ (transition: GB)
* u0Β³ -> u1Β³ (transition: GB)
* u1Β³ -> u2Β³ (transition: A3RB)
* u2Β³ -> uAΒ³ (transition: RB)
### Key Observations
* All three maps start with a transition triggered by either YB or GB.
* Map A2 has a branching transition from u2Β² to u3Β², triggered by both A2RB and A~RB. This indicates a decision point based on the presence or absence of RB.
* The maps differ in the sequence of transitions and the conditions that trigger them, reflecting the unique reactive behavior of each agent.
* The goal state is reached via a transition triggered by RB in maps A1 and A3, and by RB in map A2.
### Interpretation
These reactive maps represent simplified models of agent behavior in a dynamic environment. Each agent reacts to specific perceptual inputs (YB, RB, GB, A~RB, A2RB, A3RB) by transitioning between states. The maps illustrate how different agents can have different responses to the same environmental conditions.
The branching transition in Map A2 suggests that the agent needs to make a decision based on whether RB is present or absent. The use of A~RB (A not RB) indicates a negation condition.
The maps are a visual representation of a state machine, where each state represents a possible situation the agent can be in, and the transitions represent the actions the agent takes based on its perceptions. The maps are useful for understanding the basic reactive capabilities of each agent and for designing more complex behaviors. The maps are not quantitative, but rather qualitative representations of agent behavior. They do not provide information about the timing or duration of transitions. They also do not account for uncertainty or noise in the perceptual inputs.
</details>
agent per timestep, and they should together output the same label as the global labeling function π .
Given a global RM π modeling the structure of the task at hand (e.g., that in Figure 1b) and each agent's proposition set P π , Neary et al. [16] propose a method for deriving each agent's RM π π by projecting π onto P π . We refer the reader to Neary et al. [16] for a description of the projection mechanism and its guarantees since we focus on learning these individual RMs instead (see Section 3).
Example 2.3. Given the local proposition sets P 1 = { π π΅ , π
π΅ , πΊπππ } , P 2 = { π π΅ , πΊ π΅ , π΄ π
π΅ 2 , π΄ Β¬ π
π΅ 2 , π
π΅ } and P 3 = { πΊ π΅ , π΄ π
π΅ 3 , π΄ Β¬ π
π΅ 3 , π
π΅ } , Figure 2 shows the RMs that result from applying Neary et al. [16] projection algorithm.
Neary et al. [16] extend QRM and propose a decentralised training approach to train each agent in isolation (i.e., in the absence of their teammates) exploiting their respective RMs; crucially, the learnt policies should work when all agents interact simultaneously with the world. In both the individual and team settings, agents must synchronize whenever a shared proposition is observed (i.e., a proposition in the proposition set of two or more agents). Specifically, an agent should check with all its teammates whether their labeling functions also returned the same proposition. In the individual setting, given that the rest of the team is not actually acting, synchronization is simulated with a fixed probability of occurrence.
## 3 LEARNING REWARD MACHINES IN COOPERATIVE MULTI-AGENT TASKS
Decomposing a task using RMs is a promising approach to solving collaborative problems where coordination and synchronization among agents are required, as described in Section 2.2.3. In wellunderstood problems, such as the ThreeButtons task, one can manually engineer the RM encoding the sequence of sub-tasks needed to achieve the goal. However, this becomes a challenge for more complex problems, where the structure of the task is unknown. In fact, the human-provided RM introduces a strong bias in the learning mechanism of the agents and can become adversarial if the human intuition about the structure of the task is incorrect. To alleviate this issue, we argue for learning each of the RMs automatically from traces collected via exploration instead of handcrafting them. In the following paragraphs, we describe two approaches to accomplish this objective.
## 3.1 Learn a Global Reward Machine
A naive approach to learning each agent's RM consists of two steps. First, a global RM π is learnt from traces where each label is the union of each agent's label. Different approaches [9, 21, 22] could be applied in this situation. The learnt RM is then decomposed into one per agent by projecting it onto the local proposition set P π of each agent π΄ π [16].
Despite its simplicity, this method is prone to scale poorly. It has been observed that learning a minimal RM (i.e., an RM with the fewest states) from traces becomes significantly more difficult as the number of constituent states grows [9]. Indeed, the problem of learning a minimal finite-state machine from a set of examples is NP-complete [12]. Intuitively, the more agents interacting with the world, the larger the global RM will become, hence impacting performance and making the learning task more difficult. Besides, having multiple agents potentially increases the number of observable propositions as well as the length of the traces, hence increasing the complexity of the problem further.
## 3.2 Learn Individual Reward Machines
We propose to decompose the global task into sub-tasks, each of which can be independently solved by one of the agents. Note that even though an agent is assigned a sub-task, it does not have any information about its structure nor how to solve it. This is in fact what the proposed approach intends to tackle: learning the RM encoding the structure of the sub-task. This should be simpler than learning a global RM since the number of constituent states becomes smaller.
Given a set of π agents, we assume that the global task can be decomposed into π sub-task MDPs. Similarly to Neary et al. [16], we assume each agent's MDP has its own state space S π , action space A π , transition function π π , and termination function π π . In addition, each agent has its own labeling function π π ; hence, the agent may only observe a subset P π β P . Note that the termination function is particular to each agent; for instance, agent π΄ 2 in ThreeButtons will complete its sub-task by pressing the green button, then the red button until it observes the signal π
π΅ indicating that π΄ 3 is also pressing it. Ideally, the parallel composition of the learned RMs captures the collective goal.
We propose an algorithm that interleaves the induction of the RMs from a collection of label traces and the learning of the associated Q-functions, akin to that by Furelos-Blanco et al. [9] in the single agent setting. The decentralised learning of the Q-functions is performed using the method by Neary et al. [16], briefly outlined in Section 2.2.3. The induction of the RMs is done using a stateof-the-art inductive logic programming system called ILASP [15], which learns the transition function of an RM as a set of logic rules from example traces observed by the agent.
Example 3.1. The leftmost edge in Figure 2a corresponds to the rule πΏ ( π’ 1 0 , π’ 1 1 , T ) : -prop ( π π΅ , T ) . The πΏ ( X , Y , T ) predicate expresses that the transition from X to Y is satisfied at time T , while the prop ( P , T ) predicate indicates that proposition P is observed at time T . The rule as a whole expresses that the transition from π’ 1 0 to π’ 1 1 is satisfied at time T if π π΅ is observed at that time. The traces provided to RM learner are expressed as sets of prop facts; for instance, the trace β¨{ π π΅ } , { π
π΅ } , { πΊπππ }β© is mapped into { prop ( π π΅ , 0 ) . prop ( π
π΅ , 1 ) . prop ( πΊπππ, 2 ) . } .
Algorithm 1 shows the pseudocode describing how the reinforcement and RM learning processes are interleaved. The algorithm starts initializing for each agent π΄ π : the set of states U π of the RM, the RM π π itself, the associated Q-functions π π , and the sets of example goal ( Ξ π β€ ) and incomplete ( Ξ π β₯ ) traces ( l.1-5 ). Note that, initially, each agent's RM solely consists of two unconnected states: the initial and final states (i.e., the agent loops in the initial state forever). Next, ππ’ππΈπππ ππππ are run for each agent. Before running any steps, each environment is reset to the initial state π π πΌ , each RM π π is reset to its initial state π’ π 0 , each agent's trace π π is empty, and we indicate that no agents have yet completed their episodes ( l.8-9 ). Then, while there are agents that have not yet completed their episodes ( l.10 ), a training step is performed for each of the agents, which we describe in the following paragraphs. Note that we show a sequential implementation of the approach but it could be performed in parallel to improve performance.
```
```
Algorithm 1: Multi-Agent QRM with RM Learning
The TrainAgent routine shows how the reinforcement learning and RM learning processes are interleaved for a given agent π΄ π operating in its own environment. From l.17 to l.25 , the steps for QRM (i.e., the RL steps) are performed. The agent first selects the next action π to execute in its current state π π‘ given the Qfunction π π’ π‘ associated with the current RM state π’ π‘ . The action is then applied, and the agent observes the next state π π‘ + 1 and whether it has achieved its goal ( l.18 ). The next label L π‘ + 1 is then determined ( l.19 ) and used to (i) extend the episode trace π π‘ ( l.20 ), and (ii) obtain reward π and the next RM state π’ π‘ + 1 ( l.21 ). The Qfunction π is updated for both the RM state π’ π‘ the agent is in at time π‘ ( l.22 ) and in a counterfactual manner for all the other states of the RM ( l.25 ).
The learning of the RMs occurs from l.27 to l.42 . Remember that there are two sets of example traces for each agent: one for goal traces Ξ β€ and one for incomplete traces Ξ β₯ . Crucially, the learnt RMs must be such that (i) traversals for goal traces end in the final state, and (ii) traversals for incomplete traces do not end in the final state. Given the trace π π‘ + 1 and the RM state π’ π‘ + 1 at time π‘ + 1, the example sets are updated in three cases:
- (1) If π π‘ + 1 is a goal trace ( l.27 ), π π‘ + 1 is added to Ξ β€ . We consider as incomplete all sub-traces of π β€ : {β¨L 0 , . . . , L π β© ; β π β [ 0 , . . . , | π β€| -1 ]} (see Example 3.2). Note that if a sub-trace was a goal trace, the episode would have ended before. This optimization enables capturing incomplete traces faster; that is, it prevents waiting for them to appear as counterexamples (see next case).
- (2) If π π‘ + 1 is an incomplete trace and π’ π‘ + 1 is the final state of the RM( l.32 ), then π π‘ + 1 is a counterexample since reaching the final state of the RM should be associated with completing the task. In this case, we add π π‘ + 1 to Ξ β₯ .
- (3) If π π‘ + 1 is an incomplete trace and the maximum number of steps per episode π has been reached ( l.36 ), we add π π‘ + 1 to Ξ β₯ .
In the first two cases, a new RM is learnt once the example sets have been updated. The LearnRM routine ( l.44-49 ) finds the RM with the fewest states that covers the example traces. The RM learner (here ILASP) is initially called using the current set of RM states U ( l.45 ); then, if the RM learner finds no solution covering the provided set of examples, the number of states is increased by 1. This approach guarantees that the RMs learnt at the end of the process are minimal (i.e., consist of the fewest possible states) [9]. Finally, like Furelos-Blanco et al. [9], we reset the Q-functions associated with an RM when it is relearned; importantly, the Q-functions depend on the RM structure and, hence, they are not easy to keep when the RM changes ( l.42 ). In all three enumerated cases, the episode is interrupted.
Example 3.2. Given the sub-task assigned to the agent π΄ 3 in the ThreeButtons environment described in Figure 2c, we consider a valid goal trace π = β¨{} , { πΊ π΅ } , { π΄ π
π΅ 3 } , { π΄ Β¬ π
π΅ 3 } , { π΄ π
π΅ 3 } , { π
π΅ }β© . The set of generated incomplete traces is:
- β¨{}β© ,
- β¨{} , { πΊ π΅ }β© ,
- β¨{} , { πΊ π΅ } , { π΄ π
π΅ 3 }β© ,
- β¨{} , { πΊ π΅ } , { π΄ π
π΅ 3 } , { π΄ Β¬ π
π΅ 3 }β© ,
<!-- formula-not-decoded -->
We have described in this section a method to learn the RM associated with each sub-task. In the next section, we evaluate how this approach performs on two collaborative tasks.
## 4 EXPERIMENTS
We evaluate our approach by looking at two metrics as the training progresses: (1) the collective reward obtained by the team of agents and (2) the number of steps required for the team to achieve the goal. The first metric indicates whether the team learns how to solve the task by getting a reward of 1 upon completion. The second one estimates the quality of the solution, i.e. the lower the number of steps the more efficient the agents. Although the training is performed in isolation for each agent, the results presented here were evaluated with all the agents acting together in a shared environment. Additionally, we inspect the RM learnt by each agent and perform a qualitative assessment of their quality. The results presented show the mean value and the 95%-confidence interval of the metrics evaluated over 5 different random seeds in two gridbased environments of size 7x7. We ran the experiments on an EC2 c5.2xlarge instance on AWS using Python 3 . 7 . 0. We use the state-of-the-art inductive logic programming system ILASP v4 . 2 . 0 [15] to learn the RM from the set of traces with a timeout set to 1 β .
## 4.1 ThreeButtons Task
We evaluate our approach in the ThreeButtons task described in Example 2.1. The number of steps required to complete the task and the collective reward obtained by the team of agents throughout training are shown in Figure 3. The green curve shows the performance of the team of agents when the local RMs are manually engineered and provided. In pink , we show the performance of the team when the RMs are learnt. Finally in yellow , we show the performance of the team when each agent learn independently to perform their task without using the RM framework 1 . We observe that regardless of whether the local RMs are handcrafted or learned, the agents learn policies that, when combined, lead to the completion of the global task. Indeed, the collective reward obtained converges to the maximum reward indicating that all the agents have learnt to execute their sub-tasks and coordinate in order to achieve the common goal. Learning the RMs also helps speed up the learning for each of the agent, significantly reducing the number of training iterations required to learn to achieve the global task. We note however that the policy learnt without the RM framework requires less steps than when the RMs are used.
We present in Figure 4 the RM learnt by π΄ 2 using our interleaved learning approach. We compare it to the 'true' RM shown in Figure 2b. The back transition from π’ 2 3 to π’ 2 2 associated with the label π΄ Β¬ π
π΅ 2 is missing from the learnt RM. This can be explained by the fact that this transition does not contribute towards the goal. When deriving the minimal RM from the set of labels, this one in particular is not deemed important and is thus ignored by the RM learner.
1 This is equivalent of using a 2-State RM, reaching the final state only when the task is completed.
Figure 3: Comparison between handcrafted RMs (RM Provided) and our approach learning the RMs from traces (RM Learnt) in the ThreeButtons environment.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Line Chart: Training Performance Comparison
### Overview
The image presents two line charts comparing the performance of a reinforcement learning algorithm under different conditions: with a Reward Model (RM) provided, with a Reward Model learned during training, and without a Reward Model. The top chart shows the number of steps to reach the goal, while the bottom chart shows the average reward. Both charts plot these metrics against the number of training steps. Shaded areas around each line represent the standard deviation.
### Components/Axes
* **X-axis (Both Charts):** Training steps, ranging from 0 to approximately 60,000 (6e4).
* **Y-axis (Top Chart):** Steps to reach the goal, ranging from 0 to 300.
* **Y-axis (Bottom Chart):** Average Reward, ranging from 0 to 1.1.
* **Legend (Both Charts):**
* RM Provided (Purple)
* RM Learnt (Teal)
* Without RM (Yellow)
### Detailed Analysis
**Chart (a): Number of Steps to reach the Goal**
* **RM Provided (Purple):** The line starts at approximately 280 steps, rapidly decreases to around 20 steps by 1000 training steps, and then plateaus around 15-20 steps for the remainder of the training period.
* **RM Learnt (Teal):** The line begins at approximately 270 steps, decreases more gradually than the "RM Provided" line, reaching around 50 steps by 1000 training steps. It continues to decrease, but with more oscillations, eventually leveling off around 30-40 steps.
* **Without RM (Yellow):** The line starts at approximately 260 steps, decreases slowly initially, then exhibits significant oscillations. It reaches a minimum of around 40 steps at approximately 2000 training steps, but then fluctuates between 40 and 80 steps for the rest of the training.
**Chart (b): Average Reward**
* **RM Provided (Purple):** The line starts at approximately 0.1, rapidly increases to around 0.9 by 1000 training steps, and then stabilizes around 0.9-1.0 for the remainder of the training.
* **RM Learnt (Teal):** The line begins at approximately 0.1, increases more gradually than the "RM Provided" line, reaching around 0.7 by 1000 training steps. It continues to increase, but with more oscillations, eventually leveling off around 0.8-0.9.
* **Without RM (Yellow):** The line starts at approximately 0.1, increases slowly initially, then exhibits significant oscillations. It reaches a maximum of around 0.7 at approximately 2000 training steps, but then fluctuates between 0.6 and 0.8 for the rest of the training.
### Key Observations
* Providing a Reward Model ("RM Provided") consistently results in the lowest number of steps to reach the goal and the highest average reward.
* Learning a Reward Model ("RM Learnt") improves performance compared to not using a Reward Model ("Without RM"), but does not reach the same level as providing a pre-defined Reward Model.
* The "Without RM" condition exhibits the most significant oscillations in both charts, indicating instability in the learning process.
* The standard deviation (represented by the shaded areas) is generally smaller for the "RM Provided" condition, suggesting more consistent performance.
### Interpretation
The data strongly suggests that a Reward Model is crucial for the effective training of this reinforcement learning algorithm. Providing a pre-defined Reward Model ("RM Provided") leads to faster learning and more stable performance. Learning a Reward Model during training ("RM Learnt") offers some improvement over not using a Reward Model at all ("Without RM"), but is less effective. The oscillations observed in the "Without RM" condition indicate that the algorithm struggles to learn a meaningful reward signal on its own, leading to inconsistent and suboptimal performance.
The two charts are directly related: a lower number of steps to reach the goal (Chart a) corresponds to a higher average reward (Chart b). This is expected, as reaching the goal quickly implies efficient learning and a strong reward signal. The consistent performance of the "RM Provided" condition highlights the importance of a well-defined reward function in reinforcement learning. The differences between the "RM Learnt" and "Without RM" conditions suggest that learning a reward function is a challenging task, and may require careful tuning or additional techniques to achieve optimal results.
</details>
Figure 4: Learnt RM for π΄ 2 .
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Diagram: State Transition Diagram
### Overview
The image depicts a state transition diagram with four states labeled uβΒ², uβΒ², uβΒ², and uβΒ², along with a final state uβΒ². Arrows indicate transitions between states, labeled with variables Y<sub>B</sub>, G<sub>B</sub>, Aβ<sup>R<sub>B</sub></sup>, and R<sub>B</sub>. The diagram starts at state uβΒ².
### Components/Axes
The diagram consists of:
* **States:** Represented by circles labeled uβΒ², uβΒ², uβΒ², uβΒ², and uβΒ².
* **Transitions:** Represented by arrows connecting the states, labeled with variables.
* **Start State:** Indicated by the label "start" pointing to state uβΒ².
* **Final State:** State uβΒ² is double-circled, indicating a final state.
* **Labels:**
* uβΒ²
* uβΒ²
* uβΒ²
* uβΒ²
* uβΒ²
* Y<sub>B</sub>
* G<sub>B</sub>
* Aβ<sup>R<sub>B</sub></sup>
* R<sub>B</sub>
* start
### Detailed Analysis / Content Details
The diagram shows a sequential flow of states:
1. **Start to uβΒ²:** A transition from the "start" point to state uβΒ² is indicated. Then, a transition from uβΒ² to uβΒ² is labeled with Y<sub>B</sub>.
2. **uβΒ² to uβΒ²:** A transition from uβΒ² to uβΒ² is labeled with G<sub>B</sub>.
3. **uβΒ² to uβΒ²:** A transition from uβΒ² to uβΒ² is labeled with Aβ<sup>R<sub>B</sub></sup>.
4. **uβΒ² to uβΒ²:** A transition from uβΒ² to the final state uβΒ² is labeled with R<sub>B</sub>.
### Key Observations
The diagram represents a process that moves through a series of states, each transition governed by a variable. The final state is uβΒ². The variables Y<sub>B</sub>, G<sub>B</sub>, Aβ<sup>R<sub>B</sub></sup>, and R<sub>B</sub> likely represent conditions or events that trigger the transitions.
### Interpretation
This diagram likely represents a model of a system or process that progresses through distinct stages. The variables associated with each transition suggest that the system's behavior is dependent on certain parameters or conditions (represented by the subscript 'B'). The superscript '2' on the state labels might indicate a squared value or a specific version of the state. The final state uβΒ² signifies the completion of the process. Without further context, it's difficult to determine the specific nature of the system or process being modeled, but the diagram provides a clear visual representation of its state transitions. The variables suggest a mathematical or engineering context, potentially related to control systems or signal processing.
</details>
Learning the global RM was also attempted in this environment. Unfortunately, learning the minimal RM for the size of this problem is hard and the RM learner timed out after the 1 β limit.
## 4.2 Rendezvous Task
The second environment in which we evaluate our approach is the 2-agent Rendezvous task [16] presented in Figure 5a. This task is composed of 2 agents acting in a grid with the ability to go up, down, left, right or do nothing. The task requires the agents to move in the environment to first meet in an π
π·π location, i.e, all agents need to simultaneously be in that location for at least one timestep. Then each of the agents must reach their individual goal location for the global task to be completed.
We present in Figure 6 the number of steps (6a) to complete the task and the collective reward received by the team of agents (6b) throughout training. In green we show the performance when the RMs are known a priori and provided to the agents. This scenario, where we have perfect knowledge about the structure of each subtasks, is used as an upper bound for the results of our approach. In pink , we show the performances of our approach where the local RMs are learnt. The yellow curve shows the performance of the team when each agent learns to perform their individual task without the RM construct.
The collective reward obtained while interleaving the learning of the policies and the learning of the local RMs converges to the maximum reward after a few iterations. In this task, the RM framework is shown to be crucial to help the team solve the task in a timely manner. Indeed, without the RMs (either known a priori or learnt), the team of agents do not succeed at solving the task.
We also attempted to learn the global RM directly but the task could not be solved (i.e., a collective goal trace was not observed). The collective reward remained null throughout the training that we manually stopped after 1 π 6 timesteps.
## 5 RELATED WORK
Since the recent introduction of reward machines as a way to model non-Markovian rewards [19, 20], several lines of research have based their work on this concept (or similar finite-state machines). Most of the work has focused on how to derive them from logic specifications [1] or demonstrations [2], as well as learning them using different methods [3, 9-11, 13, 21, 22].
In this work, based on the multi-agent with RMs work by Neary et al. [16], the RMs of the different agents are executed in parallel. Recent works have considered other ways of composing RMs, such as merging the state and reward transition functions [6] or arranging them hierarchically by enabling RMs to call each other [10]. The latter is a natural way of extending this work since agents could share subroutines in the form of callable RMs.
The policy learning algorithm for exploiting RMs we employ is an extension of the QRM algorithm (see Sections 2.2.2-2.2.3). However, algorithms based on making decisions at two hierarchical levels [9, 19] or more [10] have been proposed. In the simplest case (i.e., two levels), the agent first decides which transition to satisfy from a given RM state, and then decides which low-level actions to take to satisfy that transition. For example, if agent π΄ 2 gets to state π’ 2 3 in Figure 2b, the decisions are: (1) whether to satisfy the transition labeled π΄ Β¬ π
π΅ 2 or the one labeled π
π΅ , and (2) given the chosen transition, act towards satisfying it. Unlike QRM, this method is not guaranteed to learn optimal policies; however, it enables lower-level sub-tasks to be reused while QRM does not. Note that the Q-functions learned by QRM depend on the structure of the whole RM and, hence, we learn them from scratch every time a new RM is induced. While other methods attempt to leverage the previously learned Q-functions [22], the hierarchical approach
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Diagram: State Transition and Goal Space Representation
### Overview
The image presents two diagrams: (a) a 2D goal space with a highlighted region representing a rendezvous (RDV) point, and (b) a state transition diagram illustrating a control process with states, inputs, and a final goal state.
### Components/Axes
**(a) Goal Space:**
* **Axes:**
* Horizontal axis labeled "Aβ", ranging from approximately 0 to 4.
* Vertical axis labeled "Goalβ", ranging from approximately 0 to 4.
* **RDV Point:** A green circle labeled "RDV" positioned at approximately (2, 2).
* **Grid:** A dotted grid overlaid on the goal space, with markings at intervals of 1 along both axes.
**(b) State Transition Diagram:**
* **States:** Represented by circles labeled uβ, uβ, uβ, uβ, uβ, uβ
, uβ, and uβ.
* **Transitions:** Represented by directed arrows labeled with inputs (Rβ, Rβ, -Rβ, -Rβ) and a final transition labeled "R".
* **Goal States:** States uβ and uβ
are labeled with "Gβ" and uβ and uβ are labeled with "Gβ".
* **Start State:** State uβ is labeled "start".
### Detailed Analysis or Content Details
**(a) Goal Space:**
The RDV point is centered within a 4x4 grid. The coordinates of the RDV point are approximately (2, 2). The axes represent the goal space dimensions, Aβ and Goalβ.
**(b) State Transition Diagram:**
* **State uβ (start):** Has outgoing transitions labeled -Rβ, -Rβ to states uβ and uβ, respectively.
* **State uβ:** Has outgoing transitions labeled Rβ to state uβ and -Rβ to state uβ.
* **State uβ:** Has outgoing transitions labeled Rβ to state uβ and -Rβ to state uβ.
* **State uβ:** Has an outgoing transition labeled R to state uβ.
* **State uβ:** Has outgoing transitions labeled Gβ to state uβ and Gβ to state uβ
.
* **State uβ
:** Has an outgoing transition labeled Gβ to state uβ.
* **State uβ:** Has an outgoing transition labeled Gβ to state uβ.
* **State uβ:** Represents the final goal state.
### Key Observations
* The goal space diagram (a) shows a specific target location (RDV) within a defined area.
* The state transition diagram (b) illustrates a sequence of states and transitions, potentially representing a control system or algorithm.
* The transitions are labeled with inputs (Rβ, Rβ, -Rβ, -Rβ) and a final transition (R), suggesting a control mechanism.
* The goal states (Gβ, Gβ) indicate that the system aims to reach one of these states.
* The diagram shows a path from the start state (uβ) to the final goal state (uβ) through a series of intermediate states.
### Interpretation
The diagrams likely represent a robotic control problem or a similar system where an agent needs to reach a specific goal (RDV point) in a defined space. The state transition diagram (b) models the agent's behavior, showing how it moves between different states based on inputs (Rβ, Rβ, -Rβ, -Rβ). The goal states (Gβ, Gβ) represent intermediate objectives that the agent must achieve before reaching the final goal (uβ). The "R" transition could represent a final action or decision that leads to the goal.
The goal space diagram (a) provides the context for the control problem, defining the area where the agent operates and the target location it needs to reach. The state transition diagram (b) provides the logic for how the agent navigates this space. The combination of these diagrams suggests a system designed to achieve a specific goal through a controlled sequence of actions. The use of positive and negative inputs (-Rβ, -Rβ) suggests a feedback or corrective mechanism.
</details>
Figure 5: Example of the Rendezvous task where 2 agents must meet on the RDV point (green) before reaching their goal state πΊ 1 and πΊ 2 for agents π΄ 1 and π΄ 2 respectively.
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Line Chart: Steps to Reach the Goal vs. Training Steps
### Overview
This image presents a line chart illustrating the relationship between "Training steps" (x-axis) and "Steps to reach the goal" (y-axis) under three different conditions: "RM Provided", "RM Learnt", and "Without RM". The chart appears to demonstrate the learning progress of a system or algorithm, showing how the number of steps required to achieve a goal decreases with increasing training.
### Components/Axes
* **X-axis:** "Training steps", ranging from 0 to 300,000 (3.0e5). The scale is linear.
* **Y-axis:** "Steps to reach the goal", ranging from 0 to 300. The scale is linear.
* **Legend:** Located in the top-right corner, identifying three data series:
* "RM Provided" (Blue line)
* "RM Learnt" (Magenta/Purple line)
* "Without RM" (Yellow line)
* **Data Series:** Three lines representing the performance under different conditions. Each line is accompanied by a shaded region, likely representing a standard deviation or confidence interval.
### Detailed Analysis
Let's analyze each line individually:
1. **RM Provided (Blue Line):**
* Trend: The line initially slopes steeply downward, indicating a rapid decrease in steps to reach the goal. It reaches a plateau around 10-20 steps at approximately 50,000 training steps. The line remains relatively flat for the remainder of the training period.
* Data Points (approximate):
* Training Steps = 0, Steps to Goal = ~300
* Training Steps = 50,000, Steps to Goal = ~10-20
* Training Steps = 300,000, Steps to Goal = ~10-20
2. **RM Learnt (Magenta/Purple Line):**
* Trend: Similar to "RM Provided", this line also shows a steep initial decline. It reaches a plateau around 0-10 steps at approximately 50,000 training steps. The line remains relatively flat for the remainder of the training period.
* Data Points (approximate):
* Training Steps = 0, Steps to Goal = ~300
* Training Steps = 50,000, Steps to Goal = ~0-10
* Training Steps = 300,000, Steps to Goal = ~0-10
3. **Without RM (Yellow Line):**
* Trend: This line exhibits a much slower decline compared to the other two. It remains relatively high throughout the training period, fluctuating around 300 steps. There is a slight downward trend, but it is significantly less pronounced.
* Data Points (approximate):
* Training Steps = 0, Steps to Goal = ~300
* Training Steps = 50,000, Steps to Goal = ~300
* Training Steps = 300,000, Steps to Goal = ~300
The shaded regions around each line indicate variability in the results. The "RM Provided" and "RM Learnt" lines have relatively small shaded regions, suggesting consistent performance. The "Without RM" line has a larger shaded region, indicating greater variability.
### Key Observations
* Both "RM Provided" and "RM Learnt" significantly outperform "Without RM" in terms of reducing the steps to reach the goal.
* The performance of "RM Provided" and "RM Learnt" is nearly identical after approximately 50,000 training steps.
* "Without RM" shows minimal improvement even after 300,000 training steps.
* The shaded regions suggest that the "RM Provided" and "RM Learnt" methods are more stable and reliable than the "Without RM" method.
### Interpretation
The data suggests that the use of "RM" (likely referring to Reward Modeling or Reinforcement Modeling) is crucial for efficient learning. Providing or learning the reward model leads to a rapid decrease in the steps required to achieve the goal, while the absence of a reward model results in significantly slower and less consistent learning. The fact that "RM Provided" and "RM Learnt" converge to similar performance levels indicates that the algorithm is capable of effectively learning the reward model itself, achieving comparable results to having a pre-defined reward model. The large variability in the "Without RM" condition suggests that the learning process is highly sensitive to initial conditions or random factors when a reward model is not used. This chart demonstrates the effectiveness of reward modeling in accelerating and stabilizing the learning process.
</details>
(a) Number of Steps to reach the Goal
Figure 6: Comparison between handcrafted RMs (RM Provided) and our approach learning the RMs from traces (RM Learnt) in the Rendezvous environment.
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Line Chart: Average Reward vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between training steps and average reward for three different scenarios: "RM Provided", "RM Learnt", and "Without RM". The chart appears to demonstrate the impact of Reward Modeling (RM) on the learning process, showing how providing or learning a reward model affects the average reward achieved during training. Shaded areas around each line represent the standard deviation or confidence interval.
### Components/Axes
* **X-axis:** "Training steps" - Scale ranges from 0 to 3.0e5 (300,000).
* **Y-axis:** "Average Reward" - Scale ranges from approximately -0.1 to 1.1.
* **Legend:** Located in the bottom-right corner.
* "RM Provided" - Represented by a teal/cyan line.
* "RM Learnt" - Represented by a magenta/purple line.
* "Without RM" - Represented by a yellow line.
* **Title:** "(a) Number of steps to reach the Goal" and "(b) Average Reward" - positioned above and below the chart respectively.
### Detailed Analysis
* **RM Provided (Teal Line):** The line starts at approximately 0.1 at training step 0, rapidly increases to around 0.8 by 10,000 training steps, and then plateaus around 0.95-1.0 from approximately 50,000 training steps onwards. The shaded area around the line is relatively small, indicating consistent performance.
* **RM Learnt (Magenta Line):** This line also starts at approximately 0.1 at training step 0, but its increase is slightly slower than "RM Provided". It reaches around 0.8 by 20,000 training steps and then quickly reaches a plateau around 1.0 from approximately 50,000 training steps onwards. The shaded area is also relatively small, similar to "RM Provided".
* **Without RM (Yellow Line):** This line exhibits significantly more fluctuation. It starts at approximately 0.05 at training step 0, initially increases to around 0.2 by 10,000 training steps, then fluctuates between approximately -0.1 and 0.3 for the remainder of the training period. The shaded area around this line is much larger than the other two, indicating high variability in performance.
### Key Observations
* Both "RM Provided" and "RM Learnt" achieve significantly higher average rewards compared to "Without RM".
* The learning process is faster when the reward model is provided ("RM Provided") compared to when it is learned ("RM Learnt").
* The "Without RM" scenario exhibits high variability and does not converge to a stable reward level.
* The shaded areas indicate that the "RM Provided" and "RM Learnt" scenarios have relatively low variance in their average rewards, suggesting more consistent performance.
### Interpretation
The data strongly suggests that incorporating a reward model, whether provided or learned, significantly improves the learning process and leads to higher average rewards. The faster convergence observed with "RM Provided" indicates that a well-defined reward model can accelerate learning. The high variability and low average reward in the "Without RM" scenario highlight the importance of a clear reward signal for effective reinforcement learning. The consistent performance (low variance) of the RM scenarios suggests that the reward model provides a stable and reliable guide for the learning agent. The chart demonstrates the effectiveness of reward modeling techniques in reinforcement learning, and the trade-off between providing a pre-defined reward model versus learning one during training. The difference in the initial slope of the lines suggests that the cost of learning the reward model is a slower initial learning rate.
</details>
outlined here does not need to relearn all functions each time a new RM is induced.
In the multi-agent community, the use of finite-state machines for task decomposition in collaborative settings has been studied in [4, 8] where a top-down approach is adopted to construct submachines from a global task. Unlike our work, these methods focus on the decomposition of the task but not on how to execute it. The algorithm presented in this paper interleaves both the decomposition and the learning of the policies associated with each state of the RM.
The MARL community studying the collaborative setting has proposed different approaches to decompose the task among all the different agents. For example, QMIX [17] tackles the credit assignment problem in MARL assuming that the team's Q-function can be factorized, and performs a value-iteration method to centrally train policies that can be executed in a decentralised fashion. While this approach has shown impressive empirical results, its interpretation is more difficult than understanding the structure of the task using RM.
The use of RMs in the MARL literature is starting to emerge. The work of Neary et al. [16] was the first to propose the use of RMs for task decomposition among multiple agents. As already mentioned in this paper, this work assumes that the structure of the task is known 'a priori', which is often untrue. With a different purpose, Dann et al. [5] propose to use RMs to help an agent predict the next actions the other agents in the system will perform. Instead of modeling the other agents' plan (or program), the authors argue that RMs provide a more permissive way to accommodate for variations in the behaviour of the other agent by providing a higher-level mechanism to specify the structure of the task. In this work as well, the RMs are pre-defined and provided to the agent.
Finally, the work of Eappen and Jagannathan [7] uses a finitestate machine called 'task monitor', which is built from temporal logic specifications that encode the reward signal of the agents in the system. A task monitor is akin to an RM with some subtle differences like the use of registers for memory.
## 6 CONCLUSION AND FUTURE WORK
Dividing a problem into smaller parts that are easier to solve is a natural approach performed instinctively by humans. It is also particularly suited for the multi-agent setting where all the available resources can be utilized to solve the problem or when a complementary set of skills is required.
In this work, we extend the approach by Neary et al. [16], who leverage RMs for task decomposition in a multi-agent setting. We propose a novel approach where the RMs are learnt instead of manually engineered. The method presented in this paper interleaves the learning of the RM from the traces collected by the agent and the learning of the set of policies used to achieve the task's goal. Learning the RMs associated with each sub-task not only helps deal with non-Markovian rewards but also provides a more interpretable way to understand the structure of the task. We show experimentally in the ThreeButtons and the Rendezvous environments that our approach converges to the maximum reward a team of agents can obtain by collaborating to achieve a goal. Finally, we validated that the learnt RM corresponds indeed to the RM needed to achieve each sub-task.
While our approach lifts the assumption about knowing the structure of the task 'a priori', a set of challenges remain and could be the object of further work:
- (1) Alabeling function is assumed to exist for each of the agents and be known by them. In other words, each agent knows the labels relevant to its task and any labels used for synchronization with the other team members.
- (2) If the labeling functions are noisy, agents may fail to synchronize and hence not be able to progress in their sub-tasks. In this scenario, the completion of the cooperative task is compromised. Defining or learning a communication protocol among the agents could be the object of future work.
- (3) Our approach assumes that the global task has been automatically divided into several tasks, each to be completed by one of the agents. Dynamically creating these sub-tasks and assigning them to each of the agents is crucial to having more autonomous agents and would make an interesting topic for further research.
- (4) We leverage task decomposition to learn simpler RMs that when run in parallel represent a more complex global RM. Task decomposition can be driven further through task hierarchies, which could be represented through hierarchies of RMs [10]. Intuitively, each RM in the hierarchy should be simpler to learn and enable reusability and further parallelization among the different agents.
The use of RMs in the multi-agent context is still in its infancy and there are several exciting avenues for future work. We have demonstrated in this paper that it is possible to learn them under certain assumptions that could be relaxed in future work.
## ACKNOWLEDGMENTS
Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-222-0243. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorised to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
## REFERENCES
- [1] Alberto Camacho, Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Anthony Valenzano, and Sheila A. McIlraith. 2019. LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) . 6065-6073.
- [2] Alberto Camacho, Jacob Varley, Andy Zeng, Deepali Jain, Atil Iscen, and Dmitry Kalashnikov. 2021. Reward Machines for Vision-Based Robotic Manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) . 14284-14290.
- [3] Phillip J. K. Christoffersen, Andrew C. Li, Rodrigo Toro Icarte, and Sheila A. McIlraith. 2020. Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior. In Proceedings of the Knowledge Representation and Reasoning Meets Machine Learning (KR2ML) Workshop at the Advances in Neural Information Processing Systems (NeurIPS) Conference .
- [4] Jin Dai and Hai Lin. 2014. Automatic synthesis of cooperative multi-agent systems. In Proceedings of the IEEE Conference on Decision and Control (CDC) . 6173-6178.
- [5] Michael Dann, Yuan Yao, Natasha Alechina, Brian Logan, and John Thangarajah. 2022. Multi-Agent Intention Progression with Reward Machines. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) . 215-222.
- [6] Giuseppe De Giacomo, Marco Favorito, Luca Iocchi, Fabio Patrizi, and Alessandro Ronca. 2020. Temporal Logic Monitoring Rewards via Transducers. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR) . 860-870.
- [7] Joe Eappen and Suresh Jagannathan. 2022. DistSPECTRL: Distributing Specifications in Multi-Agent Reinforcement Learning Systems. arXiv preprint arXiv:2206.13754 (2022).
- [8] Ayman Elshenawy Elsefy. 2020. A task decomposition using (HDec-POSMDPs) approach for multi-robot exploration and fire searching. International Journal of Robotics and Mechatronics 7, 1 (2020), 22-30.
- [9] Daniel Furelos-Blanco, Mark Law, Anders Jonsson, Krysia Broda, and Alessandra Russo. 2021. Induction and Exploitation of Subgoal Automata for Reinforcement Learning. J. Artif. Intell. Res. 70 (2021), 1031-1116.
- [10] Daniel Furelos-Blanco, Mark Law, Anders Jonsson, Krysia Broda, and Alessandra Russo. 2022. Hierarchies of Reward Machines. arXiv preprint arXiv:2205.15752 (2022).
- [11] Maor Gaon and Ronen I. Brafman. 2020. Reinforcement Learning with NonMarkovian Rewards. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) . 3980-3987.
- [12] E. Mark Gold. 1978. Complexity of Automaton Identification from Given Data. Inf. Control. 37, 3 (1978), 302-320.
- [13] Mohammadhosein Hasanbeig, Natasha Yogananda Jeppu, Alessandro Abate, Tom Melham, and Daniel Kroening. 2021. DeepSynth: Automata Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) . 7647-7656.
- [14] Leslie Pack Kaelbling. 1993. Learning to Achieve Goals. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) . 1094-1099.
- [15] Mark Law, Alessandra Russo, and Krysia Broda. 2015. The ILASP System for Learning Answer Set Programs. https://www.doc.ic.ac.uk/~ml1909/ILASP
- [16] Cyrus Neary, Zhe Xu, Bo Wu, and Ufuk Topcu. 2021. Reward Machines for Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) . 934-942.
- [17] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. 2020. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research 21, 1 (2020), 7234-7284.
- [18] Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction . MIT Press.
- [19] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. 2018. Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML) . 2107-2116.
- [20] Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. McIlraith. 2022. Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning. J. Artif. Intell. Res. 73 (2022), 173-208.
- [21] Rodrigo Toro Icarte, Ethan Waldie, Toryn Q. Klassen, Richard Anthony Valenzano, Margarita P. Castro, and Sheila A. McIlraith. 2019. Learning Reward Machines for Partially Observable Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) Conference . 15497-15508.
- [22] Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, and Bo Wu. 2020. Joint Inference of Reward Machines and Policies for Reinforcement Learning. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS) . 590-598.