2303.14061v4
Model: gemini-2.0-flash
## Learning Reward Machines in Cooperative Multi-Agent Tasks
Leo Ardon
Imperial College London
## ABSTRACT
This paper presents a novel approach to Multi-Agent Reinforcement Learning (MARL) that combines cooperative task decomposition with the learning of reward machines (RMs) encoding the structure of the sub-tasks. The proposed method helps deal with the non-Markovian nature of the rewards in partially observable environments and improves the interpretability of the learnt policies required to complete the cooperative task. The RMs associated with each sub-task are learnt in a decentralised manner and then used to guide the behaviour of each agent. By doing so, the complexity of a cooperative multi-agent problem is reduced, allowing for more effective learning. The results suggest that our approach is a promising direction for future research in MARL, especially in complex environments with large state spaces and multiple agents.
## KEYWORDS
Reinforcement Learning; Multi-agent; Reward Machine; NeuroSymbolic
## ACMReference Format:
Leo Ardon, Daniel Furelos-Blanco, and Alessandra Russo. 2023. Learning Reward Machines in Cooperative Multi-Agent Tasks. In Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), London, United Kingdom, May 29 - June 2, 2023 , IFAAMAS, 8 pages.
## 1 INTRODUCTION
With impressive advances in the past decade, the Reinforcement Learning (RL) [18] paradigm appears as a promising avenue in the quest to have machines learn autonomously how to achieve a goal. Originally evaluated in the context of a single agent interacting with the rest of the world, researchers have since then extended the approach to the multi-agent setting (MARL), where multiple autonomous entities learn in the same environment. Multiple agents learning concurrently brings a set of challenges, such as partial observability, non-stationarity, and scalability; however, this setting is more realistic. Like in many real-life scenarios, the actions of one individual often affect the way others act and should therefore be considered in the learning process.
A sub-field of MARL called Cooperative Multi-Agent Reinforcement Learning (CMARL) focuses on learning policies for multiple agents that must coordinate their actions to achieve a shared objective. With a 'divide and conquer' strategy, complex problems can thus be decomposed into smaller and simpler ones assigned to different agents acting in parallel in the environment. The problem of learning an optimal sequence of actions now also includes the need to strategically divide the task among agents that must learn
Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 - June 2, 2023, London, United Kingdom . Β© 2023 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.
Daniel Furelos-Blanco
Imperial College London
Alessandra Russo Imperial College London to coordinate and communicate in order to find the optimal way to accomplish the goal.
An emergent field of research in the RL community exploits finite-state machines to encode the structure of the reward function; ergo, the structure of the task at hand. This new construct, introduced by Toro Icarte et al. [19] and called reward machine (RM), can model non-Markovian reward and provides a symbolic interpretation of the stages required to complete a task. While the structure of the task is sometimes known in advance, it is rarely true in practice. Learning the RM has thus been the object of several works in recent years [3, 9, 10, 13, 21, 22] as a way to reduce the human input. In the multi-agent settings, however, the challenges highlighted before make the learning of the RM encoding the global task particularly difficult. Instead, it is more efficient to learn the RMs of smaller tasks obtained from an appropriate task decomposition. The 'global' RM can later be reconstructed with a parallel composition of the RM of the sub-tasks.
Inspired by the work of Neary et al. [16] using RMs to decompose and distribute a global task to a team of collaborative agents, we propose a method combining task decomposition with autonomous learning of the RMs. Our approach offers a more scalable and interpretable way to learn the structure of a complex task involving collaboration among multiple agents. We present an algorithm that learns in parallel the RMs associated with the sub-task of each agent and their associated RL policies whilst guaranteeing that their executions achieve the global cooperative task. We experimentally show that with our method a team of agents is able to learn how to solve a collaborative task in two different environments.
## 2 BACKGROUND
This section gives a summary of the relevant background to make the paper self-contained. Given a finite set X , Ξ (X) denotes the probability simplex over X , X β denotes (possibly empty) sequences of elements from X , and X + is a non-empty sequence. The symbols β₯ and β€ denote false and true, respectively.
## 2.1 Reinforcement Learning
Weconsider π -episodic labeled Markovdecision processes (MDPs) [22], characterized by a tuple β¨S , π πΌ , A , π, π, π , πΎ, P , π β© consisting of a set of states S , an initial state π πΌ β S , a set of actions A , a transition function π : S Γ A β Ξ (S) , a not necessarily Markovian reward function π : (S ΓA) + ΓS β R , the time horizon π of each episode, a discount factor πΎ β [ 0 , 1 ) , a finite set of propositions P , and a labeling function π : S Γ S β 2 P .
A (state-action) history β π‘ = β¨ π 0 , π 0 , . . . , π π‘ β© β (S Γ A) β Γ S is mapped into a label trace (or trace) π π‘ = β¨ π ( π 0 , π 1 ) , . . . , π ( π π‘ -1 , π π‘ )β© β ( 2 P ) + by applying the labeling function to each state transition in β π‘ . We assume that the reward function can be written in terms of a label trace, i.e. formally π ( β π‘ + 1 ) = π ( π π‘ + 1 , π π‘ + 1 ) . The goal is to find an optimal policy π : ( 2 P ) + Γ S β A , mapping (traces-states) to actions, in order to maximize the expected cumulative discounted reward π
π‘ = E π h Λ π π = π‘ πΎ π -π‘ π ( π π + 1 , π π + 1 ) i .
At time π‘ , the agent keeps a trace π π‘ β ( 2 P ) + , and observes state π π‘ β S . The agent chooses the action to execute with its policy π = π ( π π‘ , π π‘ ) β A , and the environment transitions to state π π‘ + 1 βΌ π (Β· | π π‘ , π π‘ ) . As a result, the agent observes a new state π π‘ + 1 and a new label L π‘ + 1 = π ( π π‘ , π π‘ + 1 ) , and gets the reward π π‘ + 1 = π ( π π‘ + 1 , π π‘ + 1 ) . The trace π π‘ + 1 = π π‘ β L π‘ + 1 is updated with the new label.
In this work, we focus on the goal-conditioned RL problem [14] where the agent's task is to reach a goal state as rapidly as possible. We thus distinguish between two types of traces: goal traces and incomplete traces. A trace π π‘ is said to be a goal trace if the task's goal has been achieved before π‘ otherwise we say that the trace is incomplete . Formally, the MDP includes a termination function π : ( 2 P ) + Γ S β {β₯ , β€} indicating that the trace π π‘ at time π‘ is an incomplete trace if π ( π π‘ , π π‘ ) = β₯ or a goal trace if π ( π π‘ , π π‘ ) = β€ . We assume the reward is 1 for goal traces and 0 otherwise. For ease of notation, we will refer to a goal trace as π β€ π‘ and an incomplete trace as π β₯ π‘ . Note that we assume the agent-environment interaction stops when the task's goal is achieved. This is without loss of generality as it is equivalent to having a final absorbing state with a null reward associated and no label returned by the labeling function π .
The single-agent framework above can be extended to the collaborative multi-agent setting as a Markov game. A cooperative Markov game of π agents is a tuple G = β¨ π, S , π πΌ , A , π, π, π, π , πΎ, P , π β© , where S = S 1 Γ Β· Β· Β· Γ S π is the set of joint states, π πΌ β S is a joint initial state, A = A 1 ΓΒ· Β· Β· Γ A π is the set of joint actions, π : S Γ A β Ξ ( S ) is a joint transition function, π : ( S Γ A ) + Γ S β R is the collective reward function, π : ( 2 P ) + Γ S β {β₯ , β€} is the collective termination function, and π : S Γ S β 2 P is an event labeling function. π , πΎ , and P are defined as for MDPs. Like Neary et al. [16], we assume each agent π΄ π 's dynamics are independently governed by local transition functions π π : S π ΓA π β Ξ (S π ) , hence the joint transition function is π ( π β² | π , π ) = Ξ π π = 1 π ( π β² π | π π , π π ) for all π , π β² β S and π β A . The objective is to find a team policy π : ( 2 P ) + Γ S β A mapping pairs of traces and joint states to joint actions that maximizes the expected cumulative collective reward.
Example 2.1. We use the ThreeButtons task [16], illustrated in Figure 1a, as a running example. The task consists of three agents ( π΄ 1 , π΄ 2 , and π΄ 3 ) that must cooperate for agent π΄ 1 to get to the πΊπππ position. To achieve this, agents must open doors that prevent other agents from progressing by pushing buttons π π΅ , πΊ π΅ , and π
π΅ . The button of a given color opens the door of the same color and, two agents are required to push π
π΅ at once hence involving synchronization. Once a door has been opened, it remains open until the end of the episode. The order in which high-level actions must be performed is shown in the figure: (1) π΄ 1 pushes π π΅ , (2) π΄ 2 pushes πΊ π΅ , (3) π΄ 2 and π΄ 3 push π
π΅ , and (4) π΄ 1 reaches πΊπππ . We can formulate this problem using the RL framework by defining a shared reward function returning 1 when the πΊπππ position is reached. States solely consist of the position of the agents in the environment. The proposition set in this task is P = { π
π΅ , π π΅ , πΊ π΅ , π΄ Β¬ π
π΅ 2 , π΄ π
π΅ 2 , π΄ Β¬ π
π΅ 3 , π΄ π
π΅ 3 , πΊπππ } , where (i) π
π΅ , π π΅ and πΊ π΅ indicate that the red, yellow and green buttons have been pushed respectively, (ii) π΄ π
π΅ π (resp. π΄ Β¬ π
π΅ π ), where π β { 2 , 3 } , indicates that agent π΄ π is (resp. has stopped) pushing the red button π
π΅ , and (iii) πΊπππ indicates that the goal position has been reached. A minimal goal trace is β¨{} , { π π΅ } , { πΊ π΅ } , { π΄ π
π΅ 2 } , { π΄ π
π΅ 3 } , { π
π΅ } , { πΊπππ }β© .
## 2.2 Reward Machines
We here introduce reward machines, the formalism we use to express decomposable (multi-agent) tasks.
## 2.2.1 Definitions.
A reward machine (RM) [19, 20] is a finite-state machine that represents the reward function of an RL task. Formally, an RM is a tuple π = β¨U , P , π’ 0 , π’ π΄ , πΏ π’ , πΏ π β© where U is a set of states; P is a set of propositions; π’ 0 β U is the initial state; π’ π΄ β U is the final state; πΏ π’ : U Γ 2 P β U is a state-transition function such that πΏ π’ ( π’, L) is the state that results from observing label L β 2 P in state π’ β U ; and πΏ π : UΓU β R is a reward-transition function such that πΏ π ( π’,π’ β² ) is the reward obtained for transitioning from state π’ β U to π’ β² β U . We assume that (i) there are no outgoing transitions from π’ π΄ , and (ii) πΏ π ( π’,π’ β² ) = 1 if π’ β² = π’ π΄ and 0 otherwise (following the reward assumption in Section 2.1). Given a trace π = β¨L 0 , . . . , L π β© , a traversal (or run) is a sequence of RM states β¨ π£ 0 , . . . , π£ π + 1 β© such that π£ 0 = π’ 0 , and π£ π + 1 = πΏ π’ ( π£ π , L π ) for π β [ 1 , π ] .
Ideally, RMs are such that (i) the cross product S Γ U of their states with the MDP's make the reward and termination functions Markovian, and (ii) traversals for goal traces end in the final state π’ π΄ and traversals for incomplete traces do not.
Example 2.2. Figure 1b shows an RM for the ThreeButtons task. For simplicity, edges are labeled using the single proposition that triggers the transitions instead of sets. Note that the goal trace introduced in Example 2.1 ends in the final state π’ π΄ .
## 2.2.2 RL Algorithm.
The Q-learning for RMs (QRM) algorithm [19, 20] exploits the task structure modeled through RMs. QRM learns a Q-function π π’ for each state π’ β U in the RM. Given an experience tuple β¨ π , π, π β² β© , a Q-function π π’ is updated as follows:
<!-- formula-not-decoded -->
where π’ β² = πΏ π’ ( π’, π ( π , π β² )) . All Q-functions (or a subset of them) are updated at each step using the same experience tuple in a counterfactual manner. In the tabular case, QRM is guaranteed to converge to an optimal policy.
## 2.2.3 Multi-Agent Decomposition.
Neary et al. [16] recently proposed to decompose the RM for a multiagent task into several RMs (one per agent) executed in parallel. The task decomposition into individual and independently learnable sub-tasks addresses the problem of non-stationarity inherent in the multi-agent setting. The use of RMs gives the high-level structure of the task each agent must solve. In this setting, each agent π΄ π has its own RM π π , local state space S π (e.g., it solely observes its position in the grid), and propositions P π such that P = -π P π . Besides, instead of having a single opaque labeling function, each agent π΄ π employs its own labeling function π π : S π ΓS π β 2 P π . Each labeling function is assumed to return at most one proposition per
Figure 1: Illustration of the ThreeButtons grid (a) and a reward machine modeling the task's structure (b) [16].
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: Path Planning and State Transition
### Overview
The image presents two diagrams side-by-side, labeled (a) and (b). Diagram (a) depicts a path planning scenario in a grid-like environment with colored regions and a goal. Diagram (b) shows a state transition diagram representing the possible states and transitions within the path planning problem.
### Components/Axes
**Diagram (a): Path Planning Environment**
* **Environment:** A grid-like environment with black walls.
* **Colored Regions:** Yellow (YB), Green (GB), and Red (RB) regions.
* **Goal:** A red region labeled "Goal" at the bottom.
* **Agent:** Represented by circles labeled YB, GB, and RB.
* **Path:** A dashed line indicating the path from the starting point (A1) to the goal.
* **Actions:** Labeled as A1, A2, and A3.
* **Step Numbers:** Numbers 1, 2, 3, and 4 indicating the sequence of steps.
**Diagram (b): State Transition Diagram**
* **States:** Represented by circles labeled u0, u1, u2, u3, u4, u5, u6, and uA.
* **Start State:** u0, indicated by an arrow labeled "start".
* **Goal State:** uA, indicated by a double circle and labeled "Goal".
* **Transitions:** Represented by arrows between states, labeled with actions and conditions (e.g., YB, GB, RB, A2^Β¬RB, A2^RB, A3^Β¬RB, A3^RB).
### Detailed Analysis
**Diagram (a): Path Planning Environment**
1. **Starting Point:** The agent starts at the top-left corner, labeled A1, and is initially in the "YB" (Yellow) state.
2. **Path Sequence:**
* Step 1: The agent moves from A1 to the yellow region (YB).
* Step 2: The agent moves from the yellow region to the green region (GB) using action A2.
* Step 3: The agent moves from the green region to the red region (RB) using action A3.
* Step 3: The agent moves from the red region to the red "Goal" region using action A3.
* Step 4: The agent moves from the top-left corner down to the bottom-left corner.
3. **Obstacles:** Black regions represent obstacles that the agent must avoid.
**Diagram (b): State Transition Diagram**
1. **Initial State:** The agent starts at state u0.
2. **Transitions:**
* u0 -> u1: Transition labeled "YB".
* u1 -> u2: Transition labeled "GB".
* u2 -> u3: Transition labeled "A2^Β¬RB" (Action A2, not in RB region).
* u2 -> u4: Transition labeled "A3^Β¬RB" (Action A3, not in RB region).
* u3 -> u5: Transition labeled "A3^RB" (Action A3, in RB region).
* u4 -> u5: Transition labeled "A2^RB" (Action A2, in RB region).
* u5 -> u3: Transition labeled "A3^Β¬RB" (Action A3, not in RB region).
* u5 -> u4: Transition labeled "A2^Β¬RB" (Action A2, not in RB region).
* u5 -> u6: Transition labeled "RB".
* u6 -> uA: Transition labeled "Goal".
3. **Goal State:** The agent reaches the goal state uA from state u6.
### Key Observations
* Diagram (a) provides a visual representation of the path planning problem, while diagram (b) formalizes the problem as a state transition diagram.
* The actions A1, A2, and A3 in diagram (a) correspond to transitions in diagram (b), with additional conditions based on the agent's location (e.g., RB).
* The state transition diagram shows the possible sequences of actions and states that lead to the goal.
### Interpretation
The image illustrates the relationship between a path planning problem and its corresponding state transition diagram. Diagram (a) provides a concrete example of the environment and the agent's path, while diagram (b) abstracts the problem into a set of states and transitions. The state transition diagram can be used to formally analyze the problem and design algorithms for finding the optimal path to the goal. The conditions on the transitions (e.g., A2^Β¬RB) indicate that the agent's actions may depend on its current location or state.
</details>
Figure 2: RMs for each of the agents in ThreeButtons [16].
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: State Transition Diagrams for RM (Resource Machines)
### Overview
The image presents three state transition diagrams, labeled (a), (b), and (c), representing Resource Machines (RM) for agents A1, A2, and A3, respectively. Each diagram illustrates the possible states of the agent and the transitions between them, triggered by specific events or conditions. The diagrams use circles to represent states, arrows to represent transitions, and labels to indicate the events or conditions causing the transitions. Start states are indicated by an arrow pointing into the initial state, and goal states are indicated by a double circle.
### Components/Axes
* **States:** Represented by circles labeled as u<sub>i</sub><sup>j</sup>, where 'i' is the state number and 'j' is the agent number.
* **Transitions:** Represented by arrows between states, labeled with the event or condition that triggers the transition (e.g., Y<sub>B</sub>, R<sub>B</sub>, G<sub>B</sub>, A<sub>2</sub><sup>Β¬R<sub>B</sub></sup>, A<sub>2</sub><sup>R<sub>B</sub></sup>, A<sub>3</sub><sup>Β¬R<sub>B</sub></sup>, A<sub>3</sub><sup>R<sub>B</sub></sup>).
* **Start State:** Indicated by an arrow labeled "start" pointing to the initial state.
* **Goal State:** Indicated by a double circle around the state.
* **Labels:**
* (a) RM for A<sub>1</sub>.
* (b) RM for A<sub>2</sub>.
* (c) RM for A<sub>3</sub>.
### Detailed Analysis
**(a) RM for A<sub>1</sub>:**
* **States:** u<sub>0</sub><sup>1</sup>, u<sub>1</sub><sup>1</sup>, u<sub>2</sub><sup>1</sup>, u<sub>A</sub><sup>1</sup> (Goal State)
* **Transitions:**
* start β u<sub>0</sub><sup>1</sup>
* u<sub>0</sub><sup>1</sup> β u<sub>1</sub><sup>1</sup> labeled Y<sub>B</sub>
* u<sub>1</sub><sup>1</sup> β u<sub>2</sub><sup>1</sup> labeled R<sub>B</sub>
* u<sub>2</sub><sup>1</sup> β u<sub>A</sub><sup>1</sup> labeled Goal
**(b) RM for A<sub>2</sub>:**
* **States:** u<sub>0</sub><sup>2</sup>, u<sub>1</sub><sup>2</sup>, u<sub>2</sub><sup>2</sup>, u<sub>3</sub><sup>2</sup>, u<sub>A</sub><sup>2</sup> (Goal State)
* **Transitions:**
* start β u<sub>0</sub><sup>2</sup>
* u<sub>0</sub><sup>2</sup> β u<sub>1</sub><sup>2</sup> labeled Y<sub>B</sub>
* u<sub>1</sub><sup>2</sup> β u<sub>2</sub><sup>2</sup> labeled G<sub>B</sub>
* u<sub>2</sub><sup>2</sup> β u<sub>3</sub><sup>2</sup> labeled A<sub>2</sub><sup>R<sub>B</sub></sup>
* u<sub>3</sub><sup>2</sup> β u<sub>2</sub><sup>2</sup> labeled A<sub>2</sub><sup>Β¬R<sub>B</sub></sup>
* u<sub>3</sub><sup>2</sup> β u<sub>A</sub><sup>2</sup> labeled R<sub>B</sub>
**(c) RM for A<sub>3</sub>:**
* **States:** u<sub>0</sub><sup>3</sup>, u<sub>1</sub><sup>3</sup>, u<sub>2</sub><sup>3</sup>, u<sub>A</sub><sup>3</sup> (Goal State)
* **Transitions:**
* start β u<sub>0</sub><sup>3</sup>
* u<sub>0</sub><sup>3</sup> β u<sub>1</sub><sup>3</sup> labeled G<sub>B</sub>
* u<sub>1</sub><sup>3</sup> β u<sub>2</sub><sup>3</sup> labeled A<sub>3</sub><sup>R<sub>B</sub></sup>
* u<sub>2</sub><sup>3</sup> β u<sub>1</sub><sup>3</sup> labeled A<sub>3</sub><sup>Β¬R<sub>B</sub></sup>
* u<sub>2</sub><sup>3</sup> β u<sub>A</sub><sup>3</sup> labeled R<sub>B</sub>
### Key Observations
* Agent A<sub>1</sub> has a linear path to the goal state.
* Agents A<sub>2</sub> and A<sub>3</sub> have loops in their state transitions, indicating the possibility of returning to previous states based on certain conditions.
* The transitions are labeled with events involving 'B' (Y<sub>B</sub>, R<sub>B</sub>, G<sub>B</sub>) and actions of other agents (A<sub>2</sub>, A<sub>3</sub>) conditioned on the presence or absence of R<sub>B</sub> (A<sub>2</sub><sup>R<sub>B</sub></sup>, A<sub>2</sub><sup>Β¬R<sub>B</sub></sup>, A<sub>3</sub><sup>R<sub>B</sub></sup>, A<sub>3</sub><sup>Β¬R<sub>B</sub></sup>).
### Interpretation
The diagrams represent the possible behaviors of three agents (A<sub>1</sub>, A<sub>2</sub>, A<sub>3</sub>) in a system. The transitions between states are triggered by events or actions, potentially involving another entity 'B'. The loops in the state transition diagrams for A<sub>2</sub> and A<sub>3</sub> suggest that these agents can react to changes in the environment or the actions of other agents, allowing them to adapt their behavior. Agent A<sub>1</sub>, on the other hand, follows a fixed sequence of actions to reach its goal. The superscripts on the transition labels (e.g., A<sub>2</sub><sup>R<sub>B</sub></sup>) indicate that the action of agent A<sub>2</sub> is conditional on the presence or absence of R<sub>B</sub>. The diagrams provide a visual representation of the agents' decision-making processes and their interactions with the environment.
</details>
agent per timestep, and they should together output the same label as the global labeling function π .
Given a global RM π modeling the structure of the task at hand (e.g., that in Figure 1b) and each agent's proposition set P π , Neary et al. [16] propose a method for deriving each agent's RM π π by projecting π onto P π . We refer the reader to Neary et al. [16] for a description of the projection mechanism and its guarantees since we focus on learning these individual RMs instead (see Section 3).
Example 2.3. Given the local proposition sets P 1 = { π π΅ , π
π΅ , πΊπππ } , P 2 = { π π΅ , πΊ π΅ , π΄ π
π΅ 2 , π΄ Β¬ π
π΅ 2 , π
π΅ } and P 3 = { πΊ π΅ , π΄ π
π΅ 3 , π΄ Β¬ π
π΅ 3 , π
π΅ } , Figure 2 shows the RMs that result from applying Neary et al. [16] projection algorithm.
Neary et al. [16] extend QRM and propose a decentralised training approach to train each agent in isolation (i.e., in the absence of their teammates) exploiting their respective RMs; crucially, the learnt policies should work when all agents interact simultaneously with the world. In both the individual and team settings, agents must synchronize whenever a shared proposition is observed (i.e., a proposition in the proposition set of two or more agents). Specifically, an agent should check with all its teammates whether their labeling functions also returned the same proposition. In the individual setting, given that the rest of the team is not actually acting, synchronization is simulated with a fixed probability of occurrence.
## 3 LEARNING REWARD MACHINES IN COOPERATIVE MULTI-AGENT TASKS
Decomposing a task using RMs is a promising approach to solving collaborative problems where coordination and synchronization among agents are required, as described in Section 2.2.3. In wellunderstood problems, such as the ThreeButtons task, one can manually engineer the RM encoding the sequence of sub-tasks needed to achieve the goal. However, this becomes a challenge for more complex problems, where the structure of the task is unknown. In fact, the human-provided RM introduces a strong bias in the learning mechanism of the agents and can become adversarial if the human intuition about the structure of the task is incorrect. To alleviate this issue, we argue for learning each of the RMs automatically from traces collected via exploration instead of handcrafting them. In the following paragraphs, we describe two approaches to accomplish this objective.
## 3.1 Learn a Global Reward Machine
A naive approach to learning each agent's RM consists of two steps. First, a global RM π is learnt from traces where each label is the union of each agent's label. Different approaches [9, 21, 22] could be applied in this situation. The learnt RM is then decomposed into one per agent by projecting it onto the local proposition set P π of each agent π΄ π [16].
Despite its simplicity, this method is prone to scale poorly. It has been observed that learning a minimal RM (i.e., an RM with the fewest states) from traces becomes significantly more difficult as the number of constituent states grows [9]. Indeed, the problem of learning a minimal finite-state machine from a set of examples is NP-complete [12]. Intuitively, the more agents interacting with the world, the larger the global RM will become, hence impacting performance and making the learning task more difficult. Besides, having multiple agents potentially increases the number of observable propositions as well as the length of the traces, hence increasing the complexity of the problem further.
## 3.2 Learn Individual Reward Machines
We propose to decompose the global task into sub-tasks, each of which can be independently solved by one of the agents. Note that even though an agent is assigned a sub-task, it does not have any information about its structure nor how to solve it. This is in fact what the proposed approach intends to tackle: learning the RM encoding the structure of the sub-task. This should be simpler than learning a global RM since the number of constituent states becomes smaller.
Given a set of π agents, we assume that the global task can be decomposed into π sub-task MDPs. Similarly to Neary et al. [16], we assume each agent's MDP has its own state space S π , action space A π , transition function π π , and termination function π π . In addition, each agent has its own labeling function π π ; hence, the agent may only observe a subset P π β P . Note that the termination function is particular to each agent; for instance, agent π΄ 2 in ThreeButtons will complete its sub-task by pressing the green button, then the red button until it observes the signal π
π΅ indicating that π΄ 3 is also pressing it. Ideally, the parallel composition of the learned RMs captures the collective goal.
We propose an algorithm that interleaves the induction of the RMs from a collection of label traces and the learning of the associated Q-functions, akin to that by Furelos-Blanco et al. [9] in the single agent setting. The decentralised learning of the Q-functions is performed using the method by Neary et al. [16], briefly outlined in Section 2.2.3. The induction of the RMs is done using a stateof-the-art inductive logic programming system called ILASP [15], which learns the transition function of an RM as a set of logic rules from example traces observed by the agent.
Example 3.1. The leftmost edge in Figure 2a corresponds to the rule πΏ ( π’ 1 0 , π’ 1 1 , T ) : -prop ( π π΅ , T ) . The πΏ ( X , Y , T ) predicate expresses that the transition from X to Y is satisfied at time T , while the prop ( P , T ) predicate indicates that proposition P is observed at time T . The rule as a whole expresses that the transition from π’ 1 0 to π’ 1 1 is satisfied at time T if π π΅ is observed at that time. The traces provided to RM learner are expressed as sets of prop facts; for instance, the trace β¨{ π π΅ } , { π
π΅ } , { πΊπππ }β© is mapped into { prop ( π π΅ , 0 ) . prop ( π
π΅ , 1 ) . prop ( πΊπππ, 2 ) . } .
Algorithm 1 shows the pseudocode describing how the reinforcement and RM learning processes are interleaved. The algorithm starts initializing for each agent π΄ π : the set of states U π of the RM, the RM π π itself, the associated Q-functions π π , and the sets of example goal ( Ξ π β€ ) and incomplete ( Ξ π β₯ ) traces ( l.1-5 ). Note that, initially, each agent's RM solely consists of two unconnected states: the initial and final states (i.e., the agent loops in the initial state forever). Next, ππ’ππΈπππ ππππ are run for each agent. Before running any steps, each environment is reset to the initial state π π πΌ , each RM π π is reset to its initial state π’ π 0 , each agent's trace π π is empty, and we indicate that no agents have yet completed their episodes ( l.8-9 ). Then, while there are agents that have not yet completed their episodes ( l.10 ), a training step is performed for each of the agents, which we describe in the following paragraphs. Note that we show a sequential implementation of the approach but it could be performed in parallel to improve performance.
```
```
Algorithm 1: Multi-Agent QRM with RM Learning
The TrainAgent routine shows how the reinforcement learning and RM learning processes are interleaved for a given agent π΄ π operating in its own environment. From l.17 to l.25 , the steps for QRM (i.e., the RL steps) are performed. The agent first selects the next action π to execute in its current state π π‘ given the Qfunction π π’ π‘ associated with the current RM state π’ π‘ . The action is then applied, and the agent observes the next state π π‘ + 1 and whether it has achieved its goal ( l.18 ). The next label L π‘ + 1 is then determined ( l.19 ) and used to (i) extend the episode trace π π‘ ( l.20 ), and (ii) obtain reward π and the next RM state π’ π‘ + 1 ( l.21 ). The Qfunction π is updated for both the RM state π’ π‘ the agent is in at time π‘ ( l.22 ) and in a counterfactual manner for all the other states of the RM ( l.25 ).
The learning of the RMs occurs from l.27 to l.42 . Remember that there are two sets of example traces for each agent: one for goal traces Ξ β€ and one for incomplete traces Ξ β₯ . Crucially, the learnt RMs must be such that (i) traversals for goal traces end in the final state, and (ii) traversals for incomplete traces do not end in the final state. Given the trace π π‘ + 1 and the RM state π’ π‘ + 1 at time π‘ + 1, the example sets are updated in three cases:
- (1) If π π‘ + 1 is a goal trace ( l.27 ), π π‘ + 1 is added to Ξ β€ . We consider as incomplete all sub-traces of π β€ : {β¨L 0 , . . . , L π β© ; β π β [ 0 , . . . , | π β€| -1 ]} (see Example 3.2). Note that if a sub-trace was a goal trace, the episode would have ended before. This optimization enables capturing incomplete traces faster; that is, it prevents waiting for them to appear as counterexamples (see next case).
- (2) If π π‘ + 1 is an incomplete trace and π’ π‘ + 1 is the final state of the RM( l.32 ), then π π‘ + 1 is a counterexample since reaching the final state of the RM should be associated with completing the task. In this case, we add π π‘ + 1 to Ξ β₯ .
- (3) If π π‘ + 1 is an incomplete trace and the maximum number of steps per episode π has been reached ( l.36 ), we add π π‘ + 1 to Ξ β₯ .
In the first two cases, a new RM is learnt once the example sets have been updated. The LearnRM routine ( l.44-49 ) finds the RM with the fewest states that covers the example traces. The RM learner (here ILASP) is initially called using the current set of RM states U ( l.45 ); then, if the RM learner finds no solution covering the provided set of examples, the number of states is increased by 1. This approach guarantees that the RMs learnt at the end of the process are minimal (i.e., consist of the fewest possible states) [9]. Finally, like Furelos-Blanco et al. [9], we reset the Q-functions associated with an RM when it is relearned; importantly, the Q-functions depend on the RM structure and, hence, they are not easy to keep when the RM changes ( l.42 ). In all three enumerated cases, the episode is interrupted.
Example 3.2. Given the sub-task assigned to the agent π΄ 3 in the ThreeButtons environment described in Figure 2c, we consider a valid goal trace π = β¨{} , { πΊ π΅ } , { π΄ π
π΅ 3 } , { π΄ Β¬ π
π΅ 3 } , { π΄ π
π΅ 3 } , { π
π΅ }β© . The set of generated incomplete traces is:
- β¨{}β© ,
- β¨{} , { πΊ π΅ }β© ,
- β¨{} , { πΊ π΅ } , { π΄ π
π΅ 3 }β© ,
- β¨{} , { πΊ π΅ } , { π΄ π
π΅ 3 } , { π΄ Β¬ π
π΅ 3 }β© ,
<!-- formula-not-decoded -->
We have described in this section a method to learn the RM associated with each sub-task. In the next section, we evaluate how this approach performs on two collaborative tasks.
## 4 EXPERIMENTS
We evaluate our approach by looking at two metrics as the training progresses: (1) the collective reward obtained by the team of agents and (2) the number of steps required for the team to achieve the goal. The first metric indicates whether the team learns how to solve the task by getting a reward of 1 upon completion. The second one estimates the quality of the solution, i.e. the lower the number of steps the more efficient the agents. Although the training is performed in isolation for each agent, the results presented here were evaluated with all the agents acting together in a shared environment. Additionally, we inspect the RM learnt by each agent and perform a qualitative assessment of their quality. The results presented show the mean value and the 95%-confidence interval of the metrics evaluated over 5 different random seeds in two gridbased environments of size 7x7. We ran the experiments on an EC2 c5.2xlarge instance on AWS using Python 3 . 7 . 0. We use the state-of-the-art inductive logic programming system ILASP v4 . 2 . 0 [15] to learn the RM from the set of traces with a timeout set to 1 β .
## 4.1 ThreeButtons Task
We evaluate our approach in the ThreeButtons task described in Example 2.1. The number of steps required to complete the task and the collective reward obtained by the team of agents throughout training are shown in Figure 3. The green curve shows the performance of the team of agents when the local RMs are manually engineered and provided. In pink , we show the performance of the team when the RMs are learnt. Finally in yellow , we show the performance of the team when each agent learn independently to perform their task without using the RM framework 1 . We observe that regardless of whether the local RMs are handcrafted or learned, the agents learn policies that, when combined, lead to the completion of the global task. Indeed, the collective reward obtained converges to the maximum reward indicating that all the agents have learnt to execute their sub-tasks and coordinate in order to achieve the common goal. Learning the RMs also helps speed up the learning for each of the agent, significantly reducing the number of training iterations required to learn to achieve the global task. We note however that the policy learnt without the RM framework requires less steps than when the RMs are used.
We present in Figure 4 the RM learnt by π΄ 2 using our interleaved learning approach. We compare it to the 'true' RM shown in Figure 2b. The back transition from π’ 2 3 to π’ 2 2 associated with the label π΄ Β¬ π
π΅ 2 is missing from the learnt RM. This can be explained by the fact that this transition does not contribute towards the goal. When deriving the minimal RM from the set of labels, this one in particular is not deemed important and is thus ignored by the RM learner.
1 This is equivalent of using a 2-State RM, reaching the final state only when the task is completed.
Figure 3: Comparison between handcrafted RMs (RM Provided) and our approach learning the RMs from traces (RM Learnt) in the ThreeButtons environment.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Line Charts: RM Provided vs RM Learnt vs Without RM
### Overview
The image contains two line charts comparing the performance of three different reinforcement learning approaches: "RM Provided", "RM Learnt", and "Without RM". The first chart shows the number of steps to reach the goal as a function of training steps, while the second chart shows the average reward as a function of training steps. Both charts display the data from 0 to 60,000 training steps. Each line is surrounded by a shaded region, representing the variance or uncertainty in the data.
### Components/Axes
**Chart 1: Number of Steps to reach the Goal**
* **Y-axis:** "Steps to reach the Goal", ranging from 0 to 300.
* **X-axis:** "Training steps", ranging from 0 to 6 x 10^4 (60,000).
* **Title:** "(a) Number of Steps to reach the Goal"
* **Legend:** Located in the top-right corner.
* "RM Provided" - Teal line
* "RM Learnt" - Pink/Magenta line
* "Without RM" - Yellow line
**Chart 2: Average Reward**
* **Y-axis:** "Average Reward", ranging from 0.0 to 1.0.
* **X-axis:** "Training steps", ranging from 0 to 6 x 10^4 (60,000).
* **Title:** "(b) Average Reward"
* **Legend:** Located in the bottom-right corner.
* "RM Provided" - Teal line
* "RM Learnt" - Pink/Magenta line
* "Without RM" - Yellow line
### Detailed Analysis
**Chart 1: Number of Steps to reach the Goal**
* **RM Provided (Teal):** Starts at approximately 175 steps, rapidly decreases to around 25 steps by 10,000 training steps, and then remains relatively constant around 20-25 steps.
* **RM Learnt (Pink/Magenta):** Starts at approximately 300 steps, decreases to around 25 steps by 10,000 training steps, and then remains relatively constant around 20-25 steps.
* **Without RM (Yellow):** Starts at approximately 250 steps, decreases to around 50 steps by 20,000 training steps, and then fluctuates between 25 and 75 steps, eventually stabilizing around 25 steps after 50,000 training steps.
**Chart 2: Average Reward**
* **RM Provided (Teal):** Starts at approximately 0.8, rapidly increases to 1.0 by 10,000 training steps, and then remains constant at 1.0.
* **RM Learnt (Pink/Magenta):** Starts at approximately 0.2, rapidly increases to 1.0 by 10,000 training steps, and then remains constant at 1.0.
* **Without RM (Yellow):** Starts at approximately 0.2, gradually increases to approximately 0.9 by 40,000 training steps, with significant fluctuations along the way, and then stabilizes around 1.0 after 50,000 training steps.
### Key Observations
* Both "RM Provided" and "RM Learnt" approaches converge to a low number of steps and a high average reward much faster than the "Without RM" approach.
* "RM Learnt" starts with the worst performance in terms of average reward, but quickly catches up to "RM Provided".
* The "Without RM" approach exhibits more variability in both the number of steps and the average reward, especially in the early stages of training.
### Interpretation
The data suggests that providing or learning a reward model (RM) significantly improves the performance of the reinforcement learning agent, leading to faster convergence and better overall results. The "RM Provided" and "RM Learnt" approaches are more efficient in terms of training steps required to achieve optimal performance compared to the "Without RM" approach. The fluctuations in the "Without RM" approach indicate that the agent struggles to learn a consistent policy without the guidance of a reward model. The rapid improvement of "RM Learnt" suggests that the agent can effectively learn a useful reward model from the environment.
</details>
Figure 4: Learnt RM for π΄ 2 .
<details>
<summary>Image 4 Details</summary>

### Visual Description
## State Diagram: State Transition Diagram
### Overview
The image depicts a state transition diagram. It shows a sequence of states labeled with 'u' and numerical subscripts, connected by transitions labeled with 'Y', 'G', 'A', and 'R' with subscripts. The diagram starts at a state labeled "start" and ends at a final state, indicated by a double circle.
### Components/Axes
* **States:** Represented by circles, each containing a label of the form "u<sub>i</sub><sup>2</sup>", where 'i' is a numerical subscript.
* **Transitions:** Represented by arrows connecting the states, each labeled with a letter (Y, G, A, R) and a subscript 'B' or '2'.
* **Start State:** Labeled "start" with an arrow pointing to the initial state.
* **Final State:** The state "u<sub>A</sub><sup>2</sup>" is the final state, indicated by a double circle.
### Detailed Analysis
The diagram consists of the following states and transitions:
1. **Start State:** An arrow labeled "start" points to the first state.
2. **State 1:** A circle labeled "u<sub>0</sub><sup>2</sup>".
3. **Transition 1:** An arrow from "u<sub>0</sub><sup>2</sup>" to "u<sub>1</sub><sup>2</sup>" labeled "Y<sub>B</sub>".
4. **State 2:** A circle labeled "u<sub>1</sub><sup>2</sup>".
5. **Transition 2:** An arrow from "u<sub>1</sub><sup>2</sup>" to "u<sub>2</sub><sup>2</sup>" labeled "G<sub>B</sub>".
6. **State 3:** A circle labeled "u<sub>2</sub><sup>2</sup>".
7. **Transition 3:** An arrow from "u<sub>2</sub><sup>2</sup>" to "u<sub>3</sub><sup>2</sup>" labeled "A<sub>2</sub><sup>R<sub>B</sub></sup>".
8. **State 4:** A circle labeled "u<sub>3</sub><sup>2</sup>".
9. **Transition 4:** An arrow from "u<sub>3</sub><sup>2</sup>" to "u<sub>A</sub><sup>2</sup>" labeled "R<sub>B</sub>".
10. **Final State:** A double circle labeled "u<sub>A</sub><sup>2</sup>".
### Key Observations
* The diagram represents a sequential process, moving from the start state through intermediate states to the final state.
* The transitions are labeled with symbols that likely represent specific actions or conditions.
* The final state is "u<sub>A</sub><sup>2</sup>", which is marked as an accepting or terminal state.
### Interpretation
The state transition diagram likely represents a process or algorithm where the system moves through a series of states based on specific conditions or actions. The labels on the transitions indicate the events that trigger the state changes. The diagram provides a visual representation of the flow of the process, starting from an initial state and ending at a final state. The subscripts and superscripts on the transition labels likely denote specific parameters or conditions associated with each transition. The diagram could be used to model various systems, such as communication protocols, control systems, or software algorithms.
</details>
Learning the global RM was also attempted in this environment. Unfortunately, learning the minimal RM for the size of this problem is hard and the RM learner timed out after the 1 β limit.
## 4.2 Rendezvous Task
The second environment in which we evaluate our approach is the 2-agent Rendezvous task [16] presented in Figure 5a. This task is composed of 2 agents acting in a grid with the ability to go up, down, left, right or do nothing. The task requires the agents to move in the environment to first meet in an π
π·π location, i.e, all agents need to simultaneously be in that location for at least one timestep. Then each of the agents must reach their individual goal location for the global task to be completed.
We present in Figure 6 the number of steps (6a) to complete the task and the collective reward received by the team of agents (6b) throughout training. In green we show the performance when the RMs are known a priori and provided to the agents. This scenario, where we have perfect knowledge about the structure of each subtasks, is used as an upper bound for the results of our approach. In pink , we show the performances of our approach where the local RMs are learnt. The yellow curve shows the performance of the team when each agent learns to perform their individual task without the RM construct.
The collective reward obtained while interleaving the learning of the policies and the learning of the local RMs converges to the maximum reward after a few iterations. In this task, the RM framework is shown to be crucial to help the team solve the task in a timely manner. Indeed, without the RMs (either known a priori or learnt), the team of agents do not succeed at solving the task.
We also attempted to learn the global RM directly but the task could not be solved (i.e., a collective goal trace was not observed). The collective reward remained null throughout the training that we manually stopped after 1 π 6 timesteps.
## 5 RELATED WORK
Since the recent introduction of reward machines as a way to model non-Markovian rewards [19, 20], several lines of research have based their work on this concept (or similar finite-state machines). Most of the work has focused on how to derive them from logic specifications [1] or demonstrations [2], as well as learning them using different methods [3, 9-11, 13, 21, 22].
In this work, based on the multi-agent with RMs work by Neary et al. [16], the RMs of the different agents are executed in parallel. Recent works have considered other ways of composing RMs, such as merging the state and reward transition functions [6] or arranging them hierarchically by enabling RMs to call each other [10]. The latter is a natural way of extending this work since agents could share subroutines in the form of callable RMs.
The policy learning algorithm for exploiting RMs we employ is an extension of the QRM algorithm (see Sections 2.2.2-2.2.3). However, algorithms based on making decisions at two hierarchical levels [9, 19] or more [10] have been proposed. In the simplest case (i.e., two levels), the agent first decides which transition to satisfy from a given RM state, and then decides which low-level actions to take to satisfy that transition. For example, if agent π΄ 2 gets to state π’ 2 3 in Figure 2b, the decisions are: (1) whether to satisfy the transition labeled π΄ Β¬ π
π΅ 2 or the one labeled π
π΅ , and (2) given the chosen transition, act towards satisfying it. Unlike QRM, this method is not guaranteed to learn optimal policies; however, it enables lower-level sub-tasks to be reused while QRM does not. Note that the Q-functions learned by QRM depend on the structure of the whole RM and, hence, we learn them from scratch every time a new RM is induced. While other methods attempt to leverage the previously learned Q-functions [22], the hierarchical approach
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Diagram: RDV Path and State Transition Diagram
### Overview
The image presents two diagrams. Diagram (a) illustrates a path for an entity labeled "RDV" to reach different goals (Goal1, Goal2) and locations (A1, A2). Diagram (b) depicts a state transition diagram with states labeled u0 through u6 and uA, connected by transitions labeled with R, R1, R2, G1, G2, and their negations.
### Components/Axes
**Diagram (a): RDV Path**
* **Nodes:**
* RDV (a green circle in the center)
* Goal1 (top-left)
* Goal2 (bottom-right)
* A1 (bottom-left)
* A2 (top-right)
* **Paths:** Dotted lines indicating possible routes.
* **Labels:** Numerical labels (1, 2) indicating path weights or costs.
**Diagram (b): State Transition Diagram**
* **Nodes:** States labeled u0, u1, u2, u3, u4, u5, u6, and uA (where uA is a double-circled final state).
* **Edges:** Arrows indicating transitions between states.
* **Labels:** Transition labels R, R1, R2, G1, G2, -R1, -R2.
* **Start:** An arrow pointing to state u0 labeled "start".
### Detailed Analysis
**Diagram (a): RDV Path**
* From RDV:
* To Goal1: A dotted line goes up, then left, labeled "2" then "1".
* To Goal2: A dotted line goes down, then right, labeled "2" then "1".
* To A1: A dotted line goes left, then down, labeled "2" then "1".
* To A2: A dotted line goes right, then up, labeled "2" then "1".
**Diagram (b): State Transition Diagram**
* **State u0:**
* "start" arrow points to u0.
* Transition to u1 labeled "-R2".
* Transition to u2 labeled "R2, R1".
* Loop back to u0 labeled "-R1".
* **State u1:**
* Transition to u3 labeled "-R1".
* **State u2:**
* Transition to u3 labeled "-R2".
* **State u3:**
* Transition to u1 labeled "R1, R2".
* Transition to u2 labeled "R1, R2".
* Transition to u4 labeled "R".
* **State u4:**
* Transition to u6 labeled "G1".
* Transition to u5 labeled "G2".
* **State u5:**
* Transition to uA labeled "G1".
* **State u6:**
* Transition to uA labeled "G2".
* **State uA:**
* Final state (double circle).
### Key Observations
* Diagram (a) shows a symmetrical path structure centered around the RDV.
* Diagram (b) shows a more complex state transition diagram with loops and branching paths.
* State uA is the only final state in diagram (b).
### Interpretation
* Diagram (a) likely represents a simplified environment where the RDV can navigate to different locations with associated costs. The symmetry suggests that the cost to reach each goal or location is the same.
* Diagram (b) represents a finite state machine. The labels on the transitions likely represent conditions or actions that cause the system to move from one state to another. The final state uA represents a successful completion of some process. The initial states u0, u1, u2, u3 form a cycle, suggesting a repeating or iterative process. The transitions from u4 to u5/u6 and then to uA suggest a branching decision process leading to the final state.
</details>
Figure 5: Example of the Rendezvous task where 2 agents must meet on the RDV point (green) before reaching their goal state πΊ 1 and πΊ 2 for agents π΄ 1 and π΄ 2 respectively.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Line Chart: Steps to Reach the Goal vs. Training Steps
### Overview
The image is a line chart comparing the number of steps required to reach a goal during a training process, across three different conditions: "RM Provided," "RM Learnt," and "Without RM." The x-axis represents training steps, and the y-axis represents the number of steps to reach the goal.
### Components/Axes
* **X-axis:** "Training steps" with a scale from 0 to 3.0 x 10^5 (300,000). Axis markers are present at 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0 (x 10^5).
* **Y-axis:** "Steps to reach the Goal" with a scale from 0 to 300. Axis markers are present at 0, 50, 100, 150, 200, 250, and 300.
* **Legend:** Located in the top-right corner.
* "RM Provided" - Teal line
* "RM Learnt" - Pink/Magenta line
* "Without RM" - Yellow line
### Detailed Analysis
* **RM Provided (Teal):**
* Trend: Starts high (around 290 steps), rapidly decreases to approximately 15 steps by 0.5 x 10^5 training steps, and then remains relatively constant.
* Data Points:
* 0 training steps: ~290 steps
* 0.1 x 10^5 training steps: ~75 steps
* 0.2 x 10^5 training steps: ~20 steps
* 0.5 x 10^5 training steps: ~15 steps
* 3.0 x 10^5 training steps: ~13 steps
* **RM Learnt (Pink/Magenta):**
* Trend: Starts high (around 300 steps), decreases to approximately 15 steps by 0.5 x 10^5 training steps, and then remains relatively constant.
* Data Points:
* 0 training steps: ~300 steps
* 0.1 x 10^5 training steps: ~150 steps
* 0.2 x 10^5 training steps: ~70 steps
* 0.4 x 10^5 training steps: ~15 steps
* 3.0 x 10^5 training steps: ~13 steps
* **Without RM (Yellow):**
* Trend: Starts around 290 steps and remains relatively constant around 300 steps throughout the training process, with some minor fluctuations.
* Data Points:
* 0 training steps: ~290 steps
* 1.0 x 10^5 training steps: ~300 steps
* 2.0 x 10^5 training steps: ~305 steps
* 3.0 x 10^5 training steps: ~300 steps
### Key Observations
* Both "RM Provided" and "RM Learnt" conditions show a significant decrease in the number of steps required to reach the goal as training progresses, eventually stabilizing at a low number of steps.
* The "Without RM" condition consistently requires a high number of steps to reach the goal throughout the training process.
* The shaded regions around each line likely represent the variance or standard deviation of the data.
### Interpretation
The data suggests that using Reinforcement Management (RM), whether provided or learned, significantly improves the efficiency of reaching the goal during the training process. The "RM Provided" and "RM Learnt" conditions demonstrate a rapid learning curve, quickly reducing the number of steps needed. In contrast, the "Without RM" condition shows no improvement over time, indicating that RM is crucial for efficient goal attainment in this scenario. The similarity in performance between "RM Provided" and "RM Learnt" suggests that the system can effectively learn and utilize RM strategies.
</details>
(a) Number of Steps to reach the Goal
Figure 6: Comparison between handcrafted RMs (RM Provided) and our approach learning the RMs from traces (RM Learnt) in the Rendezvous environment.
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Line Chart: Average Reward vs. Training Steps
### Overview
The image is a line chart comparing the average reward achieved during training steps for three different reinforcement learning (RL) scenarios: "RM Provided", "RM Learnt", and "Without RM". The x-axis represents training steps (in units of 1e5), and the y-axis represents the average reward, ranging from 0.0 to 1.0. The chart includes shaded regions around each line, indicating the variance or uncertainty in the reward values.
### Components/Axes
* **Title:** (a) Number of Steps to reach the Goal, (b) Average Reward
* **X-axis:** Training steps (labeled "Training steps"), scaled from 0.0 to 3.0 (representing 3.0 x 10^5 steps).
* **Y-axis:** Average Reward (labeled "Average Reward"), scaled from 0.0 to 1.0.
* **Legend:** Located in the bottom-right of the chart.
* **RM Provided:** Teal line
* **RM Learnt:** Pink line
* **Without RM:** Yellow line
### Detailed Analysis
* **RM Provided (Teal):**
* Trend: Initially increases rapidly, reaching a reward of approximately 0.8 around 0.2 x 10^5 training steps. It then plateaus at approximately 1.0 around 0.4 x 10^5 training steps.
* Data Points: Starts at approximately 0.2, rises to 0.8 around 0.2 x 10^5 steps, and reaches 1.0 around 0.4 x 10^5 steps.
* **RM Learnt (Pink):**
* Trend: Increases rapidly, reaching a reward of approximately 0.8 around 0.3 x 10^5 training steps, then plateaus at approximately 1.0 around 0.4 x 10^5 training steps.
* Data Points: Starts at approximately 0.1, rises to 0.8 around 0.3 x 10^5 steps, and reaches 1.0 around 0.4 x 10^5 steps.
* **Without RM (Yellow):**
* Trend: Starts low, fluctuates with several peaks and valleys, and remains generally low throughout the training steps.
* Data Points: Starts at approximately 0.1, fluctuates between 0.0 and 0.2, and remains below 0.2 throughout the training.
### Key Observations
* Both "RM Provided" and "RM Learnt" achieve significantly higher average rewards compared to "Without RM".
* "RM Provided" and "RM Learnt" converge to a reward of 1.0 much faster than "Without RM".
* "Without RM" exhibits more volatility and lower overall performance.
### Interpretation
The data suggests that using Reinforcement Management (RM), either provided or learned, significantly improves the average reward achieved during training. The "RM Provided" and "RM Learnt" scenarios demonstrate faster learning and higher final performance compared to the scenario "Without RM". The fluctuations in the "Without RM" scenario indicate instability and difficulty in learning without the aid of RM. The similarity in performance between "RM Provided" and "RM Learnt" suggests that the learned RM is as effective as the provided RM.
</details>
outlined here does not need to relearn all functions each time a new RM is induced.
In the multi-agent community, the use of finite-state machines for task decomposition in collaborative settings has been studied in [4, 8] where a top-down approach is adopted to construct submachines from a global task. Unlike our work, these methods focus on the decomposition of the task but not on how to execute it. The algorithm presented in this paper interleaves both the decomposition and the learning of the policies associated with each state of the RM.
The MARL community studying the collaborative setting has proposed different approaches to decompose the task among all the different agents. For example, QMIX [17] tackles the credit assignment problem in MARL assuming that the team's Q-function can be factorized, and performs a value-iteration method to centrally train policies that can be executed in a decentralised fashion. While this approach has shown impressive empirical results, its interpretation is more difficult than understanding the structure of the task using RM.
The use of RMs in the MARL literature is starting to emerge. The work of Neary et al. [16] was the first to propose the use of RMs for task decomposition among multiple agents. As already mentioned in this paper, this work assumes that the structure of the task is known 'a priori', which is often untrue. With a different purpose, Dann et al. [5] propose to use RMs to help an agent predict the next actions the other agents in the system will perform. Instead of modeling the other agents' plan (or program), the authors argue that RMs provide a more permissive way to accommodate for variations in the behaviour of the other agent by providing a higher-level mechanism to specify the structure of the task. In this work as well, the RMs are pre-defined and provided to the agent.
Finally, the work of Eappen and Jagannathan [7] uses a finitestate machine called 'task monitor', which is built from temporal logic specifications that encode the reward signal of the agents in the system. A task monitor is akin to an RM with some subtle differences like the use of registers for memory.
## 6 CONCLUSION AND FUTURE WORK
Dividing a problem into smaller parts that are easier to solve is a natural approach performed instinctively by humans. It is also particularly suited for the multi-agent setting where all the available resources can be utilized to solve the problem or when a complementary set of skills is required.
In this work, we extend the approach by Neary et al. [16], who leverage RMs for task decomposition in a multi-agent setting. We propose a novel approach where the RMs are learnt instead of manually engineered. The method presented in this paper interleaves the learning of the RM from the traces collected by the agent and the learning of the set of policies used to achieve the task's goal. Learning the RMs associated with each sub-task not only helps deal with non-Markovian rewards but also provides a more interpretable way to understand the structure of the task. We show experimentally in the ThreeButtons and the Rendezvous environments that our approach converges to the maximum reward a team of agents can obtain by collaborating to achieve a goal. Finally, we validated that the learnt RM corresponds indeed to the RM needed to achieve each sub-task.
While our approach lifts the assumption about knowing the structure of the task 'a priori', a set of challenges remain and could be the object of further work:
- (1) Alabeling function is assumed to exist for each of the agents and be known by them. In other words, each agent knows the labels relevant to its task and any labels used for synchronization with the other team members.
- (2) If the labeling functions are noisy, agents may fail to synchronize and hence not be able to progress in their sub-tasks. In this scenario, the completion of the cooperative task is compromised. Defining or learning a communication protocol among the agents could be the object of future work.
- (3) Our approach assumes that the global task has been automatically divided into several tasks, each to be completed by one of the agents. Dynamically creating these sub-tasks and assigning them to each of the agents is crucial to having more autonomous agents and would make an interesting topic for further research.
- (4) We leverage task decomposition to learn simpler RMs that when run in parallel represent a more complex global RM. Task decomposition can be driven further through task hierarchies, which could be represented through hierarchies of RMs [10]. Intuitively, each RM in the hierarchy should be simpler to learn and enable reusability and further parallelization among the different agents.
The use of RMs in the multi-agent context is still in its infancy and there are several exciting avenues for future work. We have demonstrated in this paper that it is possible to learn them under certain assumptions that could be relaxed in future work.
## ACKNOWLEDGMENTS
Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-222-0243. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorised to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
## REFERENCES
- [1] Alberto Camacho, Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Anthony Valenzano, and Sheila A. McIlraith. 2019. LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) . 6065-6073.
- [2] Alberto Camacho, Jacob Varley, Andy Zeng, Deepali Jain, Atil Iscen, and Dmitry Kalashnikov. 2021. Reward Machines for Vision-Based Robotic Manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) . 14284-14290.
- [3] Phillip J. K. Christoffersen, Andrew C. Li, Rodrigo Toro Icarte, and Sheila A. McIlraith. 2020. Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior. In Proceedings of the Knowledge Representation and Reasoning Meets Machine Learning (KR2ML) Workshop at the Advances in Neural Information Processing Systems (NeurIPS) Conference .
- [4] Jin Dai and Hai Lin. 2014. Automatic synthesis of cooperative multi-agent systems. In Proceedings of the IEEE Conference on Decision and Control (CDC) . 6173-6178.
- [5] Michael Dann, Yuan Yao, Natasha Alechina, Brian Logan, and John Thangarajah. 2022. Multi-Agent Intention Progression with Reward Machines. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) . 215-222.
- [6] Giuseppe De Giacomo, Marco Favorito, Luca Iocchi, Fabio Patrizi, and Alessandro Ronca. 2020. Temporal Logic Monitoring Rewards via Transducers. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR) . 860-870.
- [7] Joe Eappen and Suresh Jagannathan. 2022. DistSPECTRL: Distributing Specifications in Multi-Agent Reinforcement Learning Systems. arXiv preprint arXiv:2206.13754 (2022).
- [8] Ayman Elshenawy Elsefy. 2020. A task decomposition using (HDec-POSMDPs) approach for multi-robot exploration and fire searching. International Journal of Robotics and Mechatronics 7, 1 (2020), 22-30.
- [9] Daniel Furelos-Blanco, Mark Law, Anders Jonsson, Krysia Broda, and Alessandra Russo. 2021. Induction and Exploitation of Subgoal Automata for Reinforcement Learning. J. Artif. Intell. Res. 70 (2021), 1031-1116.
- [10] Daniel Furelos-Blanco, Mark Law, Anders Jonsson, Krysia Broda, and Alessandra Russo. 2022. Hierarchies of Reward Machines. arXiv preprint arXiv:2205.15752 (2022).
- [11] Maor Gaon and Ronen I. Brafman. 2020. Reinforcement Learning with NonMarkovian Rewards. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) . 3980-3987.
- [12] E. Mark Gold. 1978. Complexity of Automaton Identification from Given Data. Inf. Control. 37, 3 (1978), 302-320.
- [13] Mohammadhosein Hasanbeig, Natasha Yogananda Jeppu, Alessandro Abate, Tom Melham, and Daniel Kroening. 2021. DeepSynth: Automata Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) . 7647-7656.
- [14] Leslie Pack Kaelbling. 1993. Learning to Achieve Goals. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) . 1094-1099.
- [15] Mark Law, Alessandra Russo, and Krysia Broda. 2015. The ILASP System for Learning Answer Set Programs. https://www.doc.ic.ac.uk/~ml1909/ILASP
- [16] Cyrus Neary, Zhe Xu, Bo Wu, and Ufuk Topcu. 2021. Reward Machines for Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) . 934-942.
- [17] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. 2020. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research 21, 1 (2020), 7234-7284.
- [18] Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction . MIT Press.
- [19] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. 2018. Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML) . 2107-2116.
- [20] Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. McIlraith. 2022. Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning. J. Artif. Intell. Res. 73 (2022), 173-208.
- [21] Rodrigo Toro Icarte, Ethan Waldie, Toryn Q. Klassen, Richard Anthony Valenzano, Margarita P. Castro, and Sheila A. McIlraith. 2019. Learning Reward Machines for Partially Observable Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) Conference . 15497-15508.
- [22] Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, and Bo Wu. 2020. Joint Inference of Reward Machines and Policies for Reinforcement Learning. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS) . 590-598.