## The PlayStation Reinforcement Learning Environment (PSXLE)
## Carlos Purves, CΛ atΛ alina Cangea, Petar VeliΛ ckoviΒ΄ c
Department of Computer Science and Technology University of Cambridge
{cp614, catalina.cangea, petar.velickovic}@cst.cam.ac.uk
## Abstract
We propose a new benchmark environment for evaluating Reinforcement Learning (RL) algorithms: the PlayStation Learning Environment (PSXLE), a PlayStation emulator modified to expose a simple control API that enables rich game-state representations. We argue that the PlayStation serves as a suitable progression for agent evaluation and propose a framework for such an evaluation. We build an action-driven abstraction for a PlayStation game with support for the OpenAI Gym interface and demonstrate its use by running OpenAI Baselines .
## 1 Introduction
Reinforcement Learning (RL) describes a form of machine learning in which an agent learns how to interact with an environment through the acquisition of rewards that are chosen to encourage good behaviours and penalise harmful ones. The environment is described to the agent at each point in time by a state encoding . A well-trained agent should use this encoding to select its next action as one that maximises its long-term cumulative reward. This model of learning has proved effective in many real-world environments, including in self-driving cars [1], traffic control [2], advertising [3] and robotics [4].
An important advance in RL research came with the development of Deep Q-Networks (DQN), in which agents utilise deep neural networks to interpret complex state spaces. Increased attention towards RL in recent years has led to further advances such as Double DQN [5], Prioritised Experience Replay [6] and Duelling Architectures [7] bringing improvements over DQN. Policy-based methods such as A3C have brought further improvements [8] and asynchronous methods to RL.
In order to quantify the success of these learning algorithms and to demonstrate improvements in new approaches, a common evaluation methodology is needed. Computer games are typically used to fulfil this role, providing practical advantages over other types of environments: episodes are reproducible , due to the lack of uncontrollable stochasticity; they offer comparatively low-dimensional state encodings; and the notion of a 'score' translates naturally into one of a 'reward'. The use of computer games also serves a more abstract purpose: to court public interest in RL research. Describing the conclusions of research in terms of a player's achievement in a computer game makes the work more approachable and improves its comprehensibility, as people can utilise their own experience of playing games as a baseline for comparison.
In 2015, Mnih et al. [9] used the Atari-2600 console as an environment in which to evaluate DQN. Their agent outperformed human expert players in 22 out of the 49 games that were used in training. In 2016, OpenAI announced OpenAI Gym [10], which allows researchers and developers to interface with games and physical simulations in a standardised way through a Python library. Gym now represents the de-facto standard evaluation method for RL algorithms [11]. It includes, amongst
others [12], several Atari-2600 games which utilise Arcade Learning Environment (ALE) [13] to interface with a console emulation.
One of the most important considerations in developing successful RL methods in complex environments is the choice of state encoding . This describes the relationship between the state of an environment and the format of the data available to the agent. For a human playing a game, 'state' can mean many things, including: the position of a character in a world, how close enemies are, the remaining time on a clock or the set of previous actions. While these properties are easy for humans to quantify, RL environments usually do not encode them explicitly for two reasons. Firstly, doing so would simplify the problem too much, permitting reliance on a human understanding of the environment-something which should ideally be approximated through learning. Secondly, it does not allow agents to generalise, since each game's state will be described by different properties. Rather, game-based RL environments typically consider the 'state' to be an element of some common state space . Common examples of such spaces are the set of a console's possible display outputs or its RAM contents.
Until now, RL research has seen little exploration of the use of sound effects in state encodings. This is clearly not due to a lack of methods for processing audio data; there is substantial research precedent in the areas of speech recognition [14], audio in affective computing [15] and unsupervised approaches to music genre detection [16]. A discussion about richer state encodings is particularly pertinent given the success of existing RL approaches within conventional environments. A significant gap exists between the richness and complexity of such environments and those representing the eventual goal of RL: real-world situations with naturally many-dimensional state spaces.
To help narrow this gap, this paper introduces the PlayStation Reinforcement Learning Environment (PSXLE): a toolkit for training agents to play Sony PlayStation 1 games. PSXLE is designed to follow from the standard set by ALE and enable RL research using more complex environments and state encodings. It increases the complexity of the games that can be used within a framework such as OpenAI Gym, due to the significant hardware differences between the consoles. PSXLE utilises this additional complexity by exposing raw audio output from the console alongside RAM and display contents. We implement an OpenAI Gym interface for the PlayStation game Kula World and use OpenAI Baselines to evaluate the performance of two popular RL algorithms with it.
## 2 PlayStation
Figure 1: The PlayStation controller, highlighting its 14 buttons.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: PlayStation Controller Layout
### Overview
The image is a diagram of a PlayStation controller, labeling its various buttons and directional pad. The controller is light gray, and the labels point to specific features.
### Components/Axes
* **Controller Body:** Light gray, showing the general shape of the PlayStation controller.
* **Directional Pad:** Located on the bottom-left of the controller.
* **Select and Start Buttons:** Positioned in the center of the controller.
* **Triangle Button:** Located on the top-right of the controller; labeled "Triangle". The button is green.
* **Circle Button:** Located on the right of the controller; labeled "Circle". The button is red.
* **Cross Button:** Located on the bottom-right of the controller; labeled "Cross". The button is blue.
* **Square Button:** Located on the left of the controller; labeled "Square". The button is pink.
* **L1 and L2 Buttons:** Located on the top-left of the controller.
* **R1 and R2 Buttons:** Located on the top-right of the controller.
* **Sony PlayStation Logo:** Located in the center of the controller, between the Select and Start buttons.
### Detailed Analysis or Content Details
The diagram provides a clear visual representation of the PlayStation controller layout. Each button and the directional pad are labeled with arrows pointing to their respective locations. The labels are clear and easy to read.
### Key Observations
* The diagram focuses on the button layout and labeling.
* The positioning of the buttons is accurately depicted.
* The color of the Triangle, Circle, Cross, and Square buttons are also shown.
### Interpretation
The diagram serves as a simple and effective guide to the PlayStation controller layout. It is useful for identifying the location of each button and understanding its function. The diagram is straightforward and easy to understand, making it suitable for a wide audience.
</details>
The Sony PlayStation 1 (sometimes PSX or PlayStation ) is a console first released by Sony Computer Entertainment in 1994. It has 2 megabytes of RAM, 16.7 million displayable colours and a 33.9 MHz CPU, which contrasts with the Atari-2600's 128 bytes of RAM, 128 displayable colours and a 1.19MHz CPU. Since its launch, the number of titles available for the PlayStation has grown to almost 8000 worldwide, more than the 500 that are available for the Atari-2600 console. PlayStation games are controlled using a handheld controller, shown in Figure 1. The controller has 14 buttons, with Start typically used to pause a game.
Figure 2: A visualisation of the Inter-Process Communication used in PSXLE. Pipes are coloured green, FIFO queues are coloured blue and Unix fopen calls are coloured grey. A is used to notify the PSXLE Python library that parts of memory have changed and that events like load\_state and save\_state have completed. B , representing standard input stdin , is used to communicate which regions of memory the console should watch. C is used to send simulated button presses to the console. D sends instructions to the console, such as loading and saving state or loading an ISO.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: Console and Python Library Interaction
### Overview
The image is a diagram illustrating the interaction between a "Console" and a "Python Library," mediated by "Shared Memory" and "Temp Files." The diagram shows data flow between components within the Console and Python Library, as well as communication pathways between them.
### Components/Axes
* **Title:** None explicitly provided, but the diagram depicts "Console and Python Library Interaction."
* **Console:** A rounded rectangle on the left, colored in light orange, containing the following components (from top to bottom):
* ProcThread
* FakeJoy
* GPU
* RAM
* CPU
* SPU
* **Python Library:** A rounded rectangle on the right, colored in light blue, containing the following components (from top to bottom):
* Console Handler
* Notification Handler
* **Shared Memory:** A rounded rectangle at the top, colored in light blue.
* **Temp Files:** A rounded rectangle at the bottom, colored in light blue.
* **Data Flow Arrows:**
* A: Green arrow from RAM in Console to Notification Handler in Python Library.
* B: Green arrow from RAM in Console to Console Handler in Python Library.
* C: Blue arrow from FakeJoy in Console to Console Handler in Python Library.
* D: Blue arrow from ProcThread in Console to Console Handler in Python Library.
* **Grey Lines:** Connect the Console and Python Library to Shared Memory and Temp Files.
### Detailed Analysis or Content Details
* **Console Components:**
* ProcThread: Located at the top of the Console block.
* FakeJoy: Located below ProcThread.
* GPU: Located below FakeJoy.
* RAM: Located below GPU.
* CPU: Located below RAM.
* SPU: Located at the bottom of the Console block.
* **Python Library Components:**
* Console Handler: Located at the top of the Python Library block.
* Notification Handler: Located below Console Handler.
* **Data Flow:**
* The green arrow labeled "A" originates from the RAM component of the Console and points to the Notification Handler component of the Python Library.
* The green arrow labeled "B" originates from the RAM component of the Console and points to the Console Handler component of the Python Library.
* The blue arrow labeled "C" originates from the FakeJoy component of the Console and points to the Console Handler component of the Python Library.
* The blue arrow labeled "D" originates from the ProcThread component of the Console and points to the Console Handler component of the Python Library.
* **Shared Memory and Temp Files:**
* The Console and Python Library are connected to "Shared Memory" via grey lines. The Console connects to Shared Memory from the top-left corner, and the Python Library connects to Shared Memory from the top-right corner.
* The Console and Python Library are connected to "Temp Files" via grey lines. The Console connects to Temp Files from the bottom-left corner, and the Python Library connects to Temp Files from the bottom-right corner.
### Key Observations
* The diagram illustrates data flow between specific components of the Console and the Python Library.
* RAM in the Console sends data to both Console Handler and Notification Handler in the Python Library.
* ProcThread and FakeJoy in the Console only send data to the Console Handler in the Python Library.
* Shared Memory and Temp Files are used for general communication between the Console and Python Library.
### Interpretation
The diagram depicts the architecture and data flow between a Console application and a Python Library. It highlights how different components within the Console (ProcThread, FakeJoy, RAM) interact with specific handlers within the Python Library (Console Handler, Notification Handler). The use of Shared Memory and Temp Files suggests a mechanism for broader communication and data exchange between the two systems. The diagram suggests that the RAM component in the Console is a central point for data dissemination to the Python Library, while ProcThread and FakeJoy have more specific interactions with the Console Handler.
</details>
## 3 Implementation
PSXLE is built using a fork of PCSX-R 1 , an open-source emulator created in 2009. We made modifications to the source of PCSX-R by adding simple Inter-Process Communication (IPC) toolsthe structure of which is shown in Figure 2-to simulate controller inputs and read console output through an external interface. The PSXLE Python library uses these tools to provide a simple, game-agnostic PlayStation console interface. To specialise the environment to a certain game and implement an interface such as OpenAI Gym, a customised environment stack can be created that abstracts console-level functions across two layers. Figure 3 visualises the structure of such a stack. Specialisation to individual games occurs within the 'game abstraction' component, which uses the Console API to translate game actions (such as 'move forwards') into console functions (such as 'press up').
The Console API supports four primary forms of interaction:
## Β· General:
- -run and kill control the executing process of the emulator;
- -freeze and unfreeze will freeze and unfreeze the emulator's execution, respectively;
- -speed is a property of Console which, when set, will synchronously set the speed of execution of the console, expressed as a percentage relative to default speed.
## Β· Controller:
- -hold\_button and release\_button simulate a press down and release of a given controller button-referred to here as control events ;
- -touch\_button holds, pauses for a specified amount of time and then releases a button;
- -delay\_button adds a (millisecond-order) delay between successive control events
## Β· RAM:
- -read\_bytes and write\_byte directly read from and write to console memory;
- -add\_memory\_listener and clear\_memory\_listeners control which parts of the console's memory should have asynchronous listeners attached when the console runs;
- -sleep\_memory\_listener and wake\_memory\_listener tell the console which listeners are active.
- Audio/Visual:
1 Available at https://github.com/pcsxr/PCSX-Reloaded/
- -start\_recording\_audio and stop\_recording\_audio control when the console should record audio and when it should stop;
- -get\_screen synchronously returns an np.array of the console's instantaneous visual output.
Example usage of the PSXLE interface can be found in Appendix A.
Figure 3: The proposed architecture for PlayStation RL environments includes three components: the interface , which allows agents to perform actions and receive feedback; the game abstraction , which translates game actions into controller inputs and console state (visual, audio and RAM) into state representations; and the console , which handles communication with the emulator.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Layered Diagram: Game System Architecture
### Overview
The image is a layered diagram illustrating the architecture of a game system. It depicts three main layers: Interface, Game Abstraction, and Console, each represented by a rectangular block. Arrows indicate the flow of information and control between these layers.
### Components/Axes
* **Layers (from top to bottom):**
* Interface (Orange)
* Game Abstraction (Pink)
* Console (Blue)
* **Arrows:**
* "Reset" (Red arrow pointing downwards, positioned above the Interface layer)
* "Step" (White arrow pointing downwards, positioned above the Interface layer)
* Arrows indicating communication between layers (Black arrows pointing left to right)
* **Text Descriptions:**
* Interface: "Implement 'reset' and 'step' methods using the abstraction."
* Game Abstraction: "Interpret RAM, visual and audio output as 'state'. Perform actions using controller inputs."
* Console: "Implement all console-interfacing functionality."
* Communication between Interface and Game Abstraction: "Send actions, receive reward, read state"
* Communication between Game Abstraction and Console: "Interface with hardware, read RAM"
### Detailed Analysis or ### Content Details
* **Interface Layer (Top Layer):**
* Color: Orange
* Function: Implements 'reset' and 'step' methods using the abstraction.
* Receives "Reset" and "Step" inputs (indicated by downward arrows).
* Sends actions, receives rewards, and reads state from the Game Abstraction layer.
* **Game Abstraction Layer (Middle Layer):**
* Color: Pink
* Function: Interprets RAM, visual, and audio output as 'state'. Performs actions using controller inputs.
* Communicates with both the Interface and Console layers.
* **Console Layer (Bottom Layer):**
* Color: Blue
* Function: Implements all console-interfacing functionality.
* Interfaces with hardware and reads RAM.
### Key Observations
* The diagram illustrates a hierarchical structure, with the Interface layer at the top and the Console layer at the bottom.
* The "Reset" and "Step" inputs are directed towards the Interface layer.
* The Game Abstraction layer acts as an intermediary between the Interface and Console layers.
* The arrows indicate a flow of information and control from top to bottom and vice versa.
### Interpretation
The diagram represents a simplified model of a game system's architecture. The Interface layer likely handles user input and high-level game logic. The Game Abstraction layer translates this logic into actions that can be executed by the Console layer. The Console layer interacts directly with the hardware, providing the necessary functionality for the game to run. The "Reset" and "Step" inputs suggest a control mechanism for the game's execution, possibly related to debugging or stepping through the game logic. The architecture promotes modularity and abstraction, allowing for easier development and maintenance of the game system.
</details>
## 4 Game abstraction
OpenAI Gym environments expose three methods: reset , which restarts an episode and returns the initial state; step , which takes an action as an argument and performs it within the environment; and render , which renders the current state to a window, or as text.
The step function takes an integer representing an action and returns a tuple containing: state , which is the value of the state of the system after an action has taken place; reward , which gives the reward gained by performing a certain action; done , which is a Boolean value indicating whether the episode has finished; and info , which gives extra information about the environment. Gym requires that these methods return synchronously. There are two possible approaches to deriving this synchrony with PlayStation games.
Firstly, the environment could exercise granular control over the execution of the console, choosing how many frames to skip for each move. This approach is common and is used to implement the
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Diagram: State Transition Diagram
### Overview
The image is a diagram illustrating state transitions. It shows a linear sequence of boxes, some shaded blue, representing states. Blue arrows indicate transitions between states, each transition adding +4. The diagram is divided into three sections labeled "State".
### Components/Axes
* **Boxes:** Represent individual states. Some boxes are shaded light blue.
* **Arrows:** Blue curved arrows indicate transitions between states. Each arrow is labeled with "+4".
* **State Labels:** Black curly braces with the label "State" underneath, indicating a group of states. There are three such labels.
### Detailed Analysis
The diagram consists of a horizontal sequence of 15 boxes. The boxes alternate between being shaded light blue and being unshaded. The first box is shaded light blue.
* **Transitions:** Each transition moves forward by 4 boxes. The arrows start above a shaded box and point down to another shaded box 4 positions to the right. Each arrow is labeled "+4".
* **State Groupings:** The 15 boxes are divided into three groups labeled "State". The first group contains 5 boxes, the second group contains 5 boxes, and the third group contains 5 boxes.
### Key Observations
* The diagram shows a consistent pattern of state transitions, each adding +4.
* The "State" labels group the boxes into equal segments.
* The shaded boxes are the targets of the state transitions.
### Interpretation
The diagram represents a system where the state advances by +4 units at each transition. The "State" labels likely indicate distinct phases or stages within the overall system. The alternating shaded and unshaded boxes might represent different properties or conditions within each state. The diagram suggests a cyclical or repetitive process, as the transitions consistently add +4 and the pattern of shaded boxes repeats.
</details>
(a) A visualisation of frame skip and frame stacking. In [9], only every fourth frame is considered and of those, every four frames are combined (stacked) into a single state representation. New frames are requested by the Python library upon each action, making this approach synchronous.
<details>
<summary>Image 5 Details</summary>

### Visual Description
Icon/Small Image (771x51)
</details>
(b) An asynchronous approach to frame skipping. In environments where actions are long or have variable length, the state transition occurs asynchronously. The transition ends once the immediate effects of the associated action have ceased.
Figure 4: A comparison of the state-delimiting techniques used in [9] and those used in this paper.
Atari-2600 environments in Gym. In cases where a simple snapshot of the environment would leave ambiguity (such as when the motion of an object could be in several directions), consecutive frames may be stacked to produce a corresponding state encoding. Frame skipping works well for simple games, but is not always suitable if moves can take a variable amount of time to finish. If the skip is less than the number of frames a move takes to finish, the agent may choose its next action before a previous move has finished. In many games, this would lead to the chosen action not being completed properly. If the environment was to skip significantly more frames than required, the agent would unnecessarily incur a delay in making moves.
A second approach would be to allow the console to run asynchronously for the duration of each move, with an associated condition that signifies the move being over. For example, a move that involves collecting a valuable item might be 'finished' once a player's score has changed. To the agent, the move would begin when step was called and end as soon as the score had increased. This approach is easy to implement with PSXLE, using memory listeners that respond to changes in RAM. These approaches are contrasted in Figure 4.
## 5 Kula World
Kula World is a game developed by Game Design Sweden A.B. and released in 1998. It involves controlling a ball within a world that contains a series of objects. The world consists of a platform on which the ball can move.
An object can be a coin, a piece of fruit or a key, each of which are collected when a player moves into them. The ball is controlled through the use of the directional pad and the cross button shown in Figure 1. Pressing the right or left directional button rotates the direction of view 90 degrees clockwise or anti-clockwise about the line perpendicular to the platform. Pressing the up directional button moves the player forwards on the platform, in the direction that the camera is facing. Pressing the cross button makes the ball jump, this can be pressed simultaneously with the forward directional button to make the ball jump forwards . Jumping forwards moves the player two squares forwards, over the square in front of it. If the player jumps onto a square that doesn't exist, the game will end.
## 5.1 Actions
The definition of the action space for Kula World is relatively simple. We omitted the jump action since this served no purpose within the levels that were tested. In total, the action space is given by:
<!-- formula-not-decoded -->
There is a clock on each level, which counts down in seconds from a level-specific start value. To complete a level, the player must pick up all of the keys and move to a goal square before the clock reaches zero. Collecting objects gains points, which are added to the player's score for the level.
It is not suitable to employ frame skipping in Kula World, since moves can vary in length. A jump, for example, takes roughly a second longer than a camera rotation. The duration of moves can also depend on the specific state of a level; for example, moving forwards to collect a coin takes longer than moving forwards without collecting a coin. This is a problem since a move cannot be carried out while another is taking place. Further, since the logic for the game takes place within the CPU of the console, it is not possible in general to predict the duration of a move prior to it finishing. Instead, the asynchronous approach described earlier is used. We did not use any kind of frame stacking since a snapshot of the console's display does not contain any ambiguous motion.
## 5.2 Rewards
There are several ways of 'losing the game' in Kula World: falling off the edge of the platform, being 'spiked' by an object in the game or running out of time. We consider these to be identical events in terms of their reward, although the abstraction supports assigning different values to each way of losing. It also supports adding the player's remaining time to the state encoding, to ensure that agents aren't misled by time running out in the level. If the remaining time of a level is low, agents will learn that moves are likely to yield negative rewards and modify their behaviour appropriately. A constant negative value is added to the reward incurred by all actions in order to discourage agents from making moves that do not lead to timely outcomes.
A user of this abstraction specifies a function score\_to\_reward , which takes a score change (such as 250 for a coin, 1000 for a key and 0 for a non-scoring action) and returns the instantaneous reward. In addition, they specify fixed rewards for winning and losing the game. While they can choose score\_to\_reward arbitrarily, most implementations will ensure that: the instantaneous reward increases for increasing changes in score, the reward of a non-scoring action is a small negative value and the discounted sum of these rewards is always bounded, to aid stability in learning.
## 5.3 State
The abstraction does not prescribe a state encoding, instead it returns a tuple of relevant data after each move has finished. The contents of the tuple are:
- visual : an RGB array derived using the process shown in Figure 5.
- reward : the value of instantaneous reward resulting from the move that has just been executed.
- playing : a value indicating whether the player is still 'alive'.
- clock : the number of seconds that remain in which the player must complete the level.
- sound : one of either: None , if the practitioner has not instructed the abstraction to record the sound of moves; an array of Mel-frequency Cepstral Coefficients (MFCCs), if the abstraction was instantiated with use\_mfcc set to True ; or an array describing the raw audio output of the console over the duration of the move, otherwise.
- duration\_real : the amount of time the move took to complete.
- duration\_game : the amount of the player's remaining time that the move took to complete, relative to the in-game clock.
- score : the score that the player has achieved so far in the current episode .
When a move does not make a sound, sound will be an empty array. When a move makes a sound with a duration shorter than that of the move, silence from the recording will be removed from both its start and end. If a move's sound lasts longer than the duration of the move, the game will continue running until either the audio has ceased or the maximum recording time has been exceeded. The result of these features is that the audio output of moves is succinct, as shown in Appendix B, but that moves may take longer to execute when the abstraction is recording audio.
Figure 5: Frame processing used for complex state representations. Images have their red, green and blue (RGB) channels separated.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Image Processing Diagram: Image Transformation Pipeline
### Overview
The image depicts a pipeline of image processing steps, starting with an original image and transforming it through cropping, resizing, and finally, separation into its RGB components.
### Components/Axes
* **Original Image:** A scene with a 3D environment, including a structure, a ball, and a sky background. Dimensions are labeled as (640 x 480). A dashed rectangle is drawn on the original image, indicating the region to be cropped.
* **Cropped Image:** The result of cropping the original image. Dimensions are labeled as (512 x 256).
* **Resized Image:** The cropped image is resized. Dimensions are labeled as (256 x 256).
* **Separate RGB Components:** The resized image is separated into its red, green, and blue color channels. Each component has dimensions of (256 x 256).
* **Arrows:** Arrows indicate the flow of the image processing pipeline from left to right.
### Detailed Analysis or Content Details
1. **Original Image:**
* Dimensions: 640 x 480 pixels.
* Content: The image shows a 3D rendered scene. There is a tan-colored structure resembling a letter "A" with a hole in the top. A multi-colored beach ball (red, yellow, white) is in the foreground. In the background, there is a blue sky. A picture frame with an hourglass is visible in the background. The score "200" is displayed on the structure.
* A dashed rectangle is drawn on the original image, indicating the region to be cropped. The top-left corner of the rectangle is approximately at (100, 50), and the bottom-right corner is approximately at (550, 250).
2. **Cropped Image:**
* Dimensions: 512 x 256 pixels.
* Content: The cropped image shows a portion of the original scene, focusing on the structure and the ball. The sky is still visible in the background.
3. **Resized Image:**
* Dimensions: 256 x 256 pixels.
* Content: The resized image shows the same content as the cropped image but at a smaller resolution.
4. **Separate RGB Components:**
* Dimensions: 3 x (256 x 256) pixels.
* Content: Three separate images representing the red, green, and blue color channels of the resized image. The red channel shows the intensity of red in the image, the green channel shows the intensity of green, and the blue channel shows the intensity of blue.
### Key Observations
* The image processing pipeline demonstrates a series of transformations applied to an initial image.
* The cropping step focuses on a specific region of interest within the original image.
* The resizing step reduces the image resolution.
* The RGB separation step decomposes the image into its fundamental color components.
### Interpretation
The diagram illustrates a common image processing workflow. Cropping allows focusing on relevant parts of an image, resizing reduces computational load, and RGB separation enables color-based analysis and manipulation. This type of pipeline is fundamental in computer vision and image analysis applications. The transformations shown are typical preprocessing steps before more complex algorithms are applied.
</details>
## 6 Evaluation
Appendix A gives example usage of both the game abstraction and the OpenAI Gym interface. The Kula-v1 Gym environment supports Kula World levels from 1 up to 10, uses the console's screen as its state encoding and (by default) uses the reward function described in Table 1. This environment was used with deepq and ppo2 from OpenAI Baselines 2 . The ppo2 baseline is an implementation of Proximal Policy Optimization Algorithms from Schulman et al. [17]. The results of these are shown in Figures 8a and 8b.
2 Available at https://github.com/openai/baselines
Table 1: Reward function.
| Event | Score change | Reward |
|---------------|----------------|----------|
| Coin collect | +250 | 0.2 |
| Key collect | +1000 | 0.4 |
| Fruit collect | +2500 | 0.6 |
| Win level | - | 1 |
| Lose level | - | -1 |
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Screenshot: Game Screenshots
### Overview
The image presents a collage of twelve screenshots from a video game, likely a puzzle or platformer type. The screenshots depict a ball rolling through various desert-themed levels with different obstacles and structures.
### Components/Axes
Each screenshot shows a 3D environment with the following common elements:
- A ball with a tri-color pattern (red, green, and blue)
- Desert-like terrain with sandy textures and brown/tan color palette
- Various structures such as ramps, platforms, and walls
- A sky with a light blue color
### Detailed Analysis or ### Content Details
The screenshots vary in terms of the level design and the ball's position within each level. Here's a breakdown of each screenshot:
1. **Top-Left:** The ball is on a ramp leading upwards.
2. **Top-Middle:** The ball is on a flat platform with a small structure in the background.
3. **Top-Right:** The ball is near a striped red and white object, possibly an obstacle.
4. **Second Row, Left:** The ball is on a narrow path with walls on either side.
5. **Second Row, Middle:** The ball is on a platform with a slight incline.
6. **Second Row, Right:** The ball is on a flat surface with a distant structure.
7. **Third Row, Left:** The ball is on a platform with a complex structure above.
8. **Third Row, Middle:** The ball is on a sloped platform with a view of the sky.
9. **Third Row, Right:** The ball is on a flat path leading into the distance.
10. **Top-Right:** The ball is near a striped red and white object, possibly an obstacle.
11. **Second Row, Left:** The ball is on a narrow path with walls on either side.
12. **Third Row, Right:** The ball is on a flat path leading into the distance.
### Key Observations
- The game appears to have a consistent art style with a focus on desert environments.
- The ball's tri-color design is a recurring element in all screenshots.
- The level designs vary, suggesting a range of challenges and puzzles.
### Interpretation
The screenshots suggest a puzzle or platformer game where the player controls a ball through various levels. The desert theme and the tri-color ball are consistent elements that define the game's visual identity. The different level designs indicate a variety of challenges and gameplay mechanics. The presence of obstacles like the striped object suggests that the game may involve avoiding hazards or solving puzzles to progress.
</details>
(a) Starting locations, as used in training.
(b) The evaluation episode start location.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Screenshot: Video Game Scene
### Overview
The image is a screenshot from a video game, showing a 3D environment with a ball as the main element. The scene is set on a textured surface resembling ancient stone, with various obstacles and collectible items. An hourglass is visible at the top center of the screen.
### Components/Axes
* **Environment:** The game environment consists of textured blocks with hieroglyphic-like patterns. The background shows a sky gradient and a desert-like landscape.
* **Ball:** A ball with red, green, and white sections is positioned in the center of the scene.
* **Obstacles:** Various obstacles are present, including a red platform with a hazard symbol.
* **Collectibles:** Floating collectible items resembling coins or tokens are scattered throughout the scene.
* **UI Elements:** An hourglass is at the top center. A score of "1000" is at the bottom center. Fruit icons are at the bottom right.
### Detailed Analysis or Content Details
* **Hourglass:** Located at the top center, likely represents a timer or time-based element in the game.
* **Score:** The score "1000" is displayed at the bottom center, indicating the player's current score.
* **Fruit Icons:** The fruit icons (apple, pear, banana, etc.) at the bottom right likely represent health, power-ups, or other in-game resources.
* **Obstacles:** The red platform with the hazard symbol suggests a dangerous area that the player must avoid.
* **Collectibles:** The floating coins/tokens are likely items that the player can collect to increase their score or gain other benefits.
### Key Observations
* The game appears to be a 3D platformer or puzzle game where the player controls the ball and navigates through the environment.
* The ancient Egyptian-themed textures and environment suggest a specific setting or theme for the game.
* The presence of obstacles and collectibles indicates that the game involves challenges and rewards.
### Interpretation
The screenshot provides a glimpse into a 3D video game with a focus on navigation, obstacle avoidance, and collection. The game's design incorporates elements of ancient Egyptian aesthetics, creating a unique visual style. The UI elements suggest that the game involves scoring, time management, and resource collection. The overall impression is that of a challenging and engaging game experience.
</details>
Figure 6: Screenshots from Kula World showing each possible start position.
This approach to training is not particularly interesting, since the agent is presented with the same situation in each episode. As a result, the agent simply learns a set of moves which will result in it winning that level in the shortest time and with the highest score. To make the problem more challenging, we propose varying the player's start position within the level and the use of multiple levels. The game does not offer this feature by default, but it can be implemented using the tools available in PSXLE. To add an additional starting position for a player, simply save a PSXLE state with it at the desired position. We chose to limit the time available to complete each level to 80 seconds, rather than the usual 100, so that each starting state allowed the same amount of time to complete the level. The environment Kula-random-v1 takes this approach. It presents the agent with one of four different starting position in each of the first three levels of the game. There is an additional starting position within Level 2 that is reserved for use in validation episodes in which no learning takes place and the agent picks moves from its learned policy. Each of the starting positions are shown in Figure 6.
We implemented a simple DQN agent based on the network presented in Mnih et al. [9], for use with Kula-random-v1 . The results are summarised in Figure 7a and Figure 7b.
Surprisingly, the agent was most successful in learning how to play Level 3, despite it being the most complex for human players. This may be due to Level 3 introducing new physics that are not intuitive to humans, but are easier to navigate for the agent. Level 2 requires players jump between islands, meaning a jump can either be very profitable for a player (in that it gains access to more coins) or very bad (because the player falls off the platform). This discrepancy means that agents must disambiguate a jump that is required for it to reach the goal square from one that will end the game. By comparison, Level 3 does not require any jumps to complete.
The agent performs similarly from the reserved start position as it did from the other available locations in Level 2. It is promising that the agent did not perform notably worse when given an unseen start state, it is also worth noting that the performance of the agent was fairly lacklustre on Level 2 in general.
In order to leverage the audio recording features of PSXLE, we developed another environment: Kula-audio-v1 . This environment includes a representation of the audio output from each move within its state encoding. The state encoding is the richest yet, containing visual output, audio output, time remaining on the level and score. Since MFCCs have been successful in machine learning contexts for classifying audio data, they are used here to represent the audio output of a move. The MFCC outputs for some moves are shown in Appendix B.
Finally, since the agent starts each episode from a specified state, the states in its memory are dominated by those that occur at the start of the level. For PlayStation games, which are usually quite
complex and have long levels, this can result in lenghtly training times. Prioritised experience replay [6] attempts to combat this by replaying state transitions that the agent learned the most from. We propose a modification to this, prioritised state replay , in which an agent can resume play from a state which it has previously visited, allowing the agent to perform a different action. This can be trivially implemented using the save\_state and load\_state methods in PSXLE and could help to reduce the time required to learn complex games. No agents have been trained using this approach.
<details>
<summary>Image 9 Details</summary>

### Visual Description
## Line Charts: Reward vs. Episode Number for Different Levels
### Overview
The image contains two line charts showing the "Total Reward" versus "Episode Number". The left chart displays the reward during training for three different levels (Level 1, Level 2, and Level 3). The right chart displays the reward during evaluation, with a different scale for the x-axis (Episode Number).
### Components/Axes
**Left Chart:**
* **Title:** Implicit, but represents training performance.
* **Y-axis:** "Total Reward", ranging from 0 to 1.
* **X-axis:** "Episode Number (Train)", ranging from 0 to 1500.
* **Legend (Top-Left):**
* Blue line: "Level 1"
* Red line: "Level 2"
* Orange line: "Level 3"
**Right Chart:**
* **Title:** Implicit, but represents evaluation performance.
* **Y-axis:** "Total Reward", ranging from -0.4 to 0.4.
* **X-axis (Bottom):** "Episode Number (Train)", ranging from 0 to 150.
* **X-axis (Top):** "Episode Number (Evaluation)", ranging from 0 to 1600.
* **Legend:**
* Green line: Corresponds to Level 1
* Red line: Corresponds to Level 2
### Detailed Analysis
**Left Chart (Training):**
* **Level 1 (Blue):** Initially fluctuates around 0, then starts to increase around episode 750, reaching approximately 0.4 around episode 1000, and then fluctuates between 0.2 and 0.5 until episode 1500.
* **Level 2 (Red):** Initially fluctuates around -0.2, then starts to increase around episode 750, reaching approximately 0.2 around episode 1000, and then fluctuates between 0 and 0.3 until episode 1500.
* **Level 3 (Orange):** Initially fluctuates around 0, then starts to increase around episode 500, reaching approximately 0.8 around episode 750, and then fluctuates between 0.6 and 1 until episode 1500.
**Right Chart (Evaluation):**
* **Level 1 (Green):** Starts at approximately -0.35, increases rapidly to around 0 by episode 25, and then fluctuates between 0 and 0.3 until episode 150.
* **Level 2 (Red):** Starts at approximately -0.2, decreases to around -0.3 by episode 10, increases to around 0 by episode 50, and then fluctuates between -0.2 and 0.2 until episode 150.
### Key Observations
* In the training phase, Level 3 consistently achieves the highest total reward, followed by Level 1 and then Level 2.
* The training reward for all levels increases significantly after approximately 500-750 episodes.
* In the evaluation phase, Level 1 initially performs poorly but quickly improves, surpassing Level 2.
* The evaluation reward fluctuates significantly for both levels.
### Interpretation
The charts suggest that the agent learns to perform better on all levels as the number of training episodes increases. Level 3 appears to be the easiest to learn, as it achieves a high reward earlier in the training process. The evaluation chart indicates that the agent's performance on unseen episodes is initially poor but improves with training, although it remains volatile. The difference in performance between the training and evaluation phases suggests that the agent may be overfitting to the training data, especially for Level 2. The evaluation chart uses a different x-axis on the top, which could be useful for comparing the performance across different training regimes or environments.
</details>
- (b) The green line shows a moving average over 5 validation episodes for the reserved location, the red line is the same as in Figure 7a.
- (a) Agent proficiency on each level during training. Start locations within each level are chosen uniformly at random from those shown in Figure 6. The line shows a moving average over 10 training episodes.
Figure 7: Evaluation of a simple DQN implementation for Kula-random-v1 .
<details>
<summary>Image 10 Details</summary>

### Visual Description
## Line Charts: Reward vs. Timesteps for Deepq and PPO2 Baselines
### Overview
The image contains two line charts comparing the reward obtained by two different reinforcement learning algorithms, Deepq and PPO2, over a number of timesteps. The x-axis represents the total timesteps (in units of 10,000), and the y-axis represents the reward.
### Components/Axes
**Chart (a): Deepq Baseline**
* **Title:** (a) The deepq baseline.
* **X-axis:** Timesteps (total) * 10^4
* **X-axis Markers:** 0, 1, 2, 3, 4, 5
* **Y-axis:** Reward
* **Y-axis Markers:** -1, -0.5, 0
* **Data Series:** Blue line representing the reward obtained by the Deepq algorithm.
**Chart (b): PPO2 Baseline**
* **Title:** (b) The ppo2 baseline.
* **X-axis:** Timesteps (total)
* **X-axis Markers:** 0, 2,000, 4,000, 6,000
* **Y-axis:** Reward
* **Y-axis Markers:** -1, 0, 1, 2
* **Data Series:** Blue line representing the reward obtained by the PPO2 algorithm.
### Detailed Analysis
**Chart (a): Deepq Baseline**
* **Trend:** The reward initially stays at approximately -1 for the first 0.5 * 10^4 timesteps. It then increases sharply to approximately -0.05 at 1 * 10^4 timesteps, remains relatively constant until 2 * 10^4 timesteps, then decreases to approximately -0.4 at 2.5 * 10^4 timesteps, then decreases further to approximately -0.5 at 3 * 10^4 timesteps, then increases sharply to approximately 0.1 at 3.5 * 10^4 timesteps, then decreases to approximately -0.5 at 4 * 10^4 timesteps, then increases to approximately -0.2 at 4.5 * 10^4 timesteps.
* **Data Points:**
* (0, -1)
* (0.5, -1)
* (1, -0.05)
* (1.5, -0.05)
* (2, -0.05)
* (2.5, -0.4)
* (3, -0.5)
* (3.5, 0.1)
* (4, -0.5)
* (4.5, -0.2)
**Chart (b): PPO2 Baseline**
* **Trend:** The reward starts at approximately -1 and gradually increases with timesteps. It shows a steep increase between 2,000 and 4,000 timesteps, then plateaus around a reward value of 1.8.
* **Data Points:**
* (0, -1)
* (500, -0.4)
* (1000, -0.2)
* (1500, -0.1)
* (2000, -0.05)
* (2500, 0.2)
* (3000, 0.4)
* (3500, 1)
* (4000, 1.5)
* (4500, 1.7)
* (5000, 1.8)
* (5500, 1.8)
* (6000, 1.9)
* (6500, 1.9)
### Key Observations
* The PPO2 baseline consistently achieves higher rewards than the Deepq baseline after a certain number of timesteps.
* The Deepq baseline shows more fluctuation in reward during training.
* The PPO2 baseline demonstrates a more stable and consistent learning curve.
### Interpretation
The charts illustrate the performance of two different reinforcement learning algorithms on a specific task. The PPO2 algorithm appears to be more effective in this scenario, as it achieves higher rewards and exhibits a more stable learning process compared to the Deepq algorithm. The Deepq algorithm's fluctuating reward suggests that it may be more sensitive to the environment or require more fine-tuning to achieve optimal performance. The PPO2 algorithm converges to a higher reward value, indicating that it is better at learning the optimal policy for this task.
</details>
Figure 8: A plot of the average reward over 100 episodes, as given by OpenAI Baselines, against the number of timesteps that the agent had played. The agents were trained on Level 1 of Kula World using a visual state encoding.
## 7 Conclusion
The results from deepq and ppo2 baselines show how two different RL algorithms can perform to a vastly different standard within the Kula World environment. From this, it is clear that PlayStation games can represent suitable environments in which to evaluate the effectiveness of RL algorithms.
The approach shown in this paper for Kula World can be customised to work with many different PlayStation games. The design of PSXLE makes it simple to build abstractions that support interfaces like OpenAI Gym. Armed with such abstractions, researchers will be able to apply existing RL implementations on more complex environments with richer state spaces. The challenges that the audio and complex visual rendering in PlayStation games present to RL could help us to close the gap between what we want agents to achieve and the methodology that we use to evaluate them.
## Appendices
## A Library usage
To demonstrate the functionality of the work presented in this paper, example scripts are shown below. These are chosen to show how practitioners can interact with the environment at each level of abstraction.
## A.1 Console
This script creates a console running Kula World, loads Level 1 and moves the player forwards two steps.
```
```
## A.2 Game abstraction
This script creates a Game object for Kula World and plays it randomly.
```
```
## A.3 Interface
This script uses the OpenAI Gym implementation of Kula World to play randomly. Note that the methods being used are entirely game-agnostic.
```
```
```
```
## B Audio outputs
Figure 9: Raw audio output for a selection of moves within Kula World, obtained using PSXLE's audio recording feature. The abstraction employs cropping and extended move times to ensure that the full audio waveform for each move is captured.
<details>
<summary>Image 11 Details</summary>

### Visual Description
## Waveform Charts: Sound Event Analysis
### Overview
The image presents four waveform charts, each representing the amplitude of a sound event over time. The events are "Key Collect," "Jump," "Apple Collect," and "Coin Collect." Each chart displays the sound's amplitude on the y-axis and time in seconds on the x-axis. The waveforms are plotted in orange.
### Components/Axes
* **Titles:** Each chart has a title indicating the sound event it represents: "Waveform (Key Collect)," "Waveform (Jump)," "Waveform (Apple Collect)," and "Waveform (Coin Collect)."
* **X-axis:** Represents time in seconds (s).
* "Key Collect": Ranges from 0.0 to 2.0 s.
* "Jump": Ranges from 0.00 to 0.40 s.
* "Apple Collect": Ranges from 0.0 to 4.0 s.
* "Coin Collect": Ranges from 0.00 to 1.75 s.
* **Y-axis:** Represents amplitude.
* "Key Collect": Ranges from -3000 to 2000.
* "Jump": Ranges from -4000 to 4000.
* "Apple Collect": Ranges from -4000 to 4000.
* "Coin Collect": Ranges from -3000 to 3000.
* **Data Series:** Each chart contains a single data series plotted in orange, representing the amplitude of the sound event over time.
### Detailed Analysis
**1. Waveform (Key Collect)**
* **Trend:** The waveform shows a complex pattern with varying amplitudes over the 2-second duration. There is a significant increase in amplitude between 0.5 and 1.5 seconds, followed by a decrease.
* **Values:**
* Time = 0.0 s, Amplitude β 0
* Time = 0.75 s, Amplitude β 2000
* Time = 1.0 s, Amplitude β 1500
* Time = 2.0 s, Amplitude β 0
**2. Waveform (Jump)**
* **Trend:** The waveform shows a sharp initial spike in amplitude, followed by a rapid decay and oscillations that dampen over time.
* **Values:**
* Time = 0.00 s, Amplitude β -4000
* Time = 0.02 s, Amplitude β 4000
* Time = 0.10 s, Amplitude β 1000
* Time = 0.40 s, Amplitude β 0
**3. Waveform (Apple Collect)**
* **Trend:** The waveform shows several distinct spikes in amplitude, indicating short bursts of sound.
* **Values:**
* Time = 0.0 s, Amplitude β 0
* Time = 0.75 s, Amplitude β -1500
* Time = 1.5 s, Amplitude β 2500
* Time = 2.0 s, Amplitude β -1000
* Time = 3.0 s, Amplitude β 4000
* Time = 4.0 s, Amplitude β 0
**4. Waveform (Coin Collect)**
* **Trend:** The waveform shows a complex pattern with varying amplitudes, similar to the "Key Collect" waveform, but compressed into a shorter time frame.
* **Values:**
* Time = 0.00 s, Amplitude β 0
* Time = 0.50 s, Amplitude β 2500
* Time = 0.75 s, Amplitude β -2000
* Time = 1.00 s, Amplitude β 2000
* Time = 1.75 s, Amplitude β 0
### Key Observations
* The "Jump" waveform has the shortest duration and a distinct impulse-like shape.
* The "Apple Collect" waveform shows discrete, well-separated sound events.
* The "Key Collect" and "Coin Collect" waveforms have more complex and continuous amplitude variations.
### Interpretation
The waveforms provide a visual representation of the sound characteristics of different events. The "Jump" waveform likely represents a transient sound, while the "Apple Collect" waveform represents a series of distinct sounds, possibly individual impacts. The "Key Collect" and "Coin Collect" waveforms suggest more complex and sustained sounds, potentially involving multiple components or reverberations. These visualizations can be used for sound event classification, analysis, and synthesis.
</details>
Figure 10: MFCCs derived from the waveforms in Figure 9.
<details>
<summary>Image 12 Details</summary>

### Visual Description
## Heatmap: MFCCs of Different Actions
### Overview
The image presents four heatmaps, each representing the Mel-Frequency Cepstral Coefficients (MFCCs) of a different action: "Key Collect", "Jump", "Apple Collect", and "Coin Collect". Each heatmap visualizes the MFCCs over time (Frame) and across different coefficients. The color intensity represents the magnitude of the MFCC value, with blue indicating lower values and red indicating higher values.
### Components/Axes
* **Titles:**
* Top-Left: "MFCCs (Key Collect)"
* Top-Right: "MFCCs (Jump)"
* Bottom-Left: "MFCCs (Apple Collect)"
* Bottom-Right: "MFCCs (Coin Collect)"
* **X-axis (Frame):**
* Top-Left: 0 to 200, incrementing by 25
* Top-Right: 0 to 35, incrementing by 5
* Bottom-Left: 0 to 400, incrementing by 50
* Bottom-Right: 0 to 160, incrementing by 20
* **Y-axis (Coefficient):**
* All plots: 0 to 12, incrementing by 1
* **Color Scale:** The color scale is implicitly defined, with blue representing lower MFCC values and red representing higher MFCC values.
### Detailed Analysis
**1. MFCCs (Key Collect)**
* **X-axis:** Frame (0 to 200)
* **Y-axis:** Coefficient (0 to 12)
* **Trend:** The heatmap shows distinct patterns across different coefficients. Coefficients 4 and 5 show consistently low values (blue), while other coefficients exhibit varying patterns of high and low values over time.
* **Data Points:**
* Coefficient 4 and 5: Consistently low values (blue) across the entire frame range.
* Coefficient 12: High values (red) from frame 0 to approximately 75, then fluctuating values.
* Coefficient 8: High values (red) from frame 100 to 200.
**2. MFCCs (Jump)**
* **X-axis:** Frame (0 to 35)
* **Y-axis:** Coefficient (0 to 12)
* **Trend:** Similar to "Key Collect", coefficients 3, 4, and 5 show consistently low values (blue). Other coefficients show more dynamic changes over time.
* **Data Points:**
* Coefficients 3, 4, and 5: Consistently low values (blue) across the entire frame range.
* Coefficient 12: High values (red) from frame 0 to 10, then fluctuating values.
* Coefficient 8: High values (red) from frame 20 to 35.
**3. MFCCs (Apple Collect)**
* **X-axis:** Frame (0 to 400)
* **Y-axis:** Coefficient (0 to 12)
* **Trend:** Coefficients 3, 4, and 5 show consistently low values (blue). The heatmap appears more uniform compared to "Key Collect" and "Jump".
* **Data Points:**
* Coefficients 3, 4, and 5: Consistently low values (blue) across the entire frame range.
* Coefficient 12: Relatively low values (light blue/white) across the entire frame range.
* Coefficient 8: High values (red) around frame 150 and 300.
**4. MFCCs (Coin Collect)**
* **X-axis:** Frame (0 to 160)
* **Y-axis:** Coefficient (0 to 12)
* **Trend:** Coefficients 3, 4, and 5 show consistently low values (blue). The heatmap shows distinct patterns across different coefficients.
* **Data Points:**
* Coefficients 3, 4, and 5: Consistently low values (blue) across the entire frame range.
* Coefficient 12: High values (red) from frame 0 to 20, then fluctuating values.
* Coefficient 8: High values (red) from frame 60 to 160.
### Key Observations
* Coefficients 3, 4, and 5 consistently show low values (blue) across all four actions.
* The "Jump" action has the shortest frame duration, while "Apple Collect" has the longest.
* The heatmaps show distinct patterns for each action, suggesting that MFCCs can be used to differentiate between them.
### Interpretation
The heatmaps visualize the MFCCs of different actions, providing a representation of the sound characteristics over time. The consistent low values in coefficients 3, 4, and 5 across all actions might indicate common acoustic features shared by these actions. The varying patterns in other coefficients likely capture the unique acoustic properties of each action. The differences in frame duration reflect the different lengths of the audio recordings for each action. The data suggests that MFCCs can be used as features for classifying different actions based on their sound.
</details>
## References
- [1] S. Wang, D. Jia, and X. Weng, 'Deep Reinforcement Learning for Autonomous Driving,' arXiv e-prints , p. arXiv:1811.11329, Nov 2018.
- [2] I. Arel, C. Liu, T. Urbanik, and A. G. Kohls, 'Reinforcement learning-based multi-agent system for network traffic signal control,' IET Intelligent Transport Systems , vol. 4, pp. 128-135, June 2010.
- [3] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang, 'Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Advertising,' arXiv e-prints , p. arXiv:1802.09756, Feb 2018.
- [4] S. Levine, C. Finn, T. Darrell, and P. Abbeel, 'End-to-End Training of Deep Visuomotor Policies,' arXiv e-prints , p. arXiv:1504.00702, Apr 2015.
- [5] H. van Hasselt, A. Guez, and D. Silver, 'Deep Reinforcement Learning with Double Q-Learning,' AAAI Conference on Artificial Intelligence , 2016.
- [6] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, 'Prioritized Experience Replay,' arXiv e-prints , p. arXiv:1511.05952, Nov 2015.
- [7] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, 'Dueling Network Architectures for Deep Reinforcement Learning,' Proceedings of The 33rd International Conference on Machine Learning , vol. 48, pp. 1995-2003, 20-22 Jun 2016.
- [8] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, 'Asynchronous Methods for Deep Reinforcement Learning,' Proceedings of The 33rd International Conference on Machine Learning , vol. 48, pp. 1928-1937, 20-22 Jun 2016.
- [9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,
H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, 'Human-level control through deep reinforcement learning,' Nature , vol. 518, pp. 529 EP -, Feb 2015.
- [10] G. Brockman and J. Schulman, 'OpenAI Gym Beta.' https://openai.com/blog/ openai-gym-beta/ . Accessed: 2019-01-12.
- [11] D. T. Behrens, 'Deep Reinforcement Learning and Autonomous Driving.' https:// ai-guru.de/deep-reinforcement-learning-and-autonomous-driving/ . Accessed: 2019-03-04.
- [12] OpenAI, 'OpenAI Gym Environments.' https://gym.openai.com/envs/ . Accessed: 2019-01-12.
- [13] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, 'The Arcade Learning Environment: An Evaluation Platform for General Agents,' arXiv e-prints , p. arXiv:1207.4708, Jul 2012.
- [14] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, 'Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,' IEEE Signal Processing Magazine , vol. 29, pp. 82-97, Nov 2012.
- [15] Z. Zeng, J. Tu, B. M. Pianfetti, and T. S. Huang, 'Audio-Visual Affective Expression Recognition Through Multistream Fused HMM,' IEEE Transactions on Multimedia , vol. 10, pp. 570577, June 2008.
- [16] Xi Shao, Changsheng Xu, and M. S. Kankanhalli, 'Unsupervised classification of music genre using hidden Markov model,' in 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763) , vol. 3, pp. 2023-2026 Vol.3, June 2004.
- [17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, 'Proximal Policy Optimization Algorithms,' arXiv e-prints , p. arXiv:1707.06347, Jul 2017.