2207.14140v1

Model: nemotron-free

## Playing a 2D Game Indefinitely using NEAT and Reinforcement Learning Jerin Paul Selvan Dept. of Computer Engineering Pune Institute of Computer Technology Pune, India jerinsprograms@gmail.com Abstract -For over a decade now, robotics and the use of artificial agents have become a common thing. Testing the performance of new path finding or search space optimisation algorithms has also become a challenge as they require simulation or an environment to test them. The creation of artificial environments with artificial agents is one of the methods employed to test such algorithms. Games have also become an environment to test them. The performance of the algorithms can be compared by using artificial agents that will behave according to the algorithm in the environment they are put in. The performance parameters can be, how quickly the agent is able to differentiate between rewarding actions and hostile actions. This can be tested by placing the agent in an environment with different types of hurdles and the goal of the agent is to reach the farthest by taking decisions on actions that will lead to avoiding all the obstacles. The environment chosen is a game called 'Flappy Bird'. The goal of the game is to make the bird fly through a set of pipes of random heights. The bird must go in between these pipes and must not hit the top, the bottom, or the pipes themselves. The actions that the bird can take are either to flap its wings or drop down with gravity. The algorithms that are enforced on the artificial agents are NeuroEvolution of Augmenting Topologies (NEAT) and Reinforcement Learning. The NEAT algorithm takes an 'N' initial population of artificial agents. They follow genetic algorithms by considering an objective function, crossover, mutation, and augmenting topologies. Reinforcement learning, on the other hand, remembers the state, the action taken at that state, and the reward received for the action taken using a single agent and a Deep Q-learning Network. The performance of the NEAT algorithm improves as the initial population of the artificial agents is increased. Keywords -NeuroEvolution of Augmenting Topologies (NEAT), Artificial agent, Artificial environment, Game, Reinforcement Learning (RL) ## I. INTRODUCTION An intelligent agent is anything that can detect its surroundings, act independently to accomplish goals, and learn from experience or use knowledge to execute tasks better. The agent's surroundings are considered an environment in artificial intelligence. The agent uses actuators to send its output to the environment after receiving information from it through sensors. [11] There are several types of environments, Fully Observable vs Partially Observable, Deterministic vs Stochastic, Competitive vs Collaborative, Single-agent vs Multi-agent, Static vs Dynamic, Discrete vs Continuous, Dr. P. S. Game Dept. of Computer Engineering Pune Institute of Computer Technology Pune, India psgame@pict.edu Episodic vs Sequential and Known vs Unknown. An approach to machine learning known as NEAT, or Neuroevolution of Augmenting Topologies, functions similarly to evolution. In its most basic form, [1] NEAT is a technique for creating networks that are capable of performing a certain activity, like balancing a pole or operating a robot. It's significant that NEAT networks can learn using a reward function as opposed to back-propagation. By executing actions and observing the outcomes of those actions, an agent learns how to behave in a given environment via reinforcement learning, a feedbackbased machine learning technique. The agent receives compliments for each positive activity and is penalised or given negative feedback for each negative action. In contrast to supervised learning, reinforcement learning uses feedback to autonomously train the agent without the use of labelled data. The agent can only learn from its experience because there is no labelled data. In situations like gaming, robotics, and the like, where decisions must be made sequentially and with a long-term objective, RL provides a solution. The agent engages with the environment and independently explores it. In reinforcement learning, an agent's main objective is to maximise positive rewards while doing better. ## II. LITERATURE SURVEY Games have been used a lot to act as an environment to test algorithms. There is a lot of research [3] done to create an AI bot that can challenge a player in a multi-player or two-player game. Neuroevolution and Reinforcement learning algorithms are some of the algorithms that are used to create AI bots or artificial agents. [1], [7] and [8] have implemented a configuration of an ANN called Neuroevolution. The algorithm does not depend on the actions taken by the agents as a whole. [3], [4], [5], [6] and [7] use Reinforcement Learning algorithm with Deep Q-Learning to train the agents. The performance of the Neuroevolution algorithm depends on the objective function, initial population, mutation rate, weights and bias added to the network, the activation function used and overall topology of the network. Authors in [2] talk about how superior the Neuroevolution algorithm is over the traditional Reinforcement Learning algorithm with the Deep Q-Learning algorithm. Neuroevolution has an upper hand when it comes to the time taken by the artificial agent to train itself. There are other parameters that need to be taken into consideration while using a Neural Network. The topology of the network plays a vital role in the performance. Two strategies were proposed by Evgenia Papavasileiou (2021) [2], using fixed topologies in the neural networks and using augmented topologies. The network topology is a single hidden layer of neurons, with each hidden neuron connected to every network input and every network output. Evolution searches the space of connection weights of this fully-connected topology by allowing high performing networks to reproduce. The weight space is explored through the crossover of network weight vectors and through the mutation of single networks' weights. Thus, the goal of fixed-topology NE is to optimise the connection weights that determine the functionality of a network. The topology, or structure, of neural networks also affects their functionality, and modifying the network structure has been effective as part of supervised training. There are two ways of making use of the environment. Authors in [3], [4], [6] and [7] use DNN to extract the features from the frame of the game and they form the input to the agent. However, [1], [5] and [8] make use of the game itself and place the agent to perceive its surroundings. There are several combinations of Reinforcement Learning algorithms possible, like Deep Neural Networks (DNN), Long short-term memory (LSTM), Deep Q-Network (DQN) and the like. However, depending on the type of obstacle and the type of game, its performance varies. Reinforcement Learning algorithm with DNN and LSTM have been used in [3]. This algorithm addresses issues like vast search space, dependencies between the actions taken by the agent, the state and the environment, inputs and imperfect information. To reduce the complexity of the data generated by the perception of the agent, data skipping techniques are implemented. There is, however, a drawback with this algorithm. It takes a lot of time for the agent to train. Or, for every discrete step taken by the agent, it receives a state that belongs to a set S and it sends an action from the set A actions to the environment. The environment makes a transition from state St to St+1 and a gamma value [0, 1] determines the preference for immediate reward over longterm reward. A self-playing method is used by storing the parameters of the network to create a pool of past agents. This pool of past agents is used to sample opponents. This method offers RL to learn the Nash equilibrium strategy. Data skipping techniques were proposed in this paper. It refers to the process of dropping certain data during the training and evaluation process. Data skipping techniques proposed are: 'no-op' and 'maintain move decision'. The network is composed of an LSTM-based architecture, which has four heads with a shared state representation layer. An actor-critic off-policy learning algorithm was proposed. Botong Liu (2020) [4] has used Reinforcement Learning with DQN. The game was split into frames, and each game image was sequentially scaled, grayed, and adjusted for brightness. Deep Q Network algorithm was used to convert the game decision problem into a classification and recognition prob- lem of multi-dimensional images and solve it using CNN. Reinforcement learning works best for continuous decisionmaking problems. However, Deep Reinforcement Learning has a limitation of not converging for which Neural fitted Q-learning and DQN algorithms were used to overcome the issue. Since FNQ can work with numerical information only the author suggests use of DQN. Combining Q learning with CNN, the DQN can achieve self-learning. ReLu and maximum pooling layers are added to the CNN. Gradient descent (Adam Optimizer) was used to train the DQN parameters. Q-Value function based algorithms are the focus of Aidar Shakerimov (2021) [5]. For the DQN algorithms, improvements could be achieved in their performance by using a cumulative reward for training actions. To speed up training, RNN-ReLU was used instead of LSTM or GRU. LSTM or GRU performs better than RNN-ReLU but takes 7 times more time to train. Label smoothing was used to prevent the vanishing gradients in RNN-ReLU. However, DQN is sensitive to seed randomization. SARSA is a slight variation of the traditional Q-Learning algorithm. Authors in [6] use SARSA and Q-Learning algorithms with modifications such as -greedy policy, discretization and backward updates. Some variants of Q-Learning were also implemented such as a tabular approach, Q-value approximation using linear regression, and NN. In the implementation, [6] finds the SARSA algorithm to have outperformed Q-learning. The specifications of the rewards are a positive 5 for passing a pipe, a negative 1000 for hitting a pipe, and a positive 0.5 for surviving a frame. Feed-forward NN was used with a 3 neuron input layer, 50 neuron hidden layer, 20 neuron hyphen layer, and a 2 neuron output layer (ReLU activation function). The CNN is used with preprocessed input image by removing the background, grayscale, and resizing to 80 x 80, 2 CNN layers were used, one using sixteen 5 × 5 kernels with stride 2, and another with thirty-two 5 × 5 kernels with stride 2. [7] proposes the use of specific feature selection and presents the state by the bird velocity and the difference between the bird's position and the next lower pipe. This reduces the feature space and eliminates the need for deeper modules. The agent is provided with rational human-level inputs along with generic RL and a standard 3-layer NN with a genetic optimization algorithm. The reward for the agent is a positive 1 for every cross of the pipe and a negative 100 if the agent dies. The Neuro evolution has the following characteristics: the NN weights and the number of hidden layer units undergo changes, the mutation rate is kept at 0.3, and the initial population size is 200. [8] proposes the use of two levels for the Flappy Bird game. The fitness function is calculated by the distance traveled by the agent and the current distance to the closest gap. The mutation rate is kept at 0.2, and there are 5 neurons in the hidden layer. ## III. METHODOLOGY The NEAT algorithm implementation is dependent on the objective function, crossover, mutation, and a population of agents. For a given position of the bird, say (x, y), there are two actions that the agent can make. Either the bird flaps its wings or it does not flap its wings. The vertical and horizontal distances traveled by the agent are determined by the following equations. $$d _ { v e r t i c a l } = v _ { j u m p } . t + \frac { 1 } { 2 } . a . t ^ { 2 } \quad ( 1 )$$ $$d _ { h o r i z o n t a l } = v _ { f l o o r } . t \quad ( 2 )$$ $$d _ { f l o o r } = v _ { f l o o r } . t \quad ( 3 )$$ $$d _ { p i p e } = v _ { p i p e } . t \quad ( 4 )$$ Eq. (1) determines the vertical displacement of the agent, where a is the acceleration that is a constant [12]. As shown in Fig. 1. Details of the game environment [10] <details> <summary>Image 1 Details</summary> ![39127a20](/v1/image/39127a20ab3485ef46182dca3921a1d47354bf43e0c65374837faad7110b5301) ### Visual Description ## Screenshot: Flappy Bird Game Interface ### Overview The image depicts a screenshot of a Flappy Bird-style game interface. The scene includes a yellow bird with a red beak, two green pipes, and a blue sky background with a cityscape and clouds at the bottom. Text annotations describe game mechanics, including pipe height, gap, and distance between pipes. ### Components/Axes - **Labels**: - "Gen: 1" (top-left corner) - "Random pipe height" (above the top pipe) - "Fixed gap" (between the two pipes) - "Fixed distance between pipes" (horizontal arrow at the bottom) - "Score: 13" (top-right corner) - **Visual Elements**: - Two green pipes (vertical, with a gap between them). - A yellow bird (centered, facing right). - Blue sky background with a cityscape and clouds at the bottom. - Green striped ground at the bottom. ### Detailed Analysis - **Pipe Mechanics**: - The top pipe has a **random height** (annotated with a vertical arrow). - The **fixed gap** between the pipes is marked with a vertical arrow. - The **fixed distance between pipes** is indicated by a horizontal arrow at the bottom. - **Game State**: - The bird is positioned between the pipes, suggesting active gameplay. - The score is **13**, and the generation is **1** (likely indicating a procedural generation iteration). ### Key Observations - The **random pipe height** introduces variability, while the **fixed gap** and **distance** ensure consistent gameplay challenges. - The bird’s position (centered) and orientation (facing right) imply movement toward the next pipe. - The score and generation labels suggest a procedural generation system, where "Gen: 1" may represent the first iteration of pipe placement. ### Interpretation This image illustrates a procedural generation system in a Flappy Bird-like game. The **random pipe height** ensures each level is unique, while the **fixed gap** and **distance** maintain consistent difficulty. The **score** (13) reflects the player’s progress, and "Gen: 1" likely indicates the current procedural generation cycle. The bird’s placement between the pipes suggests the player is navigating through the obstacle course. The cityscape and clouds at the bottom add visual context but do not affect gameplay mechanics. **Note**: No numerical data or trends are explicitly provided beyond the score and generation labels. The image focuses on game mechanics rather than statistical analysis. </details> the Fig. 2, the y coordinate of the agent, the distance between the top pipe and the agent (y - T') and the distance between the bottom pipe and the agent (T') are the inputs to the neural network. The gap between the top and the bottom pipe is fixed to 320 pixels, and the heights are randomly generated. The distance between subsequent pipes is also kept constant. With respect to the NEAT algorithm, the fitness of the agent is determined by the number of pipes that the agent is able to cross without collision. As soon as the agent collides with the pipe, hits the roof, or falls down to the ground, the agent is removed from the environment. The performance of the algorithm depends on the initial population that is taken into consideration. The activation function used is the hyperbolic tangent function. The mutation rate is kept at 0.03. The encoding of the chromosome is shown in Table. I. The weight of the connection from a node in a layer to another node in the other layer and the dropped value is also considered as part of the encoding. If the connection is to be dropped, it Fig. 2. Parameters required as input to the NN <details> <summary>Image 2 Details</summary> ![ae8c59ed](/v1/image/ae8c59edf0f6c83b503ddb7c841193e3d6a5263e6a4b9a4c78809bc43ba8155a) ### Visual Description ## Screenshot: Flappy Bird Game Interface ### Overview The image depicts a gameplay screen from a Flappy Bird-style game. The scene includes a bird character, vertical green pipes, and a background with a cityscape and clouds. Text annotations and coordinate labels are overlaid on the image. ### Components/Axes 1. **Text Elements**: - **Top-left corner**: `Gen: 1` (Generation 1). - **Top-right corner**: `Score: 13`. - **Bird annotation**: `(x, y)` (coordinates of the bird). - **Pipe labels**: - `Y - T'` (vertical line above the bird). - `T'` (vertical line below the bird). - `T' = Lower_Pipe_Top - y` (equation defining the lower pipe's top position relative to the bird's y-coordinate). 2. **Visual Elements**: - **Bird**: Yellow and red, positioned centrally between two green pipes. - **Pipes**: Two vertical green pipes with gaps. The upper pipe is partially visible at the top, and the lower pipe spans the bottom half of the screen. - **Background**: Blue sky with white clouds and a cityscape silhouette at the bottom. - **Ground**: Green striped platform at the bottom. ### Detailed Analysis - **Bird Position**: The bird is annotated with `(x, y)`, indicating its coordinates in a 2D plane. The `x` and `y` values are not numerically specified but are implied to be dynamic during gameplay. - **Pipe Equations**: - `T'` represents the top position of the lower pipe. - The equation `T' = Lower_Pipe_Top - y` suggests that the lower pipe's top position is calculated by subtracting the bird's y-coordinate from a fixed value (`Lower_Pipe_Top`). This likely ensures the pipe's position adjusts relative to the bird's movement. - **Score and Generation**: The score (`13`) and generation (`1`) are displayed as game metrics, though their exact relationship to gameplay mechanics is not visually explained. ### Key Observations 1. The bird is positioned symmetrically between the two pipes, suggesting it is navigating through a gap. 2. The equations for `T'` imply a dynamic relationship between the bird's vertical position (`y`) and the pipe's placement, possibly for collision detection or procedural generation. 3. The score and generation are static in this frame but likely update during gameplay. ### Interpretation This screenshot captures a moment in a procedural game where the bird's position (`x, y`) influences the placement of pipes via the equation `T' = Lower_Pipe_Top - y`. The score (`13`) and generation (`1`) indicate progression, with the generation possibly representing an evolutionary algorithm (e.g., neuroevolution) if this is a machine learning implementation. The annotations suggest the image is part of a technical analysis or debugging process, highlighting how the bird's coordinates interact with the game's physics. The absence of numerical values for `x` and `y` implies these are variables updated in real-time during gameplay. </details> is encoded with the value 0 otherwise, it has the value 1. With reference to Fig. 3 and Table. I, the edges between Fig. 3. Diagramatic view of the encoded chromosome in Table I <details> <summary>Image 3 Details</summary> ![3121711e](/v1/image/3121711ebf7e1864b6ca69fccb8ff35aeaf9d0ec5fc70893894a6a6470410ea8) ### Visual Description ## Directed Graph Diagram: Node Interconnections ### Overview The image depicts a directed graph with four nodes (labeled 1–4) connected by bidirectional and unidirectional edges. Arrows indicate the direction of relationships, with one node (4) containing a self-loop. The graph forms a diamond-like structure with node 2 at the top, node 3 at the bottom, node 1 on the left, and node 4 on the right. ### Components/Axes - **Nodes**: - Node 1 (left) - Node 2 (top) - Node 3 (bottom) - Node 4 (right) - **Edges**: - Unidirectional: - 1 → 2 (top-left to top) - 2 → 3 (top to bottom) - 3 → 1 (bottom to left) - 2 → 4 (top to right) - 4 → 3 (right to bottom) - Bidirectional: - 2 ↔ 3 (top ↔ bottom) - Self-loop: - 4 → 4 (right node loops to itself) ### Detailed Analysis - **Node 1**: - Outgoing edge to Node 2. - Incoming edge from Node 3. - **Node 2**: - Outgoing edges to Nodes 3 and 4. - Incoming edge from Node 1. - Bidirectional connection to Node 3. - **Node 3**: - Outgoing edges to Nodes 1 and 4. - Incoming edges from Nodes 2 and 4. - Bidirectional connection to Node 2. - **Node 4**: - Outgoing edges to Node 3 and itself (self-loop). - Incoming edge from Node 2. ### Key Observations 1. **Cyclic Relationships**: - A self-loop on Node 4 suggests a feedback mechanism or autoregulation. - A cycle exists via the path 2 → 3 → 4 → 3 → 2, indicating potential infinite loops in traversal. 2. **Bidirectional Interaction**: - Nodes 2 and 3 share a mutual connection, implying a two-way relationship (e.g., mutual dependency or exchange). 3. **Asymmetry**: - Node 1 has only one incoming edge (from Node 3) and one outgoing edge (to Node 2), making it a "leaf" node in the graph. 4. **Central Hub**: - Node 2 acts as a central hub with three outgoing edges (to 3, 4, and itself via Node 3) and one incoming edge. ### Interpretation This graph likely represents a network where nodes are entities (e.g., processes, systems, or actors) and edges are directional relationships (e.g., data flow, dependencies, or interactions). The self-loop on Node 4 suggests a feedback mechanism, while the bidirectional edge between Nodes 2 and 3 implies mutual influence. The absence of numerical labels or weights on edges indicates the diagram focuses on structural relationships rather than quantitative metrics. The diamond-like layout emphasizes Node 2 as a central coordinator, with Node 1 serving as a peripheral node. The cycle involving Nodes 2, 3, and 4 could represent a closed-loop system requiring careful analysis to avoid infinite recursion in computational contexts. </details> TABLE I ENCODING OF A CHROMOSOME BEFORE CROSSOVER AND MUTATION | Weight | 0.25 | 2.31 | 1.55 | 0.98 | 5.11 | 1.17 | 0.07 | |----------|--------|--------|--------|--------|--------|--------|--------| | From | 1 | 2 | 3 | 1 | 3 | 4 | 2 | | To | 2 | 3 | 2 | 3 | 4 | 3 | 4 | | Enabled | 1 | 0 | 1 | 1 | 1 | 1 | 1 | the nodes are represented by the rows 'From' and 'To'. The Table. I shows the encoding of the network before mutation. After the mutation, or rather after topology augmentation, the encoding of the edges is shown in Table. II. The resultant connections are shown in Fig. 4. The edges that are in red are the edges that were dropped, and the edges that are in green are the ones that have been added as a result of the mutation. The cross-over process happens between any two randomly selected parents. The next population is determined by the fitness of the individual agents. Fig. 4. Diagramatic view of the encoded chromosome in Table. II <details> <summary>Image 4 Details</summary> ![8bc0aa87](/v1/image/8bc0aa8704d42b8e0bd6a5313ba2c48511ab53b6d16860f2d3a0374febf7be71) ### Visual Description ## Directed Graph Diagram: Node Interactions and Flow Relationships ### Overview The image depicts a directed graph with four nodes (1–4) connected by colored arrows. Arrows indicate directional relationships between nodes, with three distinct arrow colors (red, green, gray) suggesting different types of interactions. No explicit legend is present, but color coding appears to differentiate relationship types. ### Components/Axes - **Nodes**: Four labeled entities (1, 2, 3, 4) represented as gray circles with black numerals. - **Edges**: - **Red arrows**: - 1 → 2 (unidirectional) - 2 → 3 (unidirectional) - 3 → 4 (unidirectional) - 4 → 4 (self-loop) - **Green arrow**: - 2 → 4 (unidirectional) - **Gray arrows**: - 2 ↔ 3 (bidirectional) - 3 ↔ 4 (bidirectional) - **Spatial Layout**: Nodes arranged in a diamond pattern (1 top-left, 2 top-center, 3 bottom-center, 4 bottom-right). Arrows vary in curvature and thickness. ### Detailed Analysis 1. **Node 1**: - Only outgoing edge: Red arrow to Node 2. - No incoming connections. 2. **Node 2**: - Incoming: Red arrow from Node 1. - Outgoing: - Red arrow to Node 3. - Green arrow to Node 4. - Bidirectional gray connection to Node 3. 3. **Node 3**: - Incoming: - Red arrow from Node 2. - Gray bidirectional arrow from Node 2. - Outgoing: - Red arrow to Node 4. - Gray bidirectional arrow to Node 4. 4. **Node 4**: - Incoming: - Green arrow from Node 2. - Red arrow from Node 3. - Gray bidirectional arrow from Node 3. - Self-loop: Red arrow (4 → 4). ### Key Observations - **Central Hub**: Node 2 acts as a primary intermediary, connecting Nodes 1, 3, and 4. - **Feedback Loops**: - Node 4 has a self-sustaining loop (red arrow). - Bidirectional gray connections between Nodes 2–3 and 3–4 suggest mutual dependencies. - **Color Coding Ambiguity**: Without a legend, red/green/gray relationships remain undefined (e.g., priority, status, or interaction type). - **Asymmetry**: Node 1 is a source with no incoming edges, while Node 4 has the most incoming connections. ### Interpretation This diagram likely represents a **state transition system** or **workflow process**: - **Node 1** could represent an initial state or trigger. - **Node 2** functions as a decision point, branching to Nodes 3 and 4. - **Bidirectional gray arrows** between Nodes 2–3 and 3–4 imply reversible or iterative processes (e.g., validation cycles). - The **self-loop on Node 4** suggests a terminal or recurring state (e.g., error handling, continuous operation). - **Green arrow (2→4)** may denote a high-priority or exceptional pathway compared to red/gray connections. ### Uncertainties - Arrow color semantics (e.g., red = error, green = success) are unspecified. - Node labels (1–4) lack contextual meaning without additional metadata. - Temporal or quantitative attributes (e.g., transition probabilities, time delays) are absent. This structure could model systems like software workflows, neural networks, or organizational hierarchies, but further context is required for precise interpretation. </details> TABLE II ENCODING OF A CHROMOSOME AFTER CROSSOVER AND MUTATION | Weight | 0.25 | 5.11 | 1.17 | 0.98 | 2.31 | 1.55 | 0.07 | |----------|--------|--------|--------|--------|--------|--------|--------| | From | 1 | 2 | 4 | 1 | 3 | 3 | 4 | | To | 3 | 4 | 2 | 4 | 2 | 4 | 3 | | Enabled | 1 | 1 | 1 | 0 | 1 | 1 | 0 | ## IV. RESULTS The implementation of the algorithm requires no historic data or any dataset. The algorithm makes use of the sensory data perceived from the environment by the artificial agent as the program runs. The inputs to the algorithm are the y position of the agent, the vertical distance of the agent from the top pipe, and the vertical distance of the agent from the lower pipe. The output of the algorithm is the action that the agent is to take i.e. jump or drop down owing to gravity. NEAT algorithm was implemented by taking different initial populations. Fig. 5, Fig. 6 and Fig. 7 shows the average score and the scores reached in every generation, when the game is played by the agents over 50 generations. The change in the average scores Fig. 5. Gameplay when initial population is 80 <details> <summary>Image 5 Details</summary> ![2f7a4315](/v1/image/2f7a4315ad8b1b23d779205372b8dff8848a6f6c9e08527dc88b9e14c37c0f59) ### Visual Description ## Line Graph: Score vs. Mean Score Over Generations ### Overview The image depicts a line graph comparing two data series: "Score" (blue line) and "Mean Score" (orange line) across 50 generations. The y-axis represents "Score" (0–800), and the x-axis represents "Generation" (0–50). The blue line exhibits significant volatility with sharp peaks and troughs, while the orange line shows a smoother, gradual upward trend. ### Components/Axes - **X-axis (Horizontal)**: Labeled "Generation," ranging from 0 to 50 in increments of 10. - **Y-axis (Vertical)**: Labeled "Score," ranging from 0 to 800 in increments of 100. - **Legend**: Located in the top-right corner, with: - **Blue line**: "Score" - **Orange line**: "Mean Score" ### Detailed Analysis 1. **Score (Blue Line)**: - **Initial Phase (0–10 generations)**: Starts at 0, rises sharply to ~300 at Generation 5, then drops to ~100 by Generation 10. - **Mid-Phase (10–30 generations)**: Exhibits erratic fluctuations, with peaks reaching ~750 (Generation 5), ~650 (Generation 15), and ~550 (Generation 25). Troughs dip below 100 (e.g., Generation 20). - **Late Phase (30–50 generations)**: Peaks stabilize around 400–500, with a final peak of ~400 at Generation 50. 2. **Mean Score (Orange Line)**: - **Initial Phase (0–10 generations)**: Begins at 0, rises gradually to ~150 by Generation 10. - **Mid-Phase (10–30 generations)**: Maintains a steady increase, reaching ~220 by Generation 20 and ~250 by Generation 30. - **Late Phase (30–50 generations)**: Flattens slightly, ending at ~250 by Generation 50. ### Key Observations - **Volatility vs. Stability**: The "Score" line shows extreme variability, with sharp spikes and crashes, while the "Mean Score" line demonstrates consistent growth. - **Peak Correlation**: The highest "Score" peak (~750) occurs at Generation 5, but the "Mean Score" remains low (~100) at this point, indicating outliers skew the mean. - **Late-Stage Divergence**: By Generation 50, the "Score" line stabilizes (~400), while the "Mean Score" plateaus (~250), suggesting reduced variability in later generations. ### Interpretation The graph illustrates a system where individual "Scores" fluctuate wildly, but the "Mean Score" reflects a long-term upward trend. The early peaks in "Score" (e.g., Generation 5) may represent transient events or anomalies, as the mean remains low. By Generation 50, the stabilization of "Score" and plateau of "Mean Score" suggest the system reaches equilibrium, with reduced variability in outcomes. The divergence between the two lines highlights the impact of outliers on statistical measures, emphasizing the importance of distinguishing between individual performance and aggregate trends. </details> over the change in the initial population is separately shown in Fig. 6. Gameplay when initial population is 100 <details> <summary>Image 6 Details</summary> ![5e27eeeb](/v1/image/5e27eeeb3f49dca6b82161f32623412498ccdb867c83db306433d78defa097ed) ### Visual Description ## Line Graph: Score vs. Mean Score Over Generations ### Overview The image depicts a line graph comparing two metrics across 50 generations: "Score" (blue line) and "Mean Score" (orange line). The y-axis represents scores ranging from 0 to 800, while the x-axis represents generations from 0 to 50. The blue line exhibits significant volatility, with sharp peaks and troughs, whereas the orange line shows a smoother, gradual upward trend. ### Components/Axes - **X-axis (Horizontal)**: Labeled "Generation," with markers at intervals of 10 (0, 10, 20, ..., 50). - **Y-axis (Vertical)**: Labeled "Score," with markers at intervals of 200 (0, 200, 400, 600, 800). - **Legend**: Located in the top-left corner, associating: - **Blue line**: "Score" - **Orange line**: "Mean Score" ### Detailed Analysis 1. **Score (Blue Line)**: - **Trend**: Highly volatile, with sharp peaks and deep troughs. - **Key Data Points**: - Starts near 0 at generation 0. - First peak at ~300 (generation 5). - Second peak at ~550 (generation 15). - Third peak at ~500 (generation 25). - Final peak at ~850 (generation 50), exceeding the y-axis maximum of 800. - **Uncertainty**: Peaks are approximate due to jagged line; exact values cannot be determined without gridlines. 2. **Mean Score (Orange Line)**: - **Trend**: Smooth, gradual increase with minimal fluctuations. - **Key Data Points**: - Starts near 0 at generation 0. - Rises steadily to ~250 by generation 50. - **Uncertainty**: Values are interpolated between markers; exact values require finer resolution. ### Key Observations - The "Score" line shows extreme variability, with peaks exceeding the "Mean Score" by 2–3x at certain generations (e.g., generation 50). - The "Mean Score" line acts as a stabilizing force, averaging out the volatility of individual "Scores." - The final spike in "Score" at generation 50 is an outlier, suggesting a sudden anomaly or breakthrough. ### Interpretation The graph likely represents iterative performance metrics in a generative process (e.g., machine learning, evolutionary algorithms). The "Score" reflects individual trial outcomes, while the "Mean Score" captures the overall trend. The final spike in "Score" could indicate a critical breakthrough or outlier event in the final generation, warranting further investigation. The divergence between the two lines highlights the tension between individual variability and systemic trends. </details> Fig. 7. Gameplay when initial population is 120 <details> <summary>Image 7 Details</summary> ![a5a6541b](/v1/image/a5a6541b5e4a32f35e04fb5cd588a6eef044c29c4d52fb3e2ac213f17df69af1) ### Visual Description ## Line Graph: Score vs. Mean Score Over Generations ### Overview The image depicts a line graph comparing two metrics across 50 generations: "Score" (blue line) and "Mean Score" (orange line). The y-axis represents the score magnitude (0–500), while the x-axis represents generations (0–50). The graph shows significant variability in the "Score" line, with sharp peaks and troughs, while the "Mean Score" line exhibits a smoother, gradual upward trend. --- ### Components/Axes - **X-axis (Horizontal)**: Labeled "Generation," ranging from 0 to 50 in increments of 10. - **Y-axis (Vertical)**: Labeled "Score," ranging from 0 to 500 in increments of 100. - **Legend**: Located in the top-right corner, with: - **Blue line**: Labeled "Score" - **Orange line**: Labeled "Mean Score" --- ### Detailed Analysis 1. **Score (Blue Line)**: - Exhibits extreme volatility, with sharp peaks reaching ~500 and troughs near 0. - Notable spikes occur at generations ~5, 15, 25, 35, 45, and 50. - Troughs (near 0) are observed at generations ~10, 20, 30, and 40. - No consistent pattern; fluctuations appear random or context-dependent. 2. **Mean Score (Orange Line)**: - Shows a steady, gradual increase. </details> Fig. 8 for generations 30 to 50. The average score of the agent is steadily increasing from when the initial population is 20 to 100. The maximum score is observed when the population is 160. The average fitness value of the population is higher when the initial population size is 100. This is shown in Fig. 9. The initial training phase is less than 5 generations. When the initial population has fewer agents, it takes more generations to spike the average score of the game. This can be observed from Fig. 10. Table. III shows the average score and the maximum score gained by the agent over 50 generations. A maximum score of 1025 is obtained when the initial population is 160 and the gameplay run till 50 generations. ## CONCLUSION AND FUTURE SCOPE By using a 2D game, the performance of the algorithms can be determined very efficiently. Unlike simulation, the creation of an environment gives better control over the environment. Through various iterations by changing the initial population size, the average score gained by the agent has increased. The initial population of agents also affects the training speed. The more the agents, the quicker the training is done. The highest <details> <summary>Image 8 Details</summary> ![70ad0c92](/v1/image/70ad0c9274c5f33b4ff9ce64209a6d3aad0d81d7615149c973d4ec1fceb67abd) ### Visual Description ## Line Chart: Fitness Value Across Generations for Different Initial Populations ### Overview The image is a line chart displaying the evolution of fitness values across generations for six distinct initial population sizes (20, 40, 60, 80, 100, 120). Each line represents a unique initial population, with fitness values plotted against generations (x-axis) and numerical fitness scores (y-axis). The chart spans 18 generations (32.5 to 50.0) and fitness values from 0 to 250. ### Components/Axes - **X-axis (Horizontal)**: Labeled "Generation," with markers at 32.5, 35.0, 37.5, 40.0, 42.5, 45.0, 47.5, and 50.0. - **Y-axis (Vertical)**: Labeled "Fitness value," with increments of 50 (0, 50, 100, 150, 200, 250). - **Legend**: Positioned on the left side of the chart, mapping colors to initial population sizes: - Blue: Init Pop = 120 - Orange: Init Pop = 100 - Green: Init Pop = 80 - Red: Init Pop = 60 - Purple: Init Pop = 40 - Brown: Init Pop = 20 ### Detailed Analysis 1. **Green Line (Init Pop = 80)**: - **Trend**: Starts at ~240 (generation 32.5), peaks at ~250 (generation 37.5), and stabilizes near 250 until generation 50.0. - **Key Values**: Consistently the highest fitness value across all generations. 2. **Orange Line (Init Pop = 100)**: - **Trend**: Begins at ~200 (generation 32.5), rises to ~220 (generation 37.5), fluctuates between 210–230, and reaches ~240 by generation 50.0. - **Key Values**: Second-highest fitness, showing steady improvement over time. 3. **Blue Line (Init Pop = 120)**: - **Trend**: Starts at ~200 (generation 32.5), dips to ~190 (generation 37.5), then rises to ~220 (generation 50.0). - **Key Values**: Third-highest fitness, with a notable dip and recovery. 4. **Red Line (Init Pop = 60)**: - **Trend**: Begins at ~180 (generation 32.5), fluctuates between 180–190, and stabilizes near 190 by generation 50.0. - **Key Values**: Mid-range fitness, minimal growth over generations. 5. **Purple Line (Init Pop = 40)**: - **Trend**: Starts at ~160 (generation 32.5), rises to ~170 (generation 37.5), and plateaus near 170 by generation 50.0. - **Key Values**: Lower fitness, limited improvement over time. 6. **Brown Line (Init Pop = 20)**: - **Trend**: Begins at ~100 (generation 32.5), increases to ~140 (generation 37.5), and reaches ~160 by generation 50.0. - **Key Values**: Lowest initial fitness but shows the steepest growth trajectory. ### Key Observations - **Higher Initial Populations**: Lines for Init Pop = 80, 100, and 120 dominate the upper fitness range, suggesting a correlation between larger populations and higher fitness. - **Growth Trajectories**: Smaller populations (40, 20) show significant improvement over generations, while larger populations (80, 100, 120) exhibit stability or gradual growth. - **Plateaus**: The green (80) and red (60) lines plateau early, indicating diminishing returns for mid-sized populations. - **Anomalies**: The blue line (120) dips below the orange line (100) at generation 37.5, suggesting population size alone does not guarantee consistent performance. ### Interpretation The data suggests that **initial population size influences fitness outcomes**, but the relationship is non-linear. Larger populations (80–120) achieve higher baseline fitness but show limited growth, while smaller populations (20–40) start lower but improve more dramatically over time. This could imply that smaller populations have greater adaptive potential or that larger populations face diminishing returns due to factors like genetic redundancy or resource constraints. The blue line’s dip highlights variability in performance, emphasizing the need for further analysis of population dynamics. The chart underscores the importance of balancing population size with evolutionary efficiency in optimization scenarios. </details> Generation Fig. 8. Average scores over initial population change (Gen 30 - Gen 50) <details> <summary>Image 9 Details</summary> ![7aacd7f7](/v1/image/7aacd7f732ef6f1e41b5af26a37c281bfbdd7ea97acac68c2ac9b442901064e6) ### Visual Description ## Line Graph: Fitness Value Over Generations ### Overview The image depicts a line graph comparing the fitness values of two populations over 50 generations. Two lines represent populations with initial sizes of 120 (blue) and 100 (orange). Both lines show an upward trend, but the blue line exhibits greater variability and a steeper increase after generation 20. ### Components/Axes - **X-axis (Horizontal)**: Labeled "Generation," ranging from 0 to 50 in increments of 10. - **Y-axis (Vertical)**: Labeled "Fitness value," ranging from 0 to 300 in increments of 50. - **Legend**: Located in the top-left corner, with: - **Blue line**: "Init Pop = 120" - **Orange line**: "Init Pop = 100" - **Lines**: - **Blue line**: Starts near 0, dips below the orange line around generation 10, then rises sharply after generation 20. - **Orange line**: Starts near 0, rises steadily, and remains above the blue line until generation 20, after which it plateaus slightly. ### Detailed Analysis - **Blue Line (Init Pop = 120)**: - **Generations 0–10**: Fitness value rises from ~0 to ~100, with fluctuations (e.g., ~50 at generation 5, ~150 at generation 10). - **Generations 10–20**: Dips to ~120 at generation 15, then rises to ~180 by generation 20. - **Generations 20–50**: Accelerates sharply, reaching ~300 by generation 50. - **Orange Line (Init Pop = 100)**: - **Generations 0–10**: Rises from ~0 to ~80, with minor fluctuations. - **Generations 10–20**: Steady increase to ~200 by generation 20. - **Generations 20–50**: Slows slightly, reaching ~310 by generation 50. ### Key Observations 1. **Crossover Point**: The blue line (Init Pop = 120) surpasses the orange line (Init Pop = 100) around generation 20, despite starting lower. 2. **Volatility**: The blue line shows more pronounced fluctuations (e.g., dips at generations 5, 15) compared to the smoother orange line. 3. **Final Values**: Both lines converge near 300 by generation 50, but the blue line ends slightly higher (~310 vs. ~300). ### Interpretation The graph suggests that larger initial population sizes (120 vs. 100) may lead to higher long-term fitness values, even if short-term performance is less stable. The blue line’s initial dip and subsequent surge could indicate adaptive mechanisms (e.g., genetic diversity, selection pressure) that favor larger populations over time. The orange line’s steady rise implies that smaller populations might achieve consistent but slower growth. The crossover at generation 20 highlights the importance of population size in evolutionary dynamics, where larger groups may recover from early setbacks to outperform smaller ones in the long run. </details> Fig. 9. Average Fitness of the population over initial population change <details> <summary>Image 10 Details</summary> ![198ae9a7](/v1/image/198ae9a7d98bc3d300faceddccbe448ca2290e7cc66c1ccd9eb5c7d3a0233d42) ### Visual Description ## Line Graph: Score Trends Across Generations for Different Initial Populations ### Overview The image depicts a line graph comparing the performance (score) of three distinct initial population sizes (20, 40, 60) over 50 generations. Scores are plotted on the y-axis (0–200), while generations are on the x-axis (0–50). Three colored lines represent the populations: blue (20), orange (40), and green (60). ### Components/Axes - **X-axis (Horizontal)**: Labeled "Generation," scaled from 0 to 50 in increments of 10. - **Y-axis (Vertical)**: Labeled "Score," scaled from 0 to 200 in increments of 25. - **Legend**: Located in the bottom-right corner, mapping colors to initial population sizes: - Blue: Init Pop = 20 - Orange: Init Pop = 40 - Green: Init Pop = 60 ### Detailed Analysis 1. **Green Line (Init Pop = 60)**: - Starts at ~50 (Generation 0) and rises sharply to ~150 by Generation 10. - Exhibits fluctuations but maintains a general upward trend, peaking near 200 by Generation 50. - Shows the highest variability (e.g., dips to ~125 at Generation 15, spikes to ~190 at Generation 30). 2. **Orange Line (Init Pop = 40)**: - Begins at ~0 (Generation 0) and rises steeply to ~125 by Generation 10. - Stabilizes between ~150–175 from Generation 20 onward, with minor oscillations. - Ends near ~175 at Generation 50. 3. **Blue Line (Init Pop = 20)**: - Remains near 0 until Generation 10, then climbs gradually to ~100 by Generation 30. - Accelerates sharply after Generation 30, reaching ~150 by Generation 50. - Shows the least variability but the slowest initial growth. ### Key Observations - **Performance Correlation**: Larger initial populations (green) achieve higher scores earlier and maintain a lead throughout. - **Threshold Effects**: The orange line (40) demonstrates a "breakthrough" around Generation 10, suggesting a critical mass effect. - **Delayed Growth**: The blue line (20) exhibits a "late surge" after Generation 30, indicating slower adaptation but eventual competitiveness. - **Volatility**: The green line’s fluctuations suggest higher sensitivity to generational changes compared to smaller populations. ### Interpretation The data implies that **initial population size significantly impacts long-term performance**, with larger populations achieving higher scores more consistently. The green line’s volatility may reflect exploration of a broader solution space, while the blue line’s delayed growth could indicate resource constraints or slower optimization. The orange line’s mid-range performance suggests a balance between stability and adaptability. Notably, all populations improve over time, but the rate and trajectory differ markedly, highlighting the importance of initial conditions in evolutionary or iterative systems. </details> Fig. 10. Speed of agents getting trained over initial population change | Initial Population | Average Score | Max Score | |----------------------|-----------------|-------------| | 20 | 158.2 | 583 | | 40 | 187.04 | 771 | | 60 | 200.06 | 756 | | 80 | 250.28 | 765 | | 100 | 244.56 | 911 | | 120 | 220.66 | 544 | | 140 | 255.82 | 565 | | 160 | 293.72 | 1025 | average score is obtained when the initial population is set to 100 individuals. It can be concluded that the performance of the algorithm increases as the initial population is increased. The implemented algorithm can be extended to making use of Reinforcement Learning with multiple agents and using Augmented Topologies along with the Deep Q-Learning model. ## REFERENCES - [1] M. G. Cordeiro, Paulo B. S. Serafim, Yuri B. Nogueira, 'A Minimal Training Strategy to Play Flappy Bird Indefinitely with NEAT', 2019 18th Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), pp. 384-390, DOI: 10.1109/SBGames.2019.00014. - [2] Evgenia Papavasileiou, Jan Cornelis and Bart Jansen, 'A Systematic Literature Review of the Successors of 'NeuroEvolution of Augmenting Topolo- gies', Evolutionary Computation, vol: 29, Issue: 1, March 2021, pp. 1-73, DOI: 10.1162/evcoa00282. - [3] Inseok Oh, Seungeun Rho, Sangbin Moon and Seongho Son, 'Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning', IEEE Transactions on Games, 2021, pp. 1-10, doi: 10.1109/TG.2021.3049539. - [4] Botong Liu, 'Implementing Game Strategies Based on Reinforcement Learning', ICRAI 2020: 2020 6th International Conference on Robotics and Artificial Intelligence, November 2020, pp 53-56, DOI: https://doi.org/10.1145/3449301.3449311. - [5] Evalds Urtans, Agris Nikitenko, 'Survey of Deep Q-Network variants in PyGame Learning Environment', ICDLT '18: Proceedings of the 2018 2nd International Conference on Deep Learning Technologies, June 2018, pp 27-36, DOI: https://doi.org/10.1145/3234804.3234816. - [6] Tai Vu, Leon Tran, 'FlapAI Bird: Training an Agent to Play Flappy Bird Using Reinforcement Learning Techniques', experimental projects with community collaborators, 2020, DOI: https://doi.org/10.48550/arXiv.2003.09579. - [7] Andre Brandao, Pedro Pires, Petia Georgieva, 'Reinforcement Learning and Neuroevolution in Flappy Bird Game', Pattern Recognition and Image Analysis, 2019, pp.225-236, DOI: DOI:10.1007/978-3-030-31332620. - [8] Yash Mishra, Vijay Kumawat, Selvakumar Kamalanathan, 'Performance Analysis of Flappy Bird Playing Agent Using Neural Network and Genetic Algorithm', Information, Communication and Computing Technology, 2019, pp.253-265, DOI:10.1007/978-981-15-1384-821. - [9] A. McIntyre, M. Kallada, C. G. Miguel, and C. F. da Silva, 'neatpython,' https://github.com/CodeReclaimers/neat-python. - [10] T. Ruscica and J. Keromnes, 'Flappy Bird game images', https://github.com/techwithtim/NEAT-Flappy-Bird/tree/master/imgs. - [11] S. Kumar, 'Understand Types of Environments in Artificial Intelligence', https://www.aitude.com/understand-types-of-environments-inartificial-intelligence/, 2020. - [12] D. Zhu, 'How I Built an Intelligent Agent to Play Flappy Bird', Analytics Vidhya, 2020.

Rendering Paper...