2512.24156v1

Model: gemini-2.0-flash

# Graph-Based Exploration for ARC-AGI-3 Interactive Reasoning Tasks **Authors**: Evgenii Rudakov, Jonathan Shock, Benjamin Ultan Cowley ## Abstract We present a training-free graph-based approach for solving interactive reasoning tasks in the ARC-AGI-3 benchmark. ARC-AGI-3 comprises game-like tasks where agents must infer task mechanics through limited interactions, and adapt to increasing complexity as levels progress. Success requires forming hypotheses, testing them, and tracking discovered mechanics. The benchmark has revealed that state-of-the-art LLMs are currently incapable of reliably solving these tasks. Our method combines vision-based frame processing with systematic state-space exploration using graph-structured representations. It segments visual frames into meaningful components, prioritizes actions based on visual salience, and maintains a directed graph of explored states and transitions. By tracking visited states and tested actions, the agent prioritizes actions that provide the shortest path to untested state-action pairs. On the ARC-AGI-3 Preview Challenge, this structured exploration strategy solves a median of 30 out of 52 levels across six games and ranks 3rd on the private leaderboard, substantially outperforming frontier LLM-based agents. These results demonstrate that explicit graph-structured exploration, even without learning, can serve as a strong baseline for interactive reasoning and underscore the importance of systematic state tracking and action prioritization in sparse-feedback environments where current LLMs fail to capture task dynamics. The code is open source and available at https://github.com/dolphin-in-a-coma/arc-agi-3-just-explore. ## Introduction Introduced in 2019, the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) has become a fundamental benchmark for evaluating general fluid intelligence in artificial systems by posing novel tasks that require minimal prior knowledge (chollet_measure_2019). While the original ARC-AGI benchmarks focused on static grid-based reasoning tasks, ARC-AGI-3 represents a paradigm shift toward Interactive Reasoning Benchmarks (IRBs) that test broader capabilities, including on-the-fly learning, exploration, and memory through game-like environments where agents must perceive, plan, and act across multiple steps to achieve long-horizon goals (arc-agi-3). ARC-AGI-3 introduces novel game environments designed to test the skill-acquisition efficiency of artificial systems, where agents interact with game environments without instructions, and must discover mechanics through exploration. Early results reveal a stark performance gap: frontier AI models scored 0% while human participants achieved 100% on the initial preview tasks (arc-agi-3). This dramatic disparity underscores fundamental limitations in current AI approaches to interactive reasoning and adaptive learning in novel environments. The challenge of learning from sparse rewards has been central to reinforcement learning (RL) for decades. When rewards are rare and precise action sequences are required, random exploration fails to discover optimal policies. Exploration strategies have emerged to address this challenge. Curiosity-driven methods use prediction error as intrinsic motivation (pathak_curiosity-driven_2017), enabling agents to explore complex environments like Super Mario Bros without extrinsic rewards (pathak_curiosity-driven_2017). Go-Explore advances systematic exploration by maintaining archives of discovered states and decomposing exploration into phases: return to promising states, then explore from them (ecoffet_first_2021). This approach achieved breakthrough performance on Montezuma’s Revenge, scoring 25% higher than the human expert. For goal-conditioned tasks, Hindsight Experience Replay (HER) learns from failure by relabeling unsuccessful attempts as alternative goals, achieving sample-efficient learning without reward engineering (andrychowicz_hindsight_2017). Model-based approaches have demonstrated remarkable sample efficiency by learning environment dynamics. MuZero combined learned latent dynamics with tree search, achieving superhuman performance on board games and Atari benchmarks without knowledge of game rules (schrittwieser_mastering_2020). EfficientZero extended this with self-supervised consistency losses, becoming the first algorithm to reach superhuman levels on Atari (194.3% mean human) with just 100k training samples (two hours of real-time experience) per game (ye_mastering_2021). BBF further improved Atari 100k results by scaling the value network sample-efficiency (schwarzer_bigger_2023), in a completely model-free manner. The family of Dreamer models (hafner_training_2025) takes an alternative approach, learning world models in latent space and training policies through imagined rollouts rather than via tree search, mastering over 150 diverse tasks from Atari to Minecraft with a single configuration (hafner_training_2025). Most recently, Axiom introduced object-centric world models that learn compositional representations by discovering and tracking entities, achieving competitive performance within minutes by targeting 10k-step solutions per environment (heins_axiom_2025). Despite these advances, current approaches face fundamental limitations for few-shot discovery tasks like ARC-AGI-3. The benchmark provides only a single sparse reward signal, level completion, across no more than 10 levels per game. This scarcity of feedback severely constrains learning-based methods. The challenge is compounded by the fact that each level introduces new mechanics while retaining previous ones, creating a shifting distribution that prevents straightforward transfer learning. Curiosity-driven exploration offers no guarantee of correlation with task progress in truly novel environments where the notion of ”most interesting states” may be orthogonal to goal-relevance. Sample-efficient approaches like Axiom assume object-centric compositional structure and require environments to exhibit consistent physical dynamics, assumptions that may not hold across ARC-AGI-3’s abstract and diverse game mechanics. ARC-AGI-3 is also relevant for understanding the behaviour of large language model (LLM) agents. Unlike static reasoning benchmarks, it requires agents to infer latent rules through interaction, maintain an evolving notion of state, and design multi-step probes under sparse feedback, making it a complementary testbed for studying how explicit structure and exploration strategies can support LLM-based reasoning. In this work, we present a graph-based exploration method that combines systematic state-space tracking with visual priority heuristics to tackle ARC-AGI-3’s interactive reasoning challenges. Our approach maintains a directed graph representation of explored states and action transitions, prioritizing actions based on visual salience while ensuring comprehensive exploration through frontier-driven navigation. Unlike learning-based approaches that require extensive training, our method operates as a strong baseline that can make progress through structured exploration alone. We demonstrate that this approach achieves competitive performance on the ARC-AGI-3 benchmark, significantly outperforming state-of-the-art LLMs while providing insights into the nature of exploration required for interactive reasoning tasks. ## ARC-AGI-3 ### Benchmark Overview ARC-AGI-3 represents a significant evolution from the original ARC challenge, shifting from static grid-based reasoning to interactive game environments that test an agent’s ability to learn through exploration (ying_assessing_2025). The benchmark consists of 6 novel game environments, with 3 public games (ft09, ls20, vc33) released for development and 3 private games (sp80, lp85, as66) used to determine final leaderboard rankings. Each game contains between 8 and 10 levels, with each subsequent level introducing new mechanics. Figure 2 in the appendix shows example screenshots from the games. The benchmark’s evaluation criterion prioritizes both effectiveness and efficiency: agents are scored based on the number of levels completed, with the total number of actions required serving as a tiebreaker. This dual objective encourages solutions that not only discover winning strategies but do so with minimal exploration. For the final evaluation experiments by ARC-AGI-3 organizers, each run was capped at 8 hours of wall-clock time and 10 environment steps per second (sps), shared across the three private games. Under these limits, a single game can receive at most 96,000 steps ### Observation and Action Spaces #### Visual Observations. Agents receive visual observations as 64 $\times$ 64 pixel RGB frames with a discrete palette of 16 colors. Each frame contains both the game environment and a status bar displaying the number of steps remaining before an automatic level restart. When the step counter reaches zero, the current level resets to its initial state. In the majority of games, the number of levels passed is also displayed. #### Action Spaces. The benchmark features three control schemes. Games such as ls20 use arrow-based control with directional keyboard inputs (up, down, left, right), yielding an action space of size $|\mathcal{A}|=4$ . Games such as ft09, vc33, and lp85 employ click-based control, enabling spatial interaction by allowing the agent to click any pixel location in the frame, yielding an action space of size $|\mathcal{A}|=64\times 64=4{,}096$ . Private games (sp80 and as66) introduce combined control schemes that integrate both arrow and click inputs, resulting in action spaces of size $|\mathcal{A}|=4{,}100$ . The dramatic difference in action space cardinality between control schemes poses a fundamental challenge: click-based games present over 1,000 times more possible actions at each state than arrow-based games, making exhaustive exploration intractable without intelligent action selection. ### Task Structure and Mechanics Each game in ARC-AGI-3 embodies a distinct set of mechanics and objectives that agents must discover through interaction. The only feedback signal is level completion: the environment advances to the next level when the agent satisfies (unknown) winning conditions, or resets to the beginning when the step limit expires. Within each game, levels progressively add new elements while retaining earlier ones. For example, level 1 of ls20 requires basic movement and the use of the transformer object to activate the exit door by adjusting the shape of a key, level 2 adds energy palletes to refill the number of steps remaining, level 3 introduces color dimension to the key, and so forth, up to level 8, when the agent must manage with only partial observations. This progressive structure mirrors how humans naturally acquire skills in games, but poses challenges for algorithms: knowledge transfer between levels could accelerate learning, but the levels are connected on a highly abstract level. The released games operate deterministically: the same action taken from the same state always produces the same outcome. This property enables systematic state-space exploration strategies and graph-based representations of explored states. However, determinism does not imply simplicity; the complexity arises from the large state and action spaces and the lack of prior knowledge about which actions lead toward goal states. ## Methods Our approach comprises two primary components: a Frame Processor for extracting key visual features and a Level Graph Explorer for systematic state-space exploration. ### Frame Processor The Frame Processor reduces irrelevant visual variability and directs exploration toward actionable regions of the game environment through the following operations: #### Image Segmentation. Each frame is segmented into single-color connected components, establishing the foundation for identifying distinct visual elements that may constitute interactive objects. #### Status Bar Detection and Masking. To prevent conflation of environment features with user interface components, the processor identifies and masks probable status bars. This preprocessing substantially reduces the number of recognized states. #### Priority-Based Action Grouping. For click-controlled games, visual segments are stratified into five priority tiers based on their likelihood of representing interactive buttons or objects. Prioritization is determined by segment size, morphological features, and color salience. The lowest priority tier encompasses segments identified as probable status bars, ensuring their exploration only after exhausting higher-priority alternatives. #### State Hashing. The processor generates a hash representation of the masked image, serving as a unique identifier for the current game state. This hash facilitates efficient state tracking and duplicate detection during graph exploration. ### Level Graph Explorer The Level Graph Explorer maintains a directed graph representation of the explored state space, where nodes correspond to unique game states and edges encode action-induced state transitions. #### Graph Structure. For each discovered state (graph node), the explorer maintains: - The action space $\mathcal{A}$ identifiers of connected components for spatial interaction games such as ft09/cv33, keyboard inputs for games such as ls20) - For each action $a\in\mathcal{A}$ : priority level $\pi(a)$ , exploration status, transition outcome, successor state, and minimal distance to the nearest unexplored frontier #### Action Selection Strategy. The explorer implements a hierarchical action selection policy that progressively expands the search space, as shown in Algorithm 1. Algorithm 1 Hierarchical Action Selection 0: Current state $s$ , priority threshold $p$ if $\exists$ untested actions with priority $\pi(a)\leq p$ in state $s$ then Select uniformly at random an untested action $a$ where $\pi(a)\leq p$ from $s$ Execute action and update graph with observed transition else if $\exists$ reachable state $s^{\prime}$ with untested actions where $\pi(a)\leq p$ then Select action minimizing distance to reachable state $s^{\prime}$ with untested actions at priority $\leq p$ Execute selected action else Increment priority threshold: $p\leftarrow p+1$ Recurse from current state $s$ with updated priority $p$ end if This policy ensures systematic exploration of high-salience actions prior to considering lower-priority alternatives, thereby focusing computational resources on likely-relevant state-action pairs. #### Frontier Management. The explorer maintains the shortest-path distances from each explored state to frontier states, those containing untested actions. These distance metrics always guide traversal toward unexplored regions. ## Baselines We evaluate our approach against two baseline methods to demonstrate the effectiveness of structured exploration. #### Random Agent. A simple baseline that selects actions uniformly at random from the available action space at each step. This baseline provides a lower bound on performance and demonstrates the difficulty of solving tasks through undirected exploration alone. #### LLM+DSL. We compare against the best-performing LLM-based solution on the leaderboard (fluxon_arc_2025), which combines GPT-4.1 with domain-specific language (DSL) programming. The approach observes game frames and generates Python code to interact with the environment, attempting to discover game mechanics through programmatic reasoning. Despite using a frontier LLM, this approach demonstrates the current limitations of LLM-based methods for interactive reasoning tasks. Because each environment step is gated by an LLM call, it is severely interaction-limited: within the evaluation budget, it produces only about 4,000 interactions per game, compared to the 96,000 steps that are in principle allowed. To avoid high LLM usage costs, we do not re-run this baseline; instead, we report the results from its official evaluation on the private games, with the limitation that only a single aggregate score is available and no results are reported on the public games. ## Results We evaluated our graph-based exploration method on all six ARC-AGI-3 games. Figure 1 reports an incremental component-addition analysis: starting from a random agent, we cumulatively add components and measure the total levels solved across games; the LLM+DSL baseline is included for comparison. Here, to ensure a fair comparison with the LLM-based baseline, all methods are capped at 4,000 interactions per game. All non-LLM configurations report the median over 5 runs, whereas the LLM+DSL baseline is shown as a single result taken from the official challenge evaluation. <details> <summary>x1.png Details</summary> ![75f66df8](/v1/image/75f66df84c0b7f501e261b474cc79424c13be2df7c9e364ef5fc1a0e3eab60fa) ### Visual Description ## Stacked Bar Chart: Number of Solved Levels by Agent Type and Game ### Overview The image is a stacked bar chart comparing the number of solved levels across different game types (private and public) and agent configurations. The y-axis represents the "Number of Solved Levels," ranging from -10 to 15. The x-axis represents different agent configurations: "LLM + DSL," "Random Agent," "+ Frame Segmentation," "+ Prioritize New Actions," and "+ Graph Exploration." The chart uses stacked bars to show the contribution of each game to the total solved levels for each agent configuration. The horizontal line at y=0 separates the public and private games. ### Components/Axes * **Y-axis:** "Number of Solved Levels," ranging from -10 to 15, with tick marks at intervals of 5. * **X-axis:** Agent configurations: * "LLM + DSL" * "Random Agent" * "+ Frame Segmentation" * "+ Prioritize New Actions" * "+ Graph Exploration" * **Legend (Top-Left):** * **Private games:** * as66 (light green) * lp85 (light tan) * sp80 (light blue) * Unknown (gray) * **Public games:** * ft09 (dark green) * ls20 (orange) * vc33 (pink) * **Horizontal Line:** A thick black line at y=0 separates the public and private games. * **Dashed Horizontal Lines:** Faint dashed lines at y=5, y=10, and y=15. ### Detailed Analysis **LLM + DSL:** * Unknown: 5 **Random Agent:** * Private Games: * as66: 5 * lp85: 1 * sp80: 1 * Public Games: * ft09: 0 * ls20: 1 * vc33: 1 * Total: 6 **+ Frame Segmentation:** * Private Games: * as66: 5 * lp85: 1 * sp80: 1 * Public Games: * ft09: 0 * ls20: 2 * vc33: 5 * Total: 7 **+ Prioritize New Actions:** * Private Games: * as66: 4 * lp85: 1 * sp80: 1 * Public Games: * ft09: 0 * ls20: 2 * vc33: 5 * Total: 6 **+ Graph Exploration:** * Private Games: * as66: 7 * lp85: 2 * sp80: 1 * Public Games: * ft09: 0 * ls20: 2 * vc33: 5 * Total: 10 ### Key Observations * The "LLM + DSL" agent only solves "Unknown" private games. * The "+ Graph Exploration" agent configuration achieves the highest number of solved levels (10). * The "ft09" public game is never solved by any agent configuration. * The "vc33" public game consistently contributes a significant portion to the total solved levels across all agent configurations. * The number of solved levels for private games generally increases as the agent configuration becomes more sophisticated. ### Interpretation The chart demonstrates the impact of different agent configurations on the number of solved levels in both private and public games. The "LLM + DSL" agent appears to be a baseline, only solving "Unknown" private games. Adding "Frame Segmentation," "Prioritize New Actions," and "Graph Exploration" progressively improves the agent's ability to solve levels, particularly in private games. The "Graph Exploration" agent configuration stands out as the most effective, achieving the highest number of solved levels overall. The consistent performance of the "vc33" public game suggests it may be an easier or more suitable game for these agents compared to "ft09," which is never solved. The negative values are not possible, and are likely an artifact of the stacking of the bars. </details> Figure 1: Effect of progressively adding method components to a random agent, compared with the LLM+DSL baseline. For each configuration, the stacked bar above the horizontal axis shows the total number of solved levels across the three private games, and the stacked bar below shows the total across the three public games. Colors indicate how many levels are solved in each individual game. The rightmost bars correspond to the full method. All non-LLM configurations report the median over 5 runs, whereas the LLM+DSL baseline is shown as a single result taken from the official challenge evaluation. The random agent and LLM+DSL baseline solve 6 and 5 levels on the private games, respectively, meaning that the LLM-based method underperforms even a random policy. The random agent also solves 3 levels across the public games. Adding frame segmentation to random exploration slightly increases performance on the private games, making it possible to solve one level of lp85. It also significantly improves performance on the public games, solving 5 levels on vc33 and 2 levels on ft09. When, in each state, untested actions are favored without full state-graph exploration, performance slightly decreases on as66, and the method is able to solve only 4 levels. Our complete approach solves 19 levels with an interaction limit of 4,000: 2 on ft09, 2 on ls20, 5 on vc33, 1 on sp80, 2 on lp85, and 7 on as66. In a full 8-hour run, across 5 independent runs, our method solves a median of 16 levels on the private games and 14 levels on the public games (see Figure 3 in the appendix). Per-level performance is reported in Tables 1 and 2 in the appendix. On the official ARC-AGI-3 challenge evaluation, the submitted model solves 12 levels on the private games while still ranking 3rd by the number of solved levels. This discrepancy is due to an implementation bug in how reset-inducing actions are handled (see Discussion). ## Discussion Our graph-based exploration method demonstrates that structured state-space navigation with visual prioritization significantly outperforms both random exploration and frontier LLMs with access to code writing and execution on ARC-AGI-3. #### Performance Analysis. The method excelled on games where visual salience aligned with interactive elements (vc33, as66). Performance degraded on games with extremely large state spaces (ft09 levels 6+, ls20 levels 3+), where exhaustive exploration becomes computationally intractable. The improvement over LLM+DSL baselines suggests that structured exploration provides a more reliable foundation for interactive reasoning than pure language-model-based approaches, which struggle to form and test hypotheses systematically. The discrepancy between the official ARC-AGI-3 evaluation and our re-runs is due to an implementation bug in the handling of reset events. Actions that triggered a reset were not marked as tested in the game graph. Consequently, when such a state–action pair was the nearest remaining untested edge in the graph from the starting node, the agent repeatedly selected it, resetting the game and effectively entering a loop. #### Limitations. The method faces two fundamental constraints. First, computational requirements grow linearly with state space size, limiting scalability to levels with moderate complexity. Second, the approach assumes deterministic, fully observable environments and would fail under stochasticity or partial observability. #### Future Directions. While the first-place solution on the leaderboard (smit_driessmitarc3-solution_2025) achieved superior performance with a learning-based approach, it did not incorporate structured exploration strategies. A natural next step is to integrate our graph-based exploration framework with adaptive learning algorithms. Such hybrid approaches could leverage graph representations to guide model training and action selection, while learned world models or policies could improve sample efficiency through generalization. The key challenge remains the sparse reward signal and limited training data, making it essential to develop methods that can effectively transfer knowledge across levels while maintaining systematic exploration coverage. ## Appendix A Appendix A: ARC-AGI-3 Games | <details> <summary>Figures/vc33.png Details</summary> ![4f5f4a95](/v1/image/4f5f4a95db9608883f78d15669f0717daff03025af38cb12b88b26141b0b69bc) ### Visual Description ## Pixel Art: Abstract Scene ### Overview The image is a pixel art scene, resembling a simplified, abstract representation of an environment. It features a grid-like background with various colored blocks forming structures and elements within the scene. ### Components/Axes * **Background:** Predominantly gray, forming the main environment. The grid pattern is visible. * **Ground Level:** White grid lines, with alternating blue and red blocks at the bottom. * **Structures:** Two vertical black bars (columns or walls) of different heights. * **Green Elements:** A horizontal green bar at the top, with small green squares and a larger green square on the right side. * **Yellow Elements:** A small yellow square at the top. * **White Squares:** A series of white squares at the top. ### Detailed Analysis * **Background:** The gray background covers most of the image, providing a neutral backdrop. * **Ground Level:** The bottom of the image consists of a white grid, with alternating blue and red blocks. The blue and red blocks appear to be at the base of the black structures. * **Left Structure:** A tall, vertical black bar extends from the bottom to near the top of the image. * **Right Structure:** A shorter, vertical black bar, with a green block on top of it. * **Green Elements:** * A horizontal green bar runs across the top of the image. * Two small green squares are positioned on the top-left. * A larger green square is located on the bottom-right. * **Yellow Elements:** A small yellow square is positioned on the top. * **White Squares:** A series of white squares are positioned on the top. ### Key Observations * The scene is highly abstract and simplified, using basic shapes and colors. * The black bars create a sense of depth and structure. * The green elements add contrast and visual interest. * The alternating blue and red blocks at the bottom provide a sense of rhythm. ### Interpretation The image appears to be an abstract representation of a landscape or architectural scene. The black bars could represent buildings or walls, while the green elements could represent vegetation or other natural features. The white squares at the top could represent lights or other objects in the sky. The overall composition is simple yet visually engaging, inviting the viewer to interpret the scene in their own way. The use of pixel art gives the image a retro or nostalgic feel. </details> | <details> <summary>Figures/ls20.png Details</summary> ![ecbfe6be](/v1/image/ecbfe6be340ecd120cde3b8fd1d1d5fb5eb1570bd7bf578dfb2499d44bd6bcbe) ### Visual Description ## Pixelated Maze Diagram ### Overview The image is a pixelated representation of a maze or game level. It features a grid-based layout with various colored blocks representing different elements within the maze. The maze is primarily composed of gray pathways and darker gray walls, with several colored blocks indicating specific locations or objects. ### Components/Axes * **Grid:** The maze is laid out on a grid, providing a structured environment. * **Walls:** Dark gray blocks represent the walls of the maze. * **Pathways:** Light gray blocks represent the traversable pathways. * **Colored Blocks:** * **Purple:** Scattered throughout the maze, potentially indicating checkpoints or items. * **Blue/White L-Shape:** Located at the bottom-left, possibly representing the starting point. * **Green:** A row of green blocks at the bottom, possibly indicating a safe zone or objective area. * **Blue/Orange:** Located near the center, potentially representing a key or objective. * **Orange/White:** Located near the center, potentially representing a key or objective. * **Black/Orange:** Located at the top, potentially representing an enemy or obstacle. * **Red:** Located at the top-right, potentially representing the end point or a danger zone. * **White/Light Blue:** Located on the left side, potentially representing a key or objective. ### Detailed Analysis or ### Content Details * **Maze Layout:** The maze has a complex layout with multiple paths and dead ends. The walls are arranged to create a challenging navigation experience. * **Purple Blocks:** There are 6 purple blocks scattered throughout the maze. * **Starting Point:** The blue/white L-shaped block is located at the bottom-left corner. It is 2 blocks wide and 3 blocks tall. * **Green Zone:** The green blocks are located at the bottom, directly above the starting point. There are 3 green blocks. * **Blue/Orange Block:** The blue/orange block is located near the center of the maze. * **Orange/White Block:** The orange/white block is located near the center of the maze, to the right of the blue/orange block. * **Black/Orange Block:** The black/orange block is located at the top of the maze, slightly to the right of the center. * **Red Blocks:** The red blocks are located at the top-right corner. There are 3 red blocks. * **White/Light Blue Block:** The white/light blue block is located on the left side of the maze, slightly above the starting point. ### Key Observations * The maze is designed to be navigated from the bottom-left to the top-right. * The colored blocks likely represent key locations, items, or objectives within the maze. * The maze has a complex layout with multiple paths and dead ends. ### Interpretation The image represents a pixelated maze or game level, likely for a simple video game. The colored blocks serve as key markers or objectives within the maze, guiding the player through the environment. The maze's complexity suggests a moderate level of challenge for the player. The placement of the starting point (blue/white L-shape) and the potential end point (red blocks) indicates a clear direction for navigation. The other colored blocks (purple, blue/orange, orange/white, black/orange, white/light blue) likely represent items to collect, enemies to avoid, or puzzles to solve. </details> | <details> <summary>Figures/ft09.png Details</summary> ![587cf5c1](/v1/image/587cf5c1006375354293370c7d233c4e3ef3e7db5c8003387323e62b9e89bee1) ### Visual Description ## Pixel Art: Grid with Colored Blocks ### Overview The image is a pixel art representation of a grid containing colored blocks. The grid is surrounded by a border of different colors. The central area contains a dark gray rectangle filled with red, blue, white, and gray blocks. ### Components/Axes * **Outer Border:** The image is bordered by a multi-colored frame. The top and bottom borders consist of alternating white and gray pixels. The left border consists of alternating orange and gray pixels. The bottom border consists of alternating green and gray pixels. The right border consists of alternating orange and gray pixels. * **Background:** The area surrounding the central colored blocks is filled with gray pixels. * **Central Rectangle:** A dark gray rectangle is positioned in the center of the image. * **Colored Blocks:** Within the dark gray rectangle, there are blocks of red, blue, white, and gray pixels arranged in a pattern. ### Detailed Analysis The central dark gray rectangle contains a 4x4 grid of blocks, with one block missing in the top right corner. The arrangement of the colored blocks is as follows: * **Row 1:** Red, Red, Red, Red, Red, Blue * **Row 2:** Red, White/Gray, Blue, White/Gray, Red * **Row 3:** Red, Red, Blue, Blue, Blue * **Row 4:** Red, Red, Red, Red The white/gray blocks in the second row appear to have a diagonal line of gray pixels. ### Key Observations * The image is symmetrical in its border design. * The colored blocks within the central rectangle form a distinct pattern. * The white/gray blocks have a unique diagonal gray pixel arrangement. ### Interpretation The image appears to be an abstract representation, possibly a stylized icon or a simplified visual representation of data. The arrangement of the colored blocks within the dark gray rectangle could represent a specific configuration or state. The diagonal gray pixels in the white/gray blocks might indicate a specific property or attribute. Without additional context, the exact meaning of the image is unclear, but the deliberate arrangement of colors and shapes suggests a specific purpose or message. </details> | | --- | --- | --- | | <details> <summary>Figures/sp80.png Details</summary> ![e868ecd4](/v1/image/e868ecd4fef1dcca924f4c20dea521a6faf64719aab1a659d0fb4ee08d2c46ab) ### Visual Description ## Pixel Art: Castle ### Overview The image is a pixel art representation of a castle. It features a grid-like structure with different colored blocks forming the castle's components. The castle is primarily orange, with a green top border, a light blue base, yellow crenellations, and red, blue, black, and magenta details. ### Components/Axes * **Colors:** The image uses a limited color palette: orange, green, light blue, yellow, red, blue, black, and magenta. * **Grid:** The image is constructed on a grid, with each color block representing a pixel. * **Castle Elements:** The castle consists of a main body, crenellations, and details that suggest windows or other architectural features. ### Detailed Analysis * **Green Border:** A horizontal green band runs along the top of the image. * **Black and Magenta Block:** A small black block is positioned above a magenta block, near the top-center of the image. * **Red Rectangle:** A horizontal red rectangle is located below the black and magenta blocks, in the upper half of the image. * **Orange Main Body:** The primary structure of the castle is orange, filling most of the image area. * **Blue Rectangle:** A horizontal blue rectangle is positioned in the lower half of the orange main body. * **Yellow Crenellations:** Yellow blocks form crenellations along the bottom edge of the orange main body. These are shaped like repeating "U"s. * **Light Blue Base:** A light blue band forms the base of the castle at the bottom of the image. ### Key Observations * The image is symmetrical. * The pixelated style gives the image a retro, 8-bit aesthetic. * The use of simple geometric shapes creates a recognizable castle structure. ### Interpretation The image is a stylized representation of a castle using pixel art. The limited color palette and grid-based construction are characteristic of early computer graphics. The arrangement of colored blocks creates a simple yet recognizable depiction of a castle, evoking a sense of nostalgia for classic video games and digital art. The placement of the colored blocks suggests a deliberate design, with each element contributing to the overall composition and visual appeal of the image. </details> | <details> <summary>Figures/lp85.png Details</summary> ![cb55cb27](/v1/image/cb55cb27b66855c141c4eefe7a14257c80d31147aa5244ce9580617604940822) ### Visual Description ## Pixel Art: Abstract Design ### Overview The image is a pixel art design on a grid background. The design features a central shape resembling a stylized face or object, composed of various colored pixels. The background grid is gray, with a green bar on the left side and a series of colored bars at the top. ### Components/Axes * **Background:** Gray grid. * **Left Bar:** Vertical green bar on the left edge. * **Top Bars:** A series of short horizontal bars at the top, colored green and black. * **Central Design:** A symmetrical shape composed of colored pixels, including red, green, orange, yellow, blue, light blue, purple, gray, and white. ### Detailed Analysis * **Background Grid:** The background is a uniform gray grid, providing a canvas for the pixel art. * **Left Bar:** The green bar on the left edge is approximately 5 pixels wide. * **Top Bars:** There are two green bars and eight black bars at the top. * **Central Design:** * The design is roughly symmetrical. * At the bottom, there are two shapes that are each half red and half green. * Above these, there are light blue and white pixels. * The sides feature clusters of yellow/light blue and orange/light blue pixels. * The top portion includes blue, light blue, and purple pixels. * The center contains gray and purple pixels. ### Key Observations * The design is abstract and does not clearly represent a specific object or scene. * The use of various colors creates visual interest. * The symmetry of the design suggests a deliberate composition. ### Interpretation The image appears to be an abstract pixel art design, possibly intended as a decorative element or a stylized representation of an object. The symmetrical arrangement and color choices suggest a deliberate artistic intent. The design does not convey specific data or information beyond its visual composition. </details> | <details> <summary>Figures/as66.png Details</summary> ![ad323107](/v1/image/ad323107afe96301cdcd6696446316b6b528f349944bb07fc6a7808f9b5bb7c8) ### Visual Description ## Pixel Art: Maze ### Overview The image is a pixel art representation of a maze. The maze is enclosed within a border and contains a path, obstacles, and a starting/ending point. The image is composed of colored squares arranged on a grid. ### Components/Axes * **Border:** The maze is enclosed by a border consisting of multiple layers of colored pixels. From the outside in, the border colors are dark gray, orange, gray, and light gray. * **Maze Field:** The main area of the maze is filled with a purple color. * **Path:** The path through the maze is represented by a white color. * **Obstacles:** Obstacles within the maze are represented by a dark gray color. * **Starting Point:** The starting point is represented by a yellow color. * **Ending Point:** The ending point is represented by a dark red color, located within an orange square. * **Green Strip:** A vertical green strip is located on the left side of the maze. * **Red Strip:** A horizontal red strip is located on the top side of the maze. ### Detailed Analysis or Content Details * **Border:** The border surrounds the entire maze area. The outer layer is dark gray, followed by orange, gray, and a light gray inner layer. * **Maze Field:** The purple area forms the main playing field of the maze. * **Path:** The white path starts near the bottom center of the maze and extends upwards. * **Obstacles:** The dark gray obstacles are scattered throughout the purple field, creating barriers along the path. * **Starting Point:** The yellow starting point is located near the bottom-left of the maze. * **Ending Point:** The dark red ending point is located within an orange square in the upper-center area of the maze. * **Green Strip:** The green strip is positioned vertically on the left side of the maze, adjacent to the border. * **Red Strip:** The red strip is positioned horizontally on the top side of the maze, adjacent to the border. ### Key Observations * The maze has a clear starting and ending point. * The obstacles are strategically placed to create a challenging path. * The color scheme is simple and uses contrasting colors to distinguish between the path, obstacles, and background. ### Interpretation The pixel art image represents a simple maze game. The goal is to navigate the white path from the yellow starting point to the dark red ending point, avoiding the dark gray obstacles. The green and red strips on the sides do not appear to be part of the maze itself, but rather decorative elements or indicators of the maze's boundaries. The orange square surrounding the ending point may indicate a target area. The maze is designed to be visually clear and easy to understand, despite its pixelated style. </details> | Figure 2: Top row (Public set): vc33, ls20, ft09. Bottom row (Private set): sp80, lp85, as66. ## Appendix B Appendix B: Per-Level Performance Statistics <details> <summary>x2.png Details</summary> ![17e49c8b](/v1/image/17e49c8b68f10cb63ff990d6c8d4e06d40b29097674287c32f69427cf08e2102) ### Visual Description ## Line Chart: Levels Solved vs. Steps ### Overview The image is a line chart comparing the performance of four different agents in solving levels, plotted against the number of steps taken. The x-axis (Steps) is on a logarithmic scale, ranging from 10^1 to 10^5. The y-axis (Levels solved) is a linear scale, ranging from 0 to 30. Each agent's performance is represented by a line with a distinct color, and shaded areas around the lines indicate variability or confidence intervals. ### Components/Axes * **X-axis:** Steps (Logarithmic scale: 10^1, 10^2, 10^3, 10^4, 10^5) * **Y-axis:** Levels solved (Linear scale: 0, 5, 10, 15, 20, 25, 30) * **Legend (Top-Left):** * Blue: Random Agent * Orange: + Frame Segmentation * Green: + New-Action Prioritization * Red: Graph Exploration ### Detailed Analysis * **Random Agent (Blue):** The line slopes upward, indicating an increase in levels solved as steps increase. * At 10^1 steps, approximately 1 level solved. * At 10^2 steps, approximately 2 levels solved. * At 10^3 steps, approximately 6 levels solved. * At 10^4 steps, approximately 11 levels solved. * At 10^5 steps, approximately 13 levels solved. * **+ Frame Segmentation (Orange):** The line slopes upward, indicating an increase in levels solved as steps increase. * At 10^1 steps, approximately 1 level solved. * At 10^2 steps, approximately 3 levels solved. * At 10^3 steps, approximately 12 levels solved. * At 10^4 steps, approximately 17 levels solved. * At 10^5 steps, approximately 17 levels solved. * **+ New-Action Prioritization (Green):** The line slopes upward, indicating an increase in levels solved as steps increase. * At 10^1 steps, approximately 1 level solved. * At 10^2 steps, approximately 4 levels solved. * At 10^3 steps, approximately 10 levels solved. * At 10^4 steps, approximately 17 levels solved. * At 10^5 steps, approximately 17 levels solved. * **Graph Exploration (Red):** The line slopes upward, indicating an increase in levels solved as steps increase. * At 10^1 steps, approximately 1 level solved. * At 10^2 steps, approximately 5 levels solved. * At 10^3 steps, approximately 14 levels solved. * At 10^4 steps, approximately 20 levels solved. * At 10^5 steps, approximately 28 levels solved. ### Key Observations * The "Graph Exploration" agent (red line) consistently solves more levels than the other agents across all step counts. * The "Random Agent" (blue line) performs the worst, solving the fewest levels compared to the other agents. * The "+ Frame Segmentation" (orange line) and "+ New-Action Prioritization" (green line) agents perform similarly, with "+ New-Action Prioritization" slightly outperforming "+ Frame Segmentation". * The performance of "+ Frame Segmentation" and "+ New-Action Prioritization" plateaus after 10^4 steps. * The shaded regions around each line indicate the variability in performance. The "Graph Exploration" agent has a wider shaded region, suggesting more variability in its performance. ### Interpretation The chart demonstrates the effectiveness of different strategies for solving levels. The "Graph Exploration" agent, which likely uses a more sophisticated approach to explore the problem space, significantly outperforms the "Random Agent," which serves as a baseline. The "+ Frame Segmentation" and "+ New-Action Prioritization" agents show moderate improvements over the "Random Agent," suggesting that these techniques contribute to better performance, but not as significantly as "Graph Exploration." The plateauing of "+ Frame Segmentation" and "+ New-Action Prioritization" suggests that these strategies may have limitations in scaling to more complex levels or require further optimization. The variability in "Graph Exploration" performance could indicate sensitivity to specific level characteristics or randomness in the exploration process. </details> Figure 3: Levels solved as a function of environment steps for four methods: Random Agent, Random + Frame Segmentation, Random + Segmentation + New-Action Prioritization, and the full Graph Exploration method. The x-axis is logarithmic; each line shows the median over 5 runs and the shaded region shows the minimum–maximum range. Intermediate variants are shown up to 10,000 environment steps, while the Graph Explorer is plotted over the full evaluation budget. Table 1: Per-level results on public games (ft09, ls20, vc33). For each game and level, we report the number of steps to solve the level (Stp), summarized as the median together with the minimum and maximum over 5 runs, and the solve rate (SR) over the same 5 runs. We use ‘NS‘ when a level is never solved within the step budget, and ‘-‘ when there is no such level for a given game. | 1 2 | 125 $[48;340]$ 177 | 1 1 | 124 $[72;140]$ $3.2\times 10^{3}$ | 1 1 | 9 $[5;24]$ 7 | 1 1 | | --- | --- | --- | --- | --- | --- | --- | | $[5;433]$ | | $[1.9\times 10^{3};4.9\times 10^{3}]$ | | $[4;19]$ | | | | 3 | $2.0\times 10^{4}$ | 1 | NS | 0 | 36 | 1 | | $[3.0\times 10^{3};2.5\times 10^{4}]$ | | | | $[9;96]$ | | | | 4 | NS | 0 | NS | 0 | 321 | 1 | | $[298;541]$ | | | | | | | | 5 | NS | 0 | NS | 0 | 287 | 1 | | $[260;349]$ | | | | | | | | 6 | NS | 0 | NS | 0 | $6.9\times 10^{4}$ | 0.8 | | $[5.4\times 10^{4};8.3\times 10^{4}]$ | | | | | | | | 7 | NS | 0 | NS | 0 | $4.7\times 10^{3}$ | 0.8 | | $[1.5\times 10^{3};5.5\times 10^{3}]$ | | | | | | | | 8 | NS | 0 | NS | 0 | 917 | 0.8 | | $[627;929]$ | | | | | | | | 9 | NS | 0 | - | - | NS | 0 | | 10 | NS | 0 | - | - | - | - | Table 2: Per-level results on private games (sp80, lp85, as66). Conventions as in Table 1. | 1 2 | 227 $[153;373]$ $3.6\times 10^{4}$ | 1 1 | 143 $[106;181]$ $2.9\times 10^{3}$ | 1 1 | 39 $[13;47]$ 44 | 1 1 | | --- | --- | --- | --- | --- | --- | --- | | $[2.5\times 10^{4};5.0\times 10^{4}]$ | | $[1.1\times 10^{3};3.2\times 10^{4}]$ | | $[24;65]$ | | | | 3 | $3.9\times 10^{4}$ | 0.4 | $1.7\times 10^{4}$ | 1 | 123 | 1 | | $[3.6\times 10^{4};4.2\times 10^{4}]$ | | $[1.0\times 10^{4};8.2\times 10^{4}]$ | | $[25;339]$ | | | | 4 | NS | 0 | $1.6\times 10^{3}$ | 1 | 99 | 1 | | $[727;2.0\times 10^{4}]$ | | $[69;350]$ | | | | | | 5 | NS | | $4.6\times 10^{3}$ | 0.8 | $2.2\times 10^{3}$ | 1 | | $[2.2\times 10^{3};1.4\times 10^{4}]$ | | $[1.2\times 10^{3};2.9\times 10^{3}]$ | | | | | | 6 | NS | 0 | $1.3\times 10^{4}$ | 0.4 | $1.3\times 10^{3}$ | 1 | | $[1.1\times 10^{4};1.5\times 10^{4}]$ | | $[112;1.6\times 10^{3}]$ | | | | | | 7 | NS | 0 | 334.5 | 0.4 | 363 | 1 | | $[104;565]$ | | $[128;670]$ | | | | | | 8 | NS | 0 | $9.9\times 10^{3}$ | 0.2 | $1.3\times 10^{3}$ | 1 | | $[9.9\times 10^{3};9.9\times 10^{3}]$ | | $[168;2.9\times 10^{3}]$ | | | | | | 9 | - | - | - | - | $3.4\times 10^{3}$ | 1 | | $[361;8.7\times 10^{3}]$ | | | | | | | | 10 | - | - | - | - | - | - |

Rendering Paper...