# Graph-Based Exploration for ARC-AGI-3 Interactive Reasoning Tasks
**Authors**: Evgenii Rudakov, Jonathan Shock, Benjamin Ultan Cowley
## Abstract
We present a training-free graph-based approach for solving interactive reasoning tasks in the ARC-AGI-3 benchmark. ARC-AGI-3 comprises game-like tasks where agents must infer task mechanics through limited interactions, and adapt to increasing complexity as levels progress. Success requires forming hypotheses, testing them, and tracking discovered mechanics. The benchmark has revealed that state-of-the-art LLMs are currently incapable of reliably solving these tasks. Our method combines vision-based frame processing with systematic state-space exploration using graph-structured representations. It segments visual frames into meaningful components, prioritizes actions based on visual salience, and maintains a directed graph of explored states and transitions. By tracking visited states and tested actions, the agent prioritizes actions that provide the shortest path to untested state-action pairs. On the ARC-AGI-3 Preview Challenge, this structured exploration strategy solves a median of 30 out of 52 levels across six games and ranks 3rd on the private leaderboard, substantially outperforming frontier LLM-based agents. These results demonstrate that explicit graph-structured exploration, even without learning, can serve as a strong baseline for interactive reasoning and underscore the importance of systematic state tracking and action prioritization in sparse-feedback environments where current LLMs fail to capture task dynamics. The code is open source and available at https://github.com/dolphin-in-a-coma/arc-agi-3-just-explore.
## Introduction
Introduced in 2019, the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) has become a fundamental benchmark for evaluating general fluid intelligence in artificial systems by posing novel tasks that require minimal prior knowledge (chollet_measure_2019). While the original ARC-AGI benchmarks focused on static grid-based reasoning tasks, ARC-AGI-3 represents a paradigm shift toward Interactive Reasoning Benchmarks (IRBs) that test broader capabilities, including on-the-fly learning, exploration, and memory through game-like environments where agents must perceive, plan, and act across multiple steps to achieve long-horizon goals (arc-agi-3).
ARC-AGI-3 introduces novel game environments designed to test the skill-acquisition efficiency of artificial systems, where agents interact with game environments without instructions, and must discover mechanics through exploration. Early results reveal a stark performance gap: frontier AI models scored 0% while human participants achieved 100% on the initial preview tasks (arc-agi-3). This dramatic disparity underscores fundamental limitations in current AI approaches to interactive reasoning and adaptive learning in novel environments.
The challenge of learning from sparse rewards has been central to reinforcement learning (RL) for decades. When rewards are rare and precise action sequences are required, random exploration fails to discover optimal policies. Exploration strategies have emerged to address this challenge. Curiosity-driven methods use prediction error as intrinsic motivation (pathak_curiosity-driven_2017), enabling agents to explore complex environments like Super Mario Bros without extrinsic rewards (pathak_curiosity-driven_2017). Go-Explore advances systematic exploration by maintaining archives of discovered states and decomposing exploration into phases: return to promising states, then explore from them (ecoffet_first_2021). This approach achieved breakthrough performance on Montezuma’s Revenge, scoring 25% higher than the human expert. For goal-conditioned tasks, Hindsight Experience Replay (HER) learns from failure by relabeling unsuccessful attempts as alternative goals, achieving sample-efficient learning without reward engineering (andrychowicz_hindsight_2017).
Model-based approaches have demonstrated remarkable sample efficiency by learning environment dynamics. MuZero combined learned latent dynamics with tree search, achieving superhuman performance on board games and Atari benchmarks without knowledge of game rules (schrittwieser_mastering_2020). EfficientZero extended this with self-supervised consistency losses, becoming the first algorithm to reach superhuman levels on Atari (194.3% mean human) with just 100k training samples (two hours of real-time experience) per game (ye_mastering_2021). BBF further improved Atari 100k results by scaling the value network sample-efficiency (schwarzer_bigger_2023), in a completely model-free manner.
The family of Dreamer models (hafner_training_2025) takes an alternative approach, learning world models in latent space and training policies through imagined rollouts rather than via tree search, mastering over 150 diverse tasks from Atari to Minecraft with a single configuration (hafner_training_2025). Most recently, Axiom introduced object-centric world models that learn compositional representations by discovering and tracking entities, achieving competitive performance within minutes by targeting 10k-step solutions per environment (heins_axiom_2025).
Despite these advances, current approaches face fundamental limitations for few-shot discovery tasks like ARC-AGI-3. The benchmark provides only a single sparse reward signal, level completion, across no more than 10 levels per game. This scarcity of feedback severely constrains learning-based methods. The challenge is compounded by the fact that each level introduces new mechanics while retaining previous ones, creating a shifting distribution that prevents straightforward transfer learning. Curiosity-driven exploration offers no guarantee of correlation with task progress in truly novel environments where the notion of ”most interesting states” may be orthogonal to goal-relevance. Sample-efficient approaches like Axiom assume object-centric compositional structure and require environments to exhibit consistent physical dynamics, assumptions that may not hold across ARC-AGI-3’s abstract and diverse game mechanics.
ARC-AGI-3 is also relevant for understanding the behaviour of large language model (LLM) agents. Unlike static reasoning benchmarks, it requires agents to infer latent rules through interaction, maintain an evolving notion of state, and design multi-step probes under sparse feedback, making it a complementary testbed for studying how explicit structure and exploration strategies can support LLM-based reasoning.
In this work, we present a graph-based exploration method that combines systematic state-space tracking with visual priority heuristics to tackle ARC-AGI-3’s interactive reasoning challenges. Our approach maintains a directed graph representation of explored states and action transitions, prioritizing actions based on visual salience while ensuring comprehensive exploration through frontier-driven navigation. Unlike learning-based approaches that require extensive training, our method operates as a strong baseline that can make progress through structured exploration alone. We demonstrate that this approach achieves competitive performance on the ARC-AGI-3 benchmark, significantly outperforming state-of-the-art LLMs while providing insights into the nature of exploration required for interactive reasoning tasks.
## ARC-AGI-3
### Benchmark Overview
ARC-AGI-3 represents a significant evolution from the original ARC challenge, shifting from static grid-based reasoning to interactive game environments that test an agent’s ability to learn through exploration (ying_assessing_2025). The benchmark consists of 6 novel game environments, with 3 public games (ft09, ls20, vc33) released for development and 3 private games (sp80, lp85, as66) used to determine final leaderboard rankings. Each game contains between 8 and 10 levels, with each subsequent level introducing new mechanics. Figure 2 in the appendix shows example screenshots from the games.
The benchmark’s evaluation criterion prioritizes both effectiveness and efficiency: agents are scored based on the number of levels completed, with the total number of actions required serving as a tiebreaker. This dual objective encourages solutions that not only discover winning strategies but do so with minimal exploration. For the final evaluation experiments by ARC-AGI-3 organizers, each run was capped at 8 hours of wall-clock time and 10 environment steps per second (sps), shared across the three private games. Under these limits, a single game can receive at most 96,000 steps
### Observation and Action Spaces
#### Visual Observations.
Agents receive visual observations as 64 $\times$ 64 pixel RGB frames with a discrete palette of 16 colors. Each frame contains both the game environment and a status bar displaying the number of steps remaining before an automatic level restart. When the step counter reaches zero, the current level resets to its initial state. In the majority of games, the number of levels passed is also displayed.
#### Action Spaces.
The benchmark features three control schemes. Games such as ls20 use arrow-based control with directional keyboard inputs (up, down, left, right), yielding an action space of size $|\mathcal{A}|=4$ . Games such as ft09, vc33, and lp85 employ click-based control, enabling spatial interaction by allowing the agent to click any pixel location in the frame, yielding an action space of size $|\mathcal{A}|=64\times 64=4{,}096$ . Private games (sp80 and as66) introduce combined control schemes that integrate both arrow and click inputs, resulting in action spaces of size $|\mathcal{A}|=4{,}100$ .
The dramatic difference in action space cardinality between control schemes poses a fundamental challenge: click-based games present over 1,000 times more possible actions at each state than arrow-based games, making exhaustive exploration intractable without intelligent action selection.
### Task Structure and Mechanics
Each game in ARC-AGI-3 embodies a distinct set of mechanics and objectives that agents must discover through interaction. The only feedback signal is level completion: the environment advances to the next level when the agent satisfies (unknown) winning conditions, or resets to the beginning when the step limit expires.
Within each game, levels progressively add new elements while retaining earlier ones. For example, level 1 of ls20 requires basic movement and the use of the transformer object to activate the exit door by adjusting the shape of a key, level 2 adds energy palletes to refill the number of steps remaining, level 3 introduces color dimension to the key, and so forth, up to level 8, when the agent must manage with only partial observations. This progressive structure mirrors how humans naturally acquire skills in games, but poses challenges for algorithms: knowledge transfer between levels could accelerate learning, but the levels are connected on a highly abstract level.
The released games operate deterministically: the same action taken from the same state always produces the same outcome. This property enables systematic state-space exploration strategies and graph-based representations of explored states. However, determinism does not imply simplicity; the complexity arises from the large state and action spaces and the lack of prior knowledge about which actions lead toward goal states.
## Methods
Our approach comprises two primary components: a Frame Processor for extracting key visual features and a Level Graph Explorer for systematic state-space exploration.
### Frame Processor
The Frame Processor reduces irrelevant visual variability and directs exploration toward actionable regions of the game environment through the following operations:
#### Image Segmentation.
Each frame is segmented into single-color connected components, establishing the foundation for identifying distinct visual elements that may constitute interactive objects.
#### Status Bar Detection and Masking.
To prevent conflation of environment features with user interface components, the processor identifies and masks probable status bars. This preprocessing substantially reduces the number of recognized states.
#### Priority-Based Action Grouping.
For click-controlled games, visual segments are stratified into five priority tiers based on their likelihood of representing interactive buttons or objects. Prioritization is determined by segment size, morphological features, and color salience. The lowest priority tier encompasses segments identified as probable status bars, ensuring their exploration only after exhausting higher-priority alternatives.
#### State Hashing.
The processor generates a hash representation of the masked image, serving as a unique identifier for the current game state. This hash facilitates efficient state tracking and duplicate detection during graph exploration.
### Level Graph Explorer
The Level Graph Explorer maintains a directed graph representation of the explored state space, where nodes correspond to unique game states and edges encode action-induced state transitions.
#### Graph Structure.
For each discovered state (graph node), the explorer maintains:
- The action space $\mathcal{A}$ identifiers of connected components for spatial interaction games such as ft09/cv33, keyboard inputs for games such as ls20)
- For each action $a\in\mathcal{A}$ : priority level $\pi(a)$ , exploration status, transition outcome, successor state, and minimal distance to the nearest unexplored frontier
#### Action Selection Strategy.
The explorer implements a hierarchical action selection policy that progressively expands the search space, as shown in Algorithm 1.
Algorithm 1 Hierarchical Action Selection
0: Current state $s$ , priority threshold $p$
if $\exists$ untested actions with priority $\pi(a)\leq p$ in state $s$ then
Select uniformly at random an untested action $a$ where $\pi(a)\leq p$ from $s$
Execute action and update graph with observed transition
else if $\exists$ reachable state $s^{\prime}$ with untested actions where $\pi(a)\leq p$ then
Select action minimizing distance to reachable state $s^{\prime}$ with untested actions at priority $\leq p$
Execute selected action
else
Increment priority threshold: $p\leftarrow p+1$
Recurse from current state $s$ with updated priority $p$
end if
This policy ensures systematic exploration of high-salience actions prior to considering lower-priority alternatives, thereby focusing computational resources on likely-relevant state-action pairs.
#### Frontier Management.
The explorer maintains the shortest-path distances from each explored state to frontier states, those containing untested actions. These distance metrics always guide traversal toward unexplored regions.
## Baselines
We evaluate our approach against two baseline methods to demonstrate the effectiveness of structured exploration.
#### Random Agent.
A simple baseline that selects actions uniformly at random from the available action space at each step. This baseline provides a lower bound on performance and demonstrates the difficulty of solving tasks through undirected exploration alone.
#### LLM+DSL.
We compare against the best-performing LLM-based solution on the leaderboard (fluxon_arc_2025), which combines GPT-4.1 with domain-specific language (DSL) programming. The approach observes game frames and generates Python code to interact with the environment, attempting to discover game mechanics through programmatic reasoning. Despite using a frontier LLM, this approach demonstrates the current limitations of LLM-based methods for interactive reasoning tasks.
Because each environment step is gated by an LLM call, it is severely interaction-limited: within the evaluation budget, it produces only about 4,000 interactions per game, compared to the 96,000 steps that are in principle allowed. To avoid high LLM usage costs, we do not re-run this baseline; instead, we report the results from its official evaluation on the private games, with the limitation that only a single aggregate score is available and no results are reported on the public games.
## Results
We evaluated our graph-based exploration method on all six ARC-AGI-3 games. Figure 1 reports an incremental component-addition analysis: starting from a random agent, we cumulatively add components and measure the total levels solved across games; the LLM+DSL baseline is included for comparison. Here, to ensure a fair comparison with the LLM-based baseline, all methods are capped at 4,000 interactions per game. All non-LLM configurations report the median over 5 runs, whereas the LLM+DSL baseline is shown as a single result taken from the official challenge evaluation.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Bar Chart: Solved Levels by Game Type and Strategy
### Overview
The chart compares the number of solved levels across different game types (Private and Public) using various AI strategies. Private games include as66, lp85, sp80, and Unknown categories, while Public games feature ft09, ls20, and vc33. Strategies tested include LLM + DSL, Random Agent, Frame Segmentation, Prioritize New Actions, and Graph Exploration.
### Components/Axes
- **X-axis (Strategies)**:
- LLM + DSL
- Random Agent
- + Frame Segmentation
- + Prioritize New Actions
- + Graph Exploration
- **Y-axis (Number of Solved Levels)**: Ranges from -10 to 15, with 0 as the baseline.
- **Legends**:
- **Private Games** (Top-left):
- as66 (Green)
- lp85 (Light Brown)
- sp80 (Teal)
- Unknown (Gray)
- **Public Games** (Bottom-left):
- ft09 (Teal)
- ls20 (Orange)
- vc33 (Pink)
### Detailed Analysis
#### Private Games
- **LLM + DSL**:
- as66: 5
- lp85: 0
- sp80: 0
- Unknown: 5
- **Random Agent**:
- as66: 5
- lp85: 1
- sp80: 1
- Unknown: 1
- **+ Frame Segmentation**:
- as66: 5
- lp85: 1
- sp80: 1
- Unknown: 1
- **+ Prioritize New Actions**:
- as66: 4
- lp85: 1
- sp80: 1
- Unknown: 1
- **+ Graph Exploration**:
- as66: 7
- lp85: 2
- sp80: 1
- Unknown: 0
#### Public Games
- **LLM + DSL**:
- ft09: 0
- ls20: 0
- vc33: 3
- **Random Agent**:
- ft09: 2
- ls20: 1
- vc33: 5
- **+ Frame Segmentation**:
- ft09: 2
- ls20: 1
- vc33: 5
- **+ Prioritize New Actions**:
- ft09: 2
- ls20: 1
- vc33: 5
- **+ Graph Exploration**:
- ft09: 2
- ls20: 2
- vc33: 5
### Key Observations
1. **Private Games Dominance**:
- as66 consistently solves the most levels across all strategies (5–7), with Graph Exploration achieving the highest (7).
- Unknown levels are most frequent in LLM + DSL (5) and decline with advanced strategies.
2. **Public Games Trends**:
- vc33 solves the most levels (5–9), with Graph Exploration achieving the highest (9).
- ft09 and ls20 show minimal progress, with ls20 never exceeding 2 solved levels.
3. **Strategy Impact**:
- Frame Segmentation and Prioritize New Actions show marginal improvements over Random Agent.
- Graph Exploration significantly boosts performance for both game types.
### Interpretation
The data suggests that **Graph Exploration** is the most effective strategy for solving levels, particularly for Private games (as66) and Public games (vc33). The "Unknown" category in Private games indicates unresolved challenges, which decrease with advanced strategies. Public games show a clear preference for vc33, while Private games rely heavily on as66. The lack of progress for ls20 and ft09 in Public games highlights potential limitations in these game types or strategies. The baseline (0) acts as a reference, emphasizing that negative values are not represented, suggesting all strategies achieve at least partial success.
</details>
Figure 1: Effect of progressively adding method components to a random agent, compared with the LLM+DSL baseline. For each configuration, the stacked bar above the horizontal axis shows the total number of solved levels across the three private games, and the stacked bar below shows the total across the three public games. Colors indicate how many levels are solved in each individual game. The rightmost bars correspond to the full method. All non-LLM configurations report the median over 5 runs, whereas the LLM+DSL baseline is shown as a single result taken from the official challenge evaluation.
The random agent and LLM+DSL baseline solve 6 and 5 levels on the private games, respectively, meaning that the LLM-based method underperforms even a random policy. The random agent also solves 3 levels across the public games.
Adding frame segmentation to random exploration slightly increases performance on the private games, making it possible to solve one level of lp85. It also significantly improves performance on the public games, solving 5 levels on vc33 and 2 levels on ft09.
When, in each state, untested actions are favored without full state-graph exploration, performance slightly decreases on as66, and the method is able to solve only 4 levels.
Our complete approach solves 19 levels with an interaction limit of 4,000: 2 on ft09, 2 on ls20, 5 on vc33, 1 on sp80, 2 on lp85, and 7 on as66.
In a full 8-hour run, across 5 independent runs, our method solves a median of 16 levels on the private games and 14 levels on the public games (see Figure 3 in the appendix). Per-level performance is reported in Tables 1 and 2 in the appendix.
On the official ARC-AGI-3 challenge evaluation, the submitted model solves 12 levels on the private games while still ranking 3rd by the number of solved levels. This discrepancy is due to an implementation bug in how reset-inducing actions are handled (see Discussion).
## Discussion
Our graph-based exploration method demonstrates that structured state-space navigation with visual prioritization significantly outperforms both random exploration and frontier LLMs with access to code writing and execution on ARC-AGI-3.
#### Performance Analysis.
The method excelled on games where visual salience aligned with interactive elements (vc33, as66). Performance degraded on games with extremely large state spaces (ft09 levels 6+, ls20 levels 3+), where exhaustive exploration becomes computationally intractable. The improvement over LLM+DSL baselines suggests that structured exploration provides a more reliable foundation for interactive reasoning than pure language-model-based approaches, which struggle to form and test hypotheses systematically.
The discrepancy between the official ARC-AGI-3 evaluation and our re-runs is due to an implementation bug in the handling of reset events. Actions that triggered a reset were not marked as tested in the game graph. Consequently, when such a state–action pair was the nearest remaining untested edge in the graph from the starting node, the agent repeatedly selected it, resetting the game and effectively entering a loop.
#### Limitations.
The method faces two fundamental constraints. First, computational requirements grow linearly with state space size, limiting scalability to levels with moderate complexity. Second, the approach assumes deterministic, fully observable environments and would fail under stochasticity or partial observability.
#### Future Directions.
While the first-place solution on the leaderboard (smit_driessmitarc3-solution_2025) achieved superior performance with a learning-based approach, it did not incorporate structured exploration strategies. A natural next step is to integrate our graph-based exploration framework with adaptive learning algorithms. Such hybrid approaches could leverage graph representations to guide model training and action selection, while learned world models or policies could improve sample efficiency through generalization. The key challenge remains the sparse reward signal and limited training data, making it essential to develop methods that can effectively transfer knowledge across levels while maintaining systematic exploration coverage.
## Appendix A Appendix A: ARC-AGI-3 Games
|
<details>
<summary>Figures/vc33.png Details</summary>

### Visual Description
## Bar Chart: Unlabeled Value Comparison
### Overview
The image depicts a bar chart with two vertical bars on a grid background. The chart lacks a title but includes axis labels, a legend, and color-coded bars. The left bar is taller and entirely black, while the right bar is shorter with a green section in its midsection. A legend at the top maps colors to categories, though not all colors are used in the bars.
### Components/Axes
- **X-Axis**:
- Labels: "Category A" (left) and "Category B" (right).
- Position: Bottom of the chart, centered below each bar.
- **Y-Axis**:
- Label: "Value" (vertical text on the left edge).
- Scale: Implicit grid lines (no numerical ticks visible).
- **Legend**:
- Position: Top of the chart, above the bars.
- Colors/Labels:
- Green: "Subcategory X"
- Yellow: "Subcategory Y"
- White: "Subcategory Z"
- Black: "Primary Category"
- Note: Only green and black are used in the bars; yellow and white are unused.
### Detailed Analysis
1. **Left Bar (Category A)**:
- Color: Black (matches "Primary Category" in the legend).
- Height: Approximately 80% of the chart’s vertical range.
- Structure: Solid black with no subdivisions.
2. **Right Bar (Category B)**:
- Color: Black (bottom 70% of the bar) with a green section (top 30%).
- Height: Approximately 50% of the chart’s vertical range.
- Structure: Split into two segments:
- Black segment: 70% of the bar’s height.
- Green segment: 30% of the bar’s height (matches "Subcategory X" in the legend).
3. **Grid and Background**:
- Grid lines: Light gray, forming a 10x10 grid.
- Background: White, with a gray header at the top.
### Key Observations
- **Value Disparity**: Category A’s value is ~1.6x higher than Category B’s total value (80% vs. 50%).
- **Subcategory Representation**: The green section in Category B suggests a partial contribution from "Subcategory X" to its total value.
- **Legend Mismatch**: The legend includes unused colors (yellow, white), implying potential missing data or categories.
### Interpretation
The chart compares two categories (A and B) with distinct value distributions. Category A’s dominance (80% vs. 50%) suggests a significant difference in their primary metric. The green segment in Category B indicates that "Subcategory X" contributes ~30% of its total value, while the remaining 70% is attributed to the "Primary Category." The unused legend colors (yellow, white) may hint at incomplete data or a design oversight. The absence of a title and numerical scale limits precise interpretation, but the visual hierarchy emphasizes Category A’s superiority.
</details>
|
<details>
<summary>Figures/ls20.png Details</summary>

### Visual Description
## Screenshot: Retro-Style Maze Game Interface
### Overview
The image depicts a pixelated, grid-based maze environment characteristic of early arcade games (e.g., Pac-Man-inspired). The layout includes colored blocks, pathways, and game elements arranged in a structured grid. No textual labels, legends, or axis titles are visible.
### Components/Axes
- **Grid Structure**: A dark gray grid overlays the entire screen, dividing the maze into discrete cells.
- **Color-Coded Elements**:
- **Purple Squares**: Scattered throughout the maze (e.g., top-left, center, bottom-right).
- **Red Squares**: Clustered in the top-right corner.
- **Blue Square**: Located near the top-center, adjacent to a black block.
- **Orange Squares**: Positioned in the bottom-center, forming a small cluster.
- **White Squares**: Found in the bottom-left corner, adjacent to a blue square.
- **Green Rectangles**: Two horizontal bars at the bottom of the screen.
- **Player/Character**: A small white block with a blue square atop it, positioned near the bottom-left.
### Detailed Analysis
- **Grid Dimensions**: The maze spans approximately 15x15 grid cells (estimated based on visible rows/columns).
- **Pathways**: Black blocks form walls, creating a labyrinthine structure. Open paths are represented by gray grid cells.
- **Color Distribution**:
- Purple squares: 5 total (top-left, center, bottom-right, and two others).
- Red squares: 3 in the top-right cluster.
- Blue square: 1 near the top-center.
- Orange squares: 3 in the bottom-center.
- White squares: 2 in the bottom-left.
- Green rectangles: 2 at the bottom edge.
### Key Observations
- **Objective Indicators**: Purple squares may represent collectibles or objectives, common in maze games.
- **Hazards/Enemies**: Red squares could denote enemies or obstacles, given their clustering.
- **Player Position**: The white/blue block in the bottom-left suggests the player’s starting point or current location.
- **Green Rectangles**: Likely represent a status bar, score counter, or timer (common UI elements in retro games).
### Interpretation
The image represents a simplified maze game environment where the player (white/blue block) must navigate through pathways (gray cells) to collect purple squares while avoiding red squares (potential enemies). The green rectangles at the bottom likely track game metrics (e.g., score, lives). The absence of text suggests this is either a prototype, a minimalist design, or a screenshot from a game where UI elements are implied rather than explicitly labeled. The grid-based design emphasizes spatial reasoning and pathfinding, core mechanics in such games.
## No textual data (labels, legends, axis titles) is present in the image. All analysis is based on visual elements and positional relationships.
</details>
|
<details>
<summary>Figures/ft09.png Details</summary>

### Visual Description
## Heatmap: Color Distribution Grid
### Overview
The image depicts a 3x3 grid of colored squares (red, blue, gray) with a legend on the right. The grid is bordered in black, and the background is a uniform gray. The legend categorizes the colors with labels, and the grid contains a mix of solid and nested patterns.
### Components/Axes
- **Legend**:
- **Red**: Solid red squares.
- **Blue**: Solid blue squares.
- **Gray**: Nested gray squares (with smaller squares inside).
- **Grid Structure**:
- 3 rows × 3 columns = 9 cells.
- Each cell contains a single colored square (except the center cell, which has a nested pattern).
- **Background**: Uniform gray with a black border.
### Detailed Analysis
- **Color Distribution**:
- **Red**: 15 squares (7 in the first row, 5 in the second row, 3 in the third row).
- **Blue**: 5 squares (1 in the first row, 2 in the second row, 2 in the third row).
- **Gray**: 2 squares (both in the second row, center cell).
- **Pattern Details**:
- The center cell (row 2, column 2) contains a nested gray square with a smaller red square inside.
- The top-right cell (row 1, column 3) contains a nested gray square with a smaller blue square inside.
- **Legend Placement**: Positioned to the right of the grid, aligned vertically.
### Key Observations
1. **Dominance of Red**: Red squares occupy the majority of the grid (70% of total cells).
2. **Nested Patterns**: Only two cells (center and top-right) feature nested squares, suggesting a hierarchical or sub-category relationship.
3. **Color Symmetry**: Blue and gray squares are less frequent but strategically placed to balance the grid.
### Interpretation
The image likely represents a simplified data visualization where colors denote categories (e.g., status, priority, or type). The nested patterns in gray squares may indicate sub-categories or conditional states (e.g., "in progress" vs. "completed"). The high prevalence of red suggests it is the primary category, while blue and gray serve as secondary or tertiary classifications. The lack of textual labels or numerical values implies the image is a conceptual or abstract representation rather than a data-driven chart.
## Notes
- No textual labels, axis titles, or numerical data are present in the image.
- The grid’s structure and color distribution are the sole sources of information.
- The nested patterns in gray squares are the only deviation from the solid-color cells.
</details>
|
| --- | --- | --- |
|
<details>
<summary>Figures/sp80.png Details</summary>

### Visual Description
## Screenshot: Breakout Game Interface
### Overview
The image depicts a minimalist, retro-style video game interface resembling the classic "Breakout" game. The screen features a grid-based layout with colored geometric shapes against an orange background. No textual labels, legends, or data points are present.
### Components/Axes
- **Background**: Solid orange grid pattern.
- **Top Section**:
- A thin horizontal green line spans the width of the screen.
- Centered above the grid:
- A small black square (1x1 grid unit).
- Directly below the black square: A pink square (1x1 grid unit).
- **Middle Section**:
- A horizontal red rectangle spanning ~1/3 of the screen width, positioned ~1/3 from the top.
- A horizontal blue rectangle spanning ~1/2 of the screen width, positioned ~2/3 from the top.
- **Bottom Section**:
- Three yellow U-shaped blocks (each composed of 3x2 grid units with a central void), spaced evenly across the bottom.
- A thin horizontal light blue line at the very bottom, matching the screen width.
### Detailed Analysis
- **No textual elements** (labels, axis titles, legends, or embedded text) are visible in the image.
- **Spatial grounding**:
- Green line: Top-center, above all other elements.
- Black and pink squares: Vertically aligned above the red rectangle.
- Red and blue rectangles: Horizontally centered, stacked vertically.
- Yellow U-shaped blocks: Horizontally aligned at the bottom, spaced ~1 grid unit apart.
- Light blue line: Bottom edge, grounding the yellow blocks.
### Key Observations
- The layout suggests a game environment where the player likely controls a paddle (blue rectangle) to deflect a ball (pink square) at a brick (yellow blocks).
- The black square may represent a power-up or special item.
- No numerical values, scales, or data-driven elements are present.
### Interpretation
This image represents a static frame from a simplified Breakout-style game. The absence of text or data points indicates it is purely a visual representation of game mechanics rather than a data visualization. The colored shapes likely correspond to interactive elements (e.g., paddle, ball, bricks), but no quantitative analysis can be performed due to the lack of numerical or labeled data.
</details>
|
<details>
<summary>Figures/lp85.png Details</summary>

### Visual Description
## Screenshot: Retro Arcade Game Interface
### Overview
The image depicts a retro-style arcade game interface with a grid-based layout. The central focus is a symmetrical arrangement of colored blocks forming a pattern, a spaceship-like object at the bottom, and a score display at the top. The background is a dark gray grid, and the interface uses pixelated graphics typical of 1980s arcade games.
### Components/Axes
- **Grid**: A 10x10 matrix of squares (approximate) forming the play area.
- **Spaceship**: A red and green pixelated object positioned at the bottom center.
- **Blocks**: Colored squares (blue, purple, gray, yellow, orange, red, green) arranged in a symmetrical pattern.
- **Score Display**: Located in the top-left corner, showing "SCORE: 000000" and "LEVEL: 001".
- **Legend**: No explicit legend is present, but colors correspond to block types (e.g., red/green for the spaceship, blue/purple for blocks).
### Content Details
- **Score**: "000000" (no numerical value beyond the placeholder).
- **Level**: "001" (indicating the first level).
- **Block Pattern**: Symmetrical arrangement with alternating colors. The pattern includes:
- Blue, purple, and gray blocks forming a circular or spiral design.
- Yellow and orange blocks at the edges.
- Red and green blocks at the bottom (spaceship).
- **Spaceship**: Positioned centrally at the bottom, with a red and green color scheme.
### Key Observations
1. **Symmetry**: The block pattern is mirrored along the vertical axis, suggesting a puzzle or symmetry-based gameplay mechanic.
2. **Spaceship Placement**: The red/green spaceship is centered, likely the player-controlled object.
3. **Score/Level**: The score is at the starting value (000000), and the level is 001, indicating the game has just begun.
4. **Color Coding**: Blocks use distinct colors (blue, purple, gray, yellow, orange, red, green), but no legend explains their purpose.
### Interpretation
This image represents a classic arcade game setup, likely a puzzle or shooter game where the player interacts with the block pattern. The symmetrical design suggests the objective may involve matching, rotating, or destroying blocks. The spaceship’s position implies it may move horizontally to target blocks. The absence of a legend or instructions leaves the exact mechanics ambiguous, but the score and level indicators confirm progression tracking. The retro aesthetic and grid-based layout align with early arcade games like *Tetris* or *Breakout*, though the specific mechanics remain unclear without additional context.
</details>
|
<details>
<summary>Figures/as66.png Details</summary>

### Visual Description
```markdown
## Pixel Art Composition: Abstract Geometric Design
### Overview
The image depicts a pixelated, abstract geometric design composed of colored blocks arranged in a grid-like structure. The composition features a layered border system and a central area dominated by purple, with scattered black, white, yellow, and red blocks. No textual elements, labels, or axis markers are present.
### Components/Axes
1. **Border System**:
- **Outer Border**: Orange (#FFA500) with a 1-pixel width.
- **Inner Border**: Gray (#808080) with a 2-pixel width, containing black (#000000) corner squares at each quadrant.
- **Vertical Green Stripe**: A 3-pixel-wide green (#00FF00) stripe on the left edge of the inner border.
- **Horizontal Gray Stripes**: Two 2-pixel-wide gray stripes at the top and bottom of the inner border.
2. **Central Grid**:
- **Background**: Solid purple (#800080) occupying ~80% of the image.
- **Block Elements**:
- **Black (#000000)**: Rectangular and square shapes scattered asymmetrically.
- **White (#FFFFFF)**: A single 2x2 block near the bottom center.
- **Yellow (#FFFF00)**: A 1x1 block near the bottom left.
- **Red (#FF0000)**: A 1x1 block near the top center.
### Detailed Analysis
- **Block Distribution**:
- Black blocks dominate the central area, forming irregular patterns.
- The white block is isolated at the bottom center, contrasting with the purple background.
- The yellow and red blocks are positioned near the edges of the central grid.
- **Symmetry**: No strict symmetry is observed; the design appears intentionally chaotic.
- **Color Contrast**: High contrast between the purple background and the bright accent colors (yellow, red, white).
### Key Observations
1. The green stripe on the left border may serve as a visual anchor or thematic element.
2. The absence of text or labels suggests the image is purely decorative or symbolic.
3. The red block near the top center could represent a focal point or intentional anomaly.
4. The white block’s placement at the bottom center creates a visual "balance" against the purple expanse.
### Interpretation
This image likely represents a digital art piece or a placeholder for a user interface element (e.g., a game tile, app icon, or abstract logo). The lack of textual information implies it is not data-driven but instead relies on visual symbolism. The scattered blocks might symbolize fragmentation, diversity, or modularity. The green stripe could
</details>
|
Figure 2: Top row (Public set): vc33, ls20, ft09. Bottom row (Private set): sp80, lp85, as66.
## Appendix B Appendix B: Per-Level Performance Statistics
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Graph: Levels Solved vs. Steps for Different Agents
### Overview
The image is a line graph comparing the performance of four agents in solving levels over a logarithmic number of steps. The y-axis represents "Levels solved" (0–30), and the x-axis represents "Steps" (10¹ to 10⁵). Four agents are compared: Random Agent (blue), Frame Segmentation (orange), New-Action Prioritization (green), and Graph Exploration (red). Shaded regions around each line indicate variability in performance.
### Components/Axes
- **X-axis (Steps)**: Logarithmic scale from 10¹ to 10⁵.
- **Y-axis (Levels solved)**: Linear scale from 0 to 30.
- **Legend**: Located in the top-left corner, with colors matching the lines:
- Blue: Random Agent
- Orange: Frame Segmentation
- Green: New-Action Prioritization
- Red: Graph Exploration
### Detailed Analysis
1. **Graph Exploration (Red)**:
- Starts at ~2 levels at 10¹ steps.
- Reaches ~30 levels by 10⁵ steps.
- Steep upward trend with consistent growth.
- Shaded region widens significantly at higher steps, indicating increased variability.
2. **New-Action Prioritization (Green)**:
- Starts at ~1 level at 10¹ steps.
- Reaches ~18 levels by 10⁵ steps.
- Slower growth than Graph Exploration but steeper than Frame Segmentation.
- Shaded region is narrower, suggesting lower variability.
3. **Frame Segmentation (Orange)**:
- Starts at ~0.5 levels at 10¹ steps.
- Reaches ~15 levels by 10⁵ steps.
- Moderate growth rate, outperforming Random Agent but lagging behind New-Action Prioritization.
- Shaded region is moderately wide.
4. **Random Agent (Blue)**:
- Starts at ~0.1 levels at 10¹ steps.
- Reaches ~10 levels by 10⁵ steps.
- Slowest growth rate, with minimal improvement over steps.
- Shaded region is the narrowest, indicating low variability.
### Key Observations
- **Graph Exploration** dominates in performance, solving the most levels across all step ranges.
- **New-Action Prioritization** and **Frame Segmentation** show intermediate performance, with the former slightly outperforming the latter.
- **Random Agent** performs the worst, with negligible improvement despite the logarithmic increase in steps.
- The shaded regions suggest that Graph Exploration has the highest variability in performance, while Random Agent is the most consistent (but least effective).
### Interpretation
The data demonstrates that **Graph Exploration** is the most effective agent, achieving near-maximal level completion (30 levels) by 10⁵ steps. Its steep growth curve and high final performance suggest it efficiently leverages step increments to solve levels. In contrast, the Random Agent’s near-linear growth indicates random or suboptimal decision-making. The shaded regions imply that Graph Exploration’s performance is less predictable at scale, possibly due to complex environmental interactions or exploration strategies. New-Action Prioritization and Frame Segmentation offer middle-ground solutions, balancing performance and consistency. This graph highlights the importance of structured exploration (Graph Exploration) over random or segmented approaches in level-solving tasks.
</details>
Figure 3: Levels solved as a function of environment steps for four methods: Random Agent, Random + Frame Segmentation, Random + Segmentation + New-Action Prioritization, and the full Graph Exploration method. The x-axis is logarithmic; each line shows the median over 5 runs and the shaded region shows the minimum–maximum range. Intermediate variants are shown up to 10,000 environment steps, while the Graph Explorer is plotted over the full evaluation budget.
Table 1: Per-level results on public games (ft09, ls20, vc33). For each game and level, we report the number of steps to solve the level (Stp), summarized as the median together with the minimum and maximum over 5 runs, and the solve rate (SR) over the same 5 runs. We use ‘NS‘ when a level is never solved within the step budget, and ‘-‘ when there is no such level for a given game.
| 1 2 | 125 $[48;340]$ 177 | 1 1 | 124 $[72;140]$ $3.2\times 10^{3}$ | 1 1 | 9 $[5;24]$ 7 | 1 1 |
| --- | --- | --- | --- | --- | --- | --- |
| $[5;433]$ | | $[1.9\times 10^{3};4.9\times 10^{3}]$ | | $[4;19]$ | | |
| 3 | $2.0\times 10^{4}$ | 1 | NS | 0 | 36 | 1 |
| $[3.0\times 10^{3};2.5\times 10^{4}]$ | | | | $[9;96]$ | | |
| 4 | NS | 0 | NS | 0 | 321 | 1 |
| $[298;541]$ | | | | | | |
| 5 | NS | 0 | NS | 0 | 287 | 1 |
| $[260;349]$ | | | | | | |
| 6 | NS | 0 | NS | 0 | $6.9\times 10^{4}$ | 0.8 |
| $[5.4\times 10^{4};8.3\times 10^{4}]$ | | | | | | |
| 7 | NS | 0 | NS | 0 | $4.7\times 10^{3}$ | 0.8 |
| $[1.5\times 10^{3};5.5\times 10^{3}]$ | | | | | | |
| 8 | NS | 0 | NS | 0 | 917 | 0.8 |
| $[627;929]$ | | | | | | |
| 9 | NS | 0 | - | - | NS | 0 |
| 10 | NS | 0 | - | - | - | - |
Table 2: Per-level results on private games (sp80, lp85, as66). Conventions as in Table 1.
| 1 2 | 227 $[153;373]$ $3.6\times 10^{4}$ | 1 1 | 143 $[106;181]$ $2.9\times 10^{3}$ | 1 1 | 39 $[13;47]$ 44 | 1 1 |
| --- | --- | --- | --- | --- | --- | --- |
| $[2.5\times 10^{4};5.0\times 10^{4}]$ | | $[1.1\times 10^{3};3.2\times 10^{4}]$ | | $[24;65]$ | | |
| 3 | $3.9\times 10^{4}$ | 0.4 | $1.7\times 10^{4}$ | 1 | 123 | 1 |
| $[3.6\times 10^{4};4.2\times 10^{4}]$ | | $[1.0\times 10^{4};8.2\times 10^{4}]$ | | $[25;339]$ | | |
| 4 | NS | 0 | $1.6\times 10^{3}$ | 1 | 99 | 1 |
| $[727;2.0\times 10^{4}]$ | | $[69;350]$ | | | | |
| 5 | NS | | $4.6\times 10^{3}$ | 0.8 | $2.2\times 10^{3}$ | 1 |
| $[2.2\times 10^{3};1.4\times 10^{4}]$ | | $[1.2\times 10^{3};2.9\times 10^{3}]$ | | | | |
| 6 | NS | 0 | $1.3\times 10^{4}$ | 0.4 | $1.3\times 10^{3}$ | 1 |
| $[1.1\times 10^{4};1.5\times 10^{4}]$ | | $[112;1.6\times 10^{3}]$ | | | | |
| 7 | NS | 0 | 334.5 | 0.4 | 363 | 1 |
| $[104;565]$ | | $[128;670]$ | | | | |
| 8 | NS | 0 | $9.9\times 10^{3}$ | 0.2 | $1.3\times 10^{3}$ | 1 |
| $[9.9\times 10^{3};9.9\times 10^{3}]$ | | $[168;2.9\times 10^{3}]$ | | | | |
| 9 | - | - | - | - | $3.4\times 10^{3}$ | 1 |
| $[361;8.7\times 10^{3}]$ | | | | | | |
| 10 | - | - | - | - | - | - |