# Experience-based Knowledge Correction for Robust Planning in Minecraft
footnotetext: Corresponding author: Jungseul Ok <jungseul@postech.ac.kr>
## Abstract
Large Language Model (LLM)-based planning has advanced embodied agents in long-horizon environments such as Minecraft, where acquiring latent knowledge of goal (or item) dependencies and feasible actions is critical. However, LLMs often begin with flawed priors and fail to correct them through prompting, even with feedback. We present XENON (eXpErience-based kNOwledge correctioN), an agent that algorithmically revises knowledge from experience, enabling robustness to flawed priors and sparse binary feedback. XENON integrates two mechanisms: Adaptive Dependency Graph, which corrects item dependencies using past successes, and Failure-aware Action Memory, which corrects action knowledge using past failures. Together, these components allow XENON to acquire complex dependencies despite limited guidance. Experiments across multiple Minecraft benchmarks show that XENON outperforms prior agents in both knowledge learning and long-horizon planning. Remarkably, with only a 7B open-weight LLM, XENON surpasses agents that rely on much larger proprietary models. Project page: https://sjlee-me.github.io/XENON
## 1 Introduction
Large Language Model (LLM)-based planning has advanced in developing embodied AI agents that tackle long-horizon goals in complex, real-world-like environments (Szot et al., 2021; Fan et al., 2022). Among such environments, Minecraft has emerged as a representative testbed for evaluating planning capability that captures the complexity of such environments (Wang et al., 2023b; c; Zhu et al., 2023; Yuan et al., 2023; Feng et al., 2024; Li et al., 2024b). Success in these environments often depends on agents acquiring planning knowledge, including the dependencies among goal items and the valid actions needed to obtain them. For instance, to obtain an iron nugget
<details>
<summary>x1.png Details</summary>

### Visual Description
Icon/Small Image (23x20)
</details>
, an agent should first possess an iron ingot
<details>
<summary>x2.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
, which can only be obtained by the action smelt.
However, LLMs often begin with flawed priors about these dependencies and actions. This issue is indeed critical, since a lack of knowledge for a single goal can invalidate all subsequent plans that depend on it (Guss et al., 2019; Lin et al., 2021; Mao et al., 2022). We find several failure cases stemming from these flawed priors, a problem that is particularly pronounced for the lightweight LLMs suitable for practical embodied agents. First, an LLM often fails to predict planning knowledge accurately enough to generate a successful plan (Figure Ë 1 b), resulting in a complete halt in progress toward more challenging goals. Second, an LLM cannot robustly correct its flawed knowledge, even when prompted to self-correct with failure feedback (Shinn et al., 2023; Chen et al., 2024), often repeating the same errors (Figures 1 c and 1 d). To improve self-correction, one can employ more advanced techniques that leverage detailed reasons for failure (Zhang et al., 2024; Wang et al., 2023a). Nevertheless, LLMs often stubbornly adhere to their erroneous parametric knowledge (i.e. knowledge implicitly stored in model parameters), as evidenced by Stechly et al. (2024) and Du et al. (2024).
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagram: LLM Dependency and Action Correction
### Overview
The image presents a comparative analysis of true dependency graphs versus those predicted by a Large Language Model (LLM), alongside a demonstration of LLM self-correction for both dependencies and actions. It visually contrasts the ideal scenario with the LLM's initial output and subsequent attempts at correction. The diagram is divided into four sections: (a) True Dependency Graph, (b) LLM-predicted Graph, (c) LLM self-correction for dependencies, and (d) LLM self-correction for actions.
### Components/Axes
The diagram utilizes a visual representation of nodes (cubes) and edges (arrows) to depict dependencies and relationships. Each section has a title and a brief description. A legend in section (a) defines the arrow types: solid arrow = "Correct dependency", dashed arrow = "Missed dependency", and dotted arrow = "Redundant dependency". Section (b) introduces "Hallucinated item". Sections (c) and (d) show a sequence of steps: "Prior attempt", "Correction prompt", and "Next attempt", connected by curved arrows. Text boxes within each step display the LLM's generated text. The ground truth for the correction tasks is provided in parentheses after the section titles.
### Detailed Analysis or Content Details
**(a) True Dependency Graph:**
This section shows a network of six interconnected cubes.
* There are six nodes (cubes) arranged in a roughly circular pattern.
* There are seven edges (arrows) representing dependencies.
* All edges are solid, indicating "Correct dependency".
**(b) LLM-predicted Graph:**
This section displays a more complex network of seven nodes (six cubes and one additional item).
* There are seven nodes, including one labeled "Hallucinated item" (a light blue sphere).
* There are nine edges, with a mix of arrow types:
* Five solid arrows ("Correct dependency").
* Two dashed arrows ("Missed dependency").
* Two dotted arrows ("Redundant dependency").
* The "Ground truth" node is connected to the network via a solid arrow.
* The "Wrong knowledge" node is connected to the network via a dashed arrow.
**(c) LLM self-correction for dependencies (Ground-truth for â·):**
This section illustrates the correction process for dependencies.
* **Prior attempt:** A cube with a texture and a light blue sphere are connected by an arrow labeled "requires". The text reads: "You failed to get â· many times. You had â· at those times."
* **Correction prompt:** A document icon is shown.
* **Next attempt:** The same cube and sphere are connected by an arrow labeled "requires". The text reads: "I still think â· requires". A "Fail!" label is present.
**(d) LLM self-correction for actions (Ground-truth for : "craft"):**
This section demonstrates the correction process for actions.
* **Prior attempt:** A cube with a texture and a light blue sphere are connected. The text reads: "I will do âmineâ". The sphere is highlighted in yellow.
* **Correction prompt:** A document icon is shown.
* **Next attempt:** The same cube and sphere are connected. The text reads: "I will âmineâ again. I failed since I had no â· and". The sphere is highlighted in yellow. A "Fail!" label is present.
### Key Observations
* The LLM-predicted graph (b) exhibits both over- and under-estimation of dependencies compared to the true graph (a). It also introduces a "Hallucinated item" not present in the ground truth.
* The self-correction attempts (c and d) fail to achieve the correct output, despite the correction prompt. The "Fail!" label indicates the LLM did not successfully correct its initial prediction.
* The LLM struggles with both dependency identification and action selection, as demonstrated by the failures in both (c) and (d).
* The use of highlighting (yellow) around "mine" in sections (d) suggests the LLM is focusing on this term during the correction process.
### Interpretation
The diagram highlights the challenges LLMs face in accurately representing and reasoning about dependencies and actions. The LLM's initial prediction contains errors (hallucinations, missed/redundant dependencies), and its self-correction mechanism is insufficient to rectify these errors. This suggests that while LLMs can generate plausible outputs, they lack a robust understanding of the underlying relationships and constraints. The failures in self-correction indicate a need for more sophisticated techniques to improve the reliability and accuracy of LLM-generated knowledge. The use of the symbol â· suggests a placeholder or variable that the LLM is attempting to resolve, but consistently fails to do so. The diagram serves as a visual illustration of the gap between LLM capabilities and true reasoning. The consistent "Fail!" label underscores the limitations of the current self-correction approach.
</details>
Figure 1: An LLM exhibits flawed planning knowledge and fails at self-correction. (b) The dependency graph predicted by Qwen2.5-VL-7B (Bai et al., 2025) contains multiple errors (e.g., missed dependencies, hallucinated items) compared to (a) the ground truth. (c, d) The LLM fails to correct its flawed knowledge about dependencies and actions from failure feedbacks, often repeating the same errors. See Appendix Ë B for the full prompts and LLMâs self-correction examples.
In response, we propose XENON (eXpErience-based kNOwledge correctioN), an agent that robustly learns planning knowledge from only binary success/failure feedback. To this end, instead of relying on an LLM for correction, XENON algorithmically and directly revises its external knowledge memory using its own experience, which in turn guides its planning. XENON learns this planning knowledge through two synergistic components. The first component, Adaptive Dependency Graph (ADG), revises flawed dependency knowledge by leveraging successful experiences to propose plausible new required items. The second component, Failure-aware Action Memory (FAM), builds and corrects its action knowledge by exploring actions upon failures. In the challenging yet practical setting of using only binary feedbacks, FAM enables XENON to disambiguate the cause of a failure, distinguishing between flawed dependency knowledge and invalid actions, which in turn triggers a revision in ADG for the former.
Extensive experiments in three Minecraft testbeds show that XENON excels at both knowledge acquisition and planning. XENON outperforms prior agents in learning knowledge, showing unique robustness to LLM hallucinations and modified ground-truth environmental rules. Furthermore, with only a 7B LLM, XENON significantly outperforms prior agents that rely on much larger proprietary models like GPT-4 in solving diverse long-horizon goals. These results suggest that robust algorithmic knowledge management can be a promising direction for developing practical embodied agents with lightweight LLMs (Belcak et al., 2025).
Our contributions are as follows. First, we propose XENON, an LLM-based agent that robustly learns planning knowledge from experience via algorithmic knowledge correction, instead of relying on the LLM to self-correct its own knowledge. We realize this idea through two synergistic mechanisms that explicitly store planning knowledge and correct it: Adaptive Dependency Graph (ADG) for correcting dependency knowledge based on successes, and Failure-aware Action Memory (FAM) for correcting action knowledge and disambiguating failure causes. Second, extensive experiments demonstrate that XENON significantly outperforms prior state-of-the-art agents in both knowledge learning and long-horizon goal planning in Minecraft.
## 2 Related work
### 2.1 LLM-based planning in Minecraft
Prior work has often address LLMsâ flawed planning knowledge in Minecraft using impractical methods. For example, such methods typically involve directly injecting knowledge through LLM fine-tuning (Zhao et al., 2023; Feng et al., 2024; Liu et al., 2025; Qin et al., 2024) or relying on curated expert data (Wang et al., 2023c; Zhu et al., 2023; Wang et al., 2023a).
Another line of work attempts to learn planning knowledge via interaction, by storing the experience of obtaining goal items in an external knowledge memory. However, these approaches are often limited by unrealistic assumptions or lack robust mechanisms to correct the LLMâs flawed prior knowledge. For example, ADAM and Optimus-1 artificially simplify the challenge of predicting and learning dependencies via shortcuts like pre-supplied items, while also relying on expert data such as learning curriculum (Yu and Lu, 2024) or Minecraft wiki (Li et al., 2024b). They also lack a robust way to correct wrong action choices in a plan: ADAM has none, and Optimus-1 relies on unreliable LLM self-correction. Our most similar work, DECKARD (Nottingham et al., 2023), uses an LLM to predict item dependencies but does not revise its predictions for items that repeatedly fail, and when a plan fails, it cannot disambiguate whether the failure is due to incorrect dependencies or incorrect actions. In contrast, our work tackles the more practical challenge of learning planning knowledge and correcting flawed priors from only binary success/failure feedback.
### 2.2 LLM-based self-correction
LLM self-correction, i.e., having an LLM correct its own outputs, is a promising approach to overcome the limitations of flawed parametric knowledge. However, for complex tasks like planning, LLMs struggle to identify and correct their own errors without external feedback (Huang et al., 2024; Tyen et al., 2024). To improve self-correction, prior works fine-tune LLMs (Yang et al., 2025) or prompt LLMs to correct themselves using environmental feedback (Shinn et al., 2023) and tool-execution results (Gou et al., 2024). While we also use binary success/failure feedbacks, we directly correct the agentâs knowledge in external memory by leveraging experience, rather than fine-tuning the LLM or prompting it to self-correct.
## 3 Preliminaries
We aim to develop an agent capable of solving long-horizon goals by learning planning knowledge from experience. As a representative environment which necessitates accurate planning knowledge, we consider Minecraft as our testbed. Minecraft is characterized by strict dependencies among game items (Guss et al., 2019; Fan et al., 2022), which can be formally represented as a directed acyclic graph $\mathcal{G}^{*}=(\mathcal{V}^{*},\mathcal{E}^{*})$ , where $\mathcal{V}^{*}$ is the set of all items and each edge $(u,q,v)\in\mathcal{E}^{*}$ indicates that $q$ quantities of an item $u$ are required to obtain an item $v$ . In our actual implementation, each edge also stores the resulting item quantity, but we omit it from the notation for presentation simplicity, since most edges have resulting item quantity 1 and this multiplicity is not essential for learning item dependencies. A goal is to obtain an item $g\in\mathcal{V}^{*}$ . To obtain $g$ , an agent must possess all of its prerequisites as defined by $\mathcal{G}^{*}$ in its inventory, and perform the valid high-level action in $\mathcal{A}=\{\text{``mine'', ``craft'', ``smelt''}\}$ .
Framework: Hierarchical agent with graph-augmented planning
We employ a hierarchical agent with an LLM planner and a low-level controller, adopting a graph-augmented planning strategy (Li et al., 2024b; Nottingham et al., 2023). In this strategy, agent maintains its knowledge graph $\mathcal{G}$ and plans with $\mathcal{G}$ to decompose a goal $g$ into subgoals in two stages. First, the agent identifies prerequisite items it does not possess by traversing $\hat{\mathcal{G}}$ backward from $g$ to nodes with no incoming edges (i.e., basic items with no known requirements), and aggregates them into a list of (quantity, item) tuples, $((q_{1},u_{1}),...,(q_{L_{g}},u_{L_{g}})=(1,g))$ . Second, the planner LLM converts this list into executable language subgoals $\{(a_{l},q_{l},u_{l})\}_{l=1}^{L_{g}}$ , where it takes each $u_{l}$ as input and outputs a high-level action $a_{l}$ to obtain $u_{l}$ . Then the controller executes each subgoal, i.e., it takes each language subgoal as input and outputs a sequence of low-level actions in the environment to achieve it. After each subgoal execution, the agent receives only binary success/failure feedback.
Problem formulation: Dependency and action learning
To plan correctly, the agent must acquire knowledge of the true dependency graph $\mathcal{G}^{*}$ . However, $\mathcal{G}^{*}$ is latent, making it necessary for the agent to learn this structure from experience. We model this as revising a learned graph, $\hat{\mathcal{G}}=(\hat{\mathcal{V}},\hat{\mathcal{E}})$ , where $\hat{\mathcal{V}}$ contains known items and $\hat{\mathcal{E}}$ represents the agentâs current belief about item dependencies. Following Nottingham et al. (2023), whenever the agent obtains a new item $v$ , it identifies the experienced requirement set $\mathcal{R}_{\text{exp}}(v)$ , the set of (item, quantity) pairs consumed during this item acquisition. The agent then updates $\hat{\mathcal{G}}$ by replacing all existing incoming edges to $v$ with the newly observed $\mathcal{R}_{\text{exp}}(v)$ . The detailed update procedure is in Appendix C.
We aim to maximize the accuracy of learned graph $\hat{\mathcal{G}}$ against true graph $\mathcal{G}^{*}$ . We define this accuracy $N_{true}(\hat{\mathcal{G}})$ as the number of items whose incoming edges are identical in $\hat{\mathcal{G}}$ and $\mathcal{G}^{*}$ , i.e.,
$$
\displaystyle N_{true}(\hat{\mathcal{G}}) \displaystyle\coloneqq\sum_{v\in\mathcal{V}^{*}}\mathbb{I}(\mathcal{R}(v,\hat{\mathcal{G}})=\mathcal{R}(v,\mathcal{G}^{*}))\ , \tag{1}
$$
where the dependency set, $\mathcal{R}(v,\mathcal{G})$ , denotes the set of all incoming edges to the item $v$ in the graph $\mathcal{G}$ .
## 4 Methods
XENON is an LLM-based agent with two core components: Adaptive Dependency Graph (ADG) and Failure-aware Action Memory (FAM), as shown in Figure Ë 3. ADG manages dependency knowledge, while FAM manages action knowledge. The agent learns this knowledge in a loop that starts by selecting an unobtained item as an exploratory goal (detailed in Appendix Ë G). Once an item goal $g$ is selected, ADG, our learned dependency graph $\mathcal{G}$ , traverses itself to construct $((q_{1},u_{1}),\dots,(q_{L_{g}},u_{L_{g}})=(1,g))$ . For each $u_{l}$ in this list, FAM either reuses a previously successful action for $u_{l}$ or, if none exists, the planner LLM selects a high-level action $a_{l}\in\mathcal{A}$ given $u_{l}$ and action histories from FAM. The resulting actions form language subgoals $\{(a_{l},q_{l},u_{l})\}_{l=1}^{L_{g}}$ . The controller then takes each subgoal as input, executes a sequence of low-level actions to achieve it, and returns binary success/failure feedback, which is used to update both ADG and FAM. The full procedure is outlined in Algorithm Ë 1 in Appendix Ë D. We next detail each component, beginning with ADG.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Diagram: Adaptive Control Loop with LLM
### Overview
This diagram illustrates an adaptive control loop incorporating a Large Language Model (LLM) for goal-oriented action planning and execution. The system includes components for dependency management, action memory, a controller, and interaction with an environment. The diagram highlights the flow of information and control signals between these components, particularly focusing on how failures are handled and how the LLM is utilized for both initial planning and adaptation.
### Components/Axes
The diagram consists of the following components:
* **Adaptive Dependency Graph:** A green rectangular block.
* **Failure-aware Action Memory:** A purple rectangular block.
* **LLM:** A light gray rectangular block.
* **Controller:** A blue rectangular block.
* **Environment:** A small cube-shaped icon.
* **Decision Node:** An oval shape labeled "If (past successful subgoal exists)".
The diagram also includes labeled arrows indicating the flow of information and control:
* **(1) Goal & item requirements:** Yellow arrow from Adaptive Dependency Graph to LLM.
* **(2) Action history:** Pink arrow from Failure-aware Action Memory to Decision Node.
* **(3)-X Call LLM:** Yellow dashed arrow from LLM to Controller.
* **(3)-O Reuse subgoal:** Yellow arrow from Decision Node to Controller.
* **(4) Subgoal failures:** Red arrow from Environment to Failure-aware Action Memory.
* **(5) All actions are invalid:** Red arrow from Failure-aware Action Memory to Adaptive Dependency Graph.
A dashed red border encompasses the Adaptive Dependency Graph, Failure-aware Action Memory, and Environment, suggesting a closed-loop system.
### Detailed Analysis or Content Details
The diagram depicts a cyclical process:
1. The **Adaptive Dependency Graph** provides **Goal & item requirements (1)** to the **LLM**.
2. The **LLM** generates a plan and sends a **Call LLM (3)-X** signal to the **Controller**.
3. The **Controller** interacts with the **Environment**.
4. If the **Environment** encounters **Subgoal failures (4)**, this information is sent to the **Failure-aware Action Memory**.
5. The **Failure-aware Action Memory** determines if **All actions are invalid (5)** and sends this information back to the **Adaptive Dependency Graph**, triggering a replanning process.
6. Alternatively, if a **past successful subgoal exists**, the **Decision Node** sends a **Reuse subgoal (3)-O** signal to the **Controller**.
7. The **Failure-aware Action Memory** also provides **Action history (2)** to the **Decision Node** to inform the decision of whether to reuse a subgoal.
### Key Observations
* The system is designed to handle failures by incorporating a failure-aware action memory and a mechanism for invalidating actions.
* The LLM is used both for initial planning and potentially for adapting plans based on past successes and failures.
* The dashed red border highlights the closed-loop nature of the system, emphasizing the continuous cycle of planning, execution, and adaptation.
* The use of dashed vs. solid lines for the LLM interaction (3-X vs 3-O) suggests a distinction between initial LLM calls and reuse of existing subgoals.
### Interpretation
This diagram represents a sophisticated control architecture that leverages the capabilities of LLMs for adaptive planning and execution. The system is designed to be robust to failures by incorporating a memory of past actions and a mechanism for invalidating plans when necessary. The LLM acts as a central planning component, but the system also allows for the reuse of successful subgoals, potentially improving efficiency. The inclusion of the Adaptive Dependency Graph suggests that the system can also adapt to changes in the environment or task requirements.
The diagram suggests a shift from traditional, pre-programmed control systems to more flexible and intelligent systems that can learn from experience and adapt to changing conditions. The LLM is not simply a one-time planner but is integrated into a continuous loop of planning, execution, and adaptation. This architecture is particularly well-suited for complex tasks where it is difficult to anticipate all possible scenarios in advance. The system's ability to learn from failures and reuse successful subgoals is crucial for achieving robust and reliable performance.
</details>
Figure 2: Overview. XENON updates Adaptive Dependency Graph and Failure-aware Action Memory with environmental experiences.
### 4.1 Adaptive Dependency Graph (ADG)
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Diagram: Adaptive Control Loop with LLM
### Overview
The image depicts a diagram of an adaptive control loop incorporating a Large Language Model (LLM). The system consists of several components: an Adaptive Dependency Graph, a Failure-aware Action Memory, an LLM, a Controller, and an Environment. Arrows indicate the flow of information and control between these components. The diagram is enclosed within a dashed red border, suggesting a closed-loop system.
### Components/Axes
The key components are:
* **Adaptive Dependency Graph:** A green rectangular block.
* **Failure-aware Action Memory:** A purple rectangular block.
* **LLM:** A light gray rectangular block.
* **Controller:** A blue rectangular block.
* **Environment:** A small cube-shaped icon with a textured surface.
* **Decision Node:** An oval shape labeled "If (past successful subgoal exists)".
The diagram also includes numbered labels indicating the flow of information:
1. Goal & item requirements
2. Action history
3. Call LLM / Reuse subgoal
4. Subgoal failures
5. All actions are invalid
### Detailed Analysis or Content Details
The diagram illustrates the following flow:
1. **Goal & item requirements** (labeled '1', green arrow) flow from the Adaptive Dependency Graph to the LLM.
2. **Action history** (labeled '2', purple arrow) flows from the Failure-aware Action Memory to the decision node "If (past successful subgoal exists)".
3. From the decision node, two paths emerge:
* If a past successful subgoal exists, a **Reuse subgoal** signal (labeled '3-O', yellow arrow) is sent to the Controller.
* If no past successful subgoal exists, a **Call LLM** signal (labeled '3-X', yellow arrow) is sent to the LLM.
4. **Subgoal failures** (labeled '4', red dashed arrow) flow from the Environment to the Failure-aware Action Memory.
5. **All actions are invalid** (labeled '5', red arrow) flow from the Failure-aware Action Memory to the Adaptive Dependency Graph.
The Controller interacts with the Environment, as indicated by the bidirectional arrow between them. The LLM also interacts with the Controller via the yellow arrows.
### Key Observations
* The system is designed to learn from past successes and failures. The Failure-aware Action Memory and the Adaptive Dependency Graph play crucial roles in this learning process.
* The LLM is used both to generate new subgoals (when no past successful subgoal exists) and to reuse existing subgoals (when a past successful subgoal exists).
* The red dashed border and the red arrows suggest a feedback loop for handling failures.
* The diagram emphasizes the adaptive nature of the control loop, as the system can adjust its behavior based on the environment and past experiences.
### Interpretation
This diagram represents a sophisticated control system that leverages the capabilities of an LLM to achieve goals in a dynamic environment. The system's ability to learn from failures and reuse successful subgoals suggests a high degree of efficiency and robustness. The Adaptive Dependency Graph likely represents the relationships between different tasks or subgoals, allowing the system to plan and execute complex actions. The Failure-aware Action Memory provides a mechanism for avoiding repeated mistakes and improving performance over time. The LLM acts as a central intelligence, capable of generating novel solutions when needed and leveraging existing knowledge when appropriate. The overall architecture suggests a system designed for autonomous operation in complex and uncertain environments. The use of numbered labels indicates a specific sequence of operations or a workflow within the system. The diagram is a high-level representation of the system's architecture and does not provide specific details about the implementation of each component.
</details>
Figure 3: Overview. XENON updates Adaptive Dependency Graph and Failure-aware Action Memory with environmental experiences.
Dependency graph initialization
To make the most of the LLMâs prior knowledge, albeit incomplete, we initialize the learned dependency graph $\hat{\mathcal{G}}=(\hat{\mathcal{V}},\hat{\mathcal{E}})$ using an LLM. We follow the initialization process of DECKARD (Nottingham et al., 2023), which consists of two steps. First, $\hat{\mathcal{V}}$ is assigned $\mathcal{V}_{0}$ , which is the set of goal items whose dependencies must be learned, and $\hat{\mathcal{E}}$ is assigned $\emptyset$ . Second, for each item $v$ in $\hat{\mathcal{V}}$ , the LLM is prompted to predict its requirement set (i.e. incoming edges of $v$ ), aggregating them to construct the initial graph.
However, those LLM-predicted requirement sets often include items not present in the initial set $\mathcal{V}_{0}$ , which is a phenomenon overlooked by DECKARD. Since $\mathcal{V}_{0}$ may be an incomplete subset of all possible game items $\mathcal{V}^{*}$ , we cannot determine whether such items are genuine required items or hallucinated items which do not exist in the environment. To address this, we provisionally accept all LLM requirement set predictions. We iteratively expand the graph by adding any newly mentioned item to $\hat{\mathcal{V}}$ and, in turn, querying the LLM for its own requirement set. This expansion continues until a requirement set has been predicted for every item in $\hat{\mathcal{V}}$ . Since we assume that the true graph $\mathcal{G}^{*}$ is a DAG, we algorithmically prevent cycles in $\hat{\mathcal{G}}$ ; see Section Ë E.2 for the cycle-check procedure. The quality of this initial LLM-predicted graph is analyzed in detail in Appendix K.1.
Dependency graph revision
Correcting the agentâs flawed dependency knowledge involves two challenges: (1) detecting and handling hallucinated items from the graph initialization, and (2) proposing a new requirement set. Simply prompting an LLM for corrections is ineffective, as it often predicts a new, flawed requirement set, as shown in Figures 1 c and 1 d. Therefore, we revise $\hat{\mathcal{G}}$ algorithmically using the agentâs experiences, without relying on the LLM.
To implement this, we introduce a dependency revision procedure called RevisionByAnalogy and a revision count $C(v)$ for each item $v\in\hat{\mathcal{V}}$ . This procedure outputs a revised graph by taking item $v$ whose dependency needs to be revised, its revision count $C(v)$ , and the current graph $\hat{\mathcal{G}}$ as inputs, leveraging the required items of previously obtained items. When a revision for an item $v$ is triggered by FAM (Section Ë 4.2), the procedure first discards $v$ âs existing requirement set ( $\text{i.e}.\hbox{},\mathcal{R}(v,\hat{\mathcal{G}})\leftarrow\emptyset$ ). It increments the revision count $C(v)$ for $v$ . Based on whether $C(v)$ exceeds a hyperparameter $c_{0}$ , RevisionByAnalogy proceeds with one of the following two cases:
- Case 1: Handling potentially hallucinated items ( $C(v)>c_{0}$ ). If an item $v$ remains unobtainable after excessive revisions, the procedure flags it as inadmissible to signify that it may be a hallucinated item. This reveals a critical problem: if $v$ is indeed a hallucinated item, any of its descendants in $\hat{\mathcal{G}}$ become permanently unobtainable. To enable XENON to try these descendant items through alternative paths, we recursively call RevisionByAnalogy for all of $v$ âs descendants in $\hat{\mathcal{G}}$ , removing their dependency on the inadmissible item $v$ (Figure Ë 4 a, Case 1). Finally, to account for cases where $v$ may be a genuine item that is simply difficult to obtain, its requirement set $\mathcal{R}(v,\hat{\mathcal{G}})$ is reset to a general set of all resource items (i.e. items previously consumed for crafting other items), each with a quantity of hyperparameter $\alpha_{i}$ .
- Case 2: Plausible revision for less-tried items ( $C(v)\leq c_{0}$ ). The item $v$ âs requirement set, $\mathcal{R}(v,\hat{\mathcal{G}})$ , is revised to determine both a plausible set of new items and their quantities. First, for plausible required items, we use an idea that similar goals often share similar preconditions (Yoon et al., 2024). Therefore, we set the new required items referencing the required items of the top- $K$ similar, successfully obtained items (Figure Ë 4 a, Case 2). We compute this item similarity as the cosine similarity between the Sentence-BERT (Reimers and Gurevych, 2019) embeddings of item names. Second, to determine their quantities, the agent should address the trade-off between sufficient amounts to avoid failures and an imperfect controllerâs difficulty in acquiring them. Therefore, the quantities of those new required items are determined by gradually scaling with the revision count, $\alpha_{s}C(v)$ .
Here, the hyperparameter $c_{0}$ serves as the revision count threshold for flagging an item as inadmissible. $\alpha_{i}$ and $\alpha_{s}$ control the quantity of each required item for inadmissible items (Case 1), and for less-tried items (Case 2), respectively, to maintain robustness when dealing with an imperfect controller. $K$ determines the number of similar, successfully obtained items to reference for (Case 2). Detailed pseudocode of RevisionByAnalogy is in Section Ë E.3, Algorithm Ë 3.
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Diagram: Dependency and Action Correction for ADG
### Overview
The image presents a two-part diagram illustrating dependency correction and action correction processes within an Automated Dependency Graph (ADG) framework, likely related to a task-planning or problem-solving system. Part (a) focuses on correcting dependencies, while part (b) focuses on correcting actions. Both parts utilize visual representations of graphs and iterative processes.
### Components/Axes
The diagram is divided into two main sections, labeled (a) and (b). Each section has a title indicating the type of correction being demonstrated. Within each section, there are visual representations of graphs, boxes containing text descriptions, and arrows indicating the flow of the process.
**Part (a): Dependency Correction for ADG**
* **Title:** Dependency Correction for ADG
* **Sub-sections:** Case 1 ADG, Case 2 ADG, ADG
* **Labels:** "Descendant (Leaf)", "Descendant", "Hallucinated item", "Recursively call RevisionByAnalogy", "Search similar, obtained items", "Replace the wrong dependency"
* **Visual Elements:** Boxes representing ADGs, nodes within the graphs, arrows indicating dependencies, checkmarks and crosses indicating success/failure.
**Part (b): Action Correction for FAM**
* **Title:** Action Correction for FAM
* **Labels:** "FAM", "Prompt", "Subgoal", "Failure counts", "mine": 2, "craft": 1, "smelt": 0, "Invalid action", "Determine & remove invalid actions", "Try under-explored action"
* **Visual Elements:** Boxes representing FAM, a prompt box, a subgoal box, a failure counts box, arrows indicating the flow of the process, a hand icon with a cursor.
### Detailed Analysis or Content Details
**Part (a): Dependency Correction**
* **Case 1 ADG:** Shows a simple ADG with a "Descendant (Leaf)" node connected to a "Descendant" node, which is then connected to a "Hallucinated item" node. A circular arrow indicates "Recursively call RevisionByAnalogy".
* **Case 2 ADG:** Shows a more complex ADG with multiple nodes and dependencies. A node with a red "X" is connected to several other nodes via dashed lines. Green checkmarks indicate successful dependencies. The text "Search similar, obtained items" is associated with this section.
* **ADG:** Shows an ADG where a dependency is being removed (scissors icon) and replaced with a new dependency (arrow). The text "Replace the wrong dependency" is associated with this section.
**Part (b): Action Correction**
* **FAM:** A box labeled "FAM" contains a "Failure counts" box.
* **Failure Counts:** Lists the number of failures for different actions: "mine": 2, "craft": 1, "smelt": 0.
* **Prompt:** A box labeled "Prompt" contains the text "Select an action for mine, craft, smelt...".
* **Subgoal:** A box labeled "Subgoal" contains the text "craft".
* **Process Flow:** An arrow points from the "Prompt" box to a box labeled "Invalid action". Another arrow points from the "Invalid action" box to a box labeled "Determine & remove invalid actions". A final arrow points from "Determine & remove invalid actions" to a box labeled "Try under-explored action", which is associated with a hand icon and a spiral shape.
### Key Observations
* The dependency correction process (a) appears to involve identifying and correcting incorrect dependencies within an ADG, potentially using analogy-based reasoning.
* The action correction process (b) focuses on identifying and removing invalid actions based on failure counts, and then exploring alternative actions.
* The use of visual cues (checkmarks, crosses, scissors) helps to illustrate the success or failure of different steps in the process.
* The "Failure counts" data suggests that the "mine" action has failed twice, the "craft" action has failed once, and the "smelt" action has not failed.
### Interpretation
The diagram illustrates a system for automated problem-solving or task planning that incorporates mechanisms for both dependency and action correction. The ADG framework appears to be used to represent the problem space, and the correction processes are designed to improve the quality of the solution by identifying and resolving errors.
The dependency correction process suggests a method for handling incomplete or inaccurate information by leveraging analogy to find similar, valid dependencies. The action correction process suggests a reinforcement learning or exploration-exploitation strategy, where actions that have failed repeatedly are removed, and under-explored actions are given a chance.
The data in the "Failure counts" box provides insights into the performance of different actions, and can be used to guide the exploration process. The overall system appears to be designed to be robust and adaptable, capable of learning from its mistakes and improving its performance over time. The diagram does not provide quantitative data beyond the failure counts, so it is difficult to assess the overall effectiveness of the system. However, the visual representation suggests a well-defined and logical process.
</details>
Figure 4: XENONâs algorithmic knowledge correction. (a) Dependency Correction via RevisionByAnalogy. Case 1: For an inadmissible item (e.g., a hallucinated item), its descendants are recursively revised to remove the flawed dependency. Case 2: A flawed requirement set is revised by referencing similar, obtained items. (b) Action Correction via FAM. FAM prunes invalid actions from the LLMâs prompt based on failures, guiding it to select an under-explored action.
### 4.2 Failure-aware Action Memory (FAM)
FAM is designed to address two challenges of learning only from binary success/failure feedback: (1) discovering valid high-level actions for each item, and (2) disambiguating the cause of persistent failures between invalid actions and flawed dependency knowledge. This section first describes FAMâs core mechanism, and then details how it addresses each of these challenges in turn.
Core mechanism: empirical action classification
FAM classifies actions as either empirically valid or empirically invalid for each item, based on their history of past subgoal outcomes. Specifically, for each item $v\in\hat{\mathcal{V}}$ and action $a\in\mathcal{A}$ , FAM maintains the number of successful and failed outcomes, denoted as $S(a,v)$ and $F(a,v)$ respectively. Based on these counts, an action $a$ is classified as empirically invalid for $v$ if it has failed repeatedly, (i.e., $F(a,v)\geq S(a,v)+x_{0}$ ); otherwise, it is classified as empirically valid if it has succeeded at least once (i.e., $S(a,v)>0$ and $S(a,v)>F(a,v)-x_{0}$ ). The hyperparameter $x_{0}$ controls the tolerance for this classification, accounting for the possibility that an imperfect controller might fail even with an indeed valid action.
Addressing challenge 1: discovering valid actions
FAM helps XENON discover valid actions by avoiding repeatedly failed actions when making a subgoal $sg_{l}=(a_{l},q_{l},u_{l})$ . Only when FAM has no empirically valid action for $u_{l}$ , XENON queries the LLM to select an under-explored action for constructing $sg_{l}$ . To accelerate this search for a valid action, we query the LLM with (i) the current subgoal item $u_{l}$ , (ii) empirically valid actions for top- $K$ similar items successfully obtained and stored in FAM (using Sentence-BERT similarity as in Section Ë 4.1), and (iii) candidate actions for $u_{l}$ that remain after removing all empirically invalid actions from $\mathcal{A}$ (Figure Ë 4 b). We prune action candidates rather than include the full failure history because LLMs struggle to effectively utilize long prompts (Li et al., 2024a; Liu et al., 2024). If FAM already has an empirically valid one, XENON reuses it to make $sg_{l}$ without using LLM. Detailed procedures and prompts are in Appendix Ë F.
Addressing challenge 2: disambiguating failure causes
By ensuring systematic action exploration, FAM allows XENON to determine that persistent subgoal failures stem from flawed dependency knowledge rather than from the actions. Specifically, once FAM classifies all actions in $\mathcal{A}$ for an item as empirically invalid, XENON concludes that the error lies within ADG and triggers its revision. Subsequently, XENON resets the itemâs history in FAM to allow for a fresh exploration of actions with the revised ADG.
### 4.3 Additional technique: context-aware reprompting (CRe) for controller
In real-world-like environments, an imperfect controller can stall (e.g., in deep water). To address this, XENON employs context-aware reprompting (CRe), where an LLM uses the current image observation and the controllerâs language subgoal to decide whether to replace the subgoal and propose a new temporary subgoal to escape the stalled state (e.g., âget out of the waterâ). Our CRe is adapted from Optimus-1 (Li et al., 2024b) to be suitable for smaller LLMs, with two differences: (1) a two-stage reasoning process that captions the observation first and then makes a text-only decision on whether to replace the subgoal, and (2) a conditional trigger that activates only when the subgoal for item acquisition makes no progress, rather than at fixed intervals. See Appendix Ë H for details.
## 5 Experiments
### 5.1 Setups
Environments
We conduct experiments in three Minecraft environments, which we separate into two categories based on their controller capacity. First, as realistic, visually-rich embodied AI environments, we use MineRL (Guss et al., 2019) and Mineflayer (PrismarineJS, 2023) with imperfect low-level controllers: STEVE-1 (Lifshitz et al., 2023) in MineRL and hand-crafted codes (Yu and Lu, 2024) in Mineflayer. Second, we use MC-TextWorld (Zheng et al., 2025) as a controlled testbed with a perfect controller. Each experiment in this environment is repeated over 15 runs; in our results, we report the mean and standard deviation, omitting the latter when it is negligible. In all environments, the agent starts with an empty inventory. Further details on environments are provided in Appendix Ë J. Additional experiments in a household task planning domain other than Minecraft are reported in Appendix Ë A, where XENON also exhibits robust performance.
Table 1: Comparison of knowledge correction mechanisms across agents. â: Our proposed mechanism (XENON), $\triangle$ : LLM self-correction, â: No correction, â: Not applicable.
| Agent | Dependency Correction | Action Correction |
| --- | --- | --- |
| XENON | â | â |
| SC | $\triangle$ | $\triangle$ |
| DECKARD | â | â |
| ADAM | - | â |
| RAND | â | â |
Evaluation metrics
For both dependency learning and planning evaluations, we utilize the 67 goals from 7 groups proposed in the long-horizon task benchmark (Li et al., 2024b). To evaluate dependency learning with an intuitive performance score between 0 and 1, we report $N_{\text{true}}(\hat{\mathcal{G}})/67$ , where $N_{\text{true}}(\hat{\mathcal{G}})$ is defined in Equation Ë 1. We refer to this normalized score as Experienced Graph Accuracy (EGA). To evaluate planning performance, we follow the benchmark setting (Li et al., 2024b): at the beginning of each episode, a goal item is specified externally for the agent, and we measure the average success rate (SR) of obtaining this goal item in MineRL. See Table Ë 10 for the full list of goals.
Implementation details
For the planner, we use Qwen2.5-VL-7B (Bai et al., 2025). The learned dependency graph is initialized with human-written plans for three goals (âcraft an iron sword
<details>
<summary>x7.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
â, âcraft a golden sword
<details>
<summary>x8.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
,â âmine a diamond
<details>
<summary>x9.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
â), providing minimal knowledge; the agent must learn dependencies for over 80% of goal items through experience. We employ CRe only for long-horizon goal planning in MineRL. All hyperparameters are kept consistent across experiments. Further details on hyperparameters and human-written plans are in Appendix Ë I.
Baselines
As no prior work learns dependencies in our exact setting, we adapt four baselines, whose knowledge correction mechanisms are summarized in Table 1. For dependency knowledge, (1) LLM Self-Correction (SC) starts with an LLM-predicted dependency graph and prompts the LLM to revise it upon failures; (2) DECKARD (Nottingham et al., 2023) also relies on an LLM-predicted graph but with no correction mechanism; (3) ADAM (Yu and Lu, 2024) assumes that any goal item requires all previously used resource items, each in a sufficient quantity; and (4) RAND, the simplest baseline, uses a static graph similar to DECKARD. Regarding action knowledge, all baselines except for RAND store successful actions. However, only the SC baseline attempts to correct its flawed knowledge upon failures. The SC prompts the LLM to revise both its dependency and action knowledge using previous LLM predictions and interaction trajectories, as done in many self-correction methods (Shinn et al., 2023; Stechly et al., 2024). See Appendix Ë B for the prompts of SC and Section Ë J.1 for detailed descriptions of these baselines. To evaluate planning on diverse long-horizon goals, we further compare XENON with recent planning agents that are provided with oracle dependencies: DEPS Wang et al. (2023b), Jarvis-1 Wang et al. (2023c), Optimus-1 Li et al. (2024b), and Optimus-2 Li et al. (2025b).
### 5.2 Robust dependency learning against flawed prior knowledge
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Episode
### Overview
The image presents a line chart illustrating the relationship between "EGA" (y-axis) and "Episode" (x-axis) for five different entities: XENON, SC, DECKARD, ADAM, and RAND. The chart displays how the EGA value changes as the episode number increases from 0 to 400.
### Components/Axes
* **X-axis:** "Episode", ranging from 0 to 400, with markers at 0, 100, 200, 300, and 400.
* **Y-axis:** "EGA", ranging from 0 to 1.0, with markers at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located in the top-left corner, identifying the five data series:
* XENON (Blue)
* SC (Pink/Red)
* DECKARD (Light Green)
* ADAM (Yellow/Orange)
* RAND (Dark Blue)
### Detailed Analysis
Here's a breakdown of each data series, including trend descriptions and approximate data points:
* **XENON (Blue):** The line slopes consistently upward.
* Episode 0: EGA â 0.15
* Episode 100: EGA â 0.35
* Episode 200: EGA â 0.45
* Episode 300: EGA â 0.55
* Episode 400: EGA â 0.65
* **SC (Pink/Red):** The line initially rises sharply, then plateaus.
* Episode 0: EGA â 0.15
* Episode 100: EGA â 0.40
* Episode 200: EGA â 0.42
* Episode 300: EGA â 0.42
* Episode 400: EGA â 0.42
* **DECKARD (Light Green):** The line rises initially, then plateaus.
* Episode 0: EGA â 0.15
* Episode 100: EGA â 0.42
* Episode 200: EGA â 0.42
* Episode 300: EGA â 0.42
* Episode 400: EGA â 0.42
* **ADAM (Yellow/Orange):** The line remains relatively flat throughout.
* Episode 0: EGA â 0.10
* Episode 100: EGA â 0.12
* Episode 200: EGA â 0.12
* Episode 300: EGA â 0.12
* Episode 400: EGA â 0.12
* **RAND (Dark Blue):** The line remains relatively flat throughout.
* Episode 0: EGA â 0.15
* Episode 100: EGA â 0.17
* Episode 200: EGA â 0.17
* Episode 300: EGA â 0.17
* Episode 400: EGA â 0.17
### Key Observations
* XENON exhibits the most significant and consistent increase in EGA value over the episodes.
* SC and DECKARD show a rapid initial increase in EGA, followed by stabilization.
* ADAM and RAND demonstrate minimal change in EGA value across all episodes.
* ADAM consistently has the lowest EGA value.
### Interpretation
The chart suggests that the "EGA" metric is most sensitive to the "Episode" progression for the XENON entity. The rapid initial increase and subsequent plateau for SC and DECKARD could indicate a learning or adaptation phase followed by saturation. ADAM and RAND appear to be largely unaffected by the episode progression, potentially representing baseline or static behaviors. The EGA metric could represent a measure of performance, skill, or some other quantifiable attribute that evolves with experience (represented by the "Episode" number). The differences between the entities suggest varying rates of learning or adaptation. The fact that ADAM consistently has the lowest EGA value could indicate a fundamental difference in its capabilities or characteristics compared to the other entities. Further investigation would be needed to understand the specific meaning of "EGA" and the context of these entities and episodes.
</details>
(a) MineRL
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Episode
### Overview
The image presents a line chart illustrating the relationship between "Episode" (on the x-axis) and "EGA" (on the y-axis). Five different data series are plotted, each represented by a distinct color. The chart appears to track the evolution of EGA across a range of episodes from 0 to 400.
### Components/Axes
* **X-axis:** Labeled "Episode", ranging from 0 to 400, with markers at 0, 100, 200, 300, and 400.
* **Y-axis:** Labeled "EGA", ranging from 0 to 1.0, with markers at 0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Data Series:** Five lines, each representing a different experimental condition or variable. The colors are:
* Blue
* Orange
* Green
* Pink
* Gray
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **Blue Line:** This line shows a strong upward trend, starting at approximately 0.15 at Episode 0, rising rapidly to around 0.7 at Episode 100, reaching approximately 0.9 at Episode 200, and plateauing around 0.92-0.95 from Episode 200 to 400.
* **Orange Line:** This line also shows an initial increase, starting at approximately 0.15 at Episode 0, rising to around 0.65 at Episode 100, and then leveling off, remaining relatively constant between 0.6 and 0.7 from Episode 100 to 400.
* **Green Line:** This line exhibits a moderate upward trend, starting at approximately 0.1 at Episode 0, increasing to around 0.4 at Episode 100, and then showing a slower increase, reaching approximately 0.45 at Episode 400.
* **Pink Line:** This line shows a moderate upward trend, starting at approximately 0.1 at Episode 0, increasing to around 0.3 at Episode 100, and then showing a slower increase, reaching approximately 0.4 at Episode 400.
* **Gray Line:** This line remains relatively flat throughout the entire range of episodes, starting at approximately 0.15 at Episode 0, and fluctuating around 0.2, ending at approximately 0.2 at Episode 400.
### Key Observations
* The blue line consistently exhibits the highest EGA values across all episodes.
* The gray line demonstrates the lowest and most stable EGA values.
* The orange, green, and pink lines show similar, moderate growth patterns.
* The rate of increase for all lines (except gray) is most pronounced between Episodes 0 and 100.
* After Episode 100, the growth rate slows down significantly for all lines.
### Interpretation
The chart suggests that the "EGA" metric is sensitive to the "Episode" number, at least initially. The blue line's rapid increase and subsequent plateau could indicate a saturation point or a limit to the improvement achievable with increasing episodes. The relatively stable gray line might represent a baseline or a control condition that is not affected by the episode number. The convergence of the orange, green, and pink lines towards a similar EGA value suggests that these conditions may be approaching a similar state. The initial rapid increase across all lines could represent a learning or adaptation phase, where the system quickly improves with early exposure to episodes. The plateauing after 100 episodes suggests diminishing returns or a stabilization of the system's performance. Further investigation would be needed to understand the underlying mechanisms driving these trends and the specific meaning of "EGA" in the context of the experiment.
</details>
(b) Mineflayer
Figure 5: Robustness against flawed prior knowledge. EGA over 400 episodes in (a) MineRL and (b) Mineflayer. XENON consistently outperforms the baselines.
Table 2: Robustness to LLM hallucinations. The number of correctly learned dependencies of items that are descendants of a hallucinated item in the initial LLM-predicted dependency graph (out of 12).
| Agent | Learned descendants of hallucinated items |
| --- | --- |
| XENON | 0.33 |
| SC | 0 |
| ADAM | 0 |
| DECKARD | 0 |
| RAND | 0 |
XENON demonstrates robust dependency learning from flawed prior knowledge, consistently outperforming baselines with an EGA of approximately 0.6 in MineRL and 0.9 in Mineflayer (Figure Ë 5), despite the challenging setting with imperfect controllers. This superior performance is driven by its algorithmic correction mechanism, RevisionByAnalogy, which corrects flawed dependency knowledge while also accommodating imperfect controllers by gradually scaling required items quantities. The robustness of this algorithmic correction is particularly evident in two key analyses of the learned graph for each agent from the MineRL experiments. First, as shown in Table Ë 2, XENON is uniquely robust to LLM hallucinations, learning dependencies for descendant items of non-existent, hallucinated items in the initial LLM-predicted graph. Second, XENON outperforms the baselines in learning dependencies for items that are unobtainable by the initial graph, as shown in Table Ë 13.
Our results demonstrate the unreliability of relying on LLM self-correction or blindly trusting an LLMâs flawed knowledge; in practice, SC achieves the same EGA as DECKARD, with both plateauing around 0.4 in both environments.
We observe that controller capacity strongly impacts dependency learning. This is evident in ADAM, whose EGA differs markedly between MineRL ( $\approx$ 0.1), which has a limited controller, and Mineflayer ( $\approx$ 0.6), which has a more competent controller. While ADAM unrealistically assumes a controller can gather large quantities of all resource items before attempting a new item, MineRLâs controller STEVE-1 (Lifshitz et al., 2023) cannot execute this demanding strategy, causing ADAMâs EGA to fall below even the simplest baseline, RAND. Controller capacity also accounts for XENONâs lower EGA in MineRL. For instance, XENON learns none of the dependencies of the Redstone group items, as STEVE-1 cannot execute XENONâs strategy for inadmissible items (Section Ë 4.1). In contrast, the more capable Mineflayer controller executes this strategy successfully, allowing XENON to learn the correct dependencies for 5 of 6 Redstone items. This difference highlights the critical role of controllers for dependency learning, as detailed in our analysis in Section Ë K.3
### 5.3 Effective planning to solve diverse goals
Table 3: Performance on long-horizon task benchmark. Average success rate of each group on the long-horizon task benchmark Li et al. (2024b) in MineRL. Oracle indicates that the true dependency graph is known in advance, Learned indicates that the graph is learned via experience across 400 episodes. For fair comparison across LLMs, we include Optimus-1 â , our reproduction of Optimus-1 using Qwen2.5-VL-7B. Due to resource limits, results for DEPS, Jarvis-1, Optimus-1, and Optimus-2 are cited directly from (Li et al., 2025b). See Section Ë K.12 for the success rate on each goal.
| Method | Dependency | Planner LLM | Overall |
<details>
<summary>x12.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x13.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x14.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x15.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x16.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x17.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x18.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Wood | Stone | Iron | Diamond | Gold | Armor | Redstone | | | | |
| DEPS | - | Codex | 0.22 | 0.77 | 0.48 | 0.16 | 0.01 | 0.00 | 0.10 | 0.00 |
| Jarvis-1 | Oracle | GPT-4 | 0.38 | 0.93 | 0.89 | 0.36 | 0.08 | 0.07 | 0.15 | 0.16 |
| Optimus-1 | Oracle | GPT-4V | 0.43 | 0.98 | 0.92 | 0.46 | 0.11 | 0.08 | 0.19 | 0.25 |
| Optimus-2 | Oracle | GPT-4V | 0.45 | 0.99 | 0.93 | 0.53 | 0.13 | 0.09 | 0.21 | 0.28 |
| Optimus-1 â | Oracle | Qwen2.5-VL-7B | 0.34 | 0.92 | 0.80 | 0.22 | 0.10 | 0.09 | 0.17 | 0.04 |
| XENON â | Oracle | Qwen2.5-VL-7B | 0.79 | 0.95 | 0.93 | 0.83 | 0.75 | 0.73 | 0.61 | 0.75 |
| XENON | Learned | Qwen2.5-VL-7B | 0.54 | 0.85 | 0.81 | 0.46 | 0.64 | 0.74 | 0.28 | 0.00 |
As shown in Table Ë 3, XENON significantly outperforms baselines in solving diverse long-horizon goals despite using the lightweight Qwen2.5-VL-7B LLM (Bai et al., 2025), while the baselines rely on large proprietary models such as Codex (Chen et al., 2021), GPT-4 (OpenAI, 2024), and GPT-4V (OpenAI, 2023). Remarkably, even with its learned dependency knowledge (Section Ë 5.2), XENON surpasses the baselines with the oracle knowledge on challenging late-game goals, achieving high SRs for item groups like Gold (0.74) and Diamond (0.64).
XENONâs superiority stems from two key factors. First, its FAM provides systematic, fine-grained action correction for each goal. Second, it reduces reliance on the LLM for planning in two ways: it shortens prompts and outputs by requiring it to predict one action per subgoal item, and it bypasses the LLM entirely by reusing successful actions from FAM. In contrast, the baselines lack a systematic, fine-grained action correction mechanism and instead make LLMs generate long plans from lengthy promptsâa strategy known to be ineffective for LLMs (Wu et al., 2024; Li et al., 2024a). This challenge is exemplified by Optimus-1 â . Despite using a knowledge graph for planning like XENON, its long-context generation strategy causes LLM to predict incorrect actions or omit items explicitly provided in its prompt, as detailed in Section Ë K.5.
We find that accurate knowledge is critical for long-horizon planning, as its absence can make even a capable agent ineffective. The Redstone group from Table Ë 3 provides an example: while XENON â with oracle knowledge succeeds (0.75 SR), XENON with learned knowledge fails entirely (0.00 SR), because it failed to learn the dependencies for Redstone goals due to the controllerâs limited capacity in MineRL (Section Ë 5.2). This finding is further supported by our comprehensive ablation study, which confirms that accurate dependency knowledge is most critical for success across all goals (See Table Ë 17 in Section Ë K.7).
### 5.4 Robust dependency learning against knowledge conflicts
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Legend: Identifier Key
### Overview
The image presents a legend, likely associated with a chart or diagram, defining identifiers for different data series or components. It consists of six labels, each paired with a distinct colored symbol. There is no chart or diagram present in the image, only the key.
### Components/Axes
The legend contains the following labels and corresponding symbols:
* **XENON**: Light blue circle.
* **SC**: Pink diamond.
* **ADAM**: Orange star.
* **DECKARD**: Light green square.
* **RAND**: Dark blue plus sign.
The legend is arranged horizontally, with labels positioned to the right of their respective symbols.
### Detailed Analysis or Content Details
The image provides only identifier names and their associated colors/shapes. There are no numerical values, axes, or trends to analyze. The identifiers appear to be names, potentially representing individuals, algorithms, or data sets.
### Key Observations
The legend uses a variety of shapes and colors to differentiate between the identifiers. The choice of names (XENON, ADAM, DECKARD, RAND) suggests a possible connection to science fiction or computer science.
### Interpretation
The image, in isolation, does not convey any specific data or insights. It serves solely as a key to interpret a larger visualization (not provided). The names suggest the data might relate to different agents or entities within a system. Without the accompanying chart or diagram, it is impossible to determine the nature of the data or the relationships between the identifiers. The legend is a necessary component for understanding a more complex visual representation, but it is insufficient on its own.
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Perturbed Items
### Overview
The image presents a line chart illustrating the relationship between "Perturbed (required items, action)" on the x-axis and "EGA" on the y-axis. Four distinct lines are plotted, each representing a different data series. The chart appears to demonstrate how EGA values change as the number of perturbed items increases.
### Components/Axes
* **X-axis Title:** "Perturbed (required items, action)"
* **X-axis Markers:** (0, 0), (1, 0), (2, 0), (3, 0)
* **Y-axis Title:** "EGA"
* **Y-axis Scale:** Ranges from 0.0 to 1.0, with increments of 0.2.
* **Data Series:** Four lines, each with a unique color and marker:
* Light Blue Line with Circle Markers
* Pink Line with Diamond Markers
* Green Line with Square Markers
* Dark Blue Line with Cross Markers
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **Light Blue Line:** This line is relatively flat, starting at approximately 0.95 at (0, 0) and remaining consistently around 0.95-1.0 throughout the x-axis range.
* **Pink Line:** This line exhibits a downward trend. It begins at approximately 0.7 at (0, 0), decreases to around 0.45 at (1, 0), then plateaus around 0.35-0.4 for (2, 0) and (3, 0).
* **Green Line:** This line also shows a downward trend, but less pronounced than the pink line. It starts at approximately 0.55 at (0, 0), decreases to around 0.35 at (1, 0), and remains relatively stable around 0.3-0.35 for (2, 0) and (3, 0).
* **Dark Blue Line:** This line shows a slight downward trend. It starts at approximately 0.2 at (0, 0), decreases to around 0.18 at (1, 0), and remains relatively stable around 0.18-0.2 for (2, 0) and (3, 0).
Here's a table summarizing the approximate data points:
| Perturbed (required items, action) | Light Blue (EGA) | Pink (EGA) | Green (EGA) | Dark Blue (EGA) |
|---|---|---|---|---|
| (0, 0) | 0.95 | 0.7 | 0.55 | 0.2 |
| (1, 0) | 0.95 | 0.45 | 0.35 | 0.18 |
| (2, 0) | 0.95 | 0.35 | 0.3 | 0.18 |
| (3, 0) | 1.0 | 0.4 | 0.35 | 0.2 |
### Key Observations
* The light blue line consistently maintains a high EGA value, indicating minimal impact from the perturbed items.
* The pink and green lines show a clear decreasing trend in EGA as the number of perturbed items increases, suggesting a negative correlation.
* The dark blue line remains consistently low, indicating a consistently low EGA value regardless of the number of perturbed items.
* The rate of decrease in EGA is most significant between (0, 0) and (1, 0) for both the pink and green lines.
### Interpretation
The chart suggests that the "EGA" metric is sensitive to the number of "perturbed" items, particularly for the pink and green data series. The light blue line's stability indicates that this particular data series is robust to perturbations. The dark blue line's consistently low value suggests a fundamentally different behavior or baseline.
The x-axis label "(required items, action)" implies that the perturbations involve changes to required items or actions within a system. The EGA metric likely represents some measure of system performance, effectiveness, or goal achievement. The decreasing EGA values with increasing perturbations suggest that the system's ability to achieve its goals is compromised as more items or actions are altered.
The plateauing of the pink and green lines after (1, 0) could indicate a saturation point, where further perturbations have diminishing returns on the EGA value. The consistent low EGA of the dark blue line could represent a system component that is inherently less effective or more vulnerable to perturbations.
Further investigation would be needed to understand the specific meaning of "EGA" and the nature of the "perturbed" items and actions to draw more definitive conclusions.
</details>
(a) Perturbed True Required Items
<details>
<summary>x21.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Perturbed
### Overview
This image presents a line chart illustrating the relationship between "Perturbed (required items, action)" on the x-axis and "EGA" on the y-axis. The chart displays multiple data series, each represented by a different colored line, showing how EGA values change as the perturbation level increases.
### Components/Axes
* **X-axis Title:** "Perturbed (required items, action)"
* **X-axis Markers:** (0, 0), (0, 1), (0, 2), (0, 3)
* **Y-axis Title:** "EGA"
* **Y-axis Scale:** Ranges from approximately 0.0 to 1.0, with increments of 0.2.
* **Legend:** (Implicit, based on line colors)
* Light Blue Line
* Peach/Orange Line
* Pink Line
* Lime Green Line
* Dark Grey/Black Line
### Detailed Analysis
The chart contains five distinct lines, each representing a different data series.
* **Light Blue Line:** This line is nearly flat, starting at approximately 0.95 at (0, 0) and remaining around 0.93-0.95 across all perturbation levels (0, 1), (0, 2), and (0, 3).
* **Peach/Orange Line:** This line starts at approximately 0.7 at (0, 0) and decreases steadily to around 0.15 at (0, 3).
* **Pink Line:** This line begins at approximately 0.6 at (0, 0) and exhibits a steeper decline than the peach line, reaching approximately 0.25 at (0, 3).
* **Lime Green Line:** This line starts at approximately 0.45 at (0, 0) and decreases rapidly to approximately 0.1 at (0, 3).
* **Dark Grey/Black Line:** This line is relatively flat, starting at approximately 0.2 at (0, 0) and remaining around 0.2-0.25 across all perturbation levels.
Here's a breakdown of approximate EGA values for each line at each perturbation level:
| Perturbation | Light Blue | Peach/Orange | Pink | Lime Green | Dark Grey/Black |
|--------------|------------|--------------|------|------------|-----------------|
| (0, 0) | 0.95 | 0.7 | 0.6 | 0.45 | 0.2 |
| (0, 1) | 0.94 | 0.5 | 0.45 | 0.3 | 0.2 |
| (0, 2) | 0.93 | 0.35 | 0.3 | 0.2 | 0.25 |
| (0, 3) | 0.93 | 0.15 | 0.25 | 0.1 | 0.2 |
### Key Observations
* The light blue line demonstrates high and stable EGA values across all perturbation levels.
* The peach, pink, and lime green lines all show a negative correlation between perturbation and EGA â as perturbation increases, EGA decreases. The lime green line exhibits the steepest decline.
* The dark grey/black line remains relatively constant, indicating that EGA is not significantly affected by the perturbation in this case.
* The lines diverge significantly, suggesting different sensitivities to the perturbation.
### Interpretation
The chart likely represents the performance or effectiveness of a system or process (as measured by EGA) under varying levels of disruption or challenge ("Perturbed"). The "required items, action" likely refers to the number of necessary components or steps and the action taken.
The consistent high EGA value for the light blue line suggests a robust component or process that is unaffected by the perturbations. The decreasing EGA values for the other lines indicate that these components or processes become less effective as the perturbation increases. The steep decline of the lime green line suggests a particularly vulnerable component.
The flat dark grey/black line indicates a component that is resilient to the perturbations, or one that is already operating at a minimal level of effectiveness.
The data suggests that the system's overall performance is heavily influenced by the components represented by the peach, pink, and lime green lines, as their EGA values are significantly impacted by the perturbations. Further investigation could focus on understanding why these components are so sensitive and how to improve their resilience. The (0,0) values suggest a baseline performance level, and the subsequent decreases show the impact of introducing the "perturbation".
</details>
(b) Perturbed True Actions
<details>
<summary>x22.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Perturbed (required items, action)
### Overview
This line chart depicts the relationship between "Perturbed (required items, action)" on the x-axis and "EGA" on the y-axis. The chart displays five distinct lines, each representing a different data series, showing how EGA changes as the perturbation level increases. The x-axis values are (0, 0), (1, 1), (2, 2), and (3, 3). The y-axis ranges from approximately 0.0 to 1.0.
### Components/Axes
* **X-axis Label:** "Perturbed (required items, action)"
* **X-axis Markers:** (0, 0), (1, 1), (2, 2), (3, 3)
* **Y-axis Label:** "EGA"
* **Y-axis Scale:** 0.0 to 1.0, with increments of 0.2.
* **Data Series:** Five lines, each with a unique color and marker.
* Blue Line with Circle Marker
* Pink Line with Diamond Marker
* Green Line with Square Marker
* Brown Line with Triangle Marker
* Black Line with Plus Marker
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **Blue Line:** This line is nearly flat, maintaining a value of approximately 0.95-1.0 throughout all x-axis values.
* (0, 0): ~0.98
* (1, 1): ~0.97
* (2, 2): ~0.96
* (3, 3): ~0.95
* **Pink Line:** This line shows a strong downward trend.
* (0, 0): ~0.68
* (1, 1): ~0.38
* (2, 2): ~0.28
* (3, 3): ~0.18
* **Green Line:** This line also shows a downward trend, but less steep than the pink line.
* (0, 0): ~0.45
* (1, 1): ~0.32
* (2, 2): ~0.22
* (3, 3): ~0.12
* **Brown Line:** This line exhibits a moderate downward trend.
* (0, 0): ~0.55
* (1, 1): ~0.25
* (2, 2): ~0.15
* (3, 3): ~0.10
* **Black Line:** This line is relatively flat, with a slight downward trend.
* (0, 0): ~0.22
* (1, 1): ~0.20
* (2, 2): ~0.18
* (3, 3): ~0.15
### Key Observations
* The blue line remains consistently high, indicating that the corresponding condition is largely unaffected by the perturbation.
* The pink and green lines show the most significant decrease in EGA as the perturbation increases.
* The brown and black lines show a more moderate decrease.
* All lines demonstrate a negative correlation between perturbation and EGA, meaning that as the perturbation increases, EGA tends to decrease.
### Interpretation
The chart suggests that increasing the "Perturbed (required items, action)" level generally leads to a decrease in EGA. The different lines likely represent different conditions or experimental setups, with the blue line representing a condition that is robust to the perturbation. The steep decline in the pink and green lines indicates that these conditions are highly sensitive to the perturbation. The x-axis values (0,0), (1,1), (2,2), (3,3) suggest a paired increase in "required items" and "action" as the perturbation increases. The EGA metric likely represents some measure of performance or effectiveness, and the chart demonstrates how this performance degrades as the system is perturbed. The consistent high value of the blue line suggests a baseline or control condition that is not affected by the perturbation, providing a reference point for evaluating the impact on other conditions. The chart could be used to identify conditions that are most vulnerable to perturbations and to develop strategies for mitigating their impact.
</details>
(c) Perturbed Both Rules
Figure 6: Robustness against knowledge conflicts. EGA after 3,000 environment steps in MC-TextWorld under different perturbations of the ground-truth rules. The plots show performance with increasing intensities of perturbation applied to: (a) requirements only, (b) actions only, and (c) both (see Table Ë 4).
Table 4: Effect of ground-truth perturbations on prior knowledge.
| Perturbation Intensity | Goal items obtainable via prior knowledge |
| --- | --- |
| 0 | 16 (no perturbation) |
| 1 | 14 (12 %) |
| 2 | 11 (31 %) |
| 3 | 9 (44 %) |
To isolate dependency learning from controller capacity, we shift to the MC-TextWorld environment with a perfect controller. In this setting, we test each agentâs robustness to conflicts with its prior knowledge (derived from the LLMâs initial predictions and human-written plans) by introducing arbitrary perturbations to the ground-truth required items and actions. These perturbations are applied with an intensity level; a higher intensity affects a greater number of items, as shown in Table Ë 4. This intensity is denoted by a tuple (r,a) for required items and actions, respectively. (0,0) represents the vanilla setting with no perturbations. See Figure Ë 21 for the detailed perturbation process.
Figure Ë 6 shows XENONâs robustness to knowledge conflicts, as it maintains a near-perfect EGA ( $\approx$ 0.97). In contrast, the performance of all baselines degrades as perturbation intensity increases across all three perturbation scenarios (required items, actions, or both). We find that prompting an LLM to self-correct is ineffective when the ground truth conflicts with its parametric knowledge: SC shows no significant advantage over DECKARD, which lacks a correction mechanism. ADAM is vulnerable to action perturbations; its strategy of gathering all resource items before attempting a new item fails when the valid actions for those resources are perturbed, effectively halting its learning.
### 5.5 Ablation studies on knowledge correction mechanisms
Table 5: Ablation study of knowledge correction mechanisms. â: XENON; $\triangle$ : LLM self-correction; â: No correction. All entries denote the EGA after 3,000 environment steps. Columns denote the perturbation setting (r,a). For LLM self-correction, we use the same prompt as the SC baseline (see Appendix Ë B).
| Dependency Correction | Action Correction | (0,0) | (3,0) | (0,3) | (3,3) |
| --- | --- | --- | --- | --- | --- |
| â | â | 0.97 | 0.97 | 0.97 | 0.97 |
| â | $\triangle$ | 0.93 | 0.93 | 0.12 | 0.12 |
| â | â | 0.84 | 0.84 | 0.12 | 0.12 |
| $\triangle$ | â | 0.57 | 0.30 | 0.57 | 0.29 |
| â | â | 0.53 | 0.13 | 0.53 | 0.13 |
| â | â | 0.46 | 0.13 | 0.19 | 0.11 |
As shown in Table Ë 5, to analyze XENONâs knowledge correction mechanisms for dependencies and actions, we conduct ablation studies in MC-TextWorld. While dependency correction is generally more important for overall performance, action correction becomes vital under action perturbations. In contrast, LLM self-correction is ineffective for complex scenarios: it offers minimal gains for dependency correction even in the vanilla setting and fails entirely for perturbed actions. Its effectiveness is limited to simpler scenarios, such as action correction in the vanilla setting. These results demonstrate that our algorithmic knowledge correction approach enables robust learning from experience, overcoming the limitations of both LLM self-correction and flawed initial knowledge.
### 5.6 Ablation studies on hyperparameters
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Environment Step for Different câ Values
### Overview
The image presents a line chart illustrating the relationship between EGA (likely an abbreviation for a performance metric) and Environment Step for four different values of câ. The chart appears to demonstrate how EGA evolves over time (represented by Environment Step) for varying initial conditions (câ).
### Components/Axes
* **X-axis:** "Environment step" ranging from approximately 0 to 3000. The scale is linear.
* **Y-axis:** "EGA" ranging from approximately 0.0 to 1.0. The scale is linear.
* **Legend:** Located in the bottom-right corner of the chart. It identifies four data series, each corresponding to a different value of câ:
* câ = 2 (Dark Blue)
* câ = 3 (Gray)
* câ = 4 (Orange)
* câ = 5 (Green)
* **Grid:** A light gray grid is present in the background, aiding in the readability of the data points.
### Detailed Analysis
The chart displays four lines, each representing a different câ value.
* **câ = 2 (Dark Blue):** The line starts at approximately EGA = 0.15 at Environment Step = 0. It increases steadily until approximately Environment Step = 800, where it begins to level off. By Environment Step = 1500, EGA reaches approximately 0.95 and remains relatively constant until Environment Step = 3000.
* **câ = 3 (Gray):** The line begins at approximately EGA = 0.15 at Environment Step = 0. It exhibits a similar trend to câ = 2, increasing steadily and leveling off around Environment Step = 1000. EGA reaches approximately 0.95 by Environment Step = 1500 and remains stable.
* **câ = 4 (Orange):** The line starts at approximately EGA = 0.18 at Environment Step = 0. It shows a similar pattern, with a steady increase followed by a leveling off around Environment Step = 900. EGA reaches approximately 0.95 by Environment Step = 1400 and remains stable.
* **câ = 5 (Green):** The line begins at approximately EGA = 0.15 at Environment Step = 0. It follows the same general trend as the other lines, increasing steadily and leveling off around Environment Step = 800. EGA reaches approximately 0.95 by Environment Step = 1300 and remains stable.
All four lines converge towards an EGA value of approximately 0.95-1.0 as the Environment Step increases.
### Key Observations
* All four lines exhibit a similar S-shaped curve, indicating a common underlying process.
* The lines converge as the Environment Step increases, suggesting that the initial value of câ has a diminishing effect on EGA as the system evolves.
* There is a slight variation in the rate of increase, with câ = 4 appearing to reach the plateau slightly earlier than the other values.
* The initial EGA values are relatively consistent across all câ values, around 0.15-0.18.
### Interpretation
The data suggests that EGA is a performance metric that improves with increasing Environment Step, and that the initial condition câ influences the *rate* of improvement, but not the final EGA value. The convergence of the lines indicates that the system reaches a stable state regardless of the initial câ value, given sufficient Environment Steps. The slight differences in the rate of improvement could be due to the sensitivity of the system to different initial conditions. The plateauing of the lines suggests that there is a limit to the performance improvement achievable, even with continued increases in Environment Step. This could represent a saturation point or a constraint within the system. Further investigation would be needed to understand the specific meaning of EGA, Environment Step, and câ within the context of the system being modeled.
</details>
(a) $c_{0}$
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Environment Step for Different Alpha Values
### Overview
This image presents a line chart illustrating the relationship between EGA (presumably a performance metric) and the number of environment steps, for four different values of αᔹ (alpha i). The chart shows how EGA evolves over time for each alpha value.
### Components/Axes
* **X-axis:** "Environment step" ranging from approximately 0 to 3000.
* **Y-axis:** "EGA" ranging from approximately 0.0 to 1.0.
* **Legend:** Located in the top-right corner, it identifies four lines, each corresponding to a different αᔹ value:
* αᔹ = 7 (Black line)
* αᔹ = 8 (Gray line)
* αᔹ = 9 (Orange line)
* αᔹ = 10 (Teal line)
* **Grid:** A light gray grid is present, aiding in reading values from the chart.
### Detailed Analysis
The chart displays four lines, each representing the change in EGA over environment steps for a specific αᔹ value.
* **αᔹ = 7 (Black Line):** The line starts at approximately 0.15 at Environment step 0. It increases steadily until around Environment step 800, where it begins to plateau. It reaches approximately 0.95 at Environment step 1500 and remains relatively stable around 0.97-0.99 for the rest of the range.
* **αᔹ = 8 (Gray Line):** This line begins at approximately 0.16 at Environment step 0. It shows a similar trend to αᔹ = 7, but with a slightly slower initial increase. It reaches approximately 0.94 at Environment step 1400 and stabilizes around 0.97-0.99.
* **αᔹ = 9 (Orange Line):** Starting at approximately 0.17 at Environment step 0, this line exhibits a faster initial increase compared to αᔹ = 7 and αᔹ = 8. It reaches approximately 0.96 at Environment step 1300 and stabilizes around 0.98-0.99.
* **αᔹ = 10 (Teal Line):** This line starts at approximately 0.18 at Environment step 0. It demonstrates the fastest initial increase among all four lines. It reaches approximately 0.97 at Environment step 1200 and stabilizes around 0.98-0.99.
All four lines converge towards a similar EGA value (around 0.97-0.99) as the number of environment steps increases.
### Key Observations
* The lines show a clear upward trend, indicating that EGA generally increases with the number of environment steps.
* Higher αᔹ values lead to faster initial increases in EGA.
* The lines converge as the number of environment steps increases, suggesting that the impact of αᔹ diminishes over time.
* There is a slight difference in the final EGA values achieved by each αᔹ value, with αᔹ = 9 and αᔹ = 10 achieving slightly higher values than αᔹ = 7 and αᔹ = 8.
### Interpretation
The data suggests that the parameter αᔹ plays a role in the initial learning or performance rate, with higher values leading to faster progress. However, the convergence of the lines indicates that the long-term performance is less sensitive to the initial αᔹ value. This could imply that the system reaches a saturation point where further increases in αᔹ do not significantly improve EGA. The EGA metric likely represents some form of accumulated reward or performance score within the environment. The environment steps represent the progression of the agent's interaction with the environment. The chart demonstrates how different learning rates (controlled by αᔹ) affect the speed at which the agent learns and optimizes its performance. The convergence suggests that the environment presents a limited learning opportunity, and the agent eventually reaches a point of diminishing returns regardless of the initial learning rate.
</details>
(b) $\alpha_{i}$
<details>
<summary>x25.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Environment Step for Different Alpha Values
### Overview
This image presents a line chart illustrating the relationship between EGA (presumably a performance metric) and the Environment Step, for four different values of αs (alpha_s). The chart shows how EGA evolves over time (Environment Step) for each αs value.
### Components/Axes
* **X-axis:** "Environment step" ranging from approximately 0 to 3000.
* **Y-axis:** "EGA" ranging from approximately 0.0 to 1.0.
* **Legend:** Located in the top-right corner, it identifies four lines:
* αs = 1 (Dark Blue)
* αs = 2 (Gray)
* αs = 3 (Orange)
* αs = 4 (Green)
* **Grid:** A light gray grid is present in the background to aid in reading values.
### Detailed Analysis
The chart displays four distinct lines, each representing a different αs value.
* **αs = 1 (Dark Blue):** The line starts at approximately 0.15 at Environment Step 0. It increases relatively slowly until around Environment Step 500, then accelerates, reaching approximately 0.95 at Environment Step 2500.
* **αs = 2 (Gray):** This line begins at approximately 0.18 at Environment Step 0. It exhibits a similar trend to αs = 1, but with a slightly faster initial increase. It reaches approximately 0.98 at Environment Step 2500.
* **αs = 3 (Orange):** Starting at approximately 0.17 at Environment Step 0, this line shows a rapid increase from the beginning. It reaches approximately 0.99 at Environment Step 1500 and plateaus.
* **αs = 4 (Green):** This line starts at approximately 0.13 at Environment Step 0. It shows the fastest initial increase among all four lines, reaching approximately 0.97 at Environment Step 1200 and then plateaus.
All lines converge towards a value of approximately 1.0 as the Environment Step increases.
### Key Observations
* Higher αs values (3 and 4) lead to faster initial increases in EGA.
* The lines representing αs = 3 and αs = 4 plateau earlier than those representing αs = 1 and αs = 2.
* The difference in EGA values between the lines diminishes as the Environment Step increases, suggesting convergence.
* There is a slight variation in the final EGA values achieved by each αs value, with αs = 2 reaching the highest value (approximately 0.98).
### Interpretation
The data suggests that the parameter αs significantly influences the rate at which EGA increases with respect to the Environment Step. Higher values of αs result in faster initial gains in EGA, but also lead to earlier saturation. This could indicate that there's a trade-off between the speed of learning/adaptation (represented by EGA) and the potential for further improvement. The convergence of the lines at higher Environment Steps suggests that, regardless of the initial αs value, the system eventually reaches a similar level of performance. The slight difference in the final EGA values might be due to inherent stochasticity in the system or other factors not represented in this chart. The chart demonstrates a learning curve, where performance (EGA) improves with experience (Environment Step), and the learning rate is modulated by the αs parameter.
</details>
(c) $\alpha_{s}$
<details>
<summary>x26.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Environment Step for Different x0 Values
### Overview
This image presents a line chart illustrating the relationship between EGA (presumably a performance metric) and the Environment Step, for four different initial values of x0. Each line represents the average EGA value over time for a specific x0, with shaded regions indicating the standard deviation around the mean.
### Components/Axes
* **X-axis:** "Environment step" ranging from approximately 0 to 3000.
* **Y-axis:** "EGA" ranging from approximately 0.0 to 1.0.
* **Legend:** Located in the top-right corner, listing the following data series:
* x0 = 1 (Black line)
* x0 = 2 (Gray line)
* x0 = 3 (Orange line)
* x0 = 4 (Blue line)
* **Grid:** A light gray grid is present across the entire chart area, aiding in value estimation.
### Detailed Analysis
The chart displays four lines, each representing a different x0 value. Each line is accompanied by a shaded region representing the standard deviation.
* **x0 = 1 (Black):** The line starts at approximately 0.18 at Environment Step 0, increases steadily, and reaches approximately 0.95 at Environment Step 2000. The shaded region indicates a relatively small standard deviation, especially at higher Environment Steps.
* **x0 = 2 (Gray):** The line begins at approximately 0.18 at Environment Step 0, increases steadily, and reaches approximately 0.98 at Environment Step 2000. The shaded region is slightly larger than that of x0 = 1, indicating more variability.
* **x0 = 3 (Orange):** The line starts at approximately 0.17 at Environment Step 0, increases rapidly, and reaches approximately 1.0 at Environment Step 1200. The shaded region is larger than those of x0 = 1 and x0 = 2, particularly in the initial stages.
* **x0 = 4 (Blue):** The line begins at approximately 0.16 at Environment Step 0, increases steadily, and reaches approximately 0.98 at Environment Step 2000. The shaded region is the largest among the four, indicating the highest variability.
All lines exhibit an upward trend, indicating that EGA generally increases with the Environment Step. The rate of increase varies between the different x0 values.
### Key Observations
* x0 = 3 reaches an EGA of approximately 1.0 the fastest, around Environment Step 1200.
* x0 = 4 has the largest standard deviation, suggesting the most variability in EGA values.
* x0 = 1 and x0 = 2 have similar EGA values and standard deviations.
* All lines converge towards an EGA of approximately 1.0 as the Environment Step increases.
### Interpretation
The data suggests that the initial value of x0 significantly impacts the rate at which EGA increases with the Environment Step. A higher initial value (x0 = 3) leads to faster convergence towards a maximum EGA value. However, a higher initial value (x0 = 4) also introduces more variability in the EGA values.
The convergence of all lines towards EGA = 1.0 indicates that, regardless of the initial x0 value, the system eventually reaches a stable state with a high EGA. The shaded regions represent the uncertainty or variance in the EGA values for each x0, which could be due to random factors or inherent system noise.
The chart demonstrates a trade-off between the speed of convergence and the stability of the system. While a higher initial value can accelerate convergence, it also increases the variability of the results. This information could be valuable for optimizing the initial conditions of the system to achieve a desired balance between performance and reliability.
</details>
(d) $x_{0}$
Figure 7: Hyperparameter ablation study in MC-TextWorld. EGA over 3,000 environment steps under different hyperparameters. The plots show EGA when varying: (a) $c_{0}$ (revision count threshold for inadmissible items), (b) $\alpha_{i}$ (required items quantities for inadmissible items), (c) $\alpha_{s}$ (required items quantities for less-tried items), and (d) $x_{0}$ (invalid action threshold). Each study varies one hyperparameter while keeping the others fixed to their default values ( $c_{0}=3$ , $\alpha_{i}=8$ , $\alpha_{s}=2$ , $x_{0}=2$ ).
<details>
<summary>x27.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Episode for Different C0 Values
### Overview
This image presents a line chart illustrating the relationship between Episode number and EGA (presumably a performance metric) for four different values of C0. The chart displays how EGA changes over the course of 400 episodes for each C0 value.
### Components/Axes
* **X-axis:** Episode, ranging from 0 to 400, with tick marks at 0, 100, 200, 300, and 400.
* **Y-axis:** EGA, ranging from 0 to 1.0, with tick marks at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located at the top-right of the chart, identifying four data series:
* C0 = 2 (Black circle)
* C0 = 3 (Orange triangle)
* C0 = 4 (Light blue diamond)
* C0 = 5 (Teal triangle)
* **Gridlines:** Light gray horizontal and vertical lines providing a visual reference for data point values.
### Detailed Analysis
The chart displays four distinct lines, each representing a different C0 value.
* **C0 = 2 (Black):** The line starts at approximately 0.17 at Episode 0, increases to around 0.44 at Episode 100, continues to approximately 0.52 at Episode 200, reaches about 0.56 at Episode 300, and ends at approximately 0.59 at Episode 400. The line exhibits a generally upward trend, with diminishing returns as the episode number increases.
* **C0 = 3 (Orange):** The line begins at approximately 0.13 at Episode 0, rises to around 0.38 at Episode 100, continues to approximately 0.48 at Episode 200, reaches about 0.54 at Episode 300, and ends at approximately 0.61 at Episode 400. This line also shows an upward trend, similar to C0 = 2, but generally starts lower and ends higher.
* **C0 = 4 (Light Blue):** The line starts at approximately 0.15 at Episode 0, increases to around 0.42 at Episode 100, continues to approximately 0.50 at Episode 200, reaches about 0.55 at Episode 300, and ends at approximately 0.58 at Episode 400. This line's trend is similar to the others, but it appears to plateau earlier than the other lines.
* **C0 = 5 (Teal):** The line begins at approximately 0.11 at Episode 0, rises to around 0.36 at Episode 100, continues to approximately 0.47 at Episode 200, reaches about 0.53 at Episode 300, and ends at approximately 0.57 at Episode 400. This line shows a similar upward trend, starting at the lowest EGA value and ending with a value between C0=4 and C0=2.
### Key Observations
* All four lines demonstrate an increasing trend in EGA as the episode number increases, suggesting that performance improves with more training or experience.
* The initial EGA values vary depending on the C0 value, with C0 = 2 starting with the highest EGA and C0 = 5 starting with the lowest.
* The lines converge as the episode number increases, indicating that the impact of C0 on EGA diminishes over time.
* C0 = 3 consistently shows the highest EGA values across most of the episodes.
### Interpretation
The chart suggests that the parameter C0 influences the initial performance (EGA) and the rate of improvement over episodes. Lower values of C0 (2 and 3) appear to lead to better initial performance, while higher values (4 and 5) require more episodes to reach comparable levels. The convergence of the lines indicates that, regardless of the initial C0 value, the system eventually reaches a similar level of performance after sufficient training (400 episodes in this case). This could imply that C0 primarily affects the learning speed or initial conditions, but not the ultimate performance limit. The data suggests that C0 = 3 provides the best overall performance, achieving the highest EGA values throughout the observed episode range. Further investigation might be needed to understand the specific role of C0 and why it impacts EGA in this manner.
</details>
(e) $c_{0}$
<details>
<summary>x28.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Episode for Different Alpha Values
### Overview
This image presents a line chart illustrating the relationship between Episode number and EGA (presumably a performance metric) for four different values of α (alpha). The chart displays how EGA changes over the course of 400 episodes for each alpha value.
### Components/Axes
* **X-axis:** Episode, ranging from 0 to 400, with markers at 0, 100, 200, 300, and 400.
* **Y-axis:** EGA, ranging from 0 to 1.0, with markers at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located in the top-left corner, identifying four data series:
* αᔹ = 7 (Black circle)
* αᔹ = 8 (Orange triangle)
* αᔹ = 9 (Light blue diamond)
* αᔹ = 10 (Teal triangle)
* **Grid:** A light gray grid is present in the background to aid in reading values.
### Detailed Analysis
Here's a breakdown of the data series, observing trends and extracting approximate values:
* **αᔹ = 7 (Black):** The line starts at approximately 0.18 at Episode 0, rises to around 0.45 at Episode 100, plateaus around 0.52-0.53 between Episodes 100 and 300, and then increases slightly to approximately 0.58 at Episode 400.
* **αᔹ = 8 (Orange):** The line begins at approximately 0.16 at Episode 0, increases to around 0.43 at Episode 100, reaches a peak of approximately 0.54 at Episode 200, and then fluctuates around 0.54-0.56 between Episodes 200 and 400, ending at approximately 0.61 at Episode 400.
* **αᔹ = 9 (Light Blue):** The line starts at approximately 0.17 at Episode 0, rises to around 0.44 at Episode 100, reaches approximately 0.53 at Episode 200, and then increases to approximately 0.58 at Episode 300, ending at approximately 0.60 at Episode 400.
* **αᔹ = 10 (Teal):** The line begins at approximately 0.15 at Episode 0, increases to around 0.42 at Episode 100, reaches approximately 0.51 at Episode 200, and then increases to approximately 0.57 at Episode 300, ending at approximately 0.62 at Episode 400.
### Key Observations
* All four lines exhibit an upward trend, indicating that EGA generally increases with the number of episodes for all alpha values.
* The lines converge as the episode number increases, suggesting that the impact of alpha diminishes over time.
* αᔹ = 10 consistently shows the highest EGA values, particularly in the later episodes.
* αᔹ = 7 consistently shows the lowest EGA values, particularly in the later episodes.
* The most significant increase in EGA for all lines occurs between Episodes 0 and 100.
### Interpretation
The data suggests that the parameter α has a positive influence on the EGA metric, with higher values of α leading to better performance (higher EGA). However, the effect of α appears to diminish as the number of episodes increases, indicating that the system converges towards a similar performance level regardless of the initial α value. The initial rapid increase in EGA for all alpha values suggests a period of rapid learning or adaptation at the beginning of the process. The convergence of the lines implies that the system eventually reaches a stable state where the initial α value has less impact on the overall performance. The differences in final EGA values between the alpha values suggest that while the system converges, the initial choice of alpha can still influence the final performance level. This could be due to the exploration-exploitation trade-off inherent in reinforcement learning or other optimization processes.
</details>
(f) $\alpha_{i}$
<details>
<summary>x29.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Episode for Different Alpha Values
### Overview
This image presents a line chart illustrating the relationship between Episode number and EGA (presumably a performance metric) for four different values of αs (alpha_s). The chart displays how EGA changes over the course of 400 episodes for each αs value.
### Components/Axes
* **X-axis:** Episode (ranging from 0 to 400, with markers at 0, 100, 200, 300, and 400).
* **Y-axis:** EGA (ranging from 0 to 1.0, with markers at 0.2, 0.4, 0.6, 0.8, and 1.0).
* **Legend:** Located in the top-left corner, it identifies four lines representing different αs values:
* αs = 1 (Black line with circle markers)
* αs = 2 (Orange line with circle markers)
* αs = 3 (Light Blue line with diamond markers)
* αs = 4 (Teal line with triangle markers)
* **Grid:** A light gray grid is present in the background to aid in reading values.
### Detailed Analysis
Here's a breakdown of the data for each αs value, observing the trends first:
* **αs = 1 (Black):** The line starts at approximately 0.18 at Episode 0, increases steadily to around 0.52 at Episode 200, then plateaus, reaching approximately 0.55 at Episode 400.
* **αs = 2 (Orange):** The line begins at approximately 0.18 at Episode 0, rises more rapidly than αs = 1, reaching around 0.55 at Episode 200, and continues to increase slightly, ending at approximately 0.61 at Episode 400.
* **αs = 3 (Light Blue):** This line starts at approximately 0.18 at Episode 0, increases quickly, reaching around 0.45 at Episode 100, then slows down, reaching approximately 0.57 at Episode 200, and continues to increase slowly, ending at approximately 0.63 at Episode 400.
* **αs = 4 (Teal):** The line starts at approximately 0.18 at Episode 0, increases rapidly to around 0.42 at Episode 100, then slows down, reaching approximately 0.53 at Episode 200, and continues to increase slowly, ending at approximately 0.58 at Episode 400.
Here's a more precise data reconstruction (approximate values):
| Episode | αs = 1 | αs = 2 | αs = 3 | αs = 4 |
|---|---|---|---|---|
| 0 | 0.18 | 0.18 | 0.18 | 0.18 |
| 100 | 0.35 | 0.45 | 0.45 | 0.42 |
| 200 | 0.52 | 0.55 | 0.57 | 0.53 |
| 300 | 0.54 | 0.58 | 0.60 | 0.56 |
| 400 | 0.55 | 0.61 | 0.63 | 0.58 |
### Key Observations
* All lines start at the same EGA value (approximately 0.18).
* The lines generally converge as the episode number increases, suggesting diminishing returns from increasing αs.
* αs = 2 and αs = 3 achieve the highest EGA values at Episode 400.
* The rate of increase in EGA slows down for all αs values after Episode 100.
### Interpretation
The chart demonstrates the impact of the αs parameter on the EGA metric over the course of training episodes. Initially, increasing αs leads to a faster improvement in EGA. However, as training progresses (beyond Episode 200), the differences in EGA between different αs values become less pronounced. This suggests that there's an optimal range for αs, and beyond that, further increases don't significantly improve performance. The convergence of the lines indicates that the system is approaching a saturation point, where additional training doesn't yield substantial gains. The initial rapid increase could represent a period of rapid learning or adaptation, while the subsequent plateau suggests the system has converged to a stable state. The fact that αs = 2 and αs = 3 perform best suggests that these values strike a good balance between exploration and exploitation during the learning process.
</details>
(g) $\alpha_{s}$
<details>
<summary>x30.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Episode for Different X0 Values
### Overview
This image presents a line chart illustrating the relationship between Episode number and EGA (likely an abbreviation for a performance metric) for four different initial values of X0. The chart displays how EGA changes over the course of 400 episodes for each X0 value.
### Components/Axes
* **X-axis:** Labeled "Episode", ranging from 0 to 400, with tick marks at 0, 100, 200, 300, and 400.
* **Y-axis:** Labeled "EGA", ranging from 0 to 1.0, with tick marks at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located in the top-left corner, identifying four data series:
* X0 = 1 (Black circle)
* X0 = 2 (Orange triangle)
* X0 = 3 (Red diamond)
* X0 = 4 (Teal triangle)
* **Gridlines:** A light gray grid is present to aid in reading values.
### Detailed Analysis
Here's a breakdown of each data series, describing the trend and then extracting approximate data points:
* **X0 = 1 (Black):** The line slopes upward, showing a generally increasing EGA value with increasing episode number.
* Episode 0: EGA â 0.18
* Episode 100: EGA â 0.38
* Episode 200: EGA â 0.47
* Episode 300: EGA â 0.52
* Episode 400: EGA â 0.61
* **X0 = 2 (Orange):** The line also slopes upward, but starts higher and plateaus earlier than X0 = 1.
* Episode 0: EGA â 0.25
* Episode 100: EGA â 0.42
* Episode 200: EGA â 0.52
* Episode 300: EGA â 0.57
* Episode 400: EGA â 0.62
* **X0 = 3 (Red):** This line shows an initial increase, then a leveling off.
* Episode 0: EGA â 0.15
* Episode 100: EGA â 0.35
* Episode 200: EGA â 0.45
* Episode 300: EGA â 0.50
* Episode 400: EGA â 0.55
* **X0 = 4 (Teal):** The line shows a moderate increase, remaining the lowest of the four lines throughout the entire range.
* Episode 0: EGA â 0.12
* Episode 100: EGA â 0.30
* Episode 200: EGA â 0.40
* Episode 300: EGA â 0.47
* Episode 400: EGA â 0.53
### Key Observations
* All four lines demonstrate an increasing trend in EGA as the episode number increases, suggesting a learning or improvement process.
* The initial value of X0 significantly impacts the rate of EGA increase. X0 = 2 starts with the highest EGA and reaches the highest value at episode 400.
* X0 = 4 consistently exhibits the lowest EGA values throughout the observed episodes.
* The rate of increase in EGA appears to diminish over time for all X0 values, indicating a potential saturation point.
### Interpretation
The chart likely represents the performance of a system or algorithm over time (episodes), where X0 is an initial parameter setting. The EGA metric measures some aspect of the system's effectiveness. The data suggests that the choice of the initial parameter X0 has a substantial impact on the system's performance. Higher initial values (X0 = 2) lead to faster and potentially better performance, while lower initial values (X0 = 4) result in slower improvement. The diminishing rate of increase suggests that the system is approaching a limit to its performance, regardless of the initial X0 value. The differences in the curves could be due to the algorithm's sensitivity to initial conditions, or the presence of local optima that are easier or harder to reach depending on the starting point. Further investigation would be needed to understand the underlying mechanisms driving these observed trends.
</details>
(h) $x_{0}$
Figure 8: Hyperparameter ablation study in MineRL. EGA over 400 episodes under different hyperparameters. The plots show EGA when varying: (a) $c_{0}$ (revision count threshold for inadmissible items), (b) $\alpha_{i}$ (required items quantities for inadmissible items), (c) $\alpha_{s}$ (required items quantities for less-tried items), and (d) $x_{0}$ (invalid action threshold). Each study varies one hyperparameter while keeping the others fixed to their default values ( $c_{0}=3,\alpha_{i}=8,\alpha_{s}=2,x_{0}=2$ ).
To validate XENONâs stability to its hyperparameters, we conduct comprehensive ablation studies in both MC-TextWorld and MineRL. In these studies, we vary one hyperparameter at a time while keeping the others fixed to their default values ( $c_{0}=3$ , $\alpha_{i}=8$ , $\alpha_{s}=2$ , $x_{0}=2$ ).
Our results (Figure Ë 8, Figure Ë 8) show that although XENON is generally stable across hyperparameters, an effective learning strategy should account for controller capacity when the controller is imperfect. In MC-TextWorld (Figure Ë 8), XENON maintains near-perfect EGA across a wide range of all tested hyperparameters, confirming its stability when a perfect controller is used. In MineRL (Figure Ë 8), with an imperfect controller, the results demonstrate two findings. First, while influenced by hyperparameters, XENON still demonstrates robust performance, showing EGA after 400 episodes for all tested values remains near or above 0.5, outperforming baselines that plateau around or below 0.4 (Figure Ë 5(a)). Second, controller capacity should be considered when designing dependency and action learning strategies. For example, the ablation on $\alpha_{s}$ (Figure Ë 7(g)) shows that while gathering a sufficient quantity of items is necessary ( $\alpha_{s}=1$ ), overburdening the controller with excessive items ( $\alpha_{s}=4$ ) also degrades performance. Similarly, the ablation on $x_{0}$ (Figure Ë 7(h)) shows the need to balance tolerating controller failures against wasting time on invalid actions.
We provide additional ablations in the Appendix on dependency and action learningâwhen initializing the dependency graph from an external source mismatched to the environment (Figure Ë 23), when scaling to more goals/actions (Figure Ë 24), and when using a smaller 4B planner LLM (Figure Ë 26)âas well as an ablation of action selection methods for subgoal construction (Figure Ë 25).
## 6 Conclusion
We address the challenge of robust planning via experience-based algorithmic knowledge correction. With XENON, we show that directly revising external knowledge through experience enables an LLM-based agent to overcome flawed priors and sparse feedback, surpassing the limits of LLM self-correction. Experiments across diverse Minecraft benchmarks demonstrate that this approach not only strengthens knowledge acquisition and long-horizon planning, but also enables an agent with a lightweight 7B open-weight LLM to outperform prior methods that rely on much larger proprietary models. Our work delivers a key lesson for building robust LLM-based embodied agents: LLM priors should be treated with skepticism and continuously managed and corrected algorithmically.
Limitations
Despite its contributions, XENON faces a limitation. XENONâs performance is influenced by the underlying controller; in MineRL, STEVE-1 (Lifshitz et al., 2023) controller struggles with spatial exploration tasks, making a performance gap compared to more competent controllers like Mineflayer. Future work could involve jointly training the planner and controller, potentially using hierarchical reinforcement learning.
#### Acknowledgments
This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) and IITP-ITRC (Information Technology Research Center) grant funded by the Korea government (MSIT) (No. RS-2019-II191906, Artificial Intelligence Graduate School Program (POSTECH); IITP-2026-RS-2024-00437866; RS-2024-00509258, Global AI Frontier Lab), by a grant from the Korea Institute for Advancement of Technology (KIAT), funded by the Ministry of Trade, Industry and Energy (MOTIE), Republic of Korea (RS-2025-00564342), and by Seoul R&BD Program (SP240008) through the Seoul Business Agency (SBA) funded by The Seoul Metropolitan Government.
## References
- S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin (2025) Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923. Cited by: Figure 26, §K.1, §K.11, Figure 1, §5.1, §5.3.
- B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune (2022) Video pretraining (vpt): learning to act by watching unlabeled online videos. External Links: 2206.11795, Link Cited by: §J.2.1.
- P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, and P. Molchanov (2025) Small language models are the future of agentic ai. External Links: 2506.02153, Link, Document Cited by: §1.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba (2021) Evaluating large language models trained on code. External Links: 2107.03374 Cited by: §5.3.
- M. Chen, Y. Li, Y. Yang, S. Yu, B. Lin, and X. He (2024) AutoManual: constructing instruction manuals by llm agents via interactive environmental learning. External Links: 2405.16247 Cited by: §E.1, §1.
- M. CĂŽtĂ©, Ă. KĂĄdĂĄr, X. Yuan, B. Kybartas, T. Barnes, E. Fine, J. Moore, R. Y. Tao, M. Hausknecht, L. E. Asri, M. Adada, W. Tay, and A. Trischler (2018) TextWorld: a learning environment for text-based games. CoRR abs/1806.11532. Cited by: Appendix A.
- K. Du, V. SnĂŠbjarnarson, N. Stoehr, J. White, A. Schein, and R. Cotterell (2024) Context versus prior knowledge in language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L. Ku, A. Martins, and V. Srikumar (Eds.), Bangkok, Thailand, pp. 13211â13235. External Links: Link, Document Cited by: §1.
- L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang, H. Zhu, A. Tang, D. Huang, Y. Zhu, and A. Anandkumar (2022) MineDojo: building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, External Links: Link Cited by: §1, §3.
- Y. Feng, Y. Wang, J. Liu, S. Zheng, and Z. Lu (2024) LLaMA-rider: spurring large language models to explore the open world. In Findings of the Association for Computational Linguistics: NAACL 2024, K. Duh, H. Gomez, and S. Bethard (Eds.), Mexico City, Mexico, pp. 4705â4724. External Links: Link, Document Cited by: §1, §2.1.
- Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen (2024) CRITIC: large language models can self-correct with tool-interactive critiquing. External Links: 2305.11738, Link Cited by: §2.2.
- W. H. Guss, B. Houghton, N. Topin, P. Wang, C. Codel, M. Veloso, and R. Salakhutdinov (2019) MineRL: a large-scale dataset of minecraft demonstrations. External Links: 1907.13440, Link Cited by: §J.2.2, §J.2.5, §1, §3, §5.1.
- J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou (2024) Large language models cannot self-correct reasoning yet. External Links: 2310.01798, Link Cited by: §2.2.
- J. Li, Q. Wang, Y. Wang, X. Jin, Y. Li, W. Zeng, and X. Yang (2025a) Open-world reinforcement learning over long short-term imagination. In ICLR, Cited by: §J.2.1.
- T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen (2024a) Long-context llms struggle with long in-context learning. External Links: 2404.02060 Cited by: §4.2, §5.3.
- Z. Li, Y. Xie, R. Shao, G. Chen, D. Jiang, and L. Nie (2024b) Optimus-1: hybrid multimodal memory empowered agents excel in long-horizon tasks. Advances in neural information processing systems 37, pp. 49881â49913. Cited by: §J.2.2, §J.2.3, §J.2.5, Appendix H, §1, §2.1, §3, §4.3, §5.1, §5.1, Table 3.
- Z. Li, Y. Xie, R. Shao, G. Chen, D. Jiang, and L. Nie (2025b) Optimus-2: multimodal minecraft agent with goal-observation-action conditioned policy. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1, Table 3.
- S. Lifshitz, K. Paster, H. Chan, J. Ba, and S. McIlraith (2023) STEVE-1: a generative model for text-to-behavior in minecraft. External Links: 2306.00937 Cited by: §5.1, §5.2, §6.
- Z. Lin, J. Li, J. Shi, D. Ye, Q. Fu, and W. Yang (2021) Juewu-mc: playing minecraft with sample-efficient hierarchical reinforcement learning. arXiv preprint arXiv:2112.04907. Cited by: §J.2.1, §1.
- N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang (2024) Lost in the middle: how language models use long contexts. Transactions of the Association for Computational Linguistics 12, pp. 157â173. External Links: Link, Document Cited by: §4.2.
- S. Liu, Y. Li, K. Zhang, Z. Cui, W. Fang, Y. Zheng, T. Zheng, and M. Song (2025) Odyssey: empowering minecraft agents with open-world skills. In International Joint Conference on Artificial Intelligence, Cited by: §2.1.
- H. Mao, C. Wang, X. Hao, Y. Mao, Y. Lu, C. Wu, J. Hao, D. Li, and P. Tang (2022) Seihai: a sample-efficient hierarchical ai for the minerl competition. In Distributed Artificial Intelligence: Third International Conference, DAI 2021, Shanghai, China, December 17â18, 2021, Proceedings 3, pp. 38â51. Cited by: §J.2.1, §1.
- Microsoft, :, A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V. Chaudhary, C. Chen, D. Chen, D. Chen, J. Chen, W. Chen, Y. Chen, Y. Chen, Q. Dai, X. Dai, R. Fan, M. Gao, M. Gao, A. Garg, A. Goswami, J. Hao, A. Hendy, Y. Hu, X. Jin, M. Khademi, D. Kim, Y. J. Kim, G. Lee, J. Li, Y. Li, C. Liang, X. Lin, Z. Lin, M. Liu, Y. Liu, G. Lopez, C. Luo, P. Madan, V. Mazalov, A. Mitra, A. Mousavi, A. Nguyen, J. Pan, D. Perez-Becker, J. Platin, T. Portet, K. Qiu, B. Ren, L. Ren, S. Roy, N. Shang, Y. Shen, S. Singhal, S. Som, X. Song, T. Sych, P. Vaddamanu, S. Wang, Y. Wang, Z. Wang, H. Wu, H. Xu, W. Xu, Y. Yang, Z. Yang, D. Yu, I. Zabir, J. Zhang, L. L. Zhang, Y. Zhang, and X. Zhou (2025) Phi-4-mini technical report: compact yet powerful multimodal language models via mixture-of-loras. External Links: 2503.01743, Link Cited by: Figure 26, §K.11.
- K. Nottingham, P. Ammanabrolu, A. Suhr, Y. Choi, H. Hajishirzi, S. Singh, and R. Fox (2023) Do embodied agents dream of pixelated sheep? embodied decision making using language guided world modelling. In Proceedings of the 40th International Conference on Machine Learning, ICMLâ23. Cited by: §J.1, Table 8, Appendix C, Appendix G, §2.1, §3, §3, §4.1, §5.1.
- OpenAI (2023) Gpt-4v(ision) system card. External Links: Link Cited by: §5.3.
- OpenAI (2024) GPT-4 technical report. External Links: 2303.08774, Link Cited by: §5.3.
- PrismarineJS (2023) Prismarinejs/mineflayer. Note: https://github.com/PrismarineJS/mineflayer External Links: Link Cited by: §J.3, §5.1.
- Y. Qin, E. Zhou, Q. Liu, Z. Yin, L. Sheng, R. Zhang, Y. Qiao, and J. Shao (2024) Mp5: a multi-modal open-ended embodied system in minecraft via active perception. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16307â16316. Cited by: §2.1.
- N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: Appendix I, 2nd item.
- N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao (2023) Reflexion: language agents with verbal reinforcement learning. External Links: 2303.11366 Cited by: §J.1, §1, §2.2, §5.1.
- K. Stechly, K. Valmeekam, and S. Kambhampati (2024) On the self-verification limitations of large language models on reasoning and planning tasks. External Links: 2402.08115, Link Cited by: §J.1, §1, §5.1.
- A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V. Vondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V. Koltun, J. Malik, M. Savva, and D. Batra (2021) Habitat 2.0: training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
- G. Tyen, H. Mansoor, V. Carbune, P. Chen, and T. Mak (2024) LLMs cannot find reasoning errors, but can correct them given the error location. In Findings of the Association for Computational Linguistics: ACL 2024, L. Ku, A. Martins, and V. Srikumar (Eds.), Bangkok, Thailand, pp. 13894â13908. External Links: Link, Document Cited by: §2.2.
- G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar (2023a) Voyager: an open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291. Cited by: §1, §2.1.
- Z. Wang, S. Cai, G. Chen, A. Liu, X. (. Ma, and Y. Liang (2023b) Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents. Advances in Neural Information Processing Systems 36, pp. 34153â34189. Cited by: §J.2.5, §1, §5.1.
- Z. Wang, S. Cai, A. Liu, Y. Jin, J. Hou, B. Zhang, H. Lin, Z. He, Z. Zheng, Y. Yang, X. Ma, and Y. Liang (2023c) JARVIS-1: open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv: 2311.05997. Cited by: §1, §2.1, §5.1.
- Y. Wu, M. S. Hee, Z. Hu, and R. K. Lee (2024) LongGenBench: benchmarking long-form generation in long context llms. External Links: 2409.02076, Link Cited by: §5.3.
- L. Yang, Z. Yu, T. Zhang, M. Xu, J. E. Gonzalez, B. Cui, and S. Yan (2025) SuperCorrect: supervising and correcting language models with error-driven insights. In International Conference on Learning Representations, Cited by: §2.2.
- Y. Yoon, G. Lee, S. Ahn, and J. Ok (2024) Breadth-first exploration on adaptive grid for reinforcement learning. In Forty-first International Conference on Machine Learning, Cited by: 2nd item.
- S. Yu and C. Lu (2024) ADAM: an embodied causal agent in open-world environments. arXiv preprint arXiv:2410.22194. Cited by: §J.1, §J.3.1, Table 8, §K.1, §2.1, §5.1, §5.1.
- H. Yuan, C. Zhang, H. Wang, F. Xie, P. Cai, H. Dong, and Z. Lu (2023) Plan4MC: skill reinforcement learning and planning for open-world Minecraft tasks. arXiv preprint arXiv:2303.16563. Cited by: §1.
- Y. Zhang, M. Khalifa, L. Logeswaran, J. Kim, M. Lee, H. Lee, and L. Wang (2024) Small language models need strong verifiers to self-correct reasoning. In Findings of the Association for Computational Linguistics: ACL 2024, L. Ku, A. Martins, and V. Srikumar (Eds.), Bangkok, Thailand, pp. 15637â15653. External Links: Link, Document Cited by: §1.
- A. Zhao, D. Huang, Q. Xu, M. Lin, Y. Liu, and G. Huang (2024) ExpeL: llm agents are experiential learners. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2024, February 20-27, 2024, Vancouver, Canada, M. J. Wooldridge, J. G. Dy, and S. Natarajan (Eds.), pp. 19632â19642. External Links: Link, Document Cited by: §E.1.
- Z. Zhao, W. Chai, X. Wang, L. Boyi, S. Hao, S. Cao, T. Ye, J. Hwang, and G. Wang (2023) See and think: embodied agent in virtual environment. arXiv preprint arXiv:2311.15209. Cited by: §2.1.
- X. Zheng, H. Lin, K. He, Z. Wang, Z. Zheng, and Y. Liang (2025) MCU: an evaluation framework for open-ended game agents. External Links: 2310.08367, Link Cited by: §J.4, §5.1.
- X. Zhu, Y. Chen, H. Tian, C. Tao, W. Su, C. Yang, G. Huang, B. Li, L. Lu, X. Wang, Y. Qiao, Z. Zhang, and J. Dai (2023) Ghost in the minecraft: generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144. Cited by: §1, §2.1.
This appendix is organized as follows:
- Appendix Ë A: Experiments in a domain other than Minecraft (Microsoft TextWorld Cooking).
- Appendix Ë B: Prompts and qualitative results of LLM self-correction in our experiments.
- Appendix Ë C: Detailed procedure for experienced requirement set determination and dependency graph updates, as discussed in Section Ë 3.
- Appendix Ë E: Detailed pseudocode and the prompt for ADG in Section Ë 4.1.
- Appendix Ë F: Detailed pseudocode and the prompt for step-by-step planning using FAM in Section Ë 4.2.
- Appendix Ë H: Detailed descriptions and the prompt for CRe in Section Ë 4.3.
- Appendix Ë I: Detailed descriptions of implementation, human-written plans, and hyperparameters.
- Appendix Ë J: Detailed descriptions of the baselines and experimental environments in Section Ë 5.
- Appendix Ë K: Analysis of experimental results and additional experimental results.
- Appendix Ë L: Descriptions about LLM usage.
## Appendix A Additional experiments in another domain
To assess generalization beyond Minecraft, we evaluate XENON on the Microsoft TextWorld Cooking environment (CĂŽtĂ© et al., 2018), a text-based household task planning benchmark. We demonstrate XENON can correct an LLMâs flawed knowledge of preconditions (e.g., required tools) and valid actions for plans using ADG and FAM in this domain as well. We note that XENON is applied with minimal modification: FAM is applied without modification, while ADG is adapted from its original design, which supports multiple incoming edges (preconditions) for a node, to one that allows only a single incoming edge, as this domain requires only a single precondition per node.
### A.1 Experiment Setup
Environment Rules
The goal is to prepare and eat a meal by reading a cookbook, which provides a plan as a list of (action, ingredient) pairs, e.g., (âfryâ, âpepperâ). We note that an agent cannot succeed by naively following this plan. This is because the agent must solve two key challenges: (1) it must discover the valid tool required for each cookbook action, and (2) it must discover the valid, executable action for each cookbook action, as some cookbook actions are not directly accepted by the environment (i.e., not in its action space).
Specifically, to succeed a cookbookâs (action, ingredient) pair, an agent must make a subgoal, formatted as (executable action, ingredient, tool), where the executable action and tool must be valid for the cookbook action. For example, the cookbookâs (âfryâ, âpepperâ) pair requires the agent to make a subgoal (cook, âpepperâ, stove). The available executable action space consists of { âchopâ, âcloseâ, âcookâ, âdiceâ, âdropâ, âeatâ, âexamineâ, âsliceâ, âprepareâ }, and the available tools are { âknifeâ, âovenâ, âstoveâ, âfridgeâ, âtableâ, âcounterâ }.
Baselines and Evaluation
All agents use an LLM (Qwen2.5-VL-7B) to make subgoals. The tool for each cookbook action is predicted by the LLM from the available tools before an episode begins. At each timestep during the episode, given a cookbook action, the LLM predicts an executable action from the executable action space, constructing a subgoal from this predicted executable action, the input ingredient, and the predicted tool.
To isolate the challenge of planning knowledge correction, we assume a competent controller gathers all ingredients and tools; thus, an agent starts each episode with all necessary ingredients and tools. An episode (max 50 timesteps) is successful if the agent completes the plan.
### A.2 Results
Table 6: Success rates in the TextWorld Cooking environment, comparing XENON against the SC (LLM self-correction) and DECKARD baselines from Section Ë 5.1. We report the mean $\pm$ standard deviation over 3 independent runs, where each run consists of 100 episodes.
| | DECKARD | SC | XENON |
| --- | --- | --- | --- |
| Success Rate | $0.09\pm 0.02$ | $0.75\pm 0.04$ | $1.00\pm 0.00$ |
Table Ë 6 shows that XENON achieves a perfect success rate ( $1.00\pm 0.00$ ), significantly outperforming both SC ( $0.75\pm 0.04$ ) and DECKARD ( $0.09\pm 0.02$ ). These results demonstrate that XENONâs core mechanisms (ADG and FAM) are generalizable, effectively correcting flawed planning knowledge in a domain that requires the agent to discover valid symbolic actions and preconditions. Notably, the SC baseline fails to achieve high performance, even in the TextWorld Cooking environment which is simpler than Minecraft. This reinforces our claim that relying on LLM self-correction is less reliable than XENONâs experience-based algorithmic knowledge correction.
## Appendix B Prompts and qualitative results of LLM self-correction
### B.1 Dependency correction
Figure Ë 9 shows the prompt used for dependency correction.
âŹ
1 You are a professional game analyst. For a given < item _ name >, you need to make < required _ items > to get the item.
2 If you make < required _ items > well, I will give you 1 $.
3
4 I will give you recent transitions.
5 % Recent failed trajectories are given
6 [Failed example]
7 < item _ name >: {item _ name}
8 < hypothesized _ required _ items >: {original _ prediction}
9 < inventory >: {inventory}
10 < plan >: {failed _ subgoal}
11 < success >: false
12
13 I will give you learned items similar to < item _ name >, and their validated required items, just for reference.
14 % K similar experienced items and their requirements are given
15 [Success Example]
16 < item _ name >: {experienced _ item}
17 < required _ items > {experienced _ requirements}
18
19 % Make a new predicted requirement set
20 [Your turn]
21 Here is < item _ name >, you MUST output < required _ items > to obtain the item in JSON format. Remember < required _ items > MUST be in JSON format.
22
23 < item _ name >: {item _ name}
24 < required _ items >:
Figure 9: Prompt used for LLM self-correction about dependencies.
We provide some examples of actual prompts and LLM outputs in Figure Ë 10, Figure Ë 11
âŹ
1 You are a professional game analyst. For a given < item _ name >, you need to make < required _ items > to get the item.
2 If you make < required _ items > well, I will give you 1 $.
3
4 I will give you recent transitions.
5
6 [Failed example]
7 < item _ name >: iron _ nugget
8 < hypothesized _ required _ items >: {â iron _ ore â: 1, â crafting _ table â: 1}
9 < inventory >: {â crafting _ table â: 1, â wooden _ sword â: 1, â wooden _ pickaxe â: 1, â torch â: 4, â furnace â: 1, â stone _ pickaxe â: 1, â iron _ axe â: 1, â iron _ shovel â: 1, â stick â: 2, â iron _ pickaxe â: 1, â diamond â: 3, â iron _ ingot â: 2, â iron _ ore â: 2, â gold _ ore â: 1, â coal â: 1}
10 < plan >: dig down and mine iron _ nugget
11 < success >: false
12
13 I will give you learned items similar to < item _ name >, and their validated required items, just for reference.
14 [Success Example]
15 < item _ name >:
16 iron _ ingot
17 < required _ items >:
18 {â recipe â: {â furnace â: 1, â iron _ ore â: 1, â coals â: 1}}
19 [Success Example]
20 < item _ name >:
21 iron _ pickaxe
22 < required _ items >:
23 {â recipe â: {â stick â: 2, â iron _ ingot â: 3, â crafting _ table â: 1}}
24 [Success Example]
25 < item _ name >:
26 iron _ shovel
27 < required _ items >:
28 {â recipe â: {â stick â: 2, â iron _ ingot â: 1, â crafting _ table â: 1}}
29
30 [Your turn]
31 Here is < item _ name >, you MUST output < required _ items > to obtain the item in JSON format. Remember < required _ items > MUST be in JSON format.
32
33 < item _ name >:
34 iron _ nugget
35 < required _ items >:
36 % LLM output: {ârecipeâ: {âiron_oreâ: 1, âcrafting_tableâ: 1}}
Figure 10: Example of dependency self-correction for iron_nugget.
âŹ
1 You are a professional game analyst. For a given < item _ name >, you need to make < required _ items > to get the item.
2 If you make < required _ items > well, I will give you 1 $.
3
4 I will give you recent transitions.
5
6 [Failed example]
7 < item _ name >: charcoal
8 < hypothesized _ required _ items >: {â oak _ log â: 8}
9 < inventory >: {â dirt â: 1, â oak _ log â: 2, â crafting _ table â: 1, â wooden _ hoe â: 1, â wooden _ pickaxe â: 1, â torch â: 4, â stone _ axe â: 1, â furnace â: 1, â stone _ pickaxe â: 1, â stick â: 2, â iron _ pickaxe â: 1, â diamond â: 1, â iron _ ingot â: 3, â iron _ ore â: 2, â coal â: 2}
10 < action >: craft charcoal
11 < success >: false
12
13 I will give you learned items similar to < item _ name >, and their validated required items, just for reference.
14 [Success Example]
15 < item _ name >:
16 coals
17 < required _ items >:
18 {â recipe â: {â wooden _ pickaxe â: 1}}
19 [Success Example]
20 < item _ name >:
21 furnace
22 < required _ items >:
23 {â recipe â: {â cobblestone â: 8, â crafting _ table â: 1}}
24 [Success Example]
25 < item _ name >:
26 diamond
27 < required _ items >:
28 {â recipe â: {â iron _ pickaxe â: 1}}
29
30 [Your turn]
31 Here is < item _ name >, you MUST output < required _ items > to achieve charcoal in JSON format. Remember < required _ items > MUST be in JSON format.
32
33 < item _ name >:
34 charcoal
35 < required _ items >:
36 % LLM output: {ârecipeâ: {âoak_logâ: 8}}
Figure 11: Example of dependency self-correction for charcoal.
### B.2 Action correction
Figure Ë 12 shows the prompt used self-reflection for failed actions.
âŹ
1 % LLM self-reflection to analyze failure reasons
2 You are a professional game analyst.
3 For a given < item _ name > and < inventory >, you need to analyze why < plan > failed to get the item.
4 I will give you examples of analysis as follow.
5
6 [Example]
7 < item _ name >: wooden _ pickaxe
8 < inventory >: {â stick â: 4, â planks â: 4, â crafting _ table â: 1}
9 < plan >: smelt wooden _ pickaxe
10 < failure _ analysis >
11 {" analysis ": " You failed because you cannot smelt a wooden _ pickaxe. You should craft it instead."}
12
13 [Example]
14 < item _ name >: stone _ pickaxe
15 < inventory >: {â stick â: 4, â planks â: 4, â crafting _ table â: 1}
16 < plan >: craft stone _ pickaxe
17 < failure _ analysis >
18 {" analysis ": " You failed because you do not have enough cobblestones."}
19
20 [Your turn]
21 Here is < item _ name >, < inventory > and < plan >, you MUST output < failure _ analysis > concisely in JSON format.
22
23 < item _ name >: {item _ name}
24 < inventory >: {inventory}
25 < plan >: {plan}
26 < failure _ analysis >
27
28 % Then, using the self-reflection results, LLM self-correct its actions.
29 For an item name, you need to make a plan, by selecting one among provided options.
30 I will give you examples of which plans are needed to achieve an item, just for reference.
31 [Example]
32 < item name >
33 {similar _ item}
34 < task planning >
35 {successful _ plan
36
37 Here are some analyses on previous failed plans for this item.
38 [Analysis]
39 {â item _ name â: {item}, â inventory â: {inventory}, â plan â: â{plan}â, â failure _ analysis â: â{self - reflection}â}
40
41 [Your turn]
42 Here is < item name >, you MUST select one from below < options >, to make < task planning >.
43 you MUST select one from below < options >. DO NOT MAKE A PLAN NOT IN < options >.
44
45 < options >:
46 1: {" task ": " dig down and mine {item}", " goal ": [{item}, {quantity}]}
47 2: {" task ": " craft {item}", " goal ": [{item}, {quantity}]}
48 3: {" task ": " smelt {item}", " item ": [{item}, {quantity}}
49
50 < item name >
51 {item}
52 < task planning >
Figure 12: Prompts used for LLM self-correction about actions.
We provide some examples of actual prompts and LLM outputs in Figure Ë 13, Figure Ë 14
âŹ
1 For an item name, you need to make a plan, by selecting one among provided options.
2 I will give you examples of which plans are needed to achieve an item, just for reference.
3
4 [Example]
5 < item name >
6 iron _ ingot
7 < task planning >
8 {" task ": " smelt iron _ ingot ", " goal ": [" iron _ ingot ", 1]}
9
10 [Example]
11 < item name >
12 iron _ pickaxe
13 < task planning >
14 {" task ": " craft iron _ pickaxe ", " goal ": [" iron _ pickaxe ", 1]}
15
16 [Example]
17 < item name >
18 iron _ shovel
19 < task planning >
20 {" task ": " craft iron _ shovel ", " goal ": [" iron _ shovel ", 1]}
21
22 Here are some analyses on previous failed plans for this item.
23 [Analysis]
24 {â item _ name â: â iron _ nugget â,
25 â inventory â: {â crafting _ table â: 1, â wooden _ sword â: 1, â wooden _ pickaxe â: 1, â torch â: 4, â furnace â: 1, â stone _ pickaxe â: 1, â iron _ axe â: 1, â iron _ shovel â: 1, â stick â: 2, â iron _ pickaxe â: 1, â diamond â: 3, â iron _ ingot â: 2, â iron _ ore â: 2, â gold _ ore â: 1, â coal â: 1},
26 â plan â: â dig down and mine iron _ nugget â,
27 â failure _ analysis â: â You failed because you do not have any iron ore or diamond ore to mine for iron nuggets.â}
28
29 [Your turn]
30 Here is < item name >, you MUST select one from below < options >, to make < task planning >.
31 you MUST select one from below < options >. DO NOT MAKE A PLAN NOT IN < options >.
32
33 < options >
34 1. {" task ": " dig down and mine iron _ nugget ", " goal ": [" iron _ nugget ", 1]}
35 2. {" task ": " craft iron _ nugget ", " goal ": [" iron _ nugget ", 1]}
36 3. {" task ": " smelt iron _ nugget ", " goal ": [" iron _ nugget ", 1]}
37
38 < item name >
39 iron _ nugget
40 % LLM output: â{"task": "dig down and mine iron_nugget", "goal": ["iron_nugget", 1]}â
Figure 13: Example of action self-correction for iron_nugget.
âŹ
1 For an item name, you need to make a plan, by selecting one among provided options.
2 I will give you examples of which plans are needed to achieve an item, just for reference.
3
4 [Example]
5 < item name >
6 coals
7 < task planning >
8 {" task ": " dig down and mine coals ", " goal ": [" coals ", 1]}
9
10 [Example]
11 < item name >
12 furnace
13 < task planning >
14 {" task ": " craft furnace ", " goal ": [" furnace ", 1]}
15
16 [Example]
17 < item name >
18 diamond
19 < task planning >
20 {" task ": " dig down and mine diamond ", " goal ": [" diamond ", 1]}
21
22 Here are some analyses on previous failed plans for this item.
23 [Analysis]
24 {â item _ name â: â charcoal â,
25 â inventory â: {â dirt â: 1, â oak _ log â: 2, â crafting _ table â: 1, â wooden _ hoe â: 1, â wooden _ pickaxe â: 1, â torch â: 4, â stone _ axe â: 1, â furnace â: 1, â stone _ pickaxe â: 1, â stick â: 2, â iron _ pickaxe â: 1, â diamond â: 1, â iron _ ingot â: 3, â iron _ ore â: 2, â coal â: 2},
26 â plan â: â mine iron _ nugget â,
27 â failure _ analysis â: â You failed because you already have enough charcoal.â}
28
29
30 [Your turn]
31 Here is < item name >, you MUST select one from below < options >, to make < task planning >.
32 you MUST select one from below < options >. DO NOT MAKE A PLAN NOT IN < options >.
33
34 < options >
35 1. {" task ": " mine iron _ nugget ", " goal ": [" charcoal ", 1]}
36 2. {" task ": " craft charcoal ", " goal ": [" charcoal ", 1]}
37 3. {" task ": " smelt charcoal ", " goal ": [" charcoal ", 1]}
38
39 < item name >
40 charcoal
41 < task planning >
42 % LLM output: â{"task": "craft charcoal", "goal": ["charcoal", 1]}â
Figure 14: Example of action self-correction for charcoal.
## Appendix C Experienced requirement set and dependency graph update
We note that the assumptions explained in this section are largely similar to those in the implementation of DECKARD (Nottingham et al., 2023) https://github.com/DeckardAgent/deckard.
Determining experienced requirement set
When the agent obtains item $v$ while executing a subgoal $(a,q,u)$ , it determines the experienced requirement set $\mathcal{R}_{exp}(v)$ differently depending on whether the high-level action $a$ is âmineâ or falls under âcraftâ or âsmeltâ. If $a$ is âmineâ, the agent determines $\mathcal{R}_{exp}(v)$ based on the pickaxe in its inventory. If no pickaxe is held, $\mathcal{R}_{exp}(v)$ is $\emptyset$ . Otherwise, $\mathcal{R}_{exp}(v)$ becomes $\{(\text{the highest-tier pickaxe the agent has},1)\}$ , where the highest-tier pickaxe is determined following the hierarchy: âwooden_pickaxeâ, âstone_pickaxeâ, âiron_pickaxeâ, âdiamond_pickaxeâ. If $a$ is âcraftâ or âsmeltâ, the agent determines the used items and their quantities as $\mathcal{R}_{exp}(v)$ by observing inventory changes when crafting or smelting $v$ .
Dependency graph update
When the agent obtains an item $v$ and its $\mathcal{R}_{exp}(v)$ for the first time, it updates its dependency graph $\hat{\mathcal{G}}=(\hat{\mathcal{V}},\hat{\mathcal{E}})$ . Since $\mathcal{R}_{\text{exp}}(v)$ only contains items acquired before $v$ , no cycles can be introduced to ADG during learning. The update proceeds as follows: The agent adds $v$ to both the set of known items $\hat{\mathcal{V}}$ . Then, it updates the edge set $\hat{\mathcal{E}}$ by replacing $v$ âs incoming edges with $\mathcal{R}_{exp}(v)$ : it removes all of $v$ âs incoming edges $(u,\cdot,v)\in\hat{\mathcal{E}}$ and adds new edges $(u_{i},q_{i},v)$ to $\hat{\mathcal{E}}$ for every $(u_{i},q_{i})\in\mathcal{R}_{exp}(v)$ .
## Appendix D Full procedure of XENON
input : invalid action threshold $x_{0}$ , inadmissible item threshold $c_{0}$ , less-explored item scale $\alpha_{s}$ , inadmissible item scale $\alpha_{i}$
1 Initialize dependency $\hat{\mathcal{G}}\leftarrow(\hat{\mathcal{V}},\hat{\mathcal{E}})$ , revision counts $C[v]\leftarrow 1$ for all $v\in\hat{\mathcal{V}}$
2 Initialize memory $S(a,v)=0,F(a,v)=0$ for all $v\in\hat{\mathcal{V}},a\in\mathcal{A}$
3 while learning do
4 Get an empty inventory $inv$
$v_{g}\leftarrow\texttt{SelectGoalWithDifficulty}(\hat{\mathcal{G}},C[\cdot])$
// DEX Appendix Ë G
5 while $H_{episode}$ do
6 if $v_{g}\in inv$ then
7 $v_{g}\leftarrow\texttt{SelectGoalWithDifficulty}(\hat{\mathcal{G}},C[\cdot])$
8
9
Series of aggregated requirements $((q_{l},u_{l}))_{l=1}^{L_{v_{g}}}$ using $\hat{\mathcal{G}}$ and $inv$
// from Section Ë 3
10 Plan $P\leftarrow((a_{l},q_{l},u_{l}))_{l=1}^{L_{v_{g}}}$ by selecting $a_{l}$ for each $u_{l}$ , using LLM, $S$ , $F$ , $x_{0}$
11 foreach subgoal $(a,q,u)\in P$ do
12 Execute $(a,q,u)$ then get the execution result $success$
Get an updated inventory $inv$ , dependency graph $\hat{\mathcal{G}}$
// from Section Ë 3
13
14 if success then $S(a,u)\leftarrow S(a,u)+1$
15 else $F(a,u)\leftarrow F(a,u)+1$
16
17 if not $success$ then
18 if All actions are invalid then
$\hat{\mathcal{G}},C\leftarrow\texttt{RevisionByAnalogy}(\hat{\mathcal{G}},u,C[\cdot],c_{0},\alpha_{s},\alpha_{i})$
// ADG Section Ë 4.1
19 Reset memory $S(\cdot,u)\leftarrow 0,F(\cdot,u)\leftarrow 0$
20 $v_{g}\leftarrow\texttt{SelectGoalWithDifficulty}(\hat{\mathcal{G}},C[\cdot])$
21 break
22
23
24
25
Algorithm 1 Pseudocode of XENON
The full procedure of XENON is outlined in Algorithm Ë 1
## Appendix E Details in Adaptive Dependency Graph (ADG)
### E.1 Rationale for initial knowledge
In real-world applications, a human user may wish for an autonomous agent to accomplish certain goals, yet the user themselves may have limited or no knowledge of how to achieve them within a complex environment. We model this scenario by having a user specify goal items without providing the detailed requirements, and then the agent should autonomously learn how to obtain these goal items. The set of 67 goal item names ( $\mathcal{V}_{0}$ ) provided to the agent represents such user-specified goal items, defining the learning objectives.
To bootstrap learning in complex environments, LLM-based planning literature often utilizes minimal human-written plans for initial knowledge (Zhao et al., 2024; Chen et al., 2024). In our case, we provide the agent with 3 human-written plans (shown in Appendix Ë I). By executing these plans, our agent can experience items and their dependencies, thereby bootstrapping the dependency learning process.
### E.2 Details in dependency graph initialization
Keeping ADG acyclic during initialization
During initialization, XENON prevents cycles in ADG algorithmically and maintains ADG as a directed acyclic graph, by, whenever adding an LLM-predicted requirement set for an item, discarding any set that would make a cycle and instead assign an empty requirement set to that item. Specifically, we identify and prevent cycles in three steps when adding LLM-predicted incoming edges for an item $v$ . First, we tentatively insert the LLM-predicted incoming edges of $v$ into the current ADG. Second, we detect cycles by checking whether any of $v$ âs parents now appears among $v$ âs descendants in the updated graph. Third, if a cycle is detected, we discard the LLM-predicted incoming edges for $v$ and instead assign an empty set of incoming edges to $v$ in the ADG.
Pseudocode is shown in Algorithm Ë 2. The prompt is shown in Figure Ë 15.
1
input : Goal items $\mathcal{V}_{0}$ , (optional) human written plans $\mathcal{P}_{0}$
output : Initialized dependency graph $\hat{\mathcal{G}}=(\hat{\mathcal{V}},\hat{\mathcal{E}})$ , experienced items $\mathcal{V}$
2
3 Initialize a set of known items $\hat{\mathcal{V}}\leftarrow\mathcal{V}_{0}$ , edge set $\hat{\mathcal{E}}\leftarrow\emptyset$
4 Initialize a set of experienced items $\mathcal{V}\leftarrow\emptyset$
5
6 foreach plan in $\mathcal{P}_{0}$ do
7 Execute the plan and get experienced items and their experienced requirement sets $\bigl\{(v_{n},\mathcal{R}_{exp}(v_{n}))\bigr\}_{n=1}^{N}$
8 foreach $(v,\mathcal{R}_{exp}(v))\in\bigl\{(v_{n},\mathcal{R}_{exp}(v_{n}))\bigr\}_{n=1}^{N}$ do
9 if $v\notin\mathcal{V}$ then
/* graph update from Appendix Ë C */
10 $\mathcal{V}\leftarrow\mathcal{V}\cup\{v\}$ , $\hat{\mathcal{V}}\leftarrow\hat{\mathcal{V}}\cup\{v\}$
11 Add edges to $\hat{\mathcal{E}}$ according to $\mathcal{R}_{exp}(v)$
12
13
/* Graph construction using LLM predictions */
14 while $\exists v\in\hat{\mathcal{V}}\setminus\mathcal{V}$ whose requirement set $\mathcal{R}(v)$ has not yet been predicted by the LLM do
15 Select such an item $v\in\hat{\mathcal{V}}\setminus\mathcal{V}$ (i.e., $\mathcal{R}(v)$ has not yet been predicted)
16 Select $\mathcal{V}_{K}\subseteq\mathcal{V}$ based on Top-K semantic similarity to $v$ , $|\mathcal{V}_{K}|=K$
17 Predict $\mathcal{R}(v)\leftarrow LLM(v,\{\big(u,\mathcal{R}(u,\hat{\mathcal{G}})\big)\}_{u\in\mathcal{V}_{K}})$
18
19 foreach $(u_{j},q_{j})\in\mathcal{R}(v)$ do
20 $\hat{\mathcal{E}}\leftarrow\hat{\mathcal{E}}\cup\{(u_{j},q_{j},v)\}$
21 if $u_{j}\notin\hat{\mathcal{V}}$ then
22 $\hat{\mathcal{V}}\leftarrow\hat{\mathcal{V}}\cup\{u_{j}\}$
23
24
25
Algorithm 2 GraphInitialization
âŹ
1 You are a professional game analyst. For a given < item _ name >, you need to make < required _ items > to get the item.
2 If you make < required _ items > well, I will give you 1 $.
3
4 I will give you some examples < item _ name > and < required _ items >.
5
6 [Example] % TopK similar experienced items are given as examples
7 < item _ name >: {experienced _ item}
8 < required _ items >: {experienced _ requirement _ set}
9
10 [Your turn]
11 Here is a item name, you MUST output < required _ items > in JSON format. Remember < required _ items > MUST be in JSON format.
12
13 < item _ name >: {item _ name}
14 < required _ items >:
Figure 15: Prompt for requirement set prediction for dependency graph initialization
### E.3 Pseudocode of RevisionByAnalogy
Pseudocode is shown in Algorithm Ë 3.
1
input : Dependency graph $\hat{\mathcal{G}}=(\hat{\mathcal{V}},\hat{\mathcal{E}})$ , an item to revise $v$ , exploration counts $C[\cdot]$ , inadmissible item threshold $c_{0}$ , less-explored item scale $\alpha_{s}$ , inadmissible item scale $\alpha_{i}$
output : Revised dependency graph $\hat{\mathcal{G}}=(\hat{\mathcal{V}},\hat{\mathcal{E}})$ , exploration counts $C[\cdot]$
2
3 Consider cases based on $C[v]$ :
4 if $C[v]>c_{0}$ then
/* $v$ is inadmissible */
5
/* resource set: items previously consumed for crafting other items */
6 $\mathcal{R}(v)\leftarrow\{(u,\alpha_{i})\mid u\in\text{``resource'' set}\}$
/* Remove all incoming edges to $v$ in $\hat{\mathcal{E}}$ and add new edges */
7 $\hat{\mathcal{E}}\leftarrow\hat{\mathcal{E}}\setminus\{(x,q,v)\mid(x,q,v)\in\hat{\mathcal{E}}\}$
8 foreach $(u,\alpha_{i})\in\mathcal{R}(v)$ do
9 $\hat{\mathcal{E}}\leftarrow\hat{\mathcal{E}}\cup\{(u,\alpha_{i},v)\}$
10
11
/* Revise requirement sets of descendants of $v$ */
12 Find the set of all descendants of $v$ in $\hat{\mathcal{G}}$ (excluding $v$ ): $\mathcal{W}\leftarrow\text{FindAllDescendants}(v,\hat{\mathcal{G}})$
13
14 for each item $w$ in $\mathcal{W}$ do
15 Invoke RevisionByAnalogy for $w$
16
17
18 else
/* $v$ is less explored yet. Revise based on analogy */
19 Find similar successfully obtained items $\mathcal{V}_{K}\subseteq\hat{\mathcal{V}}$ based on Top-K semantic similarity to $v$
20
Candidate items $U_{cand}\leftarrow\{u\mid\exists w\in\mathcal{V}_{K},(u,\cdot,w)\in\hat{\mathcal{E}}\}$ /* all items required to obtain similar successfully obtained items $\mathcal{V}_{K}$ */
21
22 Start to construct a requirement set, $\mathcal{R}(v)\leftarrow\emptyset$
23 for each item $u$ in $U_{cand}$ do
24 if $u$ is in âresourceâ set then
25 Add $(u,\alpha_{s}\times C[v])$ to $\mathcal{R}(v)$
26
27 else
28 Add $(u,1)$ to $\mathcal{R}(v)$
29
30
31 Update $\hat{\mathcal{G}}$ : Remove all incoming edges to $v$ in $\hat{\mathcal{E}}$ , and add new edges $(u,q,v)$ to $\hat{\mathcal{E}}$ for each $(u,q)\in\mathcal{R}(v)$
32
33
Algorithm 3 RevisionByAnalogy
## Appendix F Step-by-step planning using FAM
Given a sequence of aggregated requirements $((q_{l},v_{l}))_{l=1}^{L}$ , XENON employs a step-by-step planning approach, iteratively selecting an high-level action $a_{l}$ for each requirement item $v_{l}$ to make a subgoal $(a_{l},q_{l},v_{l})$ . This process considers the past attempts to obtain $v_{l}$ using specific actions. Specifically, for a given item $v_{l}$ , if FAM has an empirically valid action, XENON reuses it without prompting the LLM. Otherwise, XENON prompts the LLM to select an action, leveraging information from (i) valid actions for items semantically similar to $v_{l}$ , (ii) empirically invalid actions for $v_{l}$ .
The pseudocode for this action selection process is detailed in Algorithm Ë 4. The prompt is shown in Figure Ë 16.
1
Input : An item $v$ , Action set $\mathcal{A}$ , Success/Failure counts from FAM $S(\cdot,\cdot)$ and $F(\cdot,\cdot)$ , Invalid action threshold $x_{0}$
Output : Selected action $a_{selected}$
2
/* 1. Classify actions based on FAM history (S and F counts) */
3 $\mathcal{A}^{valid}_{v}\leftarrow\{a\in\mathcal{A}\mid S(a,v)>0\land S(a,v)>F(a,v)-x_{0}\}$
4 $\mathcal{A}^{invalid}_{v}\leftarrow\{a\in\mathcal{A}\mid F(a,v)\geq S(a,v)+x_{0}\}$
5
6 if $\mathcal{A}^{valid}_{v}\neq\emptyset$ then
/* Reuse the empirically valid action if it exists */
7 Select $a_{selected}$ from $\mathcal{A}^{valid}_{v}$
8 return $a_{selected}$
9
10 else
/* Otherwise, query LLM with similar examples and filtered candidates */
11
/* (i) Retrieve valid actions from other items for examples */
12 $\mathcal{V}_{source}\leftarrow\{u\in\hat{V}\setminus\{v\}\mid\exists a^{\prime},S(a^{\prime},u)>0\land S(a^{\prime},u)>F(a^{\prime},u)-x_{0}\}$
13 Identify $\mathcal{V}_{topK}\subseteq\mathcal{V}_{source}$ as the $K$ items most similar to $v$ (using S-BERT)
14 $\mathcal{D}_{examples}\leftarrow\{(u,a_{valid})\mid u\in\mathcal{V}_{topK},a_{valid}\in\mathcal{A}^{valid}_{u}\}$
15
/* (ii) Prune invalid actions to form candidates */
16 $\mathcal{A}^{cand}_{v}\leftarrow\mathcal{A}\setminus\mathcal{A}^{invalid}_{v}$
17
18 if $\mathcal{A}^{cand}_{v}=\emptyset$ then
19 $\mathcal{A}^{cand}_{v}\leftarrow\mathcal{A}$
20
21 $a_{selected}\leftarrow\text{LLM}(v,\mathcal{D}_{examples},\mathcal{A}^{cand}_{v})$
22 return $a_{selected}$
23
Algorithm 4 Step-by-step Planning with FAM
âŹ
1 For an item name, you need to make a plan, by selecting one among provided options.
2 I will give you examples of which plans are needed to achieve an item, just for reference.
3
4 % Similar items and their successful plans are given
5 [Example]
6 < item name >
7 {similar _ item}
8 < task planning >
9 {successful _ plan}
10
11 [Your turn]
12 Here is < item name >, you MUST select one from below < options >, to make < task planning >.
13 you MUST select one from below < options >. DO NOT MAKE A PLAN NOT IN < options >.
14
15 % Three actions are given, excluding any that were empirically invalid
16 < options >:
17 1: {" task ": " dig down and mine {item}", " goal ": [{item}, {quantity}]}
18 2: {" task ": " craft {item}", " goal ": [{item}, {quantity}]}
19 3: {" task ": " smelt {item}", " item ": [{item}, {quantity}}
20
21 < item name >
22 {item}
23 < task planning >
Figure 16: Prompt for action selection
## Appendix G Difficulty-based Exploration (DEX)
For autonomous dependency learning, we introduce DEX. DEX strategically selects items that (1) appear easier to obtain, prioritizing those (2) under-explored for diversity and (3) having fewer immediate prerequisite items according to the learned graph $\hat{\mathcal{G}}$ . (line 5 in Algorithm Ë 1). First, DEX defines the previously unobtained items but whose required items are all obtained according to learned dependency $\hat{\mathcal{G}}$ as the frontier $F$ . Next, the least explored frontier set $\mathcal{F}_{min}\coloneqq\{f\in F\mid C(f)=\min_{f^{\prime}\in F}C(f^{\prime})\}$ is identified, based on revision counts $C(\cdot)$ . For items $f^{\prime}\in\mathcal{F}_{min}$ , difficulty $D(f^{\prime})$ is estimated as $L_{f^{\prime}}$ , the number of distinct required items needed to obtain $f^{\prime}$ according to $\hat{\mathcal{G}}$ . The intrinsic goal $g$ is then selected as the item in $\mathcal{F}_{min}$ with the minimum estimated difficulty: $g=\arg\min_{f^{\prime}\in\mathcal{F}_{min}}D(f^{\prime})$ . Ties are broken uniformly at random.
While our frontier concept is motivated by DECKARD (Nottingham et al., 2023), DEXâs selection process differs significantly. DECKARD selects randomly from $\{v\in\mathcal{F}\mid C(v)\leq c_{0}\}$ , but if this set is empty, it selects randomly from the union of frontier set and previously obtained item set. This risks inefficient attempts on already obtained items. In contrast, DEX exclusively selects goals from $\mathcal{F}_{\text{min}}$ , inherently avoiding obtained items. This efficiently guides exploration towards achievable, novel dependencies.
## Appendix H Context-aware Reprompting (CRe)
Minecraft, a real-world-like environment can lead to situations where the controller stalls (e.g., when stuck in deep water or a cave). To assist the controller, the agent provides temporary prompts to guide it (e.g., "get out of the water and find trees"). XENON proposes a context-aware reprompting scheme. It is inspired by Optimus-1 Li et al. (2024b) but introduces two key differences:
1. Two-stage reasoning. When invoked, in Optimus-1, LLM simultaneously interprets image observations, decides whether to reprompt, and generates new prompts. XENON decomposes this process into two distinct steps:
1. the LLM generates a caption for the current image observation, and
1. using text-only input (the generated caption and the current subgoal prompt), the LLM determines if reprompting is necessary and, if so, produces a temporary prompt.
1. Trigger. Unlike Optimus-1, which invokes the LLM at fixed intervals, XENON calls the LLM only if the current subgoal item has not been obtained within that interval. This approach avoids unnecessary or spurious interventions from a smaller LLM.
The prompt is shown in Figure Ë 17.
âŹ
1 % Prompt for the first step: image captioning
2 Given a Minecraft game image, describe nearby Minecraft objects, like tree, grass, cobblestone, etc.
3 [Example]
4 " There is a large tree with dark green leaves surrounding the area."
5 " The image shows a dark, cave - like environment in Minecraft. The player is digging downwards. There are no visible trees or grass in this particular view."
6 " The image shows a dark, narrow tunnel made of stone blocks. The player is digging downwards."
7 [Your turn]
8 Describe the given image, simply and clearly like the examples.
9
10 % Prompt for the second step: reasoning whether reprompting is needed or not
11 Given < task > and < visual _ description >, determine if the player needs intervention to achieve the goal. If intervention is needed, suggest a task that the player should perform.
12 I will give you examples.
13 [Example]
14 < task >: chop tree
15 < visual _ description >: There is a large tree with dark green leaves surrounding the area.
16 < goal _ item >: logs
17 < reasoning >:
18 {{
19 " need _ intervention ": false,
20 " thoughts ": " The player can see a tree and can chop it down to get logs.",
21 " task ": "",
22}}
23 [Example]
24 < task >: chop tree
25 < visual _ description >: The image shows a dirt block in Minecraft. There is a tree in the image, but it is too far from here.
26 < goal _ item >: logs
27 < reasoning >:
28 {{
29 " need _ intervention ": true,
30 " thoughts ": " The player is far from trees. The player needs to move to the trees.",
31 " task ": " explore to find trees ",
32}}
33 [Example]
34 < task >: dig down to mine iron _ ore
35 < visual _ description >: The image shows a dark, narrow tunnel made of stone blocks. The player is digging downwards.
36 < goal _ item >: iron _ ore
37 < reasoning >:
38 {{
39 " need _ intervention ": false,
40 " thoughts ": " The player is already digging down and is likely to find iron ore.",
41 " task ": "",
42}}
43 [Your turn]
44 Here is the < task >, < visual _ description >, and < goal _ item >.
45 You MUST output the < reasoning > in JSON format.
46 < task >: {task} % current prompt for the controller
47 < visual _ description >: {visual _ description} % caption from the step 1
48 < goal _ item >: {goal _ item} % current subgoal item
49 < reasoning >:
Figure 17: Prompt for context-aware reprompting
## Appendix I Implementation details
To identify similar items, semantic similarity between two items is computed as the cosine similarity of their Sentence-BERT (all-MiniLM-L6-v2 model) embeddings (Reimers and Gurevych, 2019). This metric is utilized whenever item similarity comparisons are needed, such as in Algorithm Ë 2, Algorithm Ë 3, and Algorithm Ë 4.
### I.1 Hyperparameters
Table 7: Hyperparameters used in our experiments.
| Hyperparameter | Notation | Value |
| --- | --- | --- |
| Failure threshold for invalid action | $x_{0}$ | $2$ |
| Revision count threshold for inadmissible items | $c_{0}$ | $3$ |
| Required items quantity scale for less explored items | $\alpha_{s}$ | $2$ |
| Required items quantity scale for inadmissible items | $\alpha_{i}$ | $8$ |
| Number of top-K similar experienced items used | $K$ | $3$ |
For all experiments, we use consistent hyperparameters across environments. The hyperparameters, whose values are determined with mainly considering robustness against imperfect controllers. All hyperparameters are listed in Table Ë 7. The implications of increasing each hyperparameterâs value are detailed below:
- $x_{0}$ (failure threshold for empirically invalid action): Prevents valid actions from being misclassified as invalid due to accidental failures from an imperfect controller or environmental stochasticity. Values that are too small or large hinder dependency learning and planning by hampering the discovery of valid actions.
- $c_{0}$ (exploration count threshold for inadmissible items): Ensures an item is sufficiently attempted before being deemed âinadmissibleâ and triggering a revision for its descendants. Too small/large values could cause inefficiency; small values prematurely abandon potentially correct LLM predictions for descendants, while large values prevent attempts on descendant items.
- $\alpha_{s}$ (required items quantity scale for less explored items): Controls the gradual increase of required quantities for revised required items. Small values make learning inefficient by hindering item obtaining due to insufficient required items, yet large values lower robustness by overburdening controllers with excessive quantity demands.
- $\alpha_{i}$ (required items quantity scale for inadmissible items): Ensures sufficient acquisition of potential required items before retrying inadmissible items to increase the chance of success. Improper values reduce robustness; too small leads to failure in obtaining items necessitating many items; too large burdens controllers with excessive quantity demands.
- $K$ (Number of similar items to retrieve): Determines how many similar, previously successful experiences are retrieved to inform dependency revision (Algorithm Ë 3) and action selection (Algorithm Ë 4).
### I.2 Human-written plans
We utilize three human-written plans (for iron sword, golden sword, and diamond, shown in Plan 18, 19, and 20, respectively), the format of which is borrowed from the human-written plan examples in the publicly released Optimus-1 repository https://github.com/JiuTian-VL/Optimus-1/blob/main/src/optimus1/example.py. We leverage the experiences gained from executing these plans to initialize XENONâs knowledge.
âŹ
1 iron_sword: str = "" "
2 <goal>: craft an iron sword.
3 <requirements>:
4 1. log: need 7
5 2. planks: need 21
6 3. stick: need 5
7 4. crafting_table: need 1
8 5. wooden_pickaxe: need 1
9 6. cobblestone: need 11
10 7. furnace: need 1
11 8. stone_pickaxe: need 1
12 9. iron_ore: need 2
13 10. iron_ingot: need 2
14 11. iron_sword: need 1
15 <plan>
16 {
17 " step 1 ": {" prompt ": " mine logs ", " item ": [" logs ", 7]},
18 " step 2 ": {" prompt ": " craft planks ", " item ": [" planks ", 21]},
19 " step 3 ": {" prompt ": " craft stick ", " item ": [" stick ", 5]},
20 " step 4 ": {" prompt ": " craft crafting_table ", " item ": [" crafting_table ", 1]},
21 " step 5 ": {" prompt ": " craft wooden_pickaxe ", " item ": [" wooden_pickaxe ", 1]},
22 " step 6 ": {" prompt ": " mine cobblestone ", " item ": [" cobblestone ", 11]},
23 " step 7 ": {" prompt ": " craft furnace ", " item ": [" furnace ", 1]},
24 " step 8 ": {" prompt ": " craft stone_pickaxe ", " item ": [" stone_pickaxe ", 1]},
25 " step 9 ": {" prompt ": " mine iron_ore ", " item ": [" iron_ore ", 2]},
26 " step 10 ": {" prompt ": " smelt iron_ingot ", " item ": [" iron_ingot ", 2]},
27 " step 11 ": {" prompt ": " craft iron_sword ", " item ": [" iron_sword ", 1]}
28}
29 " ""
Figure 18: Human-written plan for crafting an iron sword.
âŹ
1 golden_sword: str = "" "
2 <goal>: craft a golden sword.
3 <requirements>:
4 1. log: need 9
5 2. planks: need 27
6 3. stick: need 7
7 4. crafting_table: need 1
8 5. wooden_pickaxe: need 1
9 6. cobblestone: need 11
10 7. furnace: need 1
11 8. stone_pickaxe: need 1
12 9. iron_ore: need 3
13 10. iron_ingot: need 3
14 11. iron_pickaxe: need 1
15 12. gold_ore: need 2
16 13. gold_ingot: need 2
17 14. golden_sword: need 1
18 <plan>
19 {
20 " step 1 ": {" prompt ": " mine logs ", " item ": [" logs ", 7]},
21 " step 2 ": {" prompt ": " craft planks ", " item ": [" planks ", 21]},
22 " step 3 ": {" prompt ": " craft stick ", " item ": [" stick ", 5]},
23 " step 4 ": {" prompt ": " craft crafting_table ", " item ": [" crafting_table ", 1]},
24 " step 5 ": {" prompt ": " craft wooden_pickaxe ", " item ": [" wooden_pickaxe ", 1]},
25 " step 6 ": {" prompt ": " mine cobblestone ", " item ": [" cobblestone ", 11]},
26 " step 7 ": {" prompt ": " craft furnace ", " item ": [" furnace ", 1]},
27 " step 8 ": {" prompt ": " craft stone_pickaxe ", " item ": [" stone_pickaxe ", 1]},
28 " step 9 ": {" prompt ": " mine iron_ore ", " item ": [" iron_ore ", 3]},
29 " step 10 ": {" prompt ": " smelt iron_ingot ", " item ": [" iron_ingot ", 3]},
30 " step 11 ": {" task ": " craft iron_pickaxe ", " goal ": [" iron_pickaxe ", 1]},
31 " step 12 ": {" prompt ": " mine gold_ore ", " item ": [" gold_ore ", 2]},
32 " step 13 ": {" prompt ": " smelt gold_ingot ", " item ": [" gold_ingot ", 2]},
33 " step 14 ": {" task ": " craft golden_sword ", " goal ": [" golden_sword ", 1]}
34}
35 " ""
Figure 19: Human-written plan for crafting a golden sword.
âŹ
1 diamond: str = "" "
2 <goal>: mine a diamond.
3 <requirements>:
4 1. log: need 7
5 2. planks: need 21
6 3. stick: need 6
7 4. crafting_table: need 1
8 5. wooden_pickaxe: need 1
9 6. cobblestone: need 11
10 7. furnace: need 1
11 8. stone_pickaxe: need 1
12 9. iron_ore: need 3
13 10. iron_ingot: need 3
14 11. iron_pickaxe: need 1
15 12. diamond: need 1
16 <plan>
17 {
18 " step 1 ": {" prompt ": " mine logs ", " item ": [" logs ", 7]},
19 " step 2 ": {" prompt ": " craft planks ", " item ": [" planks ", 21]},
20 " step 3 ": {" prompt ": " craft stick ", " item ": [" stick ", 5]},
21 " step 4 ": {" prompt ": " craft crafting_table ", " item ": [" crafting_table ", 1]},
22 " step 5 ": {" prompt ": " craft wooden_pickaxe ", " item ": [" wooden_pickaxe ", 1]},
23 " step 6 ": {" prompt ": " mine cobblestone ", " item ": [" cobblestone ", 11]},
24 " step 7 ": {" prompt ": " craft furnace ", " item ": [" furnace ", 1]},
25 " step 8 ": {" prompt ": " craft stone_pickaxe ", " item ": [" stone_pickaxe ", 1]},
26 " step 9 ": {" prompt ": " mine iron_ore ", " item ": [" iron_ore ", 2]},
27 " step 10 ": {" prompt ": " smelt iron_ingot ", " item ": [" iron_ingot ", 2]},
28 " step 11 ": {" prompt ": " craft iron_pickaxe ", " item ": [" iron_pickaxe ", 1]},
29 " step 12 ": {" prompt ": " mine diamond ", " item ": [" diamond ", 1]}
30}
31 " ""
Figure 20: Human-written plan for mining a diamond.
## Appendix J Details for experimental setup
### J.1 Compared baselines for dependency learning
We compare our proposed method, XENON, against four baselines: LLM self-correction (SC), DECKARD Nottingham et al. (2023), ADAM (Yu and Lu, 2024), and RAND (the simplest baseline). As no prior baselines were evaluated under our specific experimental setup (i.e., empty initial inventory, pre-trained low-level controller), we adapted their implementation to align with our environment. SC is implemented following common methods that prompt the LLM to correct its own knowledge upon plan failures (Shinn et al., 2023; Stechly et al., 2024). A summary of all methods compared in our experiments is provided in Table Ë 8. All methods share the following common experimental setting: each episode starts with an initial experienced requirements for some items, derived from human-written plans (details in Appendix Ë I). Additionally, all agents begin each episode with an initial empty inventory.
Table 8: Summary of methods compared in our experiments.
$$
\times \tag{2024}
$$
LLM self-correction (SC)
While no prior work specifically uses LLM self-correction to learn Minecraft item dependencies in our setting, we include this baseline to demonstrate the unreliability of this approach. For predicted requirements, similar to XENON, SC initializes its dependency graph with LLM-predicted requirements for each item. When a plan for an item fails repeatedly, it attempts to revise the requirements using LLM. SC prompts the LLM itself to perform the correction, providing it with recent trajectories and the validated requirements of similar, previously obtained items in the input prompt. SCâs action memory stores both successful and failed actions for each item. Upon a plan failure, the LLM is prompted to self-reflect on the recent trajectory to determine the cause of failure. When the agent later plans to obtain an item on which it previously failed, this reflection is included in the LLMâs prompt to guide its action selection. Intrinsic goals are selected randomly from the set of previously unobtained items. The specific prompts used for the LLM self-correction and self-reflection in this baseline are provided in Appendix Ë B.
DECKARD
The original DECKARD utilizes LLM-predicted requirements for each item but does not revise these initial predictions. It has no explicit action memory for the planner; instead, it trains and maintains specialized policies for each obtained item. It selects an intrinsic goal randomly from less explored frontier items (i.e., $\{v\in\mathcal{F}\mid C(v)\leq c_{0}\}$ ). If no such items are available, it selects randomly from the union of experienced items and all frontier items.
In our experiments, the DECKARD baseline is implemented to largely mirror the original version, with the exception of its memory system. Its memory is implemented to store only successful actions without recording failures. This design choice aligns with the original DECKARDâs approach, which, by only learning policies for successfully obtained items, lacks policies for unobtained items.
ADAM
The original ADAM started with an initial inventory containing 32 quantities of experienced resource items (i.e., items used for crafting other items) and 1 quantity of tool items (e.g., pickaxes, crafting table), implicitly treating those items as a predicted requirement set for each item. Its memory recorded which actions were used for each subgoal item without noting success or failure, and its intrinsic goal selection was guided by an expert-defined exploration curriculum.
In our experiments, ADAM starts with an empty initial inventory. The predicted requirements for each goal item in our ADAM implementation assume a fixed quantity of 8 for all resource items. This quantity was chosen to align with $\alpha_{i}$ , the hyperparameter for the quantity scale of requirement items for inadmissible items, thereby ensuring a fair comparison with XENON. The memory stores successful actions for each item, but did not record failures. This modification aligns the memory mechanism with SC and DECKARD baselines, enabling a more consistent comparison across baselines in our experimental setup. Intrinsic goal selection is random, as we do not assume such an expert-defined exploration curriculum.
RAND
RAND is a simple baseline specifically designed for our experimental setup. It started with an empty initial inventory and an LLM-predicted requirement set for each item. RAND did not incorporate any action memory. Its intrinsic goal selection involved randomly selecting from unexperienced items.
### J.2 MineRL environment
#### J.2.1 Basic rules
Minecraft has been adopted as a suitable testbed for validating performance of AI agents on long-horizon tasks (Mao et al., 2022; Lin et al., 2021; Baker et al., 2022; Li et al., 2025a), largely because of the inherent dependency in item acquisition where agents must obtain prerequisite items before more advanced ones. Specifically, Minecraft features multiple technology levelsâincluding wood, stone, iron, gold, diamond, etc. âwhich dictate item and tool dependencies. For instance, an agent must first craft a lower-level tool like a wooden pickaxe to mine materials such as stone. Subsequently, a stone pickaxe is required to mine even higher-level materials like iron. An iron pickaxe is required to mine materials like gold and diamond. Respecting the dependency is crucial for achieving complex goals, such as crafting an iron sword or mining a diamond.
#### J.2.2 Observation and action space
First, we employ MineRL (Guss et al., 2019) with Minecraft version 1.16.5.
Observation
When making a plan, our agent receives inventory information (i.e., item with their quantities) as text. When executing the plan, our agent receives an RGB image with dimensions of $640\times 360$ , including the hotbar, health indicators, food saturation, and animations of the playerâs hands.
Action space
Following Optimus-1 (Li et al., 2024b), our low-level action space primarily consists of keyboard and mouse controls, except for craft and smelt high-level actions. Crucially, craft and smelt actions are included into our action space, following (Li et al., 2024b). This means these high-level actions automatically succeed in producing an item if the agent possesses all the required items and a valid actions for that item is chosen; otherwise, they fail. This abstraction removes the need for complex, precise low-level mouse control for these specific actions. For low-level controls, keyboard presses control agent movement (e.g., jumping, moving forward, backward) and mouse movements control the agentâs perspective. The mouseâs left and right buttons are used for attacking, using, or placing items. The detailed action space is described in Table Ë 9.
Table 9: Action space in MineRL environment
| Index | Action | Human Action | Description |
| --- | --- | --- | --- |
| 1 | Forward | key W | Move forward. |
| 2 | Back | key S | Move back. |
| 3 | Left | key A | Move left. |
| 4 | Right | key D | Move right. |
| 5 | Jump | key Space | Jump. When swimming, keeps the player afloat. |
| 6 | Sneak | key left Shift | Slowly move in the current direction of movement. |
| 7 | Sprint | key left Ctrl | Move quickly in the direction of current movement. |
| 8 | Attack | left Button | Destroy blocks (hold down); Attack entity (click once). |
| 9 | Use | right Button | Place blocks, entity, open items or other interact actions defined by game. |
| 10 | hotbar [1-9] | keys 1-9 | Selects the appropriate hotbar item. |
| 11 | Open/Close Inventory | key E | Opens the Inventory. Close any open GUI. |
| 12 | Yaw | move Mouse X | Turning; aiming; camera movement.Ranging from -180 to +180. |
| 13 | Pitch | move Mouse Y | Turning; aiming; camera movement.Ranging from -180 to +180. |
| 14 | Craft | - | Execute crafting to obtain new item |
| 15 | Smelt | - | Execute smelting to obtain new item. |
#### J.2.3 Goals
We consider 67 goals from the long-horizon tasks benchmark suggested in (Li et al., 2024b). These goals are categorized into 7 groups based on Minecraftâs item categories: Wood
<details>
<summary>x31.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
, Stone
<details>
<summary>x32.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
, Iron
<details>
<summary>x33.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
, Gold
<details>
<summary>x34.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
, Diamond
<details>
<summary>x35.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
, Redstone
<details>
<summary>x36.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
, and Armor
<details>
<summary>x37.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
. All goal items within each group are listed in Table Ë 10.
Table 10: Setting of 7 groups encompassing 67 Minecraft long-horizon goals.
| Group | Goal Num. | All goal items |
| --- | --- | --- |
|
<details>
<summary>x38.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Wood | 10 | bowl, crafting_table, chest, ladder, stick, wooden_axe, wooden_hoe, wooden_pickaxe, wooden_shovel, wooden_sword |
|
<details>
<summary>x39.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Stone | 9 | charcoal, furnace, smoker, stone_axe, stone_hoe, stone_pickaxe, stone_shovel, stone_sword, torch |
|
<details>
<summary>x40.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Iron | 16 | blast_furnace, bucket, chain, hopper, iron_axe, iron_bars, iron_hoe, iron_nugget, iron_pickaxe, iron_shovel, iron_sword, rail, shears, smithing_table, stonecutter, tripwire_hook |
|
<details>
<summary>x41.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Gold | 6 | gold_ingot, golden_axe, golden_hoe, golden_pickaxe, golden_shovel, golden_sword |
|
<details>
<summary>x42.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Redstone | 6 | activator_rail, compass, dropper, note_block, piston, redstone_torch |
|
<details>
<summary>x43.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Diamond | 7 | diamond, diamond_axe, diamond_hoe, diamond_pickaxe, diamond_shovel, diamond_sword, jukebox |
|
<details>
<summary>x44.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Armor | 13 | diamond_boots, diamond_chestplate, diamond_helmet, diamond_leggings, golden_boots, golden_chestplate, golden_helmet, golden_leggings, iron_boots, iron_chestplate, iron_helmet, iron_leggings, shield |
Additional goals for scalability experiments.
To evaluate the scalability of XENON with respect to the number of goals Section Ë K.9, we extend the above 67-goal set (Table Ë 10) by adding additional goal items to construct two larger settings with 100 and 120 goals; the added goals are listed in Table 11.
Specifically, in the setting with 100 goals, we add 33 goals in total by introducing new âleatherâ, âpaperâ, and âflintâ groups and by adding more items to the existing âwoodâ and âstoneâ groups. In the setting with 120 goals, we further add 20 goals in the âironâ, âgoldâ, âredstoneâ, and âdiamondâ groups.
Table 11: Additional goals used for the scalability experiments. The setting with 100 goals extends the 67-goal set in Table 10 by adding all items in the top block; the setting with 120 goals further includes both the top and bottom blocks.
| Group | Goal Num. | Added goal items |
| --- | --- | --- |
| Additional items in the setting with 100 goals (33 items) | | |
| leather | 7 | leather, leather_boots, leather_chestplate, leather_helmet, leather_leggings, leather_horse_armor, item_frame |
| paper | 5 | map, book, cartography_table, bookshelf, lectern |
| flint | 4 | flint, flint_and_steel, fletching_table, arrow |
| wood | 8 | bow, boat, wooden_slab, wooden_stairs, wooden_door, wooden_sign, wooden_fence, woodenfence_gate |
| stone | 9 | cobblestone_slab, cobblestone_stairs, cobblestone_wall, lever, stone_slab, stone_button, stone_pressure_plate, stone_bricks, grindstone |
| Additional items only in the setting with 120 goals (20 more items) | | |
| iron | 7 | iron_trapdoor, heavy_weighted_pressure_plate, iron_door, crossbow, minecart, cauldron, lantern |
| gold | 4 | gold_nugget, light_weighted_pressure_plate, golden_apple, golden_carrot |
| redstone | 7 | redstone, powered_rail, target, dispenser, clock, repeater, detector_rail |
| diamond | 2 | obsidian, enchanting_table |
#### J.2.4 Episode horizon
The episode horizon varies depending on the experiment phase: dependency learning or long-horizon goal planning. During the dependency learning phase, each episode has a fixed horizon of 36,000 steps. In this phase, if the agent successfully achieves an intrinsic goal within an episode, it is allowed to select another intrinsic goal and continue exploration without the episode ending. After dependency learning, when measuring the success rate of goals from the long-horizon task benchmark, the episode horizon differs based on the goalâs category group. And in this phase, the episode immediately terminates upon success of a goal. The specific episode horizons for each group are as follows: Wood: 3,600 steps; Stone: 7,200 steps; Iron: 12,000 steps; and Gold, Diamond, Redstone, and Armor: 36,000 steps each.
#### J.2.5 Item spawn probability details
Following Optimus-1âs public implementation, we have modified environment configuration different from original MineRL environment (Guss et al., 2019). In Minecraft, obtaining essential resources such as iron, gold, and diamond requires mining their respective ores. However, these ores are naturally rare, making them challenging to obtain. This inherent difficulty can significantly hinder an agentâs goal completion, even with an accurate plan. This challenge in resource gathering due to an imperfect controller is a common bottleneck, leading many prior works to employ environmental modifications to focus on planning. For example, DEPS (Wang et al., 2023b) restricts the controllerâs actions based on the goal items https://github.com/CraftJarvis/MC-Planner/blob/main/controller.py. Optimus-1 (Li et al., 2024b) also made resource items easier to obtain by increasing item ore spawn probabilities. To focus on our primary goal of robust planning and isolate this challenge, we follow Optimus-1 and adopt its item ore spawn procedure directly from the publicly released Optimus-1 repository, without any modifications to its source code https://github.com/JiuTian-VL/Optimus-1/blob/main/src/optimus1/env/wrapper.py.
The ore spawn procedure probabilistically spawns ore blocks in the vicinity of the agentâs current coordinates $(x,y,z)$ . Specifically, at each timestep, the procedure has a 10% chance of activating. When activated, it spawns a specific type of ore block based on the agentâs y-coordinate. Furthermore, for any given episode, the procedure is not activate more than once at the same y-coordinate. The types of ore blocks spawned at different y-levels are as follows:
-
<details>
<summary>x45.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Coal Ore: between y=45 and y=50.
-
<details>
<summary>x46.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Iron Ore: between y=26 and y=43.
-
<details>
<summary>x47.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Gold Ore: between y=15 and y=26
-
<details>
<summary>x48.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Redstone Ore: between y=15 and y=26
-
<details>
<summary>x49.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
Diamond Ore: below y=14
### J.3 Mineflayer Environment
We use the Mineflayer (PrismarineJS, 2023) environment with Minecraft version 1.19. In Mineflayer, resource item spawn probabilities do not need to be adjusted, unlike in MineRL Section Ë J.2.5. This is because the controller, JavaScript APIs provided by Mineflayer, is competent to gather many resource items.
#### J.3.1 Observation and Action Space
The agentâs observation space is multimodal. For planning, the agent receives its current inventory (i.e., item names and their quantities) as text. For plan execution, it receives a first-person RGB image that includes the hotbar, health and food indicators, and player hand animations. For action space, following ADAM (Yu and Lu, 2024), we use the JavaScript APIs provided by Mineflayer for low-level control. Specifically, our high-level actions, such as âcraftâ, âsmeltâ, and âmineâ, are mapped to corresponding Mineflayer APIs like craftItem, smeltItem, and mineBlock.
#### J.3.2 Episode Horizon
For dependency learning, each episode has a fixed horizon of 30 minutes, which is equivalent to 36,000 steps in the MineRL environment. If the agent successfully achieves a goal within this horizon, it selects another exploratory goal and continues within the same episode.
### J.4 MC-TextWorld
MC-Textworld is a text-based environment based on Minecraft game rules (Zheng et al., 2025). We employ Minecraft version 1.16.5. In this environment, basic rules and goals are the same as those in the MineRL environment Section Ë J.2. Furthermore, resource item spawn probabilities do not need to be adjusted, unlike in MineRL Section Ë J.2.5. This is because an agent succeeds in mining an item immediately without spatial exploration, if it has a required tool and âmineâ is a valid action for that item.
In the following subsections, we detail the remaining aspects of experiment setups in this environment: the observation and action space, and the episode horizon.
#### J.4.1 Observation and action space
The agent receives a text-based observation consisting of inventory information (i.e., currently possessed items and their quantities). Actions are also text-based, where each action is represented as an high-level action followed by an item name (e.g., "mine diamond"). Thus, to execute a subgoal specified as $(a,q,v)$ (high-level action $a$ , quantity $q$ , item $v$ ), the agent repeatedly performs the action $(a,v)$ until $q$ units of $v$ are obtained.
#### J.4.2 Episode horizon
In this environment, we conduct experiments for dependency learning only. Each episode has a fixed horizon of 3,000 steps. If the agent successfully achieves an intrinsic goal within an episode, it is then allowed to select another intrinsic goal and continue exploration, without termination of the episode.
#### J.4.3 Perturbation on ground truth rules
<details>
<summary>x50.png Details</summary>

### Visual Description
\n
## Diagram: Minecraft Crafting Progression - Vanilla, Perturbed Items, Perturbed Actions
### Overview
The image presents a comparative diagram illustrating crafting progressions in Minecraft under three conditions: (a) Vanilla (standard crafting), (b) Perturbed True Required Items (where some required items are altered), and (c) Perturbed Actions (where the required action to progress is altered). Each condition is shown across three levels of crafting complexity. The diagram uses Minecraft item icons to represent crafting ingredients and outputs.
### Components/Axes
The diagram is divided into three columns labeled:
* **(a) Vanilla:** Represents the standard Minecraft crafting recipe.
* **(b) Perturbed True Required Items:** Shows a modified recipe where some items are replaced. A blue dashed rectangle highlights the replaced item. The text "Replace" is written above the rectangle.
* **(c) Perturbed Actions:** Shows a modified recipe where the required action is altered. A purple dashed rectangle highlights the altered action.
Each column is further divided into three levels, labeled:
* **Level 1**
* **Level 2**
* **Level 3**
At the bottom of each column, the "Valid action" is specified.
### Detailed Analysis / Content Details
**Column (a) Vanilla:**
* **Level 1:** Input: Wooden Plank, Stick. Output: Wooden Pickaxe.
* **Level 2:** Input: Wooden Pickaxe, Cobblestone. Output: Stone Pickaxe.
* **Level 3:** Input: Stone Pickaxe, Stick. Output: Iron Pickaxe.
* **Valid action:** craft
**Column (b) Perturbed True Required Items:**
* **Level 1:** Input: Wooden Plank, Stick. Output: Stone Pickaxe (instead of Wooden Pickaxe). The Stone Pickaxe is highlighted with a blue dashed rectangle.
* **Level 2:** Input: Stone Pickaxe, Cobblestone. Output: Wooden Pickaxe (instead of Stone Pickaxe). The Wooden Pickaxe is highlighted with a blue dashed rectangle.
* **Level 3:** Input: Wooden Pickaxe, Stick. Output: Iron Pickaxe. The Iron Pickaxe is highlighted with a blue dashed rectangle.
* **Valid action:** craft
**Column (c) Perturbed Actions:**
* **Level 1:** Input: Wooden Plank, Stick. Output: Wooden Pickaxe.
* **Level 2:** Input: Wooden Pickaxe, Cobblestone. Output: Stone Pickaxe.
* **Level 3:** Input: Stone Pickaxe, Stick. Output: Iron Pickaxe.
* **Valid action:** craft, mine OR smelt
### Key Observations
* The "Vanilla" column demonstrates the standard crafting progression.
* The "Perturbed True Required Items" column shows how altering the required items affects the crafting outcome, resulting in a different item at each level.
* The "Perturbed Actions" column introduces the possibility of multiple valid actions (craft, mine, or smelt) to achieve the desired outcome.
* The use of dashed rectangles effectively highlights the perturbations in the "Perturbed" columns.
### Interpretation
This diagram appears to be illustrating a study on how humans adapt to changes in crafting recipes or required actions in a game like Minecraft. The "Vanilla" column serves as a baseline. The "Perturbed" columns represent scenarios where the expected crafting logic is disrupted. The "Perturbed True Required Items" column tests the ability to adapt to incorrect ingredients, while the "Perturbed Actions" column tests the ability to consider alternative actions to achieve the same goal. The addition of "mine OR smelt" as a valid action in the last column suggests an exploration of how players might find alternative pathways to progress when the standard crafting action is not sufficient. The diagram is likely used to investigate cognitive processes related to problem-solving, learning, and adaptation in a game environment. The perturbations are designed to test the robustness of human crafting strategies and identify potential areas of cognitive flexibility.
</details>
Figure 21: Illustration of the ground-truth rule perturbation settings. (a) in the vanilla setting, goal items (black boxes) have standard required items (incoming edges) and âcraftâ is the valid action; (b) in the Perturbed Requirements setting, one required item (red dashed circle) is replaced by a new one randomly from a candidate pool (blue dashed box); (c) in the Perturbed Actions setting, the valid action is changed to either âmineâ or âsmeltâ.
To evaluate each agentâs robustness to conflicts with its prior knowledge, we perturb the ground-truth rules (required items and actions) for a subset of goal items, as shown in Figure Ë 21. The perturbation is applied at different intensity levels (from 1 to 3), where higher levels affect a greater number of items. These levels are cumulative, meaning a Level 2 perturbation includes all perturbations from Level 1 plus additional ones.
- Vanilla Setting: In the setting with no perturbation (Figure Ë 21, a), the ground-truth rules are unmodified. In the figure, items in the black solid boxes are the goal items, and those with arrows pointing to them are their true required items. Each goal item has âcraftâ as a valid action.
- Perturbed True Required Items: In this setting (Figure Ë 21, b), one of the true required items (indicated by a red dashed circle) for a goal is replaced. The new required item is chosen uniformly at random from a candidate pool (blue dashed box). The valid action remains craft.
- Perturbed True Actions: In this setting (Figure Ë 21, c), the valid action for a goal is randomly changed from âcraftâ to either âmineâ or âsmeltâ. The required items are not modified.
- Perturbed Both Rules: In this setting, both the required items and the valid actions are modified according to the rules described above.
## Appendix K Additional experimental results
### K.1 LLM-predicted initial dependency graph analysis
Table 12: Performance analysis for the initial LLM-predicted requirement sets over 75 Minecraft items, used to build the initial dependency graph. Note that while we began the prediction process with 67 goal items, the total number of predicted items expanded to 75. This expansion occurred because, as the LLM predicted requirement sets for items in the dependency graph (initially for those goal items), any newly mentioned items that were not yet part of the graph are also included. This iterative process is detailed in Section Ë 4.1 (Dependency graph initialization) of our method.
| Metric | Value |
| --- | --- |
| Requirement Set Prediction Accuracy | |
| Correct items (ignoring quantities) | 23% |
| Exact items & quantities | 8% |
| Non-existent Item Rates | |
| Non-existent items | 8% |
| Descendants of non-existent items | 23% |
| Required Items Errors | |
| Unnecessary items included | 57% |
| Required items omitted | 57% |
| Required Item Quantity Prediction Errors | |
| Standard deviation of quantity error | 2.74 |
| Mean absolute quantity error | 2.05 |
| Mean signed quantity error | -0.55 |
The initial dependency graph, constructed from predictions by Qwen2.5-VL-7B (Bai et al., 2025), forms the initial planning knowledge for XENON (Section Ë 4.1). This section analyzes its quality, highlighting limitations that necessitate our adaptive dependency learning.
As shown in Table Ë 12, the 7B LLMâs initial requirement sets exhibit significant inaccuracies. Accuracy for correct item types was 23%, dropping to 8% for exact items and quantities. Errors in dependency among items are also prevalent: 57% of items included unnecessary items, and 57% omitted required items. Furthermore, 8% of predicted items were non-existent (hallucinated), making 23% of descendant items unattainable. Quantity predictions also showed substantial errors, with a mean absolute error of 2.05.
These results clearly demonstrate that the LLM-generated initial dependency graph is imperfect. Its low accuracy and high error rates underscore the unreliability of raw LLM knowledge for precise planning, particularly for smaller models like the 7B LLM which are known to have limited prior knowledge on Minecraft, as noted in previous work (ADAM, Yu and Lu (2024), Appendix A. LLMsâ Prior Knowledge on Minecraft). This analysis therefore highlights the importance of the adaptive dependency learning within XENON, which is designed to refine this initial, imperfect knowledge for robust planning.
Table 13: Ratio of dependencies learned for items which are unobtainable by the flawed initial dependency graph (out of 51). Analysis is based on the final learned graphs from the MineRL experiments.
| Agent | Learned ratio (initially unobtainable items) |
| --- | --- |
| XENON | 0.51 |
| SC | 0.25 |
| DECKARD | 0.25 |
| ADAM | 0.00 |
| RAND | 0.02 |
### K.2 Additional analysis of learned dependency graph
As shown in Table Ë 13, XENON demonstrates significantly greater robustness to the LLMâs flawed prior knowledge compared to all baselines. It successfully learned the correct dependencies for over half (0.51) of the 51 items that were initially unobtainable by the flawed graph. In contrast, both DECKARD (with no correction) and the SC baseline (with LLM self-correction) learned only a quarter of these items (0.25). This result strongly indicates that relying on the LLM to correct its own errors is as ineffective as having no correction mechanism at all in this setting. The other baselines, ADAM and RAND, failed almost completely, highlighting the difficulty of this challenge.
### K.3 Impact of controller capacity on dependency learning
We observe that controller capacity significantly impacts an agentâs ability to learn dependencies from interaction. Specifically, in our MineRL experiments, we find that ADAM fails to learn any new dependencies due to the inherent incompatibility between its strategy and the controllerâs limitations. In our realistic setting with empty initial inventories, ADAMâs strategy requires gathering a sufficient quantity (fixed at 8, same with our hyperparameter $\alpha_{i}$ The scaling factor for required item quantities for inadmissible items.) of all previously used resources before attempting a new item. This list of required resource items includes gold ingot
<details>
<summary>x51.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
, because of an initially provided human-written plan for golden sword; however, the controller STEVE-1 never managed to collect more than seven units of gold in a single episode across all our experiments. Consequently, this controller bottleneck prevents ADAM from ever attempting to learn new items, causing its dependency learning to stall completely.
Although XENON fails to learn dependencies for the Redstone group items in MineRL, our analysis shows this stems from controller limitations rather than algorithmic ones. Specifically, in MineRL, STEVE-1 cannot execute XENONâs exploration strategy for inadmissible items, which involves gathering a sufficient quantity of all previously used resources before a retry (Section Ë 4.1). The Redstone group items become inadmissible because the LLMâs initial predictions for them are entirely incorrect. This lack of a valid starting point prevents XENON from ever experiencing the core item, redstone, being used as a requirement for any other item. Consequently, our RevisionByAnalogy mechanism has no analogous experience to propose redstone as a potential required item for other items during its revision process.
In contrast, with more competent controllers, XENON successfully overcomes even such severely flawed prior knowledge to learn the challenging Redstone group dependencies, as demonstrated in Mineflayer and MC-TextWorld. First, in Mineflayer, XENON learns the correct dependencies for 5 out of 6 Redstone items. This success is possible because its more competent controller can execute the exploration strategy for inadmissible items, which increases the chance of possessing the core required item (redstone) during resource gathering. Second, with a perfect controller in MC-TextWorld, XENON successfully learns the dependencies for all 6 Redstone group items in every single episode.
### K.4 Impact of Controller Capacity in Long-horizon Goal Planning
Table 14: Long-horizon task success rate (SR) comparison between the Modified MineRL (a setting where resource items are easier to obtain) and Standard MineRL environments. All methods are provided with the correct dependency graph. DEPS $\dagger$ and Optimus-1 $\dagger$ are our reproductions of the respective methods using Qwen2.5-VL-7B as a planner. OracleActionPlanner, which generates the correct plan for all goals, represents the performance upper bound. SR for Optimus-1 $\dagger$ and XENON â in the Modified MineRL column are taken from Table Ë 3 in Section Ë 5.3.
| Method | Dependency | Modified MineRL | Standard MineRL | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Iron | Diamond | Gold | Iron | Diamond | Gold | | |
| DEPS $\dagger$ | - | 0.02 | 0.00 | 0.01 | 0.01 | 0.00 | 0.00 |
| Optimus-1 $\dagger$ | Oracle | 0.23 | 0.10 | 0.11 | 0.13 | 0.00 | 0.00 |
| XENON â | Oracle | 0.83 | 0.75 | 0.73 | 0.24 | 0.00 | 0.00 |
| OracleActionPlanner | Oracle | - | - | - | 0.27 | 0.00 | 0.00 |
Because our work focuses on building a robust planner, to isolate the planning from the significant difficulty of item gatheringâa task assigned to the controllerâour main experiments for long-horizon tasks (Section Ë 5.3) uses a modified MineRL environment following the official implementation of Optimus-1. This modification makes essential resource items like iron, gold, and diamond easier for the controller to find, allowing for a clearer evaluation of planning algorithms (modifications are detailed in Section Ë J.2.5). However, to provide a more comprehensive analysis, we also evaluated our agent and baselines in the unmodified, standard MineRL environment. In this setting, items like iron, gold, and diamond are naturally rare, making item gathering a major bottleneck.
The results are shown in Table Ë 14. Most importantly, XENON â consistently outperforms the baselines in both the modified and standard MineRL. Notably, in the standard environment, XENON â âs performance on the Iron group (0.24 SR) is comparable to that of the OracleActionPlanner (0.27 SR), which always generates correct plans for all goals. This comparison highlights the severity of the controller bottleneck: even the OracleActionPlanner achieves a 0.00 success rate for the Diamond and Gold groups in the standard MineRL. This shows that the failures are due to the controllerâs inability to gather rare resources in the standard environment.
### K.5 Long-horizon task benchmark experiments analysis
This section provides a detailed analysis of the performance differences observed in Table Ë 3 between Optimus-1 â and XENON â on long-horizon tasks, even when both access to a true dependency graph and increased item spawn probabilities (Section Ë J.2.5). We specifically examine various plan errors encountered when reproducing Optimus-1 â using Qwen2.5-VL-7B as the planner, and explain how XENON â robustly constructs plans through step-by-step planning with FAM.
Table 15: Analysis of primary plan errors observed in Optimus-1 â and XENON â during long-horizon tasks benchmark experiments. This table presents the ratio of specified plan error among the failed episodes for Optimus-1 â and XENON â respectively. Invalid Action indicates errors where an invalid action is used for an item in a subgoal. Subgoal Omission refers to errors where a necessary subgoal for a required item is omitted from the plan. Note that these plan error values are not exclusive; one episode can exhibit multiple types of plan errors.
| Plan Error Type | Optimus-1 â Error Rate (%) | XENON â Error Rate (%) |
| --- | --- | --- |
| Invalid Action | 37 | 2 |
| Subgoal Omission | 44 | 0 |
Optimus-1 â has no fine-grained action knowledge correction mechanism. Furthermore, Optimus-1 â âs LLM planner generates a long plan at once with a long input prompt including a sequence of aggregated requirements $((q_{1},u_{1}),\dots,(q_{L_{v}},u_{L_{v}})=(1,v))$ for the goal item $v$ . Consequently, as shown in Table Ë 15, Optimus-1 generates plans with invalid actions for required items, denoted as Invalid Action. Furthermore, Optimus-1 omits necessary subgoals for required items, even they are in the input prompts, denoted as Subgoal Omission.
In contrast, XENON discovers valid actions by leveraging FAM, which records the outcomes of each action for every item, thereby enabling it to avoid empirically failed ones and and reuse successful ones. Furthermore, XENON mitigates the problem of subgoal omission through constructing a plan by making a subgoal for each required item one-by-one.
### K.6 Robust Dependency learning under dynamic true knowledge
<details>
<summary>x52.png Details</summary>

### Visual Description
\n
## Legend: Identifier Key
### Overview
The image presents a legend, likely associated with a chart or diagram, defining identifiers for different data series or components. It consists of six labels, each paired with a distinct colored symbol. There is no chart or diagram present in the image, only the key.
### Components/Axes
The legend contains the following labels and corresponding symbols:
* **XENON**: Light blue circle.
* **SC**: Pink diamond.
* **ADAM**: Orange star.
* **DECKARD**: Light green square.
* **RAND**: Dark blue plus sign.
The legend is arranged horizontally, with labels positioned to the right of their respective symbols.
### Detailed Analysis or Content Details
The image provides only identifier names and their associated colors/shapes. There are no numerical values, axes, or trends to analyze. The symbols are simple geometric shapes.
### Key Observations
The legend uses a variety of colors and shapes to differentiate between the identifiers. The names appear to be proper nouns, potentially representing individuals, systems, or categories within a larger dataset.
### Interpretation
This legend is a key to understanding a more complex visualization that is not present in the image. The identifiers likely represent different data series in a chart or different components in a diagram. Without the associated chart or diagram, it is impossible to determine the meaning of these identifiers or the relationships between them. The names "XENON", "DECKARD", and "ADAM" suggest a possible connection to science fiction or artificial intelligence, but this is speculative. The presence of "RAND" suggests a possible connection to random number generation or a related concept. The identifier "SC" is ambiguous without further context.
</details>
<details>
<summary>x53.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Environment Step
### Overview
This image presents a line chart illustrating the relationship between EGA (presumably a performance metric) and Environment Step, representing time or iterations. The chart displays multiple lines, each representing a different experimental condition or algorithm. A vertical dashed line indicates a point where "True requirements are changed," suggesting a shift in the evaluation criteria or environment. Shaded areas around each line represent confidence intervals or standard deviations.
### Components/Axes
* **X-axis:** "Environment step" ranging from 0 to 3000, with tick marks at 0, 1000, 2000, and 3000.
* **Y-axis:** "EGA" ranging from 0 to 1.0, with tick marks at 0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Lines:** Six distinct lines, each with a different color and marker:
* Blue circle: Rapidly increasing EGA, plateauing near 1.0.
* Orange cross: Increasing EGA, plateauing around 0.65.
* Pink diamond: Increasing EGA, plateauing around 0.6.
* Light blue triangle: Increasing EGA, plateauing around 0.5.
* Green square: Relatively stable EGA, increasing slowly to around 0.45.
* Gray plus sign: Relatively stable EGA, around 0.2.
* **Vertical Dashed Line:** Located at approximately Environment Step 1750, labeled "True requirements are changed."
* **Shaded Areas:** Lightly colored areas surrounding each line, indicating variability or confidence intervals.
### Detailed Analysis
Let's analyze each line's trend and approximate data points:
* **Blue Line (Circle):** Starts at approximately EGA 0.15 at Environment Step 0. Rapidly increases, reaching EGA 0.95 by Environment Step 1500. Plateaus around EGA 0.98 from Environment Step 1750 to 3000.
* **Orange Line (Cross):** Starts at approximately EGA 0.18 at Environment Step 0. Increases steadily, reaching EGA 0.65 by Environment Step 1500. Plateaus around EGA 0.63 from Environment Step 1750 to 3000.
* **Pink Line (Diamond):** Starts at approximately EGA 0.2 at Environment Step 0. Increases steadily, reaching EGA 0.6 by Environment Step 1500. Plateaus around EGA 0.58 from Environment Step 1750 to 3000.
* **Light Blue Line (Triangle):** Starts at approximately EGA 0.25 at Environment Step 0. Increases steadily, reaching EGA 0.5 by Environment Step 1500. Plateaus around EGA 0.48 from Environment Step 1750 to 3000.
* **Green Line (Square):** Starts at approximately EGA 0.3 at Environment Step 0. Increases slowly, reaching EGA 0.45 by Environment Step 3000. Remains relatively flat throughout the entire range.
* **Gray Line (Plus Sign):** Starts at approximately EGA 0.18 at Environment Step 0. Remains relatively stable around EGA 0.2 throughout the entire range, with minor fluctuations.
### Key Observations
* The blue line demonstrates the fastest and most significant improvement in EGA, reaching a high plateau.
* The gray line shows minimal change in EGA, indicating a lack of adaptation or a stable performance level.
* The vertical dashed line ("True requirements are changed") appears to have little impact on the blue, orange, pink, and light blue lines, as their plateaus were already established. The green and gray lines are unaffected.
* The shaded areas indicate that the orange, pink, and light blue lines have more variability in their performance than the blue and gray lines.
### Interpretation
The chart likely represents the performance of different algorithms or strategies in a dynamic environment. The "EGA" metric could be a measure of accuracy, efficiency, or some other relevant performance indicator. The "Environment Step" represents the progression of learning or adaptation.
The rapid improvement of the blue line suggests a highly effective algorithm that quickly converges to optimal performance. The gray line indicates a strategy that is either ineffective or already optimized for the initial environment.
The "True requirements are changed" event suggests a shift in the evaluation criteria or the environment itself. The fact that most lines plateaued *before* this change suggests that the algorithms had already reached their limits under the initial conditions. The lack of significant change after the event indicates that the algorithms were unable to adapt to the new requirements.
The variability in the orange, pink, and light blue lines suggests that their performance is more sensitive to environmental factors or random fluctuations. The green line's slow increase suggests a gradual learning process, but it may not be sufficient to achieve high performance.
This data suggests that the blue algorithm is the most robust and adaptable, while the gray algorithm is the least. The change in requirements highlights the importance of adaptability in dynamic environments. Further investigation could focus on understanding why the other algorithms failed to adapt and how to improve their performance.
</details>
(a) Dynamic True Required Items
<details>
<summary>x54.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Environment Step
### Overview
This image presents a line chart illustrating the relationship between EGA (presumably a performance metric) and the Environment Step. The chart displays multiple lines, each representing a different experimental condition or algorithm. A vertical dashed line indicates a point where "True actions are changed," suggesting an intervention or shift in the experimental setup.
### Components/Axes
* **X-axis:** "Environment step" ranging from 0 to approximately 3000, with tick marks at 0, 1000, 2000, and 3000.
* **Y-axis:** "EGA" ranging from 0 to 1.1, with tick marks at 0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Lines:** Six distinct lines, each with a unique color and marker:
* Blue line with circle markers
* Orange line with diamond markers
* Pink line with triangle markers
* Green line with square markers
* Purple line with inverted triangle markers
* Gray line with plus markers
* **Annotation:** "True actions are changed" with a vertical dashed gray line at approximately Environment Step 1800.
### Detailed Analysis
Let's analyze each line's trend and approximate data points:
* **Blue Line:** This line shows a rapid increase in EGA from approximately 0.15 at Environment Step 0 to nearly 1.0 by Environment Step 1000, then plateaus around 0.95-1.0 for the remainder of the chart.
* (0, 0.15), (500, 0.6), (1000, 0.95), (1500, 0.98), (2000, 0.97), (3000, 0.96)
* **Orange Line:** This line exhibits a slower, more gradual increase in EGA, starting at approximately 0.2 at Environment Step 0 and reaching around 0.6 by Environment Step 3000.
* (0, 0.2), (500, 0.35), (1000, 0.45), (1500, 0.52), (2000, 0.56), (3000, 0.6)
* **Pink Line:** This line starts at approximately 0.25 at Environment Step 0 and increases to around 0.5 by Environment Step 3000.
* (0, 0.25), (500, 0.38), (1000, 0.42), (1500, 0.46), (2000, 0.49), (3000, 0.5)
* **Green Line:** This line shows a relatively stable EGA value, starting at approximately 0.2 at Environment Step 0 and remaining around 0.4 for most of the chart, with a slight increase to around 0.45 by Environment Step 3000.
* (0, 0.2), (500, 0.3), (1000, 0.35), (1500, 0.4), (2000, 0.42), (3000, 0.45)
* **Purple Line:** This line starts at approximately 0.15 at Environment Step 0 and remains relatively flat, fluctuating around 0.2 for the duration of the chart.
* (0, 0.15), (500, 0.18), (1000, 0.19), (1500, 0.18), (2000, 0.17), (3000, 0.16)
* **Gray Line:** This line starts at approximately 0.18 at Environment Step 0 and remains relatively flat, fluctuating around 0.15-0.2 for the duration of the chart.
* (0, 0.18), (500, 0.17), (1000, 0.16), (1500, 0.16), (2000, 0.16), (3000, 0.15)
### Key Observations
* The blue line demonstrates significantly faster learning and higher performance (EGA) compared to all other lines.
* The purple and gray lines show minimal improvement in EGA over time, indicating limited learning or adaptation.
* The "True actions are changed" annotation coincides with a slight dip in the blue line's EGA, but it quickly recovers.
* The orange, pink, and green lines show moderate improvement in EGA, but at a much slower rate than the blue line.
### Interpretation
The chart likely represents the performance of different algorithms or experimental conditions in a reinforcement learning or control task. The EGA metric appears to measure the effectiveness of the algorithm in achieving its goal. The rapid increase in EGA for the blue line suggests that this algorithm learns quickly and efficiently. The change in "True actions" at Environment Step 1800 might represent a shift in the environment's dynamics or a change in the reward function, which temporarily disrupts the performance of the blue line but doesn't prevent it from continuing to learn. The consistently low EGA values for the purple and gray lines suggest that these algorithms are either ineffective or poorly suited for the task. The differences in learning rates and final EGA values between the lines highlight the importance of algorithm selection and experimental design in achieving optimal performance. The chart suggests a clear hierarchy of performance, with the blue algorithm significantly outperforming the others. Further investigation would be needed to understand the specific reasons for these differences and the impact of the "True actions" change.
</details>
(b) Dynamic True Actions
<details>
<summary>x55.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Environment Step
### Overview
This image presents a line chart illustrating the relationship between "EGA" (Evolutionary Game Algorithm) and "Environment step". The chart displays multiple lines representing different experimental conditions or algorithms, tracking how EGA changes over time (measured in environment steps). A vertical dashed line indicates a point where a change was made to the "true rules" governing the environment.
### Components/Axes
* **X-axis:** "Environment step", ranging from 0 to approximately 3000. The axis is linearly scaled.
* **Y-axis:** "EGA", ranging from 0 to 1.0. The axis is linearly scaled.
* **Lines:** Six distinct lines, each with a unique color:
* Blue
* Orange
* Green
* Red
* Pink
* Gray
* **Vertical Dashed Line:** Located at approximately Environment step 1750, with the text "Both true rules are changed" positioned to the right.
* **Grid:** A light gray grid is present, aiding in the readability of the data points.
### Detailed Analysis
Let's analyze each line's trend and extract approximate data points.
* **Blue Line:** This line shows a rapid increase in EGA from approximately 0.1 at Environment step 0 to nearly 1.0 by Environment step 1000. It plateaus around 0.95-1.0 for the remainder of the chart.
* **Orange Line:** This line starts at approximately 0.15 at Environment step 0 and increases steadily to around 0.6 by Environment step 1000. It continues to increase, but at a slower rate, reaching approximately 0.7 by Environment step 3000.
* **Green Line:** This line begins at approximately 0.1 at Environment step 0 and increases to around 0.4 by Environment step 1000. It continues to increase, reaching approximately 0.5 by Environment step 3000.
* **Red Line:** This line starts at approximately 0.15 at Environment step 0 and increases to around 0.5 by Environment step 1000. It plateaus around 0.55-0.6 for the remainder of the chart.
* **Pink Line:** This line begins at approximately 0.2 at Environment step 0 and increases to around 0.5 by Environment step 1000. It continues to increase, reaching approximately 0.6 by Environment step 3000.
* **Gray Line:** This line starts at approximately 0.18 at Environment step 0 and remains relatively flat, fluctuating between 0.15 and 0.2 throughout the entire chart.
### Key Observations
* The blue line demonstrates the fastest and most significant increase in EGA, reaching a plateau at a high value.
* The gray line remains consistently low and relatively unchanged, indicating a lack of significant evolution or adaptation.
* The vertical dashed line at Environment step 1750 appears to coincide with a slight dip in the blue line, but it doesn't significantly affect the other lines.
* The orange, green, red, and pink lines show similar, but distinct, growth patterns, suggesting different rates of adaptation or learning.
### Interpretation
The chart likely represents the performance of different algorithms or strategies in an evolving environment. The "EGA" metric could represent a measure of success, adaptation, or fitness. The "Environment step" represents the progression of time or the number of interactions within the environment.
The rapid increase in the blue line suggests that the corresponding algorithm quickly learned to exploit the environment. The plateau indicates that it reached a point of optimal performance. The gray line's stagnation suggests that the corresponding algorithm failed to adapt or was poorly suited to the environment.
The change in "true rules" at Environment step 1750 may have disrupted the environment, causing a temporary dip in the blue line's performance as it readjusted. The fact that the other lines were less affected suggests that the blue line's strategy was particularly sensitive to the rule change.
The differences between the orange, green, red, and pink lines suggest that different algorithms have varying levels of adaptability and efficiency. Further analysis would be needed to determine the specific characteristics of each algorithm and the reasons for their performance differences. The data suggests that the environment is dynamic and that algorithms must be able to adapt to changing conditions to maintain high performance.
</details>
(c) Dynamic Both Rules
Figure 22: Robustness against dynamic true knowledge. EGA over 3,000 environment steps in the where the true item acquisition rules are changed during the learning process.
Table 16: The ratio of correctly learned dependencies among whose rules are dynamically changed (out of 7 total) by each agent. Columns correspond to the type of ground-truth rules changed during learning: requirements only, actions only, or both.
| Agent | (3,0) | (0,3) | (3,3) |
| --- | --- | --- | --- |
| XENON | 1.0 | 1.0 | 1.0 |
| SC | 0.80 | 0.0 | 0.0 |
| ADAM | 0.83 | 0.0 | 0.0 |
| DECKARD | 0.49 | 0.0 | 0.0 |
| RAND | 0.29 | 0.0 | 0.0 |
Additionally, We show XENON is also applicable to scenarios where the latent true knowledge changes dynamically. We design three dynamic scenarios where the environment begins with the vanilla setting, (0,0), for the first 1,500 steps, then transitions to a level-3 perturbation setting for the subsequent 1,500 steps: either required items-only (3,0), action-only (0,3), or both (3,3). Upon this change, the agent is informed of which itemsâ rules are modified but not what the new rules are, forcing it to relearn from experience. As shown in Figure Ë 22, XENON rapidly adapts by re-learning the new dependencies and recovering its near-perfect EGA in all three scenarios. In contrast, all baselines fail to adapt effectively, with their performance remaining significantly degraded after the change. Specifically, for the 7 items whose rules are altered, Table Ë 16 shows that XENON achieves a perfect re-learning ratio of 1.0 in all scenarios, while all baselines score 0.0 whenever actions are modified.
### K.7 Ablation studies for long-horzion goal planning
Table 17: Ablation experiment results for long-horizon goal planning in MineRL. Without Learned Dependency, XENON employs a dependency graph initialized with LLM predictions and human-written examples. Without Action Correction, XENON saves and reuses successful actions in FAM, but it does not utilize the information of failed actions.
| Learned Dependency | Action Correction | CRe |
<details>
<summary>x56.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x57.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x58.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x59.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x60.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x61.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
<details>
<summary>x62.png Details</summary>

### Visual Description
Icon/Small Image (20x20)
</details>
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Wood | Stone | Iron | Diamond | Gold | Armor | Redstone | | | |
| 0.54 | 0.39 | 0.10 | 0.26 | 0.45 | 0.0 | 0.0 | | | |
| 0.54 | 0.38 | 0.09 | 0.29 | 0.45 | 0.0 | 0.0 | | | |
| 0.82 | 0.69 | 0.36 | 0.59 | 0.69 | 0.22 | 0.0 | | | |
| 0.82 | 0.79 | 0.45 | 0.59 | 0.68 | 0.21 | 0.0 | | | |
| 0.85 | 0.81 | 0.46 | 0.64 | 0.74 | 0.28 | 0.0 | | | |
To analyze how each of XENONâs components contributes to its long-horizon planning, we conducted an ablation study in MineRL, with results shown in Table Ë 17. The findings first indicate that without accurate dependency knowledge, our action correction using FAM provides no significant benefit on its own (row 1 vs. row 2). The most critical component is the learned dependency graph, which dramatically improves success rates across all item groups (row 3). Building on this, adding FAMâs action correction further boosts performance, particularly for the Stone and Iron groups where it helps overcome the LLMâs flawed action priors (row 4). Finally, Context-aware Reprompting (CRe, Section Ë 4.3) provides an additional performance gain on more challenging late-game items, such as Iron, Gold, and Armor. This is likely because their longer episode horizons offer more opportunities for CRe to rescue a stalled controller.
### K.8 The Necessity of Knowledge Correction even with External Sources
<details>
<summary>x63.png Details</summary>

### Visual Description
## Line Chart: Experienced Items Ratio vs. Environment Step
### Overview
This image presents a line chart illustrating the ratio of experienced items over environment steps for five different algorithms: XENON, SC, ADAM, DECKARD, and RAND. The chart displays the learning progress of each algorithm as it interacts with an environment, measured by the proportion of items encountered. The y-axis represents the "Experienced Items Ratio" and the x-axis represents the "Environment Step". Each line represents the average performance of an algorithm, with a shaded area indicating the variance or confidence interval around that average.
### Components/Axes
* **X-axis:** "Environment step" ranging from approximately 0 to 3000. The axis is linearly scaled.
* **Y-axis:** "Experienced Items Ratio" ranging from approximately 0 to 1.0. The axis is linearly scaled.
* **Legend:** Located in the top-left corner of the chart. It identifies the five algorithms using both names and corresponding line colors:
* XENON (Light Blue)
* SC (Light Red/Pink)
* ADAM (Orange)
* DECKARD (Light Green)
* RAND (Dark Blue)
### Detailed Analysis
* **XENON:** The XENON line (light blue) starts at approximately 0.02 at Environment Step 0. It exhibits a steep upward trend, increasing rapidly between Environment Steps 1000 and 2000. It reaches approximately 0.95 at Environment Step 2000 and plateaus around 0.98 by Environment Step 3000. The shaded area around the line indicates a relatively large variance initially, which decreases as the algorithm progresses.
* **SC:** The SC line (light red/pink) begins at approximately 0.03 at Environment Step 0. It shows a slow, gradual increase, reaching approximately 0.15 at Environment Step 3000. The line remains relatively flat throughout the entire range of Environment Steps. The shaded area is relatively small, indicating low variance.
* **ADAM:** The ADAM line (orange) starts at approximately 0.02 at Environment Step 0. It shows a very slow, almost flat increase, reaching approximately 0.08 at Environment Step 3000. The shaded area is small, indicating low variance.
* **DECKARD:** The DECKARD line (light green) begins at approximately 0.03 at Environment Step 0. It shows a slight increase up to approximately 0.1 at Environment Step 1000, then plateaus and remains relatively constant at around 0.12 for the rest of the Environment Steps. The shaded area is small, indicating low variance.
* **RAND:** The RAND line (dark blue) starts at approximately 0.02 at Environment Step 0. It shows a steady, but slower than XENON, increase. It reaches approximately 0.8 at Environment Step 1500, and then continues to increase, reaching approximately 0.95 at Environment Step 3000. The shaded area is relatively large initially, decreasing as the algorithm progresses.
### Key Observations
* XENON demonstrates the fastest learning rate, achieving a high Experienced Items Ratio significantly faster than other algorithms.
* SC, ADAM, and DECKARD exhibit very slow learning rates, remaining at low Experienced Items Ratios throughout the observed Environment Steps.
* RAND shows a moderate learning rate, slower than XENON but faster than SC, ADAM, and DECKARD.
* The variance (represented by the shaded areas) is highest for XENON and RAND initially, suggesting more variability in their performance during the early stages of learning.
### Interpretation
The chart suggests that the XENON algorithm is the most effective at exploring and experiencing items in the environment, followed by RAND. SC, ADAM, and DECKARD algorithms are significantly less efficient in this regard. The rapid increase in the Experienced Items Ratio for XENON indicates a quick adaptation and learning process. The relatively flat lines for SC, ADAM, and DECKARD suggest that these algorithms may be stuck in suboptimal exploration strategies or have limited capacity to learn from the environment. The shaded areas around each line represent the uncertainty or variability in the algorithm's performance. The decreasing variance for XENON and RAND as the Environment Step increases suggests that these algorithms become more consistent in their performance as they gain more experience. This data could be used to compare the effectiveness of different reinforcement learning algorithms or exploration strategies in a given environment. The differences in learning rates could be attributed to factors such as the algorithm's exploration policy, learning rate, or the complexity of the environment.
</details>
Figure 23: Ratio of goal items obtained in one MC-TextWorld episode when each agentâs dependency graph is initialized from an oracle graph while the environmentâs ground-truth dependency graph is perturbed. Solid lines denote the mean over 15 runs; shaded areas denote the standard deviation.
Even when an external source is available to initialize an agentâs knowledge, correcting that knowledge from interaction remains essential for dependency and action learning, because such sources can be flawed or outdated. To support this, we evaluate XENON and the baselines in the MC-TextWorld environment where each agentâs dependency graph is initialized from an oracle graph, while the environmentâs ground-truth dependency graph is perturbed (perturbation level 3 in Table Ë 4). We measure performance as the ratio of the 67 goal items obtained within a single episode. We use an intrinsic exploratory item selection method for all agents (i.e., which item each agent chooses on its own to try to obtain next): they choose, among items not yet obtained in the current episode, the one with the fewest attempts so far.
As shown in Figure Ë 23, this experiment demonstrates that, even when an external source is available, (1) interaction experience-based knowledge correction remains crucial when the external source is mismatched with the environment, and (2) XENON is also applicable and robust in this scenario. By continually revising its dependency knowledge, XENON achieves a much higher ratio of goal items obtained in an episode than all baselines. In contrast, the baselines either rely on unreliable LLM self-correction (e.g., SC) or do not correct flawed knowledge at all (e.g., DECKARD, ADAM, RAND), and therefore fail to obtain many goal items. Their performance is especially poor because there are dependencies between goals: for example, when the true required items for stone pickaxe and iron pickaxe are perturbed, the baselines cannot obtain these items and thus cannot obtain other goal items that depend on them.
### K.9 Scalability of Dependency and Action Learning with More Goals and Actions
<details>
<summary>x64.png Details</summary>

### Visual Description
## Line Chart: EGA vs. Environment Step for Different Goal Counts
### Overview
This image presents a line chart illustrating the relationship between Environment Step and EGA (presumably a performance metric) for three different numbers of goals: 67, 100, and 120. The chart shows how EGA increases over time (Environment Step) for each goal count.
### Components/Axes
* **X-axis:** "Environment step" ranging from 0 to 3000, with tick marks at 0, 1000, 2000, and 3000.
* **Y-axis:** "EGA" ranging from 0 to 1.0, with tick marks at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located in the top-right corner, containing the following entries:
* Black line: "# of goals: 67"
* Orange line: "# of goals: 100"
* Blue line: "# of goals: 120"
* **Gridlines:** A light gray grid is present to aid in reading values.
### Detailed Analysis
The chart displays three distinct lines, each representing a different number of goals.
* **Line 1 (Black, # of goals: 67):** This line starts at approximately EGA = 0.15 at Environment Step = 0. It increases relatively slowly until around Environment Step = 500, where it begins to rise more steeply. It reaches approximately EGA = 0.95 at Environment Step = 2000 and plateaus.
* Approximate data points: (0, 0.15), (500, 0.5), (1000, 0.8), (1500, 0.9), (2000, 0.95), (3000, 0.95)
* **Line 2 (Orange, # of goals: 100):** This line begins at approximately EGA = 0.18 at Environment Step = 0. It exhibits a similar initial slow increase, but starts to rise more rapidly around Environment Step = 700. It reaches approximately EGA = 0.95 at Environment Step = 1700 and plateaus.
* Approximate data points: (0, 0.18), (500, 0.35), (700, 0.5), (1000, 0.75), (1500, 0.9), (1700, 0.95), (3000, 0.95)
* **Line 3 (Blue, # of goals: 120):** This line starts at approximately EGA = 0.2 at Environment Step = 0. It shows the fastest initial increase, with a steeper slope than the other two lines. It reaches approximately EGA = 0.9 at Environment Step = 1200 and plateaus.
* Approximate data points: (0, 0.2), (500, 0.55), (700, 0.7), (1000, 0.85), (1200, 0.9), (3000, 0.9)
### Key Observations
* Increasing the number of goals generally leads to faster initial EGA improvement.
* All three lines converge to a plateau around EGA = 0.9 to 0.95, suggesting a performance limit.
* The line representing 120 goals consistently outperforms the others, especially in the early stages.
* The lines exhibit step-like increases, indicating discrete improvements in EGA at certain Environment Step values.
### Interpretation
The data suggests that increasing the number of goals initially accelerates the learning or performance (as measured by EGA) in the environment. However, there appears to be a diminishing return, as all lines eventually plateau at a similar EGA value. This could indicate that beyond a certain number of goals, the system reaches a performance limit, or that the benefits of adding more goals are offset by increased complexity or computational cost. The step-like increases suggest that the system learns in discrete stages, rather than continuously. The differences in the curves could be due to the exploration-exploitation trade-off; more goals might encourage more exploration initially, leading to faster learning, but eventually, the system needs to exploit the learned knowledge, leading to the plateau. The EGA metric is likely a measure of success or efficiency in achieving the defined goals within the environment.
</details>
(a) Effect of increasing the number of goals
<details>
<summary>x65.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Environment Step for Different Action Counts
### Overview
This line chart depicts the relationship between the Environment Step and the EGA (presumably a performance metric) for different numbers of actions. The chart shows how EGA changes over time (Environment Step) for each action count. Shaded regions around the lines represent uncertainty or variance.
### Components/Axes
* **X-axis:** Environment step, ranging from approximately 0 to 11000.
* **Y-axis:** EGA, ranging from approximately 0.1 to 1.0.
* **Legend:** Located in the bottom-right corner, listing the following data series:
* "# of actions: 3" (Dark Blue Line)
* "# of actions: 15" (Orange Line)
* "# of actions: 30" (Blue Line)
* "# of actions: 45" (Teal Line)
* **Grid:** A light gray grid is present in the background to aid in reading values.
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **# of actions: 3 (Dark Blue):** This line starts at approximately 0.15 at Environment Step 0. It rapidly increases to approximately 0.95 by Environment Step 2000, then plateaus around 0.95-1.0 for the remainder of the chart. The shaded region around this line is relatively small, indicating low variance.
* **# of actions: 15 (Orange):** This line begins at approximately 0.15 at Environment Step 0. It increases more gradually than the previous line, reaching approximately 0.9 by Environment Step 3000. It then plateaus around 0.95-1.0 for the rest of the chart. The shaded region is wider than the previous line, suggesting higher variance.
* **# of actions: 30 (Blue):** This line starts at approximately 0.15 at Environment Step 0. It increases at a rate between the previous two lines, reaching approximately 0.9 by Environment Step 4000. It plateaus around 0.95-1.0 for the remainder of the chart. The shaded region is similar in width to the orange line.
* **# of actions: 45 (Teal):** This line starts at approximately 0.15 at Environment Step 0. It exhibits the slowest increase, reaching approximately 0.85 by Environment Step 6000 and 0.95 by Environment Step 9000. It plateaus around 0.95-1.0 for the remainder of the chart. The shaded region is the widest, indicating the highest variance.
### Key Observations
* All lines converge towards an EGA value of approximately 1.0 as the Environment Step increases.
* The number of actions significantly impacts the speed at which EGA reaches its plateau. Fewer actions (3) lead to faster convergence, while more actions (45) lead to slower convergence.
* The variance (as indicated by the shaded regions) increases with the number of actions.
* The initial EGA value is consistent across all action counts (approximately 0.15).
### Interpretation
The data suggests that increasing the number of actions initially slows down the rate at which the EGA metric improves. While all action counts eventually reach a similar level of performance (EGA â 1.0), the time required to do so varies considerably. This could indicate a trade-off between exploration and exploitation. Fewer actions might allow for quicker exploitation of initial gains, while more actions might be necessary for thorough exploration of the environment, but at the cost of slower initial progress. The increasing variance with higher action counts suggests that the outcome becomes more unpredictable as the number of actions increases, potentially due to increased complexity or sensitivity to initial conditions. The EGA metric likely represents a measure of success or learning within the environment, and the chart demonstrates how the choice of action count influences the learning process.
</details>
(b) Effect of increasing the number of actions
Figure 24: Scalability of XENON with more goals and actions. EGA over environment steps in MC-TextWorld when (a) increasing the number of goal items and (b) increasing the number of available actions. In (a), we keep the three actions (âmineâ, âcraftâ, âsmeltâ) fixed, while in (b) we keep the 67 goal items fixed. Solid lines denote the mean over 15 runs; shaded areas denote the standard deviation.
To evaluate the scalability of XENONâs dependency and action learning, we vary the number of goal items and available actions in the MC-TextWorld environment. For the goal-scaling experiment, we increase the number of goals from 67 to 100 and 120 by adding new goal items (see Table Ë 11 for the added goals), while keeping the original three actions âmineâ, âcraftâ, and âsmeltâ fixed. For the action-scaling experiment, we increase the available actions from 3 to 15, 30, and 45 (e.g., âharvestâ, âhuntâ, âplaceâ), while keeping the original 67 goals fixed.
The results in Figure Ë 24 show that XENON maintains high EGA as both the number of goals and the number of actions grow, although the number of environment steps required for convergence naturally increases. As seen in Figure Ë 24(a), increasing the number of goals from 67 to 100 and 120 only moderately delays convergence (from around 1,400 to about 2,100 and 2,600 steps). In contrast, Figure Ë 24(b) shows a larger slowdown when increasing the number of actions (from about 1,400 steps with 3 actions to roughly 4,000, 7,000, and 10,000 steps with 15, 30, and 45 actions), which is expected because XENON only revises an itemâs dependency after all available actions for that item have been classified as empirically invalid by FAM. We believe this convergence speed could be improved with minimal changes, such as by lowering $x_{0}$ , the failure count threshold for classifying an action as invalid, or by triggering dependency revision once the agent has failed to obtain an item a fixed number of times, regardless of which actions were tried in subgoals.
### K.10 Ablation on action selection methods for making subgoals
<details>
<summary>x66.png Details</summary>

### Visual Description
\n
## Legend: Algorithm Identification
### Overview
The image presents a legend identifying different algorithms or methods used in a comparative analysis. It consists of colored markers paired with corresponding algorithm names. There is no chart or data presented, only the key for interpreting a potential visualization.
### Components/Axes
The legend consists of five entries, each with a unique marker shape and color, paired with a text label. The labels are:
* **Random+FAM**: Represented by a brown triangle marker (âRGB: 160, 82, 45).
* **UCB**: Represented by a light purple cross marker (âRGB: 173, 147, 233).
* **LLM**: Represented by a dark blue cross marker (âRGB: 70, 130, 180).
* **SC**: Represented by a light pink diamond marker (âRGB: 255, 182, 193).
* **XENON**: Represented by a light blue circle marker (âRGB: 173, 216, 230).
The legend is horizontally aligned.
### Detailed Analysis or Content Details
The image provides only labels and associated colors/markers. There are no numerical values or trends to analyze. The legend is a key to understanding a potential chart or diagram where these algorithms are being compared.
### Key Observations
The legend provides a clear mapping between visual markers and algorithm names. The variety of marker shapes and colors suggests an attempt to visually distinguish the different algorithms in a larger visualization.
### Interpretation
This legend is a crucial component for interpreting a larger visualization. Without the accompanying chart or diagram, it's impossible to determine the context or meaning of these algorithms. The presence of "LLM" suggests the comparison may involve Large Language Models, while "Random+FAM" and "XENON" could represent alternative or baseline methods. The "UCB" and "SC" algorithms are less immediately identifiable without further context. The image itself does not provide any data or insights; it merely defines the visual representation of different algorithms.
</details>
<details>
<summary>x67.png Details</summary>

### Visual Description
\n
## Line Chart: Success Rate vs. Number of Actions
### Overview
This image presents a line chart illustrating the relationship between the number of actions performed and the resulting success rate. Two distinct data series are plotted, showing how success rate changes with an increasing number of actions. The chart uses a grid background for easier readability.
### Components/Axes
* **X-axis Title:** "# of actions"
* **X-axis Markers:** 3, 15, 30, 45
* **Y-axis Title:** "Success Rate"
* **Y-axis Scale:** 0.0 to 1.0 (approximately)
* **Data Series 1 (Blue Line):** Represents a relatively stable success rate.
* **Data Series 2 (Brown Line):** Represents a rapidly decreasing success rate.
* **Data Points:** Represented by circular markers.
### Detailed Analysis
**Data Series 1 (Blue Line):**
The blue line starts at approximately 0.98 at 3 actions. It remains relatively constant until 15 actions, where it is approximately 0.96. At 30 actions, the success rate is approximately 0.94. Finally, at 45 actions, the success rate drops to approximately 0.38. The trend is generally downward, but with a slow rate of decline until the final data point.
**Data Series 2 (Brown Line):**
The brown line begins at approximately 0.02 at 3 actions. It increases slightly to approximately 0.04 at 15 actions. At 30 actions, the success rate drops significantly to approximately 0.24. Finally, at 45 actions, the success rate remains at approximately 0.02. The trend is initially slightly upward, then a steep decline.
### Key Observations
* The success rate for the blue line remains high for the majority of the actions, only dropping significantly at 45 actions.
* The success rate for the brown line is consistently low, and experiences a dramatic decrease between 15 and 30 actions.
* The brown line's success rate is nearly zero for both the initial and final number of actions (3 and 45).
* The two lines diverge significantly as the number of actions increases.
### Interpretation
The chart suggests that increasing the number of actions has a differential impact on success rates depending on the data series. The blue line indicates that a process can maintain a high success rate even with a moderate increase in the number of actions, but experiences a significant drop-off at a higher number of actions (45). The brown line suggests that the process is inherently unreliable, with a low success rate that deteriorates rapidly as the number of actions increases. This could indicate a limitation in the process's scalability or robustness. The steep decline in the brown line between 15 and 30 actions might represent a critical threshold where the process becomes unsustainable. The data could be used to identify areas for improvement in the process represented by the brown line, or to understand the limitations of the process represented by the blue line.
</details>
(a) Success rate
<details>
<summary>x68.png Details</summary>

### Visual Description
\n
## Line Chart: Environment Steps to Success vs. Number of Actions
### Overview
This image presents a line chart illustrating the relationship between the number of actions taken and the environment steps required to achieve success. Three distinct lines represent different conditions or algorithms, showing how the number of steps to success changes as the number of actions increases.
### Components/Axes
* **X-axis:** "# of actions" with markers at 3, 15, 30, and 45.
* **Y-axis:** "Environment Steps to Success" with a scale ranging from 0 to 300, incrementing by 50.
* **Lines:** Three lines are present, each with a distinct color:
* Light Blue
* Purple
* Brown
* **Grid:** A light gray grid is overlaid on the chart to aid in reading values.
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **Light Blue Line:** This line is relatively flat. It slopes slightly upward.
* At 3 actions: Approximately 60 Environment Steps to Success.
* At 15 actions: Approximately 50 Environment Steps to Success.
* At 30 actions: Approximately 50 Environment Steps to Success.
* At 45 actions: Approximately 60 Environment Steps to Success.
* **Purple Line:** This line exhibits a clear upward trend, becoming steeper as the number of actions increases.
* At 3 actions: Approximately 70 Environment Steps to Success.
* At 15 actions: Approximately 150 Environment Steps to Success.
* At 30 actions: Approximately 230 Environment Steps to Success.
* At 45 actions: Approximately 280 Environment Steps to Success.
* **Brown Line:** This line also shows an upward trend, but it is less steep than the purple line.
* At 3 actions: Approximately 60 Environment Steps to Success.
* At 15 actions: Approximately 200 Environment Steps to Success.
* At 30 actions: Approximately 270 Environment Steps to Success.
* At 45 actions: Approximately 290 Environment Steps to Success.
### Key Observations
* The light blue line remains relatively constant, suggesting that increasing the number of actions does not significantly impact the environment steps to success for this condition.
* Both the purple and brown lines demonstrate a positive correlation between the number of actions and the environment steps to success. The purple line shows a more pronounced increase.
* At 3 actions, the light blue and brown lines start at approximately the same value.
* The purple line consistently requires more environment steps to success than the other two lines, especially as the number of actions increases.
### Interpretation
The chart suggests that the effectiveness of increasing the number of actions varies depending on the condition or algorithm being used. The light blue line indicates a scenario where more actions do not necessarily lead to faster success. The purple line suggests that while more actions eventually lead to success, they also require a significantly larger number of environment steps. The brown line represents a middle ground, where increasing actions leads to increased steps, but not as dramatically as with the purple line.
This data could be related to reinforcement learning or optimization algorithms, where the number of actions represents the number of iterations or trials, and the environment steps represent the computational cost or time required to reach a successful state. The different lines could represent different algorithms or parameter settings. The fact that the purple line requires more steps despite increasing actions could indicate a less efficient algorithm or a more complex environment. Further investigation would be needed to understand the specific context and meaning of these results.
</details>
(b) Steps to success (lower is better)
Figure 25: Ablation on action selection methods for subgoal construction. We evaluate different action selection methods for solving long-horizon goals given an oracle dependency graph, as the size of the available action set increases. (a) Success rate and (b) number of environment steps per successful episode. Note that in (a), the curves for LLM and SC overlap at 0.0 because they fail on all episodes, and in (b), they are omitted since they never succeed.
We find that, while LLMs can in principle accelerate the search for valid actions, they do so effectively only when their flawed knowledge is corrected algorithmically. To support this, we study how different action selection methods for subgoal construction affect performance on long-horizon goals. In this ablation, the agent is given an oracle dependency graph and a long-horizon goal, and only needs to output one valid action from the available actions for each subgoal item to achieve that goal. Each episode specifies a single goal item, and it is counted as successful if the agent obtains this item within 300 environment steps in MC-TextWorld. To study scalability with respect to the size of the available action set, we vary the number of actions as 3, 15, 30, and 45 by gradually adding actions such as âharvestâ and âhuntâ to the original three actions (âmineâ, âcraftâ, âsmeltâ).
Methods and metrics
We compare five action selection methods: Random+FAM (which randomly samples from available actions that have not yet repeatedly failed and reuses past successful actions), UCB, LLM without memory, LLM self-correction (SC), and XENON, which combines an LLM with FAM. We report the average success rate and the average number of environment steps to success over 20 runs per goal item, where goal items are drawn from the Redstone group.
As shown in Figure Ë 25, among the three LLM-based methods (LLM, SC, XENON), only XENONâwhich corrects the LLMâs action knowledge by removing repeatedly failed actions from the set of candidate actions the LLM is allowed to selectâsolves long-horizon goals reliably, maintaining a success rate of 1.0 and requiring roughly 50 environment steps across all sizes of the available action set. In contrast, LLM and SC never succeed in any episode, because they keep selecting incorrect actions for subgoal items (e.g., redstone), and therefore perform worse than the non-LLM baselines, Random+FAM and UCB. Random+FAM and UCB perform well when the number of available actions is small, but become increasingly slow and unreliable as the number of actions grows, often failing to reach the goal within the episode horizon.
### K.11 Robustness to Smaller Planner LLMs and Limited Initial Knowledge
<details>
<summary>x69.png Details</summary>

### Visual Description
\n
## Legend: Identifier Key
### Overview
The image presents a legend, likely associated with a chart or diagram. It defines a set of identifiers, each represented by a unique symbol and label. There is no chart or diagram present in the image, only the key.
### Components/Axes
The legend consists of five entries, arranged horizontally. Each entry has a symbol and a corresponding label. The labels are:
* **XENON** - Represented by a light blue plus symbol (+)
* **SC** - Represented by a pink diamond symbol (âŠ)
* **ADAM** - Represented by an orange plus symbol (+)
* **DECKARD** - Represented by a lime green plus symbol (+)
* **RAND** - Represented by a dark blue plus symbol (+)
### Detailed Analysis or Content Details
The legend provides a mapping between identifiers and visual representations. The symbols are distinct, allowing for easy differentiation. The labels are short, likely representing names or categories within a larger dataset.
### Key Observations
All identifiers except "SC" are represented by a plus symbol. "SC" is uniquely represented by a diamond. This suggests "SC" may be a special case or a different category than the others.
### Interpretation
This legend is a key to understanding a visual representation of data. The identifiers likely represent different entities, variables, or groups within a dataset. Without the associated chart or diagram, it's impossible to determine the specific meaning of these identifiers or the relationships between them. The use of different symbols for "SC" suggests it may be an outlier or a distinct category. The legend is designed for clarity and quick identification of the represented entities.
</details>
<details>
<summary>x70.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Number of Human-Written Plans
### Overview
This image presents a line chart illustrating the relationship between the number of provided human-written plans and the EGA (Error-Guided Adaptation) score. The chart displays data for five different lines, each representing a different condition or model. The x-axis represents the number of plans (1 or 3), and the y-axis represents the EGA score, ranging from 0 to 1.
### Components/Axes
* **X-axis Label:** "# of provided human-written plans" with markers at 1 and 3.
* **Y-axis Label:** "EGA" with a scale from 0.0 to 1.0, incrementing by 0.2.
* **Lines:** Five distinct lines, each with a unique color:
* Light Blue
* Orange
* Pink
* Lime Green
* Gray
* **Data Points:** Each line has two data points, corresponding to the x-axis values of 1 and 3.
### Detailed Analysis
Let's analyze each line individually, noting the trend and approximate data points.
* **Light Blue Line:** This line is nearly flat. It slopes very slightly downward.
* At x = 1, EGA â 0.95
* At x = 3, EGA â 0.93
* **Orange Line:** This line slopes upward.
* At x = 1, EGA â 0.42
* At x = 3, EGA â 0.54
* **Pink Line:** This line slopes upward.
* At x = 1, EGA â 0.44
* At x = 3, EGA â 0.55
* **Lime Green Line:** This line is relatively flat.
* At x = 1, EGA â 0.34
* At x = 3, EGA â 0.38
* **Gray Line:** This line slopes upward.
* At x = 1, EGA â 0.12
* At x = 3, EGA â 0.18
### Key Observations
* The light blue line consistently has the highest EGA scores across both plan numbers.
* The lime green line consistently has the lowest EGA scores.
* The orange, pink, and gray lines show an increasing trend in EGA scores as the number of provided plans increases from 1 to 3.
* The difference in EGA scores between 1 and 3 plans is more pronounced for the orange, pink, and gray lines than for the light blue and lime green lines.
### Interpretation
The chart suggests that providing more human-written plans generally leads to improved EGA scores for some models (orange, pink, gray). However, the impact of additional plans varies significantly depending on the model. The light blue model already achieves a high EGA score with only one plan, indicating it may be less reliant on additional human input. Conversely, the gray model shows the most significant improvement with the addition of a second plan, suggesting it benefits substantially from more human guidance. The lime green model shows minimal improvement, indicating it may be less sensitive to the number of provided plans or may have inherent limitations.
The EGA score likely represents a measure of performance or quality, and the human-written plans serve as a form of guidance or training data. The chart highlights the importance of considering model-specific characteristics when determining the optimal amount of human input for error-guided adaptation. The data suggests that a one-size-fits-all approach to providing human plans may not be effective, and the ideal number of plans should be tailored to the specific model being used.
</details>
(a) Planner LLM size: 4B
<details>
<summary>x71.png Details</summary>

### Visual Description
\n
## Line Chart: EGA vs. Number of Human-Written Plans
### Overview
This image presents a line chart illustrating the relationship between the number of provided human-written plans and the EGA (likely an evaluation metric). The chart displays data for four different series, each represented by a distinct colored line. The x-axis represents the number of plans (1 or 3), and the y-axis represents the EGA score, ranging from 0 to 1.
### Components/Axes
* **X-axis Label:** "# of provided human-written plans"
* **X-axis Markers:** 1, 3
* **Y-axis Label:** "EGA"
* **Y-axis Scale:** 0.0 to 1.0, with increments of 0.2
* **Data Series:** Four lines, each with a unique color:
* Blue
* Orange
* Pink
* Green
### Detailed Analysis
Let's analyze each line individually, noting the trend and approximate data points.
* **Blue Line:** This line is relatively flat. It starts at approximately 0.95 when the number of plans is 1 and ends at approximately 0.97 when the number of plans is 3.
* **Orange Line:** This line slopes upward. It begins at approximately 0.55 when the number of plans is 1 and increases to approximately 0.68 when the number of plans is 3.
* **Pink Line:** This line also slopes upward, but less steeply than the orange line. It starts at approximately 0.52 when the number of plans is 1 and rises to approximately 0.63 when the number of plans is 3.
* **Green Line:** This line slopes upward. It begins at approximately 0.42 when the number of plans is 1 and increases to approximately 0.48 when the number of plans is 3.
### Key Observations
* All four lines show a positive correlation between the number of human-written plans and the EGA score. As the number of plans increases, the EGA score tends to increase for all series.
* The blue line exhibits the highest EGA scores across both data points, indicating consistently high performance.
* The green line consistently has the lowest EGA scores.
* The orange line shows the largest increase in EGA score between 1 and 3 plans.
### Interpretation
The data suggests that providing more human-written plans generally leads to improved EGA scores across all evaluated series. This could indicate that human input enhances the performance of the system being evaluated. The consistent high performance of the blue line suggests that this particular series is less sensitive to the number of provided plans, or inherently performs better. The green line's lower scores suggest it may benefit more from additional human input. The upward trends across all lines imply a diminishing return â the increase in EGA score becomes smaller as more plans are provided. This could be due to saturation, where additional plans offer less new information or refinement. The EGA metric is likely measuring the quality or effectiveness of some process, and the human-written plans are serving as a form of guidance or training data. Further investigation would be needed to understand the specific meaning of EGA and the nature of the plans.
</details>
(b) Planner LLM size: 7B
Figure 26: Effect of planner LLM size and initial dependency graph quality in dependency and action learning. The plots show EGA after 3,000 environment steps of dependency and action learning in MC-TextWorld, obtained by varying the planner LLM size and the amount of correct knowledge in the initial dependency graph (controlled by the number of provided human-written plans). In (a), the planner is Phi-4-mini (4B) (Microsoft et al., 2025); in (b), the planner is Qwen2.5-VL-7B (7B) (Bai et al., 2025).
We further evaluate robustness of XENON and the baselines to limited prior knowledge by measuring dependency and action learning in MC-TextWorld while (i) varying the planner LLM size and (ii) degrading the quality of the initial dependency graph. For the planner LLM, we compare a 7B model (Qwen2.5-VL-7B (Bai et al., 2025)) against a 4B model (Phi-4-mini (Microsoft et al., 2025)); for the initial graph quality, we vary the number of provided human-written plans used to initialize the graph from three (âcraft iron_swordâ, âmine diamondâ, âcraft golden_swordâ) to one (âcraft iron_swordâ).
As shown in Figure Ë 26, XENON remains robust across all these settings: its EGA stays near-perfect even with the smaller 4B planner and the weakest initial graph, indicating that leveraging experiences can quickly compensate for weak priors. In contrast, baselines that rely on LLM self-correction (SC) or that strongly depend on the LLM or initial graph (ADAM, DECKARD) suffer substantial drops in EGA as the planner LLM becomes smaller and the initial graph contains less correct prior knowledge. This suggests that, in our setting, algorithmic knowledge correction is more critical than scaling up the planner LLM or richer initial human-provided knowledge.
### K.12 Full results on the long-horizon tasks benchmark
In this section, we report XENONâs performance on each goal within the long-horizon tasks benchmark, detailing metrics such as the goal item, number of sub-goals, success rate (SR), and evaluation episodes.
Table Ë 18 and 19 present XENONâs results when utilizing the dependency graph learned through 400 episodes of exploration. Conversely, Table Ë 20 and 21 display XENON â âs performance, which leverages an oracle dependency graph.
Table 18: The results of XENON (with dependency graph learned via exploration across 400 episodes) on the Wood group, Stone group, and Iron group. SR denotes success rate.
| Group | Goal | Sub-Goal Num. | SR | Eval Episodes |
| --- | --- | --- | --- | --- |
| Wood | bowl | 4 | 92.68 | 41 |
| chest | 4 | 95.24 | 42 | |
| crafting_table | 3 | 95.83 | 48 | |
| ladder | 5 | 0.00 | 31 | |
| stick | 3 | 95.45 | 44 | |
| wooden_axe | 5 | 90.91 | 44 | |
| wooden_hoe | 5 | 95.35 | 43 | |
| wooden_pickaxe | 5 | 93.02 | 43 | |
| wooden_shovel | 5 | 93.75 | 48 | |
| wooden_sword | 5 | 95.35 | 43 | |
| Stone | charcoal | 8 | 87.50 | 40 |
| furnace | 7 | 88.10 | 42 | |
| smoker | 8 | 0.00 | 47 | |
| stone_axe | 7 | 97.78 | 45 | |
| stone_hoe | 7 | 90.70 | 43 | |
| stone_pickaxe | 7 | 95.45 | 44 | |
| stone_shovel | 7 | 89.58 | 48 | |
| stone_sword | 7 | 89.80 | 49 | |
| torch | 7 | 93.02 | 43 | |
| Iron | blast_furnace | 13 | 0.00 | 42 |
| bucket | 11 | 0.00 | 47 | |
| chain | 12 | 0.00 | 42 | |
| hopper | 12 | 0.00 | 47 | |
| iron_axe | 11 | 75.56 | 45 | |
| iron_bars | 11 | 80.43 | 46 | |
| iron_hoe | 11 | 89.13 | 46 | |
| iron_nugget | 11 | 79.55 | 44 | |
| iron_pickaxe | 11 | 77.08 | 48 | |
| iron_shovel | 11 | 75.56 | 45 | |
| iron_sword | 11 | 84.78 | 46 | |
| rail | 11 | 0.00 | 44 | |
| shears | 11 | 0.00 | 43 | |
| smithing_table | 11 | 93.75 | 48 | |
| stonecutter | 12 | 0.00 | 43 | |
| tripwire_hook | 11 | 78.43 | 51 | |
Table 19: The results of XENON (with dependency graph learned via exploration across 400 episodes) on the Gold group, Diamond group, Redstone group, and Armor group. SR denotes success rate.
| Group | Goal Item | Sub Goal Num. | SR | Eval Episodes |
| --- | --- | --- | --- | --- |
| Gold | gold_ingot | 13 | 76.92 | 52 |
| golden_axe | 14 | 72.00 | 50 | |
| golden_hoe | 14 | 66.67 | 48 | |
| golden_pickaxe | 14 | 76.00 | 50 | |
| golden_shovel | 14 | 71.74 | 46 | |
| golden_sword | 14 | 78.26 | 46 | |
| Diamond | diamond | 12 | 87.76 | 49 |
| diamond_axe | 13 | 72.55 | 51 | |
| diamond_hoe | 13 | 63.79 | 58 | |
| diamond_pickaxe | 13 | 60.71 | 56 | |
| diamond_shovel | 13 | 84.31 | 51 | |
| diamond_sword | 13 | 76.79 | 56 | |
| jukebox | 13 | 0.00 | 48 | |
| Redstone | activator_rail | 14 | 0.00 | 3 |
| compass | 13 | 0.00 | 3 | |
| dropper | 13 | 0.00 | 3 | |
| note_block | 13 | 0.00 | 4 | |
| piston | 13 | 0.00 | 12 | |
| redstone_torch | 13 | 0.00 | 19 | |
| Armor | diamond_boots | 13 | 64.29 | 42 |
| diamond_chestplate | 13 | 0.00 | 44 | |
| diamond_helmet | 13 | 67.50 | 40 | |
| diamond_leggings | 13 | 0.00 | 37 | |
| golden_boots | 14 | 69.23 | 39 | |
| golden_chestplate | 14 | 0.00 | 39 | |
| golden_helmet | 14 | 60.53 | 38 | |
| golden_leggings | 14 | 0.00 | 38 | |
| iron_boots | 11 | 94.44 | 54 | |
| iron_chestplate | 11 | 0.00 | 42 | |
| iron_helmet | 11 | 4.26 | 47 | |
| iron_leggings | 11 | 0.00 | 41 | |
| shield | 11 | 0.00 | 46 | |
Table 20: The results of XENON â (with oracle dependency graph) on the Wood group, Stone group, and Iron group. SR denotes success rate.
| Group | Goal Item | Sub-Goal Num. | SR | Eval Episodes |
| --- | --- | --- | --- | --- |
| Wood | bowl | 4 | 94.55 | 55 |
| chest | 4 | 94.74 | 57 | |
| crafting_table | 3 | 94.83 | 58 | |
| ladder | 5 | 94.74 | 57 | |
| stick | 3 | 95.08 | 61 | |
| wooden_axe | 5 | 94.64 | 56 | |
| wooden_hoe | 5 | 94.83 | 58 | |
| wooden_pickaxe | 5 | 98.33 | 60 | |
| wooden_shovel | 5 | 96.49 | 57 | |
| wooden_sword | 5 | 94.83 | 58 | |
| Stone | charcoal | 8 | 92.68 | 41 |
| furnace | 7 | 90.00 | 40 | |
| smoker | 8 | 87.50 | 40 | |
| stone_axe | 7 | 95.12 | 41 | |
| stone_hoe | 7 | 94.87 | 39 | |
| stone_pickaxe | 7 | 94.87 | 39 | |
| stone_shovel | 7 | 94.87 | 39 | |
| stone_sword | 7 | 92.11 | 38 | |
| torch | 7 | 92.50 | 40 | |
| Iron | blast_furnace | 13 | 82.22 | 45 |
| bucket | 11 | 89.47 | 38 | |
| chain | 12 | 83.33 | 36 | |
| hopper | 12 | 77.78 | 36 | |
| iron_axe | 11 | 82.50 | 40 | |
| iron_bars | 11 | 85.29 | 34 | |
| iron_hoe | 11 | 75.68 | 37 | |
| iron_nugget | 11 | 84.78 | 46 | |
| iron_pickaxe | 11 | 83.33 | 42 | |
| iron_shovel | 11 | 78.38 | 37 | |
| iron_sword | 11 | 85.42 | 48 | |
| rail | 11 | 80.56 | 36 | |
| shears | 11 | 82.05 | 39 | |
| smithing_table | 11 | 83.78 | 37 | |
| stonecutter | 12 | 86.84 | 38 | |
| tripwire_hook | 11 | 91.18 | 34 | |
Table 21: The results of XENON â (with oracle dependency graph) on the Gold group, Diamond group, Redstone group, and Armor group. SR denotes success rate.
| Group | Goal Item | Sub Goal Num. | SR | Eval Episodes |
| --- | --- | --- | --- | --- |
| Gold | gold_ingot | 13 | 78.38 | 37 |
| golden_axe | 14 | 65.12 | 43 | |
| golden_hoe | 14 | 70.27 | 37 | |
| golden_pickaxe | 14 | 75.00 | 36 | |
| golden_shovel | 14 | 78.38 | 37 | |
| Diamond | diamond | 12 | 71.79 | 39 |
| diamond_axe | 13 | 70.00 | 40 | |
| diamond_hoe | 13 | 85.29 | 34 | |
| diamond_pickaxe | 13 | 72.09 | 43 | |
| diamond_shovel | 13 | 76.19 | 42 | |
| diamond_sword | 13 | 80.56 | 36 | |
| jukebox | 13 | 69.77 | 43 | |
| Redstone | activator_rail | 14 | 67.39 | 46 |
| compass | 13 | 70.00 | 40 | |
| dropper | 13 | 75.00 | 40 | |
| note_block | 13 | 89.19 | 37 | |
| piston | 13 | 65.79 | 38 | |
| redstone_torch | 13 | 84.85 | 33 | |
| Armor | diamond_boots | 13 | 60.78 | 51 |
| diamond_chestplate | 13 | 20.00 | 50 | |
| diamond_helmet | 13 | 71.79 | 39 | |
| diamond_leggings | 13 | 33.33 | 39 | |
| golden_boots | 14 | 75.00 | 40 | |
| golden_chestplate | 14 | 0.00 | 36 | |
| golden_helmet | 14 | 54.05 | 37 | |
| golden_leggings | 14 | 0.00 | 38 | |
| iron_boots | 11 | 93.62 | 47 | |
| iron_chestplate | 11 | 97.50 | 40 | |
| iron_helmet | 11 | 86.36 | 44 | |
| iron_leggings | 11 | 97.50 | 40 | |
| shield | 11 | 97.62 | 42 | |
### K.13 Experiments compute resources
All experiments were conducted on an internal computing cluster equipped with RTX3090, A5000, and A6000 GPUs. We report the total aggregated compute time from running multiple parallel experiments. For the dependency learning, exploration across 400 episodes in the MineRL environment, the total compute time was 24 days. The evaluation on the long-horizon tasks benchmark in the MineRL environment required a total of 34 days of compute. Experiments within the MC-TextWorld environment for dependency learning utilized a total of 3 days of compute. We note that these values represent aggregated compute time, and the actual wall-clock time for individual experiments was significantly shorter due to parallelization.
## Appendix L The Use of Large Language Models (LLMs)
In preparing this manuscript, we used an LLM as a writing assistant to improve the text. Its role included refining grammar and phrasing, suggesting clearer sentence structures, and maintaining a consistent academic tone. All technical contributions, experimental designs, and final claims were developed by the human authors, who thoroughly reviewed and take full responsibility for the paperâs content.