# Experiential Reinforcement Learning
> Work done during internship at Microsoftâs Office of Applied Research
## Abstract
Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations. We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experienceâreflectionâconsolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks. These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Learning Paradigms - Direct, Reinforcement, and Experiential
### Overview
The image is a diagram illustrating three different learning paradigms: Direct Learning (SFT), Reinforcement Learning (RLVR), and Experiential Learning (ERL). It visually represents the flow of information and interaction within each paradigm, highlighting the differences in how learning occurs. The diagram uses boxes to represent components and arrows to indicate the direction of influence or data flow. A horizontal arrow at the bottom indicates a transition from "Learning from Feedback" to "Learning from Experience".
### Components/Axes
The diagram is divided into three main sections, each representing a learning paradigm. Each section contains the following components:
* **Policy:** Represented by a purple rectangle in all three paradigms.
* **Environment:** Represented by a brown rectangle in Reinforcement and Experiential Learning.
* **Action:** Represented by a green rectangle in Reinforcement and Experiential Learning.
* **Example:** Represented by a light purple rectangle in Direct Learning.
* **Scalar Reward:** Represented by a light green rectangle in Reinforcement Learning.
* **Experience Internalization & Self-Reflection:** Represented by a light blue and light orange rectangle in Experiential Learning.
The diagram also includes mathematical notations within the boxes, representing the underlying functions or processes. These include:
* Ď<sub>θ</sub><sup>(¡)</sup>(x) â y<sup>*</sup> (Direct Learning)
* Ď<sub>θ</sub><sup>(¡)</sup>(x) â r (Scalar Reward)
* y â Ď<sub>θ</sub><sup>(¡)</sup>(x) (Action)
* Ď<sub>θ</sub><sup>(¡)</sup>(x) (Experience Internalization)
* Î ~ Ď<sub>θ</sub><sup>(¡)</sup>(x, y, r) (Self-Reflection)
A horizontal arrow at the bottom is labeled "Learning from Feedback" on the left and "Learning from Experience" on the right.
### Detailed Analysis / Content Details
**Direct Learning (SFT):**
* The "Policy" box is connected to "Supervised Learning" which is connected to the "Example" box.
* The equation within the "Supervised Learning" box is: Ď<sub>θ</sub><sup>(¡)</sup>(x) â y<sup>*</sup>.
* This paradigm appears to be a direct mapping from input (x) to output (y<sup>*</sup>) guided by a policy.
**Reinforcement Learning (RLVR):**
* The "Policy" box is connected to "Scalar Reward" and "Action".
* The "Scalar Reward" box has the equation: Ď<sub>θ</sub><sup>(¡)</sup>(x) â r.
* The "Action" box has the equation: y â Ď<sub>θ</sub><sup>(¡)</sup>(x).
* The "Action" box is connected to the "Environment" box, which then loops back to the "Policy" box.
* This paradigm involves learning through trial and error, receiving rewards (r) based on actions (y) taken in the environment (x).
**Experiential Learning (ERL):**
* The "Policy" box is connected to "Experience Internalization" and "Action".
* The "Experience Internalization" box has the equation: Ď<sub>θ</sub><sup>(¡)</sup>(x).
* The "Self-Reflection" box has the equation: Î ~ Ď<sub>θ</sub><sup>(¡)</sup>(x, y, r).
* The "Action" box has the equation: y â Ď<sub>θ</sub><sup>(¡)</sup>(x).
* The "Action" box is connected to the "Environment" box, which then loops back to the "Policy" box.
* This paradigm incorporates both experience internalization and self-reflection (Î) in addition to action and environment interaction.
### Key Observations
* The diagram clearly illustrates a progression from direct, supervised learning to more complex learning paradigms that involve interaction with an environment.
* The inclusion of mathematical notations suggests a formal, algorithmic approach to each learning paradigm.
* The "Experiential Learning" paradigm appears to be the most complex, incorporating elements of both reinforcement learning and self-reflection.
* The transition from "Learning from Feedback" to "Learning from Experience" highlights a shift in the source of information used for learning.
### Interpretation
The diagram demonstrates a conceptual hierarchy of learning approaches. Direct Learning represents the simplest form, relying on explicit examples. Reinforcement Learning introduces the concept of learning through interaction and reward, while Experiential Learning builds upon this by adding a layer of self-reflection and internal experience processing. The diagram suggests that as learning becomes more complex, it moves away from relying solely on external feedback (examples or rewards) and incorporates internal processes for understanding and adapting to the environment. The mathematical notations indicate that these paradigms are not merely conceptual but can be formalized and implemented as algorithms. The diagram is a high-level overview and does not delve into the specifics of each algorithm or the underlying mechanisms of learning. It serves as a useful visual aid for understanding the fundamental differences between these three learning paradigms.
</details>
Figure 1: In Experiential Reinforcement Learning (ERL), instead of learning from feedback or outcome directly, an agent learns to (1) verbally reflect on its experience and observed outcome, and (2) internalize the reflections to induce behavioral changes in future iterations.
## 1 Introduction
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: RLVR vs. ERL - Acting in an Unknown Environment
### Overview
This diagram illustrates a comparison between two reinforcement learning approaches, RLVR (Reinforcement Learning with Value Rewriting) and ERL (Experience Replay with Learning), when tasked with navigating an agent in an unknown environment without prior knowledge. The diagram depicts the agent's progression through a maze-like environment, highlighting the differences in how each approach handles trial and error, memory, and learning.
### Components/Axes
The diagram is divided into two main sections, one for RLVR and one for ERL, separated by a dashed line. Each section shows a sequence of four maze snapshots, connected by arrows indicating the progression of the agent's actions. Labels are present above each section ("RLVR", "ERL") and descriptive text accompanies each step. The overall task is stated at the top: "Task: Act in an unknown environment âť with no prior knowledge". The connecting arrows are labeled "Back & Forth" (above RLVR) and "Experience Internalization" (above ERL).
### Detailed Analysis or Content Details
**RLVR Section:**
* **Snapshot 1:** An agent (blue figure) is positioned in a maze with several boxes (brown squares) and goal locations (yellow squares).
* Label: "Trial & Error"
* **Snapshot 2:** The agent attempts to move through a wall (grey blocks), resulting in no reward. A red 'X' marks the attempted invalid move.
* Label: "No reward"
* **Snapshot 3:** The agent's previous attempt is "forgotten". The maze is reset to the initial state.
* Label: "Forget"
* **Snapshot 4:** The agent again attempts to navigate the maze, this time successfully pushing a box onto a goal location. A red 'X' marks the attempted invalid move.
* Label: "Trial & Error"
* Label: "No reward"
**ERL Section:**
* **Snapshot 1:** The agent is positioned in the same maze as in the RLVR section.
* Label: "Trial & Error"
* **Snapshot 2:** The agent attempts to move through a wall, resulting in no reward. A red 'X' marks the attempted invalid move.
* Label: "No reward"
* **Snapshot 3:** A "Self-Reflection" box appears, containing the following text:
* "I guessâŚ"
* "đ§ą is wall."
* "I can control"
* "Push đŚ into đĄ"
* **Snapshot 4:** The agent successfully pushes a box onto a goal location.
* No label.
**Maze Details (Common to both sections):**
The maze consists of walls (grey blocks), empty spaces (beige), boxes (brown squares), and goal locations (yellow squares). The agent is represented by a blue figure.
### Key Observations
The key difference between RLVR and ERL is how they handle unsuccessful attempts. RLVR simply "forgets" the failed attempt and restarts, while ERL engages in "self-reflection" to learn from the experience, explicitly identifying the wall as an obstacle and the ability to manipulate boxes. Both approaches start with "Trial & Error", but diverge in their subsequent processing of that experience.
### Interpretation
The diagram demonstrates a conceptual difference in how reinforcement learning agents can learn. RLVR represents a more basic approach where learning is purely through repeated trial and error without explicit memory or reasoning. ERL, on the other hand, incorporates a form of internal representation and reasoning ("self-reflection") to improve learning efficiency. The "self-reflection" step suggests that ERL is capable of abstracting knowledge from its experiences (e.g., recognizing "đ§ą is wall") and using that knowledge to guide future actions. This highlights the potential benefits of incorporating cognitive mechanisms, such as reasoning and memory, into reinforcement learning algorithms. The diagram suggests that ERL is more likely to succeed in complex environments because it doesn't simply repeat mistakes but learns from them. The use of the "âť" symbol next to "no prior knowledge" emphasizes the importance of learning from scratch in an unknown environment.
</details>
Figure 2: Conceptual comparison of learning dynamics in RLVR and Experiential Reinforcement Learning (ERL). RLVR relies on repeated trial-and-error driven by scalar rewards, leading to back-and-forth exploration without durable correction. ERL augments this process with an experienceâreflectionâconsolidation loop that generates a revised attempt and internalizes successful corrections, enabling persistent behavioral improvement.
Large language models are increasingly deployed as decision-making agents that must act, observe feedback, and adapt their behavior in environments with delayed rewards and partial information (Singh et al., 2025; Yang et al., 2025; Song et al., 2025a; Bai et al., 2026). Reinforcement learning offers a natural framework for improving such agents. The task environments typically provide feedback in the form of outcome reward after an agent generates the entire trajectory. In practice, training agents against such sparse and delayed outcome signals remains difficult, as models must implicitly infer how to translate observed failures into corrective behavior, a process that is often unstable and sample-inefficient (Zhang et al., 2025; Shi et al., 2026). These challenges become more pronounced in agentic reasoning tasks, where multi-step decisions could amplify small errors and obscure credit assignment.
Humans address similar challenges through a process often described as experiential learning, in which effective adaptation arises from a cycle of experience, reflection, conceptualization, and experimentation (Kolb, 2014). After observing an outcome, a learner reflects on what occurred, forms revised internal models, and applies those revisions in subsequent attempts. This cycle transforms raw feedback into actionable behavioral corrections before those corrections are consolidated into future behavior. While language models have demonstrated reflection-like capabilities at inference time, standard reinforcement learning pipelines largely reduce feedback to scalar optimization signals, requiring policies to implicitly discover corrective structure through undirected exploration rather than explicit experiential revision.
This perspective highlights a progression in how language models learn from supervision and interaction, illustrated in Figures 1 and 2. In supervised fine-tuning (SFT), policies imitate fixed examples, enabling strong pattern reproduction but offering no mechanism for revising behavior once deployed. Reinforcement learning with verifiable rewards (RLVR) extends learning into interactive settings by optimizing scalar feedback, allowing agents to improve through trial-and-error; however, corrective structure must still be inferred implicitly from sparse or delayed rewards. As visualized in Figure 2, this can lead to repeated exploration without durable behavioral correction. A natural next step is to structure learning around experience itself, transforming feedback into intermediate reasoning that supports explicit revision and consolidation within each episode. Figure 1 conceptualizes this shift as moving from learning purely from feedback toward deliberate learning from experience.
In this work, we introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experienceâreflectionâconsolidation loop inside reinforcement learning. Instead of learning solely from outcome rewards, the model first produces an initial attempt, receives environment feedback, and generates a structured reflection describing how the attempt should be improved. This reflection conditions a refined second attempt, whose outcome is reinforced and internalized into the base policy. By converting feedback into intermediate reasoning signals, ERL enables the model to perform targeted behavioral correction before policy consolidation. Over time, these corrections become part of the policy itself, allowing improved behavior to persist even when reflection is absent at deployment. An overview of the algorithm is shown in Figure 3.
We evaluate ERL across sparse-reward control environments and agentic reasoning benchmarks spanning two model scales. ERL consistently outperforms RLVR in all six evaluated settings, achieving gains of up to +81% in Sokoban, +27% in FrozenLake, and up to +11% in HotpotQA. These results demonstrate that embedding structured experiential revision into training improves learning efficiency and produces stronger final policies across both control and reasoning tasks.
#### Contributions.
Our main contributions are:
- We introduce Experiential Reinforcement Learning (ERL), a reinforcement learning paradigm that incorporates an explicit experienceâreflectionâconsolidation loop, enabling models to transform environment feedback into structured behavioral corrections.
- We propose an internalization mechanism that consolidates reflection-driven improvements into the base policy, preserving gains without requiring reflection at inference time.
- We demonstrate that experiential reinforcement learning improves training efficiency and final performance across agentic reasoning tasks.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: ERL - Experiential Reinforcement Learning
### Overview
The image is a diagram illustrating the process of Experiential Reinforcement Learning (ERL). It depicts a three-stage process: First Attempt (RL), Self-reflection (RL), and Second Attempt (RL), with an additional Internalization (SFT) stage. The diagram uses boxes representing "Policy" components, arrows indicating flow, and symbols representing task input and output. A flame symbol is used to indicate a failure state.
### Components/Axes
The diagram is divided into three vertical columns labeled:
1. "First Attempt (RL)"
2. "Self-reflection (RL)"
3. "Second Attempt (RL)"
Below these columns is a fourth section labeled "Internalization (SFT)".
Key elements and labels:
* **Task:** Labeled as "x" at the input of the "First Attempt (RL)" stage.
* **Policy:** Represented by a hexagonal shape within a rounded rectangle in each stage.
* **Output:** Labeled as "y<sup>(1)</sup>" after the "First Attempt (RL)" and "y<sup>(2)</sup>" after the "Second Attempt (RL)".
* **Env. Feedback:** Labeled as "f" and connected to the output of the "First Attempt (RL)".
* **Self-Reflection:** Labeled as "Î" and connected to the output of the "Self-reflection (RL)".
* **Cross Episode Memory:** A label pointing to the connection between the "Self-reflection (RL)" and the "Internalization (SFT)" stages.
* **Flame Symbol:** Indicates a failure state in the "First Attempt (RL)", "Self-reflection (RL)", and "Second Attempt (RL)" stages.
### Detailed Analysis / Content Details
The diagram illustrates a feedback loop.
1. **First Attempt (RL):** The "Task" (x) is input into a "Policy" component, resulting in an output "y<sup>(1)</sup>". This output is then fed into "Env. Feedback" (f).
2. **Self-reflection (RL):** The "Env. Feedback" (f) and the "Policy" component are combined (using a circle with a minus sign inside, indicating subtraction or difference) to generate "Self-Reflection" (Î). This "Self-Reflection" is also stored in "Cross Episode Memory". A flame symbol is present, indicating a potential failure.
3. **Second Attempt (RL):** The "Self-Reflection" (Î) and the "Policy" component are combined, resulting in an output "y<sup>(2)</sup>". A flame symbol is present, indicating a potential failure.
4. **Internalization (SFT):** The "Cross Episode Memory" feeds into another "Policy" component, representing the "Internalization" stage.
The diagram shows a cyclical process where the initial attempt is followed by self-reflection and a second attempt, with the knowledge gained from the first attempt being used to improve the second. The "Internalization" stage suggests a long-term learning or adaptation process.
### Key Observations
* The presence of the flame symbol in multiple stages suggests that failures are expected and are part of the learning process.
* The "Cross Episode Memory" indicates that information is retained across attempts, enabling learning and improvement.
* The "Internalization" stage suggests a consolidation of learning over time.
* The use of mathematical notation (y<sup>(1)</sup>, y<sup>(2)</sup>, Î) implies a quantitative aspect to the process.
### Interpretation
This diagram represents a reinforcement learning framework where an agent learns through trial and error. The "First Attempt" represents the initial exploration of a task. The "Self-reflection" stage allows the agent to analyze its performance and identify areas for improvement. The "Second Attempt" utilizes this feedback to refine its strategy. The "Internalization" stage suggests that the learned knowledge is integrated into the agent's long-term memory or policy.
The diagram highlights the importance of feedback and memory in reinforcement learning. The agent doesn't simply try random actions; it learns from its mistakes and uses that knowledge to improve its future performance. The flame symbol suggests that failure is not necessarily a negative outcome, but rather an opportunity for learning.
The use of "RL" and "SFT" suggests that the framework combines Reinforcement Learning with Supervised Fine-Tuning (SFT), potentially leveraging the strengths of both approaches. The diagram provides a high-level overview of the process and doesn't specify the details of the algorithms or techniques used in each stage.
</details>
Figure 3: Overview of Experiential Reinforcement Learning (ERL). Given an input task $x$ , the language model first produces an initial attempt and receives environment feedback. The same model then generates a self-reflection conditioned on this attempt, which is used to guide a second attempt. Both attempts and reflections are optimized with reinforcement learning, while successful second attempts are internalized via self-distillation, so the model learns to reproduce improved behavior directly from the original input without self-reflection.
## 2 Experiential Reinforcement Learning (ERL)
Algorithm 1 Experiential Reinforcement Learning
1: Inputs: Language model $\pi_{\theta}$ ; dataset of questions $x$ ; reward threshold $\tau$ ; environment returning feedback $f$ and reward $r$ .
2: Initialize: reflection memory $m\leftarrow\emptyset$ .
3: repeat
4: Sample question $x$ from the dataset.
5: // First attempt
6: Sample an answer $y^{(1)}\sim\pi_{\theta}(\cdot\mid x)$ .
7: Obtain environment feedback and reward $(f^{(1)},r^{(1)})$ .
8: // Self-reflection
9: Sample a reflection $\Delta\sim\pi_{\theta}(\cdot\mid x,y^{(1)},f^{(1)},r^{(1)},m)$ .
10: // Second attempt
11: Sample a refined answer $y^{(2)}\sim\pi_{\theta}(\cdot\mid x,\Delta)$ .
12: Obtain environment feedback and reward $(f^{(2)},r^{(2)})$ .
13: Set reflection reward $\tilde{r}\leftarrow r^{(2)}$ .
14: Store reflection $m\leftarrow\Delta\;\;$ if $\;\;r^{(2)}>\tau$ .
15: // RL update
16: Update $\theta$ via $\mathcal{L}_{\text{policy}}(\theta)$ over the first attempt, reflection, and second attempt.
17: // Internalization
18: Update $\theta$ via $\mathcal{L}_{\text{distill}}(\theta)$ to internalize reflection, training $\pi_{\theta}$ to produce $y^{(2)}$ from $x$ only.
19: until converged
We introduce Experiential Reinforcement Learning (ERL), a training framework that enables a language model to iteratively improve its behavior through self-generated feedback and internalization. The key idea is to treat reflection as an intermediate reasoning signal that guides a refined second attempt, while reinforcement learning aligns both attempts with reward, and supervised distillation consolidates successful behaviors into the base policy. An overview is shown in Figure 3, and the core training loop appears in Algorithm 1. A detailed implementation, including memory persistence and gating logic, is provided in Appendix A.
Given an input task $x$ , the model $\pi_{\theta}$ first produces an initial response
$$
y^{(1)}\sim\pi_{\theta}(\cdot\mid x), \tag{1}
$$
which is evaluated by the environment to produce textual feedback $f^{(1)}$ and reward $r^{(1)}$ . Rather than immediately updating the policy, ERL optionally triggers a reflection-and-retry phase when the first attempt underperforms relative to a reward threshold $\tau$ . This selective retry mechanism focuses compute on trajectories that are most likely to benefit from revision while avoiding unnecessary refinement when performance is already sufficient. When triggered, the model generates a reflection
$$
\Delta\sim\pi_{\theta}(\cdot\mid x,y^{(1)},f^{(1)},r^{(1)},m), \tag{1}
$$
which serves as self-guidance describing how the initial attempt can be improved. Here, $m$ denotes a cross-episode reflection memory that persists successful corrective patterns discovered during training. This memory provides contextual priors that help stabilize reflection generation and encourage reuse of previously effective strategies. The model then produces a refined response
$$
y^{(2)}\sim\pi_{\theta}(\cdot\mid x,\Delta), \tag{2}
$$
and receives new feedback $(f^{(2)},r^{(2)})$ . Reflections that lead to sufficiently improved outcomes are stored back into memory,
$$
m\leftarrow\Delta\quad\text{if}\quad r^{(2)}>\tau, \tag{2}
$$
allowing corrective knowledge to accumulate across training episodes. The reflection is assigned reward $\tilde{r}=r^{(2)}$ , encouraging reflections that lead to improved downstream performance.
Both attempts and reflections are optimized using a reinforcement learning objective
$$
\mathcal{L}_{\text{policy}}(\theta)=-\mathbb{E}\!\left[A\,\log\pi_{\theta}(y\mid x,\cdot)\right],
$$
where $y$ denotes model outputs arising from the first attempt, reflection, or second attempt, and the conditioning context corresponds to the inputs specified in Algorithm 1. The advantage estimate $A$ is computed from the associated rewards.
While reflection and environment feedback provide strong training signals, such supervision is typically unavailable at deployment time, where the model must operate in a zero-shot setting. We therefore introduce an internalization step that converts reflection-guided improvements into persistent policy behavior. The goal is to make the model remember corrections discovered during training and avoid repeating the same mistakes when feedback is absent. We implement internalization via selective distillation: we supervise the model to imitate only successful second attempts while removing reflection context from the input. Concretely, given a training example $x$ , we generate a refined response $y^{(2)}$ and reward $r^{(2)}$ , and optimize
$$
\mathcal{L}_{\text{distill}}(\theta)=-\mathbb{E}\Big[\mathbb{I}\!\left(r^{(2)}>0\right)\,\log\pi_{\theta}\!\left(y^{(2)}\mid x\right)\Big], \tag{2}
$$
where $\mathbb{I}(\cdot)$ is the indicator function. This trains $\pi_{\theta}$ to reproduce improved behavior from the original input $x$ alone (no reflection), ensuring that lessons learned through feedback and self-reflection persist at test time.
By alternating between reinforcement learning, selective reflection, and distillation, ERL bootstraps self-improvement: reflections guide higher-quality retries, memory preserves effective corrective structure, reinforcement learning aligns behavior with reward, and distillation internalizes gains into the core model. Over time, this interaction stabilizes training, concentrates exploration on failure cases, and reduces dependence on explicit reflection at inference.
### 2.1 Comparison to Standard RLVR
Standard reinforcement learning with verifiable rewards (RLVR) optimizes a policy directly from scalar outcome signals. Given an input $x$ , the model samples a response $y\sim\pi_{\theta}(\cdot\mid x)$ and receives a reward $r$ , with policy updates derived from trajectory-level credit assignment. In this formulation, feedback influences learning only through reward-driven optimization, requiring the model to implicitly discover how failures should translate into behavioral change. Corrective structure therefore emerges slowly through repeated exploration, with no explicit mechanism for revising behavior within the same learning episode. This learning dynamic corresponds to trial-and-error optimization, as illustrated in Figure 2.
Experiential Reinforcement Learning (ERL) augments this loop with an explicit experienceâreflectionâconsolidation stage embedded inside each trajectory. Instead of optimizing solely from outcome reward, the model converts environment feedback into a reflection that conditions a refined attempt. This intermediate revision produces a locally improved trajectory that is reinforced and later internalized through selective distillation, allowing the base policy to reproduce corrected behavior without reflection at inference. A cross-episode reflection memory further stabilizes this process by preserving corrective patterns that proved effective, allowing subsequent reflections to reuse prior improvements. Importantly, ERL preserves the underlying RLVR objective: policy gradients remain reward-driven, but operate over a richer trajectory structure that includes explicit behavioral correction. This reframing shifts feedback from a scalar endpoint signal to a catalyst for immediate revision, reducing reliance on undirected exploration while maintaining compatibility with standard reinforcement learning pipelines. This contrast between blind trial-and-error learning and reflection-guided revision is visualized in Figure 1 and Figure 2.
## 3 Experiment
We evaluate Experiential Reinforcement Learning (ERL) against standard RLVR on a set of agentic reasoning tasks.
### 3.1 Task
We evaluate ERL on three agentic reasoning tasks: Frozen Lake, Sokoban, and HotpotQA (Yang et al., 2018). Detailed environment descriptions are provided in Appendix B.
For Frozen Lake and Sokoban, we configure the environments with sparse terminal rewards following Wang et al. (2025) and Guertler et al. (2025). The agent receives reward only at episode completion: a reward of +1 is assigned for successfully achieving the objective and 0 otherwise. Crucially, we do not provide explicit game rules or environment dynamics. The model must infer task structure purely through interaction, with access limited to the available action set. This evaluation design is inspired by prior work on learning from experience, where the goal is to measure an agentâs ability to acquire task understanding through trial-and-error rather than relying on human-authored priors embedded in pretraining. The combination of sparse rewards and unknown dynamics therefore creates a challenging setting that emphasizes reasoning, planning, and experiential learning.
HotpotQA is adapted into an agentic multi-hop question-answering task following Search-R1 (Jin et al., 2025). Given a question, the model performs iterative tool-assisted retrieval before producing a final answer. To maintain consistency with the experiential learning setup, we provide only a default system prompt describing available tools, without additional task-specific guidance. Correctness is evaluated using token-level F1 against ground-truth answers. The reward function assigns 1.0 for exact matches, a proportional reward for partial matches with F1 score $\geq 0.3$ , and 0 otherwise.
### 3.2 Models and Baselines
In our experiments, we train Olmo-3-7B-Instruct (Olmo et al., 2025) and Qwen3-4B-Instruct-2507 (Yang et al., 2025) using both standard RLVR and our proposed ERL paradigm, with GRPO (Shao et al., 2024) serving as the underlying policy-gradient optimizer in all cases. To ensure stable training, we adopt common reinforcement learning techniques such as clipping, KL regularization, and importance sampling. Notably, the internalization stage in ERL naturally involves off-policy data, which can introduce additional instability. We therefore apply the same stabilization techniques during this phase to maintain consistent optimization behavior. Additionally, because ERL requires two attempts per task along with an additional reflection step, we allocate 10 rollouts per task for RLVR and half as many per task per attempt for ERL to equalize the training compute per task across methods. Full hyperparameters and implementation details are provided in Appendix C.
## 4 Result and Discussion
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Chart: Reinforcement Learning Reward Curves
### Overview
The image presents six line charts displaying reward curves for two reinforcement learning algorithms, ERL (green) and RLVR (blue), across three different environments: FrozenLake, HotpotQA, and Sokoban. Each environment is evaluated using two language models: Owen-3-4B-Instruct-2.507 and Olmo-3-7B-Instruct. The x-axis represents training wall-clock time in hours, and the y-axis represents the reward.
### Components/Axes
* **X-axis Label (all charts):** "Training wall-clock time (hours)"
* **Y-axis Label (top row):** "Reward" (for Owen-3-4B-Instruct-2.507)
* **Y-axis Label (bottom row):** "Reward" (for Olmo-3-7B-Instruct)
* **Legend (top-left of each chart):**
* Green Line: "ERL"
* Blue Line: "RLVR"
* **Chart Titles (top of each chart):**
* "FROZENLAKE"
* "HOTPOTQA"
* "SOKOBAN"
* **Model Labels (left side of each 2x3 grid):**
* "Owen-3-4B-Instruct-2.507"
* "Olmo-3-7B-Instruct"
* **Y-axis Scale (approximate):**
* FrozenLake (Owen): 0.20 to 0.80
* HotpotQA (Owen): 0.32 to 0.48
* Sokoban (Owen): 0.00 to 0.80
* FrozenLake (Olmo): 0.20 to 0.50
* HotpotQA (Olmo): 0.24 to 0.48
* Sokoban (Olmo): 0.04 to 0.16
### Detailed Analysis or Content Details
**1. FrozenLake (Owen-3-4B-Instruct-2.507):**
* ERL (Green): Starts at approximately 0.22, increases rapidly to around 0.75 by 6 hours, and plateaus around 0.80.
* RLVR (Blue): Starts at approximately 0.25, increases to around 0.45 by 4 hours, then fluctuates between 0.40 and 0.50, ending around 0.48.
**2. HotpotQA (Owen-3-4B-Instruct-2.507):**
* ERL (Green): Starts at approximately 0.35, increases steadily to around 0.47 by 3 hours, and remains relatively stable around 0.48.
* RLVR (Blue): Starts at approximately 0.40, initially decreases slightly, then increases to around 0.45 by 2 hours, and fluctuates between 0.40 and 0.44, ending around 0.42.
**3. Sokoban (Owen-3-4B-Instruct-2.507):**
* ERL (Green): Starts at approximately 0.20, increases steadily to around 0.60 by 16 hours, and fluctuates between 0.50 and 0.80.
* RLVR (Blue): Starts at approximately 0.20, increases slowly to around 0.30 by 16 hours, and fluctuates between 0.20 and 0.40, ending around 0.30.
**4. FrozenLake (Olmo-3-7B-Instruct):**
* ERL (Green): Starts at approximately 0.22, increases rapidly to around 0.45 by 6 hours, and plateaus around 0.50.
* RLVR (Blue): Starts at approximately 0.25, increases to around 0.35 by 4 hours, and remains relatively stable around 0.30.
**5. HotpotQA (Olmo-3-7B-Instruct):**
* ERL (Green): Starts at approximately 0.25, increases steadily to around 0.45 by 3 hours, and remains relatively stable around 0.46.
* RLVR (Blue): Starts at approximately 0.28, increases to around 0.35 by 2 hours, and fluctuates between 0.30 and 0.35, ending around 0.32.
**6. Sokoban (Olmo-3-7B-Instruct):**
* ERL (Green): Starts at approximately 0.04, increases slowly to around 0.12 by 32 hours, and fluctuates between 0.08 and 0.16.
* RLVR (Blue): Starts at approximately 0.04, increases slowly to around 0.08 by 32 hours, and remains relatively stable around 0.06.
### Key Observations
* ERL consistently outperforms RLVR across all environments and language models.
* The reward curves for FrozenLake and HotpotQA reach a plateau relatively quickly, while Sokoban shows more prolonged learning and fluctuation.
* The Olmo-3-7B-Instruct model generally yields lower rewards compared to Owen-3-4B-Instruct-2.507, particularly in the Sokoban environment.
* Sokoban consistently has the lowest reward values across all configurations.
### Interpretation
The data suggests that ERL is a more effective reinforcement learning algorithm than RLVR for the tested environments and language models. The differences in performance are particularly pronounced in the Sokoban environment, indicating that ERL may be better suited for more complex tasks. The lower rewards obtained with the Olmo-3-7B-Instruct model could be due to its different architecture or training data, potentially impacting its ability to learn optimal policies. The plateauing of reward curves in FrozenLake and HotpotQA suggests that the algorithms converge relatively quickly in these environments, while the continued fluctuation in Sokoban indicates that further training or algorithm adjustments may be needed to achieve optimal performance. The Sokoban environment appears to be the most challenging, requiring significantly more training time to achieve even modest rewards.
</details>
Figure 4: Validation reward trajectories versus training wall-clock time on FrozenLake, HotpotQA, and Sokoban for Qwen3-4B-Instruct-2507 and Olmo-3-7B-Instruct. ERL consistently achieves higher reward and faster improvement than RLVR across tasks and models.
We evaluate Experiential Reinforcement Learning (ERL) against standard RLVR across three environments spanning sparse-reward control (FrozenLake, Sokoban) and agentic reasoning (HotpotQA). Table 1 summarizes the final performance, while Figures 5 â 6 visualize the performance and learning dynamics. All curves are smoothed with a trailing moving average over 5 points. The same smoothing procedure is applied to all figures unless otherwise noted.
### 4.1 Performance Across Tasks
ERL consistently improves final evaluation performance over RLVR across all tasks and both model backbones. As shown in Table 1 and Figure 5, experiential training yields gains ranging from moderate improvements on HotpotQA to substantial improvements on Sokoban and Frozenlake.
The largest effect occurs in Sokoban, where Qwen3-4B-Instruct improves from 0.06 to 0.87 and Olmo3-7B-Instruct from 0.04 to 0.20. Sokoban requires long-horizon planning and recovery from compounding errors, making performance sensitive to how well the agent reasons about environment dynamics. Similarly, FrozenLake demands that the agent infer symbol semantics, action consequences, and terminal conditions purely through interaction under sparse rewards. Importantly, as described in Section 3, unlike many prior evaluation setups that provide explicit rules or environment structure, our environments expose only observations and action interfaces; the agent must infer task dynamics through trial-and-error. This design emphasizes learning from experience rather than relying on pre-specified priors, making structured revision particularly valuable. In these settings, the experienceâreflectionâconsolidation loop enables the model to analyze failures, revise strategies, and internalize corrective behavior within each episode, producing large improvements in exploration efficiency and policy quality.
HotpotQA shows smaller but reliable gains. A likely explanation lies in differences in task structure. Compared to the grid-based control environments, HotpotQA presents a more homogeneous interaction pattern centered on repeated tool invocation and answer synthesis, with denser evaluation feedback and fewer latent dynamics to infer. Because RLVR already receives relatively informative gradients in this regime, the additional benefit of structured experiential revision is reduced. This contrast suggests that ERL yields the greatest advantage in environments where learning requires substantial reasoning about unknown dynamics and long-horizon consequences, rather than primarily optimizing over a stable interaction loop.
Importantly, improvements are observed across both models, indicating that the benefits of ERL arise from enhanced learning dynamics rather than architecture-specific effects.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Bar Chart: Reward Comparison Across Environments
### Overview
The image presents a bar chart comparing the reward achieved by four different algorithms (ERL, RLVR, Qwen, and Olmo) across three distinct environments: FrozenLake, HotpotQA, and Sokoban. The reward is represented on the y-axis, while the x-axis represents the different algorithms within each environment.
### Components/Axes
* **Y-axis:** Labeled "Reward", with a scale ranging from 0.00 to 1.00, incrementing by 0.20.
* **X-axis:** Represents the algorithms being compared within each environment.
* **Environments:** Three environments are displayed: FrozenLake, HotpotQA, and Sokoban, each with its own set of bars.
* **Legend:** Located at the top-left corner, identifying the algorithms by color:
* ERL (Green)
* RLVR (Blue)
* Qwen (Light Green)
* Olmo (Purple)
### Detailed Analysis
**FrozenLake:**
* ERL: The green bar for ERL in FrozenLake reaches approximately 0.94.
* RLVR: The blue bar for RLVR in FrozenLake reaches approximately 0.86.
* Qwen: The light green bar for Qwen in FrozenLake reaches approximately 0.39.
* Olmo: The purple bar for Olmo in FrozenLake reaches approximately 0.66.
**HotpotQA:**
* ERL: The green bar for ERL in HotpotQA reaches approximately 0.56.
* RLVR: The blue bar for RLVR in HotpotQA reaches approximately 0.45.
* Qwen: The light green bar for Qwen in HotpotQA reaches approximately 0.47.
* Olmo: The purple bar for Olmo in HotpotQA reaches approximately 0.50.
**Sokoban:**
* ERL: The green bar for ERL in Sokoban reaches approximately 0.87.
* RLVR: The blue bar for RLVR in Sokoban reaches approximately 0.06.
* Qwen: The light green bar for Qwen in Sokoban reaches approximately 0.20.
* Olmo: The purple bar for Olmo in Sokoban reaches approximately 0.04.
### Key Observations
* ERL consistently achieves high rewards across all three environments, often the highest.
* RLVR performs well in FrozenLake and HotpotQA, but significantly underperforms in Sokoban.
* Qwen's performance is moderate across all environments.
* Olmo's performance is variable, with moderate performance in FrozenLake and HotpotQA, but very low performance in Sokoban.
* Sokoban appears to be the most challenging environment, as the reward values are generally lower compared to FrozenLake and HotpotQA.
### Interpretation
The data suggests that the ERL algorithm is the most robust and effective across the tested environments. It consistently achieves the highest rewards, indicating its ability to learn and perform well in diverse problem settings. RLVR demonstrates good performance in simpler environments like FrozenLake and HotpotQA, but struggles with the complexity of Sokoban. Qwen provides a moderate level of performance, while Olmo's performance is inconsistent.
The significant difference in reward values across environments highlights the varying difficulty levels of the tasks. Sokoban, with its lower reward values, likely presents a more complex challenge for the algorithms, requiring more sophisticated learning strategies. The comparison of these algorithms provides valuable insights into their strengths and weaknesses, guiding future research and development efforts in reinforcement learning. The data suggests that algorithm performance is highly environment-dependent, and a single algorithm may not be optimal for all tasks.
</details>
Figure 5: Final evaluation reward on FrozenLake, HotpotQA, and Sokoban. ERL consistently outperforms RLVR for both Qwen3-4B-Instruct-2507 and Olmo-3-7B-Instruct.
### 4.2 Learning Efficiency and Optimization Dynamics
Figure 4 compares validation reward against wall-clock training time. Across tasks and models, ERL reaches higher reward earlier and maintains a persistent margin over RLVR. This acceleration is especially pronounced in FrozenLake and Sokoban, where RLVR progresses gradually while ERL rapidly approaches high-reward behavior.
These dynamics suggest that reflection introduces an intermediate corrective signal that reshapes exploration. Instead of relying solely on terminal reward propagation, the model conditions on feedback and self-generated critique to revise its behavior. This concentrates training updates on trajectories that are already partially aligned with the objective, reducing inefficient exploration.
Even in HotpotQA, where rewards are denser and the environment is comparatively simpler, ERL maintains a consistent performance advantage over RLVR. Across environments, these results indicate that ERL achieves higher final reward while improving learning efficiency, demonstrating that structured experiential revision leads to faster and more effective policy improvement.
### 4.3 Mechanistic Role of Reflection
Figure 6 shows training reward trajectories for ERL before and after the reflection step, alongside RLVR. Across environments and models, post-reflection trajectories consistently achieve higher training reward than pre-reflection trajectories and also exceed RLVR.
This comparison highlights the immediate within-episode effect of reflection. After observing feedback from the first attempt, the model generates a structured revision that guides a second attempt with improved actions. The resulting gain in training reward indicates that reflection produces actionable corrections within the same episode, rather than only shaping behavior over long horizons. The sustained separation between pre- and post-reflection curves throughout training suggests that reflection serves as a systematic revision mechanism. By converting observed outcomes into targeted adjustments, it improves the quality of second attempts, which are subsequently reinforced and contribute to longer-term policy improvement.
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Charts: Reinforcement Learning Reward Curves
### Overview
The image presents six line charts, arranged in a 2x3 grid, displaying reward curves for different reinforcement learning algorithms across three environments: FrozenLake, HotpotQA, and Sokoban. Two model sizes are compared: Owen-3-4B-Instruct-2507 and Olmo-3-7B-Instruct. The algorithms are ERL (with pre- and post-reflection) and RLVR. The charts show how the reward changes with the number of training steps.
### Components/Axes
* **Y-axis (Vertical):** Reward. Scales vary per chart, ranging from approximately 0.00 to 1.00.
* **X-axis (Horizontal):** Training steps. Scales vary per chart, ranging from 0 to 200.
* **Lines:**
* ERL: Post-refl. (Light Green)
* ERL: Pre-refl. (Medium Turquoise)
* RLVR (Dark Blue)
* **Chart Titles (Top Row):** FROZENLAKE, HOTPOTQA, SOKOBAN
* **Y-axis Labels (Left Column):** Owen-3-4B-Instruct-2507 Reward, Olmo-3-7B-Instruct Reward
* **X-axis Labels (Bottom Row):** Training steps, Training steps, Training steps
### Detailed Analysis or Content Details
**1. FrozenLake (Owen-3-4B-Instruct-2507)**
* **Trend:** All three lines generally slope upwards, indicating increasing reward with training. ERL Post-refl. shows the steepest initial increase.
* **Data Points (approximate):**
* ERL Post-refl.: Starts at ~0.20, reaches ~0.95 at 100 steps.
* ERL Pre-refl.: Starts at ~0.20, reaches ~0.75 at 100 steps.
* RLVR: Starts at ~0.25, reaches ~0.50 at 100 steps.
**2. HotpotQA (Owen-3-4B-Instruct-2507)**
* **Trend:** All lines fluctuate, but generally show an upward trend. ERL Post-refl. consistently achieves the highest reward.
* **Data Points (approximate):**
* ERL Post-refl.: Starts at ~0.45, reaches ~0.75 at 100 steps.
* ERL Pre-refl.: Starts at ~0.40, reaches ~0.60 at 100 steps.
* RLVR: Starts at ~0.40, reaches ~0.50 at 100 steps.
**3. Sokoban (Owen-3-4B-Instruct-2507)**
* **Trend:** All lines show an upward trend, but with more significant fluctuations. ERL Post-refl. shows the most rapid increase towards the end of the training.
* **Data Points (approximate):**
* ERL Post-refl.: Starts at ~0.05, reaches ~0.80 at 200 steps.
* ERL Pre-refl.: Starts at ~0.05, reaches ~0.40 at 200 steps.
* RLVR: Starts at ~0.02, reaches ~0.20 at 200 steps.
**4. FrozenLake (Olmo-3-7B-Instruct)**
* **Trend:** Similar to the Owen model, all lines increase with training. ERL Post-refl. shows the fastest initial growth.
* **Data Points (approximate):**
* ERL Post-refl.: Starts at ~0.15, reaches ~0.48 at 100 steps.
* ERL Pre-refl.: Starts at ~0.15, reaches ~0.35 at 100 steps.
* RLVR: Starts at ~0.20, reaches ~0.30 at 100 steps.
**5. HotpotQA (Olmo-3-7B-Instruct)**
* **Trend:** All lines fluctuate and generally increase. ERL Post-refl. consistently has the highest reward.
* **Data Points (approximate):**
* ERL Post-refl.: Starts at ~0.30, reaches ~0.60 at 100 steps.
* ERL Pre-refl.: Starts at ~0.30, reaches ~0.50 at 100 steps.
* RLVR: Starts at ~0.30, reaches ~0.40 at 100 steps.
**6. Sokoban (Olmo-3-7B-Instruct)**
* **Trend:** All lines show an upward trend with fluctuations. ERL Post-refl. demonstrates the most significant increase towards the end of training.
* **Data Points (approximate):**
* ERL Post-refl.: Starts at ~0.02, reaches ~0.12 at 200 steps.
* ERL Pre-refl.: Starts at ~0.02, reaches ~0.08 at 200 steps.
* RLVR: Starts at ~0.01, reaches ~0.06 at 200 steps.
### Key Observations
* ERL Post-refl. consistently outperforms ERL Pre-refl. and RLVR across all environments and model sizes.
* The performance gap between the algorithms is most pronounced in the Sokoban environment.
* The Olmo-3-7B-Instruct model generally achieves lower rewards than the Owen-3-4B-Instruct-2507 model, especially in the Sokoban environment.
* All algorithms show diminishing returns as training progresses, with the rate of reward increase slowing down.
### Interpretation
The data suggests that the post-reflection technique significantly improves the performance of the ERL algorithm in reinforcement learning tasks. The consistent outperformance of ERL Post-refl. across different environments indicates that this technique is robust and effective. The lower rewards achieved by the Olmo-3-7B-Instruct model, particularly in Sokoban, could be due to several factors, including differences in model architecture, training data, or hyperparameter settings. The diminishing returns observed in all algorithms suggest that further training may not yield substantial improvements in performance. The fluctuations in reward curves, especially in HotpotQA, may indicate the stochastic nature of the environment or the learning process. The Sokoban environment appears to be the most challenging, as evidenced by the lower overall reward values and the greater variability in performance. This could be due to the complexity of the task or the difficulty of exploring the state space.
</details>
Figure 6: Training reward trajectories for Qwen3-4B-Instruct-2507 and Olmo-3-7B-Instruct comparing RLVR with ERL before and after reflection. Post-reflection trajectories consistently achieve higher reward than both RLVR and pre-reflection trajectories.
### 4.4 Ablation Study: Memory and Reflection Mechanisms
| Task Qwen3-4B-Instruct-2507 FrozenLake | RLVR 0.86 | ERL 0.94 | ERL w/o Mem. 0.86 (-0.08) | ERL w/o Refl. 0.60 (-0.34) |
| --- | --- | --- | --- | --- |
| HotpotQA | 0.45 | 0.56 | 0.56 (-0.00) | 0.48 (-0.08) |
| Sokoban | 0.06 | 0.87 | 0.87 (-0.00) | 0.59 (-0.28) |
| Olmo3-7B-Instruct | | | | |
| FrozenLake | 0.39 | 0.66 | 0.64 (-0.02) | 0.54 (-0.12) |
| HotpotQA | 0.47 | 0.50 | 0.47 (-0.03) | 0.46 (-0.04) |
| Sokoban | 0.04 | 0.20 | 0.24 (+0.04) | 0.06 (-0.14) |
Table 1: Final evaluation reward on FrozenLake, HotpotQA, and Sokoban. ERL performance is compared against ablation variants, with highlighted drops showing the performance degradation relative to ERL when removing memory reuse (w/o Mem.) or structured reflection (w/o Refl.).
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Line Chart: Reward vs. Training Time for Reinforcement Learning Algorithms
### Overview
This line chart depicts the relationship between reward and training wall-clock time for four different reinforcement learning algorithms: ERL w/o Mem., ERL, ERL w/o Refl., and RLVR. The chart shows how the reward obtained by each algorithm changes as the training time increases.
### Components/Axes
* **X-axis:** Training wall-clock time (hours), ranging from 0 to 8 hours.
* **Y-axis:** Reward, ranging from 0.20 to 0.90.
* **Legend:** Located at the top-center of the chart, identifying each line with a specific color and label:
* ERL w/o Mem. (dashed light blue)
* ERL (solid teal)
* ERL w/o Refl. (dotted green)
* RLVR (solid dark blue)
### Detailed Analysis
* **ERL w/o Mem. (dashed light blue):** This line starts at approximately 0.22 at 0 hours, increases rapidly to around 0.85 by 4 hours, plateaus around 0.82-0.85 between 4 and 7 hours, and then decreases slightly to approximately 0.78 at 8 hours.
* **ERL (solid teal):** This line begins at approximately 0.21 at 0 hours, increases steadily to around 0.75 by 4 hours, continues to increase to approximately 0.78 by 7 hours, and then remains relatively stable at around 0.77-0.78 at 8 hours.
* **ERL w/o Refl. (dotted green):** This line starts at approximately 0.23 at 0 hours, increases gradually to around 0.50 by 4 hours, continues to increase to approximately 0.58 by 8 hours.
* **RLVR (solid dark blue):** This line begins at approximately 0.20 at 0 hours, increases steadily to around 0.55 by 4 hours, continues to increase to approximately 0.68 by 8 hours.
### Key Observations
* ERL w/o Mem. achieves the highest reward values, particularly between 4 and 7 hours, but experiences a slight decline at 8 hours.
* ERL consistently outperforms RLVR and ERL w/o Refl. throughout the entire training period.
* ERL w/o Refl. exhibits the slowest reward increase and remains the lowest performing algorithm.
* RLVR shows a steady increase in reward, but lags behind ERL and ERL w/o Mem.
### Interpretation
The data suggests that incorporating memory into the ERL algorithm (ERL w/o Mem.) significantly improves performance, as evidenced by the higher reward values achieved. However, the slight decline in reward at 8 hours for ERL w/o Mem. could indicate overfitting or a need for further refinement. The ERL algorithm, while not as high-performing as ERL w/o Mem., still demonstrates a consistent and positive trend. Removing reflection (ERL w/o Refl.) appears to hinder the learning process, resulting in the lowest reward values. RLVR shows a reasonable learning curve, but is consistently outperformed by the ERL variants. The chart highlights the importance of memory and reflection in the ERL algorithm for achieving optimal performance in this reinforcement learning task. The plateauing of ERL w/o Mem. suggests diminishing returns from continued training, while the continued increase of other algorithms indicates they may benefit from longer training times.
</details>
Figure 7: Ablation study on Qwen3-4B-Instruct-2507 in FrozenLake. We compare full ERL with two variants: (1) no memory, which disables cross-episode reflection reuse, and (2) no reflection, which replaces structured self-reflection with raw first-attempt context and a generic retry instruction.
To understand how structured reflection and cross-episode memory contribute to performance, we conduct ablation studies across tasks and models. The quantitative results are reported in Table 1, and representative learning dynamics for FrozenLake with Qwen3-4B-Instruct-2507 are shown in Figure 7. These experiments isolate individual components of ERL while keeping the overall training setup fixed.
The no-memory variant disables cross-episode reflection storage. Reflections are still generated and used to guide the second attempt within each episode, but they are not retained for reuse in future episodes. As a result, corrective signals remain local to individual trajectories rather than accumulating into persistent behavioral priors.
The no-reflection variant preserves the two-attempt interaction structure but removes explicit structured reflection. Instead, the model receives the full first-attempt interaction history together with a generic instruction encouraging improvement. This design tests whether contextual reuse alone can replicate the benefits of structured reflective reasoning. The prompt template used in this setting is shown in Table 9 (Appendix).
The results in Table 1 show a consistent ordering across most tasks and models: full ERL achieves the strongest performance, followed by the no-memory variant, while the no-reflection variant exhibits the largest degradation in most settings. Figure 7 further illustrates that removing memory slows convergence, whereas removing reflection substantially reduces both learning speed and final reward. These findings support the core design intuition of ERL: reflection generates actionable behavioral corrections, and memory propagates those corrections across episodes to enable cumulative refinement.
At the same time, Table 1 reveals an important caveat. In the Olmo3-7B-Instruct Sokoban setting, the no-memory variant slightly outperforms full ERL. This suggests that when a modelâs self-reflective ability is limited, or when the environment is complex and stochastic, persistent memory may propagate early inaccurate reflections, making recovery more difficult. In such cases, disabling cross-episode memory can mitigate the accumulation of erroneous priors. Nevertheless, across the broad set of tasks and models evaluated, ERL consistently delivers the strongest overall performance, demonstrating that structured reflection combined with persistent memory is highly effective in most practical settings.
## 5 Related Work
#### Reinforcement Learning for LLMs.
Reinforcement learning has become a central approach for improving large language models. Early work focused on reinforcement learning from human feedback (RLHF) to align model behavior with human preferences and conversational objectives (Ouyang et al., 2022; Christiano et al., 2023; Shi et al., 2024; 2025). More recent efforts extend RL to enhance mathematical reasoning, where verifiable or programmatic rewards derived from executable checks or formal answer verification provide structured supervision for reasoning and solution construction (OpenAI et al., 2024; Guo et al., 2025; Song et al., 2025b; Shi et al., 2026). In parallel, research on tool-using and agentic LLMs treats the model as a policy that interacts with external environments, alternating between actions and observations under task-dependent rewards to solve multi-step problems (Yao et al., 2023; Jin et al., 2025; Bai et al., 2026; Jiang et al., 2026). Despite their different goals, these approaches primarily treat environment feedback as a scalar optimization signal propagated through policy gradients, requiring the model to implicitly infer corrective structure through exploration. In contrast, our ERL paradigm introduces an explicit experience-reflection-consolidation loop that transforms environment feedback into structured behavioral revision before internalizing improvements into the base policy.
#### Learning from Experience.
A growing body of work argues that the next scaling regime for AI will come not from more static human text, but from agents generating ever-richer data through interaction-i.e., learning predominantly from experience. Silver and Sutton (2025) emphasizes that continual, agent-generated data streams and long-horizon decision-making as the route beyond imitation of human corpora. This motivates algorithmic mechanisms that convert failures into usable learning signal rather than relying on rare successes. In classic reinforcement learning, Andrychowicz et al. (2018) addresses sparse rewards by relabeling goals so that failed trajectories can still provide informative updates, substantially improving sample efficiency in goal-conditioned tasks. In the LLM-agent setting, Zhang et al. (2025) similarly targets the gap between imitation and full RL by training agents on their own interaction traces even when explicit rewards are unavailable, using the agentâs generated future states as supervision and including self-reflection as a way to learn from suboptimal actions. Meanwhile, inference-time reflection methods demonstrate that LLMs can critique and revise their own outputs to improve success (Zelikman et al., 2022; Madaan et al., 2023; Shinn et al., 2023), but typically require reflection or memory at deployment. Concurrent research explores integrating feedback-conditioned improvement directly into training. hĂźbotter2026reinforcementlearningselfdistillation; Song et al. (2026) formalize RL with textual feedback by distilling a feedback-conditioned teacher policy into a student policy. ERL is aligned with this direction but emphasizes explicit self-reflection as an intermediate reasoning step embedded inside the RL trajectory, where an initial attempt is followed by reflection and a refined retry. Coupled with selective internalization and cross-episode memory, this design treats reflection as a structured credit-assignment mechanism that transforms raw experience into durable behavioral improvement without requiring reflection at inference time.
## 6 Conclusion
In this work, we presented Experiential Reinforcement Learning (ERL), a training paradigm that incorporates an explicit experienceâreflectionâconsolidation stage into the reinforcement learning loop to convert environment feedback into structured behavioral correction. By pairing reflection-guided revision with selective internalization, ERL enables models to learn corrective strategies during training and consolidate them into a deployable policy that operates without reflection at inference time. Across sparse-reward control and agentic reasoning tasks, ERL improves learning efficiency, stabilizes optimization, and produces stronger final policies relative to standard reinforcement learning baselines. These results demonstrate that embedding structured experiential revision directly into the training process provides an effective mechanism for translating feedback into durable behavioral improvement. Looking forward, this work suggests a path toward reinforcement learning systems that are fundamentally grounded in experience, where explicit reflection and consolidation become core primitives for building agents that continually learn, adapt, and improve from their own interactions.
## Acknowledgements
The authors thank the members of the LIME Lab and Microsoft Office of Applied Research for their helpful discussions, feedback, and resources.
## References
- R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. Ramos Garea, M. Geist, and O. Bachem (2024) On-policy distillation of language models: learning from self-generated mistakes. In International Conference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024, pp. 21246â21263. External Links: Link Cited by: Appendix A.
- M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2018) Hindsight experience replay. External Links: 1707.01495, Link Cited by: §5.
- Y. Bai, Y. Bao, Y. Charles, C. Chen, G. Chen, H. Chen, H. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, J. Cui, H. Ding, M. Dong, A. Du, C. Du, D. Du, Y. Du, Y. Fan, Y. Feng, K. Fu, B. Gao, C. Gao, H. Gao, P. Gao, T. Gao, Y. Ge, S. Geng, Q. Gu, X. Gu, L. Guan, H. Guo, J. Guo, X. Hao, T. He, W. He, W. He, Y. He, C. Hong, H. Hu, Y. Hu, Z. Hu, W. Huang, Z. Huang, Z. Huang, T. Jiang, Z. Jiang, X. Jin, Y. Kang, G. Lai, C. Li, F. Li, H. Li, M. Li, W. Li, Y. Li, Y. Li, Y. Li, Z. Li, Z. Li, H. Lin, X. Lin, Z. Lin, C. Liu, C. Liu, H. Liu, J. Liu, J. Liu, L. Liu, S. Liu, T. Y. Liu, T. Liu, W. Liu, Y. Liu, Y. Liu, Y. Liu, Y. Liu, Z. Liu, E. Lu, H. Lu, L. Lu, Y. Luo, S. Ma, X. Ma, Y. Ma, S. Mao, J. Mei, X. Men, Y. Miao, S. Pan, Y. Peng, R. Qin, Z. Qin, B. Qu, Z. Shang, L. Shi, S. Shi, F. Song, J. Su, Z. Su, L. Sui, X. Sun, F. Sung, Y. Tai, H. Tang, J. Tao, Q. Teng, C. Tian, C. Wang, D. Wang, F. Wang, H. Wang, H. Wang, J. Wang, J. Wang, J. Wang, S. Wang, S. Wang, S. Wang, X. Wang, Y. Wang, Y. Wang, Y. Wang, Y. Wang, Y. Wang, Z. Wang, Z. Wang, Z. Wang, Z. Wang, C. Wei, Q. Wei, H. Wu, W. Wu, X. Wu, Y. Wu, C. Xiao, J. Xie, X. Xie, W. Xiong, B. Xu, J. Xu, L. H. Xu, L. Xu, S. Xu, W. Xu, X. Xu, Y. Xu, Z. Xu, J. Xu, J. Xu, J. Yan, Y. Yan, H. Yang, X. Yang, Y. Yang, Y. Yang, Z. Yang, Z. Yang, Z. Yang, H. Yao, X. Yao, W. Ye, Z. Ye, B. Yin, L. Yu, E. Yuan, H. Yuan, M. Yuan, S. Yuan, H. Zhan, D. Zhang, H. Zhang, W. Zhang, X. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Z. Zhang, H. Zhao, Y. Zhao, Z. Zhao, H. Zheng, S. Zheng, L. Zhong, J. Zhou, X. Zhou, Z. Zhou, J. Zhu, Z. Zhu, W. Zhuang, and X. Zu (2026) Kimi k2: open agentic intelligence. External Links: 2507.20534, Link Cited by: §1, §5.
- P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei (2023) Deep reinforcement learning from human preferences. External Links: 1706.03741, Link Cited by: §5.
- T. Dao (2024) FlashAttention-2: faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), Cited by: Appendix C.
- M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. MazarÊ, M. Lomeli, L. Hosseini, and H. JÊgou (2024) The faiss library. External Links: 2401.08281 Cited by: §B.3.
- L. Guertler, B. Cheng, S. Yu, B. Liu, L. Choshen, and C. Tan (2025) TextArena. External Links: 2504.11442, Link Cited by: §B.1, §B.2, §3.1.
- D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. Guo, J. Li, J. Chen, J. Yuan, J. Tu, J. Qiu, J. Li, J. L. Cai, J. Ni, J. Liang, J. Chen, K. Dong, K. Hu, K. You, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Zhao, L. Wang, L. Zhang, L. Xu, L. Xia, M. Zhang, M. Zhang, M. Tang, M. Zhou, M. Li, M. Wang, M. Li, N. Tian, P. Huang, P. Zhang, Q. Wang, Q. Chen, Q. Du, R. Ge, R. Zhang, R. Pan, R. Wang, R. J. Chen, R. L. Jin, R. Chen, S. Lu, S. Zhou, S. Chen, S. Ye, S. Wang, S. Yu, S. Zhou, S. Pan, S. S. Li, S. Zhou, S. Wu, T. Yun, T. Pei, T. Sun, T. Wang, W. Zeng, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, W. L. Xiao, W. An, X. Liu, X. Wang, X. Chen, X. Nie, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yang, X. Li, X. Su, X. Lin, X. Q. Li, X. Jin, X. Shen, X. Chen, X. Sun, X. Wang, X. Song, X. Zhou, X. Wang, X. Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. Zhang, Y. Xu, Y. Li, Y. Zhao, Y. Sun, Y. Wang, Y. Yu, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Ou, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Xiong, Y. Luo, Y. You, Y. Liu, Y. Zhou, Y. X. Zhu, Y. Huang, Y. Li, Y. Zheng, Y. Zhu, Y. Ma, Y. Tang, Y. Zha, Y. Yan, Z. Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Xie, Z. Zhang, Z. Hao, Z. Ma, Z. Yan, Z. Wu, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Pan, Z. Huang, Z. Xu, Z. Zhang, and Z. Zhang (2025) DeepSeek-r1 incentivizes reasoning in llms through reinforcement learning. Nature 645 (8081), pp. 633â638. External Links: ISSN 1476-4687, Link, Document Cited by: §5.
- B. Jiang, T. Shi, R. Kamoi, Y. Yuan, C. J. Taylor, L. Yang, P. Zhou, and S. Chen (2026) One model, all roles: multi-turn, multi-agent self-play reinforcement learning for conversational social intelligence. External Links: 2602.03109, Link Cited by: §5.
- B. Jin, H. Zeng, Z. Yue, J. Yoon, S. O. Arik, D. Wang, H. Zamani, and J. Han (2025) Search-r1: training LLMs to reason and leverage search engines with reinforcement learning. In Second Conference on Language Modeling, External Links: Link Cited by: §B.3, §3.1, §5.
- D. A. Kolb (2014) Experiential learning: experience as the source of learning and development. 2 edition, FT Press, Upper Saddle River, NJ. External Links: ISBN 9780133892505 Cited by: §1.
- W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, Cited by: Appendix C.
- A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark (2023) Self-refine: iterative refinement with self-feedback. External Links: 2303.17651, Link Cited by: §5.
- T. Olmo, :, A. Ettinger, A. Bertsch, B. Kuehl, D. Graham, D. Heineman, D. Groeneveld, F. Brahman, F. Timbers, H. Ivison, J. Morrison, J. Poznanski, K. Lo, L. Soldaini, M. Jordan, M. Chen, M. Noukhovitch, N. Lambert, P. Walsh, P. Dasigi, R. Berry, S. Malik, S. Shah, S. Geng, S. Arora, S. Gupta, T. Anderson, T. Xiao, T. Murray, T. Romero, V. Graf, A. Asai, A. Bhagia, A. Wettig, A. Liu, A. Rangapur, C. Anastasiades, C. Huang, D. Schwenk, H. Trivedi, I. Magnusson, J. Lochner, J. Liu, L. J. V. Miranda, M. Sap, M. Morgan, M. Schmitz, M. Guerquin, M. Wilson, R. Huff, R. L. Bras, R. Xin, R. Shao, S. Skjonsberg, S. Z. Shen, S. S. Li, T. Wilde, V. Pyatkin, W. Merrill, Y. Chang, Y. Gu, Z. Zeng, A. Sabharwal, L. Zettlemoyer, P. W. Koh, A. Farhadi, N. A. Smith, and H. Hajishirzi (2025) Olmo 3. External Links: 2512.13961, Link Cited by: §3.2.
- OpenAI, :, A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, A. Iftimie, A. Karpenko, A. T. Passos, A. Neitz, A. Prokofiev, A. Wei, A. Tam, A. Bennett, A. Kumar, A. Saraiva, A. Vallone, A. Duberstein, A. Kondrich, A. Mishchenko, A. Applebaum, A. Jiang, A. Nair, B. Zoph, B. Ghorbani, B. Rossen, B. Sokolowsky, B. Barak, B. McGrew, B. Minaiev, B. Hao, B. Baker, B. Houghton, B. McKinzie, B. Eastman, C. Lugaresi, C. Bassin, C. Hudson, C. M. Li, C. de Bourcy, C. Voss, C. Shen, C. Zhang, C. Koch, C. Orsinger, C. Hesse, C. Fischer, C. Chan, D. Roberts, D. Kappler, D. Levy, D. Selsam, D. Dohan, D. Farhi, D. Mely, D. Robinson, D. Tsipras, D. Li, D. Oprica, E. Freeman, E. Zhang, E. Wong, E. Proehl, E. Cheung, E. Mitchell, E. Wallace, E. Ritter, E. Mays, F. Wang, F. P. Such, F. Raso, F. Leoni, F. Tsimpourlas, F. Song, F. von Lohmann, F. Sulit, G. Salmon, G. Parascandolo, G. Chabot, G. Zhao, G. Brockman, G. Leclerc, H. Salman, H. Bao, H. Sheng, H. Andrin, H. Bagherinezhad, H. Ren, H. Lightman, H. W. Chung, I. Kivlichan, I. OâConnell, I. Osband, I. C. Gilaberte, I. Akkaya, I. Kostrikov, I. Sutskever, I. Kofman, J. Pachocki, J. Lennon, J. Wei, J. Harb, J. Twore, J. Feng, J. Yu, J. Weng, J. Tang, J. Yu, J. Q. Candela, J. Palermo, J. Parish, J. Heidecke, J. Hallman, J. Rizzo, J. Gordon, J. Uesato, J. Ward, J. Huizinga, J. Wang, K. Chen, K. Xiao, K. Singhal, K. Nguyen, K. Cobbe, K. Shi, K. Wood, K. Rimbach, K. Gu-Lemberg, K. Liu, K. Lu, K. Stone, K. Yu, L. Ahmad, L. Yang, L. Liu, L. Maksin, L. Ho, L. Fedus, L. Weng, L. Li, L. McCallum, L. Held, L. Kuhn, L. Kondraciuk, L. Kaiser, L. Metz, M. Boyd, M. Trebacz, M. Joglekar, M. Chen, M. Tintor, M. Meyer, M. Jones, M. Kaufer, M. Schwarzer, M. Shah, M. Yatbaz, M. Y. Guan, M. Xu, M. Yan, M. Glaese, M. Chen, M. Lampe, M. Malek, M. Wang, M. Fradin, M. McClay, M. Pavlov, M. Wang, M. Wang, M. Murati, M. Bavarian, M. Rohaninejad, N. McAleese, N. Chowdhury, N. Chowdhury, N. Ryder, N. Tezak, N. Brown, O. Nachum, O. Boiko, O. Murk, O. Watkins, P. Chao, P. Ashbourne, P. Izmailov, P. Zhokhov, R. Dias, R. Arora, R. Lin, R. G. Lopes, R. Gaon, R. Miyara, R. Leike, R. Hwang, R. Garg, R. Brown, R. James, R. Shu, R. Cheu, R. Greene, S. Jain, S. Altman, S. Toizer, S. Toyer, S. Miserendino, S. Agarwal, S. Hernandez, S. Baker, S. McKinney, S. Yan, S. Zhao, S. Hu, S. Santurkar, S. R. Chaudhuri, S. Zhang, S. Fu, S. Papay, S. Lin, S. Balaji, S. Sanjeev, S. Sidor, T. Broda, A. Clark, T. Wang, T. Gordon, T. Sanders, T. Patwardhan, T. Sottiaux, T. Degry, T. Dimson, T. Zheng, T. Garipov, T. Stasi, T. Bansal, T. Creech, T. Peterson, T. Eloundou, V. Qi, V. Kosaraju, V. Monaco, V. Pong, V. Fomenko, W. Zheng, W. Zhou, W. McCabe, W. Zaremba, Y. Dubois, Y. Lu, Y. Chen, Y. Cha, Y. Bai, Y. He, Y. Zhang, Y. Wang, Z. Shao, and Z. Li (2024) OpenAI o1 system card. External Links: 2412.16720, Link Cited by: §5.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe (2022) Training language models to follow instructions with human feedback. External Links: 2203.02155, Link Cited by: §5.
- Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models. External Links: 2402.03300, Link Cited by: Appendix C, §3.2.
- T. Shi, K. Chen, and J. Zhao (2024) Safer-instruct: aligning language models with automated preference data. External Links: 2311.08685, Link Cited by: §5.
- T. Shi, Z. Wang, L. Yang, Y. Lin, Z. He, M. Wan, P. Zhou, S. Jauhar, S. Chen, S. Xia, H. Zhang, J. Zhao, X. Xu, X. Song, and J. Neville (2025) WildFeedback: aligning llms with in-situ user interactions and feedback. External Links: 2408.15549, Link Cited by: §5.
- T. Shi, Y. Wu, L. Song, T. Zhou, and J. Zhao (2026) Efficient reinforcement finetuning via adaptive curriculum learning. External Links: 2504.05520, Link Cited by: §1, §5.
- N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao (2023) Reflexion: language agents with verbal reinforcement learning. External Links: 2303.11366, Link Cited by: §5.
- D. Silver and R. S. Sutton (2025) Welcome to the era of experience. External Links: Link Cited by: §5.
- A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, A. Nathan, A. Luo, A. Helyar, A. Madry, A. Efremov, A. Spyra, A. Baker-Whitcomb, A. Beutel, A. Karpenko, A. Makelov, A. Neitz, A. Wei, A. Barr, A. Kirchmeyer, A. Ivanov, A. Christakis, A. Gillespie, A. Tam, A. Bennett, A. Wan, A. Huang, A. M. Sandjideh, A. Yang, A. Kumar, A. Saraiva, A. Vallone, A. Gheorghe, A. G. Garcia, A. Braunstein, A. Liu, A. Schmidt, A. Mereskin, A. Mishchenko, A. Applebaum, A. Rogerson, A. Rajan, A. Wei, A. Kotha, A. Srivastava, A. Agrawal, A. Vijayvergiya, A. Tyra, A. Nair, A. Nayak, B. Eggers, B. Ji, B. Hoover, B. Chen, B. Chen, B. Barak, B. Minaiev, B. Hao, B. Baker, B. Lightcap, B. McKinzie, B. Wang, B. Quinn, B. Fioca, B. Hsu, B. Yang, B. Yu, B. Zhang, B. Brenner, C. R. Zetino, C. Raymond, C. Lugaresi, C. Paz, C. Hudson, C. Whitney, C. Li, C. Chen, C. Cole, C. Voss, C. Ding, C. Shen, C. Huang, C. Colby, C. Hallacy, C. Koch, C. Lu, C. Kaplan, C. Kim, C. Minott-Henriques, C. Frey, C. Yu, C. Czarnecki, C. Reid, C. Wei, C. Decareaux, C. Scheau, C. Zhang, C. Forbes, D. Tang, D. Goldberg, D. Roberts, D. Palmie, D. Kappler, D. Levine, D. Wright, D. Leo, D. Lin, D. Robinson, D. Grabb, D. Chen, D. Lim, D. Salama, D. Bhattacharjee, D. Tsipras, D. Li, D. Yu, D. Strouse, D. Williams, D. Hunn, E. Bayes, E. Arbus, E. Akyurek, E. Y. Le, E. Widmann, E. Yani, E. Proehl, E. Sert, E. Cheung, E. Schwartz, E. Han, E. Jiang, E. Mitchell, E. Sigler, E. Wallace, E. Ritter, E. Kavanaugh, E. Mays, E. Nikishin, F. Li, F. P. Such, F. de Avila Belbute Peres, F. Raso, F. Bekerman, F. Tsimpourlas, F. Chantzis, F. Song, F. Zhang, G. Raila, G. McGrath, G. Briggs, G. Yang, G. Parascandolo, G. Chabot, G. Kim, G. Zhao, G. Valiant, G. Leclerc, H. Salman, H. Wang, H. Sheng, H. Jiang, H. Wang, H. Jin, H. Sikchi, H. Schmidt, H. Aspegren, H. Chen, H. Qiu, H. Lightman, I. Covert, I. Kivlichan, I. Silber, I. Sohl, I. Hammoud, I. Clavera, I. Lan, I. Akkaya, I. Kostrikov, I. Kofman, I. Etinger, I. Singal, J. Hehir, J. Huh, J. Pan, J. Wilczynski, J. Pachocki, J. Lee, J. Quinn, J. Kiros, J. Kalra, J. Samaroo, J. Wang, J. Wolfe, J. Chen, J. Wang, J. Harb, J. Han, J. Wang, J. Zhao, J. Chen, J. Yang, J. Tworek, J. Chand, J. Landon, J. Liang, J. Lin, J. Liu, J. Wang, J. Tang, J. Yin, J. Jang, J. Morris, J. Flynn, J. Ferstad, J. Heidecke, J. Fishbein, J. Hallman, J. Grant, J. Chien, J. Gordon, J. Park, J. Liss, J. Kraaijeveld, J. Guay, J. Mo, J. Lawson, J. McGrath, J. Vendrow, J. Jiao, J. Lee, J. Steele, J. Wang, J. Mao, K. Chen, K. Hayashi, K. Xiao, K. Salahi, K. Wu, K. Sekhri, K. Sharma, K. Singhal, K. Li, K. Nguyen, K. Gu-Lemberg, K. King, K. Liu, K. Stone, K. Yu, K. Ying, K. Georgiev, K. Lim, K. Tirumala, K. Miller, L. Ahmad, L. Lv, L. Clare, L. Fauconnet, L. Itow, L. Yang, L. Romaniuk, L. Anise, L. Byron, L. Pathak, L. Maksin, L. Lo, L. Ho, L. Jing, L. Wu, L. Xiong, L. Mamitsuka, L. Yang, L. McCallum, L. Held, L. Bourgeois, L. Engstrom, L. Kuhn, L. Feuvrier, L. Zhang, L. Switzer, L. Kondraciuk, L. Kaiser, M. Joglekar, M. Singh, M. Shah, M. Stratta, M. Williams, M. Chen, M. Sun, M. Cayton, M. Li, M. Zhang, M. Aljubeh, M. Nichols, M. Haines, M. Schwarzer, M. Gupta, M. Shah, M. Huang, M. Dong, M. Wang, M. Glaese, M. Carroll, M. Lampe, M. Malek, M. Sharman, M. Zhang, M. Wang, M. Pokrass, M. Florian, M. Pavlov, M. Wang, M. Chen, M. Wang, M. Feng, M. Bavarian, M. Lin, M. Abdool, M. Rohaninejad, N. Soto, N. Staudacher, N. LaFontaine, N. Marwell, N. Liu, N. Preston, N. Turley, N. Ansman, N. Blades, N. Pancha, N. Mikhaylin, N. Felix, N. Handa, N. Rai, N. Keskar, N. Brown, O. Nachum, O. Boiko, O. Murk, O. Watkins, O. Gleeson, P. Mishkin, P. Lesiewicz, P. Baltescu, P. Belov, P. Zhokhov, P. Pronin, P. Guo, P. Thacker, Q. Liu, Q. Yuan, Q. Liu, R. Dias, R. Puckett, R. Arora, R. T. Mullapudi, R. Gaon, R. Miyara, R. Song, R. Aggarwal, R. Marsan, R. Yemiru, R. Xiong, R. Kshirsagar, R. Nuttall, R. Tsiupa, R. Eldan, R. Wang, R. James, R. Ziv, R. Shu, R. Nigmatullin, S. Jain, S. Talaie, S. Altman, S. Arnesen, S. Toizer, S. Toyer, S. Miserendino, S. Agarwal, S. Yoo, S. Heon, S. Ethersmith, S. Grove, S. Taylor, S. Bubeck, S. Banesiu, S. Amdo, S. Zhao, S. Wu, S. Santurkar, S. Zhao, S. R. Chaudhuri, S. Krishnaswamy, Shuaiqi, Xia, S. Cheng, S. Anadkat, S. P. Fishman, S. Tobin, S. Fu, S. Jain, S. Mei, S. Egoian, S. Kim, S. Golden, S. Mah, S. Lin, S. Imm, S. Sharpe, S. Yadlowsky, S. Choudhry, S. Eum, S. Sanjeev, T. Khan, T. Stramer, T. Wang, T. Xin, T. Gogineni, T. Christianson, T. Sanders, T. Patwardhan, T. Degry, T. Shadwell, T. Fu, T. Gao, T. Garipov, T. Sriskandarajah, T. Sherbakov, T. Kaftan, T. Hiratsuka, T. Wang, T. Song, T. Zhao, T. Peterson, V. Kharitonov, V. Chernova, V. Kosaraju, V. Kuo, V. Pong, V. Verma, V. Petrov, W. Jiang, W. Zhang, W. Zhou, W. Xie, W. Zhan, W. McCabe, W. DePue, W. Ellsworth, W. Bain, W. Thompson, X. Chen, X. Qi, X. Xiang, X. Shi, Y. Dubois, Y. Yu, Y. Khakbaz, Y. Wu, Y. Qian, Y. T. Lee, Y. Chen, Y. Zhang, Y. Xiong, Y. Tian, Y. Cha, Y. Bai, Y. Yang, Y. Yuan, Y. Li, Y. Zhang, Y. Yang, Y. Jin, Y. Jiang, Y. Wang, Y. Wang, Y. Liu, Z. Stubenvoll, Z. Dou, Z. Wu, and Z. Wang (2025) OpenAI gpt-5 system card. External Links: 2601.03267, Link Cited by: §1.
- L. Song, Y. Dai, V. Prabhu, J. Zhang, T. Shi, L. Li, J. Li, S. Savarese, Z. Chen, J. Zhao, R. Xu, and C. Xiong (2025a) CoAct-1: computer-using agents with coding as actions. External Links: 2508.03923, Link Cited by: §1.
- L. Song, T. Shi, and J. Zhao (2025b) The hallucination tax of reinforcement finetuning. External Links: 2505.13988, Link Cited by: §5.
- Y. Song, L. Chen, F. Tajwar, R. Munos, D. Pathak, J. A. Bagnell, A. Singh, and A. Zanette (2026) Expanding the capabilities of reinforcement learning via text feedback. External Links: 2602.02482, Link Cited by: Appendix A, §5.
- S. Tan, M. Luo, C. Cai, T. Venkat, K. Montgomery, A. Hao, T. Wu, A. Balyan, M. Roongta, C. Wang, L. E. Li, R. A. Popa, and I. Stoica (2025) RLLM: a framework for post-training language agents. Note: https://pretty-radio-b75.notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31 Notion Blog Cited by: Appendix C.
- S. Wang, Y. Wu, and Z. Xu (2025) Cogito, ergo ludo: an agent that learns to play by reasoning and planning. External Links: 2509.25052, Link Cited by: §B.1, §B.2, §3.1.
- A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Tang, W. Yin, X. Ren, X. Wang, X. Zhang, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Zhang, Y. Wan, Y. Liu, Z. Wang, Z. Cui, Z. Zhang, Z. Zhou, and Z. Qiu (2025) Qwen3 technical report. External Links: 2505.09388, Link Cited by: §1, §3.2.
- Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), Brussels, Belgium, pp. 2369â2380. External Links: Link, Document Cited by: §3.1.
- S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao (2023) ReAct: synergizing reasoning and acting in language models. External Links: 2210.03629, Link Cited by: §5.
- E. Zelikman, Y. Wu, J. Mu, and N. Goodman (2022) STar: bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho (Eds.), External Links: Link Cited by: §5.
- K. Zhang, X. Chen, B. Liu, T. Xue, Z. Liao, Z. Liu, X. Wang, Y. Ning, Z. Chen, X. Fu, J. Xie, Y. Sun, B. Gou, Q. Qi, Z. Meng, J. Yang, N. Zhang, X. Li, A. Shah, D. Huynh, H. Li, Z. Yang, S. Cao, L. Jang, S. Zhou, J. Zhu, H. Sun, J. Weston, Y. Su, and Y. Wu (2025) Agent learning via early experience. External Links: 2510.08558, Link Cited by: §1, §5.
## Appendix A Full Algorithm and Gated Reflection
#### Gated Reflection.
Algorithm 2 presents the full version of ERL used in our experiments. Compared to the simplified version in Algorithm 1, the key difference is a gating mechanism on the second attempt: reflection and refinement are triggered only when the first-attempt reward satisfies $r^{(1)}<\tau$ , where $\tau=1$ . In other words, reflection is applied only to failed or suboptimal trajectories. In early experiments, we applied reflection to all trajectories, including successful ones, but this led to unstable training and reduced generalization. First, reflecting on already successful attempts encouraged reward hacking: the model sometimes generated instance-specific shortcuts that guaranteed success for the current sample but did not generalize to future episodes. Second, early in training when first-attempt rewards are typically low, the optimization signal became dominated by the second attempt and reflection, which are inherently off-policy relative to the base policy. This imbalance weakened the on-policy learning signal and destabilized the policy. The gating mechanism mitigates these issues by ensuring that successful trajectories remain purely on-policy, while reflection is reserved for corrective revision on failed attempts. This design also aligns training with deployment: at inference time, the model must generate $y\sim\pi_{\theta}(\cdot\mid x)$ without access to reflection $\Delta$ or feedback signals. By restricting reflection to corrective cases and preserving sufficient on-policy updates in every batch, the gating mechanism improves stability in training.
#### Memory Extensions.
Algorithm 2 also maintains a simple reflection memory that stores successful reflections as system prompt in plain text. A natural extension is to replace this mechanism with a more sophisticated agentic memory system. For example, before the reflection step (Alg. 2, Line 12), the model may retrieve relevant past reflections from a memory base conditioned on the current input $x$ , and after a successful refinement, update the memory using a structured agentic memory update rule rather than direct overwrite. Such retrieval-and-update schemes would allow ERL to scale to more diverse and long-horizon tasks by enabling selective reuse and continual refinement of past corrective knowledge.
#### On-Policy Distillation.
The internalization step in Algorithm 2 can also be generalized beyond supervised distillation. Instead of training $\pi_{\theta}$ to reproduce $y^{(2)}$ from $x$ using a standard distillation loss, one may adopt an on-policy reverse KL objective. Let the contextual policy with access to reflection and memory be $\pi_{\theta}(\cdot\mid x,\Delta)$ , and the deployment policy be $\pi_{\theta}(\cdot\mid x)$ . An on-policy distillation objective can be written as
$$
\mathcal{L}_{\text{OD}}(\theta):=\mathbb{E}_{x\sim\mathcal{D}}\Big[\mathbb{I}\!\left(r^{(2)}>0\right)\,\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}\big[\mathrm{KL}\!\left(\pi_{\theta}(\cdot\mid x,\Delta)\;\|\;\pi_{\theta}(\cdot\mid x)\right)\big]\Big]. \tag{2}
$$
which encourages the deployment policy to match the richer contextual policy while remaining on-policy with respect to $\pi_{\theta}(\cdot\mid x)$ . This connects ERL to recent reverse-KL and on-policy distillation approaches (Agarwal et al., 2024; hĂźbotter2026reinforcementlearningselfdistillation; Song et al., 2026) and provides a principled alternative to supervised internalization.
Algorithm 2 Reinforcement Learning from Self-Reflection (Full Version)
1: Inputs: Language model $\pi_{\theta}$ ; dataset of questions $x$ ; environment returning feedback $f$ and reward $r$ , reward threshold $\tau$ .
2: Initialize: reflection memory $m\leftarrow\emptyset$ .
3: repeat
4: Sample question $x$ from dataset.
5: // First attempt
6: Sample answer $y^{(1)}\sim\pi_{\theta}(\cdot\mid x)$ .
7: Obtain feedback and reward $(f^{(1)},r^{(1)})$ .
8: // RL update on the first attempt
9: Update $\theta$ via $\mathcal{L}_{\text{policy}}(\theta)$ over the first attempt
10: // Gated second attempt
11: if $r^{(1)}<\tau$ then
12: // Reflection with cross-episode memory
13: Sample reflection $\Delta\sim\pi_{\theta}(\cdot\mid x,y^{(1)},f^{(1)},r^{(1)},m).$
14: Sample refined answer $y^{(2)}\sim\pi_{\theta}(\cdot\mid x,\Delta).$
15: Obtain feedback and reward $(f^{(2)},r^{(2)})$ .
16: Set reflection reward $\tilde{r}\leftarrow r^{(2)}$ .
17: // Store reflection only if improved beyond threshold
18: if $r^{(2)}>\tau$ then
19: Store reflection: $m\leftarrow\Delta$ .
20: end if
21: // RL update on the second attempt
22: Update $\theta$ via $\mathcal{L}_{\text{policy}}(\theta)$ over reflection and second attempt.
23: // Internalization
24: Update $\theta$ via $\mathcal{L}_{\text{distill}}(\theta)$ to internalize reflection, training $\pi_{\theta}$ to produce $y^{(2)}$ from $x$ only.
25: end if
26: until converged
## Appendix B Envrionment Configuration Details
### B.1 Frozen Lake
Frozen Lake is a grid-based navigation environment in which an agent must move from a start location to a goal location on an $n\times n$ grid. We configure our Frozenlake environment following a setup similar to those used in TextArena (Guertler et al., 2025) and Wang et al. (2025). The grid size $n$ is sampled uniformly from $[2,9]$ . For each instance, the start and goal tiles are randomly selected as distinct positions. The grid layout is generated procedurally to ensure that at least one valid path exists between the start and goal.
Each non-goal tile is assigned as either a safe frozen tile or a hole according to a frozen-tile probability parameter $p$ , sampled uniformly from $[0.6,0.85)$ . Holes represent terminal failure states, while frozen tiles are traversable. Transitions are deterministic: the agentâs chosen action directly determines its next grid position, subject to boundary constraints.
At every step, the agent observes a full textual representation of the grid. To reduce the influence of pretrained symbolic priors, we employ abstract symbols rather than semantically meaningful markers. The default encoding is:
$$
\texttt{A}=\text{agent position},\quad\texttt{B}=\text{goal tile},\quad\texttt{C}=\text{hole},\quad\texttt{D}=\text{safe frozen tile}.
$$
This representation encourages the model to infer environment dynamics through interaction rather than relying on prior associations.
In addition to the textual presentation of the grid, the environment also appends structured textual feedback to the end of the interaction history after each action. This feedback communicates the outcome of the most recent transition and serves as the only explicit signal describing terminal or invalid events. The feedback messages are defined as follows:
- The agent reached the goal â issued when the agent successfully enters the goal tile. The episode terminates with reward $1.0$ .
- The agent fell into the hole â issued when the agent enters a hole tile. The episode terminates with reward $0.0$ .
- Hit the max step limit â issued when the agent exhausts the fixed step budget. The episode terminates with reward $0.0$ .
- No valid actions were recorded. â issued when the agent produces an invalid action or when the attempted action results in no state change, such as moving into a boundary. The episode continues unless the step budget is exhausted.
The default system prompt, self-reflection prompt, and example task are shown in Tables 2, 4, and 6.
The reward function is sparse. The agent receives a reward of $1.0$ if it reaches the goal tile and $0.0$ otherwise. Episodes terminate upon reaching the goal, entering a hole, or exhausting a fixed step budget of 8 actions.
For training, we generate 10,000 procedurally sampled instances. Evaluation is conducted on a disjoint set of 100 instances constructed using the same generation process.
### B.2 Sokoban
Sokoban is a grid-based box-pushing environment in which an agent must place all boxes onto designated goal tiles. We configure our Sokoban environment following a setup similar to those used in TextArena (Guertler et al., 2025) and Wang et al. (2025). Each instance is represented as an $n\times n$ grid, where $n$ is sampled uniformly from $[6,8]$ in our procedural generator. We construct single-box, single-goal layouts with border walls, and randomly sample interior positions for the goal, box, and player, subject to non-overlap constraints.
To control difficulty, each generated layout is accepted only if its shortest valid solution is at most 8 moves (computed by BFS over playerâbox states). This guarantees solvability while keeping episodes short-horizon. Train and test splits are disjoint at the layout level.
At every step, the agent observes the full textual grid. As in FrozenLake, we use abstract symbols to reduce direct reliance on pretrained semantic priors. The default encoding is:
| | A | $\displaystyle=\text{agent position},\quad\texttt{a}=\text{agent on box},\quad\texttt{B}=\text{box},\quad\texttt{b}=\text{box on goal},$ | |
| --- | --- | --- | --- |
The action space is {Up, Down, Left, Right}. Moves are deterministic. The agent may push exactly one adjacent box only when the cell behind the box is free; it cannot pull boxes, move through walls, or move through boxes. Invalid moves produce no state change.
In addition to the grid observation, the interaction trace includes structured textual transition feedback after each action. The feedback messages are:
- The agent solved the puzzle (all boxes on goals). â issued when all boxes are on goal tiles. The episode terminates with reward $1.0$ .
- The agent moved or pushed a box; puzzle not solved yet. â issued when the action changes the state but the puzzle remains unsolved.
- The agent did not move (likely hit a wall or tried to push into a blocked space). â issued when the chosen move is ineffective (no state change).
- Hit the max step limit â issued when the fixed step budget is exhausted before solving. The episode terminates with reward $0.0$ .
The default system prompt, self-reflection prompt, and example task are shown in Tables 2, 4, and 7.
The reward is sparse: $1.0$ if and only if all boxes are on goals, and $0.0$ otherwise. Episodes terminate on success or when the step budget is exhausted. In the generated REEX Sokoban dataset, the per-instance step budget is 8.
For training, we generate 10,000 procedurally sampled instances. Evaluation is conducted on a disjoint set of 100 instances built with the same generation process.
### B.3 HotpotQA
HotpotQA is a multi-hop open-domain question answering task in which an agent must answer compositional questions by retrieving and synthesizing evidence across multiple documents. Each instance consists of a natural-language question and a reference answer.
Unlike grid-based control environments such as FrozenLake or Sokoban, HotpotQA does not expose an explicit environment state. Instead, the agent operates through a tool-augmented interaction loop in which it alternates between reasoning, retrieval, and answer generation. The agent may invoke an external retrieval tool and ultimately produce a final textual answer. The solver instruction requires that the final answer be formatted inside \boxed{} to enable reliable extraction.
The retrieval interface is defined as:
$$
\texttt{local\_search(query, top\_k)},
$$
which queries a local dense-retrieval server built over an indexed Wikipedia corpus and returns ranked text snippets relevant to the query. We use a Wikipedia corpus organized by PeterJinGo/wiki-18-corpus, with prebuilt dense indices from PeterJinGo/wiki-18-e5-index. Embeddings are generated using intfloat/e5-base-v2. Retrieval is powered by FAISS (Douze et al., 2024) with multi-GPU support. During each episode, the agent is allowed up to 5 interaction turns, which may include reasoning steps, tool calls, and final answer submission.
Following the evaluation protocol of Search-R1 (Jin et al., 2025), the answer extracted from \boxed{} is normalized prior to scoring by lowercasing and whitespace canonicalization. Correctness is measured using token-level F1 against the ground-truth answer. The reward function assigns a score of 1.0 for exact matches, a proportional reward equal to the F1 score when the F1 is at least 0.3, and 0 otherwise.
The default system prompt, self-reflection prompt, and example task are shown in Tables 3, 5, and 8.
Initial System Prompt for Frozenlake and Sokoban
You are an agent playing a game on a grid, acting as a reasoning engine. Your decisions are based on your current game rules (your best guess of how the game works) and your strategic playbook (your learned strategies). These may be incomplete or incorrect. Your only way to interact with the environment is by choosing your NEXT ACTION. ## Instructions 1. Analyze State: Summarize the current state. 2. Predict Long-term Value of Outcomes: Evaluate the strategic value and potential of the current state for the future. 3. Predict Immediate Consequences: For the top two candidate actions, predict their consequences using a âresult-becauseâ structure. 4. Select the Best Action: Choose the action leading to the most advantageous future state. ## Required response structure <reason> **1. Analysis of the Current State:** [Summary of the board state.] **2. Prediction of the Value of Current States:** [Assessment] - Value: High / Medium / Low **3. Prediction of Immediate Consequences:** [Top 2 candidate actions] </reason> Then output the NEXT ACTION inside triple backticks, e.g., âââUpâââ. Always remember: - Valid actions: Up, Down, Left, Right. - Think step by step, but make the final line only the next action wrapped in triple backticks.
Table 2: Initial System used for Frozenlake and Sokoban.
Initial System Prompt for HotpotQA
You are a helpful assistant who answers questions directly and efficiently. Provide your final answer in \boxed{} format. ## Available tool [ { "type": "function", "function": { "name": "local_search", "description": "Search for information using a dense retrieval server with Wikipedia corpus", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "Search query to retrieve relevant documents" }, "top_k": { "type": "integer", "description": "Number of results to return (default: 5)", "minimum": 1, "maximum": 50 } }, "required": ["query"] } } } ]
Table 3: Initial system prompt used for HotpotQA.
Self-reflection Prompt for Frozen Lake and Sokoban
You are a chief scientific strategist and master tactician. Your mission is to analyze extensive field data from numerous operations to distill and refine the Master Rulebook of a complex game. You will be presented with a large collection of highly successful trajectories and critical failure trajectories, collected over a long period. Your primary task is to perform a deep, comparative analysis to understand the fundamental principles of victory and defeat. Act as a grand strategist, identifying universal patterns and high-level causal relationships. Your goal is to synthesize these insights to produce the next generationâs Master Rulebook, making it more robust, accurate, and effective. Core Principles: - Think Long-Term: focus on universal, strategic truths that hold across diverse scenarios. - Learn from Contrast: extract insights by comparing winners and losers. - Synthesize and Consolidate: produce a single unified theory. - Be Authoritative and Concise: state rules as definitive principles. Your output MUST be a single consolidated <prompt> block representing the new Master Rulebook: <prompt> <game_rules> **1. Symbol Meanings:** [...] **2. Information & Interpretation:** [...] **3. Gameplay & Actions:** [...] **4. Action Effects:** [...] **5. Game Objective & Termination:** [...] </game_rules> <strategy> **1. Core Strategies:** [...] **2. Tactical Tips:** [...] </strategy> </prompt>
Table 4: Self-reflection prompt used for Frozen Lake and Sokoban.
Self-reflection Prompt for HotpotQA
You are an expert prompt updater. You will analyze recent trajectories, tool calls, and rewards to improve the solverâs system prompt. When failures occur, explicitly add rules that prevent repeating them (e.g., missing tool calls, hallucinated facts, or unboxed final answers). Keep the prompt short, actionable, and reusable. Output ONLY the improved system prompt wrapped in <prompt>...</prompt> tags.
Table 5: Self-reflection prompt used for HotpotQA.
Example Task for Frozen Lake
## {System Prompt} Current Observation (0): D D C D A D D C D C D D D D B D You have not achieved the goal yet. Please give the next action. ## Action space Up | Down | Left | Right ## Output requirement Return reasoning in <reason>...</reason> and final action in triple backticks, e.g., âââRightâââ.
Table 6: Example Frozen Lake task instance.
Example Task for Sokoban
## {System Prompt} Current Board (0): E E E E E E E A D B C E E D D D D E E E E E E E Puzzle not solved yet. Provide the next move. ## Action space Up | Down | Left | Right ## Output requirement Return reasoning in <reason>...</reason> and final action in triple backticks, e.g., âââRightâââ.
Table 7: Example Sokoban task instance.
Example Task for HotpotQA
## {System Prompt} Question: Which university did the author of âThe Hobbitâ attend?
Table 8: Example HotpotQA task instance.
Second-Attempt Prompt Template for the No-Reflection Variant
## {System Prompt} You are also provided with the modelâs past attempt data, including observations, actions, rewards, and feedback. Use this information as context to make a better next-attempt decision policy. Follow the action/output format exactly. {First Attemptâs Trajectory}
Table 9: Generic second-attempt system prompt used in the no-reflection ablation. The model is provided with the full first-attempt trajectory (observations, actions, rewards, and feedback) together with a generic instruction encouraging improvement, without any structured reflection signal.
## Appendix C Training Configuration Details
We train all models with the rLLM agent training stack (Tan et al., 2025) using GRPO (Shao et al., 2024). Training runs on a single node with 8 H100s and uses vLLM (Kwon et al., 2023) with FlashAttention (Dao, 2024).
We enable hybrid engine training, gradient checkpointing, and remove-padding. The optimizer learning rate is 1e-6. Actor updates use a mini batch size of 64, dynamic batch sizing, and a max token length per GPU of 24,000. FSDP parameter/optimizer offload is enabled for the actor, and parameter offload is enabled for the reference model.
We set the training batch size to 64, with a maximum prompt length of 8,196 tokens and a maximum response length of 8,196 tokens. Rollouts are generated asynchronously using vLLM in async mode with a tensor model parallel size of 1. We use a sampling temperature of 0.7, GPU memory utilization of 0.85. For validation rollouts, we generate 4 samples per prompt with temperature 0.7, top-p sampling set to 0.8, and top-k sampling set to 20.
KL regularization is enabled using a low-variance KL loss with coefficient 0.001, and we use a fixed KL control coefficient of 0.001. The actor clipping ratio upper bound is set to 0.28, and the entropy coefficient is set to 0. Rejection sampling and stepwise advantage estimation are disabled.
For RLVR training, we generate 10 samples per prompt. For ERL training, we generate only 4 samples per prompt for each attempt to match the compute budget of RLVR. Evaluation is performed every 5 iterations, and training is manually early stopped upon convergence.
The design and implementation details of the ERL algorithm can be found in Appendix A.