## Diagram: Comparison of Three Learning Paradigms
### Overview
The image is a conceptual diagram comparing three machine learning paradigms: Direct Learning (Supervised Fine-Tuning, SFT), Reinforcement Learning (RLVR), and Experiential Learning (ERL). The diagram is divided into three vertical sections, each illustrating the core feedback loop of a paradigm. A horizontal arrow at the bottom indicates a conceptual progression from "Learning from Feedback" to "Learning from Experience."
### Components/Axes
The diagram has no traditional axes. Its components are labeled boxes, arrows, and text annotations arranged in three distinct columns.
**1. Left Column: Direct Learning (SFT)**
* **Title:** "Direct Learning (SFT)"
* **Components:**
* A box labeled "Policy" (top).
* A box labeled "Example" (bottom).
* A gray box labeled "Supervised Learning" positioned on the arrow from "Example" to "Policy".
* **Flow/Text:** An arrow points from "Example" to "Policy". The text on the arrow reads: `π_θ(·|x) ← y'`. This denotes the policy `π_θ` being updated with a target label `y'` given input `x`.
**2. Middle Column: Reinforcement Learning (RLVR)**
* **Title:** "Reinforcement Learning (RLVR)"
* **Components:**
* A box labeled "Policy" (top).
* A box labeled "Environment" (bottom).
* A green box labeled "Action" on the arrow from "Policy" to "Environment".
* A peach-colored box labeled "Scalar Reward" on the arrow from "Environment" to "Policy".
* **Flow/Text:**
* Arrow from "Policy" to "Environment": Labeled with `y ~ π_θ(·|x)`, indicating an action `y` is sampled from the policy.
* Arrow from "Environment" to "Policy": Labeled with `π_θ(·|x) ← r`, indicating the policy is updated based on a scalar reward `r`.
**3. Right Column: Experiential Learning (ERL)**
* **Title:** "Experiential Learning (ERL)"
* **Components:**
* A box labeled "Policy" (top).
* A box labeled "Environment" (bottom).
* A green box labeled "Action" on the arrow from "Policy" to "Environment".
* A blue box labeled "Experience Internalization" positioned between "Environment" and "Policy".
* A peach-colored box labeled "Self-Reflection" positioned below "Experience Internalization".
* **Flow/Text:**
* Arrow from "Policy" to "Environment": Labeled with `y ~ π_θ(·|x)`.
* Arrow from "Environment" to "Policy": This path is more complex. It flows through the "Experience Internalization" box.
* Text in the "Experience Internalization" box: `π_θ(·|x) ← π_θ(·|x, Δ)`. This suggests the policy is updated using an internalized version of experience, parameterized by `Δ`.
* Text in the "Self-Reflection" box: `Δ ~ π_θ(·|x, y, r)`. This indicates that the internal parameter `Δ` is generated by a self-reflection process based on the input `x`, action `y`, and reward `r`.
**Bottom Annotation:**
* A long, gray horizontal arrow spans the width of the diagram beneath the three columns.
* Left end label: "Learning from Feedback" (aligned under SFT and RLVR).
* Right end label: "Learning from Experience" (aligned under ERL).
### Detailed Analysis
The diagram illustrates an evolution in learning complexity:
1. **Direct Learning (SFT):** A simple, one-step supervised loop. The policy is directly corrected towards a known good example (`y'`). The feedback is the example itself.
2. **Reinforcement Learning (RLVR):** An interactive loop with an environment. The policy takes an action, receives a scalar reward signal from the environment, and updates accordingly. Feedback is indirect (a reward number).
3. **Experiential Learning (ERL):** An enhanced interactive loop that adds an internal cognitive layer. Instead of updating the policy directly from the reward `r`, the system first performs "Self-Reflection" to generate an internal representation `Δ` from the tuple `(x, y, r)`. This `Δ` is then used for "Experience Internalization," updating the policy in a more nuanced way (`π_θ(·|x, Δ)`). This suggests learning from a processed, internalized form of experience rather than the raw reward signal.
### Key Observations
* **Increasing Complexity:** The diagrams grow more complex from left to right, adding components (Environment, Action, Self-Reflection) and more sophisticated update rules.
* **Shift in Feedback Source:** The source of learning signal evolves: from a static `y'` (SFT), to an external scalar `r` (RLVR), to an internally generated `Δ` (ERL).
* **Spatial Grounding:** The "Self-Reflection" and "Experience Internalization" boxes in the ERL diagram are centrally located between the Environment and Policy, visually emphasizing their role as a mediating, internal process.
* **Color Coding:** Green is consistently used for the "Action" output. Peach/light red is used for reward-related signals (`r` in RLVR, `Self-Reflection` in ERL). Blue is introduced in ERL for the new "Experience Internalization" process.
### Interpretation
This diagram argues for a progression in machine learning paradigms towards more autonomous and introspective systems.
* **SFT** represents foundational learning from curated data, but it's limited to mimicking provided examples.
* **RLVR** introduces learning through trial-and-error interaction with an environment, a significant step towards autonomous decision-making. However, it relies on an external reward function, which can be sparse or poorly defined.
* **ERL** proposes a next step where the agent doesn't just react to rewards but actively *reflects* on its experiences (`x, y, r`) to form an internal understanding (`Δ`). This internalized experience then guides learning. The key innovation is the **Self-Reflection** module, which transforms raw experience into a form more suitable for deep learning. This mimics aspects of biological learning, where experience is consolidated and interpreted internally, potentially leading to more robust, generalizable, and sample-efficient learning. The bottom arrow frames this as a shift from learning based on external feedback to learning based on internalized experience.