2602.10224

Model: gemma-3-27b-it-free

# Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose M eta- E xperience L earning (MEL), a novel framework that incorporates self-distilled meta-experience into the model’s parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM’s self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM’s parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%–4.73% Pass@1 gains across varying model sizes. Shiting Huang 1 Zecheng Li 1 Yu Zeng 1 Qingnan Ren 1 Zhen Fang 1 Qisheng Su 1 Kou Shi 1 Lin Chen 1 Zehui Chen 1 Feng Zhao 1 🖂 1 University of Science and Technology of China 🖂: Corresponding Author 1 Introduction Reinforcement Learning (RL) has emerged as a pivotal paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs) on complex tasks, such as mathematics, programming, and logic reasoning (Shao et al., 2024; Chen et al., 2025; Zeng et al., 2025a; Wang et al., 2025; Zeng et al., 2025b, 2026; Huang et al., 2026). By leveraging feedback signals obtained from interaction with the task environment, RL enables LLMs to move beyond passive imitation learning toward goal-directed reasoning and action (Schulman et al., 2017; Ouyang et al., 2022; Wulfmeier et al., 2024). Furthermore, by replacing learned reward models with programmatically verifiable signals, Reinforcement Learning with Verifiable Rewards (RLVR) eliminates the need for expensive human annotations and mitigates reward hacking, thereby enabling models to explore problem-solving strategies more effectively, which has contributed to its growing attention (Lambert et al., 2024). However, RLVR still faces a fundamental bottleneck regarding the granularity and utilization of learning signals. From a meta-learning perspective, the human learning cycle involves three critical components: practice and verification, error attribution, and experience internalization. While RLVR primarily drives policy updates through practice and verification, it overlooks the critical stages of error attribution and experience internalization, both of which are essential for fine-grained credit assignment and the formation of reusable knowledge (Wu et al., 2025; Zhang et al., 2025a). In other words, RLVR is largely limited to assessing the overall quality of entire trajectories, while struggling to reason about fine-grained knowledge at the level of intermediate steps (Xie et al., 2025). Although RL approaches (Lightman et al., 2023; Khalifa et al., 2025) employing Process Reward Models (PRMs) to provide dense learning signals attempt to mitigate this limitation, their reliance on trained proxies makes them inherently susceptible to reward hacking (Cheng et al., 2025; Guo et al., 2025), and poses a fundamental tension with the RLVR paradigm, which is centered on programmatically verifiable rewards. <details> <summary>x1.png Details</summary> ![dd8aabe3](/v1/image/dd8aabe37a1ca5010065bde5b6e69acb36d732f53582d94121284b65dc0a6f83) ### Visual Description \n ## Diagram: RLVR vs. MEL Frameworks ### Overview The image presents a comparative diagram illustrating two Reinforcement Learning from Visual Representations (RLVR) frameworks: "Standard RLVR" (a) and "MEL (Ours)" (b). Both frameworks involve a series of nodes connected by directed edges, representing a process of optimization and feedback. The key difference lies in the integration of a "Meta-Experience" component in the MEL framework, which influences a "Knowledge-level optimization" loop. Both diagrams feature a robot icon at the bottom, representing the agent interacting with the environment. ### Components/Axes The diagram consists of the following key components: * **Nodes:** Represented as circles, colored light green and light red. * **Edges:** Represented as curved arrows, indicating the flow of information or influence. * **Labels:** Textual descriptions associated with nodes, edges, and overall framework components. * **Robot Icon:** A stylized robot head with crossed eyes, positioned at the bottom of each framework. * **Meta-Experience Box:** A rectangular box containing text describing the concept of Meta-Experience. * **Titles:** "(a) Standard RLVR" and "(b) MEL (Ours)" indicating the two frameworks being compared. * **Reward Labels:** "Reward=1" and "Reward=0" associated with edges in the Standard RLVR framework. * **Encourage/Suppress Labels:** "Encourage" and "Suppress" associated with vertical arrows. * **Optimization Labels:** "Trajectory-Level optimization" and "Knowledge-level optimization". ### Detailed Analysis or Content Details **Standard RLVR (a):** * The framework consists of six circular nodes arranged in a roughly circular pattern. * Three nodes are light green, and three are light red. * Edges connect the nodes in a complex network. * Edges originating from the green nodes are labeled "Reward=1". * Edges originating from the red nodes are labeled "Reward=0". * Two vertical arrows labeled "Encourage" and "Suppress" connect to the robot icon. * The robot icon has crossed eyes. * The top node is labeled with a "Q" inside a circle. **MEL (Ours) (b):** * Similar to Standard RLVR, this framework also has six circular nodes (three light green, three light red) arranged in a circular pattern. * Edges connect the nodes, but the pattern differs slightly from Standard RLVR. * Edges originating from the green nodes are labeled "Reward=1". * Edges originating from the red nodes are labeled "Reward=0". * Two vertical arrows labeled "Encourage" and "Suppress" connect to the robot icon. * The robot icon has a diamond on its head. * The top node is labeled with a "Q" inside a circle. * A dashed line connects the framework to a rectangular box labeled "Meta-Experience (bifurcation point, critique, heuristic)". * An arrow originates from the "Meta-Experience" box and points towards the robot icon, labeled "Knowledge-level optimization". **Meta-Experience Box:** * The box is positioned to the right of the MEL framework. * It contains the text: "Meta-Experience (bifurcation point, critique, heuristic)". * A lightbulb icon is present within the box. ### Key Observations * Both frameworks share a similar structure of nodes and edges, but the connections and the addition of the "Meta-Experience" component differentiate MEL from Standard RLVR. * The "Reward" labels indicate a reinforcement learning process where positive rewards (Reward=1) encourage certain actions, while zero rewards (Reward=0) suppress them. * The robot icon's change in appearance (crossed eyes vs. diamond) may signify a difference in the agent's state or capabilities between the two frameworks. * The "Meta-Experience" component in MEL introduces a higher level of abstraction and learning, potentially allowing the agent to adapt more effectively. ### Interpretation The diagram illustrates a proposed improvement to the Standard RLVR framework by incorporating a "Meta-Experience" component into the MEL framework. The Standard RLVR relies solely on trajectory-level optimization based on immediate rewards. In contrast, MEL introduces a knowledge-level optimization loop informed by meta-experience, which encompasses higher-level concepts like identifying bifurcation points, providing critique, and applying heuristics. This suggests that MEL aims to enable the agent to learn not just *what* actions to take, but *why* those actions are effective, leading to more robust and adaptable behavior. The difference in the robot icon's appearance could indicate that the MEL framework results in a more "aware" or "intelligent" agent. The dashed line connecting the framework to the Meta-Experience box suggests that the meta-experience is not directly part of the core RL loop but rather provides an external influence on the learning process. The diagram is a conceptual illustration of a proposed architecture and does not contain specific numerical data or performance metrics. It is a high-level overview of the framework's components and their relationships. </details> Figure 1: Paradigm comparison between standard RLVR and MEL, where MEL extends RLVR with an explicit knowledge-level learning loop. Recently, a growing number of studies have explored integrating experience learning within the RLVR framework to address the above challenge. Early attempts, such as StepHint (Zhang et al., 2025c) utilizes experience as hints to elicit superior reasoning paths from the original problems, treating these trajectories as off-policy migration signals. However, the resulting off-policy deviation in response distribution can compromise optimization stability, undermining the theoretical benefits of on-policy reinforcement learning. To alleviate such instability, Scaf-GRPO (Zhang et al., 2025d) leverages superior models to generate multi-level knowledge-intensive experience, injecting them as on-policy prefixes for policy updates. Yet, while effective in teaching models to reason within specific experience-augmented distributions, such prefixes are unavailable during inference, inducing a severe distributional mismatch, thereby limiting performance gains. Critically, despite their differences, these approaches utilize retrieved experience primarily as external hints. While these strategies effectively elicit better reasoning paths during training, the resulting learning signals remain predominantly at the trajectory-level, yielding superficial corrections rather than intrinsic cognitive enhancements. Building on this insight, we introduce the concept of meta-experience, elevating experience learning from trajectory-level instances to knowledge-level representations. Through contrastive analysis on paired correct and incorrect trajectories, we pinpoint the bifurcation points underlying reasoning failures and abstracts them into reusable meta-experiences. Accordingly, we propose M eta- E xperience L earning (MEL), a framework explicitly designed to enable knowledge-level internalization and reuse of meta-experiences. During training phase, MEL leverages meta-experiences to inject generalizable insights via a self-distillation mechanism, and internalizes them by minimizing the negative log-likelihood in the model’s parametric memory. As shown in Figure 1, MEL differs from standard RLVR, which relies on coarse-grained outcome rewards and treats correct and incorrect trajectories independently, by explicitly connecting them via meta-experiences. Hence, this process can be viewed as a language-modeled process-level reward signal, providing continuous and fine-grained guidance for improving reasoning capability. To further enhance stability and effectiveness during RLVR training, we propose empirical validation via replay, which uses meta-experiences as auxiliary in-context hints to assess their contribution to output accuracy. Meta-experiences that pass validation are integrated via negative log-likelihood minimization, while those that fail validation are excluded. In summary, our main contributions are as follows: - We propose MEL, a novel framework that integrates self-distilled meta-experience with reinforcement learning, addressing the limitations of standard RLVR in error attribution and experience internalization by embedding these meta-experiences directly into the parametric memory of LLMs. - We validate the effectiveness of MEL through extensive experiments on five challenging mathematical reasoning benchmarks across multiple LLM scales (4B, 8B, and 14B). Compared with both the vanilla GRPO baseline and the corresponding base models, MEL consistently improves performance across Pass@1, Avg@8, and Pass@8 metrics. - Empirical results confirm that MEL seamlessly integrates with diverse paradigms (e.g., RFT, GRPO, REINFORCE++) to reshape reasoning patterns and elevate performance ceilings. Notably, these benefits exhibit strong scalability, becoming increasingly pronounced as model size expands. <details> <summary>x2.png Details</summary> ![79630dab](/v1/image/79630dab5fd6c8d2dbb7ba84dc92b64d4f75bfba731e1f82c17570268896e4c8) ### Visual Description \n ## Diagram: Meta-Experience Enhanced Reinforcement Learning ### Overview The image depicts a diagram illustrating a system for reinforcement learning enhanced with a "Meta-Experience" component. The system takes a "Question" as input, processes it through a "Policy Model," and utilizes reinforcement learning with verifiable rewards, guided by meta-experience derived from abstraction, validation, critique, and heuristics. The diagram highlights the flow of information and the interaction between different modules. ### Components/Axes The diagram is structured into several key components: * **Question Input:** A cylindrical shape labeled "Question" at the top-left. * **Policy Model:** A brain-shaped icon connected to the "Question" input, labeled "Policy Model." * **Meta-Experience:** A section at the top-right containing: * A flower pot labeled "Abstraction & Validation." * A network of nodes labeled "Bifurcation Point s*." * A magnifying glass labeled "Critique C." * A folder labeled "Heuristic H." * A lightbulb icon. * **Reinforcement Learning with Verifiable Rewards:** A large rectangular section in the center containing: * "Contrastive Pair" represented by two document icons, one with a green checkmark and one with a red cross. * A network of nodes connected by arrows, representing state transitions. * A scale icon representing "Reward." * A column labeled "r1" through "rg" representing rewards. * A "Group Norm" block. * A column labeled "A1" through "Ag" representing advantages. * **Trajectories:** A vertical stack of rectangles labeled "Y1" through "Yg" on the left side, representing trajectories. * **Labels:** "Knowledge-Level Optimization", "Trajectory-Level Optimization". ### Detailed Analysis or Content Details The diagram illustrates the following flow: 1. A "Question" is fed into the "Policy Model." 2. The "Policy Model" generates "Trajectories" (Y1 to Yg). 3. These trajectories are used in "Reinforcement Learning with Verifiable Rewards." 4. The reinforcement learning process involves a "Contrastive Pair" of states (one positive, one negative). 5. State transitions are represented by nodes connected by arrows. The arrows indicate the flow of the reinforcement learning process. 6. Rewards (r1 to rg) are calculated and normalized using "Group Norm." 7. Advantages (A1 to Ag) are computed based on the rewards. 8. The "Meta-Experience" component provides feedback to the reinforcement learning process through "Abstraction & Validation," "Critique C," and "Heuristic H," influencing the "Bifurcation Point s*." 9. "Knowledge-Level Optimization" connects the Policy Model to the Meta-Experience. 10. "Trajectory-Level Optimization" connects the Reinforcement Learning section to the Trajectories. The number of trajectories, rewards, and advantages is denoted by "g," suggesting a variable number of elements. The diagram does not provide specific numerical values for rewards or advantages. ### Key Observations * The diagram emphasizes the integration of meta-experience into the reinforcement learning loop. * The contrastive learning aspect is highlighted by the positive/negative state pair. * The use of "Group Norm" suggests a normalization technique is applied to the rewards. * The diagram is conceptual and does not provide quantitative data. ### Interpretation The diagram illustrates a sophisticated reinforcement learning framework that incorporates meta-experience to improve learning efficiency and robustness. The "Meta-Experience" component acts as a higher-level reasoning layer, providing guidance and feedback to the reinforcement learning agent. The "Abstraction & Validation" step likely involves summarizing and verifying the learned knowledge, while "Critique C" and "Heuristic H" provide corrective feedback and domain-specific knowledge. The "Bifurcation Point s*" suggests a decision-making process where the agent chooses between different actions based on the meta-experience. The contrastive learning approach, using positive and negative examples, helps the agent to distinguish between desirable and undesirable states. The overall system aims to achieve "Knowledge-Level Optimization" and "Trajectory-Level Optimization," leading to more effective and reliable reinforcement learning. The diagram suggests a system designed for complex tasks where explicit rewards are difficult to define, and meta-knowledge is crucial for guiding the learning process. The lightbulb icon suggests the system is designed to generate new insights or solutions. </details> Figure 2: Overview of M eta- E xperience L earning (MEL), which constructs meta-experiences from contrastive pairs via abstraction and validation, thereby introducing an explicit knowledge-level learning loop on top of standard RLVR. 2 Related Work Reinforcement Learning with Verifiable Rewards. Reinforcement Learning with Verifiable Rewards (RLVR) leverages rule-based validators to provide deterministic feedback on models’ self-generated solutions (Lambert et al., 2024). Extensive research has systematically investigated RLVR, exploring how this paradigm improves the performance of complex reasoning (Jaech et al., 2024; Guo et al., 2025; Liu et al., 2025; Zhang et al., 2025b). The pioneering framework Group Relative Policy Optimization (GRPO) (Shao et al., 2024) estimates advantages via group-wise relative comparisons, eliminating the need for a separate value model. Building on this base method, recent studies have introduced a range of algorithmic variants to improve training stability and efficiency. For instance, REINFORCE++ (Hu, 2025) enhances stability through global advantage normalization; DAPO (Yu et al., 2025) mitigates entropy collapse and improves reward utilization via relaxed clipping and dynamic sampling; and GSPO (Zheng et al., 2025) reduces gradient estimation variance with sequence-level clipping. Despite these algorithmic advancements, a fundamental limitation persists: current RLVR methods predominantly rely on outcome-level rewards. This failure to assign fine-grained credit to specific knowledge points prevents the construction of reusable knowledge formation, fundamentally constraining the development of systematic and generalizable reasoning capabilities. Experience Learning. Recent studies have increasingly recognized that leveraging various forms of experience can substantially enhance LLM reasoning capabilities. One prominent line of research lies in test-time scaling methods, which store experience in external memory pools. For example, SpeedupLLM (Pan and Zhao, 2025) appends memories of previously reasoning traces as experience to accelerate inference, while Training Free GRPO (Cai et al., 2025) and ReasoningBank (Ouyang et al., 2025) distill accumulated experience into structured memory entries for retrieval-based augmentation. However, these approaches rely on ever-growing external memory, preventing the experience from being truly internalized and thus failing to substantively enhance intrinsic reasoning capabilities. Complementarily, another stream of research integrates experience directly into RL training as guiding signals. Methods such as Scaf-GRPO (Zhang et al., 2025d) and StepHint (Zhang et al., 2025c) employ external models to generate experiential hints, which are injected as prefixes or migration signals, to guide the policy toward higher-quality trajectories. Similarly, approaches like LUFFY (Yan et al., 2025) and SRFT (Fu et al., 2025) incorporate expert solution traces as additional experience. Despite improving exploration efficiency, these methods primarily induce trajectory-level imitation. Consequently, models become proficient at following specific patterns but fail to develop the meta-cognitive understanding required for establishing reusable knowledge structures. 3 Meta-Experience Learning Human learning follows a recurrent cognitive cycle consisting of practice and verification, error attribution, and experience internalization, which in turn informs subsequent practice. Motivated by this cognitive process, we define meta-experience for LLMs as generalizable and reusable knowledge derived from accumulated reasoning trials, capturing both underlying knowledge concepts and common failure modes. Building on this notion, we propose M eta- E xperience L earning (MEL), a framework operating within the RLVR paradigm and expressly designed to internalize such self-distilled, knowledge-level insights into the model’s parametric memory. As illustrated in Figure 2, we first formalize the model exploration stage in RLVR (§ 3.1), then present the details of the Meta-Experience construction (§ 3.2). Finally, we describe the internalization mechanism (§ 3.3) for consolidating these insights into parametric memory, followed by the joint training objective for policy optimization (§ 3.4). 3.1 Explorative Rollout and Verifiable Feedback Mirroring the “practice and check” phase in human learning, the RLVR framework engages the model in exploring potential solutions for reasoning tasks, while the environment serves as a deterministic verifier that provides verifiable feedback on the final answers. As mastering complex logic typically requires traversing the solution space through multiple attempts, we simulate this stochastic exploration by adopting the group rollout formulation from Group Relative Policy Optimization (GRPO) (Shao et al., 2024). Formally, given a query $x$ sampled from the dataset $\mathcal{D}$ , the policy model $\pi_{\theta}$ performs stochastic exploration over the solution space and generates a group of $G$ independent reasoning trajectories $\mathcal{Y}=\{y_{1},y_{2},...,y_{G}\}.$ A rule-based verifier then evaluates each trajectory using a verification function $V(·)$ , which compares the extracted final answer from $y_{i}$ against the ground-truth answer $y^{*}$ and assigns a binary outcome reward: $$ r_{i}=\mathbb{I}\big[V(y_{i},y^{*})\big]\in\{0,1\}. \tag{1} $$ This process partitions the generated group $\mathcal{Y}$ into two distinct subsets: the set of correct trajectories $\mathcal{Y}^{+}=\{y_{i}\mid r_{i}=1\}$ and the set of incorrect trajectories $\mathcal{Y}^{-}=\{y_{i}\mid r_{i}=0\}$ . The coexistence of $\mathcal{Y}^{+}$ and $\mathcal{Y}^{-}$ under the same prompt distribution suggests that the model is capable of solving the task, while producing diverse reasoning trajectories. For our method, such diversity constitutes a beneficial property and serves a dual role. On the one hand, it supplies the variance necessary for effective policy updates in standard RLVR. On the other hand, it enables the extraction of knowledge-level meta-expression through systematic contrast between correct and incorrect reasoning outcomes. 3.2 Meta-Experience Construction Prior studies (Xie et al., 2025; Khalifa et al., 2025; Huang et al., 2025) have shown that effective learning does not arise from merely knowing that a final answer is incorrect, but rather from identifying the specific bifurcation point at which the reasoning process deviates from the correct trajectory, a critical cognitive process known as error attribution. Building on this insight, we leverage pairs of correct and incorrect trajectories to localize reasoning errors and distill such bifurcation points into explicit meta-experiences. Locating the Bifurcation Point. To extract knowledge-level learning signals from the exploration results, we focus on identifying the bifurcation points where the reasoning logic diverges into an erroneous path. With the exploration results partitioned into $\mathcal{Y}^{+}$ and $\mathcal{Y}^{-}$ by the verifier, we construct a set of contrastive pairs $\mathcal{P}_{x}=\{(y^{+},y^{-})\mid y^{+}∈\mathcal{Y}^{+},\,y^{-}∈\mathcal{Y}^{-}\}$ for each query $x$ , whose contrast naturally exposes the specific errors in the reasoning process. Such contrastive analysis requires the presence of both positive and negative trajectories; accordingly, we only consider gradient-informative samples with non-empty $\mathcal{Y}^{+}$ and $\mathcal{Y}^{-}$ . For fine-grained comparison within each pair, each trajectory $y$ can be formatted as a reasoning chain $y=(s_{1},s_{2},...,s_{L},a)$ , where each $s_{t}$ denotes an atomic reasoning step and $a$ indicates the final answer. Since both trajectories originate from the same context, they typically share a correct reasoning path until a critical divergence step $s^{*}$ occurs. Given deterministic verification signals and full access to the reasoning chains, identifying the bifurcation point can be viewed as a discriminative task that is easier than reasoning from scratch (Saunders et al., 2022; Swamy et al., 2025). Motivated by this observation, we task the policy model with analyzing each contrastive pair to identify the reasoning bifurcation point $s^{*}$ : $$ \displaystyle s^{*} \displaystyle\sim\pi_{\theta}\big(\cdot\mid I,x,y^{+},y^{-}\big). \tag{2} $$ Where $I$ denotes a structured instruction guiding introspective analysis. Deep Diagnosis and Abstraction. Identifying the bifurcation point $s^{*}$ localizes where the reasoning fails, serving as the raw material for subsequent learning. Anchored at $s^{*}$ , the policy model conducts a deep diagnostic analysis to contrast the strategic choices underlying the successful and failed trajectories. Specifically, the model examines the local reasoning context around $s^{*}$ to pinpoint the root cause of failure, such as violated assumptions, erroneous sub-goals, overlooked constraints, or the misuse of specific principles. Complementarily, it inspects the successful trajectory to uncover the mechanisms that prevented such pitfalls, including precise knowledge application, explicit constraint verification, coherent knowledge representations, or emergent self-correction behaviors. By jointly synthesizing these perspectives, the model distills the structural divergence between the correct and incorrect logic, crystallizing it into explicit knowledge. Formally, we model this diagnostic process as generating a critique $\mathcal{C}$ that encapsulates the error attribution, the comparative strategic gap, and the corresponding corrective principle: $$ \displaystyle\mathcal{C}\sim\pi_{\theta}\big(\cdot\mid I,x,y^{+},y^{-},s^{*}\big). \tag{3} $$ To ensure generalization, it is imperative for the model to distill instance-specific critiques into abstract heuristics capable of guiding future reasoning. This abstraction mechanism systematically strips away context-dependent variables, mapping the concrete logic of success and failure onto a generalized space of preconditions and responses. Structurally, such heuristics synthesize abstract problem categorization with the corresponding reasoning principles, encompassing the essential knowledge points, theoretical theorems, and decision criteria. Crucially, they also demarcate error-prone boundaries, explicitly highlighting potential pitfalls or latent constraints associated with the specific problem class. We define the extraction of this heuristic knowledge $\mathcal{H}$ as a generation process conditioned on the full critique context: $$ \mathcal{H}\sim\pi_{\theta}\big(\cdot\mid I,x,y^{+},y^{-},s^{*},\mathcal{C}\big). \tag{4} $$ Finally, we consolidate these components into a unified Meta-Experience tuple $\mathcal{M}$ , which elevates experience learning from trajectory-level instances to knowledge-level representations. $$ \mathcal{M}=\big(s^{*},\mathcal{C},\mathcal{H}\big). \tag{5} $$ This formulation enables meta-experiences to be reused across tasks that share analogous reasoning structures, serving as a fine-grained learning signal. By applying the extraction process across distinct contrastive pairs for a query $x$ , we construct a candidate pool of meta-experiences $\mathcal{D}_{\mathcal{M}}=\{(x,y^{+}_{i},y^{-}_{i},\mathcal{M}_{i})\}_{i=1}^{N}$ , where $N$ denotes the total number of meta-experiences derived from $x$ , and $(y^{+}_{i},y^{-}_{i})$ represents the specific contrastive pair used to derive $\mathcal{M}_{i}$ . Empirical Validation via Replay. Closing the cognitive loop requires re-instantiating theoretical insights derived from past failures in future problem-solving to assess their validity. We recognize that the raw meta-experience $\mathcal{M}$ may still suffer from intrinsic hallucinations or causal misalignment. To mitigate this, we conduct empirical verification by incorporating the extracted tuple $\mathcal{M}$ as short-term working memory into the prompt, thereby guiding the model to re-attempt the original query $x$ . This procedure tests whether the injected meta-experience can effectively steer the model away from the previously identified bifurcation point $s^{*}$ and toward a correct reasoning trajectory. We retain a meta-experience only if the corresponding replay trajectory $y_{\text{val}}\sim\pi_{\theta}(·\mid x,\mathcal{M})$ satisfies the verifier by producing the correct answer. $$ \mathcal{D}_{\mathcal{M^{*}}}=\left\{(x,y^{+},y^{-},\mathcal{M})\in\mathcal{D}_{\mathcal{M}}\;\middle|\;\mathbb{I}\big[V\!\left(y_{\text{val}},y^{*}\right)=1\big]\right\}. \tag{6} $$ Consequently, this empirical validation preserves only high-quality meta-experiences for integration into parametric long-term memory, guaranteeing the reliability of the supervision signals used in the subsequent optimization phase. 3.3 Internalization Mechanism The verified meta-experiences $\mathcal{D}_{\mathcal{M}^{*}}$ constitute a high-quality reservoir of reasoning guidance. However, treating these insights solely as retrieval-augmented memory imposes a substantial computational burden during the inference forward pass, as it necessitates processing elongated contexts for every query. To overcome this limitation, we propose to transfer these insights from the transient context window to the model’s parametric memory. Unlike the finite context buffer, the model parameters offer a virtually unlimited capacity for accumulating diverse meta-experiences, allowing the policy to internalize vast amounts of reasoning patterns without incurring inference-time latency. We establish this internalization process as a self-distilled paradigm, where the model learns from its own verified experiences. Specifically, we employ fine-tuning based on the token-averaged negative log-likelihood (NLL) objective to compile the meta-experiences into the policy’s weights. Formally, given the retrospective context $C_{\text{retro}}=[I,x,y^{+},y^{-}]$ , the internalization loss is defined as: $$ \displaystyle\mathcal{L}_{\text{NLL}}(\theta)=- \displaystyle\mathbb{E}_{(x,y^{+},y^{-},\mathcal{M}^{*})\sim\mathcal{D}_{\mathcal{M}^{*}}} \displaystyle\Big[\frac{1}{|\mathcal{M}^{*}|}\sum_{t=1}^{|\mathcal{M}^{*}|}\log\pi_{\theta}(\mathcal{M}^{*}_{t}\mid C_{\text{retro}},\mathcal{M}^{*}_{<t})\Big] \displaystyle=- \displaystyle\mathbb{E}_{x\sim\mathcal{D},\,\{y_{i}\}_{i=1}^{G}\sim\pi_{\theta_{\mathrm{old}}}(\cdot\mid x)} \displaystyle\Bigg[\mathbb{E}_{(y^{+},y^{-},\mathcal{M}^{*})\sim\mathcal{T}(x,\{y_{i}\}_{i=1}^{G})} \displaystyle\Big[\frac{1}{|\mathcal{M}^{*}|}\sum_{t=1}^{|\mathcal{M}^{*}|}\log\pi_{\theta}(\mathcal{M}^{*}_{t}\mid C_{\text{retro}},\mathcal{M}^{*}_{<t})\Big]\Bigg] \tag{7} $$ where $\mathcal{T}(·)$ represents the stochastic construction process detailed in § 3.2. Based on this formulation, the internalization process can be viewed as a specialized sampling form within the RLVR framework. By inverting the loss, we define the Meta-Experience Return $\mathcal{R}_{\text{MEL}}$ as the expected log-likelihood over the stochastically constructed verification set: $$ \displaystyle\mathcal{R}_{\text{MEL}} \displaystyle=\mathbb{E}_{(y^{+},y^{-},\mathcal{M}^{*})\sim\mathcal{T}(x,\{y_{i}\}_{i=1}^{G})} \displaystyle\Bigg[\frac{1}{|\mathcal{M}^{*}|}\sum_{t=1}^{|\mathcal{M}^{*}|}\log\pi_{\theta}(\mathcal{M}^{*}_{t}\mid C_{\text{retro}},\mathcal{M}^{*}_{<t})\Bigg]. \tag{8} $$ 3.4 Joint Training Objective Table 1: Main Results Comparison. Comparison of Pass@1, Avg@8, and Pass@8 accuracy (%) across different model scales. The best results within each model scale are marked in bold. | Method Qwen3-4B-Base | AIME 2024 Pass@1 | AIME 2025 Avg@8 | AMC 2023 Pass@8 | Pass@1 | Avg@8 | Pass@8 | Pass@1 | Avg@8 | Pass@8 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Baseline | 13.33 | 9.90 | 30.00 | 10.00 | 6.56 | 23.33 | 45.00 | 42.73 | 72.50 | | GRPO | 13.33 | 18.33 | 30.00 | 6.67 | 17.50 | 30.00 | 57.50 | 58.13 | 85.00 | | \rowcolor MintCream MEL | 20.00 | 20.83 | 33.00 | 16.67 | 18.33 | 33.00 | 60.00 | 60.31 | 87.50 | | Qwen3-8B-Base | | | | | | | | | | | Baseline | 6.67 | 10.00 | 26.67 | 13.33 | 15.00 | 33.33 | 65.00 | 52.50 | 87.50 | | GRPO | 16.67 | 24.58 | 43.33 | 20.00 | 20.83 | 36.67 | 67.50 | 69.06 | 87.50 | | \rowcolor MintCream MEL | 30.00 | 25.42 | 60.00 | 23.33 | 23.33 | 36.67 | 70.00 | 70.31 | 90.00 | | Qwen3-14B-Base | | | | | | | | | | | Baseline | 13.33 | 10.83 | 36.67 | 6.66 | 9.58 | 33.33 | 60.00 | 51.25 | 82.50 | | GRPO | 30.00 | 35.41 | 56.67 | 33.33 | 24.17 | 43.33 | 75.00 | 75.94 | 95.00 | | \rowcolor MintCream MEL | 33.33 | 35.83 | 60.00 | 36.67 | 30.00 | 46.67 | 82.50 | 82.81 | 95.00 | | Method Qwen3-4B-Base | MATH 500 Pass@1 | OlympiadBench Avg@8 | Average Pass@8 | Pass@1 | Avg@8 | Pass@8 | Pass@1 | Avg@8 | Pass@8 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Baseline | 74.20 | 65.74 | 89.60 | 39.17 | 35.37 | 60.38 | 36.34 | 32.06 | 55.16 | | GRPO | 81.80 | 82.20 | 93.00 | 48.51 | 48.46 | 67.21 | 41.56 | 44.92 | 61.04 | | \rowcolor MintCream MEL | 82.20 | 82.30 | 93.80 | 48.51 | 49.48 | 69.73 | 45.48 | 46.25 | 63.41 | | Qwen3-8B-Base | | | | | | | | | | | Baseline | 77.00 | 73.40 | 91.40 | 44.51 | 39.41 | 64.09 | 41.30 | 38.06 | 60.60 | | GRPO | 84.40 | 86.28 | 95.40 | 53.56 | 54.60 | 73.74 | 48.43 | 51.07 | 67.33 | | \rowcolor MintCream MEL | 86.60 | 86.70 | 96.20 | 54.01 | 55.60 | 73.00 | 52.79 | 52.27 | 71.17 | | Qwen3-14B-Base | | | | | | | | | | | Baseline | 80.80 | 74.15 | 93.60 | 45.25 | 40.50 | 65.58 | 41.21 | 37.26 | 62.34 | | GRPO | 85.00 | 88.35 | 96.40 | 58.16 | 58.46 | 74.78 | 56.30 | 56.47 | 73.24 | | \rowcolor MintCream MEL | 90.80 | 90.80 | 97.20 | 61.87 | 60.90 | 75.82 | 61.03 | 60.07 | 74.94 | To simultaneously encourage solution exploration and consolidate the internalized meta-experiences, achieving dual optimization across trajectory-level behaviors and knowledge-level representations, we train the policy model $\pi_{\theta}$ using a joint optimization objective. To simultaneously encourage solution exploration and consolidate the internalized meta-experiences, achieving dual optimization across trajectory-level behaviors and knowledge-level representations, we train the policy model $\pi_{\theta}$ using a joint optimization objective. This objective synergizes the RLVR signal derived from diverse explorative rollouts with the supervised signal distilled from high-quality meta-experiences: $$ \mathcal{J}(\theta)=\mathcal{J}_{\text{RLVR}}(\theta)+\mathcal{J}_{\text{MEL}}(\theta). \tag{9} $$ We adopt GRPO (Shao et al., 2024) as the RLVR component and compute group-normalized advantages by standardizing rewards within the sampled group and broadcast them to each token. Let $y_{i,t}$ denote the $t$ -th token in trajectory $y_{i}$ and $y_{i,<t}$ , the corresponding prefix. Substituting the definition of $\mathcal{R}_{\text{MEL}}$ from Eq. 8, the joint objective in Eq. 9 is explicitly expanded as: $$ \displaystyle\mathcal{J}(\theta)= \displaystyle\mathbb{E}_{x\sim\mathcal{D},\,\{y_{i}\}_{i=1}^{G}\sim\pi_{\theta_{\mathrm{old}}}(\cdot\mid x)} \displaystyle\Big[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|y_{i}|}\sum_{t=1}^{|y_{i}|}\min\Big(\rho_{i,t}(\theta)\hat{A}_{i,t},\; \displaystyle\mathrm{clip}\big(\rho_{i,t}(\theta),1-\epsilon,1+\epsilon\big)\hat{A}_{i,t}\Big)+\mathcal{R}_{\text{MEL}}\Big]. \tag{10} $$ Although derived from a log-likelihood objective, its optimization gradient is mathematically equivalent to a policy gradient update where the reward signal is a constant positive scalar. Consequently, the total objective $\mathcal{J}(\theta)$ can be unified as maximizing the expected cumulative return of a hybrid reward function. In this unified view, the meta-experiences function as a dense process reward model. Unlike the sparse outcome rewards in standard RLVR that only evaluate the final correctness, $\mathcal{R}_{\text{MEL}}$ provides explicit, step-by-step reinforcement for the reasoning process itself. This ensures that the model not only pursues correct outcomes via broad exploration but is also continuously shaped by the dense supervision of its own successful cognitive patterns, effectively bridging the gap between trajectory-level search and token-level knowledge encoding. 4 Experiments Datasets. We train our model on the DAPO-Math-17k dataset (Yu et al., 2025) and evaluate it on five challenging mathematical reasoning benchmarks: AIME24, AIME25, AMC23 (Li et al., 2024), MATH500 (Hendrycks et al., 2021), and OlympiadBench (He et al., 2024). Setups. All reinforcement learning training is conducted using the VERL framework (Sheng et al., 2024) on 8 $×$ H20 GPUs, with Math-Verify providing rule-based outcome verification. During training, we sample 8 responses per prompt at a temperature of 1.0 with a batch size of 128. Optimization uses a learning rate of $1× 10^{-6}$ and a mini-batch size of 128. For evaluation, we report Pass@1 at temperature 0, and Avg@8 and Pass@8 at temperature 0.6. Models and Baselines. To demonstrate the general applicability of MEL, we conduct experiments across a diverse range of model scales, including Qwen3-4B-Base, Qwen3-8B-Base, and Qwen3-14B-Base (Yang et al., 2025). We adopt GRPO (Shao et al., 2024) as the base reinforcement learning algorithm for MEL, and thus perform a direct and controlled comparison between the vanilla GRPO and our meta-experience learning approach. 4.1 Experimental Results As shown in Table 1, MEL achieves consistent and significant improvements over vanilla GRPO and the basemodel across multiple benchmarks and model scales. We report three complementary metrics: Pass@1 reflects one-shot reliability, Avg@8 measures the average performance over 8 samples, and Pass@8 reports the best-of-8 success rate. First, the gains in Pass@1 demonstrate that MEL substantially enhances the model’s confidence in following correct reasoning paths. Across all model scales, it achieves a consistent improvement of 3.92–4.73% over the strong GRPO baseline. This indicates that MEL effectively internalizes the explored insights into the model’s parametric memory. By consolidating these successful reasoning patterns, the model generates high-confidence solutions, markedly reducing the need for extensive test-time sampling. This reliability is further corroborated by the gains in Avg@8, which reveal that MEL significantly enhances reasoning consistency and output stability. High performance on this metric supports our hypothesis that internalized meta-experiences function as intrinsic process-level guidance, continuously steering the generation toward valid logic and effectively reducing variance across sampled outputs. Finally, the sustained gains in Pass@8 suggest that learning from meta-experience does not harm exploration; instead, it expands the reachable solution space and raises the upper bound of best-of- $k$ performance. 4.2 Training Dynamics and Convergence Analysis <details> <summary>x3.png Details</summary> ![52d3d4f5](/v1/image/52d3d4f5b248ed55e0570ce22715af023b6242c74d76174a316573d82cc192f3) ### Visual Description ## Line Chart: Training Reward vs. Training Steps for Different Model Sizes ### Overview The image presents three line charts, each depicting the relationship between "Training Reward" and "Training Steps" for different model sizes. The charts compare the performance of "GRPO" (Generative Reinforcement Policy Optimization) and "MEL" (Maximum Entropy Learning) algorithms. The model sizes are indicated in parentheses after the algorithm name: (4B), (8B), and (14B), representing 4 billion, 8 billion, and 14 billion parameters, respectively. Each chart has a light gray background grid. ### Components/Axes * **X-axis:** "Training Steps" ranging from 0 to 120. * **Y-axis:** "Training Reward" ranging from 0.2 to 0.7 (scales vary slightly between charts). * **Legends:** Each chart has a legend in the top-left corner indicating the two lines: * Blue Line: "GRPO (4B)", "GRPO (8B)", "GRPO (14B)" * Red Line: "MEL (4B)", "MEL (8B)", "MEL (14B)" ### Detailed Analysis or Content Details **Chart 1: Model Size 4B** * **GRPO (4B) - Blue Line:** The line starts at approximately 0.18 at step 0, increases steadily to around 0.35 at step 40, fluctuates between 0.3 and 0.45, and reaches approximately 0.42 at step 120. There is significant variance around the mean. * **MEL (4B) - Red Line:** The line starts at approximately 0.22 at step 0, increases to around 0.35 at step 20, then decreases to approximately 0.3 at step 40, and fluctuates between 0.3 and 0.4, reaching approximately 0.38 at step 120. There is significant variance around the mean. **Chart 2: Model Size 8B** * **GRPO (8B) - Blue Line:** The line starts at approximately 0.22 at step 0, increases steadily to around 0.4 at step 40, fluctuates between 0.4 and 0.55, and reaches approximately 0.52 at step 120. There is significant variance around the mean. * **MEL (8B) - Red Line:** The line starts at approximately 0.25 at step 0, increases to around 0.4 at step 20, then increases to approximately 0.5 at step 60, and fluctuates between 0.45 and 0.55, reaching approximately 0.5 at step 120. There is significant variance around the mean. **Chart 3: Model Size 14B** * **GRPO (14B) - Blue Line:** The line starts at approximately 0.25 at step 0, increases steadily to around 0.45 at step 40, fluctuates between 0.45 and 0.6, and reaches approximately 0.58 at step 120. There is significant variance around the mean. * **MEL (14B) - Red Line:** The line starts at approximately 0.28 at step 0, increases to around 0.45 at step 20, then increases to approximately 0.55 at step 60, and fluctuates between 0.5 and 0.6, reaching approximately 0.59 at step 120. There is significant variance around the mean. ### Key Observations * **General Trend:** Both GRPO and MEL show an increasing trend in training reward as the number of training steps increases. * **Model Size Impact:** As the model size increases (4B to 8B to 14B), both GRPO and MEL generally achieve higher training rewards. * **Algorithm Comparison:** MEL consistently outperforms GRPO at the 4B model size. At 8B and 14B, the performance is more comparable, with MEL often slightly outperforming GRPO. * **Variance:** All lines exhibit significant variance, indicating instability in the training process. ### Interpretation The data suggests that increasing model size generally leads to improved performance (higher training reward) for both GRPO and MEL algorithms. MEL appears to be more effective than GRPO, particularly at smaller model sizes. The substantial variance in the training reward across all model sizes and algorithms indicates that the training process is sensitive to initial conditions or other stochastic factors. The charts demonstrate the importance of model capacity and algorithm choice in reinforcement learning, and highlight the need for robust training techniques to mitigate variance. The consistent upward trend suggests that continued training would likely yield further improvements in reward, although the variance suggests diminishing returns may be encountered. The fact that the lines do not converge suggests that there is still room for optimization and that the algorithms have not reached their full potential. </details> Figure 3: Training curves comparing GRPO and MEL. To understand the mechanisms driving the performance gains under MEL, we monitored the training dynamics and validation performance in Figures 3 and 6 – 8. Vanilla GRPO methods often struggle to obtain positive reinforcement in the early stages, particularly when initial performance is low, due to the sparsity of outcome-based rewards. As illustrated in the training curve, vanilla GRPO exhibits a relatively slow ascent during the initial phase. In contrast, MEL demonstrates a sharp, rapid trajectory growth immediately from the onset of training. This acceleration is attributed to the internalized meta-experience return, $\mathcal{R}_{\text{MEL}}$ . By functioning as a dense, language-modeling process reward, $\mathcal{R}_{\text{MEL}}$ continuously provides informative gradient signals for every reasoning step, even when successful trajectories yielding positive reinforcement are scarce. Beyond sample efficiency, MEL achieves a consistently higher performance upper bound. The training curves show that the average reward of MEL consistently surpasses that of vanilla GRPO throughout the entire training process. Crucially, the downstream validation trajectories reveal that even as performance growth begins to plateau in the later stages, MEL maintains a distinct and sustained advantage over the baseline. This phenomenon demonstrates that the internalization of meta-experiences empowers the model to effectively navigate and explore more complex, long-horizon solutions that remain inaccessible to the baseline. <details> <summary>x4.png Details</summary> ![1fd0b93b](/v1/image/1fd0b93b5d8919e901688ab63d081d4cda131f709421812c11a728193998dacd) ### Visual Description ## Document: Geometry Problem Solving - Meta-Experience in Early Stage ### Overview The image presents a document detailing a geometry problem and the thought processes involved in solving it, categorized under "Meta-Experience in Early Stage." The document is split into two main sections: a "Question" section with a problem statement and a solution attempt by "GRPO," and a "Meta-Experience" section with analysis and reflections by "MEL." The document also includes visual elements like brain icons and a geometric diagram. ### Components/Axes The document is structured into distinct sections: * **Question:** Problem statement and GRPO's attempt. * **Meta-Experience in Early Stage:** Analysis of the problem-solving process. * **1. Failure Resolution Path & Error Pattern Recognition:** Discusses the error in the initial approach. * **4. Subject Heuristics (Internalized Experience):** Provides rules of thumb and considerations for solving similar problems. * **Footer:** Contains copyright information ("©2024 Learnyst"). * **Visual Elements:** Brain icons representing thought processes, a geometric diagram illustrating the triangle and its circumcircle. ### Detailed Analysis or Content Details **Question Section (Left Side):** * **Problem Statement:** Triangle ABC has AB = 4, BC = 5, and CA = 6. Points A', B', and C' are such that B' is tangent to the circumcircle of triangle ABC at A', C' is tangent to the circumcircle at B, and A' is tangent to the circumcircle at C. Find the length B'C'. * **GRPO's Attempt:** * **Step 1: Find the circumcircle of triangle ABC.** * Semi-perimeter: s = (4 + 5 + 6) / 2 = 7.5 * Area: K = sqrt(7.5 * (7.5 - 4) * (7.5 - 5) * (7.5 - 6)) = sqrt(7.5 * 3.5 * 2.5 * 1.5) = 5.33 (approximately) * **Step 2: Compute step by step.** * In a triangle, the tangential triangle is the triangle formed by the points where the tangents to the circumcircle at the vertices intersect. * There's a formula that relates the side lengths of the tangential triangle to the sides of the original triangle: * a' = (a^2)/(2K) = 4^2/(2*5.33) = 1.5 * b' = (b^2)/(2K) = 5^2/(2*5.33) = 2.35 * c' = (c^2)/(2K) = 6^2/(2*5.33) = 3.36 * **Step 3: Use the cosine rule to find the angles of the tangential triangle.** **Meta-Experience Section (Right Side):** * **1. Failure Resolution Path & Error Pattern Recognition:** * **Failure Point:** The error occurs in the use of the circumcircle formula. The formula should be BC = 2R sin ∠BAC, but the incorrect solution uses BC = 2R sin ∠BAC / 2. * **Latent Cognitive Pattern:** The error is due to a conceptual confusion in formula usage. * **4. Subject Heuristics (Internalized Experience):** * **Angle Verification Rule:** When using a formula of the form 2Rsin θ in a circumcircle, always ensure that θ is the full geometric angle spanning the chord or tangent segment, not a derived half-angle. * **Formula-Geometry Consistency Rule:** Before applying trigonometric formulas for lengths on the circumcircle, confirm that the chosen angle corresponds exactly to the geometric length being calculated, in order to avoid errors caused by half-angle substitution. * **MEL's Analysis:** * "Alright, I have this geometry problem about a triangle and some tangent lines." * "This seems like a problem involving tangent lines and the properties of a triangle's circumcircle." * "I can use some properties of tangents and maybe some trigonometry or coordinate geometry to solve it. I need to recall some properties:" * Tangent to a Circle: A tangent to a circle is perpendicular to the radius at the point of tangency. * Circumcircle: The circle that passes through all three vertices of a triangle. Maybe I can find the radius of the circumcircle first and then use the tangent property to find B'C'." * Step 1: Find the circumradius R of triangle ABC. * Step 2: Find a formula to find the circumradius of a triangle R = abc / (4K), where a, b, c are the side lengths, and K is the area of the triangle. First, I need to find the area of triangle ABC. I can use Heron's formula: K = sqrt(s(s-a)(s-b)(s-c)), where s is the semi-perimeter. * Step 3: Find the radius and the lengths B'C'. ### Key Observations * The document highlights a common error in applying geometric formulas – misinterpreting angles or applying incorrect variations of a formula. * The "Meta-Experience" section demonstrates a reflective problem-solving approach, where the solver identifies potential pitfalls and recalls relevant geometric principles. * The document uses visual cues (brain icons) to represent cognitive processes. * The GRPO's initial attempt contains numerical calculations, with approximations (e.g., K ≈ 5.33). ### Interpretation This document provides a valuable insight into the cognitive processes involved in solving geometry problems. It showcases not just the application of formulas but also the importance of conceptual understanding and error analysis. The "Meta-Experience" section emphasizes the need for careful consideration of geometric principles and the potential for errors arising from formula misuse. The document suggests that effective problem-solving involves a combination of procedural knowledge (applying formulas) and declarative knowledge (understanding the underlying concepts). The inclusion of the "Failure Resolution Path" is particularly insightful, as it demonstrates how recognizing and correcting errors is a crucial part of the learning process. The document is a pedagogical tool, designed to help students develop a deeper understanding of geometry and improve their problem-solving skills. The document is in English. </details> Figure 4: Case study comparing GRPO and MEL, with visualization of meta-experience in early stage. 4.3 How Meta-Experience Shapes Reasoning Patterns To investigate how MEL shapes the model’s cognitive processes beyond numerical metrics, we conduct a qualitative analysis comparing the reasoning trajectories of MEL and the baseline GRPO model, as visualized in Figure 4. A distinct behavioral divergence is observed from the onset of the solution. While the GRPO baseline tends to prioritize immediate execution through direct numerical operations, MEL adopts a structured preparatory strategy by explicitly outlining relevant theorems and formulas. Although the direct approach may appear efficient for simple queries, it increases the susceptibility to errors in complex tasks due to the lack of a holistic view of problem constraints. Notably, MEL exhibits an emergent cognitive behavior. When applying specific theorems, it spontaneously activates internalized “bitter lessons” as endogenous safeguards to regulate its actions. These active signals effectively reduce reasoning drift by encouraging earlier constraint checking and consistent self-correction when the model enters uncertain regions. 4.4 Generality Across Learning Paradigms <details> <summary>x5.png Details</summary> ![37285c33](/v1/image/37285c33f3191bbb72a0014e39ef3b74e45e068886c968bdaac89ec356ef7971) ### Visual Description \n ## Radar Chart: Performance Comparison ### Overview This image presents a radar chart comparing the performance of three entities – “Owen3-Base”, “RPT”, and “RPT w. ME” – across five metrics: MATH-B00, AZME24, ATME25, AMC23, and Olympickench. The chart uses a pentagonal shape, with each vertex representing one of the metrics. The values are plotted along each axis, radiating from the center (0) to a maximum value of 80 for MATH-B00 and AMC23, and 45 for Olympickench, 15 for AZME24 and 22 for ATME25. ### Components/Axes * **Axes:** The chart has five axes, each labeled with a metric name: * MATH-B00 (Scale: 0 to 80) * AZME24 (Scale: 0 to 15) * ATME25 (Scale: 0 to 22) * AMC23 (Scale: 0 to 70) * Olympickench (Scale: 0 to 45) * **Legend:** Located in the bottom-left corner, the legend identifies the three data series: * Owen3-Base (Blue line) * RPT (Green line) * RPT w. ME (Red line) * **Center:** The center of the chart is marked with values 0, 30, 60. ### Detailed Analysis Let's analyze each data series and their corresponding values: **1. Owen3-Base (Blue Line):** The line generally stays close to the center, indicating relatively low performance across all metrics. * MATH-B00: ~42.2 * AZME24: ~4.5 * ATME25: ~7.2 * AMC23: ~36.6 * Olympickench: ~26.4 **2. RPT (Green Line):** The RPT line shows significantly higher performance than Owen3-Base, particularly in MATH-B00 and Olympickench. * MATH-B00: ~78.3 * AZME24: ~9.9 * ATME25: ~16.1 * AMC23: ~66.4 * Olympickench: ~44.5 **3. RPT w. ME (Red Line):** The RPT w. ME line consistently outperforms both Owen3-Base and RPT, especially in MATH-B00, Olympickench and AMC23. * MATH-B00: ~80 * AZME24: ~5.9 * ATME25: ~14.9 * AMC23: ~70 * Olympickench: ~43.2 ### Key Observations * **Performance Hierarchy:** RPT w. ME > RPT > Owen3-Base across all metrics. * **Strongest Metric:** MATH-B00 shows the largest performance difference between the three entities. RPT w. ME and RPT have significantly higher scores than Owen3-Base. * **Weakest Metric:** AZME24 consistently has the lowest scores for all three entities, suggesting it's a challenging metric. * **ME Impact:** The addition of "ME" to RPT consistently improves performance, indicating a positive effect. ### Interpretation The radar chart demonstrates a clear performance comparison between three entities. "Owen3-Base" represents a baseline or initial state, while "RPT" and "RPT w. ME" represent improvements or variations of that baseline. The addition of "ME" to the RPT model consistently enhances performance across all measured metrics. The chart suggests that the "ME" component is a valuable addition to the RPT model, leading to substantial gains, particularly in MATH-B00, Olympickench and AMC23. The relatively low scores in AZME24 for all entities indicate a potential area for further investigation and improvement. The chart provides a visual representation of the strengths and weaknesses of each entity, allowing for targeted optimization efforts. The data suggests a positive correlation between the "ME" component and overall performance. </details> <details> <summary>x6.png Details</summary> ![2d33ecae](/v1/image/2d33ecaee2073d8ba3b10474e2cd7824b94403743c649ccc0a638f20554c3cb7) ### Visual Description ## Radar Chart: Performance Comparison ### Overview This image presents a radar chart comparing the performance of three different models – Qwen1-5B-Base (blue), REINFORCE++ (green), and REINFORCE++ w. ME (red) – across five different benchmarks: MATH/500, AZME24, AMCZ23, ATME25, and Olympickbench. The chart uses a pentagonal shape, with each vertex representing a benchmark. The distance from the center of the chart indicates the performance score on that benchmark. ### Components/Axes * **Benchmarks (Vertices):** * MATH/500 (Left vertex, approximately 9 o'clock) * AZME24 (Top-right vertex, approximately 2 o'clock) * AMCZ23 (Bottom-right vertex, approximately 8 o'clock) * ATME25 (Bottom-left vertex, approximately 4 o'clock) * Olympickbench (Top vertex, approximately 12 o'clock) * **Radial Scale:** The scale ranges from 0 to 100, with concentric circles representing increasing scores. The scale is marked at intervals of 25. * **Legend:** Located in the bottom-left corner, the legend identifies the color-coding for each model: * Blue: Qwen1-5B-Base * Green: REINFORCE++ * Red: REINFORCE++ w. ME ### Detailed Analysis Let's analyze each model's performance across the benchmarks: * **Qwen1-5B-Base (Blue):** * MATH/500: Approximately 48.4 * AZME24: Approximately 7.6 * AMCZ23: Approximately 46.6 * ATME25: Approximately 8 * Olympickbench: Approximately 50 * Trend: The blue line shows relatively low and fluctuating performance across all benchmarks. * **REINFORCE++ (Green):** * MATH/500: Approximately 78.6 * AZME24: Approximately 9.6 * AMCZ23: Approximately 69.2 * ATME25: Approximately 18.2 * Olympickbench: Approximately 47.6 * Trend: The green line generally shows higher performance than Qwen1-5B-Base, but with significant variation across benchmarks. * **REINFORCE++ w. ME (Red):** * MATH/500: Approximately 82.3 * AZME24: Approximately 24.2 * AMCZ23: Approximately 76.2 * ATME25: Approximately 15.2 * Olympickbench: Approximately 64.2 * Trend: The red line consistently demonstrates the highest performance across most benchmarks, indicating that the "w. ME" modification improves the model's capabilities. ### Key Observations * **REINFORCE++ w. ME consistently outperforms both other models.** It shows the highest scores on MATH/500, AMCZ23, and Olympickbench. * **Qwen1-5B-Base consistently underperforms.** It has the lowest scores across all benchmarks. * **AZME24 is a challenging benchmark for all models.** All models have relatively low scores on this benchmark. * **MATH/500 shows the largest performance difference between models.** The gap between Qwen1-5B-Base and REINFORCE++ w. ME is substantial on this benchmark. ### Interpretation The radar chart clearly demonstrates the effectiveness of the "w. ME" modification to the REINFORCE++ model. This modification leads to significant performance improvements across a range of benchmarks, particularly in areas like MATH/500 and Olympickbench. The chart suggests that the "w. ME" component addresses limitations in the base REINFORCE++ model, resulting in a more robust and capable system. The consistently low performance of Qwen1-5B-Base indicates that it may require further development or optimization to compete with the REINFORCE++ models. The low scores on AZME24 suggest that this benchmark represents a particularly difficult challenge for all models, potentially highlighting a specific area where further research is needed. The visual representation allows for a quick and intuitive comparison of the models' strengths and weaknesses across different tasks. </details> Figure 5: Impact of meta-experience across different training methods, including Rejection Sampling Fine-Tuning (RFT) and REINFORCE++. ME denotes Meta-Experience. To demonstrate the versatility of meta-experience, we integrated it into RFT and REINFORCE++ using the Qwen-8B-Base model as the backbone and the same training set in our experiments. As shown in Figure 5, while vanilla RFT often suffers from rote memorization and tends to overfit to specific samples in this training set, the incorporation of meta-experiences introduces robust reasoning heuristics. This allows the model to internalize the underlying logic rather than merely imitating specific answers, thereby effectively mitigating overfitting and enhancing generalization to unseen test sets. Similarly, applying meta-experiences to REINFORCE++ significantly raises the performance ceiling on benchmarks. This confirms that the benefit of internalized meta-experiences is a universal enhancement, not limited to the GRPO framework. 4.5 Scalability Analysis As indicated by the training curves in Figure 3, the method exhibits a distinct positive scaling law: the performance margin between MEL and the baseline widens significantly as the model size increases. This phenomenon consistently extends to downstream validation benchmarks. We attribute this effect to the quality of self-generated supervision, which is inherently bounded by the model’s intrinsic capability. As shown in Figure 9, the 14B model achieves a significantly higher yield rate of valid meta-experiences than its smaller counterparts. While limited-capacity models introduce noise due to imprecise error attribution, larger models benefit from stronger self-verification, enabling the distillation of high-quality heuristics that provide more accurate gradient signals and fully realize the potential of our framework. 5 Conclusion In this paper, we introduced MEL, a novel framework designed to overcome the meta-learning bottleneck in standard RLVR by transforming instance-specific failure patterns into reusable cognitive assets. Unlike traditional methods that rely solely on outcome-oriented rewards, MEL empowers models to perform granular error attribution, distilling specific failure modes into natural language heuristics—termed Meta-Experiences. By internalizing these experiences into parametric memory, our approach bridges the gap between verifying a solution and understanding the underlying reasoning logic. Extensive empirical evaluations confirm that MEL consistently boosts mathematical reasoning across diverse model scales. Impact Statement This paper presents research aimed at advancing the field of reinforcement learning. While our work may have broader societal implications, we do not identify any specific impacts that require particular attention at this stage. References - Y. Cai, S. Cai, Y. Shi, Z. Xu, L. Chen, Y. Qin, X. Tan, G. Li, Z. Li, H. Lin, et al. (2025) Training-free group relative policy optimization. arXiv preprint arXiv:2510.08191. Cited by: §2. - J. Chen, Q. He, S. Yuan, A. Chen, Z. Cai, W. Dai, H. Yu, Q. Yu, X. Li, J. Chen, et al. (2025) Enigmata: scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914. Cited by: §1. - J. Cheng, G. Xiong, R. Qiao, L. Li, C. Guo, J. Wang, Y. Lv, and F. Wang (2025) Stop summation: min-form credit assignment is all process reward model needs for reasoning. arXiv preprint arXiv:2504.15275. Cited by: §1. - Y. Fu, T. Chen, J. Chai, X. Wang, S. Tu, G. Yin, W. Lin, Q. Zhang, Y. Zhu, and D. Zhao (2025) SRFT: a single-stage method with supervised and reinforcement fine-tuning for reasoning. arXiv preprint arXiv:2506.19767. Cited by: §2. - D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §1, §2. - C. He, R. Luo, Y. Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang, et al. (2024) Olympiadbench: a challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the Association for Computational Linguistics, pp. 3828–3850. Cited by: §4. - D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv: 2103.03874. Cited by: §4. - J. Hu (2025) Reinforce++: a simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262. Cited by: §2. - S. Huang, Z. Fang, Z. Chen, S. Yuan, J. Ye, Y. Zeng, L. Chen, Q. Mao, and F. Zhao (2025) CRITICTOOL: evaluating self-critique capabilities of large language models in tool-calling error scenarios. arXiv preprint arXiv:2506.13977. Cited by: §3.2. - W. Huang, Y. Zeng, Q. Wang, Z. Fang, S. Cao, Z. Chu, Q. Yin, S. Chen, Z. Yin, L. Chen, et al. (2026) Vision-deepresearch: incentivizing deepresearch capability in multimodal large language models. arXiv preprint arXiv:2601.22060. Cited by: §1. - A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. (2024) Openai o1 system card. arXiv preprint arXiv:2412.16720. Cited by: §2. - M. Khalifa, R. Agarwal, L. Logeswaran, J. Kim, H. Peng, M. Lee, H. Lee, and L. Wang (2025) Process reward models that think. arXiv preprint arXiv:2504.16828. Cited by: §1, §3.2. - N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, S. Lyu, et al. (2024) Tulu 3: pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124. Cited by: §1, §2. - J. Li, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. Huang, K. Rasul, L. Yu, A. Q. Jiang, Z. Shen, et al. (2024) Numinamath: the largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository 13 (9), pp. 9. Cited by: §4. - H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Let’s verify step by step. In Proceedings of the International Conference on Learning Representations, Cited by: §1. - K. Liu, D. Yang, Z. Qian, W. Yin, Y. Wang, H. Li, J. Liu, P. Zhai, Y. Liu, and L. Zhang (2025) Reinforcement learning meets large language models: a survey of advancements and applications across the llm lifecycle. arXiv preprint arXiv:2509.16679. Cited by: §2. - L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, pp. 27730–27744. Cited by: §1. - S. Ouyang, J. Yan, I. Hsu, Y. Chen, K. Jiang, Z. Wang, R. Han, L. T. Le, S. Daruki, X. Tang, et al. (2025) Reasoningbank: scaling agent self-evolving with reasoning memory. arXiv preprint arXiv:2509.25140. Cited by: §2. - B. Pan and L. Zhao (2025) Can past experience accelerate llm reasoning?. arXiv preprint arXiv:2505.20643. Cited by: §2. - W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, and J. Leike (2022) Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802. Cited by: §3.2. - J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1. - Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1, §2, §3.1, §3.4, §4. - G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu (2024) HybridFlow: a flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256. Cited by: §4. - G. Swamy, S. Choudhury, W. Sun, Z. S. Wu, and J. A. Bagnell (2025) All roads lead to likelihood: the value of reinforcement learning in fine-tuning. arXiv preprint arXiv:2503.01067. Cited by: §3.2. - Q. Wang, R. Ding, Y. Zeng, Z. Chen, L. Chen, S. Wang, P. Xie, F. Huang, and F. Zhao (2025) VRAG-rl: empower vision-perception-based rag for visually rich information understanding via iterative reasoning with reinforcement learning. arXiv preprint arXiv:2505.22019. Cited by: §1. - R. Wu, X. Wang, J. Mei, P. Cai, D. Fu, C. Yang, L. Wen, X. Yang, Y. Shen, Y. Wang, et al. (2025) Evolver: self-evolving llm agents through an experience-driven lifecycle. arXiv preprint arXiv:2510.16079. Cited by: §1. - M. Wulfmeier, M. Bloesch, N. Vieillard, A. Ahuja, J. Bornschein, S. Huang, A. Sokolov, M. Barnes, G. Desjardins, A. Bewley, et al. (2024) Imitating language via scalable inverse reinforcement learning. Advances in Neural Information Processing Systems 37, pp. 90714–90735. Cited by: §1. - G. Xie, Y. Shi, H. Tian, T. Yao, and X. Zhang (2025) Capo: towards enhancing llm reasoning through verifiable generative credit assignment. arXiv e-prints, pp. arXiv–2508. Cited by: §1, §3.2. - J. Yan, Y. Li, Z. Hu, Z. Wang, G. Cui, X. Qu, Y. Cheng, and Y. Zhang (2025) Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945. Cited by: §2. - A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §4. - Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. (2025) Dapo: an open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Cited by: §2, §4. - Y. Zeng, W. Huang, Z. Fang, S. Chen, Y. Shen, Y. Cai, X. Wang, Z. Yin, L. Chen, Z. Chen, et al. (2026) Vision-deepresearch benchmark: rethinking visual and textual search for multimodal large language models. arXiv preprint arXiv:2602.02185. Cited by: §1. - Y. Zeng, W. Huang, S. Huang, X. Bao, Y. Qi, Y. Zhao, Q. Wang, L. Chen, Z. Chen, H. Chen, et al. (2025a) Agentic jigsaw interaction learning for enhancing visual perception and reasoning in vision-language models. arXiv preprint arXiv:2510.01304. Cited by: §1. - Y. Zeng, Y. Qi, Y. Zhao, X. Bao, L. Chen, Z. Chen, S. Huang, J. Zhao, and F. Zhao (2025b) Enhancing large vision-language models with ultra-detailed image caption generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 26703–26729. Cited by: §1. - K. Zhang, X. Chen, B. Liu, T. Xue, Z. Liao, Z. Liu, X. Wang, Y. Ning, Z. Chen, X. Fu, et al. (2025a) Agent learning via early experience. arXiv preprint arXiv:2510.08558. Cited by: §1. - K. Zhang, Y. Zuo, B. He, Y. Sun, R. Liu, C. Jiang, Y. Fan, K. Tian, G. Jia, P. Li, et al. (2025b) A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827. Cited by: §2. - K. Zhang, A. Lv, J. Li, Y. Wang, F. Wang, H. Hu, and R. Yan (2025c) StepHint: multi-level stepwise hints enhance reinforcement learning to reason. arXiv preprint arXiv:2507.02841. Cited by: §1, §2. - X. Zhang, S. Wu, Y. Zhu, H. Tan, S. Yu, Z. He, and J. Jia (2025d) Scaf-grpo: scaffolded group relative policy optimization for enhancing llm reasoning. arXiv preprint arXiv:2510.19807. Cited by: §1, §2. - C. Zheng, S. Liu, M. Li, X. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. (2025) Group sequence policy optimization. arXiv preprint arXiv:2507.18071. Cited by: §2. Appendix A Result of Performance Evolution As illustrated in Figures 6, 7, and 8, we visualize the performance evolution of models with different scales (Qwen3-4B-Base, Qwen3-8B-Base, and Qwen3-14B-Base) across multiple benchmarks throughout training. It can be observed that MEL consistently outperforms standard GRPO in terms of average performance on all benchmarks. <details> <summary>x7.png Details</summary> ![ae93676c](/v1/image/ae93676c66e478f1f1b960dad236b9958fb21c2627ef66ff796564fda820cc57) ### Visual Description ## Line Chart: Validation Score vs. Training Step (Benchmark: AIME24) ### Overview The image presents a line chart illustrating the validation score of two models, GRPO and MEL, against the training step. The chart appears to track the performance of these models during a training process on the AIME24 benchmark. ### Components/Axes * **Title:** Benchmark: AIME24 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with gridlines) * **Y-axis:** Validation Score (ranging from approximately 0.075 to 0.225, with gridlines) * **Legend:** Located in the bottom-right corner. * GRPO (represented by a light blue line with circular markers) * MEL (represented by a light red line with triangular markers) ### Detailed Analysis **GRPO (Light Blue Line):** The GRPO line exhibits an oscillating trend. It starts at approximately 0.125 at Training Step 0, increases to a peak of around 0.17 at Training Step 20, dips to approximately 0.13 at Step 40, rises again to around 0.16 at Step 60, then fluctuates between approximately 0.13 and 0.17 until Step 120, and finally ends at approximately 0.16 at Step 140. * Step 0: ~0.125 * Step 20: ~0.165 * Step 40: ~0.13 * Step 60: ~0.16 * Step 80: ~0.135 * Step 100: ~0.17 * Step 120: ~0.13 * Step 140: ~0.16 **MEL (Light Red Line):** The MEL line also shows an oscillating pattern, but with a more pronounced peak. It begins at approximately 0.075 at Training Step 0, increases to around 0.13 at Step 20, decreases to a low of approximately 0.10 at Step 40, then experiences a significant rise to a peak of approximately 0.225 at Step 80, before declining to around 0.17 at Step 100, and finally stabilizes around 0.20 at Step 140. * Step 0: ~0.075 * Step 20: ~0.13 * Step 40: ~0.10 * Step 60: ~0.15 * Step 80: ~0.225 * Step 100: ~0.17 * Step 120: ~0.175 * Step 140: ~0.20 ### Key Observations * The MEL model generally achieves higher validation scores than the GRPO model, especially after Training Step 60. * Both models exhibit fluctuations in validation score, suggesting that the training process is not consistently improving performance. * The MEL model shows a significant performance spike around Training Step 80, reaching its highest validation score. * The GRPO model's performance is more stable, with less dramatic fluctuations. ### Interpretation The chart demonstrates the training progress of two models (GRPO and MEL) on the AIME24 benchmark. The validation scores indicate how well each model generalizes to unseen data during training. The oscillating nature of the lines suggests that the models are experiencing periods of improvement and regression, potentially due to factors like learning rate, batch size, or the complexity of the data. The fact that MEL consistently outperforms GRPO suggests that MEL is a more effective model for this particular benchmark, or that it has been trained with more optimal hyperparameters. The peak in MEL's performance at Step 80 could indicate a critical point in the training process where the model learned a significant feature or pattern. The stabilization of both models towards the end of the training process suggests that they are approaching convergence, but further training might not yield substantial improvements. The differences in the curves suggest that the models have different learning dynamics and sensitivities to the training data. Further investigation into the training process and model architectures could reveal the reasons behind these differences and potentially lead to further performance improvements. </details> <details> <summary>x8.png Details</summary> ![70051494](/v1/image/700514944dfc259bdbcab9d11ef8d15c5c68f2e86ca339b961fd141ebd4550ca) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (Benchmark: AIME25) ### Overview The image presents a line chart comparing the validation scores of two models, "GRPO" and "MEL", across different training steps. The chart aims to visualize the performance of each model during the training process on the AIME25 benchmark. ### Components/Axes * **Title:** Benchmark: AIME25 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with tick marks at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.02 to 0.20, with tick marks at intervals of 0.05) * **Legend:** Located in the bottom-right corner. * GRPO (represented by a blue line with circular markers) * MEL (represented by a pink line with triangular markers) * **Gridlines:** Present to aid in reading values. ### Detailed Analysis **GRPO (Blue Line):** The GRPO line exhibits a fluctuating trend. It starts at approximately 0.10, dips to a minimum of around 0.025 at a training step of 20, then rises to a peak of approximately 0.175 at a training step of 100. After this peak, it declines to around 0.10 at a training step of 120, and finally drops to approximately 0.06 at a training step of 140. * (0, 0.10) * (20, 0.025) * (40, 0.065) * (60, 0.10) * (80, 0.025) * (100, 0.175) * (120, 0.10) * (140, 0.06) **MEL (Pink Line):** The MEL line shows a generally increasing trend with some fluctuations. It begins at approximately 0.10, rises to a peak of around 0.13 at a training step of 20, then plateaus around 0.10 until a training step of 60. From 60 to 100, it increases to approximately 0.14, and then rises sharply to a maximum of approximately 0.20 at a training step of 140. * (0, 0.10) * (20, 0.13) * (40, 0.10) * (60, 0.10) * (80, 0.13) * (100, 0.14) * (120, 0.15) * (140, 0.20) ### Key Observations * The GRPO model demonstrates significant fluctuations in validation score throughout the training process, indicating potential instability or sensitivity to training data. * The MEL model exhibits a more stable and generally increasing trend, suggesting more consistent learning. * The MEL model consistently outperforms the GRPO model after a training step of 80. * Both models start with similar validation scores. ### Interpretation The chart suggests that the MEL model is more effective at learning from the AIME25 benchmark data compared to the GRPO model. The consistent upward trend of the MEL model indicates that it is successfully generalizing to the validation data as training progresses. The GRPO model's erratic behavior suggests it may be overfitting to the training data or encountering difficulties in convergence. The difference in performance between the two models becomes more pronounced as training continues, highlighting the superior learning capabilities of the MEL model in this specific scenario. The initial similar performance suggests both models start with comparable initial conditions, but their learning dynamics diverge over time. The AIME25 benchmark appears to favor the learning strategy employed by the MEL model. </details> <details> <summary>x9.png Details</summary> ![2bca4820](/v1/image/2bca4820a8f6ae76ded5c4c1bc69d7a66fb88e76c25efd44980bcfad283112f3) ### Visual Description ## Line Chart: Validation Score vs. Training Step (Benchmark: AMC23) ### Overview This image presents a line chart illustrating the validation score of two models, GRPO and MEL, against the training step. The chart appears to track the performance of these models during a training process on the AMC23 benchmark. ### Components/Axes * **Title:** Benchmark: AMC23 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with markers at 0, 20, 40, 60, 80, 100, 120, and 140) * **Y-axis:** Validation Score (ranging from approximately 0.46 to 0.61, with markers at 0.46, 0.48, 0.50, 0.52, 0.54, 0.56, 0.58, 0.60) * **Legend:** Located in the bottom-right corner. * GRPO (represented by a light blue line) * MEL (represented by a light red line) ### Detailed Analysis **GRPO (Light Blue Line):** The GRPO line initially slopes upward from approximately 0.47 at a training step of 0 to a peak of approximately 0.58 at a training step of 20. It then declines to approximately 0.52 at a training step of 40, rises again to approximately 0.58 at a training step of 60, dips to approximately 0.55 at a training step of 80, and then rises to approximately 0.59 at a training step of 100. Finally, it plateaus around 0.58 from training step 100 to 140. * Step 0: ~0.47 * Step 20: ~0.58 * Step 40: ~0.52 * Step 60: ~0.58 * Step 80: ~0.55 * Step 100: ~0.59 * Step 120: ~0.56 * Step 140: ~0.58 **MEL (Light Red Line):** The MEL line exhibits a rapid increase from approximately 0.47 at a training step of 0 to approximately 0.60 at a training step of 20. It then declines to approximately 0.52 at a training step of 40, rises to approximately 0.58 at a training step of 60, dips to approximately 0.55 at a training step of 80, peaks at approximately 0.60 at a training step of 100, and then declines to approximately 0.59 at a training step of 140. * Step 0: ~0.47 * Step 20: ~0.60 * Step 40: ~0.52 * Step 60: ~0.58 * Step 80: ~0.55 * Step 100: ~0.60 * Step 120: ~0.56 * Step 140: ~0.59 ### Key Observations * Both models show an initial increase in validation score, followed by fluctuations. * MEL generally achieves a higher validation score than GRPO, especially in the early stages of training (up to step 60). * Both models appear to converge towards a similar validation score around training step 140. * The fluctuations suggest that the training process is not entirely smooth and may be sensitive to the training step. ### Interpretation The chart demonstrates the learning curves of two models (GRPO and MEL) during training on the AMC23 benchmark. The validation score serves as a metric for the model's generalization performance on unseen data. The initial increase in validation score indicates that both models are learning from the training data. The subsequent fluctuations suggest that the models are experiencing some degree of overfitting or are encountering challenges in generalizing to the validation set. The fact that MEL consistently outperforms GRPO suggests that MEL may be a more effective model for this particular benchmark. However, the convergence of the two models towards the end of the training process indicates that GRPO is also improving and may eventually achieve comparable performance. The fluctuations in validation score could be due to several factors, such as the stochastic nature of the training process, the choice of hyperparameters, or the complexity of the dataset. Further analysis would be needed to determine the root cause of these fluctuations and to optimize the training process for better performance. The data suggests that training beyond step 100 provides diminishing returns. </details> <details> <summary>x10.png Details</summary> ![f9530e39](/v1/image/f9530e398aa1fbb44f796230ae7aaf22d885aad88493db7e7294b12b84939338) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (Benchmark: MATH500) ### Overview This image presents a line chart illustrating the validation score of two models, GRPO and MEL, against the training step. The chart appears to track the performance of these models on the MATH500 benchmark during a training process. ### Components/Axes * **Title:** Benchmark: MATH500 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with markers at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.74 to 0.84, with markers at intervals of 0.02) * **Legend:** Located in the top-right corner. * GRPO (represented by a blue line with circular markers) * MEL (represented by a red line with triangular markers) * **Gridlines:** Present to aid in reading values. ### Detailed Analysis **GRPO (Blue Line):** The GRPO line generally slopes upward, indicating increasing validation score with increasing training steps. * At Training Step 0, Validation Score is approximately 0.74. * At Training Step 20, Validation Score is approximately 0.77. * At Training Step 40, Validation Score is approximately 0.80. * At Training Step 60, Validation Score is approximately 0.80. * At Training Step 80, Validation Score is approximately 0.80. * At Training Step 100, Validation Score is approximately 0.81. * At Training Step 120, Validation Score is approximately 0.82. * At Training Step 140, Validation Score is approximately 0.82. **MEL (Red Line):** The MEL line also generally slopes upward, but exhibits more fluctuation than the GRPO line. * At Training Step 0, Validation Score is approximately 0.76. * At Training Step 20, Validation Score is approximately 0.78. * At Training Step 40, Validation Score is approximately 0.79. * At Training Step 60, Validation Score is approximately 0.80. * At Training Step 80, Validation Score is approximately 0.80. * At Training Step 100, Validation Score is approximately 0.82. * At Training Step 120, Validation Score is approximately 0.84. * At Training Step 140, Validation Score is approximately 0.82. ### Key Observations * Both models show improvement in validation score as training progresses. * The MEL model achieves a higher peak validation score (approximately 0.84) than the GRPO model (approximately 0.82). * The MEL model exhibits more variability in its validation score throughout the training process. * Both models appear to plateau in performance after approximately 100 training steps. ### Interpretation The chart demonstrates the learning curves of two models (GRPO and MEL) on the MATH500 benchmark. The increasing validation scores indicate that both models are learning and improving their performance on the task. The higher peak score of the MEL model suggests that it may be a more effective model for this benchmark, although its greater variability could indicate a sensitivity to training data or hyperparameters. The plateauing of both curves suggests that further training may not yield significant improvements in performance. The difference in the curves suggests that the two models have different learning dynamics and potentially different strengths and weaknesses. The MATH500 benchmark likely involves mathematical problem-solving, and the chart provides insight into how well each model generalizes to unseen mathematical problems during training. </details> <details> <summary>x11.png Details</summary> ![aa0c4a9f](/v1/image/aa0c4a9fb412305349082c46f0c1acf3b856b5465fdab5a8ef25ef2277f46ac1) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step ### Overview This image presents a line chart illustrating the validation score of two models, GRP0 and MEL, as a function of the training step. The chart appears to track the performance of these models during a training process on the "OlympiadBench" benchmark. ### Components/Axes * **Title:** Benchmark: OlympiadBench (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with markers at 20, 40, 60, 80, 100, 120, and 140) * **Y-axis:** Validation Score (ranging from approximately 0.40 to 0.52, with markers at 0.42, 0.44, 0.46, 0.48, and 0.50) * **Legend:** Located in the top-right corner. * GRP0 (represented by a light blue line with circular markers) * MEL (represented by a light red line with triangular markers) ### Detailed Analysis **GRP0 (Light Blue Line):** The GRP0 line generally slopes upward, indicating an increasing validation score with increasing training steps. * At Training Step 0: Validation Score ≈ 0.41 * At Training Step 20: Validation Score ≈ 0.43 * At Training Step 40: Validation Score ≈ 0.44 * At Training Step 60: Validation Score ≈ 0.45 * At Training Step 80: Validation Score ≈ 0.45 * At Training Step 100: Validation Score ≈ 0.46 * At Training Step 120: Validation Score ≈ 0.49 * At Training Step 140: Validation Score ≈ 0.48 **MEL (Light Red Line):** The MEL line also generally slopes upward, but exhibits more fluctuation than GRP0. * At Training Step 0: Validation Score ≈ 0.41 * At Training Step 20: Validation Score ≈ 0.43 * At Training Step 40: Validation Score ≈ 0.44 * At Training Step 60: Validation Score ≈ 0.47 * At Training Step 80: Validation Score ≈ 0.45 * At Training Step 100: Validation Score ≈ 0.48 * At Training Step 120: Validation Score ≈ 0.51 * At Training Step 140: Validation Score ≈ 0.48 ### Key Observations * Both models show an increasing trend in validation score as training progresses. * The MEL model appears to achieve a higher peak validation score (around 0.51) compared to GRP0 (around 0.49). * The MEL model exhibits more variability in its validation score, with a noticeable dip around Training Step 80. * Both models appear to plateau or slightly decrease in performance towards the end of the training process (between Training Step 120 and 140). ### Interpretation The chart demonstrates the learning curves of two models (GRP0 and MEL) on the OlympiadBench benchmark. The upward trend in validation score for both models indicates that they are learning and improving their performance with increased training. The higher peak score achieved by MEL suggests that it may be a more effective model for this benchmark, although its greater variability could indicate a higher sensitivity to training data or hyperparameters. The plateauing or slight decrease in performance towards the end of training suggests that both models may be approaching their optimal performance level or are starting to overfit to the training data. Further investigation would be needed to determine the cause of this plateau and whether additional training or regularization techniques could improve performance. The difference in the smoothness of the curves could indicate different levels of regularization or different optimization algorithms used during training. </details> <details> <summary>x12.png Details</summary> ![96219fc8](/v1/image/96219fc8873c314c351dce3c72e11c4ec83d206bdd46dc7c77ee71c4cf98ff9e) ### Visual Description \n ## Line Chart: Benchmark Average Validation Score vs. Training Step ### Overview This image presents a line chart illustrating the validation score of two models, GRPO and MEL, as a function of the training step. The chart aims to compare the performance of these models during the training process. ### Components/Axes * **Title:** "Benchmark: Average" - positioned at the top-center of the chart. * **X-axis:** "Training Step" - ranging from approximately 0 to 140, with markers at intervals of 20. * **Y-axis:** "Validation Score" - ranging from approximately 0.36 to 0.46, with markers at intervals of 0.02. * **Legend:** Located in the top-right corner of the chart. * GRPO - represented by a blue line with circular markers. * MEL - represented by a light-red line with triangular markers. * **Gridlines:** Faint gray horizontal and vertical gridlines are present to aid in reading values. ### Detailed Analysis **GRPO (Blue Line):** The GRPO line generally slopes upward from step 0 to approximately step 100, then plateaus and slightly declines. * At Training Step 0, Validation Score is approximately 0.37. * At Training Step 20, Validation Score is approximately 0.41. * At Training Step 40, Validation Score is approximately 0.38. * At Training Step 60, Validation Score is approximately 0.41. * At Training Step 80, Validation Score is approximately 0.40. * At Training Step 100, Validation Score is approximately 0.44. * At Training Step 120, Validation Score is approximately 0.43. * At Training Step 140, Validation Score is approximately 0.42. **MEL (Light-Red Line):** The MEL line also generally slopes upward, but with more pronounced fluctuations. * At Training Step 0, Validation Score is approximately 0.36. * At Training Step 20, Validation Score is approximately 0.41. * At Training Step 40, Validation Score is approximately 0.39. * At Training Step 60, Validation Score is approximately 0.42. * At Training Step 80, Validation Score is approximately 0.42. * At Training Step 100, Validation Score is approximately 0.45. * At Training Step 120, Validation Score is approximately 0.44. * At Training Step 140, Validation Score is approximately 0.46. ### Key Observations * Both models show an increasing trend in validation score with increasing training steps, indicating learning. * The MEL model consistently achieves a slightly higher validation score than the GRPO model, especially in the later stages of training (after step 80). * The GRPO model exhibits more volatility in its validation score, with larger fluctuations between training steps. * The MEL model reaches its peak validation score at the final training step (140). ### Interpretation The chart suggests that the MEL model is performing better than the GRPO model on the benchmark task, as evidenced by its consistently higher validation scores. The increasing trend for both models indicates that both are learning from the training data. The fluctuations in the GRPO model's validation score could indicate instability during training or sensitivity to specific training batches. The fact that MEL continues to improve until the final training step suggests that further training might yield even better results. The "Benchmark: Average" title implies that the validation scores are averaged across a set of benchmark tests, providing a more robust measure of model performance. The data suggests that MEL is a more stable and effective model for this particular benchmark. </details> Figure 6: Performance evolution of GRPO and MEL on Qwen3-4B-Base across training steps on multiple benchmarks. <details> <summary>x13.png Details</summary> ![21844eac](/v1/image/21844eac8063420ab02d24d8c92ca5b913720157a2b6788e5a0a364674e04720) ### Visual Description ## Line Chart: Validation Score vs. Training Step (Benchmark: AIME24) ### Overview This image presents a line chart illustrating the validation score of two models, GRPD and MEL, as a function of the training step. The chart appears to track the performance of these models during a training process on the AIME24 benchmark. ### Components/Axes * **Title:** Benchmark: AIME24 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with tick marks at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.10 to 0.30, with tick marks at intervals of 0.05) * **Legend:** Located in the bottom-right corner. * GRP0 (represented by a blue line with circular markers) * MEL (represented by a pink line with triangular markers) * **Gridlines:** Faint gray horizontal and vertical gridlines are present to aid in reading values. ### Detailed Analysis **GRP0 (Blue Line):** The GRP0 line generally fluctuates throughout the training process. It starts at approximately 0.08 at a training step of 0, increases to a peak of around 0.27 at a training step of 40, then decreases and oscillates between approximately 0.18 and 0.24 until a training step of 120. Finally, it decreases to approximately 0.20 at a training step of 140. Approximate data points: * (0, 0.08) * (20, 0.20) * (40, 0.27) * (60, 0.23) * (80, 0.20) * (100, 0.26) * (120, 0.18) * (140, 0.20) **MEL (Pink Line):** The MEL line also exhibits fluctuations, but with a different pattern. It starts at approximately 0.08 at a training step of 0, increases steadily to around 0.27 at a training step of 40, then decreases to approximately 0.18 at a training step of 80. It then increases again, reaching a peak of around 0.29 at a training step of 120, and finally decreases to approximately 0.28 at a training step of 140. Approximate data points: * (0, 0.08) * (20, 0.16) * (40, 0.27) * (60, 0.22) * (80, 0.18) * (100, 0.25) * (120, 0.29) * (140, 0.28) ### Key Observations * Both models start with a similar validation score. * Both models show improvement in validation score during the initial training steps (up to approximately step 40). * After step 40, the models' performance diverges, with GRP0 exhibiting more pronounced oscillations. * MEL consistently achieves a higher validation score than GRP0 in the later stages of training (steps 100-140). * The validation scores for both models appear to plateau or slightly decrease towards the end of the training process. ### Interpretation The chart suggests that both GRP0 and MEL models are learning from the training data, as evidenced by the initial increase in validation score. However, MEL appears to be more stable and ultimately achieves better performance on the AIME24 benchmark. The oscillations in GRP0's validation score could indicate instability during training or sensitivity to specific training batches. The plateauing of both models towards the end of training suggests that further training may not yield significant improvements, or that the models are approaching their maximum performance on this benchmark. The difference in performance between the two models could be due to differences in their architectures, hyperparameters, or training procedures. Further investigation would be needed to determine the root cause of these differences. </details> <details> <summary>x14.png Details</summary> ![5abe5f04](/v1/image/5abe5f0455bc743981269370f4d905b751be101ff2825dd790426c9b2001cce6) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (Benchmark: AIME25) ### Overview This image presents a line chart illustrating the validation score of two models, GRPO and MEL, as a function of the training step. The chart appears to track the performance of these models during a training process on the AIME25 benchmark. ### Components/Axes * **Title:** Benchmark: AIME25 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with tick marks at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.10 to 0.275, with tick marks at intervals of 0.025) * **Legend:** Located in the bottom-right corner. * GRPO (represented by a blue line with circular markers) * MEL (represented by a pink/red line with circular markers) * **Gridlines:** Present to aid in reading values. ### Detailed Analysis **GRPO (Blue Line):** The GRPO line generally trends upward, but with significant fluctuations. * At Training Step 0, the Validation Score is approximately 0.125. * At Training Step 20, the Validation Score rises to approximately 0.165. * At Training Step 40, the Validation Score decreases to approximately 0.16. * At Training Step 60, the Validation Score drops sharply to approximately 0.11. * At Training Step 80, the Validation Score increases to approximately 0.165. * At Training Step 100, the Validation Score rises to approximately 0.20. * At Training Step 120, the Validation Score increases to approximately 0.225. * At Training Step 140, the Validation Score decreases to approximately 0.22. **MEL (Pink/Red Line):** The MEL line exhibits more pronounced fluctuations than GRPO, with a clear peak around Training Step 80. * At Training Step 0, the Validation Score is approximately 0.10. * At Training Step 20, the Validation Score rises to approximately 0.13. * At Training Step 40, the Validation Score increases to approximately 0.20. * At Training Step 60, the Validation Score decreases to approximately 0.15. * At Training Step 80, the Validation Score peaks at approximately 0.26. * At Training Step 100, the Validation Score decreases to approximately 0.21. * At Training Step 120, the Validation Score increases to approximately 0.23. * At Training Step 140, the Validation Score decreases to approximately 0.225. ### Key Observations * The MEL model consistently achieves higher validation scores than the GRPO model, especially after Training Step 40. * Both models exhibit significant variance in their validation scores, suggesting instability during training. * The MEL model shows a clear peak in performance around Training Step 80, followed by a slight decline. * The GRPO model shows a general upward trend, but with more erratic behavior. ### Interpretation The chart demonstrates the training progress of two models (GRPO and MEL) on the AIME25 benchmark. The higher validation scores of the MEL model suggest it is performing better overall. However, the fluctuations in both lines indicate that the training process may not be fully converged or that the models are sensitive to the specific training data. The peak in MEL's performance at Training Step 80 could indicate an optimal point for early stopping or model checkpointing. The variance in validation scores suggests that further investigation into the training process, such as hyperparameter tuning or data augmentation, might be beneficial to improve model stability and generalization. The AIME25 benchmark appears to be a validation set used to assess the performance of these models during training. The chart provides a visual representation of how well each model is learning to generalize to unseen data. </details> <details> <summary>x15.png Details</summary> ![f9cda16a](/v1/image/f9cda16af01a1a8bd5c6fe28f7b0559311a64f2a1656bd828109dd99148111cc) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (Benchmark: AMC23) ### Overview This image presents a line chart illustrating the validation score of two models, GRPO and MEL, as a function of the training step. The chart appears to track the performance of these models during a training process, likely evaluating their generalization ability on a validation dataset. The benchmark used is "AMC23". ### Components/Axes * **Title:** Benchmark: AMC23 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with markers every 20 steps) * **Y-axis:** Validation Score (ranging from approximately 0.55 to 0.725, with markers every 0.025) * **Legend:** Located in the bottom-right corner. * GRPO (represented by a blue line with circular markers) * MEL (represented by a pink line with triangular markers) ### Detailed Analysis **GRPO (Blue Line):** The GRPO line exhibits an oscillating trend. It starts at approximately 0.65, dips to a minimum of around 0.575 at a training step of 40, then rises to a peak of approximately 0.68 at a training step of 20. It continues to fluctuate, reaching a plateau around 0.70 between training steps 100 and 140. * Step 0: ~0.65 * Step 20: ~0.675 * Step 40: ~0.575 * Step 60: ~0.62 * Step 80: ~0.70 * Step 100: ~0.65 * Step 120: ~0.70 * Step 140: ~0.69 **MEL (Pink Line):** The MEL line also shows fluctuations, but with a different pattern. It begins at approximately 0.62, rises sharply to a maximum of around 0.725 at a training step of 40, then declines to approximately 0.65 at a training step of 100. It then increases again to around 0.68 at a training step of 140. * Step 0: ~0.62 * Step 20: ~0.625 * Step 40: ~0.725 * Step 60: ~0.70 * Step 80: ~0.70 * Step 100: ~0.65 * Step 120: ~0.67 * Step 140: ~0.68 ### Key Observations * MEL achieves a higher validation score than GRPO initially, peaking at approximately 0.725 compared to GRPO's peak of around 0.70. * GRPO demonstrates more stable performance in the later stages of training (between steps 100 and 140), with less fluctuation. * Both models exhibit significant fluctuations in validation score throughout the training process, suggesting sensitivity to the training data or learning rate. * The initial dip in GRPO's performance around step 40 is a notable anomaly. ### Interpretation The chart suggests that both GRPO and MEL models are learning and adapting during the training process, as evidenced by the changing validation scores. However, the fluctuations indicate that the training process is not entirely smooth and may be susceptible to overfitting or instability. The higher initial performance of MEL suggests it may be learning faster or more effectively at the beginning of training. The stabilization of GRPO towards the end of training could indicate that it has converged to a more stable solution, while MEL continues to fluctuate. The benchmark "AMC23" likely represents a specific dataset or task used to evaluate the models, and the validation score is a measure of how well the models generalize to unseen data from that benchmark. Further analysis would be needed to determine the root causes of the fluctuations and to optimize the training process for better performance and stability. </details> <details> <summary>x16.png Details</summary> ![e4b91492](/v1/image/e4b9149248212ce9f1dfae9f38b45b2076a1335096c8d92ed02e06efca897497) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (Benchmark: MATH500) ### Overview The image presents a line chart comparing the validation scores of two models, GPRO and MEL, against the training step. The chart appears to track the performance of these models on the MATH500 benchmark. ### Components/Axes * **Title:** Benchmark: MATH500 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with markers at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.78 to 0.87, with markers at intervals of 0.02) * **Legend:** Located in the bottom-right corner. * GPRO (represented by a light blue line with circular markers) * MEL (represented by a light red line with triangular markers) * **Grid:** A faint grid is visible, aiding in reading values. ### Detailed Analysis **GPRO (Light Blue Line):** The GPRO line generally slopes upward from step 0 to approximately step 100, then plateaus and slightly declines. * Step 0: Approximately 0.78 * Step 20: Approximately 0.81 * Step 40: Approximately 0.81 * Step 60: Approximately 0.82 * Step 80: Approximately 0.84 * Step 100: Approximately 0.85 * Step 120: Approximately 0.86 * Step 140: Approximately 0.85 **MEL (Light Red Line):** The MEL line exhibits more fluctuation. It starts at approximately 0.78 at step 0, rises sharply, then experiences some dips before reaching a peak around step 100, and then declines slightly. * Step 0: Approximately 0.78 * Step 20: Approximately 0.82 * Step 40: Approximately 0.85 * Step 60: Approximately 0.84 * Step 80: Approximately 0.85 * Step 100: Approximately 0.87 * Step 120: Approximately 0.86 * Step 140: Approximately 0.86 ### Key Observations * Both models show an initial increase in validation score as training progresses. * MEL generally achieves a higher validation score than GPRO throughout most of the training process. * MEL's performance is more volatile, with larger fluctuations in validation score. * Both models appear to converge in performance towards the end of the training process (around step 120-140). ### Interpretation The chart demonstrates the learning curves of two models (GPRO and MEL) on the MATH500 benchmark. The validation score serves as a proxy for the model's generalization ability. The fact that both models' validation scores increase with training suggests that both are learning from the data. MEL's consistently higher scores indicate that it may be a more effective model for this particular benchmark. However, its higher volatility could also suggest that it is more sensitive to the training data or requires more careful hyperparameter tuning. The convergence of the two lines towards the end of the training process could indicate that GPRO is catching up to MEL, or that both models are reaching a point of diminishing returns. The data suggests that further training might not significantly improve the performance of either model. The benchmark MATH500 likely involves mathematical problem-solving, and the chart illustrates the progress of these models in mastering such tasks. </details> <details> <summary>x17.png Details</summary> ![277b3570](/v1/image/277b3570cee0b01c13310b7d3059aeaaf759e618a365f4f583d27a3572600a14) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (OlympiadBench) ### Overview This image presents a line chart illustrating the validation score of two models, GRP0 and MEL, as a function of the training step. The chart appears to track the performance of these models during a training process on the OlympiadBench benchmark. ### Components/Axes * **Title:** Benchmark: OlympiadBench (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with markers at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.44 to 0.55, with markers at intervals of 0.02) * **Legend:** Located in the top-right corner. * GRP0 (represented by a light blue line) * MEL (represented by a light red/pink line) * **Gridlines:** Present to aid in reading values. ### Detailed Analysis **GRP0 (Light Blue Line):** The GRP0 line generally slopes upward from step 0 to approximately step 100, then exhibits some fluctuation before leveling off. * Step 0: Approximately 0.44 * Step 20: Approximately 0.46 * Step 40: Approximately 0.48 * Step 60: Approximately 0.50 * Step 80: Approximately 0.47 * Step 100: Approximately 0.53 * Step 120: Approximately 0.52 * Step 140: Approximately 0.53 **MEL (Light Red/Pink Line):** The MEL line also generally slopes upward, but with more pronounced fluctuations. * Step 0: Approximately 0.45 * Step 20: Approximately 0.43 * Step 40: Approximately 0.50 * Step 60: Approximately 0.52 * Step 80: Approximately 0.52 * Step 100: Approximately 0.54 * Step 120: Approximately 0.55 * Step 140: Approximately 0.54 ### Key Observations * Both models show an increasing trend in validation score with increasing training steps, indicating learning. * The MEL model appears to achieve a slightly higher maximum validation score (around 0.55) compared to the GRP0 model (around 0.53). * The MEL model exhibits more volatility in its validation score during training. * Both models appear to converge towards a stable validation score after approximately 100 training steps. ### Interpretation The chart demonstrates the training progress of two models (GRP0 and MEL) on the OlympiadBench benchmark. The increasing validation scores suggest that both models are learning and improving their performance over time. The MEL model appears to be slightly more effective, reaching a higher peak validation score, but also exhibits greater instability during training. This could indicate a higher sensitivity to training data or a more complex learning process. The convergence of both lines towards the end of the training period suggests that further training may yield diminishing returns. The fluctuations in the validation scores could be due to factors such as the batch size, learning rate, or the inherent difficulty of the OlympiadBench benchmark. The data suggests that the MEL model is a better performer, but requires more careful tuning to avoid instability. </details> <details> <summary>x18.png Details</summary> ![d763b4fe](/v1/image/d763b4fefd1323177bbba35a6b2e5865816195f46768b565e3bf89566abf8cdb) ### Visual Description \n ## Line Chart: Benchmark Average Validation Score vs. Training Step ### Overview This image presents a line chart illustrating the validation score of two models, GRP0 and MEL, as a function of the training step. The chart aims to compare the performance of these models during the training process. ### Components/Axes * **Title:** "Benchmark: Average" - positioned at the top-center of the chart. * **X-axis:** "Training Step" - ranging from approximately 0 to 140, with grid lines at intervals of 20. * **Y-axis:** "Validation Score" - ranging from approximately 0.42 to 0.53, with grid lines at intervals of 0.02. * **Legend:** Located in the top-right corner of the chart. * GRP0 - represented by a blue line with circular markers. * MEL - represented by a pink line with triangular markers. ### Detailed Analysis **GRP0 (Blue Line):** The GRP0 line generally slopes upward, indicating an increasing validation score with increasing training steps. * At Training Step 0, the Validation Score is approximately 0.43. * At Training Step 20, the Validation Score is approximately 0.46. * At Training Step 40, the Validation Score is approximately 0.44. * At Training Step 60, the Validation Score is approximately 0.47. * At Training Step 80, the Validation Score is approximately 0.48. * At Training Step 100, the Validation Score is approximately 0.47. * At Training Step 120, the Validation Score is approximately 0.50. * At Training Step 140, the Validation Score is approximately 0.48. **MEL (Pink Line):** The MEL line exhibits more fluctuation than the GRP0 line, but also generally trends upward. * At Training Step 0, the Validation Score is approximately 0.43. * At Training Step 20, the Validation Score is approximately 0.45. * At Training Step 40, the Validation Score is approximately 0.47. * At Training Step 60, the Validation Score is approximately 0.49. * At Training Step 80, the Validation Score is approximately 0.50. * At Training Step 100, the Validation Score is approximately 0.51. * At Training Step 120, the Validation Score is approximately 0.50. * At Training Step 140, the Validation Score is approximately 0.53. ### Key Observations * Both models start with similar validation scores around 0.43. * The MEL model consistently achieves higher validation scores than the GRP0 model throughout the training process, particularly after Training Step 60. * The GRP0 model shows a dip in validation score around Training Step 40, while the MEL model experiences a peak around the same step. * The MEL model demonstrates a more pronounced increase in validation score towards the end of the training process (between Training Steps 100 and 140). ### Interpretation The chart suggests that the MEL model outperforms the GRP0 model in terms of validation score across the observed training steps. The fluctuations in the MEL line might indicate a more sensitive or complex learning process. The consistent upward trend for both models suggests that both are learning and improving with more training. The dip in GRP0's performance around step 40 could indicate a temporary setback or a local minimum in the optimization landscape. The final validation scores suggest that the MEL model has converged to a better solution than the GRP0 model, given the training data and process. The chart provides a visual comparison of the learning curves for the two models, allowing for an assessment of their relative performance and stability during training. </details> Figure 7: Performance evolution of GRPO and MEL on Qwen3-8B-Base across training steps on multiple benchmarks. <details> <summary>x19.png Details</summary> ![70c36d83](/v1/image/70c36d83c6ed30faf548356d1302e5e23136e0ca2780f6823c49e2591dda64eb) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (Benchmark: AIME24) ### Overview This image presents a line chart illustrating the validation score of two models, GRP0 and MEL, as a function of the training step. The chart appears to track the performance of these models during a training process, likely for a machine learning task. The benchmark used is AIME24. ### Components/Axes * **Title:** Benchmark: AIME24 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with markers at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.05 to 0.45, with markers at intervals of 0.05) * **Legend:** Located in the bottom-right corner. * GRP0 (represented by a blue line with circular markers) * MEL (represented by a light red/pink line with triangular markers) ### Detailed Analysis **GRP0 (Blue Line):** The GRP0 line generally slopes upward from step 0 to approximately step 60, then fluctuates with some downward trends before rising again towards step 140. * Step 0: Approximately 0.10 * Step 20: Approximately 0.18 * Step 40: Approximately 0.28 * Step 60: Approximately 0.30 * Step 80: Approximately 0.30 * Step 100: Approximately 0.27 * Step 120: Approximately 0.37 * Step 140: Approximately 0.33 **MEL (Light Red/Pink Line):** The MEL line also shows an initial upward trend, but exhibits more pronounced fluctuations throughout the training process. * Step 0: Approximately 0.05 * Step 20: Approximately 0.20 * Step 40: Approximately 0.25 * Step 60: Approximately 0.29 * Step 80: Approximately 0.36 * Step 100: Approximately 0.28 * Step 120: Approximately 0.45 * Step 140: Approximately 0.35 ### Key Observations * Both models show an initial improvement in validation score as training progresses. * The MEL model exhibits greater variability in its validation score compared to the GRP0 model. * The MEL model achieves a higher peak validation score (approximately 0.45 at step 120) than the GRP0 model. * The GRP0 model appears to be more stable in its performance, with less drastic fluctuations. ### Interpretation The chart suggests that both GRP0 and MEL models are learning from the training data, as indicated by the initial increase in validation scores. However, the MEL model's higher peak score and greater variability suggest it may be more sensitive to the training process or have a higher capacity for learning, but also a greater risk of overfitting. The fluctuations in both lines could be due to factors such as batch variations, learning rate adjustments, or the inherent complexity of the AIME24 benchmark. The fact that both models plateau or even decrease in performance towards the end of the training process suggests that further training may not be beneficial, and could even lead to overfitting. The benchmark AIME24 is likely a specific dataset or task used to evaluate the performance of these models. Further investigation would be needed to understand the nature of AIME24 and the specific characteristics of the GRP0 and MEL models. </details> <details> <summary>x20.png Details</summary> ![c147952f](/v1/image/c147952f168feb497f22f1632492020bc20ecfd91154c0957e5b25bb56becaa1) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (Benchmark: AIME25) ### Overview This image presents a line chart illustrating the validation score of two models, GRPO and MEL, as a function of the training step. The chart appears to track the performance of these models during a training process on the AIME25 benchmark. ### Components/Axes * **Title:** Benchmark: AIME25 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with markers at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.10 to 0.35, with markers at intervals of 0.05) * **Legend:** Located in the bottom-right corner. * GRPO (represented by a blue line with circular markers) * MEL (represented by a pink line with triangular markers) ### Detailed Analysis **GRPO (Blue Line):** The GRPO line generally slopes upward, indicating an increasing validation score with increasing training steps. However, it exhibits significant fluctuations. * At Training Step 0: Validation Score ≈ 0.13 * At Training Step 20: Validation Score ≈ 0.16 * At Training Step 40: Validation Score ≈ 0.14 * At Training Step 60: Validation Score ≈ 0.24 * At Training Step 80: Validation Score ≈ 0.30 * At Training Step 100: Validation Score ≈ 0.20 * At Training Step 120: Validation Score ≈ 0.28 * At Training Step 140: Validation Score ≈ 0.27 **MEL (Pink Line):** The MEL line also shows an overall upward trend, but with less pronounced fluctuations than GRPO. * At Training Step 0: Validation Score ≈ 0.10 * At Training Step 20: Validation Score ≈ 0.18 * At Training Step 40: Validation Score ≈ 0.18 * At Training Step 60: Validation Score ≈ 0.25 * At Training Step 80: Validation Score ≈ 0.22 * At Training Step 100: Validation Score ≈ 0.24 * At Training Step 120: Validation Score ≈ 0.26 * At Training Step 140: Validation Score ≈ 0.36 ### Key Observations * Both models demonstrate improvement in validation score as training progresses. * GRPO exhibits higher volatility in its validation score compared to MEL. * MEL consistently has a lower validation score than GRPO until approximately Training Step 140, where it surpasses GRPO. * The largest increase in validation score for GRPO occurs between Training Steps 40 and 60. * The largest increase in validation score for MEL occurs between Training Steps 120 and 140. ### Interpretation The chart suggests that both GRPO and MEL are learning from the training data, as evidenced by the increasing validation scores. The higher volatility of GRPO might indicate a more sensitive or unstable training process. The fact that MEL eventually surpasses GRPO in validation score towards the end of the training period suggests that MEL may have a more robust learning algorithm or a better ability to generalize from the data, or that GRPO is overfitting. The AIME25 benchmark appears to be a suitable environment for evaluating these models, as it allows for differentiation in performance. The fluctuations in both lines could be due to the stochastic nature of the training process, the batch size used, or the learning rate schedule. Further investigation would be needed to determine the root cause of these fluctuations and to optimize the training process for both models. </details> <details> <summary>x21.png Details</summary> ![02d790e1](/v1/image/02d790e112195e8a2bf90182655be4ff9c64f811929f3a6da1b5ffcc5f42d09f) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (Benchmark: AMC23) ### Overview This image presents a line chart comparing the validation scores of two models, GRPO and MEL, across 140 training steps on the AMC23 benchmark. The chart visualizes the performance of each model as training progresses, allowing for a comparison of their learning curves. ### Components/Axes * **Title:** Benchmark: AMC23 (positioned at the top-center) * **X-axis:** Training Step (ranging from 0 to 140, with tick marks at intervals of 20) * **Y-axis:** Validation Score (ranging from 0.60 to 0.82, with tick marks at intervals of 0.05) * **Legend:** Located in the top-right corner. * GRPO (represented by a light blue line with circular markers) * MEL (represented by a light red line with triangular markers) ### Detailed Analysis **GRPO (Light Blue Line):** The GRPO line exhibits a fluctuating trend. It starts at approximately 0.63, dips to a low of around 0.58 at step 40, then rises to approximately 0.68 at step 80. It then increases to a peak of around 0.79 at step 100, followed by a decline to approximately 0.63 at step 120, and finally recovers slightly to around 0.70 at step 140. * Step 0: ~0.63 * Step 20: ~0.65 * Step 40: ~0.58 * Step 60: ~0.67 * Step 80: ~0.68 * Step 100: ~0.79 * Step 120: ~0.63 * Step 140: ~0.70 **MEL (Light Red Line):** The MEL line shows a generally increasing trend with some fluctuations. It begins at approximately 0.61, rises steadily to a peak of around 0.81 at step 60, then dips to approximately 0.76 at step 80. It then increases again to around 0.80 at step 100, dips to approximately 0.78 at step 120, and finally rises to around 0.81 at step 140. * Step 0: ~0.61 * Step 20: ~0.64 * Step 40: ~0.74 * Step 60: ~0.81 * Step 80: ~0.76 * Step 100: ~0.80 * Step 120: ~0.78 * Step 140: ~0.81 ### Key Observations * MEL consistently outperforms GRPO across all training steps. * Both models exhibit fluctuations in validation score, indicating potential instability during training. * GRPO shows a more pronounced dip in performance around step 40. * Both models appear to converge towards a stable validation score towards the end of the training process (steps 120-140). ### Interpretation The chart demonstrates that the MEL model achieves higher validation scores than the GRPO model on the AMC23 benchmark. This suggests that MEL is a more effective model for this particular task. The fluctuations in validation scores for both models could be attributed to factors such as the stochastic nature of the training process, the learning rate, or the complexity of the dataset. The convergence towards stable scores at the end of training indicates that both models are learning and improving, but MEL consistently maintains a higher level of performance. The initial dip in GRPO's performance might indicate a slower initial learning rate or a greater sensitivity to initial conditions. The overall trend suggests that continued training might lead to further improvements in both models, but MEL is likely to remain the superior performer. </details> <details> <summary>x22.png Details</summary> ![326f8511](/v1/image/326f8511d7910be73f9229db858eec3e9603f73e68e4e5392369b54a8290fff6) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step (Benchmark: MATH500) ### Overview This image presents a line chart illustrating the validation score of two models, GRPO and MEL, against the training step. The chart appears to track the performance of these models during a training process on the MATH500 benchmark. ### Components/Axes * **Title:** Benchmark: MATH500 (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with markers at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.80 to 0.91, with markers at intervals of 0.02) * **Legend:** Located in the bottom-right corner. * GRPO (represented by a blue line with circular markers) * MEL (represented by a pink line with triangular markers) ### Detailed Analysis **GRPO (Blue Line):** The GRPO line initially slopes upward from approximately 0.81 at a training step of 0 to a peak of around 0.88 at a training step of 60. It then declines to approximately 0.86 at a training step of 80, fluctuates between 0.86 and 0.88 until a training step of 120, and then sharply decreases to approximately 0.85 at a training step of 140. * (0, 0.81) * (20, 0.83) * (40, 0.85) * (60, 0.88) * (80, 0.86) * (100, 0.87) * (120, 0.89) * (140, 0.85) **MEL (Pink Line):** The MEL line starts at approximately 0.80 at a training step of 0, rises to a peak of around 0.89 at a training step of 80, dips to approximately 0.88 at a training step of 100, and then increases to approximately 0.91 at a training step of 120. Finally, it decreases to approximately 0.89 at a training step of 140. * (0, 0.80) * (20, 0.83) * (40, 0.87) * (60, 0.88) * (80, 0.89) * (100, 0.88) * (120, 0.91) * (140, 0.89) ### Key Observations * Both models show an initial increase in validation score as training progresses. * MEL consistently achieves a higher validation score than GRPO throughout most of the training process. * GRPO experiences a more significant drop in validation score towards the end of the training process (at training step 140). * MEL reaches its peak performance at training step 120, while GRPO's performance plateaus before that point. ### Interpretation The chart suggests that MEL is a more robust model for the MATH500 benchmark, consistently outperforming GRPO. The initial increase in validation score for both models indicates that they are learning from the training data. The decline in GRPO's performance at the end of the training process could indicate overfitting or a need for adjustments to the training parameters. The peak performance of MEL at step 120 suggests an optimal training duration for this model on this benchmark. The difference in the final validation scores highlights the potential benefits of using MEL over GRPO for this specific task. The data suggests that continued training beyond step 120 may not be beneficial for MEL, and could even lead to a decrease in performance. </details> <details> <summary>x23.png Details</summary> ![26e2f3d6](/v1/image/26e2f3d6c4e79d47522461083323e391d7fcf078790b5d3895cfbf3c8b2ec016) ### Visual Description \n ## Line Chart: Validation Score vs. Training Step on OlympiadBench ### Overview This image presents a line chart illustrating the validation score of two models, GRPO and MEL, as a function of the training step on the OlympiadBench benchmark. The chart displays the performance improvement of both models during training. ### Components/Axes * **Title:** Benchmark: OlympiadBench (positioned at the top-center) * **X-axis:** Training Step (ranging from approximately 0 to 140, with tick marks at intervals of 20) * **Y-axis:** Validation Score (ranging from approximately 0.45 to 0.625, with tick marks at intervals of 0.025) * **Legend:** Located in the top-right corner. * GRPO (represented by a light blue line with circular markers) * MEL (represented by a light red line with triangular markers) ### Detailed Analysis **GRPO (Light Blue Line):** The GRPO line generally slopes upward, indicating increasing validation score with training steps. * At Training Step 0: Validation Score ≈ 0.46 * At Training Step 20: Validation Score ≈ 0.48 * At Training Step 40: Validation Score ≈ 0.52 * At Training Step 60: Validation Score ≈ 0.56 * At Training Step 80: Validation Score ≈ 0.56 * At Training Step 100: Validation Score ≈ 0.56 * At Training Step 120: Validation Score ≈ 0.58 * At Training Step 140: Validation Score ≈ 0.60 **MEL (Light Red Line):** The MEL line also slopes upward, but with a steeper initial increase and some fluctuations. * At Training Step 0: Validation Score ≈ 0.45 * At Training Step 20: Validation Score ≈ 0.47 * At Training Step 40: Validation Score ≈ 0.55 * At Training Step 60: Validation Score ≈ 0.59 * At Training Step 80: Validation Score ≈ 0.59 * At Training Step 100: Validation Score ≈ 0.57 * At Training Step 120: Validation Score ≈ 0.59 * At Training Step 140: Validation Score ≈ 0.61 ### Key Observations * MEL consistently outperforms GRPO in terms of validation score across all training steps. * Both models show diminishing returns in performance improvement as training progresses beyond 80 steps. * GRPO exhibits a more stable and gradual increase in validation score, while MEL shows more volatility. * The difference in validation score between the two models is most pronounced in the initial training stages (0-40 steps). ### Interpretation The chart demonstrates the learning curves of two models (GRPO and MEL) on the OlympiadBench benchmark. MEL appears to be a more effective model, achieving higher validation scores throughout the training process. The initial rapid increase in MEL's performance suggests faster learning or a better initial parameter configuration. The plateauing of both curves towards the end of the training process indicates that further training may not yield significant improvements. The fluctuations in MEL's curve could be due to factors such as the stochastic nature of the training process or the complexity of the OlympiadBench dataset. The consistent upward trend for both models suggests that the training process is generally effective in improving performance on the benchmark. The data suggests that MEL is a better choice for this benchmark, but further investigation might be needed to understand the reasons for its superior performance and the fluctuations in its learning curve. </details> <details> <summary>x24.png Details</summary> ![3b664b9f](/v1/image/3b664b9f4c8d1c1fd3ed4bd0e41e07567183e13fd1f540fd362b4ac9fe4d4dd9) ### Visual Description \n ## Line Chart: Benchmark Average ### Overview The image presents a line chart comparing the validation score of two models, GRP0 and MEL, across 140 training steps. The chart visualizes the performance of each model during the training process, showing how the validation score changes with each step. ### Components/Axes * **Title:** "Benchmark: Average" - positioned at the top-center of the chart. * **X-axis:** "Training Step" - ranging from 0 to 140, with tick marks at intervals of 20. * **Y-axis:** "Validation Score" - ranging from approximately 0.42 to 0.62, with tick marks at intervals of 0.025. * **Legend:** Located in the top-right corner of the chart. * GRP0 - represented by a light blue line with circular markers. * MEL - represented by a light red/pink line with triangular markers. * **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values. ### Detailed Analysis **GRP0 (Light Blue Line):** The GRP0 line generally slopes upward, indicating an increasing validation score over the training steps. * At Training Step 0, the Validation Score is approximately 0.43. * At Training Step 20, the Validation Score drops to approximately 0.38. * At Training Step 40, the Validation Score rises to approximately 0.48. * At Training Step 60, the Validation Score is approximately 0.53. * At Training Step 80, the Validation Score dips to approximately 0.52. * At Training Step 100, the Validation Score reaches approximately 0.56. * At Training Step 120, the Validation Score is approximately 0.57. * At Training Step 140, the Validation Score decreases to approximately 0.55. **MEL (Light Red/Pink Line):** The MEL line also generally slopes upward, but with more fluctuations. * At Training Step 0, the Validation Score is approximately 0.32. * At Training Step 20, the Validation Score rises to approximately 0.52. * At Training Step 40, the Validation Score is approximately 0.55. * At Training Step 60, the Validation Score is approximately 0.57. * At Training Step 80, the Validation Score dips to approximately 0.53. * At Training Step 100, the Validation Score rises to approximately 0.56. * At Training Step 120, the Validation Score is approximately 0.59. * At Training Step 140, the Validation Score reaches approximately 0.61. ### Key Observations * The MEL model consistently achieves higher validation scores than the GRP0 model throughout the training process. * Both models exhibit some degree of fluctuation in their validation scores, suggesting that the training process is not perfectly smooth. * The GRP0 model experiences a significant drop in validation score at Training Step 20, followed by a recovery. * The MEL model shows a more consistent upward trend, with smaller fluctuations. * Both models appear to be converging towards a stable validation score as the training progresses, but the MEL model is still improving at the final step. ### Interpretation The chart demonstrates the learning curves of two models, GRP0 and MEL, during a training process. The validation score serves as a metric for evaluating the model's performance on unseen data. The consistently higher validation scores of the MEL model suggest that it is a better-performing model than GRP0, at least for this benchmark. The fluctuations in the validation scores indicate that the training process is sensitive to the specific training data and that further optimization may be possible. The convergence of the learning curves towards the end of the training process suggests that both models are approaching their maximum performance potential. The initial dip in GRP0's performance could be due to a challenging initial batch of data or a learning rate adjustment. The continued improvement of MEL at the final step suggests that it may benefit from further training. </details> Figure 8: Performance evolution of GRPO and MEL on Qwen3-14B-Base across training steps on multiple benchmarks. Appendix B Retention Ratio of Meta-Experience Through empirical validation via replay, MEL is able to collect high-quality meta-experiences. To examine the utilization of collected meta-experiences, Figure 9 reports the retention ratio of meta-experiences after empirical validation throughout training. We observe that the retention ratio consistently increases with model scale, indicating that larger models are more effective at abstracting high-quality knowledge into meta-experiences, thereby achieving higher retention. <details> <summary>x25.png Details</summary> ![cb093465](/v1/image/cb0934653d3b5c3101a0522e63db7c2d08e4e67f66e92ee3f6772ec89ed738d2) ### Visual Description \n ## Line Chart: Retention Ratio vs. Training Steps for Qwen Models ### Overview This line chart depicts the retention ratio of three Qwen language models (Qwen-4B-Base, Qwen-8B-Base, and Qwen-14B-Base) over 120 training steps. The chart visualizes how well each model retains information during training, with the retention ratio ranging from approximately 0.0 to 0.8. Each line also includes a shaded region representing the standard deviation around the mean retention ratio. ### Components/Axes * **X-axis:** Training Steps (ranging from 0 to 120, with markers at intervals of 20) * **Y-axis:** Retention Ratio (ranging from 0.0 to 0.8, with markers at intervals of 0.2) * **Legend:** Located in the top-left corner, identifying the three data series: * Qwen-4B-Base (represented by a dotted blue line) * Qwen-8B-Base (represented by a dashed red line) * Qwen-14B-Base (represented by a solid red line) ### Detailed Analysis * **Qwen-4B-Base (Blue, Dotted):** This line starts at approximately 0.25 at step 0 and generally trends downward, reaching a minimum of around 0.18 at step 40. It then fluctuates between approximately 0.2 and 0.3, ending at around 0.28 at step 120. The shaded region around the line indicates a relatively small standard deviation. * **Qwen-8B-Base (Red, Dashed):** This line begins at approximately 0.38 at step 0 and exhibits an upward trend until around step 60, reaching a peak of approximately 0.52. After step 60, it fluctuates, generally decreasing to around 0.45 at step 120. The shaded region is wider than that of the 4B model, indicating a larger standard deviation. * **Qwen-14B-Base (Red, Solid):** This line shows the most significant upward trend. Starting at approximately 0.42 at step 0, it consistently increases, reaching approximately 0.75 at step 100. It then plateaus and slightly decreases to around 0.72 at step 120. The shaded region is relatively wide, indicating a substantial standard deviation, particularly in the earlier training steps. ### Key Observations * The Qwen-14B-Base model consistently demonstrates the highest retention ratio throughout the training process. * The Qwen-4B-Base model exhibits the lowest retention ratio and a generally decreasing trend. * The Qwen-8B-Base model shows an initial increase in retention ratio, followed by stabilization and slight decline. * The standard deviation is largest for the Qwen-14B-Base model, suggesting greater variability in its retention performance. ### Interpretation The data suggests a strong correlation between model size and retention ratio. Larger models (14B parameters) exhibit significantly better retention capabilities than smaller models (4B and 8B parameters). The initial increase in retention for the 8B model could indicate a period of rapid learning, followed by saturation. The consistently high retention of the 14B model suggests it is better equipped to capture and retain information during training. The standard deviations indicate that the 14B model's performance is more variable, potentially due to its increased complexity and capacity. This variability could be a result of the model being more sensitive to the specific training data or hyperparameters. The downward trend of the 4B model suggests it may be struggling to learn effectively or is prone to overfitting. The chart demonstrates the importance of model capacity in achieving high retention rates during language model training. </details> Figure 9: Dynamics of the retention ratio of MEL across different model scales. Appendix C Prompt Template We use the same prompt template for all models. Details of the prompts used for meta-experience construction and for empirical validation via replay are shown below. Meta-Experience Prompt You are a Meta-Cognitive Reasoning Analyst specializing in self-reflection, error root-cause analysis, and the extraction of generalizable heuristics. You are provided with multiple solution trajectories for the same problem. Note that the labels Correct or Incorrect apply to the final answer, but the reasoning process itself may contain twists and turns. Your task is to conduct a deep comparative autopsy of the thinking processes. You must identify the structural differences in cognition that led to success or failure, and synthesize these into abstract principles for future use. Core Analysis Requirements: 1. Deep Dive into Correct Trajectories (Resilience & Robustness Analysis): • Scenario A (Self-Correction): If you find the reasoning contains initial errors or uncertainties, look for moments of self-correction. What triggered the realization? What structural insight allowed the reasoning to pivot back to the right track? • Scenario B (Flawless Execution): When every step of the reasoning is correct from the start, identify the Foundational “Immunity”. What specific definition, clear knowledge representation, or disciplined step-by-step verification prevented this Agent from falling into the traps that the Incorrect Agent fell into? • Goal: Extract the specific logic validation technique or robust mental representation that saved the solution. 2. Deep Dive into Incorrect Trajectories (Vulnerability Analysis): • You must identify not only where the math/logic went wrong, but also why the reasoning drifted. • Identify: The “Bifurcation Point” where a correct start turned into a hallucination or logic gap. • Analyze: The latent cognitive defect (e.g., concept conflation, rigid mindset, overlooking edge cases, intuitive bias) that caused the error. • Identify: What specific knowledge point or constraint was violated? 3. Comparative Synthesis: • Contrast the Solutions and Decision Boundaries. Why did the successful trajectory avoid the trap that the failed one fell into? • What structural insight did the winner have that the loser missed? (e.g., The winner treated the problem as a geometric issue, while the loser got stuck in algebra). 4. Strict Generalization Constraint: • Forbidden: Do NOT mention the specific numbers, variables, or exact answer of the current problem in your “Heuristics” or “Reflective Summary”. • Required: Convert specific lessons into abstract heuristics (e.g., instead of “The integral of $x^{2}$ is…”, use “When integrating polynomial functions…”). Formulate them as conditionally triggered rules (“If…Then…”, “When dealing with [Concept X]…I should…”). Output Format (Strict Adherence Required) 1. Failure Resolution Path & Error Pattern Recognition (Mandatory for incorrect samples) • Failure Point: Identify the exact step where logic diverged. Did it start correctly? Where did the drift happen? • Latent Cognitive Pattern: Reveal the deep-seated reasoning defect. Was it a bias? A missing prerequisite? A misunderstanding of the prompt’s intent? Do not list surface-level calculation errors. 2. Analysis of Success Factors (Mandatory for correct samples) • Reasoning Pivot (If applicable): If the path involved self-correction, describe the moment of realization and the strategy used to fix it. • Robustness Factor (If flawless): If the path was perfect, explain the fundamental concept or structural approach that effectively “immunized” the reasoning against common errors. • Reason for Effectiveness: Why did this perspective work? What fundamental logic did it satisfy? 3. First-Person Reflective Summary (Mandatory) Write a meta-cognitive reflection from the first-person perspective (“I”). • Review: Briefly review the thinking process differences. • Insight: Discuss the specific knowledge point or cognitive habit that was critical. • Action: Explain how you will restructure your approach to avoid the identified “Internal Reasoning Defects” in the future. Focus on the “How” of thinking, not the “What” of the answer. 4. Subject Heuristics (Internalized Experience) (Mandatory) • [Pattern Name]: If [abstract condition] occurs, then [abstract action]… • [Pattern Name]: When dealing with [concept type], I must strictly verify [constraint]… (Note: These must be applicable to *future* problems of a similar class, completely stripped of this problem’s specifics.) Here the question and the corresponding solutions. $<$ question $>$ {question} $<$ /question $>$ Solution 1: $<$ answer $>$ {error_ans} $<$ /answer $>$ $<$ judge $>$ Incorrect $<$ /judge $>$ Solution 2: $<$ answer $>$ {correct_ans} $<$ /answer $>$ $<$ judge $>$ Correct $<$ /judge $>$ Empirical Validation Prompt Prior study has provided some internal reference information relevant to this question, including the key approaches, steps, and reasoning needed for a correct solution; the typical reasoning biases, logical flaws, or pitfalls that appear in incorrect solutions; and various heuristic insights on how to complete this problem more effectively. {experience} Now, please fully internalize this information as your own experience, then independently think through the problem in detail and produce a complete answer. Note: • You must perform full, in-depth reasoning internally and arrive at the final answer while making full use of the information above. Answer the following question: {question}

Rendering Paper...