2601.21268v1

Model: gemma-3-27b-it-free

## Reinforcement Learning from Meta-Evaluation: Aligning Language Models Without Ground-Truth Labels ## Micah Rentschler 1 Jesse Roberts 2 ## Abstract Most reinforcement learning (RL) methods for training large language models (LLMs) require ground-truth labels or task-specific verifiers, limiting scalability when correctness is ambiguous or expensive to obtain. We introduce Reinforcement Learning from Meta-Evaluation (RLME), which optimizes a generator using reward derived from an evaluator's answers to natural-language meta-questions (e.g., 'Is the answer correct?' or 'Is the reasoning logically consistent?'). RLME treats the evaluator's probability of a positive judgment as a reward and updates the generator via group-relative policy optimization, enabling learning without labels. Across a suite of experiments, we show that RLME achieves accuracy and sample efficiency comparable to label-based training, enables controllable trade-offs among multiple objectives, steers models toward reliable reasoning patterns rather than post-hoc rationalization, and generalizes to open-domain settings where ground-truth labels are unavailable, broadening the domains in which LLMs may be trained with RL. ## 1. Introduction Reinforcement learning (RL) is widely used to align large language models (LLMs) with human preferences or verifiable task outcomes, as in Reinforcement Learning from Human Feedback (RLHF) (Kaufmann et al., 2024) and Reinforcement Learning from Verified Rewards (RLVR) (Wen et al., 2025; Yue et al., 2025). These methods work well when high-quality rewards exist, but such signals are costly: human feedback does not scale, and automatic verifiers are typically narrow and domain-specific. In many realistic 1 Department of Computer Science, Vanderbilt University, Nashville TN, USA 2 Department of Computer Science, Tennessee Technological University, Cookeville TN, USA. Correspondence to: Micah Rentschler < micah.d.rentschler@vanderbilt.edu > . Figure 1. Overview of RLME. After generating an answer, one or more evaluators (may be the same model) assign probabilities to natural-language meta-questions about the output. These probabilities are aggregated into a scalar reward, which is then used to update the generative policy via reinforcement learning. This allows models to be tuned even when ground-truth answers are unavailable. <details> <summary>Image 1 Details</summary> ![131aebc9](/v1/image/131aebc9c33b736b166dea13d95099ee0a9a897c6aabff71f7e67015b9ec02c8) ### Visual Description \n ## Diagram: Reinforcement Learning Loop with GRPO Trainer ### Overview This diagram illustrates a reinforcement learning loop involving a GRPO (Gradient-based Policy Optimization) Trainer, a model, and a user interaction. The loop demonstrates a question-answer process where the model receives feedback and updates its parameters. ### Components/Axes The diagram consists of the following components: * **User:** Represented by a grey head icon, posing the question. * **Model:** Represented by a robot icon, providing the answer. * **GRPO Trainer:** A rectangular box labeled "GRPO Trainer" in red text. * **Arrows:** Indicate the flow of information and feedback. * **Text Boxes:** Contain the question, answer, and feedback. * **Syringe Icon:** Represents the question of answer correctness. * **Probability Indicator:** "Probability = 91%" associated with the "Yes" response. ### Detailed Analysis or Content Details The diagram depicts the following sequence: 1. **Question:** A user asks, "What is the sixth number in the Fibonacci sequence?". (Top of the image) 2. **Model Response:** The model responds with the text: "The Fibonacci sequence begins with 0 and 1, and each subsequent number is the sum of the two preceding numbers: 0, 1, 1, 2, 3, 5... The answer is 5." (Top-right) 3. **Correctness Check:** A syringe icon asks, "Is the answer correct?". (Center) 4. **Positive Feedback:** The model responds "Yes" with a probability of 91%. (Bottom-right) 5. **Reward & Update:** A "Reward" signal is sent from the model to the "GRPO Trainer" (bottom-left). The GRPO Trainer then sends an "Update Model" signal back to the model (left side). The arrows indicate the following flow: * User -> Model (Question) * Model -> User (Answer) * Model -> GRPO Trainer (Answer & Feedback) * GRPO Trainer -> Model (Update Model) * Model -> GRPO Trainer (Reward) ### Key Observations * The diagram highlights a closed-loop system where the model learns from feedback. * The GRPO Trainer plays a central role in updating the model based on the reward signal. * The probability of 91% suggests a high degree of confidence in the model's answer. * The diagram does not provide any numerical data beyond the probability value. ### Interpretation This diagram illustrates a simplified reinforcement learning process. The GRPO Trainer acts as the learning algorithm, adjusting the model's parameters based on the reward received for providing correct answers. The 91% probability indicates that the model is performing well on this particular task. The diagram demonstrates how a model can improve its performance through iterative feedback and updates. The use of a Fibonacci sequence question suggests the model is capable of mathematical reasoning. The diagram is a conceptual illustration of the process rather than a presentation of specific data or results. It focuses on the *flow* of information and the *roles* of the different components. </details> settings, ground truth may be uncertain or unavailable. Apromising alternative is to have the model itself or another model evaluate the response. Prior work leverages model likelihoods of known correct answers as a proxy reward (Zhou et al., 2025; Yu et al., 2025), but still requires groundtruth labels during training. We instead explore whether models can learn from evaluations provided by an LLM acting as evaluator without access to ground truth labels. To steer the evaluations, we use natural-language prompts applicable over an entire dataset to assess high-level properties of an output which we refer to as meta-questions . For example, 'Is the answer 5?' targets a particular problem, whereas 'Is the answer correct?' is a broadly applicable meta-question. These are cheap to write, transferable across domains, and empower LLMs to embody heuristics that are difficult to hard-code. This shifts the problem from engineering a reward function or handlabeling a large dataset to designing meta-questions which elicit a desired behavior. We introduce Reinforcement Learning from MetaEvaluation (RLME), illustrated in Figure 1, and show that it provides similar results to an RLVR baseline without ground-truth labels. However, meta-evaluation introduces new risks. The model being trained, referred to as the generator, may produce outputs that satisfy the evaluator without genuinely solving the task. The central challenge is therefore to determine when meta-evaluation provides a reliable signal and how to mitigate its failure modes. To this end, we contribute the following: - RLME, a scalable framework that guides modern GRPO-style policy-gradient updates with rewards based on the aggregate probability of target answers to evaluation meta-questions; - Empirical evidence that meta-evaluation is competitive with explicit verifiers in reasoning-heavy domains; - A broad analysis of generator and evaluator choice, self-evaluation, and reward hacking, clarifying both the strengths and failure modes of meta-evaluation; - Examples of multi-objective language-driven control; - ⋆ Proof that RLME training encourages contextual faithfulness, generalizing the improved ability to an out-ofdistribution dataset. ## 2. Related Work Our work connects to several research directions in alignment and reinforcement learning for language models. RLHF and preference-based optimization. RLHF optimizes models using human preference data with PPO-style updates (Kaufmann et al., 2024; Ouyang et al., 2022; Schulman et al., 2017). While this early work was successful and influential, human preference data is expensive and introduces biases such as sycophancy (Sharma et al., 2025). RL from verifiable or probabilistic correctness signals. RLVR-style methods optimize rewards derived from correctness verifiers when ground-truth is available but precise human preference is not (Wen et al., 2025; Shao et al., 2024; Guo et al., 2025). VeriFree (Zhou et al., 2025) and RLPR (Yu et al., 2025) further this by using the model's own likelihood of the correct answer as a proxy reward, but critically, they still require access to ground truth labels. LLM-as-judge and AI feedback. To address the cost of human annotation entirely, RL with AI feedback (RLAIF) methods leverage LLMs as preference evaluators, attempting to replace the preferences that human evaluators would assign with those from an LLM (Zheng et al., 2023; Gu et al., 2024; Lee et al., 2024; Yuan et al., 2024). All of these attempt to predict preference over a number of candidate responses. This can inherit potential biases from human raters if they are directly modeled and limit applicability to domains where preference is ill-defined. In contrast to preference-based methods, (Zhao et al., 2025) uses an internal measure of certainty as a reward. However, this limits the approach to maximizing self-certainty. Flexible evaluation. Prior work has applied reinforcement learning to refine LLM behavior using a variety of feedback signals, but these approaches typically require substantial supervision or are limited to fixed objectives. Reinforcement Learning Contemplation (RLC) (Pang et al., 2024) introduces a flexible evaluation paradigm in which a frozen copy of the model provides self-critique over its own generations using Likert-style judgments, optimized with a PPO objective. While RLC demonstrates the promise of flexible, model-based evaluation, its performance relative to explicit reward supervision (e.g., RLVR) has not been systematically studied, nor have the robustness and failure modes of such self-evaluated reward signals. Situating RLME in the literature. RLME removes the ground truth label dependence and avoids the need to model human preferences directly by improving on and generalizing the RLC evaluation approach. In place of the Likert evaluation, RLME employs an evaluation approach previously used to study LLM actions in formal games, referred to as counterfactual prompting (Roberts et al., 2025). The RLME evaluator model predicts whether the generator's response agrees with one or more stated criteria, which we refer to as meta-evaluations . The evaluator's probability of producing a target response sequence is directly incorporated as a reward signal into the GRPO update in place of RLC's PPO objective. RLME generalizes RLC in that RLME optimizes the target model using either frozen self, a continually updated self, frozen other, or ensemble as the evaluator model. It is compared to the powerful RLVR method, which benefits from labeled data, as a baseline. Most importantly, our work regarding RLME extends the understanding of flexible evaluation by studying multi-objective optimization, the propensity to reward hack, and out-of-distribution generalization. Finally, our work was developed concurrently with the recent preprint disseminated by DeepSeek (Shao et al., 2025). Our work is entirely distinct and has not been influenced by theirs, though the described approaches have similarities. ## 3. Methodology After generating a response, one or more evaluators predict the probability of a target answer to natural-language metaquestions. Their probabilities are aggregated into a scalar reward, which is used to update the generator via a grouprelative policy-gradient objective. ## 3.1. Assessment Prompting Given a prompt x ∼ D where D is a dataset of prompts containing problems for the generator to solve, the generator produces a response  where π θ is the generator's policy. Evaluators { π ϕ j } J j =1 are then queried with meta-questions developed by humans to target desired behavior Q = { q 1 , . . . , q K } such as 'Is the answer correct?'. Each metaquestion q k has an answer sequence a k , and evaluator j assigns probability  Rewards are computed by first weighting meta-questions with { w k } , then weighting evaluators with { v j } :  Just like the meta-questions, { w k } and { v j } are fixed hyperparameters defined by an expert with domain knowledge to push the algorithm harder towards certain outcomes. ## 3.2. Reinforcement Learning RLME maximizes the expected meta-evaluation reward:  We adopt a Group Relative Policy Optimization (GRPO)style update (Shao et al., 2024).  where ¯ r is the mean reward over the sampled group. Unlike GRPO, we do not scale by the standard deviation because it introduces a question-level difficulty bias (Liu et al., 2025). For off-policy updates (where the policy being updated is in transition and may no longer precisely match the behavioral policy that generated the response), trajectories are sampled from the behavioral policy π b. The ratio of the current policy to the behavioral policy is the importance ratio:  As suggested by Zheng et al., we use a sequence level importance ratio to reduce high-variance noise in training. We use Clipped IS-weight Policy Optimization (CISPO) (MiniMax et al., 2025) a variant of GRPO. For CISPO, the importance sampling ratio is clipped  This ratio is then used in the final loss with sg( · ) stops gradients.  ## 3.3. Algorithm Summary Each RLME step consists of: 1. Generate responses with π θ (Eq. 1). 2. Evaluate responses using meta-questions to obtain r ( x, y ) (Eq. 3). 3. Update π θ using the CISPO loss (Eq. 8). By selecting different meta-questions and weights, the evaluating model helps RLME align the generating model without requiring ground-truth labels. ## 4. Experiments We empirically evaluate RLME and compare it against an RLVR baseline. We begin with grade-school mathematics, where correctness is fully verifiable, and then move to more open-ended domains. Through a series of experiments, we investigate and assess core questions about our approach. Complete experimental details for reproduction (optimization hyperparameters, learning rates, batch sizes, and training schedules) are reported in Appendix A. Appendix B contains the exact prompts for the generator and evaluator. Finally qualitative examples of various responses may be found in Appendix C. ## 4.1. Can we improve accuracy via meta-questions? Question . Our first experiment tests whether a single, general meta-question can provide a reward signal strong enough to improve mathematical accuracy without access to ground truth. Method . We initialize the generator from Qwen3-4B-Base (Yang et al., 2025) and train on GSM8K (Cobbe et al., 2021), prompting the model to produce a solution and to place its final response inside \ boxed {} so that the answer can be extracted with a fixed regex. To compute accuracy (and the RLVR reward), we parse each completion using a fixed regex (Appendix A) that selects the Figure 2. Comparison of RLME to an RLVR baseline that has access to ground-truth answers. Both methods rapidly exceed 90% accuracy on GSM8K, and RLME closely tracks RLVR despite never observing the true answer. <details> <summary>Image 2 Details</summary> ![97273d6b](/v1/image/97273d6b695a39c01a4b71c0521260721e38af97327a90e2ada6cc17b77771b5) ### Visual Description \n ## Line Chart: Accuracy vs. Step ### Overview This image presents a line chart comparing the accuracy of two methods, RLVR and RLME, over a series of steps. The chart displays the accuracy on the y-axis and the step number on the x-axis. A shaded region around each line indicates the variance or confidence interval. ### Components/Axes * **X-axis:** "Step" ranging from approximately 0 to 120. * **Y-axis:** "Accuracy" ranging from approximately 0.2 to 1.0. * **Legend:** Located in the top-right corner. * "RLVR" - Represented by a dotted gray line. * "RLME" - Represented by a solid blue line. ### Detailed Analysis * **RLVR (dotted gray line):** The line starts at approximately 0.38 accuracy at Step 0. It increases rapidly, reaching approximately 0.85 accuracy around Step 20. It continues to increase, peaking at approximately 0.92 accuracy around Step 60. After Step 60, the line fluctuates between approximately 0.88 and 0.92 accuracy until Step 120. The shaded region around the line indicates a variance of approximately +/- 0.04. * **RLME (solid blue line):** The line starts at approximately 0.38 accuracy at Step 0. It increases rapidly, reaching approximately 0.83 accuracy around Step 20. It continues to increase, peaking at approximately 0.91 accuracy around Step 50. After Step 50, the line fluctuates between approximately 0.86 and 0.91 accuracy until Step 120. The shaded region around the line indicates a variance of approximately +/- 0.03. ### Key Observations * Both RLVR and RLME show a similar trend of increasing accuracy with increasing steps. * RLVR appears to achieve a slightly higher peak accuracy than RLME, but the difference is minimal. * Both methods exhibit some fluctuation in accuracy after reaching their peak performance. * The confidence intervals (shaded regions) are relatively small, suggesting consistent performance for both methods. ### Interpretation The chart demonstrates that both RLVR and RLME methods are effective in improving accuracy as the number of steps increases. The slight advantage of RLVR in peak accuracy might suggest it is marginally more efficient, but the overall performance of both methods is comparable. The fluctuations in accuracy after the peak could be due to the inherent stochasticity of the learning process or the complexity of the task. The small confidence intervals indicate that the observed trends are statistically reliable. This data suggests that both methods are viable options, and the choice between them might depend on other factors such as computational cost or implementation complexity. The initial rapid increase in accuracy suggests a fast learning rate in the early stages of the process. </details> last boxed expression and extracts the predicted answer. We then compare this extracted answer to the GSM8K reference answer, after cleaning the reference to remove non-numeric characters such as commas, currency symbols, and units. As described in methodology, to construct the RLME reward, we build an auxiliary evaluation prompt that includes the original problem, the full generated solution, and the regex-extracted answer string (taken from the generated solution, not the ground-truth label). Prompted with this information and the meta-question 'Is the answer correct?' , the evaluator then estimates the probability of 'Yes' being the completion. The log-probability of this response serves as a scalar reward for RLME training. In this experiment, we use live self-evaluation where the generator serves as the evaluator using its current parameters . Thus, the evaluator co-evolves as the generator is updated. Because training is single-pass (no prompt reuse), we do not require a dedicated validation set to prevent overfitting to the training prompts. We compare RLME to an RLVR baseline that is identical in optimization, rollout settings, and number of updates, differing only in the reward signal: RLVR uses ground-truth verification (reward 1 if the regex-extracted answer exactly matches the ground-truth label, and 0 otherwise), while RLME uses meta-evaluation only. Results . As shown in Figure 2, the base model begins at roughly 30% accuracy and rapidly improves under RLME, surpassing 90% after a short training run and closely tracking RLVR throughout training across 6 trials with ±1 std shown by the shaded region. For readability, all learning curves are plotted using an exponential moving average with Figure 3. RLME performance using different generators with a fixed evaluator (frozen Qwen3-4B-Base). Generator models have a large effect on accuracy. <details> <summary>Image 3 Details</summary> ![747eff4f](/v1/image/747eff4f341ad89ccbb87c87b246f83b4dfd908482d3795ad3805916d88513b1) ### Visual Description \n ## Line Chart: Model Accuracy vs. Training Step ### Overview This image presents a line chart illustrating the accuracy of five different language models (Qwen3, Llama3.2, SmolLM3, Gemma3, and Qwen2.5) as a function of training step. The chart visually tracks the learning progress of each model, showing how their accuracy changes over the course of approximately 120 training steps. ### Components/Axes * **X-axis:** Labeled "Step", ranging from 0 to 120. Represents the progression of training. * **Y-axis:** Labeled "Accuracy", ranging from 0 to 1. Represents the performance of the models. * **Legend:** Located in the top-right corner of the chart. It maps colors to the following models: * Blue: Qwen3 * Green: Llama3.2 * Purple: SmolLM3 * Cyan: Gemma3 * Gray: Qwen2.5 ### Detailed Analysis Here's a breakdown of each model's accuracy trend and approximate data points: * **Qwen3 (Blue):** The line slopes upward rapidly from step 0, reaching approximately 0.5 accuracy at step 10. It continues to increase, plateauing around 0.9 accuracy between steps 40 and 100. There's a slight dip around step 60, falling to approximately 0.85, before recovering. * Step 0: ~0.15 * Step 10: ~0.5 * Step 20: ~0.7 * Step 40: ~0.85 * Step 60: ~0.85 * Step 80: ~0.9 * Step 100: ~0.9 * Step 120: ~0.9 * **Llama3.2 (Green):** This line starts at approximately 0 accuracy and increases slowly until step 40, reaching around 0.2 accuracy. It then shows a more rapid increase, reaching approximately 0.3 accuracy at step 60. The line plateaus around 0.3 accuracy after step 60. * Step 0: ~0 * Step 10: ~0.02 * Step 20: ~0.1 * Step 40: ~0.2 * Step 60: ~0.3 * Step 80: ~0.3 * Step 100: ~0.3 * Step 120: ~0.3 * **SmolLM3 (Purple):** The line increases rapidly from step 0, reaching approximately 0.5 accuracy at step 10. It continues to increase, reaching approximately 0.8 accuracy at step 20 and plateauing around 0.85-0.9 accuracy from step 40 onwards. * Step 0: ~0.1 * Step 10: ~0.5 * Step 20: ~0.8 * Step 40: ~0.85 * Step 60: ~0.88 * Step 80: ~0.9 * Step 100: ~0.9 * Step 120: ~0.9 * **Gemma3 (Cyan):** The line starts with a rapid increase from step 0, reaching approximately 0.5 accuracy at step 10. It continues to increase, reaching approximately 0.75 accuracy at step 20. Around step 50, the line peaks at approximately 0.85 accuracy, then declines sharply to around 0.6 accuracy at step 60, and then plateaus around 0.6-0.7. * Step 0: ~0.1 * Step 10: ~0.5 * Step 20: ~0.75 * Step 40: ~0.8 * Step 50: ~0.85 * Step 60: ~0.6 * Step 80: ~0.65 * Step 100: ~0.65 * Step 120: ~0.65 * **Qwen2.5 (Gray):** The line increases slowly from step 0, reaching approximately 0.2 accuracy at step 20. It continues to increase, reaching approximately 0.6 accuracy at step 60, and then plateaus around 0.65-0.7 accuracy. * Step 0: ~0.15 * Step 10: ~0.2 * Step 20: ~0.3 * Step 40: ~0.5 * Step 60: ~0.6 * Step 80: ~0.65 * Step 100: ~0.65 * Step 120: ~0.7 ### Key Observations * Qwen3 and SmolLM3 achieve the highest accuracy, both reaching approximately 0.9. * Gemma3 exhibits a significant drop in accuracy after reaching its peak around step 50, suggesting potential overfitting or instability. * Llama3.2 demonstrates the slowest learning rate and lowest overall accuracy. * Qwen2.5 shows a steady but moderate improvement in accuracy. ### Interpretation The chart demonstrates the learning curves of five different language models during training. The varying slopes and final accuracy levels indicate differences in model capacity, training efficiency, and potential for overfitting. Qwen3 and SmolLM3 appear to be the most effective models in this comparison, achieving high accuracy relatively quickly. Gemma3's initial success followed by a decline suggests that it may require further regularization or adjustments to its training process. Llama3.2's slow progress indicates that it may benefit from a larger model size, different architecture, or a longer training duration. The data suggests that the choice of model significantly impacts performance, and careful consideration should be given to the specific requirements of the task when selecting a language model. The anomaly of Gemma3's accuracy drop warrants further investigation to understand the underlying cause and potential mitigation strategies. </details> decay β = 0 . 9 . The similarity of the learning curves suggests that, at least in this controlled verifiable domain, the evaluator's response to a single correctness meta-question provides a reward signal that is both informative and sampleefficient even without access to ground-truth. ## 4.2. Does generator choice matter? Question . We assess whether different generator models adapt differently to meta-evaluation. Method . To isolate the effect of the generator, we fix the evaluator to a frozen Qwen3-4B-Base (Yang et al., 2025) and vary the generator among Qwen3-4B-Base, Llama-3.23B (Meta AI, 2024), SmolLM3-3B (Hugging Face, 2025), Gemma-3-4B-pt (Mesnard et al., 2024), and Qwen2.51.5B (Yang et al., 2024). Results . Figure 3 substantiates previous work showing that flexible evaluation generalizes across models, but also shows that generator choice substantially impacts accuracy. ## 4.3. Does evaluator choice matter? Question . A key design decision in RLME is whether the evaluator is live or frozen . In live evaluation, the generator also serves as the evaluator using its current parameters, such that the evaluator co-evolves with the generator during training. In frozen evaluation, the evaluator is a separate model (or a fixed snapshot of the generator taken at initialization) whose parameters remain unchanged. In this experiment, we investigate the effect of model choice and configuration on evaluation. Method . We explicitly chose not to evaluate the pairwise performance of every generator to every evaluator due to Figure 4. RLME performance using different evaluators with a fixed generator (Qwen3-4B-Base). For the Qwen3 evaluator, we compare a live self-evaluator (co-evolving with the generator) to a frozen evaluator (fixed snapshot at initialization). For other evaluators, we only use frozen weights. <details> <summary>Image 4 Details</summary> ![68577ea7](/v1/image/68577ea7a1fb75b62db95729e5a3092e2520ff3caa10ea6213b268debe560a53) ### Visual Description \n ## Line Chart: Accuracy vs. Step for Various Models ### Overview This image presents a line chart illustrating the accuracy of several language models (Live Qwen3 and various Frozen models) over a series of steps, likely representing training iterations. The chart aims to compare the learning curves of these models. ### Components/Axes * **X-axis:** "Step" - Ranging from approximately 0 to 120. * **Y-axis:** "Accuracy" - Ranging from approximately 0.2 to 1.0. * **Legend:** Located in the top-right corner, listing the following models with corresponding colors: * Live Qwen3 (Orange) * Frozen Qwen3 (Blue) * Frozen Llama3.2 (Green) * Frozen Mistral (Magenta/Pink) * Frozen SmolLM3 (Purple) * Frozen Gemma3 (Cyan/Light Blue) * Frozen Qwen2.5 (Gray) * **Gridlines:** Present to aid in reading values. ### Detailed Analysis The chart displays seven distinct lines, each representing the accuracy of a different model as the "Step" increases. * **Live Qwen3 (Orange):** Starts at approximately 0.38 at Step 0, rapidly increases to around 0.85 by Step 20, plateaus around 0.92-0.94 between Steps 40 and 100, and then slightly decreases to approximately 0.91 at Step 120. * **Frozen Qwen3 (Blue):** Begins at approximately 0.38 at Step 0, increases quickly to around 0.84 by Step 20, reaches a plateau around 0.92-0.93 between Steps 40 and 100, and then slightly declines to approximately 0.91 at Step 120. * **Frozen Llama3.2 (Green):** Starts at approximately 0.38 at Step 0, rises to around 0.83 by Step 20, plateaus around 0.90-0.92 between Steps 40 and 100, and then decreases to approximately 0.89 at Step 120. * **Frozen Mistral (Magenta/Pink):** Begins at approximately 0.35 at Step 0, increases to around 0.82 by Step 20, reaches a plateau around 0.89-0.91 between Steps 40 and 100, and then decreases to approximately 0.87 at Step 120. * **Frozen SmolLM3 (Purple):** Starts at approximately 0.38 at Step 0, increases to around 0.83 by Step 20, plateaus around 0.90-0.92 between Steps 40 and 100, and then decreases to approximately 0.88 at Step 120. * **Frozen Gemma3 (Cyan/Light Blue):** Begins at approximately 0.38 at Step 0, increases to around 0.84 by Step 20, plateaus around 0.92-0.93 between Steps 40 and 100, and then slightly declines to approximately 0.91 at Step 120. * **Frozen Qwen2.5 (Gray):** Starts at approximately 0.38 at Step 0, increases to around 0.83 by Step 20, plateaus around 0.90-0.92 between Steps 40 and 100, and then decreases to approximately 0.88 at Step 120. All lines exhibit a similar initial steep increase in accuracy, followed by a plateauing phase. ### Key Observations * **Similar Performance:** The "Live Qwen3" and "Frozen Qwen3" models show nearly identical performance curves. * **Plateau:** All models reach a plateau in accuracy after approximately 20-40 steps. * **Slight Decline:** Most models experience a slight decrease in accuracy after Step 100. * **Frozen Llama3.2 and Frozen Mistral** consistently show the lowest accuracy among the models. ### Interpretation The data suggests that all the models demonstrate effective learning up to a certain point (around 40 steps), after which further training yields diminishing returns. The close proximity of the "Live Qwen3" and "Frozen Qwen3" curves indicates that freezing the weights doesn't significantly impact performance in this scenario. The slight decline in accuracy after Step 100 could be due to overfitting or the model reaching its capacity. The differences in peak accuracy between the models suggest varying levels of inherent capability or sensitivity to the training process. The fact that all models converge to a similar accuracy range suggests a common underlying learning dynamic. The models "Frozen Llama3.2" and "Frozen Mistral" may require different training parameters or architectures to achieve comparable performance to the other models. </details> the computational resources this would demand. Based on the previous experiment, we fix the generator to Qwen3-4BBase (Yang et al., 2025) and vary the evaluator. For Qwen3, we include both live self-evaluation and a frozen snapshot. For all other models (Llama-3.2 (Meta AI, 2024), MistralNemo-Base-2407 (Mistral AI & NVIDIA, 2024), SmolLM33B (Hugging Face, 2025), Gemma-3-4B-pt (Mesnard et al., 2024), and Qwen2.5-1.5B (Yang et al., 2024)), the evaluator remains frozen. Results . Compared to generator choice, evaluator choice has a smaller effect on accuracy (Figure 4), consistent with the hypothesis that verifying correctness is easier than generating correct solutions (Pang et al., 2024). Notably, we observe little difference between live and frozen Qwen3 evaluation, suggesting that RL fine-tuning has limited impact on evaluation quality. Finally, we observe that accuracy under the SmolLM3 and Gemma3 evaluators begins to decline after reaching a peak (Figure 4). This suggests that these evaluators eventually provide misleading reward signals to the generator, a failure mode commonly referred to as reward hacking . ## 4.4. Does the generator reward hack the evaluator? Question . While RLME initially yields strong gains in reasoning accuracy, we observe a late-stage collapse in Figure 4: accuracy drops sharply even as the reward continues to increase. This phenomenon, known as reward hacking , arises when the generator discovers responses that cause the evaluator to answer meta-questions in a way that increases reward without truly improving correctness. Figure 5. RLMEeventually suffers a sharp degradation in accuracy despite continued increases in reward, indicative of reward hacking: the generator learns to exploit weaknesses in the evaluator instead of producing correct solutions. Including a small fraction of prompts with ground-truth answers in the evaluator template (10% for RLME-10GT and 1% for RLME-1GT) stabilizes training and prevents collapse. <details> <summary>Image 5 Details</summary> ![f8fbe9b2](/v1/image/f8fbe9b2ff23ab670d7488972d42ee9608a81a501e84051aa459d0dad1276963) ### Visual Description \n ## Line Chart: Performance Comparison of RL Algorithms ### Overview This image presents three line charts stacked vertically, comparing the performance of several Reinforcement Learning (RL) algorithms – RLVR, RLME, RLME-Crowd, RLME-10GT, and RLME-1GT – over 500 steps. The top chart displays accuracy, the middle chart shows reward, and the bottom chart focuses on reward with a different scale. ### Components/Axes * **X-axis (all charts):** Step (ranging from approximately 0 to 500) * **Y-axis (Top Chart):** Accuracy (ranging from 0 to 1) * **Y-axis (Middle Chart):** Reward (ranging from approximately -0.06 to 0) * **Y-axis (Bottom Chart):** Reward (ranging from approximately -0.35 to -0.2) * **Legend (Top-Left):** * RLVR (dotted purple line) * RLME (solid blue line) * RLME-Crowd (dashed blue line) * RLME-10GT (dotted orange line) * RLME-1GT (dashed orange line) ### Detailed Analysis or Content Details **Top Chart: Accuracy** * **RLVR (Purple, dotted):** Starts at approximately 0.75, rises sharply to around 0.9, plateaus around 0.92, then drops dramatically around step 300 to approximately 0.2, and fluctuates between 0.2 and 0.3 for the remainder of the steps. * **RLME (Blue, solid):** Begins at approximately 0.75, increases steadily to around 0.9, then experiences a significant drop around step 300 to approximately 0.15, followed by oscillations between 0.15 and 0.3. * **RLME-Crowd (Blue, dashed):** Starts at approximately 0.75, rises to around 0.9, remains relatively stable around 0.9, and then drops to approximately 0.2 around step 300, with some fluctuations. * **RLME-10GT (Orange, dotted):** Starts at approximately 0.75, increases to around 0.9, and remains relatively stable around 0.9 throughout the 500 steps. * **RLME-1GT (Orange, dashed):** Starts at approximately 0.75, increases to around 0.9, and remains relatively stable around 0.9 throughout the 500 steps. **Middle Chart: Reward** * **RLVR (Purple, dotted):** Starts at approximately -0.04, increases to around -0.01, then fluctuates between -0.02 and -0.05. * **RLME (Blue, solid):** Starts at approximately -0.04, increases to around -0.01, and remains relatively stable around -0.01 to -0.02. * **RLME-Crowd (Blue, dashed):** Starts at approximately -0.04, increases to around -0.01, and remains relatively stable around -0.01 to -0.02. * **RLME-10GT (Orange, dotted):** Starts at approximately -0.05, increases to around -0.02, and fluctuates between -0.03 and -0.05. * **RLME-1GT (Orange, dashed):** Starts at approximately -0.05, increases to around -0.02, and fluctuates between -0.03 and -0.05. **Bottom Chart: Reward** * **RLVR (Purple, dotted):** Starts at approximately -0.28, and decreases steadily to around -0.32 by step 500. ### Key Observations * All algorithms initially demonstrate similar performance in terms of accuracy, reaching around 0.9. * Around step 300, RLVR and RLME experience a significant drop in accuracy, while RLME-Crowd, RLME-10GT, and RLME-1GT maintain relatively stable accuracy. * The reward values are generally small and negative for all algorithms. * RLVR exhibits a consistent decrease in reward over time, as shown in the bottom chart. ### Interpretation The data suggests that RLVR and RLME are more susceptible to performance degradation after a certain number of steps (around 300) compared to RLME-Crowd, RLME-10GT, and RLME-1GT. This could indicate that these algorithms are less robust or require more frequent updates or adjustments. The consistent decrease in reward for RLVR suggests that it may be struggling to learn or adapt to the environment over time. The relatively stable performance of RLME-10GT and RLME-1GT might indicate that the use of 10GT and 1GT data improves the algorithm's stability and generalization ability. The different scales on the reward charts suggest that the reward function may be sensitive to the specific algorithm being used. The sudden drop in accuracy around step 300 for RLVR and RLME could be due to a change in the environment or a shift in the task requirements. Further investigation is needed to understand the underlying causes of these performance differences. </details> Method . To examine this effect, we repeat the selfevaluation setup from Section 4.1 but extend training far beyond the point where validation accuracy saturates. With enough optimization time, the generator learns how to induce the evaluator to answer 'Yes' to incorrect solutions. Results . Manual inspection of the resulting outputs reveals that reasoning traces become increasingly formulaic and detached from the task. Common artifacts include vacuous justification phrases such as 'the only logical conclusion is that this is the correct answer' or excessive repetition of the final answer. These behaviors appear to exploit acquiescence bias (Podsakoff et al., 2003) in the evaluator rather than reflect genuine problem solving. Early stopping based on validation accuracy can avoid this collapse but does not fix the underlying vulnerability. In subsequent experiments, we therefore explore alternative evaluation strategies-such as introducing additional evaluator models or partial ground-truth to alleviate reward hacking. ## 4.5. Can multiple evaluators mitigate reward hacking? Question . Given the reward-hacking behavior observed in Section 4.4, a natural next step is to ask whether using an ensemble of models to evaluate the solution can make RLME more robust. Intuitively, if reward hacking arises because the generator learns to exploit the weaknesses of a single self-evaluator, then aggregating judgments from multiple evaluators who may have disparate vulnerabilities might make the reward signal harder to game. Method . We consider an ensemble of evaluators. For each generated solution, multiple evaluator models are combined into an evaluator ensemble (Qwen3-4B-Base (Yang et al., 2025) generator itself as well as frozen Llama-3.2-3B (Meta AI, 2024), frozen SmolLM3-3B (Hugging Face, 2025), and frozen Mistral-Nemo-Base-2407 (Mistral AI & NVIDIA, 2024)) and we take the reward to be the average of their independent 'Yes' log-probabilities. The generator is optimized with RLME using this ensemble-derived scalar reward. Results . The reward profile with an ensemble evaluator is noticeably smoother than with a single evaluator (see RLME-Crowd vs. RLME in Figure 5). However, we still observe the same late-stage collapse as in the purely selfevaluated setting (see RLME-Crowd in Figure5). After an initial phase in which accuracy improves, extended training again leads to a regime where the ensemble reward continues to increase even as true GSM8K accuracy declines. Qualitatively, the generator rediscovers pathological reasoning templates that most evaluators agree to endorse, even though the underlying solutions are incorrect. These results suggest that simply using multiple models to evaluate the solution is not sufficient to prevent reward hacking. Notably, the generator appears to discover strategies that generalize across evaluators, much like how polling a group of humans can reduce noise but cannot fully eliminate systematic bias. ## 4.6. Does having a known answer help ground RLME and prevent reward hacking? Question . In many practical settings, fully verifiable supervision is scarce but not entirely absent: a small subset of examples may have trusted labels, while the rest do not. Can partial access to ground truth help prevent the rewardhacking behavior observed in Section 4.4? Method . To study this, we reveal the true answer to the evaluator in a limited number of questions. Concretely, for a fraction p of training prompts, the evaluation template includes the correct integer answer before asking the metaquestion 'Is the answer correct?' For the remaining (1 -p ) of prompts, standard RLME is applied such that the Figure 6. RLME enables multi-objective control over both accuracy and brevity. <details> <summary>Image 6 Details</summary> ![99ff24ed](/v1/image/99ff24edc282490fca807b14088b5ecb3ad820cedb5426f77358478186c46f85) ### Visual Description \n ## Line Chart: Performance Metrics vs. Step ### Overview The image presents two line charts stacked vertically. The top chart displays Accuracy as a function of Step, while the bottom chart shows Length (character count) as a function of Step. Three different methods – RLVR, RLME, and RLME-Concise – are compared across both charts. The charts appear to track the performance of these methods during a process that iterates in steps. ### Components/Axes * **X-axis (Both Charts):** Step, ranging from approximately 0 to 120. * **Y-axis (Top Chart):** Accuracy, ranging from approximately 0.2 to 1.0. * **Y-axis (Bottom Chart):** Length (character count), ranging from approximately 200 to 1000. * **Legend (Top-Right of Top Chart):** * RLVR (dotted grey line) * RLME (solid blue line) * RLME-Concise (dashed magenta line) ### Detailed Analysis or Content Details **Top Chart (Accuracy vs. Step):** * **RLVR (Grey, dotted):** The line starts at approximately 0.3 at Step 0, rises rapidly to around 0.75 by Step 20, plateaus around 0.85-0.9 for Steps 20-80, and then fluctuates slightly between 0.88 and 0.92 for Steps 80-120. * **RLME (Blue, solid):** The line begins at approximately 0.3 at Step 0, increases quickly to around 0.8 by Step 20, and then stabilizes around 0.9 for Steps 20-120, with minor fluctuations. * **RLME-Concise (Magenta, dashed):** The line starts at approximately 0.3 at Step 0, rises quickly to around 0.8 by Step 20, and then stabilizes around 0.9 for Steps 20-120, similar to RLME, but generally slightly lower than RLME. **Bottom Chart (Length vs. Step):** * **RLVR (Grey, dotted):** The line starts at approximately 950 at Step 0, decreases to around 750 by Step 20, and then fluctuates between approximately 850 and 950 for Steps 20-120. * **RLME (Blue, solid):** The line begins at approximately 950 at Step 0, decreases to around 650 by Step 20, and then stabilizes around 600-700 for Steps 20-120. * **RLME-Concise (Magenta, dashed):** The line starts at approximately 950 at Step 0, decreases rapidly to around 400 by Step 20, and then continues to decrease to approximately 300 by Step 120. ### Key Observations * All three methods achieve high accuracy (around 0.9) after approximately 20 steps. * RLME and RLME-Concise achieve higher accuracy than RLVR, particularly after Step 80. * The length of the output decreases as the step number increases for all methods, indicating a convergence or refinement process. * RLME-Concise consistently produces the shortest output length. * RLVR maintains a higher output length compared to RLME and RLME-Concise. ### Interpretation The data suggests that RLME and RLME-Concise are more effective than RLVR in achieving high accuracy with a shorter output length. The rapid initial decrease in length for all methods indicates a quick reduction in redundancy or complexity. The stabilization of accuracy after 20 steps suggests that the methods converge relatively quickly. The consistent difference in output length between RLME-Concise and the other methods suggests that the "Concise" version is specifically designed to minimize output size, potentially at a slight cost in accuracy (though the difference is minimal). The fluctuations in RLVR's length after Step 20 might indicate instability or continued refinement, but without further context, it's difficult to determine the cause. The charts demonstrate a trade-off between accuracy and length, with RLME-Concise offering the most concise output while maintaining high accuracy. </details> evaluator is provided no knowledge of the ground truth. Results . We find that including the ground-truth answer in as little as 1% of the evaluation prompts is sufficient to substantially reduce the reward-hacking effect in our experiments. Unlike the purely self-evaluated setting, extended training no longer leads to a late-stage collapse where reward increases while accuracy degrades. As Figure 5 indicates, when 10% of evaluations have ground-truth answers, accuracy remains stable; when we reduce this to 1%, we see only a slight degradation in accuracy. Intuitively, the presence of even a small number of fully verifiable examples anchors the evaluator's notion of correctness, preventing consistent reward from bias exploitation strategies. ## 4.7. Can we use meta-questions with multiple objectives? Question . We next test whether RLME can jointly control correctness and secondary behavioral objectives. In addition to the meta-question 'Is the answer correct?' , we introduce a conciseness objective and study whether multi-objective meta-evaluation can shape reasoning length without sacrificing accuracy. Method . Keeping the meta-question targeting correctness and add a meta-question targeting brevity: 'Is the length of the solution between 200 and 500 characters?' . We explicitly include the programmatically measured character Figure 7. Using an RLME meta-evaluation prioritizing sound reasoning trains the model not to blindly copy a provided answer. <details> <summary>Image 7 Details</summary> ![525bd6d3](/v1/image/525bd6d3465b6d2a350addb540cf7c2c5c30f69b8c1d1045d2c728e88da2b6fc) ### Visual Description \n ## Line Chart: Accuracy vs. Step for RL Algorithms ### Overview This image presents a line chart illustrating the accuracy of three reinforcement learning (RL) algorithms – RLVR, RLME, and RLME-NoCheat – over a series of steps. The chart displays how the accuracy of each algorithm changes as the number of steps increases. ### Components/Axes * **X-axis:** Labeled "Step", ranging from approximately 0 to 160. * **Y-axis:** Labeled "Accuracy", ranging from 0 to 1. * **Legend:** Located in the top-left corner, identifying the three data series: * RLVR (represented by a gray dotted line) * RLME (represented by a blue solid line) * RLME-NoCheat (represented by a green dashed line) * **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values. ### Detailed Analysis * **RLVR (Gray Dotted Line):** The line starts at approximately 0.1 accuracy at Step 0. It initially increases slightly, reaching a peak of around 0.15 at Step 10. After Step 10, the accuracy declines steadily, approaching 0 by Step 150. The trend is generally downward. * **RLME (Blue Solid Line):** The line begins at approximately 0.15 accuracy at Step 0. It increases to a peak of around 0.4 at Step 20. From Step 20 to Step 80, the accuracy decreases, reaching a low of approximately 0.05 at Step 80. After Step 80, the accuracy increases again, reaching approximately 0.35 at Step 150. The trend is initially upward, then downward, and finally upward again. * **RLME-NoCheat (Green Dashed Line):** The line starts at approximately 0.3 accuracy at Step 0. It increases steadily, reaching approximately 0.8 accuracy at Step 150. The trend is consistently upward. Specific Data Points (approximate): | Step | RLVR (Accuracy) | RLME (Accuracy) | RLME-NoCheat (Accuracy) | |---|---|---|---| | 0 | 0.1 | 0.15 | 0.3 | | 10 | 0.15 | 0.3 | 0.35 | | 20 | 0.1 | 0.4 | 0.45 | | 50 | 0.05 | 0.2 | 0.55 | | 80 | 0.02 | 0.05 | 0.65 | | 100 | 0.01 | 0.2 | 0.75 | | 150 | 0.0 | 0.35 | 0.8 | ### Key Observations * RLME-NoCheat consistently outperforms both RLVR and RLME in terms of accuracy. * RLVR exhibits a clear decline in accuracy over time. * RLME shows an initial increase in accuracy, followed by a decrease, and then a subsequent increase. This suggests a period of learning followed by forgetting or instability, and then re-learning. * The accuracy of RLME-NoCheat increases monotonically, indicating stable learning. ### Interpretation The data suggests that the "NoCheat" modification significantly improves the performance of the RLME algorithm. The consistent upward trend of RLME-NoCheat indicates that it is effectively learning and improving its accuracy over time. The fluctuating accuracy of RLME suggests that the algorithm may be susceptible to instability or overfitting without the "NoCheat" modification. The declining accuracy of RLVR indicates that it is not effectively learning from the environment. The "cheat" in RLME likely refers to some form of information leakage or shortcut that allows the algorithm to achieve higher initial accuracy but ultimately hinders its long-term learning capabilities. Removing the cheat (as in RLME-NoCheat) forces the algorithm to learn more robust and generalizable strategies, leading to better overall performance. The comparison highlights the importance of careful algorithm design and the potential pitfalls of relying on shortcuts or biased information during the learning process. </details> count in the evaluation prompt. While this injects labels regarding length, the target of this experiment is to evaluate the applicability of RLME to a simple multi-objective scenario. The log-probabilities from each meta-evaluation are then summed as described in methodology Equation 3 and RLME is applied. Results . Figure 6 shows that RLME-Concise substantially reduces generation length while maintaining high accuracy. By the end of training, the average solution length is nearly halved relative to RLME, with no significant degradation in GSM8K performance. Qualitatively, the concise objective compressed reasoning into denser mathematical expressions rather than verbose natural language (see Appendix C). Although this is a trivial example, it demonstrates that RLME supports multi-objective control through metaquestions, provided the evaluator can reliably assess the targeted property. In Section 4.8 we extend our investigation to a more useful domain: cheat detection. ## 4.8. Can we train the model not to cheat? Question . We extend the multi-objective evaluation and address a more subtle criteria: cheating abstinence . As we define it, cheating is the act of rationalizing an answer rather than deriving it through logical processes. Method . To probe whether a model is cheating we use a counterfactual prompting setup. During training, we provide the generator with the question alongside the true answer. At test time, we instead inject a random answer sampled from the dataset. If, under this counterfactual, the model's reasoning still supports the injected (and incorrect) answer, this indicates that it has learned to justify the provided answer rather than solve the problem logically. This experiment allows us to evaluate the model's ability to reason to an answer vs justifying an answer post-hoc. We first evaluate RLVR under this setup and then RLME with the accuracy-oriented meta-question from Section 4.1. Finally, we train a second RLME model using a metaquestion that emphasizes the reasoning process rather than the outcome: 'Does the whole solution logically lead from the question to an answer, even if it does not match the correct answer?' We refer to these two RLME variants as RLME-Base and RLME-NoCheat. Results . As shown in Figure 7, RLVR and RLME-Base both learn to heavily rely on the injected answer and tend to cheat in the counterfactual condition. In contrast, RLME-NoCheat avoids this behavior and achieves over 80% accuracy in counterfactual tests. Examples of cheating and non-cheating traces are provided in Appendix C. ## 4.9. Stepping outside verifiable domains Thus far, our experiments have focused on fully verifiable tasks where correctness can be determined using groundtruth labels. We now move to a more realistic setting for which RLME is particularly well suited because explicit supervision is not directly available. A central objective in training large language models is to ensure that they faithfully adhere to the provided context, avoiding hallucinations or the injection of external biases, and instead basing responses strictly on the given information. Recently, the FaithEval benchmark (Ming et al., 2024) was introduced to measure whether models remain faithful to a supplied context, even when that context conflicts with the model's prior world knowledge. Question . In this experiment, we investigate whether training with RLME on unrelated datasets using a meta-question targeting faithfulness will generalize to improve performance on the FaithEval-Counterfactual dataset. Method . We construct a heterogeneous context-questionanswer corpus (CQAC) by sampling from public readingcomprehension datasets: SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017), TriviaQA (Joshi et al., 2017), HotpotQA (Yang et al., 2018), BioASQ (Tsatsaronis et al., 2015), DROP (Dua et al., 2019), RACE (Lai et al., 2017), and TextbookQA (Fisch et al., 2019). We take the first 200 examples from each dataset (1,600 total) and truncate contexts to at most 4,000 characters. As a grounded baseline, we train on CQAC with RLVR using an exact-match reward (after removing punctuation, whitespace, and case). Based on the findings of our previous experiments regarding the inclusion of limited labeled data to avoid reward hacking, we include a combined approach, RLVR+RLME , defined as the sum of (i) the RLVR exact-match reward and (ii) an Table 1. Base, RLVR, and RLVR+RLME accuracy on CQAC constituent datasets. Both RLVR and RLVR+RLME significantly exceed the performance of the raw base model (Qwen3-4B-Base). As expected, the RLVR which only optimizes for accuracy achieves a slightly higher average accuracy than RLVR+RLME which optimizes for both accuracy and contextual faithfulness. | | Squad | NewsQA | TriviaQA | HotpotQA | BioASQ | DROP | RACE | TextbookQA | Avg | |-----------|----------|----------|------------|------------|----------|----------|----------|--------------|----------| | Base | 46 . 2 % | 20 . 7 % | 29 . 2 % | 37 . 7 % | 16 . 5 % | 33 . 3 % | 41 . 2 % | 37 . 5 % | 32 . 8 % | | RLVR | 78 . 0 % | 39 . 0 % | 63 . 5 % | 57 . 8 % | 50 . 3 % | 50 . 7 % | 86 . 2 % | 71 . 5 % | 62 . 1 % | | RLVR+RLME | 73 . 8 % | 39 . 5 % | 62 . 2 % | 57 . 7 % | 42 . 0 % | 24 . 2 % | 84 . 5 % | 71 . 8 % | 57 . 0 % | Table 2. Base, RLVR, and RLVR+RLME accuracy on FaithEvalCounterfactual dataset. RLVR+RLME outperforms RLVR, indicating improved context faithfulness can be obtained without labels. | | FaithEval-Counterfactual | |-----------|----------------------------| | Base | 28 . 2 % | | RLVR | 61 . 8 % | | RLVR+RLME | 70 . 4 % | RLME meta-evaluation reward that measures contextual support. RLVR performs well when labels are available, while RLME enables tuning without known rewards. This is expected to allow the model to benefit more substantially from the limited labeled data. To prevent either component from dominating, we normalize each reward component (mean 0 , std 1 ) within each batch before summation. For RLVR and RLME we use Qwen3-4B-Base as the generator; for RLME the generator is used as the live evaluator. The meta-evaluation uses prompts such as: 'Is the answer supported by the context, regardless of whether it seems factually correct?' Full templates are provided in Appendix B. This meta-evaluation is expected to drive the model to reason faithfully and correctly even when datasets that are not explicitly related to a faithfulness objective. Results . We discuss results on the constructed CQAC task which does not include FaithEval and the generalization objective. Tables 1 and 2 summarize evaluation results. We assess 100 held-out examples from each CQAC subset and 300 examples from the FaithEval-Counterfactual split and compare the performance. Both RLVR and RLVR+RLME substantially improve over the raw base model (Qwen3-4B-Base) on CQAC. Relative to RLVR, RLVR+RLME incurs a small average drop on the CQAC exact-match accuracy but yields a substantial improvement on FaithEval-Counterfactual, showing that RLME training generalizes to an out-of-distribution task. Crucially, the improvement on FaithEval is achieved without training on data from FaithEval. Instead, meta-evaluations of contextual support applied to the unrelated CQAC mixture generalize to the FaithEval benchmark. ## 5. Discussion We introduced Reinforcement Learning from MetaEvaluation (RLME), a framework that trains language models using rewards derived from natural-language judgments rather than ground-truth labels. RLME tracks label-based RL in verifiable tasks, enables direct multi-objective behavioral control, and generalizes in open-domain settings where correctness cannot be explicitly verified. Across our experiments, we find that: - Meta-evaluations provide a learning signal comparable to label-based RL in fully verifiable domains (Section 4.1); - RLME operates across a range of pretrained generator and evaluator models, with performance substantially more sensitive to generator choice than evaluator choice, and live self-evaluation does not noticeably degrade outcomes (Sections 4.2 and 4.3); - Meta-evaluation is inherently vulnerable to reward hacking under prolonged optimization (Section 4.4), but this failure mode can be mitigated through early stopping or by incorporating sparse ground-truth anchoring (Section 4.6); - Carefully designed meta-questions support multiobjective steering (Section 4.7) and give control over the reasoning process itself (Section 4.8). - ⋆ RLME and RLVR+RLME generalize to open-domain tasks without labels or explicit training (Sections 4.9). Taken together, the results suggest that RLME is most effective as a complement to, rather than a replacement for, verifiable rewards: RLVR dominates when labels are available, RLME enables progress without labels, and hybrid objectives offer the best of both regimes. The primary limitation is reward hacking where the generator fools the evaluator; however, even minimal grounded supervision effectively stabilizes training, making hybrid RLME approaches particularly practical. ## Impact Statement Our work proposes a way to steer language models using natural-language meta-questions answered by the model itself or by other models, rather than relying solely on scalar rewards from task-specific verifiers. When well-chosen, these meta-questions can encourage outputs that are more accurate, concise, and transparent, and can make models easier to probe and audit. However, because RLME derives rewards from model judgments, it can also amplify biases in the evaluators or in the chosen meta-questions. This may entrench prevailing norms or stylistic preferences, and poorly designed questions could incentivize persuasiveness or conformity over truthfulness. Our experiments are confined to controlled, low-stakes domains; extending this framework to high-stakes applications will require additional safeguards, such as diverse evaluator panels, periodic human or verifier audits, and monitoring for reward hacking or systematic unfairness. We view our methods as a tool for aligning models, not as a replacement for human oversight or normative judgment. ## References - Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. - Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL , 2019. - Fisch, A., Talmor, A., Jia, R., Seo, M., Choi, E., and Chen, D. Mrqa 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering , pp. 1-13, Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5801. URL https://aclanthology.org/D19-5801/ . - Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y., Gao, W., Ni, L., and Guo, J. A survey on LLM-as-a-judge. arXiv preprint arXiv:2411.15594 , 2024. - Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 , 2025. - Hugging Face. Smollm3: smol, multilingual, longcontext reasoner. https://huggingface.co/ blog/smollm3 , 2025. Accessed: 2025-11-28. - Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL , 2017. - Kaufmann, T., Weng, P., Bengs, V., and H¨ ullermeier, E. A survey of reinforcement learning from human feedback. Transactions on Machine Learning Research , 2024. arXiv:2312.14925. - Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. Race: Large-scale reading comprehension dataset from examinations. In EMNLP , 2017. Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K. R., Bishop, C., Hall, E., Carbune, V., Rastogi, A., and Prakash, S. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. In International Conference on Machine Learning (ICML) , 2024. - Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective, 2025. URL https://arxiv. org/abs/2503.20783 . - Mesnard, T. et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 , 2024. Meta AI. Llama 3.2 model cards and prompt formats. https://www.llama.com/docs/ model-cards-and-prompt-formats/ llama3\_2/ , 2024. Accessed: 2025-11-28. - Ming, Y., Purushwalkam, S., Pandit, S., Ke, Z., Nguyen, X.P., Xiong, C., and Joty, S. Faitheval: Can your language model stay faithful to context, even if' the moon is made of marshmallows'. arXiv preprint arXiv:2410.03727 , 2024. - MiniMax, :, Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., Xiao, C., Du, C., Zhang, C., Qiao, C., Zhang, C., Du, C., Guo, C., Chen, D., Ding, D., Sun, D., Li, D., Jiao, E., Zhou, H., Zhang, H., Ding, H., Sun, H., Feng, H., Cai, H., Zhu, H., Sun, J., Zhuang, J., Cai, J., Song, J., Zhu, J., Li, J., Tian, J., Liu, J., Xu, J., Yan, J., Liu, J., He, J., Feng, K., Yang, K., Xiao, K., Han, L., Wang, L., Yu, L., Feng, L., Li, L., Zheng, L., Du, L., Yang, L., Zeng, L., Yu, M., Tao, M., Chi, M., Zhang, M., Lin, M., Hu, N., Di, N., Gao, P., Li, P., Zhao, P., Ren, Q., Xu, Q., Li, Q., Wang, Q., Tian, R., Leng, R., Chen, S., Chen, S., Shi, S., Weng, S., Guan, S., Yu, S., Li, S., Zhu, S., Li, T., Cai, T., Liang, T., Cheng, W., Kong, W., Li, W., Chen, X., Song, X., Luo, X., Su, X., Li, X., Han, X., Hou, X., Lu, X., Zou, X., Shen, X., Gong, Y., Ma, Y., Wang, Y., Shi, Y., Zhong, Y., Duan, Y., Fu, Y., Hu, Y., Gao, Y., Fan, Y., Yang, Y., Li, Y., Hu, Y., Huang, Y., Li, Y., Xu, Y., Mao, Y., Shi, - Y., Wenren, Y., Li, Z., Li, Z., Tian, Z., Zhu, Z., Fan, Z., Wu, Z., Xu, Z., Yu, Z., Lyu, Z., Jiang, Z., Gao, Z., Wu, Z., Song, Z., and Sun, Z. Minimax-m1: Scaling test-time compute efficiently with lightning attention, 2025. URL https://arxiv.org/abs/2506.13585 . - Mistral AI and NVIDIA. Mistral nemo. https:// mistral.ai/news/mistral-nemo , 2024. Accessed: 2025-11-28. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, L., Miller, F., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730-27744, 2022. - Pang, J.-C., Wang, P., Li, K., Chen, X.-H., Xu, J., Zhang, Z., and Yu, Y. Language model self-improvement by reinforcement learning contemplation. In International Conference on Learning Representations (ICLR) , 2024. - Podsakoff, P. M., MacKenzie, S. B., Lee, J.-Y., and Podsakoff, N. P. Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology , 88(5): 879-903, 2003. doi: 10.1037/0021-9010.88.5.879. - Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: 100,000+ questions for machine comprehension of text. In EMNLP , 2016. - Roberts, J., Moore, K., and Fisher, D. Do large language models learn human-like strategic preferences? In Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pp. 97-108, 2025. - Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017. - Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024. - Shao, Z., Luo, Y., Lu, C., Ren, Z., Hu, J., Ye, T., Gou, Z., Ma, S., and Zhang, X. Deepseekmath-v2: Towards self-verifiable mathematical reasoning. arXiv preprint arXiv:2511.22570 , 2025. - Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Cheng, N., Durmus, E., HatfieldDodds, Z., Johnston, S. R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., - Yan, D., Zhang, M., and Perez, E. Towards understanding sycophancy in language models, 2025. URL https://arxiv.org/abs/2310.13548 . Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., and Suleman, K. Newsqa: A machine comprehension dataset. In Rep4NLP , 2017. Tsatsaronis, G. et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. In BMC Bioinformatics , 2015. - Wen, X., Liu, Z., Zheng, S., Xu, Z., Ye, S., Wu, Z., Wang, Y., Liang, X., Li, J., Miao, Z., Bian, J., and Yang, M. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms. arXiv preprint arXiv:2506.14245 , 2025. - Yang, A. et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 , 2024. - Yang, A. et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388 , 2025. URL https://arxiv. org/abs/2505.09388 . - Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhutdinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP , 2018. - Yu, T., Ji, B., Wang, S., Yao, S., Wang, Z., Cui, G., Yuan, L., Ding, N., Yao, Y., Liu, Z., et al. Rlpr: Extrapolating rlvr to general domains without verifiers. arXiv preprint arXiv:2506.18254 , 2025. - Yuan, W., Pang, R. Y., Cho, K., Li, X., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models. In International Conference on Machine Learning (ICML) , 2024. - Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Yue, Y., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837 , 2025. - Zhao, X., Kang, Z., Feng, A., Levine, S., and Song, D. Learning to reason without external rewards. arXiv preprint arXiv:2505.19590 , 2025. - Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071 , 2025. - Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM-as-a-judge with MT-Bench and chatbot arena. In Proceedings of the 37th Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023. - Zhou, X., Liu, Z., Sims, A., Wang, H., Pang, T., Li, C., Wang, L., Lin, M., and Du, C. Reinforcing general reasoning without verifiers. arXiv preprint arXiv:2505.21493 , 2025. ## A. Hyperparameters Unless otherwise noted, all experiments share the configuration below. When a setting differs for a specific experiment (e.g., FaithEval), we mention it in the main text. ## A.1. Training Algorithm We train with Group Relative Policy Optimization (GRPO), implemented using the GRPOTrainer in Hugging Face TRL, with a CISPO-style objective for importance-weight clipping. - Loss type: cispo . - Generations per prompt (group size): 6 candidate completions. - PPO iterations per batch: 1. - Importance sampling: sequence-level ratios with clipping:  with ϵ low = 10000 . 0 and ϵ high = 5 . 0 as suggested by the CISPO paper (MiniMax et al., 2025). - Advantages: sequence-level, A i = r i -¯ r over the group. ## A.2. Optimization - Optimizer: paged adamw 32bit . - Learning rate: 2 × 10 -6 , constant schedule. - Weight decay: 0.0. - Adam betas: ( β 1 , β 2 ) = (0 . 9 , 0 . 95) . - Adam epsilon: 10 -15 . - Batching: per-device batch size 12 prompts, gradient accumulation 8 steps (effective batch size 96 prompts). ## A.3. Generation During RL Unless otherwise specified, on-policy rollouts for RLME and RLVR use: - Temperature: 1.0. - Top- p : 1.0 (effectively disabled). - Top- k : - 1 (disabled). - Max new tokens: 512. - Max prompt length: 4096 tokens for GSM8K, 4608 tokens for FaithEval. - Repetition penalty: 1.0 (disabled). ## A.4. Reward Design We use a small number of reward components, combined linearly. - Accuracy reward (RLVR-style): for tasks with ground truth, we extract the final answer (e.g., from \boxed{...} ) using a fixed regex. The reward is 1.0 for exact integer match and 0.0 otherwise. - Meta-evaluation rewards (RLME): scalar rewards are log-probabilities of target answers to meta-questions (e.g., 'Is the answer correct?') under one or more evaluator models:  where p j,k is the probability of the target answer (e.g., 'YES') for question q k from evaluator j . For all problems, we extract the final predicted answer using a single-instance \boxed{} pattern. Specifically, we apply the following regex, which matches the last boxed expression in the completion:  ## A.5. Models and Precision - Generator (default): Qwen3-4B-Base. - Evaluators: depending on the experiment, we use the current generator and/or frozen external models, including Llama-3.2-3B, SmolLM3-3B, and Mistral-Nemo-Base-2407. - Precision: base model weights and LM head are kept in fp32 ; training uses bf16 with gradient checkpointing. - Quantization for evaluators: when applicable, external evaluators are loaded in bf16 with 4-bit NF4 quantization. ## A.6. Backend (vLLM) All generations during RL are served by vLLM in colocated mode. - Tensor parallel size: 1. - GPU memory utilization: 0.2 of device memory. - Importance-sampling correction: enabled, with correction cap 2.0. ## A.7. Computing Environment All experiments were run on a single NVIDIA H200 GPU using PyTorch 2.0.2 with CUDA 12.8.1 on Ubuntu 24.04 . No gradient parallelism or multi-GPU sharding was used. This configuration is used for all experiments unless explicitly noted otherwise. ## B. Prompts This appendix provides the exact prompt templates used across experiments. These prompts define how model outputs are interpreted and evaluated through natural-language meta-questions. All templates contain a fixed Problem section and a fixed Evaluation section. In all cases, the prompt text shown below is reproduced exactly as used in our experiments. We use a special end-marker token ø because it is rare in natural text and is consistently represented as a single token in our tokenizer. In evaluation questions, we supply the first ø and use the model's prediction on the target answer (e.g., the token sequence YES followed by ø ) as the reward, summing the log-probabilities of all tokens in the target answer. This makes the evaluator's target outcome unambiguous at the token level. Note: in the interest of full disclosure, we mention that in our experiments the prompts contained several misspellings. Instead of 'Evaluate the solution' we accidentally put 'Evaluation the solution'. Also we misspelled 'explicit' as 'explicite' and we misspelled 'whether' as 'wether'. These errors have been fixed here for clarity, but we have not rerun the experiments. We do not expect these mistakes to materially affect our results. ## B.1. Accuracy-Only (GSM8K) The generator produces a solution inside the solution block. The meta-reward is based solely on the evaluator's response to a single correctness question. ``` ``` This format is used to train RLME without access to ground-truth answers. ## B.2. Dual-Objective: Accuracy + Conciseness This version augments the evaluation criterion with a length preference. The evaluator receives the solution length explicitly, making compliance with the length constraint directly verifiable. ``` ``` ``` ``` As described in Section 4.7, this allows RLME to control both reasoning quality and brevity through meta-evaluation. ## B.3. Counterfactual Cheating Detection Here, we intentionally reveal the (ground-truth) answer inside the prompt during training. At test time, we replace this with a random answer. If the model continues to justify that injected value, it is cheating rather than solving the problem from first principles. We show below the prompt used to train the RLME-NoCheat variant, with the meta-question 'Does the whole solution logically lead from the question to an answer, even if it does not match the correct answer?'. The base variant uses the same template but replaces this meta-question with 'Is the answer correct?'. ``` ``` ``` ``` ## B.4. Open-Domain QA and Faithfulness (CQAC + FaithEval) For faithful open-domain question answering with contextual grounding, we use the same initial prompt for both CQAC and FaithEval datasets: ``` ``` ``` ``` ``` ``` ## C. Qualitative Examples This appendix provides representative model outputs from each experiment. For each example, we show the full raw generation including intermediate reasoning, and any artifacts. These outputs illustrate typical success modes and common failure cases that are not fully captured by aggregate metrics. ## C.1. GSM8K: Accuracy-Only This example shows outputs from RLME models trained with correctness as the sole meta-objective. Successful cases demonstrate coherent step-by-step reasoning aligned with the final answer. ``` ``` ## C.2. GSM8K: Dual-Objective Accuracy + Conciseness This sample highlights the effect of adding a conciseness reward. Compared to accuracy-only training, the concisenessaccuracy objective tends to reduce repetition and irrelevant elaboration, while preserving enough reasoning to get the answer correct. ``` ``` ## C.3. Counterfactual Cheating Detection Here we show examples to illustrate cheating behavior and its suppression. In the base setup, inserting a random answer into the prompt often causes the model to rationalize that injected number. ``` ``` Therefore, the correct answer is \boxed{540}. Adding a meta-question targeting the reasoning itself (RLME-NoCheat) frequently rejects the injected answer and derives its own through grounded reasoning. ``` ``` ## C.4. Reward Hacking This example highlights what happens when the generator learns to fool the evaluator in order to get high reward. Notice that the generator suggests the answer before generating the reasoning. ``` ``` ``` ``` ## C.5. Open-Domain QA and Faithfulness (CQAC + FaithEval) This example shows the prompt from the CQAC dataset and a typical response. ``` ``` - If the context has contradictory information about the answer to the question, put \boxed{conflict} as the answer. ↪ → -If the context does not contain enough explicit information to answer the question, put \boxed{unknown} as the answer. ↪ → Solution: We need to find information about the state of matter that is most prevalent in the universe. According to the context, "Yet, most of the universe consists of plasma." This directly answers our question. ↪ → ↪ → Therefore, the final answer is: \boxed{plasma}

Rendering Paper...