2401.10020

Model: healer-alpha-free

# Self-Rewarding Language Models ## Abstract We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes. ## 1 Introduction Aligning Large Language Models (LLMs) using human preference data can vastly improve the instruction following performance of pretrained models (Ouyang et al., 2022; Bai et al., 2022a). The standard approach of Reinforcement Learning from Human Feedback (RLHF) learns a reward model from these human preferences. The reward model is then frozen and used to train the LLM using RL, e.g., via PPO (Schulman et al., 2017). A recent alternative is to avoid training the reward model at all, and directly use human preferences to train the LLM, as in Direct Preference Optimization (DPO; Rafailov et al., 2023). In both cases, the approach is bottlenecked by the size and quality of the human preference data, and in the case of RLHF the quality of the frozen reward model trained from them as well. In this work, we instead propose to train a self-improving reward model that, rather than being frozen, is continually updating during LLM alignment, in order to avoid this bottleneck. The key to such an approach is to develop an agent that possesses all the abilities desired during training, rather than separating them out into distinct models such as a reward model and a language model. In the same way that pretraining and multitasking training of instruction following tasks allow task transfer by training on many tasks at once (Collobert and Weston, 2008; Radford et al., 2019; Ouyang et al., 2022), incorporating the reward model into that same system allows task transfer between the reward modeling task and the instruction following tasks. We thus introduce Self-Rewarding Language Models, that both (i) act as instruction following models generating responses for given prompts; and (ii) can generate and evaluate new instruction following examples to add to their own training set. We train these models using an Iterative DPO framework similar to that recently introduced in Xu et al. (2023). Starting from a seed model, in each iteration there is a process of Self-Instruction creation whereby candidate responses are generated by the model for newly created prompts, and are then assigned rewards by that same model. The latter is implemented via LLM-as-a-Judge prompting, which can also be seen as an instruction following task. A preference dataset is built from the generated data, and the next iteration of the model is trained via DPO, see Figure 1. In our experiments, we start with a Llama 2 70B (Touvron et al., 2023) seed model fine-tuned on Open Assistant (Köpf et al., 2023), and then perform the above training scheme. We find that not only does the instruction following performance improve from Self-Rewarding LLM alignment compared to the baseline seed model, but importantly the reward modeling ability, which is no longer fixed, improves as well. This means that the model during iterative training is able, at a given iteration, to provide a higher quality preference dataset to itself than in the previous iteration. While this effect likely saturates in real-world settings, it provides the intriguing possibility of obtaining reward models (and hence LLMs) that are superior to ones that could have been trained from the original human-authored seed data alone. ## 2 Self-Rewarding Language Models Our approach first assumes access to a base pretrained language model, and a small amount of human-annotated seed data. We then build a model that aims to possess two skills simultaneously: 1. Instruction following: given a prompt that describes a user request, the ability to generate a high quality, helpful (and harmless) response. 1. Self-Instruction creation: the ability to generate and evaluate new instruction-following examples to add to its own training set. These skills are used so that the model can perform self-alignment, i.e., they are the components used to iteratively train itself using AI Feedback (AIF). Self-instruction creation consists of generating candidate responses and then the model itself judging their quality, i.e., it acts as its own reward model, replacing the need for an external one. This is implemented via the LLM-as-a-Judge mechanism (Zheng et al., 2023b), i.e., by formulating the evaluation of responses as an instruction following task. This self-created AIF preference data is used as a training set. <details> <summary>x1.png Details</summary> ![76b2e310](/v1/image/76b2e3106b2f66ba0c80e095bade3af0ebf6e498a3cf14852591762da98bfa40) ### Visual Description ## Diagram: Iterative Self-Instruction and Preference-Based Training Pipeline ### Overview The image is a technical flowchart illustrating a two-stage, iterative machine learning training process. The pipeline consists of a "Self-Instruction creation" phase that generates training data, followed by an "Instruction following training" phase that refines the model. The process is cyclical, with the output model from one iteration becoming the input for the next. ### Components/Axes The diagram is divided into two main colored regions: 1. **Left Region (Light Orange Background):** Titled **"Self-Instruction creation"**. 2. **Right Region (Light Purple Background):** Titled **"Instruction following training"**. **Key Components & Labels:** * **Data Stores (Cylinders):** * Leftmost cylinder: Label **"Generated new prompts"**. Contains the mathematical set notation **`{x_i}`**. * Right cylinder: Label **"Preference pairs"**. Contains the set notation **`{x_i, y_i^w, y_i^l}`**. * **Model Blocks (Blue Rectangles):** * First model block (left): Labeled **`M_t`**. An annotation above it reads **"Seed model (for t=1)"** in red text. * Second model block (center): Also labeled **`M_t`**. * Final model block (right): Labeled **`M_{t+1}`**. * **Process Labels (Text above arrows/flows):** * **"Generate responses"**: Positioned above the output of the first `M_t` block. * **"Generate rewards"**: Positioned above the output of the second `M_t` block. * **"select"**: Positioned on the arrow leading to the "Preference pairs" cylinder. * **"DPO training"**: Positioned above the arrow leading to the `M_{t+1}` block. * **Mathematical Notation:** * Responses: A vertical set **`{y_i^1, ..., y_i^N}`**. * Rewards: A vertical set **`{r_i^1, ..., r_i^N}`**. * **Flow Arrows:** Black arrows indicate the primary data flow. A prominent **red arrow** at the bottom creates a feedback loop, labeled **"Next iteration model"**, pointing from the `M_{t+1}` block back to the initial `M_t` block. ### Detailed Analysis The process flows as follows: 1. **Self-Instruction Creation Phase:** * A set of generated prompts `{x_i}` is fed into the current model `M_t`. * `M_t` generates a set of N responses `{y_i^1, ..., y_i^N}` for each prompt. * These responses are fed back into the same model `M_t` (or a copy) to generate a corresponding set of rewards `{r_i^1, ..., r_i^N}`. * Based on these rewards, a selection process ("select") creates a dataset of "Preference pairs" `{x_i, y_i^w, y_i^l}`. Here, `y_i^w` likely denotes a "winning" or preferred response, and `y_i^l` a "losing" or less preferred response for prompt `x_i`. 2. **Instruction Following Training Phase:** * The curated preference pairs are used to perform **"DPO training"** (Direct Preference Optimization). * This training updates the model, resulting in a new, improved version: `M_{t+1}`. 3. **Iterative Loop:** * The red "Next iteration model" arrow indicates that `M_{t+1}` becomes the `M_t` for the next cycle, enabling continuous self-improvement. ### Key Observations * **Self-Data Generation:** The model `M_t` is used twice in the first phase—once to generate responses and once to generate rewards for those responses. This suggests a self-supervised or self-evaluating mechanism. * **DPO as the Training Mechanism:** The pipeline explicitly uses Direct Preference Optimization (DPO), a method that aligns models with human preferences using comparison data without needing a separate reward model. * **Closed-Loop System:** The entire process is designed to be autonomous and iterative. The model bootstraps its own training data and then improves upon it in successive generations (`t`, `t+1`, etc.). * **Color Coding:** Green is used for data stores, blue for model instances, and red for critical annotations (seed model note, feedback loop). ### Interpretation This diagram depicts a sophisticated framework for **autonomous AI self-improvement**. It outlines a method where a language model can iteratively enhance its own instruction-following capabilities with minimal human intervention. * **Core Mechanism:** The system generates its own training examples (prompts and responses), evaluates the quality of those responses to create preference data, and then uses that data to fine-tune itself via DPO. This creates a virtuous cycle where better models generate better training data, leading to even better future models. * **Significance:** This approach addresses a key challenge in AI scaling: the bottleneck of high-quality, human-labeled data. By generating and curating its own preference data, the model can theoretically continue to improve indefinitely, limited mainly by its own capabilities and computational resources. * **Underlying Assumption:** The process assumes that the model's own reward generation (`M_t` producing `r_i^N`) is a reliable proxy for quality or human preference, which is a critical and non-trivial assumption for the system's success. * **Peircean Reading:** The diagram is an **icon** of a learning process, visually representing the cyclical and iterative nature of growth. It is also an **index**, pointing to the specific technical components (DPO, preference pairs) that make this particular self-improvement loop possible. The red feedback loop is the most salient indexical sign, emphasizing recursion as the core principle. </details> Figure 1: Self-Rewarding Language Models. Our self-alignment method consists of two steps: (i) Self-Instruction creation: newly created prompts are used to generate candidate responses from model $M_t$ , which also predicts its own rewards via LLM-as-a-Judge prompting. (ii) Instruction following training: preference pairs are selected from the generated data, which are used for training via DPO, resulting in model $M_t+1$ . This whole procedure can then be iterated resulting in both improved instruction following and reward modeling ability. Our overall self-alignment procedure is an iterative one, which proceeds by building a series of such models, with the aim that each improves over the last. Importantly, because the model can both improve its generation ability, and act as its own reward model through the same generation mechanism, this means the reward model itself can improve through these iterations, deviating from standard practices where the reward model is fixed (Ouyang et al., 2022). We believe this can increase the ceiling of the potential for self-improvement of these learning models going forward, removing a constraining bottleneck. We describe these steps in more detail below. An overview of the approach is illustrated in Figure 1. ### 2.1 Initialization Seed instruction following data We are given a seed set of human-authored (instruction prompt, response) general instruction following examples that we use for training in a supervised fine-tuning (SFT) manner, starting from a pretrained base language model. Subsequently this will be referred to as Instruction Fine-Tuning (IFT) data. Seed LLM-as-a-Judge instruction following data We also assume we are provided a seed set of (evaluation instruction prompt, evaluation result response) examples which can also be used for training. While this is not strictly necessary, as the model using IFT data will already be capable of training an LLM-as-a-Judge, we show that such training data can give improved performance (see Appendix A.3 for supporting results). In this data, the input prompt asks the model to evaluate the quality of a given response to a particular instruction. The provided evaluation result response consists of chain-of-thought reasoning (a justification), followed by a final score (in our experiments out of 5). The exact prompt format we chose is given in Figure 2, which instructs the LLM to evaluate the response using five additive criteria (relevance, coverage, usefulness, clarity and expertise), covering various aspects of quality. Subsequently this will be referred to as Evaluation Fine-Tuning (EFT) data. We use both these seed sets together during training. ### 2.2 Self-Instruction Creation Using the model we have trained, we can make it self-modify its own training set. Specifically, we generate additional training data for the next iteration of training. This consists of the following steps: 1. Generate a new prompt: We generate a new prompt $x_i$ using few-shot prompting, sampling prompts from the original seed IFT data, following the approach of Wang et al. (2023) and Honovich et al. (2023). In our main experiments, responses and rewards, items (2) and (3), are generated by the model we have trained, but generating prompts is actually done by a model fixed in advance. However, we show that prompts can also be generated by the newly trained model in each iteration in Appendix A.5. 1. Generate candidate responses: We then generate $N$ diverse candidate responses $\{y_i^1,…,y_i^N\}$ for the given prompt $x_i$ from our model using sampling. 1. Evaluate candidate responses: Finally, we use the LLM-as-a-Judge ability of our same model to evaluate its own candidate responses with scores $r_i^n∈[0,5]$ (exact prompt given in Figure 2). ### 2.3 Instruction Following Training As previously described, training is initially performed with the seed IFT and EFT data (Section 2.1). This is then augmented with additional data via AI (Self-)Feedback. AI Feedback Training After performing the self-instruction creation procedure, we can augment the seed data with additional examples for training, which we refer to as AI Feedback Training (AIFT) data. To do this, we construct preference pairs, which are training data of the form (instruction prompt $x_i$ , winning response $y_i^w$ , losing response $y_i^l$ ). To form the winning and losing pair we take the highest and lowest scoring responses from the $N$ evaluated candidate responses (see Section 2.2), following Xu et al. (2023), discarding the pair if their scores are the same. These pairs can be used for training with a preference tuning algorithm. We use DPO (Rafailov et al., 2023). Review the user’s question and the corresponding response using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion: - Add 1 point if the response is relevant and provides some information related to the user’s inquiry, even if it is incomplete or contains some irrelevant content. - Add another point if the response addresses a substantial portion of the user’s question, but does not completely resolve the query or provide a direct answer. - Award a third point if the response answers the basic elements of the user’s question in a useful way, regardless of whether it seems to have been written by an AI Assistant or if it has elements typically found in blogs or search results. - Grant a fourth point if the response is clearly written from an AI Assistant’s perspective, addressing the user’s question directly and comprehensively, and is well-organized and helpful, even if there is slight room for improvement in clarity, conciseness or focus. - Bestow a fifth point for a response that is impeccably tailored to the user’s question by an AI Assistant, without extraneous information, reflecting expert knowledge, and demonstrating a high-quality, engaging, and insightful answer. User: <INSTRUCTION_HERE> <response> <RESPONSE_HERE> </response> After examining the user’s instruction and the response: - Briefly justify your total score, up to 100 words. - Conclude with the score using the format: “Score: <total points>” Remember to assess from the AI Assistant perspective, utilizing web search knowledge as necessary. To evaluate the response in alignment with this additive scoring model, we’ll systematically attribute points based on the outlined criteria. Figure 2: LLM-as-a-Judge prompt for our LLM to act as a reward model and provide self-rewards for its own model generations. The model is initially trained with seed training data of how to perform well at this task, and then improves at this task further through our self-rewarding training procedure. ### 2.4 Overall Self-Alignment Algorithm Iterative Training Our overall procedure trains a series of models $M_1,\dots,M_T$ where each successive model $t$ uses augmented training data created by the $t-1^th$ model. We thus define AIFT( $M_t$ ) to mean AI Feedback Training data created using model $M_t$ . Model Sequence We define the models, and the training data they use as follows: - : Base pretrained LLM with no fine-tuning. - : Initialized with $M_0$ , then fine-tuned on the IFT+EFT seed data using SFT. - : Initialized with $M_1$ , then trained with AIFT( $M_1$ ) data using DPO. - : Initialized with $M_2$ , then trained with AIFT( $M_2$ ) data using DPO. This iterative training resembles the procedure used in Pairwise Cringe Optimization and specifically is termed Iterative DPO, introduced in Xu et al. (2023); however, an external fixed reward model was used in that work. ## 3 Experiments ### 3.1 Experimental Setup Base Model In our experiments we use Llama 2 70B (Touvron et al., 2023) as our base pretrained model. #### 3.1.1 Seed Training Data IFT Seed Data We use the human-authored examples provided in the Open Assistant dataset (Köpf et al., 2023) for instruction fine-tuning. Following Li et al. (2024) we use 3200 examples, by sampling only first conversational turns in the English language that are high-quality, based on their human annotated rank (choosing only the highest rank 0). In our experiments, we compare to a model fine-tuned from the base model using only this data via supervised fine-tuning, and refer to it as our SFT baseline. EFT Seed Data The Open Assistant data also provides multiple ranked human responses per prompt from which we can construct evaluation fine-tuning data. We split this into train and evaluation sets, and use it to create LLM-as-a-Judge data. This is done by placing it in the input format given in Figure 2, which consists of the scoring criteria description, and the given instruction and response to be evaluated. Note, the prompt, derived from Li et al. (2024), mentions “utilizing web search”, but our model is not actually capable of this action. For training targets, chain-of-thought justifications and final scores out of 5 are not directly provided, so we use the SFT baseline to generate such output evaluations for each input, and accept them into the training set if the ranking of their scores agrees with the human rankings in the dataset. We resample the training set by discarding some of the data that receives the most common score so that the scores are not too skewed, as we observe many samples receive a score of 4. This results in 1,630 train and 541 evaluation examples (which do not overlap with the IFT data). #### 3.1.2 Evaluation Metrics We evaluate the performance of our self-rewarding models in two axes: their ability to follow instructions, and their ability as a reward model (ability to evaluate responses). Instruction Following We evaluate head-to-head performance between various models using GPT-4 (Achiam et al., 2023) as an evaluator over 256 test prompts (which we refer to as IFT test data) derived from various sources following Li et al. (2024) using the AlpacaEval evaluation prompt (Li et al., 2023). We try the prompt in both orders comparing pairwise, and if the GPT-4 evaluations disagree we count the result as a tie. We also perform a similar evaluation with humans (authors). We additionally report results in the AlpacaEval 2.0 leaderboard format which is evaluated over 805 prompts, and compute the win rate against the baseline GPT-4 Turbo model based on GPT-4 judgments. Further, we report results on MT-Bench (Zheng et al., 2023b) a set of challenging multi-turn questions in various categories from math and coding to roleplay and writing, which uses GPT-4 to grade the model responses out of 10. Finally we also test the models on a set of 9 NLP benchmarks: ARC-Easy (Clark et al., 2018), ARC-Challenge (Clark et al., 2018), HellaSwag (Zellers et al., 2019), SIQA (Sap et al., 2019), PIQA (Bisk et al., 2020), GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021), OBQA (Mihaylov et al., 2018) and NQ (Kwiatkowski et al., 2019). Reward Modeling We evaluate the correlation with human rankings on the evaluation set we derived from the Open Assistant dataset, as described in Section 3.1.1. Each instruction has on average 2.85 responses with given rankings. We can thus measure the pairwise accuracy, which is how many times the order of the ranking between any given pair agrees between the model’s evaluation and the human ranking. We also measure the exact match count, which is how often the total ordering is exactly the same for an instruction. We also report the Spearman correlation and Kendall’s $τ$ . Finally, we report how often the responses that the model scores a perfect 5 out of 5 are rated as the highest ranked by humans. #### 3.1.3 Training Details Instruction following training The training hyperparameters we use are as follows. For SFT we use learning rate $5.5e{-6}$ which decays (cosine) to $1.1e{-6}$ at the end of training, batch size $16$ and dropout $0.1$ . We only calculate the loss on target tokens instead of the full sequence. For DPO we use learning rate $1e{-6}$ which decays to $1e{-7}$ , batch size $16$ , dropout $0.1$ , and a $β$ value of 0.1. We perform early stopping by saving a checkpoint every 200 steps and evaluating generations using Claude 2 (Anthropic, 2023) on 253 validation examples derived from various sources following Li et al. (2024). This is evaluated pairwise against the previous step’s generations using the AlpacaEval evaluation prompt format (Li et al., 2023). Self-Instruction creation To generate new prompts we use a fixed model, Llama 2-Chat 70B with 8-shot prompting following Self-Instruct (Wang et al., 2023), where we sample six demonstrations from the IFT data and two from the model generated data, and use decoding parameters T = 0.6, p = 0.9. We use their prompt template for non-classification tasks and apply the same filtering techniques, including the ROUGE-L (Lin, 2004) similarity check, keyword filtering, and length filtering. Except for the prompt generation part, the other parts of the creation pipeline (generating the response, and evaluating it) use the Self-Rewarding model being trained. For candidate response generation we sample $N=4$ candidate responses with temperature $T=0.7$ , $p=0.9$ . When evaluating candidate responses, as there is variance to these scores, in our experiments we also use sampled decoding (with the same parameters) and generate these evaluations multiple (3) times and take the average. We added 3,964 such preference pairs to form the AIFT( $M_1$ ) dataset used to train $M_2$ via DPO, and 6,942 pairs to form AIFT( $M_2$ ) used to train $M_3$ . ### 3.2 Results #### 3.2.1 Instruction Following Ability Head to head performance results are provided in Figure 3. <details> <summary>x2.png Details</summary> ![63263a58](/v1/image/63263a5896bfee117f517e5825458e0fa54d133853a1a37e08284d070d8a3224) ### Visual Description ## Horizontal Stacked Bar Chart: Model Performance Comparison ### Overview The image displays a horizontal stacked bar chart comparing the performance of three different "Self-Rewarding" models (M3, M2, M1) against a common "SFT Baseline" model. The chart quantifies the outcomes of comparisons in terms of wins for each model type and ties. ### Components/Axes * **Legend:** Located at the top center of the chart. It defines three categories: * **Green Bar:** "Self-Rewarding Wins" * **Light Blue Bar:** "Tie" * **Red Bar:** "SFT Baseline Wins" * **Y-Axis (Vertical):** Lists the three model comparisons being made. From top to bottom: 1. `Self-Rewarding M3 vs. SFT Baseline` 2. `Self-Rewarding M2 vs. SFT Baseline` 3. `Self-Rewarding M1 vs. SFT Baseline` * **X-Axis (Horizontal):** Represents the percentage of outcomes. The axis is not numerically labeled, but the total length of each bar represents 100%, and the segments are labeled with their percentage values. ### Detailed Analysis Each horizontal bar is divided into three colored segments corresponding to the legend. The values are as follows: 1. **Top Bar (M3 vs. Baseline):** * **Green (Self-Rewarding Wins):** 62.5% (Left segment) * **Light Blue (Tie):** 27.7% (Middle segment) * **Red (SFT Baseline Wins):** 9.8% (Right segment) * *Trend Check:* The green segment is the largest, indicating M3 wins the majority of comparisons. The red segment is very small. 2. **Middle Bar (M2 vs. Baseline):** * **Green (Self-Rewarding Wins):** 49.2% (Left segment) * **Light Blue (Tie):** 36.3% (Middle segment) * **Red (SFT Baseline Wins):** 14.5% (Right segment) * *Trend Check:* The green segment is still the largest, but its lead over the tie segment is smaller than in M3. The red segment is larger than for M3. 3. **Bottom Bar (M1 vs. Baseline):** * **Green (Self-Rewarding Wins):** 30.5% (Left segment) * **Light Blue (Tie):** 38.7% (Middle segment) * **Red (SFT Baseline Wins):** 30.9% (Right segment) * *Trend Check:* The green and red segments are nearly equal in size, with the tie segment being the largest. This indicates a much more balanced performance between M1 and the baseline. ### Key Observations * **Clear Performance Gradient:** There is a strong, consistent trend across the three models. As we move from M1 to M2 to M3, the "Self-Rewarding Wins" percentage increases substantially (30.5% -> 49.2% -> 62.5%). * **Inverse Relationship:** Correspondingly, the "SFT Baseline Wins" percentage decreases as the Self-Rewarding model version increases (30.9% -> 14.5% -> 9.8%). * **Tie Rate Variation:** The percentage of ties is not constant. It peaks with M2 (36.3%) and is lowest with M3 (27.7%). * **M3 Dominance:** The M3 model shows a clear majority win rate (62.5%) against the baseline, with the baseline winning less than 10% of the time. * **M1 Parity:** The M1 model performs almost identically to the SFT Baseline, with win rates within 0.4% of each other (30.5% vs. 30.9%). ### Interpretation This chart demonstrates the progressive improvement of a series of "Self-Rewarding" models (M1, M2, M3) when evaluated against a fixed "SFT Baseline." The data suggests a successful iterative development process. * **M1** appears to be a foundational model that matches, but does not exceed, the baseline's performance. * **M2** represents a significant step forward, winning nearly half the comparisons and reducing the baseline's win rate by more than half compared to M1. * **M3** is the most advanced model shown, achieving a dominant win rate. The very low baseline win rate (9.8%) indicates that the baseline model rarely produces a better outcome than M3. The varying tie rates suggest that the nature of the comparisons changes with model capability. M1 and the baseline are often indistinguishable (high tie rate). M2 is more frequently distinguishable from the baseline, leading to a higher decisive outcome rate (wins + losses). M3 is so superior that ties become less common again, as it more clearly outperforms the baseline. **In summary, the chart provides strong visual evidence that the Self-Rewarding modeling approach yields progressively better results with each iteration (M1->M2->M3), culminating in a model (M3) that decisively outperforms the SFT Baseline in the majority of evaluations.** </details> <details> <summary>x3.png Details</summary> ![2e4102ac](/v1/image/2e4102accdfd5f637404708e6c680b49a599a87ffc3109ae3a670ac760ddfaad) ### Visual Description ## Horizontal Stacked Bar Chart: Pairwise Model Performance Comparison ### Overview The image displays a horizontal stacked bar chart comparing the performance of three different "Self-Rewarding" models (M1, M2, M3) in three pairwise matchups. The chart quantifies the win rate of the left-listed model versus the right-listed model, with categories for "Left Wins," "Tie," and "Right Wins." ### Components/Axes * **Legend:** Positioned at the top center. It defines three categories: * **Left Wins (in Left vs. Right):** Represented by a bright green color. * **Tie:** Represented by a light blue color. * **Right Wins:** Represented by a red/salmon color. * **Y-Axis (Vertical):** Lists the three model comparison matchups. From top to bottom: 1. `Self-Rewarding M3 vs. M2` 2. `Self-Rewarding M2 vs. M1` 3. `Self-Rewarding M3 vs. M1` * **X-Axis (Horizontal):** Implicitly represents percentage (0-100%), though no axis line or labels are drawn. The total length of each bar represents 100% of outcomes. * **Data Labels:** Numerical percentage values are embedded directly within each colored segment of the bars. ### Detailed Analysis Each bar is segmented into three parts corresponding to the legend. The values are as follows: 1. **Top Bar: `Self-Rewarding M3 vs. M2`** * **Left Wins (Green):** 47.7% * **Tie (Light Blue):** 39.8% * **Right Wins (Red):** 12.5% * *Trend Check:* The green segment (Left Wins) is the largest, followed by a substantial tie segment, with the red segment (Right Wins) being the smallest. 2. **Middle Bar: `Self-Rewarding M2 vs. M1`** * **Left Wins (Green):** 55.5% * **Tie (Light Blue):** 32.8% * **Right Wins (Red):** 11.7% * *Trend Check:* The green segment is larger than in the first bar, the tie segment is smaller, and the red segment is slightly smaller. 3. **Bottom Bar: `Self-Rewarding M3 vs. M1`** * **Left Wins (Green):** 68.8% * **Tie (Light Blue):** 22.7% * **Right Wins (Red):** 8.6% * *Trend Check:* The green segment is the largest of all three bars, the tie segment is the smallest, and the red segment is also the smallest. ### Key Observations * **Clear Performance Hierarchy:** The "Left Wins" percentage increases progressively from the top bar (47.7%) to the bottom bar (68.8%). This indicates that the performance gap widens when comparing models further apart in the sequence (M3 vs. M1) compared to adjacent models (M3 vs. M2, M2 vs. M1). * **Inverse Relationship with Ties:** As the "Left Wins" percentage increases, the "Tie" percentage decreases correspondingly (39.8% -> 32.8% -> 22.7%). This suggests that clearer victories become more common when comparing more dissimilar models. * **Consistently Low "Right Wins":** The "Right Wins" percentage is low across all comparisons (12.5%, 11.7%, 8.6%), indicating that the model listed on the right (the older model in each pair) rarely outperforms the one on the left. * **Spatial Layout:** The legend is centered at the top. The bars are left-aligned with their labels. The numerical data labels are centered within their respective colored segments. ### Interpretation The data strongly suggests a consistent and improving performance trend across the Self-Rewarding model versions M1, M2, and M3. The chart demonstrates that: 1. **Iterative Improvement:** Each subsequent model (M2 > M1, M3 > M2) outperforms its predecessor in a head-to-head comparison. 2. **Magnitude of Improvement:** The improvement is not linear. The performance jump from M1 to M3 (68.8% win rate) is significantly larger than the jump from M1 to M2 (55.5% win rate) or M2 to M3 (47.7% win rate). This could indicate accelerating returns or compounding improvements in the model series. 3. **Reduction in Ambiguity:** The decreasing tie rate implies that as models evolve, their outputs become more distinctly different in quality, making it easier to determine a winner. The oldest model (M1) is almost never judged superior to the newest (M3), as shown by the minimal 8.6% "Right Wins" in that matchup. In essence, the chart provides clear, quantitative evidence for the progressive superiority of the Self-Rewarding model line, with M3 being the most advanced and M1 the baseline. </details> Figure 3: Instruction following ability improves with Self-Training: We evaluate our models using head-to-head win rates on diverse prompts using GPT-4. The SFT Baseline is on par with Self-Rewarding Iteration 1 ( $M_1$ ). However, Iteration 2 ( $M_2$ ) outperforms both Iteration 1 ( $M_1$ ) and the SFT Baseline. Iteration 3 ( $M_3$ ) gives further gains over Iteration 2 ( $M_2$ ), outperforming $M_1$ , $M_2$ and the SFT Baseline by a large margin. EFT+IFT seed training performs similarly to IFT alone We find that adding the Evaluation Fine-Tuning (EFT) task to training does not impact instruction following performance compared to using Instruction Fine-Tuning (IFT) data alone with an almost equal head to head (30.5% wins vs. 30.9% wins). This is a positive result because it means the increased capability of a model to self-reward does not affect its other skills. We can thus use IFT+EFT training as Iteration 1 ( $M_1$ ) of our Self-Rewarding model, and then run further iterations. Iteration 2 ( $M_2$ ) improves over Iteration 1 ( $M_1$ ) and SFT Baseline Iteration 2 of Self-Rewarding training ( $M_2$ ) provides superior instruction following to Iteration 1 ( $M_1$ ) with 55.5% wins for $M_2$ compared to only 11.7% for $M_1$ in a head to head evaluation. It provides similar gains over the SFT Baseline as well (49.2% wins vs. 14.5% wins). Clearly, there is a large jump in performance from $M_1$ to $M_2$ by using the preference data AIFT( $M_1$ ) provided by the reward model from Iteration 1. Iteration 3 ( $M_3$ ) improves over Iteration 2 ( $M_2$ ) We see a further gain in Iteration 3 over Iteration 2, with 47.7% wins for $M_3$ compared to only 12.5% for $M_2$ in a head to head evaluation. Similarly, the win rate over the SFT Baseline for $M_3$ increases to 62.5% wins vs. 9.8%, i.e., winning more often than the $M_2$ model did. Overall, we see large gains from $M_2$ to $M_3$ through training using the preference data AIFT( $M_2$ ) provided by the reward model from Iteration 2. Self-Rewarding models perform well on AlpacaEval 2 leaderboard We evaluate our models on the AlpacaEval 2.0 leaderboard format, with results given in Table 1. We observe the same findings as in the head-to-head evaluations, that training iterations yield improved win rates, in this case over GPT4-Turbo, from 9.94% in Iteration 1, to 15.38% in Iteration 2, to 20.44% in Iteration 3. Our Iteration 3 model outperforms many existing models in this metric, including Claude 2, Gemini Pro, and GPT4 0613. We show some selected models from the leaderboard in the table. We note that many of those competing models contain either proprietary alignment data (which is typically large, e.g., over 1M annotations in Touvron et al. (2023)) or use targets that are distilled from stronger models. In contrast, our Self-Rewarding model starts from a small set of seed data from Open Assistant, and then generates targets and rewards from the model itself for further iterations of training. Table 1: AlpacaEval 2.0 results (win rate over GPT-4 Turbo evaluated by GPT-4). Self-Rewarding iterations yield improving win rates. Iteration 3 ( $M_3$ ) outperforms many existing models that use proprietary training data or targets distilled from stronger models. | | | Alignment Targets | | | --- | --- | --- | --- | | Model | Win Rate | Distilled | Proprietary | | Self-Rewarding 70B | | | | | Iteration 1 ( $M_1$ ) | 9.94% | | | | Iteration 2 ( $M_2$ ) | 15.38% | | | | Iteration 3 ( $M_3$ ) | 20.44% | | | | Selected models from the leaderboard | | | | | GPT-4 0314 | 22.07% | | ✓ | | Mistral Medium | 21.86% | | ✓ | | Claude 2 | 17.19% | | ✓ | | Gemini Pro | 16.85% | | ✓ | | GPT-4 0613 | 15.76% | | ✓ | | GPT 3.5 Turbo 0613 | 14.13% | | ✓ | | LLaMA2 Chat 70B | 13.87% | | ✓ | | Vicuna 33B v1.3 | 12.71% | ✓ | | | Humpback LLaMa2 70B | 10.12% | | | | Guanaco 65B | 6.86% | | | | Davinci001 | 2.76% | | ✓ | | Alpaca 7B | 2.59% | ✓ | | <details> <summary>x4.png Details</summary> ![1ebe0283](/v1/image/1ebe0283a847a50750fe04f0f7aab8ce899e3d2e30dfb91a23d8943bcdf042a5) ### Visual Description ## Line Chart: Model Win Rates Across Domains ### Overview This is a multi-series line chart comparing the performance of four different models (M0, M1, M2, M3) across 20 distinct knowledge or activity domains. The performance metric is "Win rate (%)", plotted on the y-axis against the domain categories on the x-axis. The chart reveals significant variability in model performance depending on the subject matter. ### Components/Axes * **Y-Axis:** Labeled "Win rate (%)". Scale ranges from 0 to 35, with major gridlines at intervals of 5 (0, 5, 10, 15, 20, 25, 30, 35). * **X-Axis:** Lists 20 domain categories. From left to right: Health, Professional, Linguistics, Other, Entertainment, Technology, Literature, Coding, Science, Gaming, Philosophy, Social Studies, Travel, Arts, Sports, Mathematics, Social Interaction, DIY Projects, Cooking. * **Legend:** Positioned in the top-right corner of the chart area. It defines four data series: * **M0:** Dark purple line with circle markers. * **M1:** Purple line with upward-pointing triangle markers. * **M2:** Red line with square markers. * **M3:** Orange line with diamond markers. ### Detailed Analysis Below are the approximate win rates (%) for each model across all domains. Values are estimated from the chart's gridlines. | Domain | M0 (Dark Purple, Circle) | M1 (Purple, Triangle) | M2 (Red, Square) | M3 (Orange, Diamond) | | :--- | :--- | :--- | :--- | :--- | | **Health** | ~19 | ~19 | ~31 | ~31 | | **Professional** | ~12 | ~19 | ~28 | ~30 | | **Linguistics** | ~6 | ~15 | ~22 | ~28 | | **Other** | ~16 | ~18 | ~26 | ~28 | | **Entertainment** | ~15 | ~9 | ~21 | ~26 | | **Technology** | ~10 | ~15 | **~35** | ~25 | | **Literature** | ~3 | ~3 | ~10 | ~23 | | **Coding** | ~9 | ~9 | ~14 | ~23 | | **Science** | ~5 | ~6 | ~14 | ~22 | | **Gaming** | ~6 | **~0** | ~11 | ~22 | | **Philosophy** | ~7 | ~13 | **~0** | ~20 | | **Social Studies** | ~3 | ~13 | ~8 | ~18 | | **Travel** | ~14 | ~8 | ~15 | ~17 | | **Arts** | ~7 | ~10 | ~10 | ~17 | | **Sports** | ~8 | ~8 | ~8 | ~17 | | **Mathematics** | ~15 | ~11 | ~10 | ~12 | | **Social Interaction** | ~8 | ~5 | ~10 | ~11 | | **DIY Projects** | ~4 | ~4 | ~8 | ~9 | | **Cooking** | **~0** | ~2 | ~2 | ~6 | **Trend Verification per Series:** * **M0 (Dark Purple):** Shows a generally volatile, downward trend from left to right, starting around 19% and ending at 0%. It has notable peaks in Health, Other, Travel, and Mathematics. * **M1 (Purple):** Also volatile with a slight downward trend. It starts around 19%, dips to near 0% at Gaming, recovers, and ends around 2%. Peaks are in Health, Professional, and Philosophy. * **M2 (Red):** Exhibits the most dramatic fluctuations. It starts very high (~31%), peaks sharply at Technology (~35%), crashes to 0% at Philosophy, and ends very low (~2%). It shows strong performance in technical and professional domains but fails completely in Philosophy. * **M3 (Orange):** Displays the most consistent and clear trend: a steady, almost linear decline from left to right. It starts as the top performer (~31% in Health) and gradually decreases to the lowest performer in Cooking (~6%). It never drops to zero and maintains a relatively smooth descent. ### Key Observations 1. **Domain Specialization:** No single model dominates all domains. M3 is strongest in the first half (Health to Science), M2 has a massive spike in Technology, and M0 is relatively strong in Mathematics. 2. **Catastrophic Failures:** Three models hit a 0% win rate in specific domains: M1 in Gaming, M2 in Philosophy, and M0 in Cooking. This suggests complete inability or failure in those specific contexts. 3. **Performance Clustering:** In domains like Sports and DIY Projects, all models perform within a narrow, low range (roughly 4-17%), indicating these are challenging areas for all tested models. 4. **Inverse Relationships:** In several domains (e.g., Technology, Philosophy), the performance of M2 is inversely related to the others—when M2 peaks or crashes, the other models show more moderate values. ### Interpretation This chart likely compares the efficacy of different AI models or algorithms on a benchmark suite covering diverse topics. The "Win rate" probably represents the percentage of test cases or prompts where a given model produced the best or a satisfactory answer compared to its peers. The data suggests that model architecture or training data creates strong domain biases. **M3's** smooth decline might indicate a model trained on a broad but general corpus, performing well on common topics (Health, Professional) but lacking depth in specialized or practical areas (Cooking, DIY). **M2's** profile is that of a "specialist"—exceptionally strong in Technology (perhaps due to technical training data) but brittle, failing completely in abstract reasoning (Philosophy). **M0 and M1** show more erratic performance, which could indicate less stable training or sensitivity to specific prompt phrasings within each domain. The complete failures (0% win rate) are critical findings. They don't just indicate poor performance but a fundamental breakdown, which could be due to training data gaps, algorithmic flaws in handling certain query types, or safety filters being overly triggered. The clustering of low scores in hands-on domains (Sports, DIY, Cooking) across all models highlights a current limitation in AI's ability to reason about physical, procedural, or experiential knowledge compared to textual or technical information. </details> Figure 4: AlpacaEval win rate breakdown for instruction categories (full names given in Appendix). Self-Rewarding models give gains across several topics, but tend to e.g. give less gains on mathematics and reasoning tasks. Fine-grained analysis As described earlier, the overall performance of the model in AlpacaEval improves with each iteration of training. It would be interesting to break down the overall performance improvement to see exactly what type of tasks these improvements come from. Therefore, we cluster the instructions in AlpacaEval test set into different groups based on three perspectives: (1) instruction category (2) instruction complexity (3) expected response length. We achieve this by using GPT-4. The detailed statistical information of the breakdown and the prompting techniques we used for getting this breakdown can be found in Appendix A.6. Results for the instruction category are given in Figure 4, and the other two in Appendix Figure 11. From the results we can conclude that (i) Self-Rewarding models can substantially improve the win rate in most categories, but there are some tasks for which this approach does not improve, such as mathematics and logical reasoning, indicating that our current training approach mainly allows the models to better utilize their existing knowledge. (ii) Through Self-Rewarding model training, the model’s win rate increases on almost all tasks of different complexity, and especially on slightly more difficult tasks (complexity of 5, 6, 7 out of 10). (iii) The models also show a steady increase in the win rate on tasks with instructions with different expected response lengths. Data distribution analysis We perform a t-SNE (Van der Maaten and Hinton, 2008) visualization of the IFT, EFT and AIFT( $M_1$ ) data, shown in Appendix A.1. We find good overlap between the IFT and AIFT( $M_1$ ) examples, which is desired, while the EFT examples lie in a different part of the embedding space, which can help explain why they would not affect IFT performance. We observe that generations from $M_1$ on AlpacaEval have an average length of 1092, for $M_2$ they are 1552, and for $M_3$ they are 2552, so the model is learning to generate longer responses, which we note may be a factor in relative performance. <details> <summary>x5.png Details</summary> ![b33be761](/v1/image/b33be76180e74fdaf9a14076adfff515f349107995778c8197eace76be439b84) ### Visual Description ## Horizontal Stacked Bar Chart: Self-Rewarding Model vs. SFT Baseline Performance ### Overview The image displays a horizontal stacked bar chart comparing the performance of three versions of a "Self-Rewarding" model (M1, M2, M3) against a common "SFT Baseline" model. The chart quantifies the outcomes of comparisons as percentages of wins for each model and ties. ### Components/Axes * **Chart Type:** Horizontal Stacked Bar Chart. * **Y-Axis (Vertical):** Lists the three comparison pairs. From top to bottom: 1. `Self-Rewarding M₃ vs. SFT Baseline` 2. `Self-Rewarding M₂ vs. SFT Baseline` 3. `Self-Rewarding M₁ vs. SFT Baseline` * **X-Axis (Horizontal):** Implicitly represents percentage (0-100%), though no numerical axis labels are present. The total length of each bar represents 100% of the comparison outcomes. * **Legend:** Positioned at the top of the chart. * **Green Square:** `Self-Rewarding Wins` * **Light Blue Square:** `Tie` * **Red Square:** `SFT Baseline Wins` * **Data Labels:** Percentage values are embedded directly within each colored segment of the bars. ### Detailed Analysis Each bar is segmented into three colored parts corresponding to the legend. The values are transcribed as follows: 1. **Top Bar: Self-Rewarding M₃ vs. SFT Baseline** * **Green Segment (Left):** `66.0` (Self-Rewarding Wins) * **Light Blue Segment (Center):** `16.0` (Tie) * **Red Segment (Right):** `18.0` (SFT Baseline Wins) * **Trend:** The green segment is dominant, occupying nearly two-thirds of the bar, indicating a strong performance advantage for the M₃ model. 2. **Middle Bar: Self-Rewarding M₂ vs. SFT Baseline** * **Green Segment (Left):** `56.0` (Self-Rewarding Wins) * **Light Blue Segment (Center):** `24.0` (Tie) * **Red Segment (Right):** `20.0` (SFT Baseline Wins) * **Trend:** The green segment is still the largest, but smaller than in M₃. The tie segment is notably larger. 3. **Bottom Bar: Self-Rewarding M₁ vs. SFT Baseline** * **Green Segment (Left):** `28.0` (Self-Rewarding Wins) * **Light Blue Segment (Center):** `26.0` (Tie) * **Red Segment (Right):** `46.0` (SFT Baseline Wins) * **Trend:** The red segment is the largest, indicating the SFT Baseline wins more often than the M₁ model. The green segment is the smallest of all three bars. ### Key Observations * **Clear Performance Gradient:** There is a stark, monotonic improvement in the Self-Rewarding model's performance as the version number increases from M₁ to M₃. * **M₁ Underperformance:** The Self-Rewarding M₁ model loses to the SFT Baseline more often than it wins (46.0% vs. 28.0%). * **M₃ Dominance:** The Self-Rewarding M₃ model achieves a decisive win rate of 66.0%, losing only 18.0% of the time. * **Tie Rate Variation:** The percentage of ties is not constant. It is lowest for M₃ (16.0%) and highest for M₂ (24.0%). ### Interpretation This chart demonstrates the progressive efficacy of an iterative model development process, likely involving self-rewarding or reinforcement learning techniques. The data suggests that successive versions (M₁ → M₂ → M₃) of the Self-Rewarding model have learned to significantly outperform the fixed SFT (Supervised Fine-Tuning) Baseline. The inversion of the win/loss ratio from M₁ (where the baseline is superior) to M₃ (where the self-rewarding model is dominant) indicates a successful training or alignment trajectory. The varying tie rates may reflect changes in model confidence or the distinctness of the outputs between versions. The primary takeaway is that the "Self-Rewarding" approach, as embodied in model M₃, yields a substantially more capable model than the standard SFT baseline it was compared against. </details> Figure 5: Human evaluation results. Iterations of Self-Rewarding ( $M_1$ , $M_2$ and $M_3$ ) provide progressively better head-to-head win rates compared to the SFT baseline, in agreement with the automatic evaluation results. Table 2: MT-Bench Results (on a scale of 10). Self-Rewarding iterations yield improving scores across various categories. Math, code & reasoning performance and iteration gains are smaller than for other categories, likely due to the makeup of the Open Assistant seed data we use. | $M_1$ $M_2$ $M_3$ | 6.78 7.01 7.25 | 3.83 4.05 4.17 | 8.55 8.79 9.10 | | --- | --- | --- | --- | Table 3: NLP Benchmarks. Self-Rewarding models mostly tend to maintain performance compared to the Llama 2 70B base model and the SFT Baseline, despite being fine-tuned on very different instruction-following prompts. | Llama 2 SFT Baseline $M_1$ | 57.40 55.97 57.51 | 85.30 85.17 84.99 | 56.80 50.72 60.27 | 68.90 69.76 69.34 | 25.30 34.35 35.48 | | --- | --- | --- | --- | --- | --- | | $M_2$ | 54.51 | 84.27 | 59.29 | 69.31 | 33.07 | | $M_3$ | 53.13 | 83.29 | 57.70 | 69.37 | 31.86 | Human evaluation To examine whether human judgments align with automatic evaluation results, we conduct human evaluations that compare SFT baseline generations with the generations from each iteration of Self-Rewarding training, i.e., models $M_1$ , $M_2$ , and $M_3$ . Specifically, we randomly select 50 instructions from the IFT test set. Each instruction corresponds to three pairs of generations (i.e., baseline vs. $M_1$ , baseline vs. $M_2$ , baseline vs. $M_3$ ). For each pair of generations, we assign them to three different annotators (blind evaluation performed by the authors) to make a pairwise judgment, and take a majority vote to decide which generation is better. The human evaluation results are shown in Figure 5. We find that Self-Rewarding models from later iterations show a larger advantage over the SFT baseline model, which is consistent with GPT-4’s judgments, and demonstrates the effectiveness of our iterative training procedure. MT-Bench performance further validates these results We report performance on MT-Bench in Table 2 for the SFT baseline and iterations of the Self-Rewarding model. We again see improvements across the iterations of training from $M_1$ to $M_3$ , from 6.78 (out of 10) up to 7.25, with larger relative gains in the humanities, STEM, roleplay, writing and extraction categories, and smaller gains in the math, code and reasoning categories. We expect that the latter is due to the seed prompts we use from Open Assistant tending to underemphasize the reasoning-based tasks. We note also that these improvements are in spite of our method using and constructing prompts that only involve a single turn, given the MT-Bench benchmark itself is a multi-turn evaluation. Self-rewarding models did not lose ability on NLP Benchmarks As shown in Table 3, the performance of most NLP benchmark tasks evaluated are roughly similar to the baselines, with further detailed results on more datasets given in Appendix Table 9 that follow the same pattern. We hypothesize that given that our training data (seed data and synthetically generated data) are based on the Open Assistant prompts which may not be especially relevant to skills needed in the Table 3 tasks, it is expected that the task performance stays roughly similar, or may even drop. For example, in InstructGPT training (Ouyang et al., 2022) they found that “during RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets” which they refer to as an “alignment tax.” A clear future direction is to extend the self-rewarding paradigm to these types of tasks, by relying not only on seed prompts from Open Assistant, but also on seed prompts found in a larger variety of datasets. #### 3.2.2 Reward Modeling Ability Reward modeling evaluation results are provided in Table 4. EFT augmentation improves over SFT baseline Firstly, we find that adding Evaluation Fine-Tuning (EFT) data into training, which gives examples to the model of how to act as an LLM-as-a-Judge, naturally improves its performance compared to training with Instruction Fine-Tuning (IFT) data alone. IFT data covers a wide range of general instruction tasks, and so does endow the SFT Baseline with the ability to evaluate responses; however, EFT data gives more examples of this specific task. We find improvements across all five metrics measured when using IFT+EFT vs. IFT alone, e.g., the pairwise accuracy agreement with humans increases from 65.1% to 78.7%. Table 4: Reward Modeling ability improves with Self-Training: We evaluate the LLM-as-a-Judge via various metrics which measure alignment with held-out human preference data. Self-Rewarding Iteration 2 (Model $M_2$ ), which is trained using the self-reward model derived from its previous iteration $M_1$ outperforms Iteration 1 ( $M_1$ ), while $M_1$ itself outperforms a standard SFT baseline model trained on only Instruction Fine-Tuning (IFT) data. Iteration 3 (Model $M_3$ ) gives further improvements over Iteration 2. | Model Training data | SFT Baseline IFT | Self-Rewarding Models Iter 1 ( $M_1$ ) IFT+EFT | Iter 2 ( $M_2$ ) IFT+EFT | Iter 3 ( $M_3$ ) IFT+EFT+AIFT( $M_1$ ) | | --- | --- | --- | --- | --- | | +AIFT( $M_1$ ) | +AIFT( $M_2$ ) | | | | | Pairwise acc. $(↑)$ | 65.1% | 78.7% | 80.4% | 81.7% | | 5-best % $(↑)$ | 39.6% | 41.5% | 44.3% | 43.2% | | Exact Match % $(↑)$ | 10.1% | 13.1% | 14.3% | 14.3% | | Spearman corr. $(↑)$ | 0.253 | 0.279 | 0.331 | 0.349 | | Kendall $τ$ corr. $(↑)$ | 0.233 | 0.253 | 0.315 | 0.324 | Reward Modeling ability improves with Self-Training We find that performing a round of self-reward training improves the ability of the model at providing self-rewards for the next iteration, in addition to its improved instruction following ability. Model $M_2$ (Iteration 2) is trained using the reward model from $M_1$ (Iteration 1), but provides improved performance on all five metrics compared to $M_1$ . For example, pairwise accuracy improves from 78.7% to 80.4%. Iteration 3 ( $M_3$ ) improves several of these metrics further compared to $M_2$ , for example pairwise accuracy increases from 80.4% to 81.7%. This performance gain is achieved despite there being no additional EFT data provided, and the examples created during the Self-Instruction creation loop do not tend to look like LLM-as-a-Judge training examples. We hypothesize that because the model is becoming better at general instruction following, it nevertheless also improves at the LLM-as-a-Judge task. Importance of the LLM-as-a-Judge Prompt In these experiments we used the LLM-as-a-Judge prompt format shown in Figure 2. In preliminary experiments we also tried various other prompts to decide the most effective one to use. For example, we tried the prompt proposed in Li et al. (2024) which also proposes a 5-point scale, but describes the options as multiple choice in a range of quality buckets, see Appendix Figure 7. In contrast, our prompt describes the points as additive, covering various aspects of quality. We find a large difference between these two prompts when using the SFT Baseline, e.g. 65.1% pairwise accuracy for ours, and only 26.6% pairwise accuracy for theirs. See Appendix A.2 for further details. ## 4 Related Work Automatically improving or self-correcting large language models is becoming a major focus of research. A recent survey from Pan et al. (2023) attempts to summarize the topic. However, this is a rapidly moving area, and there are already promising new works not covered there. Reinforcement Learning from Human Feedback (RLHF) Preference learning approaches such as in Ziegler et al. (2019); Stiennon et al. (2020); Ouyang et al. (2022); Bai et al. (2022a) train a fixed reward model from human preference data, and then use the reward model to train via reinforcement learning (RL), e.g. via Proximal Policy Optimization (PPO) (Schulman et al., 2017). Thus, the reward signal in a certain sense already comes from a model even in these works, but distilled from human data. Nevertheless, this is commonly referred to as RL from Human Feedback (RLHF). Methods such as Direct Preference Optimization (DPO) (Rafailov et al., 2023) avoid training the reward model entirely, and instead directly train the LLM using human preferences. Several other such competing methods exist as well (Zhao et al., 2023; Zheng et al., 2023a; Yuan et al., 2023), including Pairwise Cringe Optimization (PCO) (Xu et al., 2023). PCO uses an iterative training approach similar to the one in our work, except with a fixed reward model, and that work also showed that Iterative DPO improves over DPO using the same scheme. We note that other works have developed iterative preference training schemes as well, e.g. Adolphs et al. (2023); Gulcehre et al. (2023); Xiong et al. (2023). Reinforcement Learning from AI Feedback (RLAIF) Constitutional AI (Bai et al., 2022b) uses an LLM to give feedback and refine responses, and uses this data to train a reward model. This fixed, separate reward model is then used to train the language model via RL, called “RL from AI Feedback” (RLAIF). Lee et al. (2023) compare RLAIF and RLHF procedures and find the methods they compare perform roughly equally. They use an “off-the-shelf” LLM to perform LLM-as-a-Judge prompting to build a training set to train a fixed reward model, which is then used for RL training. They also experiment with using the fixed but separate LLM-as-a-Judge model directly, which the authors report is computationally expensive due to using it within PPO training (rather than the offline step in the iterative approach we use in our work, which is relatively computationally cheap). Finally, SPIN (Chen et al., 2024b) recently showed they can avoid reward models entirely in an Iterative DPO-like framework by using human labels as the winning response in a pair, and the last iteration’s generations as the losing response in the pair. The authors note this has the limitation that once the model generations reach human performance, they are bottlenecked. Further, each input prompt is required to have a human annotated response, in contrast to our work. Improving LLMs via data augmentation (and curation) Several methods have improved LLMs by (self-)creating training data to augment fine-tuning. Self-Instruct (Wang et al., 2023) is a method for self-instruction creation of prompts and responses, which can be used to improve a base LLM. We make use of a similar technique in our work, and then use our self-reward model to score them. Several approaches have also created training data by distilling from powerful LLMs, and shown a weaker LLM can then perform well. For example, Alpaca (Taori et al., 2023) fine-tuned a Llama 7B model with text-davinci-003 instructions created in the style of self-instruct. Alpagasus (Chen et al., 2024a) employed a strong LLM-as-a-Judge (ChatGPT) to curate the Alpaca dataset and filter to a smaller set, obtaining improved results. Instruction Backtranslation (Li et al., 2024) similarly augments and curates training data, but augmenting via backtranslating from web documents to predict prompts. The curation is done by the LLM(-as-a-Judge) itself, so can be seen as an instance of a self-rewarding model, but in a specialized setting. Reinforced Self-Training (ReST) (Gulcehre et al., 2023) uses a fixed, external reward to curate new high-quality examples to iteratively add to the training set, improving performance. In our experiments, we found that adding only positive examples in a related manner did not help, whereas preference pairs did help (see Appendix Section A.4 for details). LLM-as-a-Judge Using LLM-as-a-Judge prompting to evaluate language models has become a standard approach (Dubois et al., 2023; Li et al., 2023; Fernandes et al., 2023; Bai et al., 2023; Saha et al., 2023), and is being used to train reward models or curate data as well, as described above (Lee et al., 2023; Chen et al., 2024a; Li et al., 2024). While some works such as Kim et al. (2023) create training data to train an LLM to perform well as a judge, to our knowledge it is not common to combine this training with general instruction following skills as in our work. ## 5 Conclusion We have introduced Self-Rewarding Language Models, models capable of self-alignment via judging and training on their own generations. The method learns in an iterative manner, where in each iteration the model creates its own preference-based instruction training data. This is done by assigning rewards to its own generations via LLM-as-a-Judge prompting, and using Iterative DPO to train on the preferences. We showed that this training both improves the instruction following capability of the model, as well as its reward-modeling ability across the iterations. While there are many avenues left unexplored, we believe this is exciting because this means the model is better able to assign rewards in future iterations for improving instruction following – a kind of virtuous circle. While this improvement likely saturates in realistic scenarios, it still allows for the possibility of continual improvement beyond the human preferences that are typically used to build reward models and instruction following models today. ## 6 Limitations While we have obtained promising experimental results, we currently consider them preliminary because there are many avenues yet to explore, among them the topics of further evaluation, including safety evaluation, and understanding the limits of iterative training. We showed that the iterations of training improve both instruction following and reward modeling ability, but only ran three iterations in a single setting. A clear line of further research is to understand the “scaling laws” of this effect both for more iterations, and with different language models with more or less capabilities in different settings. We observed an increase in length in model generations, and there is a known correlation between length and estimated quality, which is a topic that should be understood more deeply in general, and in our results in particular as well. It would also be good to understand if so-called “reward-hacking” can happen within our framework, and in what circumstances. As we are using both a language model as the training reward, and a language model for final evaluation (GPT-4) in some of our benchmarks, even if they are different models, this may require a deeper analysis than we have provided. While the human evaluation we conducted did provide validation of the automatic results, further study could bring more insights. Another clear further avenue of study is to conduct safety evaluations – and to explore safety training within our framework. Reward models have been built exclusively for safety in existing systems (Touvron et al., 2023), and a promising avenue here would be to use the LLM-as-a-Judge procedure to evaluate for safety specifically in our self-rewarding training process. Given that we have shown that reward modeling ability improves over training iterations, this could mean that the safety of the model could potentially improve over time as well, with later iterations being able to catch and mitigate more challenging safety situations that earlier iterations cannot. ## References - Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023. - Adolphs et al. (2023) Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The CRINGE loss: Learning what language not to model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8854–8874, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.493. URL https://aclanthology.org/2023.acl-long.493. - Anthropic (2023) Anthropic. Claude 2. https://www.anthropic.com/index/claude-2, 2023. - Bai et al. (2022a) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a. - Bai et al. (2022b) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022b. - Bai et al. (2023) Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. Benchmarking foundation models with language-model-as-an-examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=IiRHQ7gvnq. - Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. - Chen et al. (2024a) Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. AlpaGasus: Training a better alpaca with fewer data. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=FdVXgSJhvz. - Chen et al. (2024b) Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024b. - Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. - Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. - Collobert and Weston (2008) Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167, 2008. - Dubois et al. (2023) Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023. - Fernandes et al. (2023) Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André Martins, Graham Neubig, Ankush Garg, Jonathan Clark, Markus Freitag, and Orhan Firat. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation, pages 1066–1083, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.100. URL https://aclanthology.org/2023.wmt-1.100. - Gulcehre et al. (2023) Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023. - Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ. - Honovich et al. (2023) Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806. - Kim et al. (2023) Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023. - Köpf et al. (2023) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, et al. OpenAssistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023. - Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019. - Lee et al. (2023) Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023. - Li et al. (2024) Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, and Mike Lewis. Self-alignment with instruction backtranslation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=1oijHJBRsT. - Li et al. (2023) Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023. - Lin (2004) Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013. - Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018. - Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. - Pan et al. (2023) Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023. - Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. - Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9. - Saha et al. (2023) Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, and Xian Li. Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123, 2023. - Sap et al. (2019) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. CoRR, abs/1904.09728, 2019. URL http://arxiv.org/abs/1904.09728. - Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. - Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020. - Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. - Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. - Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(11), 2008. - Wang et al. (2023) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754. - Xiong et al. (2023) Wei Xiong, Hanze Dong, Chenlu Ye, Han Zhong, Nan Jiang, and Tong Zhang. Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456, 2023. - Xu et al. (2023) Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023. - Yuan et al. (2023) Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. RRHF: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=EdIGMCHk4l. - Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472. - Zhao et al. (2023) Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023. - Zheng et al. (2023a) Chujie Zheng, Pei Ke, Zheng Zhang, and Minlie Huang. Click: Controllable text generation with sequence likelihood contrastive learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1022–1040, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.65. URL https://aclanthology.org/2023.findings-acl.65. - Zheng et al. (2023b) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b. URL https://openreview.net/forum?id=uccHPGDlao. - Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. ## Appendix A Appendix ### A.1 Distributions of IFT, EFT and AIFT data <details> <summary>x6.png Details</summary> ![5d3ae806](/v1/image/5d3ae8061bd22464939a3545ee777bd93abb6ba84cceb5a8b3885c3349fb1175) ### Visual Description ## Scatter Plot: Dimensional Analysis of IFT, EFT, and AIFT Data ### Overview The image is a 2D scatter plot visualizing three distinct datasets projected onto two dimensions. The plot reveals clear spatial clustering and separation between the datasets, suggesting they occupy different regions in this feature space. The data points are semi-transparent, allowing density to be inferred from color saturation. ### Components/Axes * **Chart Type:** Scatter Plot * **X-Axis:** Labeled "Dimension 1". The scale runs from approximately -100 to +75, with major tick marks at -100, -50, 0, and 50. * **Y-Axis:** Labeled "Dimension 2". The scale runs from approximately -75 to +75, with major tick marks at -75, -50, -25, 0, 25, 50, and 75. * **Legend:** Located in the bottom-left quadrant of the plot area. It contains three entries: * A blue circle labeled "IFT data" * A red circle labeled "EFT data" * A green circle labeled "AIFT data" ### Detailed Analysis **Data Series and Spatial Distribution:** 1. **IFT data (Blue Points):** * **Trend/Placement:** This series forms a broad, dispersed cloud primarily occupying the left and central regions of the plot. * **Spatial Range:** Spans Dimension 1 from approximately -90 to +40 and Dimension 2 from approximately -70 to +70. * **Density:** Points are moderately dense, with higher concentration in the central area around (Dimension 1: -20, Dimension 2: 0). The cloud has an irregular, somewhat amorphous shape. 2. **EFT data (Red Points):** * **Trend/Placement:** This series forms a very dense, compact, and roughly circular cluster located in the right portion of the plot. * **Spatial Range:** Concentrated within Dimension 1 from approximately +20 to +80 and Dimension 2 from approximately -40 to +40. The core of the cluster appears centered near (Dimension 1: +50, Dimension 2: 0). * **Density:** Extremely high density, indicated by the deep, saturated red color at the cluster's core. This suggests a high degree of similarity or low variance among these data points in this 2D projection. 3. **AIFT data (Green Points):** * **Trend/Placement:** This series forms a distinct cluster primarily on the left side of the plot, overlapping significantly with the lower-left portion of the IFT (blue) cloud. * **Spatial Range:** Spans Dimension 1 from approximately -90 to +10 and Dimension 2 from approximately -75 to +25. * **Density:** Moderately dense, with a concentration in the region around (Dimension 1: -50, Dimension 2: -25). It appears less dispersed than the IFT data but more spread out than the EFT data. **Inter-Series Relationships:** * There is a clear and significant separation between the **EFT (red)** cluster and the other two datasets along Dimension 1. The EFT data occupies a distinct region with minimal overlap. * The **IFT (blue)** and **AIFT (green)** datasets show substantial overlap, particularly in the region where Dimension 1 is between -90 and 0. However, the AIFT cluster appears more concentrated in the lower-left quadrant, while the IFT data extends further upwards and to the right. ### Key Observations 1. **Distinct Clustering:** The most prominent feature is the tight, isolated cluster of EFT data, indicating it is fundamentally different from IFT and AIFT data in this representation. 2. **Overlap and Proximity:** The significant overlap between IFT and AIFT data suggests these two datasets share more common characteristics or are derived from more similar processes compared to EFT data. 3. **Density Gradient:** The EFT cluster exhibits a strong density gradient, being extremely dense at its center and becoming sparser at its edges. The IFT and AIFT clouds have more uniform, lower densities. 4. **Axis Dominance:** The primary separation between the three groups occurs along **Dimension 1**. Dimension 2 shows more overlap, especially between IFT and EFT data in the vertical range of -25 to +25. ### Interpretation This scatter plot likely represents the output of a dimensionality reduction technique (like t-SNE or PCA) applied to high-dimensional data from three different sources or methods: IFT, EFT, and AIFT. * **What the data suggests:** The visualization strongly implies that the **EFT data** represents a distinct, homogeneous class or state. In contrast, **IFT and AIFT data** appear to be related, possibly representing variations of a similar underlying phenomenon or data from a related but noisier process. The AIFT data might be a subset or a refined version of the IFT data, given its more concentrated location within the IFT cloud's domain. * **How elements relate:** The spatial arrangement acts as a similarity map. Points close together are similar in the original high-dimensional space. The clear isolation of the red cluster means EFT samples are consistently different from the others. The blue-green overlap means samples from IFT and AIFT are often similar to each other. * **Notable anomalies/outliers:** There are no extreme outliers far from their respective main clusters. All data points belong clearly to one of the three regional groupings. The most notable "anomaly" is the stark separation of the EFT group itself. * **Underlying meaning:** In a technical context (e.g., machine learning, signal processing, or experimental data analysis), this plot would be used to argue that the EFT method or dataset captures a unique signal or property not present in IFT/AIFT. Conversely, it would suggest IFT and AIFT are capturing related information. The effectiveness of the dimensionality reduction in separating these classes is high for EFT but moderate for distinguishing IFT from AIFT. </details> (a) Instruction distribution of IFT, EFT and AIFT data. <details> <summary>x7.png Details</summary> ![c5afba93](/v1/image/c5afba93e7ec3eb3ebc558915697b20a5567ce8e9acc1af6ea0c7a0b3afbba58) ### Visual Description ## Scatter Plot: Dimensionality Reduction of Three Data Series ### Overview The image is a 2D scatter plot visualizing three distinct datasets projected into a two-dimensional space defined by "Dimension 1" (x-axis) and "Dimension 2" (y-axis). The plot reveals the clustering and separation patterns of the data points, which are colored according to their source dataset. ### Components/Axes * **Chart Type:** 2D Scatter Plot. * **X-Axis:** Labeled "Dimension 1". The axis spans from approximately -80 to +80, with major tick marks at -50, 0, and 50. * **Y-Axis:** Labeled "Dimension 2". The axis spans from approximately -80 to +60, with major tick marks at -80, -60, -40, -20, 0, 20, 40, and 60. * **Legend:** Located in the bottom-right quadrant of the chart area. It contains three entries: * A blue circle labeled "IFT data". * A red circle labeled "EFT data". * A green circle labeled "AIFT data". * **Data Points:** Thousands of semi-transparent circular markers, colored blue, red, or green, plotted according to their (Dimension 1, Dimension 2) coordinates. ### Detailed Analysis * **Spatial Distribution & Clustering:** * **EFT data (Red):** Forms a dense, distinct cluster primarily located on the right side of the plot. Its center of mass is approximately at Dimension 1 = +50, Dimension 2 = +20. The cluster spans roughly from Dimension 1 = +20 to +80 and Dimension 2 = -20 to +60. * **IFT data (Blue) & AIFT data (Green):** These two datasets are heavily intermingled and form a large, diffuse cloud occupying the left and central portions of the plot. Their combined mass spans roughly from Dimension 1 = -80 to +30 and Dimension 2 = -80 to +60. There is no clear visual boundary separating the blue and green points within this cloud. * **Trend Verification:** * The **EFT (red)** series shows a clear trend of being separated from the other two series along the Dimension 1 axis. * The **IFT (blue)** and **AIFT (green)** series show a trend of significant overlap and similar distribution, indicating they occupy a similar region in this 2D projection. * **Density:** The red (EFT) cluster appears denser than the combined blue/green cloud, suggesting a more concentrated distribution in this reduced-dimensional space. ### Key Observations 1. **Primary Separation:** The most prominent feature is the clear separation of the "EFT data" (red) cluster from the combined "IFT data" (blue) and "AIFT data" (green) cloud along the first dimension (x-axis). 2. **High Overlap:** The "IFT data" and "AIFT data" points are thoroughly mixed, showing no distinct sub-clustering or separation between them in this visualization. 3. **Cluster Shape:** The EFT cluster is relatively compact and roughly elliptical. The IFT/AIFT cloud is more amorphous and spread out. 4. **Outliers:** A few scattered red points appear within the main blue/green cloud, and vice-versa, but these are exceptions to the strong general separation. ### Interpretation This scatter plot likely results from a dimensionality reduction technique (like t-SNE or UMAP) applied to high-dimensional data from three sources: IFT, EFT, and AIFT. The visualization suggests a fundamental difference in the underlying structure or features of the **EFT data** compared to the other two. The **IFT and AIFT data** appear to be very similar to each other in this projected space, implying they may share common characteristics or originate from related processes. The clear separation of the red cluster indicates that the algorithm has found a meaningful way to distinguish EFT samples based on their core attributes. The overlap of blue and green points suggests that, for the features captured by this projection, IFT and AIFT data are not readily distinguishable. This could mean they are drawn from the same population, represent similar phenomena, or that the current projection method is not sensitive to the differences between them. The plot provides strong visual evidence for a categorical distinction between EFT and the (IFT/AIFT) group. </details> (b) Response distribution of IFT, EFT, and AIFT data. Figure 6: Distributions of both instructions and responses for IFT, EFT and AIFT data. We have plotted the distribution of instructions for IFT, EFT and AIFT( $M_1$ ) data, and the distribution of responses for IFT, EFT and AIFT( $M_1$ ) data in Figure 6. It is clear that the IFT data and EFT data come from very different distributions while the IFT and AIFT( $M_1$ ) data come from similar distributions. ### A.2 EFT Prompts The EFT prompt which we use in our main experiments is shown in Figure 2. Other EFT prompts we have tried At first, we took the EFT prompt from Li et al. [2024] as shown in Figure 7. However, we found that this prompt was not as effective as our additive score-counting prompt because the model needed to treat the task as a multiple-choice problem, and it was difficult for the model to break down this multiple-choice problem into sub-problems involving evaluating various aspects of the response. When using the model trained on 3,200 IFT data only, its performance on the EFT test set using our additive score-counting prompt and prompt from Li et al. [2024] is shown in Table 5. Below is a question from an user and a candidate response. Please grade the response on a 5-point scale using the following criteria: 1: It means the answer is incomplete, vague, off-topic, controversial, or not exactly what the user asked for. For example, some content seems missing, numbered list does not start from the beginning, the opening sentence repeats user’s question. Or the response is from another person’s perspective with their personal experience (e.g. taken from blog posts), or looks like an answer from a forum. Or it contains promotional text, navigation text, or other irrelevant information. 2: It means the answer addresses most of the asks from the user. It does not directly address the user’s question. For example, it only provides a high-level methodology instead of the exact solution to user’s question. 3: It means the answer is helpful but not written by an AI Assistant. It addresses all the basic asks from the user. It is complete and self contained with the drawback that the response is not written from an AI assistant’s perspective, but from other people’s perspective. The content looks like an excerpt from a blog post, web page, or web search results. For example, it contains personal experience or opinion, mentions comments section, or share on social media, etc. 4: It means the answer is written from an AI assistant’s perspective with a clear focus of addressing the instruction. It provide a complete, clear, and comprehensive response to user’s question or instruction without missing or irrelevant information. It is well organized, self-contained, and written in a helpful tone. It has minor room for improvement, e.g. more concise and focused. 5: It means it is a perfect answer from an AI Assistant. It has a clear focus on being a helpful AI Assistant, where the response looks like intentionally written to address the user’s question or instruction without any irrelevant sentences. The answer provides high quality content, demonstrating expert knowledge in the area, is very well written, logical, easy-to-follow, engaging and insightful. User: <INSTRUCTION_HERE> < response > <RESPONSE_HERE> < /response > Please first briefly describe your reasoning (in less than 100 words), and then write “Score: < rating > ” in the last line. Answer in the style of an AI Assistant, with knowledge from web search if needed. To derive the final score based on the criteria, let’s think step-by-step. Figure 7: LLM-as-a-Judge prompt taken from Li et al. [2024]. | Pairwise accuracy $(↑)$ 5-best % $(↑)$ Exact Match % $(↑)$ | 26.6% 23.5% 1.1% | 65.1% 39.6% 10.1% | | --- | --- | --- | | Spearman corr. $(↑)$ | -0.18 | 0.25 | | Kendall $τ$ corr. $(↑)$ | -0.16 | 0.23 | Table 5: We tried various LLM-as-Judge prompts using the model trained with 3,200 IFT data only and found that our additive score-counting prompt worked best which demonstrates significant improvements in EFT performance comparing to the prompt used by Li et al. [2024]. ### A.3 Self-rewarding Models Using IFT Data Only To demonstrate the importance of the EFT data, we also trained a series of models starting with the model trained only on the IFT data. The following is the model sequence. - : Base pretrained LLM with no fine-tuning. - : Initialized with $M_0$ , then fine-tuned on the IFT seed data only using SFT. - : Initialized with $M_1^\prime$ , then trained with AIFT( $M_1^\prime$ ) data using DPO. - : Initialized with $M_2^\prime$ , then trained with AIFT( $M_2^\prime$ ) data using DPO. Since we did not use EFT data to train the series of models, they were not always able to score the responses according to the format and even when they did, the scores given typically converged to 4. Therefore, even when starting from the same number of generated new prompts, we could only collect a very small number of valid training samples for DPO. In total, we collected 541 pairs to form the AIFT( $M_1^\prime$ ) dataset used to train $M_2^\prime$ via DPO, and 429 pairs to form AIFT( $M_2^\prime$ ) used to train $M_3^\prime$ . The win rates are shown in Figure 8. From the figure we can conclude that EFT data helps to get better performance in the same number of iterations and the gap in performance between the model trained with EFT data and the model trained without EFT data widens in the later iterations. <details> <summary>x8.png Details</summary> ![a9af9903](/v1/image/a9af9903f5e7285ef554f311e6d77a90d8b199bf427aa2c5c0b8c2e8f2b3986f) ### Visual Description ## Horizontal Stacked Bar Chart: Self-Rewarding vs. SFT Baseline Comparison ### Overview The image displays a horizontal stacked bar chart comparing the performance of two "Self-Rewarding" models (M³ and M²) against a "SFT Baseline" model. The chart quantifies outcomes into three categories: wins for the Self-Rewarding model, ties, and wins for the SFT Baseline. The data is presented as percentages. ### Components/Axes * **Legend:** Positioned at the top center of the chart. It defines three color-coded categories: * **Green:** "Self-Rewarding Wins" * **Light Blue:** "Tie" * **Red:** "SFT Baseline Wins" * **Bars:** Two horizontal bars are stacked and aligned to the left. * **Top Bar:** Labeled "Self-Rewarding M³ vs. SFT Baseline" on the left. * **Bottom Bar:** Labeled "Self-Rewarding M² vs. SFT Baseline" on the left. * **Data Labels:** Numerical percentage values are embedded within each colored segment of the bars. ### Detailed Analysis The chart presents the following precise data points for each comparison: **1. Self-Rewarding M³ vs. SFT Baseline (Top Bar)** * **Self-Rewarding Wins (Green, left segment):** 50.4% * **Tie (Light Blue, middle segment):** 32.8% * **SFT Baseline Wins (Red, right segment):** 16.8% **2. Self-Rewarding M² vs. SFT Baseline (Bottom Bar)** * **Self-Rewarding Wins (Green, left segment):** 46.5% * **Tie (Light Blue, middle segment):** 34.8% * **SFT Baseline Wins (Red, right segment):** 18.8% **Visual Trend Verification:** * In both bars, the green segment ("Self-Rewarding Wins") is the largest, occupying roughly half or more of the total bar length. * The light blue segment ("Tie") is the second-largest. * The red segment ("SFT Baseline Wins") is the smallest in both cases. * The total length of each bar represents 100% (50.4 + 32.8 + 16.8 = 100.0 for M³; 46.5 + 34.8 + 18.8 = 100.1 for M², with the minor discrepancy likely due to rounding). ### Key Observations 1. **Dominant Performance:** The Self-Rewarding model achieves a higher win rate than the SFT Baseline in both comparisons (50.4% vs. 16.8% for M³; 46.5% vs. 18.8% for M²). 2. **Significant Tie Rate:** A substantial portion of outcomes result in a tie, ranging from approximately one-third (32.8% to 34.8%) of all comparisons. 3. **Model Version Comparison:** The M³ version of the Self-Rewarding model shows a stronger performance (50.4% wins) against the baseline compared to the M² version (46.5% wins). Correspondingly, the M³ version has a slightly lower tie rate and a lower baseline win rate. 4. **Consistent Hierarchy:** The order of performance (Self-Rewarding Wins > Ties > SFT Baseline Wins) is consistent across both model versions tested. ### Interpretation This chart provides a clear, quantitative evaluation demonstrating the superiority of the "Self-Rewarding" training or modeling approach over a standard "SFT (Supervised Fine-Tuning) Baseline." The data suggests that the Self-Rewarding method leads to outcomes where it is judged as superior more than twice as often as the baseline (e.g., 50.4% vs. 16.8% for M³). The high tie rate is a critical finding. It indicates that in a large fraction of cases (~33-35%), the two models produce outputs of comparable quality or are indistinguishable based on the evaluation criteria. This could imply that the baseline is still competitive in many scenarios, or that the evaluation metric has a threshold that both models frequently meet. The comparison between M³ and M² suggests iterative improvement, with the M³ version yielding a higher win rate and a lower loss/tie rate against the same baseline. This chart would be essential in a technical report to justify the development of the Self-Rewarding approach, showcasing its effectiveness and providing a nuanced view that includes ties, not just a simple win/loss ratio. </details> <details> <summary>x9.png Details</summary> ![087235c7](/v1/image/087235c7fd9d70243f7fd9ef8a73316bb68433e3e3193c7098cb1c6e06e0a985) ### Visual Description ## Horizontal Stacked Bar Chart: Model Comparison (Self-Rewarding vs. Primed) ### Overview The image displays a horizontal stacked bar chart comparing the performance outcomes of two model pairs: "Self-Rewarding M₃ vs. M₃'" and "Self-Rewarding M₂ vs. M₂'". The chart quantifies the win/tie/loss rates when the "Self-Rewarding" model (Left) is compared against its "primed" counterpart (Right, denoted with a prime symbol '). The data is presented as percentages. ### Components/Axes * **Legend:** Positioned at the top of the chart. * **Left Wins (in Left vs. Right):** Represented by a bright green color. * **Tie:** Represented by a light blue color. * **Right Wins:** Represented by a salmon/red color. * **Y-Axis (Categories):** Lists the two model comparisons being analyzed. 1. **Top Bar:** "Self-Rewarding M₃ vs. M₃'" 2. **Bottom Bar:** "Self-Rewarding M₂ vs. M₂'" * **X-Axis (Implicit):** Represents the percentage of outcomes, summing to 100% for each bar. The axis itself is not labeled with numbers, but the percentage values are embedded directly within each colored segment of the bars. * **Data Series:** Each horizontal bar is a stacked segment representing the three outcome categories from the legend. ### Detailed Analysis **1. Self-Rewarding M₃ vs. M₃' (Top Bar):** * **Left Wins (Green):** 38.7% * **Tie (Light Blue):** 44.5% * **Right Wins (Red):** 16.8% * **Trend Verification:** The green segment (Left Wins) is the second largest, followed by the largest blue segment (Tie), and the smallest red segment (Right Wins). The visual proportions match the numerical values. **2. Self-Rewarding M₂ vs. M₂' (Bottom Bar):** * **Left Wins (Green):** 34.8% * **Tie (Light Blue):** 36.7% * **Right Wins (Red):** 28.5% * **Trend Verification:** The green segment (Left Wins) and blue segment (Tie) are similar in size, with the blue being slightly larger. The red segment (Right Wins) is notably larger than in the top bar. The visual proportions match the numerical values. ### Key Observations 1. **Dominance of Ties:** In both comparisons, the "Tie" outcome is the most frequent or nearly the most frequent result (44.5% for M₃, 36.7% for M₂). This suggests a high degree of equivalence or indistinguishability between the compared models in many test cases. 2. **Shift in Win Distribution:** There is a clear shift in the win/loss balance between the two model generations. * For the **M₃ pair**, the "Self-Rewarding" model (Left) has a significant advantage over its primed counterpart (38.7% vs. 16.8%). * For the **M₂ pair**, the advantage is smaller and the competition is closer (34.8% vs. 28.5%). The primed M₂' model wins more frequently against its base than M₃' does. 3. **Overall Performance:** The "Self-Rewarding" models (Left) win more often than they lose in both comparisons, but the margin is much slimmer for the M₂ generation. ### Interpretation This chart likely comes from a research paper or technical report evaluating the effect of a "priming" technique on language model performance. The "Self-Rewarding" model is the baseline, and the primed version (M' ) is the modified variant. The data suggests that the **priming technique has a more pronounced competitive effect on the older M₂ model** than on the newer M₃ model. For M₃, the primed version struggles to outperform the original, resulting in a high tie rate and a low win rate. For M₂, the primed version is much more competitive, nearly matching the original's win rate and significantly reducing the tie rate. This could imply that the "Self-Rewarding" training method used for M₃ is more robust or that the specific priming method applied is less effective against it. The high tie rates overall indicate that the differences between the base and primed models are often subtle, requiring nuanced evaluation rather than clear-cut victories. The chart effectively communicates that the impact of the intervention (priming) is model-version-dependent. </details> Figure 8: EFT data helps the self-rewarding loop: We evaluated the series of models trained using self-reward loops starting from the model trained using only IFT data. We performed head-to-head win rates comparisons on the IFT test set. While $M_2^\prime$ can improve over the SFT baseline and $M_3^\prime$ can improve even more over the SFT baseline, they lag far behind the corresponding models ( $M_2$ , $M_3$ ) that started from a base model trained using both IFT and EFT data, see Figure 3. <LIST ALL ALPACAEVAL INSTRUCTIONS> Given the above list of possible instructions, define a maximum of 20 categories that would cover the types of intructions, for example recipes, reasoning tasks, general knowledge etc. Try to cover as many of the instructions as possible with the maximum 20 categories, while keeping the categories high-level, simple and easy to understand. Figure 9: Prompt used to obtain instruction categories on the AlpacaEval test set. Instruction: <INSTRUCTION> Given the above, categorize it into one of the following 20 categories: <LIST ALL CATEGORIES> Secondly, score the instruction in terms of complexity: how complex you think it is to answer from 1-10 (where 10 is a complex question whereby first reasoning or breaking down the question into multiple subquestions for example might help improve the answer). Thirdly, indicate how long you think the response to the instruction should be, either (a) 1 sentence, (b) 1-3 sentences, (c) 1 paragraph, (d) 2 paragraphs, or (e) 3 or more paragraphs. Provide your final response in the following format: Category: < one of the 20 categories > Complexity: < score out of 10 > Length: < length category >. Do not provide the actual response. Figure 10: Prompt for categorizing instructions based on their topics, complexities and expected response lengths. Table 6: Breakdown of AlpacaEval test set instructions by instruction category. | Science / Technology / Engineering | 134 | 16.65% | | --- | --- | --- | | Professional / Business / Marketing | 77 | 9.57% | | Social Interaction / Relationships / Human Behavior | 68 | 8.45% | | Miscellaneous / Other | 61 | 7.58% | | Mathematics / Logical Reasoning | 52 | 6.46% | | Cooking / Recipes | 48 | 5.96% | | Software Development / Coding / Algorithms | 44 | 5.47% | | Travel / Geography / Exploration | 41 | 5.09% | | Literature / Writing / Communication | 39 | 4.84% | | History / Social Studies | 38 | 4.72% | | Entertainment / Media Analysis | 34 | 4.22% | | Language Learning / Linguistics | 32 | 3.98% | | Music / Audio / Arts | 30 | 3.73% | | DIY Projects / Hobbies | 24 | 2.98% | | Technology / Gadgets / Consumer Products | 20 | 2.48% | | Gaming / Game Development | 18 | 2.24% | | Exercise / Health / Wellness | 16 | 1.99% | | Philosophy / Ethics / Ideology | 15 | 1.86% | | Sports / Athletics / Physical Activity | 12 | 1.49% | | Strategy / Problem-Solving / Critical Thinking | 2 | 0.24% | Table 7: Breakdown of AlpacaEval test set instructions by instruction complexity. The instructions increase in complexity from 1 to 9, where 10 is a complex question that requires first reasoning or breaking the problem into sub-problems before it can be solved. | 3 | 238 | 29.57% | | --- | --- | --- | | 2 | 206 | 25.59% | | 4 | 122 | 15.16% | | 6 | 79 | 9.81% | | 5 | 68 | 8.45% | | 7 | 41 | 5.09% | | 1 | 34 | 4.22% | | 8 | 14 | 1.74% | | 9 | 3 | 0.37% | Table 8: Breakdown of AlpacaEval test set instructions by expected response length. | 1-3 sentences 1 paragraph 1 sentence | 361 269 143 | 44.84% 33.42% 17.76% | | --- | --- | --- | | 2 paragraphs | 31 | 3.85% | | 3 or more paragraphs | 1 | 0.13% | <details> <summary>x10.png Details</summary> ![730a17c7](/v1/image/730a17c720c9a5d8f691d7e92f8bbc9227444f74d0014c5db95a5161e4e5e91b) ### Visual Description ## Line Chart: Win Rate vs. Instruction Complexity for Four Models ### Overview The image displays a line chart comparing the performance of four distinct models (M₀, M₁, M₂, M₃) across eight levels of instruction complexity. The performance metric is "Win rate (%)". The chart reveals that model performance is highly dependent on the complexity level, with different models exhibiting divergent trends. ### Components/Axes * **Chart Type:** Multi-series line chart with markers. * **X-Axis:** Labeled "Instruction Complexity". It is a categorical axis with discrete integer markers from 1 to 8. * **Y-Axis:** Labeled "Win rate (%)". It is a linear numerical axis with major gridlines at intervals of 5%, ranging from 0% to a maximum just above 25%. * **Legend:** Located in the bottom-left corner of the plot area. It defines four data series: * **M₀:** Dark purple line with circle markers (●). * **M₁:** Magenta/dark pink line with upward-pointing triangle markers (▲). * **M₂:** Red line with square markers (■). * **M₃:** Light orange/peach line with diamond markers (◆). ### Detailed Analysis **Data Series and Approximate Values:** The following table reconstructs the approximate win rate (%) for each model at each instruction complexity level. Values are estimated from the chart's gridlines. | Instruction Complexity | M₀ (●, Dark Purple) | M₁ (▲, Magenta) | M₂ (■, Red) | M₃ (◆, Orange) | | :--- | :--- | :--- | :--- | :--- | | **1** | ~17.5 | ~14.5 | ~23.5 | ~26.5 | | **2** | ~10.5 | ~12.0 | ~17.5 | ~19.0 | | **3** | ~8.5 | ~9.5 | ~14.0 | ~18.5 | | **4** | ~6.5 | ~7.5 | ~15.5 | ~18.0 | | **5** | ~4.5 | ~7.5 | ~16.0 | ~26.5 | | **6** | ~4.0 | ~5.0 | ~9.0 | ~21.5 | | **7** | ~5.0 | ~7.5 | ~12.0 | ~24.5 | | **8** | ~21.5 | ~21.5 | ~14.0 | ~28.5 | **Trend Verification (Visual Description):** * **M₀ (Dark Purple):** Slopes downward from complexity 1 to 6, reaching its minimum. It then rises sharply from complexity 6 to 8. * **M₁ (Magenta):** Follows a similar but less steep downward trend as M₀ from complexity 1 to 6. It also recovers from complexity 6 to 8, ending at a similar point to M₀. * **M₂ (Red):** Shows a more volatile trend. It decreases from 1 to 3, increases to a local peak at 5, drops sharply to a minimum at 6, and then recovers moderately through 7 and 8. * **M₃ (Orange):** Generally maintains the highest win rate. It decreases from 1 to 4, spikes to a peak at 5, dips at 6, and then climbs to its highest point at complexity 8. ### Key Observations 1. **Performance Hierarchy:** Model M₃ (orange diamonds) consistently achieves the highest or near-highest win rate at every complexity level except complexity 6, where M₀ and M₁ perform poorly but M₃ still leads. 2. **Critical Complexity Level 6:** All four models experience a significant performance dip at instruction complexity level 6. This is the lowest point for M₀, M₁, and M₂, and a local minimum for M₃. 3. **Recovery at High Complexity:** Models M₀ and M₁ show a dramatic recovery in win rate at the highest complexity levels (7 and 8), converging with each other. M₂ shows a modest recovery, while M₃ continues to improve. 4. **Divergence at Low Complexity:** At the lowest complexity (1), there is a wide spread in performance, with M₃ leading and M₁ trailing. The models converge somewhat in the mid-range (3-4) before diverging again. ### Interpretation The data suggests that "Instruction Complexity" is a critical factor influencing model performance, but its effect is non-linear and model-specific. * **Model Robustness:** M₃ appears to be the most robust model overall, maintaining high win rates across the spectrum. Its peak at complexity 5 and ultimate high at complexity 8 suggest it may be particularly well-optimized for handling both moderately complex and very complex instructions. * **The "Complexity Valley":** The universal dip at complexity 6 indicates a potential "valley of difficulty" – a specific range of instruction complexity that poses a unique challenge for all tested models. This could represent a transition point where problem-solving strategies that work for lower complexities fail, but more advanced strategies (effective for higher complexities) are not yet fully engaged. * **Specialization vs. Generalization:** The sharp recovery of M₀ and M₁ at high complexity might indicate they are specialized models that underperform on intermediate tasks but possess strong capabilities for the most complex instructions. In contrast, M₂'s volatile performance suggests less predictable behavior. * **Practical Implication:** When deploying these models, one should consider the expected complexity of user instructions. For consistently high performance, M₃ is the best choice. If the task domain is known to involve very high complexity, M₀ or M₁ could be viable alternatives, but they would be poor choices for mid-range complexity tasks. The universal weakness at level 6 warrants further investigation into the nature of instructions at that complexity. </details> <details> <summary>x11.png Details</summary> ![f6a0f1a6](/v1/image/f6a0f1a69beb982493d6d890771d332fe608d13f5f35f309a8a3033e8e039faa) ### Visual Description ## Line Chart: Win rate (%) vs. Expected response length ### Overview This is a line chart comparing the performance of four different models (M0, M1, M2, M3) across four categories of expected response length. The performance metric is "Win rate (%)". The chart shows that win rates generally decrease as response length increases from a single sentence to a paragraph, with a slight recovery or stabilization for the longest category. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis (Horizontal):** Labeled "Expected response length". It has four categorical tick marks: 1. "1 sentence" 2. "1-3 sentences" 3. "1 paragraph" 4. "2 paragraphs" * **Y-Axis (Vertical):** Labeled "Win rate (%)". It is a linear scale with major gridlines and numerical markers at 5, 10, 15, 20, and 25. * **Legend:** Located in the top-right corner of the plot area. It defines four data series: * **M0:** Dark purple line with circle markers. * **M1:** Magenta line with triangle markers. * **M2:** Red-orange line with square markers. * **M3:** Light orange/peach line with diamond markers. ### Detailed Analysis **Data Series and Approximate Values:** * **M0 (Dark Purple, Circles):** * Trend: Steep downward slope from "1 sentence" to "1 paragraph", then a very slight upward slope to "2 paragraphs". * Values: * 1 sentence: ~15.5% * 1-3 sentences: ~7.8% * 1 paragraph: ~5.5% * 2 paragraphs: ~6.5% * **M1 (Magenta, Triangles):** * Trend: Very steep downward slope from "1 sentence" to "1 paragraph", followed by a moderate upward slope to "2 paragraphs". * Values: * 1 sentence: ~20.2% * 1-3 sentences: ~8.6% * 1 paragraph: ~5.6% * 2 paragraphs: ~9.7% * **M2 (Red-Orange, Squares):** * Trend: Steep downward slope from "1 sentence" to "1 paragraph", then a moderate upward slope to "2 paragraphs". * Values: * 1 sentence: ~26.5% * 1-3 sentences: ~14.4% * 1 paragraph: ~10.4% * 2 paragraphs: ~12.9% * **M3 (Light Orange, Diamonds):** * Trend: Consistent downward slope from "1 sentence" to "1 paragraph", then a very slight upward slope to "2 paragraphs". It maintains the highest win rate at every data point. * Values: * 1 sentence: ~25.8% * 1-3 sentences: ~21.6% * 1 paragraph: ~15.6% * 2 paragraphs: ~16.1% ### Key Observations 1. **Universal Dip at "1 Paragraph":** All four models achieve their lowest win rate at the "1 paragraph" response length category. 2. **Performance Hierarchy:** The relative ranking of the models is consistent across all response lengths: M3 > M2 > M1 > M0. 3. **Initial Drop Severity:** The drop in win rate from "1 sentence" to "1-3 sentences" is most severe for M1 and M0. M3 shows the most gradual initial decline. 4. **Recovery Pattern:** All models show a slight increase in win rate when moving from "1 paragraph" to "2 paragraphs", suggesting a potential performance rebound for longer-form responses after a mid-length trough. 5. **Highest and Lowest Points:** The highest win rate on the chart is for M2 at "1 sentence" (~26.5%). The lowest win rates are for M0 and M1 at "1 paragraph" (~5.5-5.6%). ### Interpretation The data suggests a non-linear relationship between expected response length and model win rate. The consistent dip at "1 paragraph" indicates a potential "valley of difficulty" where models struggle most—perhaps this length is long enough to introduce complexity but not long enough for the models to fully develop a coherent, high-quality response that wins comparisons. The clear and consistent performance hierarchy (M3 > M2 > M1 > M0) implies fundamental differences in model capability, training, or architecture that are evident regardless of response length. M3's superior performance, especially its more gradual decline, suggests it is more robust to increases in response length. The slight recovery at "2 paragraphs" is intriguing. It could indicate that for very long responses, other factors (like comprehensiveness or structure) become more important in determining a "win," playing to different strengths of the models. Alternatively, it might reflect a selection bias in the evaluation data for that category. **In summary:** The chart demonstrates that model performance, as measured by win rate, is highly sensitive to expected response length, with a notable performance trough at the paragraph length. It also reveals a stable ranking of model effectiveness across all tested lengths. </details> Figure 11: AlpacaEval win rate breakdown for instruction complexities (left) and expected response lengths (right). Self-Rewarding models give gains across most complexities and all response length ranges. ### A.4 Preference optimization outperforms augmenting with positive examples only We also tried an alternative self-training procedure of adding high-quality self-instruction creation examples to supervised fine-tuning (without preference optimization), rather than DPO. In this variant, we add additional examples of (instruction prompt, response) curated by the model to the seed set for supervised fine-tuning, following other approaches [Li et al., 2024, Adolphs et al., 2023, Gulcehre et al., 2023], rather than constructing preference data. In this setup we only add examples where the candidate response was evaluated to give a perfect score of $r_i^n=5$ . Unfortunately we could not find a configuration where this approach helped. For example, adding 11,254 such examples that scored 5 out of 5, and optimizing the mixing weight in training, still yielded a head to head with the SFT Baseline of 29% wins vs 30% wins, i.e., no improvement. ### A.5 Augmented Prompt Generation Using Newly Trained Models In our experiments, for time efficiency, we have created a fixed pool of augmented prompts in advance using ChatLlama 70B. In a real interactive system, ideally, those prompts could come from real users so that we can ensure the models are trained to align with real user requirements. Here, we also examine whether our newly trained Self-Rewarding models in each iteration can generate new prompts through in-context learning, instead of using ChatLlama 70B. To check this, we constructed 30 prompts with in-context examples using the original seed IFT data as described in Section 2.2 and tested whether $M_1$ , $M_2$ and $M_3$ still possess in-context learning ability and can generate high quality instructions. According to manual inspection, all models can generate novel instructions given in-context examples in all 30 cases. However, for $M2$ and $M3$ , the model is likely to first generate a few instructions, then generate a separator, and then start responding to the instructions, so some postprocessing might be necessary. ### A.6 AlpacaEval Test Sample Clustering We used the GPT-4 (gpt-4-1106-preview) model to categorize the instructions in the AlpacaEval test set into clusters from three perspectives: (1) instruction category, (2) instruction complexity, and (3) expected response length. To obtain instruction categories for the AlpaceEval test set, we used the prompt in Figure 9 and obtained 20 categories in total. Then, to cluster the instructions into different groups, we use the prompt in Figure 10 for each test example. The corresponding statistics are given in Table 6, Table 7, Table 8. The fine-grained results on instruction complexity and expected response length are given in Figure 11. Table 9: NLP Benchmarks. Self-Rewarding models mostly tend to maintain performance compared to the Llama 2 base model and the SFT Baseline, despite being fine-tuned on very different instruction-following prompts. | Llama 2 SFT Baseline $M_1$ | 80.20 76.49 78.14 | 57.40 55.97 57.51 | 85.30 85.17 84.99 | 50.70 51.48 53.02 | 82.80 82.59 82.92 | 56.80 50.72 60.27 | 68.90 69.76 69.34 | 60.20 57.80 57.60 | 25.30 34.35 35.48 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | $M_2$ | 74.84 | 54.51 | 84.27 | 51.23 | 81.94 | 59.29 | 69.31 | 57.60 | 33.07 | | $M_3$ | 72.35 | 53.13 | 83.29 | 49.28 | 80.79 | 57.70 | 69.37 | 58.40 | 31.86 | Table 10: MT-Bench Fine-grained Results. We list our models’ performance on each problem category. Self-reward is especially effective in improving the model’s ability in writing, role-playing, extraction, and STEM tasks. | SFT M1 M2 | 8.83 9.10 9.10 | 8.15 7.65 8.00 | 5.30 4.35 4.60 | 3.00 3.05 3.30 | 3.50 4.10 4.25 | 6.90 7.20 7.65 | 9.18 8.93 9.40 | 9.95 9.85 9.80 | 6.85 6.78 7.01 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | M3 | 9.58 | 8.73 | 4.80 | 3.50 | 4.20 | 7.80 | 9.45 | 9.95 | 7.25 | ### A.7 NLP Benchmark Results and MT-Bench Results We provide the detailed model performance on a number of NLP benchmarks in Table 9 and on MT-Bench in Table 10. In particular, some NLP benchmarks including ARC-Challenge, HellaSwag, SIQA, PIQA, and OBQA are all text completion tasks. In these tasks, given the multiple choice options, we choose the option corresponding to the highest log probability scored by the models as the final answer. As such, the objective of these particular tasks is quite different from what our algorithm tries to optimize, so the results on these tasks may not reflect the true capability of our models.

Rendering Paper...