# LsrIF: Logic-Structured Reinforcement Learning for Instruction Following
> Corresponding author.
Abstract
Instruction-following is critical for large language models, but real-world instructions often contain logical structures such as sequential dependencies and conditional branching. Existing methods typically construct datasets with parallel constraints and optimize average rewards, ignoring logical dependencies and yielding noisy signals. We propose a logic-structured training framework LsrIF that explicitly models instruction logic. We first construct a dataset LsrInstruct with constraint structures such as parallel, sequential, and conditional types, and then design structure-aware rewarding method LsRM including average aggregation for parallel structures, failure-penalty propagation for sequential structures, and selective rewards for conditional branches. Experiments show LsrIF brings significant improvements in instruction-following (in-domain and out-of-domain) and general reasoning. Analysis reveals that learning with explicit logic structures brings parameter updates in attention layers and sharpens token-level attention to constraints and logical operators.
LsrIF: Logic-Structured Reinforcement Learning for Instruction Following
Qingyu Ren 1, Qianyu He 1, Jingwen Chang 1, Jie Zeng 1, Jiaqing Liang 2 thanks: Corresponding author., Yanghua Xiao 1 footnotemark: Han Xia 3, Zeye Sun 3, Fei Yu 3 1 Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University, 2 School of Data Science, Fudan University, 3 Ant Group {qyren24,qyhe21,jwchang24, jzeng23}@m.fudan.edu.cn, {liangjiaqing, shawyh}@fudan.edu.cn
1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Complex Instruction Breakdown
### Overview
The image is a diagram illustrating the components of a "Complex Instruction." It breaks down a complex instruction into its constituent parts: Constraints and Logic. The diagram also provides an example of a complex instruction and how it relates to the constraints and logic components.
### Components/Axes
* **Header:** "Complex Instruction = Constraint + Logic" (Grey box)
* **Main Text Block:** A paragraph describing the complex instruction.
* **Constraint Block:** A blue rounded rectangle containing a list of constraints.
* **Logic Block:** A yellow rounded rectangle containing a list of logical steps.
* **Arrows:** Two blue arrows pointing from the main text block to the Constraint and Logic blocks.
* **Checklist Icon:** A checklist icon is present to the left of the Constraint block.
* **Logic Flow Diagram:** A small diagram showing a flow of logic from a to b and c to d.
### Detailed Analysis or ### Content Details
**1. Header:**
* Text: "Complex Instruction = Constraint + Logic"
* Background: Grey
**2. Main Text Block:**
* Text: "First, generate a short instructional paragraph and ensure the total length does not exceed three sentences; then, append a clearly separated checklist section using bullet points; if the word "error" appears anywhere in the output, all checklist items must be written in lowercase English, else the instructional paragraph must begin with a bolded core idea; finally, apply a formal, technical writing style to the entire output."
* Highlights: The words "First", "then", and "finally" are highlighted in yellow. The phrases "a short instructional paragraph", "total length does not exceed three sentences", "if the word 'error' appears anywhere in the output, all checklist items must be written in lowercase English", "else the instructional paragraph must begin with a bolded core idea", and "apply a formal, technical writing style to the entire output" are highlighted in blue.
**3. Constraint Block:**
* Label: "Constraint" (Blue box)
* Content:
1. "a short instructional"
2. "length does not exceed three sentences"
3. "......" (Indicates more constraints exist but are not listed)
**4. Logic Block:**
* Label: "Logic" (Yellow box)
* Content:
1. "First, then, finally"
2. "And"
3. "If, else"
4. "......" (Indicates more logic steps exist but are not listed)
**5. Logic Flow Diagram:**
* Diagram: A simple flow diagram with nodes labeled "a", "b", "c", and "d".
* Flow: "a" -> "b", "c" -> "d"
### Key Observations
* The diagram clearly separates the "Complex Instruction" into two main components: "Constraint" and "Logic."
* The main text block provides a specific example of a complex instruction, which is then broken down into its constraints and logical steps.
* The highlighting in the main text block emphasizes the key elements that correspond to either constraints or logic.
### Interpretation
The diagram illustrates a structured approach to understanding and implementing complex instructions. By breaking down the instruction into constraints and logic, it becomes easier to manage and execute. The example provided demonstrates how a complex instruction can be deconstructed into specific, actionable steps. The diagram suggests that effective complex instructions should clearly define both the limitations (constraints) and the sequence of actions (logic) required to achieve the desired outcome. The logic flow diagram suggests a dependency between steps, where 'a' must precede 'b' and 'c' must precede 'd'.
</details>
Figure 1: Essentially, the complex instruction is the logical composition of constraints.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: Logic-Structured Dataset Construction and Structure-Aware Reward Modeling
### Overview
The image presents two distinct diagrams: "Logic-Structured Dataset Construction" and "Structure-Aware Reward Modeling." The first diagram illustrates three approaches to dataset construction: Parallel, Sequential, and Conditional. The second diagram details three methods for reward modeling: Average Aggregation, Penalty Propagation, and Branch Selection.
### Components/Axes
**Logic-Structured Dataset Construction:**
* **Title:** Logic-Structured Dataset Construction
* **Nodes:** C1, C2, C3 (representing content components)
* **Edges:** Arrows indicating flow or relationships between nodes.
* **Textual Annotations:**
* "Do not use any commas and limit the length to no more than...and the target audience is..." (associated with Parallel)
* "First, generate a list... Then, for each point, write about. Finally, content ...more than 120 words." (associated with Sequential)
* "If the response discusses..., output in JSON format; else use an ... style." (associated with Conditional)
* **Types:**
* Parallel: C1, C2, and C3 are connected in a triangular formation.
* Sequential: C1 -> C2 -> C3 (linear flow)
* Conditional: C1 branches to C2 and C3.
**Structure-Aware Reward Modeling:**
* **Title:** Structure-Aware Reward Modeling
* **Sub-titles:** Response1, Response2, Response3
* **Nodes:** R1, R2, R3 (representing response components)
* **Edges:** Arrows indicating flow or relationships between nodes. Dashed arrows indicate "Ci not followed".
* **Y:** Denotes decay coefficient.
* **Reward Model Icon:** A cartoon bear face.
* **Code Icon:** A python logo.
* **Types:**
* Average Aggregation: R1, R2, and R3 are connected in a triangular formation. R = Avg(R1, R2, R3)
* Penalty Propagation: R1 -> R2 -> R3, with decay coefficient Y applied between each node. R = Avg(R1, Y^mR2, Y^nR3)
* Branch Selection: R1 branches to R2 and R3. R = R2 (R1=1), R = R3 (R1=0)
### Detailed Analysis or ### Content Details
**Logic-Structured Dataset Construction:**
* **Parallel:** The diagram shows C1, C2, and C3 connected in a triangular shape, indicating a parallel relationship. The associated text emphasizes avoiding commas, limiting length, and targeting the audience.
* **Sequential:** The diagram shows a linear flow from C1 to C2 to C3. The associated text describes generating a list, writing about each point, and ensuring the content exceeds 120 words.
* **Conditional:** The diagram shows C1 branching to C2 and C3, indicating a conditional relationship. The associated text describes outputting in JSON format if the response discusses a certain topic, otherwise using a different style.
**Structure-Aware Reward Modeling:**
* **Average Aggregation:** R1, R2, and R3 are interconnected. The reward is calculated as the average of R1, R2, and R3.
* **Penalty Propagation:** The reward propagates from R1 to R2 to R3, with a decay coefficient (Y) applied at each step. The reward is calculated as the average of R1, Y^mR2, and Y^nR3.
* **Branch Selection:** The reward is selected based on the value of R1. If R1=1, the reward is R2. If R1=0, the reward is R3.
### Key Observations
* The "Logic-Structured Dataset Construction" section focuses on different ways to structure content components (C1, C2, C3) based on specific guidelines.
* The "Structure-Aware Reward Modeling" section focuses on different ways to calculate a reward based on response components (R1, R2, R3) and their relationships.
* The use of arrows indicates the flow of information or dependencies between components.
* The decay coefficient (Y) in "Penalty Propagation" suggests a diminishing effect as the reward propagates through the response components.
### Interpretation
The diagrams illustrate different strategies for constructing datasets and modeling rewards. The "Logic-Structured Dataset Construction" section provides guidelines for organizing content based on parallel, sequential, or conditional relationships. The "Structure-Aware Reward Modeling" section presents methods for calculating rewards based on the structure of responses, including averaging, penalizing propagation, and selecting branches. The choice of method depends on the specific requirements of the task and the desired behavior of the system. The decay coefficient in "Penalty Propagation" suggests a mechanism for prioritizing earlier response components over later ones. The branch selection method allows for conditional rewards based on the value of a specific component (R1).
</details>
Figure 2: Our framework LsrIF consists of two components: (LsrInstruct) logic-structured dataset construction, and (LsRM) structure-aware reward modeling with corresponding methods.
Instruction following is a core capability of large language models (LLMs) and is essential for their use in real-world applications Zhang et al. (2025); Lu et al. (2025); Ye et al. (2025). User instructions are often complex and may span multiple turns or agent-based interactions Qi et al. (2025); Deshpande et al. (2025). Beyond producing fluent text, effective instruction following requires models to correctly understand and satisfy multiple constraints, which are often expressed through structured and interdependent conditions He et al. (2024); An et al. (2025).
In essence, complex instructions are composed of multiple constraints connected by logical structures. Correct instruction following therefore requires not only satisfying individual constraints, but also adhering to the logical relationships between them. As shown in Fig. 1, the complex instruction contains three common types of logical relationships. Parallel (And) structures require all constraints to be satisfied simultaneously. Sequential (FirstâThenâFinally) structures impose an execution order, where later constraints depend on the successful completion of earlier ones. Conditional (IfâElse) structures introduce branching logic, where the model must first evaluate a condition and then follow the correct branch.
Existing approaches for improving instruction following still face clear limitations when dealing with logically structured instructions. From the perspective of data construction, most training data simplify instructions by treating all constraints as parallel Sun et al. (2024); Huang et al. (2025). Although some datasets include logical structure, they are mainly used for evaluation rather than training Wen et al. (2024); Wang et al. (2025). In terms of reward modeling, the reward for the entire instruction is often computed as the average of the rewards for individual constraints Qin et al. (2025). This assumes that constraints are independent. However, for sequential or conditional instructions, failure at an early step makes later constraints irrelevant, and simple averaging can produce incorrect training signals. Finally, regarding interpretability for performance improvements, prior work typically shows gains in instruction-following performance and the preservation of general reasoning abilities Peng et al. (2025), yet the underlying reasons remain unexplored. Furthermore, it remains unclear whether gains in logically structured instruction following actually transfer to reasoning ability.
To address these limitations, we propose a logic-structured training framework LsrIF that explicitly models instruction logic in both data construction and reward design. (1) Logic-Structured Data (LsrInstruct). We define instruction structures using three basic logical forms: parallel, sequential, and conditional. Based on these forms, we construct a dataset of multi-constraint instructions covering multiple logical structures. (2) Logic-Structured Reward Modeling (LsRM). We design reward modeling methods that reflect the execution semantics of different structures. For parallel structures, rewards are aggregated by averaging. For sequential structures, we apply a decay mechanism so that failures in earlier steps reduce rewards for later ones. For conditional structures, rewards are assigned only to the constraints in the correct branch. (3) Interpretability for Performance Improvements. We further analyze how logic-structured training affects the model. We observe larger parameter updates in attention layers than in MLP layers. At the token level, trained models place more attention on logical connectors and constraint-related tokens. These changes also appear in general reasoning tasks, indicating that the learned ability transfers beyond instruction following.
Our contributions are summarized as follows: (1) We propose LsrIF, a logic-structured training framework. (2) LsrIF includes LsrInstruct, an instruction dataset capturing parallel, sequential, and conditional constraint logic structures, and LsRM, structure-aware reward modeling that aligns reward signals with logical execution semantics. (3) LsrIF improves both in-domain and out-of-domain instruction-following performance and general reasoning ability, with attention and token-level interpretability analysis.
2 Related Work
2.1 Instruction Following Data Construction
Existing work constructs datasets with multi-constraint instructions to improve instruction-following capabilities Qin et al. (2025); Cheng et al. (2024). However, these approaches directly concatenate constraints, ignoring potential structures among them, which fails to simulate real-world user instructions. While some datasets consider logical structures Wen et al. (2024); Wang et al. (2025), they are primarily designed for evaluation rather than training. In contrast, we construct a training dataset where constraints show explicit logical structures.
2.2 Reward Modeling for Instruction Following
Training paradigms for instruction following have evolved from supervised fine-tuning Sun et al. (2024) to Direct Preference Optimization Huang et al. (2025); Qi et al. (2024) and Reinforcement Learning with Verifiable Rewards (RLVR) Peng et al. (2025); Qin et al. (2025). Existing RLVR methods aggregate constraint-level rewards through simple averaging. However, this averaging strategy fails when constraint logical structures are not parallel (e.g., sequential or conditional). We propose structure-aware reward modeling, where different structures employ distinct reward modeling methods.
3 Method
Our approach consists of two main components: logic-structured dataset construction (LsrInstruct) and structure-aware reward modeling (LsRM). As illustrated in Fig. 2, we organize instructions into three logical structuresâParallel, Sequential, and Conditional and employ a structure-aware reward model with three corresponding methods: Average Aggregation for parallel structures, Penalty Propagation for sequential structures, and Branch Selection for conditional structures.
3.1 Logic-Structured Dataset Construction
To move beyond flat constraint concatenation, we formalize three logic structure types:
- Parallel Structure. A set of constraints $C=\{c_{1},c_{2},...,c_{n}\}$ that must all be satisfied simultaneously. This structure corresponds to the flat assumption commonly adopted in prior work, where constraints are treated as independent(e.g., âRespond in English and use no commas and limit the length to 100 wordsâ).
- Sequential Structure. An ordered sequence of constraints $S=(c_{1},c_{2},...,c_{n})$ , where each constraint $c_{t}$ is meaningful only if all preceding constraints $(c_{1},...,c_{t-1})$ are successfully satisfied (e.g., â First generate an outline, then write a summary, finally translate it into Englishâ).
- Conditional Structure. A branching structure governed by a trigger constraint $c_{p}$ . The active execution branch is determined by whether $c_{p}$ is satisfied: if $c_{p}$ holds, the model must satisfy the true-branch constraint $c_{\text{true}}$ ; else, it must satisfy the false-branch constraint $c_{\text{false}}$ (e.g., â If the input text contains code, explain its functionality; else, summarize the textâ).
We construct the dataset by collecting seed instructions from Infinity-Instruct Li et al. (2025), Open Assistant Köpf et al. (2024), Self-Instruct Wang et al. (2022a) and Super-Natural Wang et al. (2022b), defining constraint types (hard constraints in Tab. 5, soft constraints in Tab. 6), and using GPT-4.1 to generate multi-constraint instructions that instantiate these logical structures. Each instruction follows logical structure with multiple constraints organized accordingly, enabling controlled analysis and structure-aware training. Detailed statistics of LsrInstruct are shown in Tab. 1.
| Logic Type | # Inst. | # Cons. Types | # Cons. | Evaluation |
| --- | --- | --- | --- | --- |
| Parallel | 17510 | 48 | 52106 |
<details>
<summary>figures/python.png Details</summary>

### Visual Description
## Icon: Python Logo
### Overview
The image is the official logo for the Python programming language. It features two intertwined snakes, one blue and one yellow, each with a white dot representing an eye.
### Components/Axes
* **Color:** The logo uses two primary colors: blue and yellow.
* **Shape:** The snakes are stylized and abstract, forming an interlocking design.
* **Eyes:** Each snake has a white dot representing an eye.
### Detailed Analysis or ### Content Details
The logo consists of two snakes intertwined. The snake on the top-left is blue, and the snake on the bottom-right is yellow. Each snake has a white dot for an eye. The snakes are stylized and abstract.
### Key Observations
The logo is simple and recognizable. The use of two colors makes it visually appealing. The intertwined snakes may symbolize the versatility and adaptability of the Python language.
### Interpretation
The Python logo is a well-known symbol in the programming world. It represents the Python programming language, which is known for its readability, versatility, and ease of use. The intertwined snakes may symbolize the language's ability to work with different systems and platforms. The logo is often used in websites, documentation, and other materials related to Python.
</details>
<details>
<summary>figures/gpt2.jpg Details</summary>

### Visual Description
## Icon: ChatGPT Logo
### Overview
The image is a square icon featuring the ChatGPT logo. The logo is a stylized, interconnected geometric shape in white against a solid purple background.
### Components/Axes
* **Shape:** The logo consists of six interconnected, rounded shapes that form a central, flower-like design.
* **Color:** The logo is white, and the background is a solid purple.
* **Positioning:** The logo is centered within the square icon.
### Detailed Analysis or ### Content Details
The logo is a clean, modern design. The interconnected shapes suggest communication, networking, or artificial intelligence. The color scheme is simple and eye-catching.
### Key Observations
The logo is easily recognizable and memorable. The use of a simple geometric shape makes it scalable and versatile.
### Interpretation
The ChatGPT logo is designed to represent the AI's capabilities in communication and information processing. The interconnected shapes symbolize the complex neural networks that power the AI. The clean design and color scheme convey a sense of modernity and innovation.
</details>
|
| Sequential | 10435 | 25 | 31295 |
<details>
<summary>figures/gpt2.jpg Details</summary>

### Visual Description
## Icon: ChatGPT Logo
### Overview
The image is a square icon featuring the ChatGPT logo. The logo is a stylized, interconnected geometric shape in white against a solid purple background.
### Components/Axes
* **Shape:** The logo consists of six interconnected, rounded shapes that form a central, flower-like design.
* **Color:** The logo is white, and the background is a solid purple.
* **Positioning:** The logo is centered within the square icon.
### Detailed Analysis or ### Content Details
The logo is a clean, modern design. The interconnected shapes suggest communication, networking, or artificial intelligence. The color scheme is simple and eye-catching.
### Key Observations
The logo is easily recognizable and memorable. The use of a simple geometric shape makes it scalable and versatile.
### Interpretation
The ChatGPT logo is designed to represent the AI's capabilities in communication and information processing. The interconnected shapes symbolize the complex neural networks that power the AI. The clean design and color scheme convey a sense of modernity and innovation.
</details>
|
| Conditional | 10574 | 25 | 42152 |
<details>
<summary>figures/gpt2.jpg Details</summary>

### Visual Description
## Icon: ChatGPT Logo
### Overview
The image is a square icon featuring the ChatGPT logo. The logo is a stylized, interconnected geometric shape in white against a solid purple background.
### Components/Axes
* **Shape:** The logo consists of six interconnected, rounded shapes that form a central, flower-like design.
* **Color:** The logo is white, and the background is a solid purple.
* **Positioning:** The logo is centered within the square icon.
### Detailed Analysis or ### Content Details
The logo is a clean, modern design. The interconnected shapes suggest communication, networking, or artificial intelligence. The color scheme is simple and eye-catching.
### Key Observations
The logo is easily recognizable and memorable. The use of a simple geometric shape makes it scalable and versatile.
### Interpretation
The ChatGPT logo is designed to represent the AI's capabilities in communication and information processing. The interconnected shapes symbolize the complex neural networks that power the AI. The clean design and color scheme convey a sense of modernity and innovation.
</details>
|
Table 1: Statistics of LsrInstruct. #Inst., #Cons. Types, #Cons. and Evaluation refer to the number of instructions, constraint types, total constraints, and evaluation methods.
3.2 Structure-Aware Reward Modeling
We adopt the Group Relative Policy Optimization (GRPO) Shao et al. (2024) training, where model optimization is driven by automatically computed signals indicating constraint satisfaction. For hard constraints, we use programmatic verification. For soft constraints, we employ a reward model to assess adherence. We train Qwen2.5-7B-Instruct as the reward model, where we exploit the natural partial order in and-type multi-constraint instructions to construct binary preference pairs and train the model via supervised fine-tuning with a binary classification objective following Ren et al. (2025).
Given constraint-level verification results, we aggregate these rewards according to the logical structure of each instruction. Formally, let $o$ denote a model output and $c$ denote an atomic constraint. We define a binary verification function $r(o,c)â\{0,1\}$ , where $r(o,c)=1$ if output $o$ satisfies constraint $c$ , and $0$ otherwise. The aggregation of rewards according to logical structures is described as follows.
Reward for Parallel Structure (Average Aggregation).
For parallel constraint set $C=\{c_{1},...,c_{n}\}$ , we define:
$$
R_{\text{par}}(o,C)=\frac{1}{|C|}\sum_{c_{i}\in C}r(o,c_{i}). \tag{1}
$$
This coincides with standard RLVR aggregation under flat constraint assumptions.
Reward for Sequential Structure (Penalty Propagation).
For sequential structure $S=(c_{1},...,c_{n})$ , we introduce penalty propagation that discounts downstream rewards when earlier steps fail. The adjusted reward for $c_{i}$ is:
$$
r^{\prime}_{i}(o,S)=r(o,c_{i})\cdot\prod_{j<i}\gamma^{(1-r(o,c_{j}))}, \tag{2}
$$
where $\gammaâ[0,1)$ is a decay coefficient. The overall reward is:
$$
R_{\text{seq}}(o,S)=\frac{1}{|S|}\sum_{i=1}^{|S|}r^{\prime}_{i}(o,S). \tag{3}
$$
Reward for Conditional Structure (Branch Selection).
For conditional structure with trigger $c_{p}$ and branches $c_{\text{true}}$ , $c_{\text{false}}$ :
$$
R_{\text{cond}}(o,c_{p},c_{\text{true}},c_{\text{false}})=\begin{cases}r(o,c_{\text{true}}),&r(o,c_{p})=1,\\
r(o,c_{\text{false}}),&r(o,c_{p})=0.\end{cases} \tag{4}
$$
This ensures optimization focuses exclusively on the logically valid branch.
4 Experiment
| Models | Method | In-Domain | Out-of-Domain | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| IFEval | CFBench | FollowBench | ComplexBench | WritingBench | Collie | AgentIF | MultiChallenge | | |
| Pr.(L) | ISR | HSR | Overall | Avg. | Avg. | CSR | Overall | | |
| GPT-4o | Baseline | 84.8 | 65.3 | 70.4 | 71.6 | 75.5 | 49.8 | 58.5 | 12.9 |
| QwQ-32B | Baseline | 83.9 | 68.0 | 62.2 | 73.3 | 79.1 | 52.4 | 58.1 | 38.5 |
| Self-Supervised-7B | Baseline | 78.9 | 52.0 | 57.5 | 68.7 | 58.5 | 38.0 | 56.7 | 15.6 |
| VERIF-8B | Baseline | 87.1 | 41.0 | 56.9 | 54.7 | 50.8 | 28.3 | 56.6 | 15.0 |
| RAIF-7B | Baseline | 74.1 | 43.0 | 56.2 | 68.7 | 61.7 | 20.2 | 51.9 | 14.4 |
| SPAR-8B-DPO | Baseline | 82.4 | 37.0 | 56.1 | 63.8 | 47.0 | 27.7 | 53.6 | 17.1 |
| Crab-7B-DPO | Baseline | 57.7 | 25.0 | 49.4 | 59.0 | 45.4 | 19.6 | 47.2 | 14.1 |
| Conifer-7B-DPO | Baseline | 52.3 | 25.0 | 50.0 | 48.1 | 32.2 | 17.8 | 44.3 | 8.0 |
| Qwen2.5-1.5B-Instruct | Base | 43.6 | 22.0 | 34.6 | 45.9 | 44.8 | 13.0 | 42.8 | 12.0 |
| SFT | 64.0 | 24.0 | 37.4 | 49.8 | 44.4 | 16.1 | 46.4 | 10.2 | |
| LsrIF | 68.8 (+25.2) | 28.0 (+6.0) | 38.9 (+4.3) | 52.4 (+6.5) | 46.8 (+2.0) | 19.3 (+6.3) | 51.5 (+8.7) | 14.4 (+2.4) | |
| Qwen2.5-7B-Instruct | Base | 73.9 | 47.0 | 55.1 | 66.1 | 57.2 | 36.3 | 54.2 | 15.2 |
| SFT | 75.2 | 43.0 | 55.7 | 68.5 | 51.2 | 30.5 | 55.5 | 14.5 | |
| LsrIF | 79.7 (+5.8) | 54.0 (+7.0) | 57.5 (+2.4) | 70.0 (+3.9) | 63.2 (+6.0) | 37.3 (+1.0) | 56.5 (+2.3) | 18.7 (+3.5) | |
| Distill-Qwen-7B | Base | 61.7 | 36.0 | 41.7 | 55.2 | 53.0 | 25.2 | 47.2 | 13.9 |
| SFT | 65.1 | 40.0 | 43.1 | 55.8 | 53.6 | 28.3 | 44.2 | 14.2 | |
| LsrIF | 71.5 (+9.8) | 47.0 (+11.0) | 44.0 (+2.3) | 61.1 (+5.9) | 55.0 (+2.0) | 30.0 (+4.8) | 46.7 (-0.5) | 15.0 (+1.1) | |
| Llama-3.1-8B-Instruct | Base | 73.8 | 34.0 | 53.8 | 63.6 | 47.5 | 46.5 | 53.4 | 16.2 |
| SFT | 77.4 | 36.0 | 52.2 | 61.1 | 46.9 | 34.5 | 55.2 | 14.9 | |
| LsrIF | 81.5 (+7.7) | 40.0 (+6.0) | 58.4 (+4.6) | 63.9 (+0.3) | 48.0 (+0.5) | 47.6 (+1.1) | 57.8 (+4.4) | 18.7 (+2.5) | |
| Distill-Qwen-14B | Base | 74.9 | 55.0 | 51.2 | 72.7 | 61.0 | 34.4 | 54.5 | 17.2 |
| SFT | 79.3 | 56.0 | 56.8 | 70.5 | 59.2 | 36.1 | 59.2 | 16.4 | |
| LsrIF | 82.1 (+7.2) | 60.0 (+5.0) | 58.2 (+7.0) | 75.5 (+2.8) | 63.8 (+2.8) | 38.8 (+4.4) | 61.7 (+7.2) | 18.3 (+1.1) | |
| Qwen3-8B | Base | 87.8 | 66.0 | 56.4 | 78.5 | 75.1 | 45.5 | 64.4 | 29.8 |
| SFT | 80.6 | 62.0 | 53.2 | 74.3 | 74.7 | 35.0 | 63.3 | 25.6 | |
| LsrIF | 90.2 (+2.4) | 68.0 (+2.0) | 58.1 (+1.7) | 79.2 (+0.7) | 75.6 (+0.5) | 48.1 (+2.6) | 65.0 (+0.6) | 32.3 (+2.5) | |
Table 2: Model performance on in-domain and out-of-domain instruction following benchmarks.
4.1 Set-up
Models.
We conduct experiments on models of different scales from 1.5B to 14B to evaluate the effectiveness of our method across different architectures and parameter scales. Specifically, we evaluate on: (1) 1.5B: Qwen2.5-1.5B-Instruct; (2) 7B: Qwen2.5-7B-Instruct and Distill-Qwen-7B; (3) 8B: Llama-3.1-8B-Instruct and Qwen3-8B; (4) 14B: Distill-Qwen-14B. This diverse set of models allows us to assess the generalizability of our approach across different model families and scales.
Baselines.
We compare against both strong general-purpose models and specialized instruction-following optimized models. General-purpose baselines include GPT-4o and QwQ-32B. Specialized instruction-following baselines include RAIF-7B, Self-Supervised-7B, VERIF-8B, SPAR-8B-DPO, Conifer-7B-DPO, and Crab-7B-DPO, which are specifically optimized for instruction following tasks using various training paradigms including supervised fine-tuning, self-supervised learning, verification-based reinforcement learning training, and direct preference optimization.
Training Methods.
We compare three training methods: Base uses the original model directly without any additional training; SFT fine-tunes the model on the dataset generated by the strong model GPT-4.1 using supervised fine-tuning; LsrIF is our logic-structured reinforcement learning training method that employs structure-aware reward modeling to align optimization signals with logical constraint structure execution semantics. For each model scale, we evaluate all three methods to demonstrate the effectiveness of our approach.
Evaluation Benchmarks.
We evaluate models on both in-domain and out-of-domain instruction following benchmarks. In-domain benchmarks include IFEval Zhou et al. (2023) (Pr.(L)), CFBench Zhang et al. (2024) (ISR), and FollowBench Jiang et al. (2023) (HSR). Out-of-domain benchmarks include ComplexBench Wen et al. (2024) (Overall), WritingBench Wu et al. (2025) (Avg.), Collie Yao et al. (2023) (Avg.), AgentIF Qi et al. (2025) (CSR), and MultiChallenge Deshpande et al. (2025) (Overall). Details of the experiment set-up are provided in Appx. A.4.
4.2 Performance
Instruction Following Performance.
As shown in Tab. 2, LsrIF significantly improves instruction following capabilities across different models on both in-domain and out-of-domain benchmarks. LsrIF consistently outperforms Base and SFT across all model scales, with improvements on various metrics.
On in-domain benchmarks, LsrIF achieves substantial gains across all model scales. For smaller models, Qwen2.5-1.5B-Instruct shows remarkable improvements, improving by 25.2 on IFEval and 6.0 on CFBench. For 7B models, Qwen2.5-7B-Instruct improves by 5.8 on IFEval and 7.0 on CFBench. For stronger models, Qwen3-8B achieves strong performance with improvements of 2.4 on IFEval and 2.0 on CFBench. On out-of-domain benchmarks, LsrIF demonstrates consistent improvements across diverse evaluation scenarios. Qwen2.5-7B-Instruct improves by 6.0 on WritingBench and 3.5 on MultiChallenge. Qwen2.5-1.5B-Instruct shows improvements of 6.5 on ComplexBench and 8.7 on AgentIF.
Notably, LsrIF enables models to outperform specialized baseline models even while the base model initially underperforms. For instance, Qwen2.5-7B-Instruct underperforms RAIF-7B and Self-Supervised-7B, but after LsrIF training exceeds both baselines with substantial improvements. After LsrIF, Qwen3-8B achieves 90.2 on IFEval, higher than GPT-4o (84.8) and VERIF-8B (87.1), demonstrating state-of-the-art performance on this benchmark.
Logical Reasoning Performance.
We evaluate logical reasoning capabilities using Enigmata Chen et al. (2025), a comprehensive benchmark suite designed to assess logical reasoning abilities of large language models. Enigmata comprises 36 tasks distributed across seven categories, with each task equipped with generators that can produce infinite examples and rule-based verifiers. The benchmark evaluates four key reasoning subcategories: Logic (formal logical inference), Arithmetic (mathematical computation and reasoning), Graph (graph-based problem solving) and Search (path-finding task).
As shown in Tab. 3, LsrIF effectively enhances both logical reasoning and general capabilities. On Enigmata, LsrIF outperforms base models across all subcategories, with particularly strong gains on Arithmetic. For Distill-Qwen-7B, Arithmetic improves by 10.6, while Logic increases by 2.7 and Graph by 6.4. For Distill-Qwen-14B, Arithmetic shows the most substantial improvement, increasing by 18.0, with Logic improving by 3.7 and Graph by 2.2. The significant improvements on Arithmetic suggest that LsrIF âs structure-aware reward modeling effectively captures mathematical constraint satisfaction, enabling models to better follow numerical and computational requirements in instructions.
| Model | Logic Reasoning (Enigmata) | General Capabilities | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Logic | Arithmetic | Graph | Search | Overall | AIME2024 | AIME2025 | GPQA-Diamond | MT-Bench | AlpacaEval2.0 | |
| Distill-Qwen-7B | 10.9 | 3.7 | 11.1 | 4.4 | 9.9 | 53.4 | 38.7 | 49.1 | 5.9 | 5.0 |
| Distill-Qwen-7B- LsrIF | 13.6 | 14.3 | 17.5 | 4.6 | 12.4 | 55.1 | 41.2 | 52.5 | 6.3 | 5.8 |
| Distill-Qwen-14B | 44.7 | 21.0 | 31.1 | 10.5 | 22.4 | 69.3 | 49.0 | 58.6 | 6.6 | 26.7 |
| Distill-Qwen-14B- LsrIF | 48.4 | 39.0 | 33.3 | 14.1 | 24.4 | 70.2 | 49.6 | 60.1 | 7.0 | 30.3 |
Table 3: Model performance on logic reasoning (Enigmata) and general capabilities benchmarks. We evaluate AIME using Avg@30 method. Bolded value indicates the best result for each model on the benchmark.
On general capabilities benchmarks, which encompass mathematics (AIME2024, AIME2025), science (GPQA-Diamond), and general instruction following (MT-Bench, AlpacaEval2.0), LsrIF brings consistent improvements across all evaluated benchmarks. These results demonstrate that LsrIF not only enhances logical reasoning capabilities but also improves general model performance across diverse evaluation domains.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Bar Chart: Model Performance Comparison
### Overview
The image is a bar chart comparing the performance of four different models (Distill-Qwen-7B, Llm as a Judge, Our-RM-7B (Inst.-Level), and Our-RM-7B (Const.-Level)) across three evaluation benchmarks (IFEval, AIME, and CFBench). The y-axis represents performance, ranging from 0 to 80.
### Components/Axes
* **X-axis:** Evaluation benchmarks: IFEval, AIME, CFBench
* **Y-axis:** Performance, with a scale from 0 to 80 in increments of 10.
* **Legend (Top-Right):**
* Orange: Distill-Qwen-7B (Base)
* Light Blue: Llm as a Judge (Const.-Level)
* Light Green: Our-RM-7B (Inst.-Level)
* Light Yellow: Our-RM-7B (Const.-Level)
### Detailed Analysis
**IFEval Benchmark:**
* Distill-Qwen-7B (Base) (Orange): Approximately 62
* Llm as a Judge (Const.-Level) (Light Blue): Approximately 66
* Our-RM-7B (Inst.-Level) (Light Green): Approximately 70
* Our-RM-7B (Const.-Level) (Light Yellow): Approximately 72
**AIME Benchmark:**
* Distill-Qwen-7B (Base) (Orange): Approximately 54
* Llm as a Judge (Const.-Level) (Light Blue): Approximately 55
* Our-RM-7B (Inst.-Level) (Light Green): Approximately 53
* Our-RM-7B (Const.-Level) (Light Yellow): Approximately 56
**CFBench Benchmark:**
* Distill-Qwen-7B (Base) (Orange): Approximately 36
* Llm as a Judge (Const.-Level) (Light Blue): Approximately 42
* Our-RM-7B (Inst.-Level) (Light Green): Approximately 44
* Our-RM-7B (Const.-Level) (Light Yellow): Approximately 47
### Key Observations
* Across all benchmarks, Our-RM-7B (Const.-Level) generally shows the highest performance.
* Distill-Qwen-7B (Base) consistently shows the lowest performance among the four models.
* The performance difference between the models is most pronounced in the IFEval benchmark.
* All models perform worst on the CFBench benchmark.
### Interpretation
The bar chart provides a comparative analysis of the performance of four language models across three different evaluation benchmarks. The data suggests that the "Our-RM-7B (Const.-Level)" model generally outperforms the other models, while "Distill-Qwen-7B (Base)" model generally underperforms. The varying performance across different benchmarks indicates that the models have different strengths and weaknesses depending on the type of evaluation. The IFEval benchmark seems to be the most discriminating, showing the largest performance differences between the models. The CFBench benchmark appears to be the most challenging for all models.
</details>
Figure 3: LsrIF performance on different reward forms. Const.-Level and Inst.-Level refer to constraint-level and instruction-level, respectively.
4.3 Ablation Studies
As shown in Tab. 4, removing any component degrades performance compared to the full LsrIF. Removing the LsRM, which ignores logical structure and averages rewards across all constraints, results in the largest drop, indicating its critical importance. Specifically, without LsRM, performance decreases by 2.9 on IFEval, 5.0 on CFBench, and 2.7 on AIME2024. This demonstrates that structure-aware reward modeling is essential for effectively capturing logical constraint relationships.
Removing sequential data from LsrInstruct also leads to performance decreases, with drops of 1.6 on IFEval and 3.0 on CFBench. Similarly, removing conditional data results in decreases of 1.8 on IFEval, 3.0 on CFBench, and 3.5 on AIME2024.
All ablation variants still outperform the base model. This indicates that even partial components of LsrIF provide substantial benefits over the base model. These results demonstrate that each componentâthe logic-structured reward modeling and logic-structured dataset construction play a crucial role in the overall effectiveness of LsrIF.
4.4 Robustness of LsrIF
| Config | Performance | | | |
| --- | --- | --- | --- | --- |
| IFEval | CFBench | AIME2024 | Enigmata | |
| Distill-Qwen-7B | 61.7 | 36.0 | 53.4 | 9.9 |
| Distill-Qwen-7B- LsrIF | 71.5 | 47.0 | 55.1 | 12.4 |
| w/o LsRM | 68.6 | 42.0 | 52.4 | 10.5 |
| w/o Sequential Data | 69.9 | 44.0 | 54.0 | 11.0 |
| w/o Conditional Data | 69.7 | 44.0 | 51.6 | 10.9 |
Table 4: Ablation study results on different abilities. Bolded values indicate the best performance.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Chart: Model Performance vs. Depth
### Overview
The image is a line chart comparing the performance (Score) of two models, Distill-Qwen-7B and Distill-Qwen-14B, at different depths (1, 2, and 3). Each model has two versions: a "Base" version and an "LsrIF" version. The chart shows how the score changes with depth for each of these versions.
### Components/Axes
* **X-axis:** Depth, with markers at 1, 2, and 3.
* **Y-axis:** Score, ranging from 40 to 70, with gridlines at intervals of 5.
* **Legend (top):**
* Distill-Qwen-7B (Base): Dashed blue line
* Distill-Qwen-7B (LsrIF): Solid blue line
* Distill-Qwen-14B (Base): Dashed green line
* Distill-Qwen-14B (LsrIF): Solid green line
### Detailed Analysis
* **Distill-Qwen-7B (Base):** (Dashed blue line) Starts at approximately 53 at depth 1, decreases to approximately 43 at depth 2, and further decreases to approximately 40 at depth 3.
* **Distill-Qwen-7B (LsrIF):** (Solid blue line) Starts at approximately 62 at depth 1, decreases to approximately 44 at depth 2, and increases slightly to approximately 46 at depth 3.
* **Distill-Qwen-14B (Base):** (Dashed green line) Starts at approximately 72 at depth 1, decreases to approximately 64 at depth 2, and decreases to approximately 55 at depth 3.
* **Distill-Qwen-14B (LsrIF):** (Solid green line) Starts at approximately 72 at depth 1, decreases slightly to approximately 69 at depth 2, and remains approximately at 69 at depth 3.
### Key Observations
* The "LsrIF" versions of both models generally outperform their "Base" counterparts.
* The performance of all models tends to decrease as depth increases, except for Distill-Qwen-7B (LsrIF), which shows a slight increase from depth 2 to depth 3.
* Distill-Qwen-14B models generally outperform Distill-Qwen-7B models.
### Interpretation
The chart suggests that the "LsrIF" modification improves the performance of both Distill-Qwen models. The decrease in performance with increasing depth could indicate that the models are becoming more complex and potentially overfitting, or that the benefits of increased depth are not being fully realized. The Distill-Qwen-14B models consistently outperform the Distill-Qwen-7B models, indicating that the larger model size contributes to better performance. The slight increase in performance for Distill-Qwen-7B (LsrIF) from depth 2 to depth 3 could be an anomaly or indicate a specific interaction between the "LsrIF" modification and depth for this particular model.
</details>
Figure 4: Performance on nested structures from Wen et al. (2024).
<details>
<summary>x5.png Details</summary>

### Visual Description
## Heatmap: Attention and MLP Layer Analysis
### Overview
The image is a heatmap visualizing the activity or importance of different layers in a neural network, specifically focusing on attention (attn.) and multilayer perceptron (mlp.) layers. The heatmap uses a color gradient from blue to orange to represent values, with blue indicating lower values and orange indicating higher values. The y-axis represents different layers, numbered from 0 to 27. The x-axis represents different components within the attention and MLP layers.
### Components/Axes
* **Y-axis:** Represents the layer number, ranging from 0 to 27 in increments of 4, with visible markers at 0, 4, 8, 12, 16, 20, 24, and 27.
* **X-axis:** Represents the different components of the attention and MLP layers:
* attn. q (attention query)
* attn. k (attention key)
* attn. v (attention value)
* attn. o (attention output)
* mlp. up (MLP up-projection)
* mlp. down (MLP down-projection)
* mlp. gate (MLP gate)
* **Color Legend:** Located on the right side of the heatmap.
* Orange: Represents a value of approximately 0.105.
* White: Represents a value between 0.090 and 0.105.
* Light Blue: Represents a value of approximately 0.090.
* Dark Blue: Represents a value of approximately 0.075.
### Detailed Analysis
* **attn. q:** The values are generally low (blue) across all layers, with a slight increase towards the top layers (24-27).
* **attn. k:** Similar to attn. q, the values are low (blue) across all layers.
* **attn. v:** The values are generally higher (orange) in the top layers (24-27) and decrease towards the bottom layers.
* **attn. o:** The values are mixed, with some layers showing higher values (orange) and others showing lower values (blue).
* **mlp. up:** The values are generally higher (orange) across all layers.
* **mlp. down:** The values are generally higher (orange) across all layers.
* **mlp. gate:** The values are mixed, with some layers showing higher values (orange) and others showing lower values (blue).
### Key Observations
* The attention query (attn. q) and key (attn. k) components consistently show lower values across all layers.
* The attention value (attn. v) component shows higher values in the top layers.
* The MLP up-projection (mlp. up) and down-projection (mlp. down) components consistently show higher values across all layers.
### Interpretation
The heatmap suggests that the attention query and key components might have less influence or activity compared to the attention value component, especially in the higher layers of the network. The consistent high values in the MLP up-projection and down-projection components indicate their importance across all layers. The mixed values in the attention output and MLP gate components suggest that their activity might be more layer-dependent. This visualization can help in understanding the flow of information and the relative importance of different components within the neural network architecture.
</details>
(a) Qwen2.5-7B-Instruct
<details>
<summary>x6.png Details</summary>

### Visual Description
## Heatmap: Attention and MLP Layer Analysis
### Overview
The image is a heatmap visualizing the activity or importance of different components within attention and Multilayer Perceptron (MLP) layers. The heatmap uses a color gradient from blue to orange, representing values from approximately 0.12 to 0.18. The y-axis represents layers, numbered from 0 to 27. The x-axis represents different components: attention query (attn. q), attention key (attn. k), attention value (attn. v), attention output (attn. o), MLP up, MLP down, and MLP gate.
### Components/Axes
* **Y-axis:** Layers, ranging from 0 to 27 in increments of 4, with labels at 0, 4, 8, 12, 16, 20, 24, and 27.
* **X-axis:** Components of attention and MLP layers:
* attn. q (attention query)
* attn. k (attention key)
* attn. v (attention value)
* attn. o (attention output)
* mlp. up (MLP up)
* mlp. down (MLP down)
* mlp. gate (MLP gate)
* **Color Scale (Legend):** Located on the right side of the heatmap.
* Blue: ~0.12
* White: ~0.15
* Orange: ~0.18
### Detailed Analysis
* **attn. q (attention query):** The top layers (20-27) show high activity (orange), decreasing towards the bottom layers (0-8), where activity is low (white/light orange).
* **attn. k (attention key):** Similar to attn. q, the top layers (20-27) show high activity (orange), decreasing towards the bottom layers (0-8), where activity is low (white/light orange).
* **attn. v (attention value):** Shows low activity (blue) across all layers, with some slight increases (lighter blue) in the middle layers (8-16).
* **attn. o (attention output):** Shows low activity (blue) across all layers, with some slight increases (lighter blue) in the middle layers (8-16).
* **mlp. up (MLP up):** Shows moderate activity (light orange) across all layers, with some slight increases (orange) in the middle layers (12-20).
* **mlp. down (MLP down):** Shows moderate activity (light orange) across all layers, with some slight increases (orange) in the middle layers (12-20).
* **mlp. gate (MLP gate):** Shows moderate activity (light orange) across all layers, with some slight increases (orange) in the middle layers (12-20).
### Key Observations
* Attention queries and keys (attn. q and attn. k) are most active in the top layers.
* Attention values and outputs (attn. v and attn. o) show relatively low activity across all layers.
* MLP components (mlp. up, mlp. down, and mlp. gate) show moderate activity across all layers, with some increased activity in the middle layers.
### Interpretation
The heatmap suggests that in the analyzed model, attention queries and keys are more important in the higher layers, while attention values and outputs have a more consistent, lower-level impact. The MLP components show a relatively uniform distribution of activity across layers, with a slight increase in the middle layers, suggesting a consistent role throughout the network with some increased importance in the middle layers. The distinct patterns for attention and MLP components indicate different roles and importance within the model's architecture. The color scale indicates the relative importance or activity level of each component at each layer.
</details>
(b) Distill-Qwen-7B
Figure 5: Parameter change rates of LLMs to the original ones across different modules. Darker orange colors indicate larger parameter changes.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Diagram: Instruction Execution Steps
### Overview
The image presents a set of instructions for generating text, visualized both as a paragraph of text and as a stacked block diagram. The diagram breaks down the instructions into three steps, each associated with a specific action.
### Components/Axes
* **Textual Instructions:** A paragraph describing the steps to generate a text.
* **Stacked Block Diagram:** A visual representation of the steps, with each block representing a step. The blocks are numbered 1, 2, and 3 from top to bottom.
* **Labels:** Text associated with each block, describing the action to be performed.
* **Crown:** A golden crown is placed above the first block.
### Detailed Analysis
**Textual Instructions:**
"First, generate a short instructional paragraph and ensure the total length does not exceed three sentences; then, append a clearly separated checklist section using bullet points; if the word "error" appears anywhere in the output, all checklist items must be written in lowercase English, else the instructional paragraph must begin with a bolded core idea; finally, apply a formal, technical writing style to the entire output."
**Stacked Block Diagram:**
* **Block 1 (Top, Dark Blue):** Numbered "1". Labeled "First/then/else..." in red text to the right.
* **Block 2 (Middle, Light Blue):** Numbered "2". Labeled "bullet/lowercase/bolded..." in red text to the right.
* **Block 3 (Bottom, Very Light Blue):** Numbered "3". Labeled "apply/formal/generate..." in red text to the right.
### Key Observations
* The textual instructions provide a detailed description of the steps.
* The stacked block diagram simplifies the instructions into three key actions.
* The crown above the first block may indicate the most important step or the starting point.
* The red text labels next to the blocks summarize the actions associated with each step.
### Interpretation
The image illustrates a process for generating text that involves multiple steps. The textual instructions provide the full details, while the stacked block diagram offers a simplified, visual representation of the process. The diagram emphasizes the sequential nature of the steps, with each block building upon the previous one. The crown suggests that the first step is crucial, possibly setting the foundation for the entire process. The red labels provide a concise summary of each step, making the diagram easy to understand at a glance. The use of "else" in the first step indicates a conditional branch in the process, depending on whether the word "error" appears in the output.
</details>
(a) Instruction Following â More Attention on constraints and their underlying logic
<details>
<summary>x8.png Details</summary>

### Visual Description
## Text Analysis with Diagram
### Overview
The image presents a logical reasoning problem alongside a diagram that visually represents different levels of logical connection. The problem consists of premises and a conclusion, and the diagram seems to offer a key to interpreting the strength or type of logical links within the premises. The text on the left presents a series of premises and a conclusion, while the diagram on the right provides a visual key to understanding the relationships between the premises and the conclusion.
### Components/Axes
**Left Side (Textual Problem):**
* **Instruction:** "Please determine whether the conclusion is true, false, or uncertain based on these premises."
* **Premises:** A series of statements about people in a club, their activities, and their relationships to school events and community involvement.
* **Conclusion:** "Bonnie performs in school talent shows often."
**Right Side (Diagram):**
* A three-tiered structure, resembling stacked platforms or steps. Each level is numbered 1, 2, and 3 from top to bottom.
* Each level is colored in a gradient of blue, with the top level being the lightest and the bottom level being the darkest.
* A golden crown sits atop the first level.
* Labels are associated with each level, written in red text:
* Level 1: "or/and..."
* Level 2: "either/not..."
* Level 3: "often/attends..."
### Detailed Analysis or ### Content Details
**Textual Problem Transcription:**
"Please determine whether the conclusion is true, false, or uncertain based on these premises.
Premises are: People in this club who perform in school talent shows often attend and are very engaged with school events. People in this club either perform in school talent shows often or are inactive and dis interested community members. People in this club who chaperone high school dances are not students who attend the school. All people in this club who are inactive and disinterested members of their community chaperone high school dances. All young children and teenagers in this club who wish to further their academic careers and educational opportunities are students who attend the school.
Bonnie is in this club and she either both attends and is very engaged with school events and is a student who attends the school or is not someone who both attends and is very engaged with school events and is not a student who attends the school.
Conclusion is: Bonnie performs in school talent shows often."
**Diagram Details:**
* **Level 1:** Lightest blue, labeled "1" and "or/and..."
* **Level 2:** Medium blue, labeled "2" and "either/not..."
* **Level 3:** Darkest blue, labeled "3" and "often/attends..."
### Key Observations
* The diagram appears to provide a key for interpreting the logical strength or type of connections within the premises.
* The color gradient in the diagram may indicate a hierarchy or strength of logical connection, with darker colors potentially representing stronger or more direct relationships.
* The keywords associated with each level ("or/and," "either/not," "often/attends") suggest different types of logical relationships.
* The text on the left has blue highlights over certain words.
### Interpretation
The image presents a logical reasoning problem where the task is to determine the validity of a conclusion based on a set of premises. The diagram on the right provides a visual aid for interpreting the logical relationships within the premises. The three levels of the diagram, each associated with different logical keywords, likely represent different types or strengths of connections between the premises and the conclusion. The user is expected to analyze the premises, identify the relevant logical connections using the diagram as a guide, and then determine whether the conclusion is true, false, or uncertain based on this analysis. The blue highlights on the left side are likely to indicate the words that are most important to the logical argument.
</details>
(b) Logic Reasoning â More Attention on logical connectors
Figure 6: Comparison of attention importance changes for each token position in Qwen2.5-7B-Instruct before and after training on instruction following and logic reasoning tasks. Darker colors indicate greater increases.
4.4.1 Robustness to Reward Modeling
We compare our reward model with alternative reward methods to demonstrate robustness of our method to different reward forms. As shown in Fig. 3, all reward methods outperform the baseline, indicating that our method is robust to reward forms. LLM-as-a-Judge (Qwen2.5-7B-Instruct) with constraint-level rewards shows improvements over the base model on IFEval and CFBench. Our reward model with instruction-level rewards further improves performance on IFEval and CFBench, while our constraint-level variant achieves the best performance across all evaluated benchmarks.
Furthermore, our RM consistently outperforms LLM-as-a-Judge, demonstrating the superior effectiveness of our reward model. The constraint-level variant achieves substantial improvements over LLM-as-a-Judge on both IFEval and CFBench. Both instruction-level and constraint-level variants of our RM achieve competitive performance, with the constraint-level variant achieving the best overall results, indicating that our method is effective for different reward granularity. The superior performance of constraint-level rewards suggests that fine-grained constraint evaluation enables more precise optimization signals compared to instruction-level aggregation.
4.4.2 Generalization to Nested Structures
We conduct experiments to evaluate the performance of our method under nested logical-structure constraints. Although our training data only contains non-nested structures, LsrIF still improves performance on nested constraint structures: Selection_1 (depth 1), Selection_and_Chain_2 (depth 2), and Selection_and_Chain_3 (depth 3) from ComplexBench. As shown in Fig. 4, LsrIF maintains better performance across all depths compared to Base models. These results indicate that the improvements gained from training on non-nested structures generalize effectively to nested constraint structures, with the benefits becoming pronounced at higher nesting depths.
5 Interpretability Analysis
5.1 Parameter Change Patterns
Fig. 5 presents the relative parameter change rates across layers and modules after LsrIF training. The change rate is measured using the normalized Frobenius norm:
$$
\Delta=\frac{\|W_{\text{after}}-W_{\text{before}}\|_{F}}{\|W_{\text{before}}\|_{F}}\times 100\%, \tag{5}
$$
where $W_{\text{before}}$ and $W_{\text{after}}$ denote the parameters before and after training. For a model with $L$ layers, let $\Delta_{m}^{(l)}$ denote the change rate for module $m$ at layer $l$ .
Attention vs. MLP Modules. A clear pattern observed in Fig. 5 is that attention modules undergo substantially larger parameter changes than MLP modules across most layers. In particular, the query and key projection matrices exhibit the highest change rates, while MLP up and down projections show comparatively smaller and more uniform updates:
$$
\Delta^{(l)}_{\text{attn.q}},\ \Delta^{(l)}_{\text{attn.k}}\;>\;\Delta^{(l)}_{\text{mlp.up}},\ \Delta^{(l)}_{\text{mlp.down}},\quad \tag{6}
$$
Layer-wise Trends. This discrepancy between attention and MLP updates is consistent across layers. Although both module types display some variation along depth, attention-related parameters consistently dominate the overall magnitude of change, especially in lower and upper layers. In contrast, MLP parameters remain relatively stable throughout the network.
Model Consistency. The same trend holds for both Qwen2.5-7B-Instruct and Distill-Qwen-7B. While the distilled model shows larger absolute change magnitudes, the relative dominance of attention parameter updates over MLP updates remains consistent.
Overall, these results indicate that LsrIF primarily induces stronger updates in attention mechanisms, whereas MLP layers are affected to a much lesser extent.
5.2 Token-Level Information Flow Analysis
We analyze token-level information flow using gradient-based saliency attribution to quantify how training redirects attention to semantically critical tokens. For token $x_{i}$ with embedding $E_{i}$ , the attribution score is defined as
$$
S_{i}=\left|\sum_{d=1}^{D}\frac{\partial L}{\partial E_{i,d}}\cdot E_{i,d}\right|. \tag{7}
$$
The sequence-level loss function is defined as
$$
L(x,y)=\sum_{t=1}^{|y|}\log P(y_{t}\mid y_{<t},x). \tag{8}
$$
The change in attention importance is measured as
$$
\Delta S_{i}=S_{i}^{\text{after}}-S_{i}^{\text{before}}, \tag{9}
$$
where higher values indicate greater increases in attention importance.
As shown in Fig. 6, training shifts attention from diffuse to concentrated patterns, directly corresponding to parameter changes in attention query and key modules (Fig. 5). For instruction following tasks, we observe a hierarchical attention increase across three token categories: logical connectors (âFirstâ, âthenâ, âelseâ) show the highest increase, constraint tokens (âbulletâ, âlowercaseâ, âboldedâ) show moderate increase, and action verbs (âapplyâ, âformalâ) show lower increase. For logic reasoning tasks, we observe a similar hierarchical pattern: logical operators (âorâ, âandâ) show the highest increase, followed by choice/negation terms (âeitherâ, ânotâ) and descriptive predicates (âattendsâ).
This hierarchical pattern indicates that the model prioritizes structural elements encoding logical relationships, aligning with structure-aware reward modeling. The substantial updates to $\Delta^{(l)}_{\text{attn.q}}$ and $\Delta^{(l)}_{\text{attn.k}}$ enable query and key representations that prioritize tokens encoding logical structures. The attention mechanism computes query-key similarities where query and key projections are updated to maximize attention weights for structural tokens, validating that LsrIF adapts attention mechanisms to capture constraint relationships rather than merely adjusting output representations.
6 Conclusion
In this work, we propose LsrIF, a logic-structured training framework. We construct LsrInstruct, a multi-constraint instruction dataset covering parallel, sequential, and conditional constraint logic structures, and design LsRM, structure-aware reward modeling that aligns training signals with logical execution semantics. LsrIF improves instruction following in both in-domain and out-of-domain settings, while also enhancing general reasoning ability. We also conduct attention and token-level interpretability analysis for model performance improvements.
7 Limitations
Our study has following main limitations. First, due to computational constraints, we do not evaluate our method on larger models such as 70B+, and validation at this scale would further strengthen the credibility and robustness of our approach. Second, our training data is primarily English. While results on CFBench indicate that logic-structured training can generalize to other languages, we encourage the community to construct multilingual logic-structured instruction datasets to more systematically assess and extend cross-lingual generalization.
References
- K. An, L. Sheng, G. Cui, S. Si, N. Ding, Y. Cheng, and B. Chang (2025) UltraIF: advancing instruction following from the wild. arXiv preprint arXiv:2502.04153. Cited by: §1.
- J. Chen, Q. He, S. Yuan, A. Chen, Z. Cai, W. Dai, H. Yu, Q. Yu, X. Li, J. Chen, et al. (2025) Enigmata: scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914. Cited by: §A.4.3, §4.2.
- J. Cheng, X. Liu, C. Wang, X. Gu, Y. Lu, D. Zhang, Y. Dong, J. Tang, H. Wang, and M. Huang (2024) Spar: self-play with tree-search refinement to improve instruction-following in large language models. arXiv preprint arXiv:2412.11605. Cited by: §A.4.2, §2.1.
- K. Deshpande, V. Sirdeshmukh, J. B. Mols, L. Jin, E. Hernandez-Cardona, D. Lee, J. Kritz, W. E. Primack, S. Yue, and C. Xing (2025) MultiChallenge: a realistic multi-turn conversation evaluation benchmark challenging to frontier llms. In Findings of the Association for Computational Linguistics: ACL 2025, pp. 18632â18702. Cited by: §A.4.3, §1, §4.1.
- Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto (2024) Length-controlled alpacaeval: a simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475. Cited by: §A.4.3.
- S. Han, H. Schoelkopf, Y. Zhao, Z. Qi, M. Riddell, W. Zhou, J. Coady, D. Peng, Y. Qiao, L. Benson, et al. (2022) Folio: natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840. Cited by: §A.4.3.
- Q. He, J. Zeng, Q. He, J. Liang, and Y. Xiao (2024) From complex to simple: enhancing multi-constraint complex instruction following ability of large language models. arXiv preprint arXiv:2404.15846. Cited by: §1.
- H. Huang, J. Liu, Y. He, S. Li, B. Xu, C. Zhu, M. Yang, and T. Zhao (2025) Musc: improving complex instruction following with multi-granularity self-contrastive training. arXiv preprint arXiv:2502.11541. Cited by: §1, §2.2.
- Y. Jiang, Y. Wang, X. Zeng, W. Zhong, L. Li, F. Mi, L. Shang, X. Jiang, Q. Liu, and W. Wang (2023) Followbench: a multi-level fine-grained constraints following benchmark for large language models. arXiv preprint arXiv:2310.20410. Cited by: §A.4.3, §4.1.
- A. Köpf, Y. Kilcher, D. von RĂŒtte, S. Anagnostidis, Z. R. Tam, K. Stevens, A. Barhoum, D. Nguyen, O. Stanley, R. Nagyfi, et al. (2024) Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems 36. Cited by: §3.1.
- J. Li, L. Du, H. Zhao, B. Zhang, L. Wang, B. Gao, G. Liu, and Y. Lin (2025) Infinity instruct: scaling instruction selection and synthesis to enhance language models. arXiv preprint arXiv:2506.11116. Cited by: §3.1.
- K. Lu, Z. Chen, S. Fu, C. H. Yang, J. Balam, B. Ginsburg, Y. F. Wang, and H. Lee (2025) Developing instruction-following speech language model without speech instruction-tuning data. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1â5. Cited by: §1.
- MAA (2024) American invitational mathematics examination - aime. Note: Accessed in February 2024 External Links: Link Cited by: §A.4.3.
- MAA (2025) American invitational mathematics examination - aime. Note: Accessed in February 2025 External Links: Link Cited by: §A.4.3.
- H. Peng, Y. Qi, X. Wang, B. Xu, L. Hou, and J. Li (2025) VerIF: verification engineering for reinforcement learning in instruction following. arXiv preprint arXiv:2506.09942. Cited by: §A.4.2, §1, §2.2.
- Y. Qi, H. Peng, X. Wang, A. Xin, Y. Liu, B. Xu, L. Hou, and J. Li (2025) Agentif: benchmarking instruction following of large language models in agentic scenarios. arXiv preprint arXiv:2505.16944. Cited by: §A.4.3, §1, §4.1.
- Y. Qi, H. Peng, X. Wang, B. Xu, L. Hou, and J. Li (2024) Constraint back-translation improves complex instruction following of large language models. arXiv preprint arXiv:2410.24175. Cited by: §A.4.2, §2.2.
- Y. Qin, G. Li, Z. Li, Z. Xu, Y. Shi, Z. Lin, X. Cui, K. Li, and X. Sun (2025) Incentivizing reasoning for advanced instruction-following of large language models. arXiv preprint arXiv:2506.01413. Cited by: §A.4.2, §1, §2.1, §2.2.
- D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) Gpqa: a graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, Cited by: §A.4.3.
- Q. Ren, Q. He, B. Zhang, J. Zeng, J. Liang, Y. Xiao, W. Zhou, Z. Sun, and F. Yu (2025) Instructions are all you need: self-supervised reinforcement learning for instruction following. arXiv preprint arXiv:2510.14420. Cited by: §A.4.2, §3.2.
- Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §3.2.
- H. Sun, L. Liu, J. Li, F. Wang, B. Dong, R. Lin, and R. Huang (2024) Conifer: improving complex constrained instruction-following ability of large language models. arXiv preprint arXiv:2404.02823. Cited by: §A.4.2, §1, §2.2.
- C. Wang, Y. Zhou, Q. Wang, Z. Wang, and K. Zhang (2025) Complexbench-edit: benchmarking complex instruction-driven image editing via compositional dependencies. In Proceedings of the 33rd ACM International Conference on Multimedia, pp. 13391â13397. Cited by: §1, §2.1.
- Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi (2022a) Self-instruct: aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560. Cited by: §3.1.
- Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al. (2022b) Super-naturalinstructions: generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705. Cited by: §3.1.
- B. Wen, P. Ke, X. Gu, L. Wu, H. Huang, J. Zhou, W. Li, B. Hu, W. Gao, J. Xu, et al. (2024) Benchmarking complex instruction-following with multiple constraints composition. Advances in Neural Information Processing Systems 37, pp. 137610â137645. Cited by: §A.4.3, §1, §2.1, Figure 4, §4.1.
- Y. Wu, J. Mei, M. Yan, C. Li, S. Lai, Y. Ren, Z. Wang, J. Zhang, M. Wu, Q. Jin, et al. (2025) Writingbench: a comprehensive benchmark for generative writing. arXiv preprint arXiv:2503.05244. Cited by: §A.4.3, §4.1.
- S. Yao, H. Chen, A. W. Hanjie, R. Yang, and K. Narasimhan (2023) Collie: systematic construction of constrained text generation tasks. arXiv preprint arXiv:2307.08689. Cited by: §A.4.3, §4.1.
- J. Ye, C. Huang, Z. Chen, W. Fu, C. Yang, L. Yang, Y. Wu, P. Wang, M. Zhou, X. Yang, et al. (2025) A multi-dimensional constraint framework for evaluating and improving instruction following in large language models. arXiv preprint arXiv:2505.07591. Cited by: §1.
- J. Zhang, R. Xie, Y. Hou, X. Zhao, L. Lin, and J. Wen (2025) Recommendation as instruction following: a large language model empowered recommendation approach. ACM Transactions on Information Systems 43 (5), pp. 1â37. Cited by: §1.
- T. Zhang, C. Zhu, Y. Shen, W. Luo, Y. Zhang, H. Liang, F. Yang, M. Lin, Y. Qiao, W. Chen, et al. (2024) Cfbench: a comprehensive constraints-following benchmark for llms. arXiv preprint arXiv:2408.01122. Cited by: §A.4.3, Table 6, §4.1.
- L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. (2023) Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems 36, pp. 46595â46623. Cited by: §A.4.3.
- Y. Zheng, R. Zhang, J. Zhang, Y. Ye, Z. Luo, Z. Feng, and Y. Ma (2024) Llamafactory: unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372. Cited by: §A.3.
- J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou (2023) Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911. Cited by: §A.4.3, Table 5, §4.1.
Appendix A Appendix
A.1 Dataset
A.1.1 Constraint Types
As shown in Tab. 5 and Tab. 6, we distinguish between soft and hard constraints on LLM outputs. Soft constraints cannot be reliably verified by fixed symbolic rules, as they target high-level, often subjective properties such as semantic focus, tone and emotion, stylistic form, audience- or author-specific style, and syntactic patterns. In contrast, hard constraints are explicitly rule-checkable: they specify concrete requirements on keywords and their frequencies, lengths (in words, sentences, or paragraphs), detectable formats (e.g., numbered bullets, titles, JSON), presence of placeholders or postscripts, and strict start/end markers or punctuation usage. Together, these constraint types provide a comprehensive taxonomy for characterizing both high-level communicative behavior and strictly verifiable surface properties in our instruction formulations.
| Instruction Group | Instruction | Description |
| --- | --- | --- |
| Keywords | Include Keywords | Response must include specified keywords (e.g., {keyword1}, {keyword2}). |
| Keyword Frequency | A particular word should appear a certain number of times ({N} times). | |
| Forbidden Words | Prohibits the inclusion of specified keywords ({forbidden words}). | |
| Letter Frequency | Requires a specific letter to appear a certain number of times ({N} times). | |
| Response Language | Entire response must be in a specified language ({language}) and no other. | |
| Length Constraints | Number Paragraphs | Specifies the exact number of paragraphs ({N}), separated by markdown divider ***. |
| Number Words | Constraint on the number of words: âat least / around / at most {N} words â. | |
| Number Sentences | Constraint on the number of sentences: âat least / around / at most {N} sentences â. | |
| Number Paragraphs + First Word | Requires {N} paragraphs (separated by two line breaks), with the {i} -th paragraph starting with a specified word ({first_word}). | |
| Detectable Content | Postscript | Requires an explicit postscript at the end, starting with a specified marker ({postscript marker}). |
| Number Placeholder | Response must contain at least {N} placeholders in square brackets (e.g., [address]). | |
| Detectable Format | Number Bullets | Requires exactly {N} bullet points using markdown format (e.g., * This is a point.). |
| Title | Answer must include a title wrapped in double angular brackets (e.g., <<poem of joy>>). | |
| Choose From | Response must be one of the provided options ({options}). | |
| Minimum Number Highlighted Section | Requires at least {N} sections highlighted using markdown (e.g., *highlighted section*). | |
| Multiple Sections | Response must have {N} sections, with each sectionâs beginning marked by a splitter (e.g., {section_splitter} X). | |
| JSON Format | Entire output must be wrapped in JSON format. | |
| Combination | Repeat Prompt | First repeat the request without change, then provide the answer. |
| Two Responses | Requires two different responses, separated by six asterisk symbols (******). | |
| Change Cases | All Uppercase | Entire response must be in English, using only capital letters. |
| All Lowercase | Entire response must be in English, using only lowercase letters, with no capital letters allowed. | |
| Frequency of All-capital Words | Words with all capital letters should appear âat least / around / at most {N} times â. | |
| Start with / End with | End Checker | Response must end with a specific phrase ({end_phrase}), with no other words following it. |
| Quotation | Entire response must be wrapped in double quotation marks. | |
| Punctuation | No Commas | Prohibits the use of any commas in the entire response. |
Table 5: Hard Constraint Types Zhou et al. (2023).
| Constraint Type | Definition | Example |
| --- | --- | --- |
| Lexical content constraint | Requires specific terms or symbols with precise placement. | ââŠmust include the word âbeautifulâ.â |
| Element constraint | Requires inclusion of specific entities or scenarios. | ââŠhighlights the Great Wall.â |
| Semantic constraint | Focuses on themes, tone, or stance. | âWrite a poem about London.â |
| Word Count | Limits the number of words. | âA 50-word poem.â |
| Sentence Count | Limits the number of sentences. | ââŠthree sentences.â |
| Paragraph Count | Limits the number of paragraphs. | ââŠdivided into 3 sections.â |
| Document Count | Limits the number of documents. | ââŠlist 3 articles.â |
| Tone and emotion | Conforms to specific emotional tone. | âWrite a letter in an angry and sarcastic tone.â |
| Form and style | Uses specified stylistic form and perception. | âWrite a passage in an encyclopedic style.â |
| Audience-specific | Tailored to a specific audience group. | âWrite a poem for a 6-year-old.â |
| Authorial style | Emulates specific authorsâ styles. | âWrite a passage in the style of Shakespeare.â |
| Fundamental format | Follows standard formats like JSON, HTML, etc. | âOutput in JSON format.â |
| Bespoke format | Uses custom formatting protocols. | âBold the main idea and output in unordered list.â |
| Specialized format | Tailored for specific applications or domains. | âConvert to electronic medical record format.â |
| Pragmatic constraint | Adapts to context like dialects or language policy. | âOutput in English, classical Chinese, etc.â |
| Syntactic constraint | Follows specific phrase and clause structures. | âUse imperatives with nouns and verb phrases.â |
| Morphological constraint | Controls affixes, roots, and word formation. | âOutput all content in lowercase English.â |
| Phonological constraint | Focuses on sounds, tone, and intonation. | âSingle-syllable tongue twisters.â |
| Role-based constraint | Responds with specific role identity. | âYou are Confucius, how do you decide?â |
| Task-specific constraint | Addresses a defined situational task. | âWork from home, how to report?â |
| Complex context constraint | Involves multi-faceted and nested reasoning. | âOn the left, 10 total, what to do?â |
| Example constraint | Conforms to patterns from example pairs. | âinput:xâŠ, output:{âŠ}; input:yâŠ, output?â |
| Inverse constraint | Narrows response space via exclusions. | âNo responses about political topics.â |
| Contradictory constraint | Combines requirements that are hard to satisfy simultaneously. | âA five-character quotation, 1000 words.â |
| Rule constraint | Follows symbolic or logical operation rules. | âEach answer adds 1+1=3, then 2+2=5.â |
Table 6: Soft Constraint Types Zhang et al. (2024).
/* Task Description */ 1. I currently have a seed question, but the seed questions are relatively simple. To make the instructions more complex, I want you to identify and return three composition constraints that can be added to the seed question. 2. I will provide [Seed Question] and [Constraint References], and you can use these references to propose the composition constraint that would increase the difficulty of the seed question. 3. You may choose one or more constraints from the [Constraint References] list, and combine them using the following composition rules. 4. Do not modify or rewrite the seed question. Your task is only to generate the new composite constraint that can be added to it. 5. Return the added constraint(s) in the JSON format described below, including all sub-constraints and their logical composition types. 6. Do not return anything else. No explanation, no reformulated question, no analysisâonly the JSON structure. /* Logical Composition Types */ And: The output is required to satisfy multiple constraints simultaneously. Template: C1 and C2 and C3. Example: summarize the news in bullet points and within 100 words. Chain: The output is required to complete multiple tasks sequentially, each with its own constraints. Template: first C1, then C2, finally C3. Example: introduce âMona Lisaâ: year of creation, then background, then impact. Selection: The output is required to select different branches according to conditions, fulfilling the constraints of the corresponding branch. Template: if C1 then C2 otherwise C3. Example: if the painting has an animal, describe it in Chinese; otherwise, give year, background, and impact. /* JSON Output Format */ Return { "composite_constraints": [ ... ] } where each element contains a "composite_constraint" with fields "type": "<And/Chain/Selection>" and "sub_constraints" ("c1", "c2, "c3") each holding a "constraint" string that specifies one atomic constraint. /* Constraint References*/ 1. Lexical content constraint: must include specific terms or symbols with precise placement. 2. Element constraint: include specific entities or scenarios. 3. Semantic constraint: focus on themes, tone, or stance. 4. Word Count: limit the number of words. 5. Sentence Count: limit the number of sentences. 6. Paragraph Count: limit the number of paragraphs. 7. Document Count: limit the number of documents. 8. Tone and emotion: conform to specific emotional tone. 9. Form and style: use specified stylistic form and perception. 10. Audience-specific: tailored to a specific audience group. 11. Authorial style: emulate specific authorsâ styles. 12. Fundamental format: follow standard formats like JSON, HTML, etc. 13. Bespoke format: use custom formatting protocols. 14. Specialized format: tailored for specific applications or domains. 15. Pragmatic constraint: adapt to context like dialects or language policy. 16. Syntactic constraint: follow specific phrase and clause structures. 17. Morphological constraint: control over affixes, roots, and word formation. 18. Phonological constraint: focus on sounds, tone, and intonation. 19. Role-based constraint: respond with specific role identity. 20. Task-specific constraint: address a defined situational task. 21. Complex context constraint: involve multi-faceted and nested reasoning. 22. Example constraint: conform to patterns from example pairs. 23. Inverse constraint: narrow response space via exclusions. 24. Contradictory constraint: combine requirements that are hard to satisfy simultaneously. 25. Rule constraint: follow symbolic or logical operation rules. /* Seed Question */ [Seed Question]: {} /* Modified Question */ [Modified Question]: (the seed question plus one of the generated composite constraints).
Table 7: Prompt template for constructing logically structured multi-constraint instructions.
A.1.2 Prompt for constructing logically structured multi-constraint instructions
As shown in Tab. 7, the template takes a seed question and a reference list of 25 constraint types. The model selects atomic constraints and combines them using three logical composition types (And, Chain, Selection): And requires satisfying multiple constraints simultaneously, Chain requires sequential task completion, and Selection requires conditional branch selection. The model generates three composite constraints in JSON format, each specifying its composition type and sub-constraints. These constraints are added to the seed question to form complex multi-constraint instructions.
A.2 Reward Model Training
We fine-tune Qwen2.5-7B-Instruct for a binary classification task to determine whether a response satisfies a given constraint. Training data consists of response-constraint pairs. Each sample is tokenized by concatenating the response and constraint into a single text sequence. We use full-parameter fine-tuning (not LoRA) with the HuggingFace Trainer framework. Training hyperparameters: learning rate 5e-6, batch size 1 per device, gradient accumulation steps 1, 3 epochs, FP16 precision, gradient checkpointing enabled, and DeepSpeed optimization configured via JSON. We use accuracy as the evaluation metric, computed by comparing predicted labels with ground truth labels. The training is performed on 8 NVIDIA H200 GPUs.
A.3 SFT Training
We perform supervised fine-tuning (SFT) on six models: Qwen2.5-1.5B-Instruct, Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Distill-Qwen-7B, Distill-Qwen-14B, and Qwen3-8B. The training is performed on 8 NVIDIA H200 GPUs. The training data consists of instruction-response pairs where responses are generated by the teacher model GPT-4.1. Training is conducted using LLaMA-Factory Zheng et al. (2024) with LoRA fine-tuning (rank=8, targeting all linear layers). We use a maximum sequence length of 20480 tokens, and employ 16 preprocessing workers and 4 dataloader workers. Training hyperparameters include: batch size of 1 per device with gradient accumulation of 8 steps, learning rate of 1.0e-4, 3 training epochs, cosine learning rate scheduler with 10% warmup ratio, and bfloat16 precision. Model-specific templates are applied according to each modelâs architecture.
A.4 RL Training
A.4.1 Implementation Details
We apply GRPO training using the VeRL framework. Training is conducted on 8 NVIDIA H200 GPUs. Maximum prompt length is 2048 tokens, and maximum response length is 8192 tokens. The rollout batch size is set to 384, and data is shuffled with a random seed of 1. The algorithm employs the GRPO advantage estimator with KL penalty enabled. We use a low-variance KL penalty formulation with a coefficient of 1.0e-2. Training batches are organized with a global batch size of 96, micro-batches of size 4 per device for updates, and micro-batches of size 8 per device for experience generation. Gradient clipping is applied with a max norm of 1.0. The optimizer uses a learning rate of 1.0e-6 with a weight decay of 1.0e-2, and no learning rate warm-up. We leverage FSDP with full sharding enabled. For rollouts, we generate responses with a temperature of 1.0 and a group size of 5. Tensor parallelism of size 2 is applied. The maximum number of batched tokens is set to 16000. Different models are trained for varying numbers of epochs: Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct and Distill-Qwen-14B are trained for 1 epoch; Distill-Qwen-7B and Qwen2.5-1.5B-Instruct are trained for 5 epochs; and Qwen3-8B is trained for 4 epochs. For the sequential rewards, we set the decay coefficient $\gamma$ to $0.5$ , which represents a moderate penalty propagation strength. With $\gamma=0.5$ , each failed earlier step reduces the effective reward of subsequent steps by half, encouraging correct early decisions while still allowing partial credit for later successes. Empirically, this choice provides stable training dynamics and avoids overly aggressive reward suppression observed with smaller $\gamma$ values.
A.4.2 Baselines
RAIF-7B Qin et al. (2025): RAIF-7B (Incentivizing Reasoning) proposes a systematic approach to enhance large language modelsâ ability to handle complex instructions by incentivizing reasoning processes during test-time computation scaling. The method encourages models to engage in explicit reasoning steps when processing complex instructions, thereby improving instruction-following performance through enhanced computational reasoning capabilities.
Conifer-7B-DPO Sun et al. (2024): Conifer addresses complex constrained instruction-following through a two-stage training pipeline. The method first constructs a curriculum dataset organized from simple to complex instructions and performs supervised fine-tuning (SFT) on this dataset. Subsequently, it applies Direct Preference Optimization (DPO) training using an open-source preference dataset to further refine the modelâs ability to follow complex constraints.
Crab-7B-DPO Qi et al. (2024): Crab employs a constraint back-translation strategy to improve complex instruction following. The method leverages Llama3-70B-Instruct as a strong teacher model to back-translate constraints into high-quality instruction-response pairs. This process creates a comprehensive dataset with complex constraints, which is then used for DPO training to enhance the modelâs instruction-following capabilities.
SPAR-8B-DPO Cheng et al. (2024): SPAR (Self-play with tree-search refinement) introduces a self-play framework that integrates tree-search-based self-refinement mechanisms. The framework enables an LLM to play against itself, employing tree-search strategies to iteratively refine responses with respect to given instructions. This approach generates valid and comparable preference pairs while minimizing unnecessary variations, facilitating effective DPO training for instruction-following tasks.
VERIF Peng et al. (2025): VERIF (Verification Engineering for Reinforcement Learning) combines multiple verification approaches to enhance instruction following through reinforcement learning. The method integrates rule-based code verification with LLM-based verification from large reasoning models (RLVR), providing comprehensive verification signals that guide the reinforcement learning process toward better instruction-following performance.
Self-Supervised-7B Ren et al. (2025): Self-Supervised-7B presents a self-supervised reinforcement learning framework for instruction following that eliminates the need for external supervision. The method extracts reward signals directly from instructions and generates pseudo-labels for reward model training, thereby removing dependencies on human-annotated preference data. The framework introduces constraint decomposition strategies and efficient constraint-level binary classification methods to address sparse reward problems while maintaining computational efficiency. Experimental results demonstrate significant performance improvements across multiple datasets, including complex agentic tasks and multi-turn instruction-following scenarios.
A.4.3 Benchmarks
We evaluate instruction-following ability on various benchmarks:
IFEval Zhou et al. (2023): IFEval (Instruction-Following Evaluation) focuses on verifiable instructions that can be automatically checked for compliance. The benchmark includes instructions such as "write in more than 400 words" and "mention the keyword of AI at least 3 times", covering 25 distinct types of verifiable instructions across approximately 500 prompts. Each prompt may contain one or more verifiable instructions, enabling systematic evaluation of modelsâ ability to follow explicit, rule-based constraints.
CFBench Zhang et al. (2024): CFBench (Comprehensive Constraints Following Benchmark) is a large-scale benchmark featuring 1,000 carefully curated samples that span more than 200 real-life scenarios and over 50 natural language processing tasks. The benchmark systematically compiles constraints from real-world instructions and establishes a comprehensive framework for constraint categorization, including 10 primary categories and over 25 subcategories. Each constraint is seamlessly integrated within instructions to reflect realistic usage scenarios.
FollowBench Jiang et al. (2023): FollowBench provides a comprehensive evaluation framework covering five distinct types of fine-grained constraints: Content, Situation, Style, Format, and Example. To enable precise constraint-following assessment across varying difficulty levels, the benchmark introduces a multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level, allowing for granular analysis of model performance.
ComplexBench Wen et al. (2024): ComplexBench is designed to comprehensively evaluate LLMsâ ability to follow complex instructions composed of multiple constraints. The benchmark proposes a hierarchical taxonomy for complex instructions, encompassing 4 constraint types, 19 constraint dimensions, and 4 composition types. It includes a manually collected high-quality dataset that systematically covers various constraint combinations and logical structures.
WritingBench Wu et al. (2025): WritingBench is a comprehensive benchmark designed to evaluate LLMs across diverse writing domains. The benchmark covers 6 core writing domains and 100 subdomains, encompassing creative, persuasive, informative, and technical writing. It provides systematic evaluation of modelsâ ability to generate high-quality written content that adheres to various stylistic and content requirements.
Collie Yao et al. (2023): Collie (Constrained Language Generation) employs a grammar-based framework that enables the specification of rich, compositional constraints across multiple generation levels, including word, sentence, paragraph, and passage levels. The benchmark encompasses diverse modeling challenges such as language understanding, logical reasoning, counting, and semantic planning, providing a systematic approach to evaluating constraint-following capabilities.
AgentIF Qi et al. (2025): AgentIF is the first benchmark specifically designed for systematically evaluating LLM instruction-following ability in agentic scenarios. The benchmark features three key characteristics: (1) Realistic: constructed from 50 real-world agentic applications; (2) Long: averaging 1,723 words with a maximum of 15,630 words; (3) Complex: averaging 11.9 constraints per instruction, covering diverse constraint types including tool specifications and condition constraints.
MultiChallenge Deshpande et al. (2025): MultiChallenge is a pioneering benchmark evaluating large language models on conducting multi-turn conversations with human users, a crucial yet underexamined capability. The benchmark identifies four categories of challenges in multi-turn conversations that are both common in real-world human-LLM interactions and challenging to current frontier LLMs. All four challenge categories require accurate instruction-following, context allocation, and in-context reasoning simultaneously.
We assess general reasoning and knowledge capabilities with the following datasets:
GPQA-Diamond Rein et al. (2024): GPQA-Diamond is a specialized subset of the GPQA (Graduate-Level Google-Proof Q&A) benchmark, comprising 198 meticulously crafted multiple-choice questions across biology, chemistry, and physics. These questions are designed to be exceptionally challenging, requiring domain expertise that makes them a rigorous test for AI modelsâ scientific knowledge and reasoning capabilities.
AIME2024 MAA (2024) and AIME2025 MAA (2025): The AIME (American Invitational Mathematics Examination) datasets consist of problems from the 2024 and 2025 AIME competitions, respectively. These datasets are commonly used to evaluate the mathematical reasoning ability of large language models. The AIME is a prestigious mathematics competition for high school students in the United States, and its problems require sophisticated mathematical reasoning and problem-solving skills.
FOLIO Han et al. (2022): FOLIO (First-Order Logic Inference Over Text) is a benchmark dataset developed to assess the logical reasoning capabilities of large language models. It consists of human-annotated examples that require deductive reasoning grounded in first-order logic (FOL). The benchmark evaluates modelsâ ability to perform formal logical inference over natural language text, bridging natural language understanding and formal reasoning.
Enigmata Chen et al. (2025): Enigmata is a comprehensive suite designed to enhance logical reasoning capabilities of large language models. The benchmark comprises 36 tasks distributed across seven categories, with each task equipped with generators that can produce infinite examples and rule-based verifiers. This generator-verifier design supports scalable multi-task reinforcement learning training, fine-grained analysis, and seamless integration with reinforcement learning with verifiable rewards (RLVR). Models trained on Enigmata demonstrate strong performance across multiple puzzle reasoning benchmarks and exhibit good generalization to advanced mathematical and STEM reasoning tasks.
MT-Bench Zheng et al. (2023): MT-Bench (Multi-Turn Benchmark) is an evaluation framework designed to assess chat assistantsâ performance in multi-turn conversations. The benchmark contains 80 high-quality multi-turn open-ended questions covering diverse topics such as writing, role-playing, mathematics, and coding. Model responses are scored by GPT-4, providing direct scores without requiring pairwise comparisons. MT-Bench enables systematic evaluation of modelsâ conversational abilities and their capacity to maintain context and coherence across multiple interaction turns.
AlpacaEval 2.0 Dubois et al. (2024): AlpacaEval 2.0 is an automated evaluation benchmark designed to assess instruction-following capabilities of language models. The benchmark leverages GPT-4 as an automated annotator to compare model-generated responses with reference outputs, evaluating how well models adhere to user instructions. The benchmark is characterized by its efficiency, low cost, and reliability, enabling rapid assessment of model performance. AlpacaEval 2.0 provides a standardized evaluation protocol for comparing instruction-following models across diverse tasks and scenarios.
A.4.4 Case Study
Tab. 8 illustrates the modelâs transformation in constraint-aware reasoning and output generation. Before training, the modelâs reasoning process shows awareness of constraints (e.g., "Let me make sure itâs clear without commas") but fails to translate this awareness into compliant output: the generated riddle contains commas and uses keywords inappropriately (e.g., "afternoon embroidered" rather than natural keyword integration). The constraint compliance is [False, False]. After training, the modelâs reasoning becomes more systematic, explicitly planning constraint satisfaction (e.g., "Iâll have to structure the sentences carefully to avoid [commas]") and naturally incorporating keywords. This improved reasoning directly translates to compliant output: the riddle contains no commas and integrates both keywords naturally. The constraint compliance improves to [True, True], demonstrating effective alignment between reasoning and constraint satisfaction.
| Case | Content |
| --- | --- |
| Before Training | Input: Write a riddle about embroidery that has the answer "needle". Include keywords âafternoonâ and âdistressedâ in the response. Donât use any commas in your answer. Think: Alright, I need to create a riddle about embroidery where the answer is "needle." The riddle should include the keywords âafternoonâ and âdistressed.â Let me break this down. First, Iâll think about what embroidery entails. Embroidery involves stitching, so maybe using words like "stitch," "needle," or "cloth." Since the answer is "needle," I need to hint at it without stating it directly. For âafternoon,â perhaps I can use something related to the time of day, like "quietly" or "evening," but in a way that ties into embroidery. Maybe "afternoon" can relate to when someone is stitching, so "quietly" could work, as itâs a gentle activity. âDistressedâ can be tricky. Maybe I can imply it through the feeling of the fabric or the process. Perhaps something like "snag" or "torn," but I need to tie it back to embroidery. Maybe the needle moves too fast, causing something to snag, making the work look distressed. Putting it together, I want a riddle that flows naturally. Let me try: "What does an afternoon embroidered with a needle bring?" Hmm, thatâs close. Maybe tweak it to add the distressed element. Wait, "What does an afternoon embroidered with a needle bring?" The answer is "a needle," but maybe I can add a clue about the distress. So, "What does an afternoon embroidered with a needle bring? A stitched, but maybe a little snag, a sewing thatâs a bit hurried, so a small, worried look." That uses âafternoonâ as âembroideredâ and âneedleâ as the answer. The âdistressedâ part is hinted at by âsnagâ and âhurried,â which can make someone worried about the work. I think that works. Let me make sure itâs clear without commas. The response is: "What does an afternoon embroidered with a needle bring? A stitched, but maybe a little snag, a sewing thatâs a bit hurried, so a small, worried look." That includes both keywords and hints at the needle being the answer. </think> Output: What does an afternoon embroidered with a needle bring? A stitched, but maybe a little snag, a sewing thatâs a bit hurried, so a small, worried look. Constraint Compliance: [False, False] |
| After Training | Input: Write a riddle about embroidery that has the answer "needle". Include keywords âafternoonâ and âdistressedâ in the response. Donât use any commas in your answer. Think: Alright, I need to help the user create a riddle about embroidery with the answer "needle." They specifically want me to include the keywords âafternoonâ and âdistressed.â Hmm, let me break this down. First, I should understand what a riddle entails. Itâs a question or statement that requires thinking to solve, often with a clever twist. Since the answer is "needle," I need to think about how to incorporate that into a riddle. The user mentioned not using any commas, so Iâll have to structure the sentences carefully to avoid that. Let me start brainstorming some ideas. Embroidery is about stitching, so maybe I can relate âneedleâ to the tool used. Including âafternoonâ⊠perhaps something about the time of day. Maybe the needle is used in the afternoon? Or the result of using it in the afternoon. âDistressedâ could relate to something thatâs been repaired or fixed, like a hole thatâs been mended. Putting it together, I think of the process of sewing. If someone is sewing in the afternoon and notices a hole, they might use a needle. So, maybe the riddle is about something thatâs fixed in the afternoon with a needle. That way, the keyword âdistressedâ can describe the hole thatâs being mended. So, the riddle could go like: "What is used in the afternoon to mend a hole thatâs been distressed?" The answer would be the needle. Let me make sure it flows without commas and makes sense. Yes, that should work. Iâll present it as a riddle, so itâs a question that leads to the answer with the given keywords. </think> Output: What do you use in the afternoon to mend a hole thatâs been distressed? The answer is a needle. Constraint Compliance: [True, True] |
Table 8: Comparison of the outputs of R1-Distill-Qwen-7B before and after training for the instruction in IFEval.
<details>
<summary>figures/files/qwen-7b-reward.png Details</summary>

### Visual Description
## Line Chart: Reward/Overall
### Overview
The image is a line chart displaying the "reward/overall" value over a series of steps. The chart shows an initial increase in reward, followed by a period of stabilization with fluctuations.
### Components/Axes
* **Title:** reward/overall
* **X-axis:** Step
* Axis markers at approximately: 20, 40, 60, 80, 100
* **Y-axis:** Reward Value
* Axis markers at: 0.4, 0.5, 0.6, 0.7
* **Data Series:**
* A single light blue line representing the reward/overall value.
### Detailed Analysis
* **Light Blue Line (reward/overall):**
* The line starts at approximately 0.35 at step 0.
* The line increases rapidly to approximately 0.6 around step 20.
* The line fluctuates between approximately 0.6 and 0.7 from step 20 to step 110.
* The line ends at approximately 0.69 at step 110.
### Key Observations
* The reward increases sharply in the initial steps.
* The reward stabilizes after approximately 20 steps, with minor fluctuations.
* There are no significant outliers.
### Interpretation
The chart suggests that the system or model being evaluated experiences a rapid learning phase initially, as indicated by the sharp increase in reward. After this initial phase, the learning plateaus, and the reward fluctuates around a stable value. This could indicate that the system has reached a point of diminishing returns, where further training yields only marginal improvements. The fluctuations could be due to the inherent variability in the environment or the learning process.
</details>
(a) Reward Dynamics of Qwen2.5-7B-Instruct.
<details>
<summary>figures/files/qwen-7b-length.png Details</summary>

### Visual Description
## Line Chart: Response Length/Mean vs. Step
### Overview
The image is a line chart showing the trend of "response_length/mean" over "step". The chart displays fluctuations in response length, with a general upward trend followed by a slight decrease towards the end.
### Components/Axes
* **Title:** response\_length/mean
* **X-axis:** Step, with markers at approximately 20, 40, 60, 80, and 100.
* **Y-axis:** Values ranging from approximately 340 to 400, with markers at 340, 360, 380, and 400.
* **Data Series:** A single line in light blue representing the "response_length/mean".
### Detailed Analysis
* **Data Series Trend:** The light blue line starts at approximately 340 at step 0. It fluctuates significantly, reaching a peak around step 60 at approximately 405. After step 60, the line generally trends downward, ending at approximately 355 at step 110.
* **Specific Data Points:**
* Step 0: ~340
* Step 20: ~370
* Step 40: ~380
* Step 60: ~405 (peak)
* Step 80: ~360
* Step 100: ~390
* Step 110: ~355
### Key Observations
* The response length/mean exhibits considerable volatility throughout the steps.
* A significant peak occurs around step 60.
* The response length/mean decreases towards the end of the observed steps.
### Interpretation
The chart illustrates how the average response length changes over a series of steps. The fluctuations suggest variability in the responses, possibly due to changes in input or system behavior. The peak around step 60 could indicate a period of particularly long responses, while the subsequent decrease might reflect an adjustment or stabilization in the system. The data suggests that the response length is not constant and is influenced by factors that vary over time.
</details>
(b) Response Length Dynamics of Qwen2.5-7B-Instruct.
<details>
<summary>figures/files/distill-7b-reward.png Details</summary>

### Visual Description
## Line Chart: Critic Score Mean vs. Step
### Overview
The image is a line chart showing the trend of the "critic/score/mean" over a series of steps. The x-axis represents the "Step" number, ranging from 0 to 500. The y-axis represents the "critic/score/mean" value, ranging from 0.55 to 0.7. The line, colored red, shows an upward trend initially, then fluctuates around a relatively stable value.
### Components/Axes
* **Title:** critic/score/mean
* **X-axis:**
* Label: Step
* Scale: 0 to 500, with markers at 100, 200, 300, 400, and 500.
* **Y-axis:**
* Scale: 0.55 to 0.7, with markers at 0.55, 0.6, 0.65, and 0.7.
* **Legend:** A short red line next to the title "critic/score/mean" indicates the color of the data series.
### Detailed Analysis
* **Data Series:** critic/score/mean (red line)
* **Trend:** The line starts at approximately 0.54 at Step 0. It increases relatively rapidly until approximately Step 100, reaching a value of approximately 0.63. From Step 100 to Step 500, the line fluctuates, but generally remains between 0.63 and 0.7. The final value at Step 500 is approximately 0.7.
### Key Observations
* The critic score mean increases significantly in the first 100 steps.
* After 100 steps, the score fluctuates but does not show a strong upward or downward trend.
* The score appears to stabilize around 0.65 after the initial increase.
### Interpretation
The chart suggests that the critic's score improves significantly during the initial steps of the process, indicating a learning phase. After this initial learning phase, the critic's score stabilizes, suggesting that the critic has reached a certain level of performance and is no longer improving significantly with each step. The fluctuations after the initial learning phase could be due to various factors, such as the inherent variability in the data or the critic's exploration of different strategies. The fact that the score does not decrease significantly suggests that the critic is not forgetting what it has learned.
</details>
(c) Reward Dynamics of Distill-Qwen-7B.
<details>
<summary>figures/files/distill-7b-length.png Details</summary>

### Visual Description
## Line Chart: Response Length/Mean
### Overview
The image is a line chart showing the trend of "response_length/mean" over "Step". The chart displays fluctuations in response length, with an overall decreasing trend in the initial steps, followed by stabilization.
### Components/Axes
* **Title:** response\_length/mean
* **X-axis:** Step, with markers at approximately 0, 100, 200, 300, 400, and 500.
* **Y-axis:** Values ranging from 700 to 850, with markers at 700, 750, 800, and 850.
* **Legend:** A red line represents "response\_length/mean". The legend is located at the top of the chart, directly below the title.
### Detailed Analysis
* **Data Series:** response\_length/mean (Red Line)
* **Trend:** The red line shows a fluctuating pattern. Initially, the line decreases from approximately 840 at step 0 to around 750 at step 100. After step 100, the line fluctuates between approximately 720 and 800, showing a relatively stable trend.
* **Data Points:**
* Step 0: Approximately 840
* Step 100: Approximately 750
* Step 200: Approximately 770
* Step 300: Approximately 730
* Step 400: Approximately 760
* Step 500: Approximately 780
### Key Observations
* The response length/mean experiences a significant drop in the first 100 steps.
* After the initial drop, the response length/mean stabilizes and fluctuates within a narrower range.
* There are no significant outliers after the initial drop.
### Interpretation
The chart suggests that the response length/mean decreases initially, possibly due to an initial learning phase or adjustment. After the initial phase, the response length/mean stabilizes, indicating a more consistent behavior. The fluctuations after the initial drop could be due to variations in the input data or the model's internal dynamics. The overall trend suggests that the system reaches a stable state after the initial adjustment period.
</details>
(d) Response Length Dynamics of Distill-Qwen-7B.
<details>
<summary>figures/files/distill-14b-reward.png Details</summary>

### Visual Description
## Line Chart: Critic Score Mean vs. Step
### Overview
The image is a line chart displaying the mean critic score over a series of steps. The chart shows the trend of the critic's performance, with the x-axis representing the step number and the y-axis representing the critic score mean. The line is colored red.
### Components/Axes
* **Title:** critic/score/mean
* **X-axis:**
* Label: Step
* Scale: 0 to 400, with markers at approximately 100, 200, 300, and 400.
* **Y-axis:**
* Scale: 0.6 to 0.7, with markers at 0.6, 0.65, and 0.7.
* **Legend:** A short red horizontal line is present next to the title, indicating the color of the data series.
### Detailed Analysis
* **Data Series:** The red line represents the "critic/score/mean".
* **Trend:** The line generally slopes upward from approximately step 0 to step 400. The line fluctuates significantly, but the overall trend is positive.
* **Values:**
* At step 0, the value is approximately 0.58.
* At step 100, the value fluctuates around 0.65.
* At step 200, the value fluctuates around 0.68.
* At step 300, the value fluctuates around 0.67.
* At step 400, the value is approximately 0.72.
### Key Observations
* The critic score mean generally increases over the steps, indicating an improvement in the critic's performance.
* There are significant fluctuations in the critic score mean, suggesting variability in the critic's performance at different steps.
* The most significant increase in the critic score mean occurs in the early steps (0-100).
### Interpretation
The chart suggests that the critic's performance improves over time, as indicated by the upward trend of the critic score mean. The fluctuations in the score may be due to the inherent variability in the training process or the complexity of the task. The initial rapid increase in the score suggests that the critic learns quickly in the early stages of training. The data demonstrates the learning process of the critic, showing how its performance evolves over a series of steps.
</details>
(e) Reward Dynamics of Distill-Qwen-14B.
<details>
<summary>figures/files/distill-14b-length.png Details</summary>

### Visual Description
## Line Chart: Response Length/Mean vs. Step
### Overview
The image is a line chart displaying the "response_length/mean" over a series of "steps". The chart shows fluctuations in response length, with a general trend of stability, except for a notable spike at the end.
### Components/Axes
* **Title:** response\_length/mean
* **X-axis:** Step, with markers at approximately 0, 100, 200, 300, and 400.
* **Y-axis:** Response Length/Mean, with markers at 700, 750, 800, and 850.
* **Legend:** A single entry, represented by a short red line, corresponding to the red line in the chart.
### Detailed Analysis
* **Data Series:** The chart contains a single data series, represented by a red line.
* **Trend:** The red line fluctuates between approximately 720 and 850 for most of the steps. There is a noticeable spike at the end (around step 400), reaching a value of approximately 840.
* **Specific Values:**
* At Step 0, the response length/mean is approximately 810.
* At Step 100, the response length/mean is approximately 770.
* At Step 200, the response length/mean is approximately 750.
* At Step 300, the response length/mean is approximately 770.
* At Step 400, the response length/mean is approximately 840.
### Key Observations
* The response length/mean is generally stable, fluctuating within a range of approximately 130 units.
* The most significant change occurs at the end of the series, with a sharp increase in response length/mean.
### Interpretation
The chart illustrates the variability in response length/mean over a series of steps. The stability of the response length/mean for most of the steps suggests a consistent process. The spike at the end indicates a potential change or anomaly in the process that warrants further investigation. The data suggests that the system or process being measured experienced a significant change in its response length towards the end of the observed period.
</details>
(f) Response Length Dynamics of Distill-Qwen-14B.
<details>
<summary>figures/files/qwen3-8b-reward.png Details</summary>

### Visual Description
## Line Chart: Critic Score Mean vs. Step
### Overview
The image is a line chart showing the trend of the "critic/score/mean" over a series of steps. The chart displays fluctuations in the critic score mean, with an overall slight upward trend in the initial steps, followed by a relatively stable but volatile pattern.
### Components/Axes
* **Title:** critic/score/mean
* **X-axis:** Step, with markers at 0, 100, 200, 300, and 400.
* **Y-axis:** Numerical scale ranging from 0.62 to 0.74, with markers at 0.62, 0.64, 0.66, 0.68, 0.7, 0.72, and 0.74.
* **Legend:** A blue line represents the "critic/score/mean".
### Detailed Analysis
* **Data Series:** critic/score/mean (blue line)
* **Trend:** The blue line starts at approximately 0.65 at step 0. It generally increases to approximately 0.70 by step 100. From step 100 to 400, the line fluctuates between approximately 0.66 and 0.73, showing no clear upward or downward trend, but rather a volatile pattern.
* **Data Points:**
* Step 0: ~0.65
* Step 100: ~0.70
* Step 200: ~0.68
* Step 300: ~0.71
* Step 400: ~0.69
### Key Observations
* The critic score mean shows an initial increase in the first 100 steps.
* After the initial increase, the score fluctuates significantly, indicating variability in the critic's performance.
* There are no significant outliers, but the volatility remains consistent throughout the later steps.
### Interpretation
The chart suggests that the critic's performance, as measured by the score mean, improves initially during the first 100 steps. However, after this initial phase, the critic's performance becomes more variable, with no clear sustained improvement or decline. The fluctuations could be due to various factors, such as changes in the training data, exploration-exploitation trade-offs, or inherent instability in the learning process. The initial increase indicates a learning phase, while the subsequent fluctuations suggest a period of refinement or adaptation without significant overall progress.
</details>
(g) Reward Dynamics of Qwen3-8B.
<details>
<summary>figures/files/qwen3-8b-length.png Details</summary>

### Visual Description
## Line Chart: response_length/mean
### Overview
The image is a line chart displaying the "response_length/mean" over a series of steps. The chart shows fluctuations in response length, with a general trend around an average value.
### Components/Axes
* **Title:** response_length/mean
* **X-axis:** Step, with markers at 0, 100, 200, 300, and 400.
* **Y-axis:** Values ranging from 900 to 1200, with markers at 900, 1000, 1100, and 1200.
* **Legend:** A blue line representing "response_length/mean". The legend is positioned directly below the title.
### Detailed Analysis
* **Data Series:** The blue line represents the "response_length/mean".
* **Trend:** The line fluctuates significantly throughout the steps.
* **Values:**
* At Step 0, the value is approximately 1075.
* At Step 100, the value fluctuates around 1075.
* At Step 200, there is a peak at approximately 1250.
* At Step 300, the value fluctuates around 1075.
* At Step 400, the value is approximately 1025.
### Key Observations
* The "response_length/mean" fluctuates between approximately 950 and 1250.
* There is a notable spike around Step 200.
* The final value at Step 400 is lower than the initial value at Step 0.
### Interpretation
The chart illustrates the variability in response length over the course of the steps. The spike at Step 200 could indicate a specific event or change in the system that caused a temporary increase in response length. The overall fluctuation suggests that the response length is not constant and may be influenced by various factors. The slight decrease in response length from the beginning to the end of the series may indicate a slight change in the system over time.
</details>
(h) Response Length Dynamics of Qwen3-8B.
Figure 7: Training dynamics of reward and response length.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Heatmap: Layer Attribution
### Overview
The image is a heatmap visualizing layer attribution across different components of a neural network. The heatmap displays the relative importance or contribution of each layer (y-axis, ranging from 0 to 27) to different parts of the model (x-axis: attn.q, attn.k, attn.v, attn.o, mlp.up, mlp.down, mlp.gate). The color intensity represents the magnitude of attribution, with orange indicating higher values (around 0.16) and blue indicating lower values (around 0.12).
### Components/Axes
* **Y-axis (Layers):** Numerical labels from 0 to 27, incrementing by 3.
* **X-axis (Model Components):** Categorical labels: attn.q, attn.k, attn.v, attn.o, mlp.up, mlp.down, mlp.gate.
* **Color Scale (Attribution):** A color bar on the right side of the heatmap indicates the attribution values. Orange represents higher attribution (approximately 0.16), white represents intermediate values (approximately 0.14), and blue represents lower attribution (approximately 0.12).
### Detailed Analysis
The heatmap displays the attribution scores for each layer (0-27) across the different model components.
* **attn.q:** Lower layers (0-6) show lower attribution (blue), while higher layers (18-27) show moderate attribution (white/light orange).
* **attn.k:** Similar to attn.q, lower layers (0-6) show lower attribution (blue), while higher layers (18-27) show moderate attribution (white/light orange).
* **attn.v:** Shows high attribution (orange) for most layers, especially layers 0-24.
* **attn.o:** Shows high attribution (orange) for layers 0-6 and 12-15, with lower attribution (blue) for layers 24-27.
* **mlp.up:** Shows low attribution (blue) for lower layers (0-6), and moderate attribution (white/light orange) for higher layers (18-27).
* **mlp.down:** Shows low attribution (blue) for most layers, with slightly higher attribution (white) for layers 12-18.
* **mlp.gate:** Shows low attribution (blue) for lower layers (0-6), and moderate attribution (white/light orange) for higher layers (18-27).
### Key Observations
* The 'attn.v' component exhibits the highest attribution across most layers.
* The 'attn.q' and 'attn.k' components show a similar pattern of lower attribution in lower layers and moderate attribution in higher layers.
* The 'mlp.down' component generally has the lowest attribution across all layers.
* The lower layers (0-6) generally have lower attribution for 'attn.q', 'attn.k', 'mlp.up', and 'mlp.gate' components.
### Interpretation
The heatmap visualizes the contribution of each layer to different parts of the model. The high attribution of 'attn.v' suggests that this component is crucial for the model's performance. The lower attribution of 'mlp.down' might indicate that this component plays a less significant role. The varying attribution patterns across layers and components provide insights into the model's internal workings and can be used for model optimization or understanding its behavior. The data suggests that the attention value component ('attn.v') is the most important, while the MLP down-projection ('mlp.down') is the least important. The attention query and key ('attn.q' and 'attn.k') and MLP up-projection and gate ('mlp.up' and 'mlp.gate') have similar attribution patterns, suggesting they may be related in the model's architecture.
</details>
(a) Qwen2.5-1.5B-Instruct
<details>
<summary>x10.png Details</summary>

### Visual Description
## Heatmap: Attention and MLP Weights
### Overview
The image is a heatmap visualizing the weights associated with different components of an attention mechanism and a multilayer perceptron (MLP). The heatmap displays the magnitude of these weights using a color gradient, where darker orange represents higher values (close to 0.09) and darker blue represents lower values (close to 0.05). The x-axis represents different components (attention query, key, value, output, and MLP layers), while the y-axis represents indices ranging from 0 to 47.
### Components/Axes
* **X-axis:**
* attn. q (Attention Query)
* attn. k (Attention Key)
* attn. v (Attention Value)
* attn. o (Attention Output)
* mlp. up (MLP Up)
* mlp. down (MLP Down)
* mlp. gate (MLP Gate)
* **Y-axis:** Numerical indices from 0 to 47, incrementing by 3 (0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 47).
* **Color Legend:** Located on the right side of the heatmap.
* Dark Orange: 0.09
* Orange: 0.08
* Light Orange: 0.07
* Light Blue: 0.06
* Dark Blue: 0.05
### Detailed Analysis
The heatmap displays the weight distribution across different components.
* **attn. q (Attention Query):** The weights are generally in the orange range (0.07-0.09), indicating relatively high values. There are some variations, with a few rows showing slightly lower values (lighter orange).
* **attn. k (Attention Key):** Similar to the query, the key also shows predominantly orange values (0.07-0.09), with some rows exhibiting slightly lower values (lighter orange).
* **attn. v (Attention Value):** The weights are mostly in the orange range (0.07-0.09), indicating relatively high values.
* **attn. o (Attention Output):** The weights are predominantly in the blue range (0.05-0.06), indicating relatively low values.
* **mlp. up (MLP Up):** The weights are mostly in the light orange range (0.07), indicating medium values.
* **mlp. down (MLP Down):** The weights are mostly in the light orange range (0.07), indicating medium values.
* **mlp. gate (MLP Gate):** The weights are mostly in the light orange range (0.07), indicating medium values.
### Key Observations
* The attention query, key, and value components have relatively high weights compared to the attention output.
* The MLP components (up, down, gate) have intermediate weight values.
* There is a clear distinction in weight distribution between the attention components and the MLP components.
### Interpretation
The heatmap provides insights into the relative importance of different components within the attention mechanism and the MLP. The higher weights associated with the attention query, key, and value suggest that these components play a more significant role in the model's performance compared to the attention output. The intermediate weights of the MLP components indicate their contribution to the overall model, but to a lesser extent than the attention's query, key, and value. The heatmap can be used to identify potential areas for optimization or further investigation, such as exploring the reasons for the lower weights in the attention output or analyzing the specific roles of the MLP components.
</details>
(b) Distill-Qwen-14B
<details>
<summary>x11.png Details</summary>

### Visual Description
## Heatmap: Attention and MLP Weights
### Overview
The image is a heatmap visualizing the weights associated with different components of an attention mechanism and a multilayer perceptron (MLP). The heatmap uses a color gradient from blue to orange, where blue represents lower weights (around 0.11) and orange represents higher weights (around 0.13). The y-axis represents a numerical scale from 0 to 35, and the x-axis represents different components: attention query (attn. q), attention key (attn. k), attention value (attn. v), attention output (attn. o), MLP up, MLP down, and MLP gate.
### Components/Axes
* **Y-axis:** Numerical scale from 0 to 35, with tick marks at intervals of 3 (0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 35).
* **X-axis:** Categorical labels representing different components:
* attn. q (attention query)
* attn. k (attention key)
* attn. v (attention value)
* attn. o (attention output)
* mlp. up (MLP up)
* mlp. down (MLP down)
* mlp. gate (MLP gate)
* **Color Legend (Right Side):**
* Orange: ~0.13
* White: ~0.12
* Blue: ~0.11
### Detailed Analysis
* **attn. q (attention query):** Predominantly orange, indicating higher weights across the entire range of the y-axis.
* **attn. k (attention key):** Similar to attn. q, mostly orange, indicating higher weights.
* **attn. v (attention value):** Predominantly blue, indicating lower weights, especially between y-axis values of approximately 6 and 24.
* **attn. o (attention output):** A mix of orange and blue, with higher weights (orange) concentrated at the top (y > 27) and bottom (y < 6), and lower weights (blue) in the middle.
* **mlp. up (MLP up):** Mostly blue, indicating lower weights, with a slight increase towards orange around y = 30.
* **mlp. down (MLP down):** Mostly orange, indicating higher weights, with some blue regions.
* **mlp. gate (MLP gate):** Blue at the top (y > 24), orange in the middle (6 < y < 24), and blue again at the bottom (y < 6).
### Key Observations
* Attention query and key components (attn. q and attn. k) consistently show higher weights across all y-axis values.
* Attention value (attn. v) shows significantly lower weights compared to query and key.
* MLP up consistently shows lower weights, while MLP down shows higher weights.
* MLP gate exhibits a mixed pattern, with lower weights at the extremes and higher weights in the middle.
### Interpretation
The heatmap visualizes the relative importance or contribution of different components in an attention mechanism and an MLP. The higher weights for attention query and key suggest that these components play a crucial role in the attention process. The lower weights for attention value might indicate a different scaling or transformation applied to this component. The differences in weights between MLP up and MLP down could reflect the flow of information or the specific function of these layers within the MLP. The varying weights of the MLP gate suggest it modulates the flow of information differently depending on the input. Overall, the heatmap provides insights into the internal workings of the model and the relative importance of its different components.
</details>
(c) Qwen3-8B
Figure 8: Parameter change rates of LLMs to the original ones across different modules.
A.4.5 Training Dynamics Analysis
As shown in Fig. 7, we present the reward and response length dynamics during training across four models: Qwen2.5-7B-Instruct, Distill-Qwen-7B, Distill-Qwen-14B, and Qwen3-8B. For reward scores, all models exhibit a consistent pattern: an initial rapid increase followed by stabilization with oscillations. Qwen2.5-7B-Instruct shows the steepest initial improvement, rising from 0.4 to 0.6 within 20 steps, while Distill-Qwen models demonstrate more gradual increases over 200-400 steps, reaching stable scores around 0.65-0.7. Qwen3-8B displays higher volatility with scores fluctuating between 0.62 and 0.74. In contrast, response length shows high variability across all models with no clear monotonic trend. Response lengths vary substantially by model scale: Qwen2.5-7B-Instruct generates shorter responses (340-400 tokens), while Distill-Qwen models and Qwen3-8B produce longer outputs (700-850 and 900-1200 tokens, respectively). The high variance in response length suggests that the training process maintains flexibility in output generation while improving constraint satisfaction, as evidenced by the stable reward trends.
A.4.6 Full Parameter Change Patterns
As shown in Fig. 8, we extend the parameter change analysis to three additional models: Qwen2.5-1.5B-Instruct, Distill-Qwen-14B, and Qwen3-8B. The analysis reveals consistent patterns across all models: attention query (attn.q) and key (attn.k) modules exhibit the highest parameter change rates, particularly concentrated in the bottom and top layers, while attention value (attn.v) and output (attn.o) modules consistently show minimal changes across all layers. MLP modules (mlp.up, mlp.down, mlp.gate) demonstrate moderate change rates, falling between the high changes in attn.q/attn.k and the low changes in attn.v/attn.o.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Heatmap: Text Highlighting
### Overview
The image is a heatmap visualization of text, where the intensity of the color (ranging from light orange to dark brown) indicates the relative importance or focus on each word. The text appears to be a set of instructions given to an AI assistant.
### Components/Axes
* **Text Content:** The text is a sequence of instructions and prompts.
* **Color Gradient:** The color gradient represents the level of attention or importance assigned to each word. Lighter shades of orange indicate lower importance, while darker shades of brown indicate higher importance.
### Detailed Analysis or ### Content Details
Here's a transcription of the text, along with observations about the highlighting:
* `<begin_of_sentence>`: Light orange.
* `<begin_of_sentence>`: Light orange.
* `< User >`: Light orange.
* `First`: Light orange.
* `,`: Light orange.
* `generate`: Light orange.
* `a`: Light orange.
* `short`: Light orange.
* `instructional`: Light orange.
* `paragraph`: Dark brown.
* `and`: Light orange.
* `ensure`: Light orange.
* `the`: Light orange.
* `total`: Light orange.
* `length`: Light orange.
* `does`: Light orange.
* `not`: Light orange.
* `exceed`: Light orange.
* `three`: Light orange.
* `sentences`: Light orange.
* `;`: Light orange.
* `then`: Light orange.
* `,`: Light orange.
* `append`: Light orange.
* `a`: Light orange.
* `clearly`: Light orange.
* `separated`: Light orange.
* `checklist`: Light orange.
* `section`: Light orange.
* `using`: Light orange.
* `bullet`: Light orange.
* `points`: Light orange.
* `;`: Light orange.
* `if`: Light orange.
* `the`: Light orange.
* `word`: Light orange.
* `"error"`: Light orange.
* `appears`: Light orange.
* `anywhere`: Light orange.
* `in`: Light orange.
* `the`: Light orange.
* `output`: Light orange.
* `,`: Light orange.
* `all`: Light orange.
* `checklist`: Light orange.
* `items`: Light orange.
* `must`: Light orange.
* `be`: Light orange.
* `written`: Light orange.
* `in`: Light orange.
* `lowercase`: Light orange.
* `English`: Light orange.
* `,`: Light orange.
* `else`: Light orange.
* `the`: Light orange.
* `instructional`: Light orange.
* `paragraph`: Light orange.
* `must`: Light orange.
* `begin`: Light orange.
* `with`: Light orange.
* `a`: Light orange.
* `bold`: Light orange.
* `ed`: Light orange.
* `core`: Light orange.
* `idea`: Light orange.
* `;`: Light orange.
* `finally`: Light orange.
* `,`: Light orange.
* `apply`: Light orange.
* `a`: Light orange.
* `formal`: Light orange.
* `,`: Light orange.
* `technical`: Light orange.
* `writing`: Light orange.
* `style`: Light orange.
* `to`: Light orange.
* `the`: Light orange.
* `entire`: Light orange.
* `output`: Light orange.
* `.`: Light orange.
* `< Assistant >`: Light orange.
* `<think>`: Light orange.
* `\n`: Light orange.
### Key Observations
* The word "paragraph" is highlighted in dark brown, indicating it is a key focus of the instructions.
* Most of the other words are highlighted in light orange, suggesting a relatively even distribution of attention across the rest of the text.
### Interpretation
The heatmap suggests that the AI assistant should pay particular attention to the concept of a "paragraph" when following these instructions. The instructions are directing the AI to generate a short instructional paragraph, followed by a checklist. The instructions also specify formatting rules based on the presence of the word "error". The highlighting emphasizes the importance of the paragraph generation task.
</details>
(a) Before Training - Distill-Qwen-7B
<details>
<summary>x13.png Details</summary>

### Visual Description
## Heatmap: Text Highlighting
### Overview
The image is a heatmap highlighting specific words and phrases within a block of text. The intensity of the orange color indicates the relative importance or focus on each word/phrase. The text appears to be a set of instructions or guidelines for generating text.
### Components/Axes
* **Text Content:** The main component is the block of text itself.
* **Highlighting:** Words and phrases are highlighted with varying shades of orange. Darker shades indicate higher importance.
* **Contextual Markers:** The text includes markers like `< begin_of_sentence >`, `< User >`, `< Assistant >`, and `<think> \n`, which seem to denote context or speaker.
### Detailed Analysis or ### Content Details
Here's the transcribed text with highlighting emphasis noted:
* `< begin_of_sentence >` (light orange)
* `instructional` (dark orange) `paragraph` (light orange) `and` (very light orange) `ensure` (light orange) `the` (very light orange) `total` (light orange) `length` (light orange) `does` (light orange) `not` (light orange) `exceed` (light orange) `three` (light orange)
* `sentences` (light orange) `;` (very light orange) `then` (light orange) `,` (very light orange) `append` (light orange) `a` (light orange) `clearly` (light orange) `separated` (light orange) `checklist` (dark orange) `section` (dark orange) `using` (dark orange) `bullet` (light orange)
* `points` (light orange) `;` (very light orange) `if` (light orange) `the` (very light orange) `word` (light orange) `"error"` (dark orange) `appears` (light orange) `anywhere` (light orange) `in` (very light orange) `the` (very light orange) `output` (light orange) `,` (very light orange) `all` (light orange) `checklist` (light orange) `items` (light orange)
* `must` (light orange) `be` (light orange) `written` (light orange) `in` (light orange) `lowercase` (light orange) `English` (light orange) `,` (very light orange) `else` (light orange) `the` (very light orange) `instructional` (light orange) `paragraph` (light orange) `must` (light orange)
* `begin` (light orange) `with` (light orange) `a` (light orange) `bold` (light orange) `ed` (light orange) `core` (light orange) `idea` (light orange) `;` (very light orange) `finally` (light orange) `,` (very light orange) `apply` (light orange) `a` (light orange) `formal` (light orange) `,` (very light orange) `technical` (light orange) `writing` (light orange) `style` (dark orange)
* `to` (light orange) `the` (very light orange) `entire` (light orange) `output` (light orange) `.` (very light orange) `< Assistant >` (dark orange) `<think>` (dark orange) `\n` (very light orange)
* `< begin_of_sentence >` (light orange)
* `< User >` (light orange) `First` (light orange) `,` (very light orange) `generate` (light orange) `a` (light orange) `short` (light orange)
### Key Observations
* The words "instructional", "checklist", "section", "error", "style", `< Assistant >`, and `<think>` are the most emphasized, suggesting they are key elements of the instructions.
* The phrases `< begin_of_sentence >` and `< User >` are less emphasized.
* Grammatical words like "the", "a", "in", and punctuation marks are the least emphasized.
### Interpretation
The heatmap visually represents the relative importance of different words and phrases within a set of instructions. The instructions likely guide an AI assistant on how to generate text, emphasizing the need for an instructional paragraph, a checklist section, and a formal technical writing style. The handling of the word "error" is also highlighted, suggesting a specific rule or procedure related to error handling. The `<think>` tag suggests an internal thought process or action by the assistant.
</details>
(b) After Training - Distill-Qwen-7B
<details>
<summary>x14.png Details</summary>

### Visual Description
## Text Block: Instructional Prompt
### Overview
The image contains a text block representing an instructional prompt given to an AI assistant. The prompt outlines specific requirements for generating a response, including length constraints, formatting guidelines (checklist), and conditional styling based on the presence of the word "error."
### Components/Axes
The text is presented as a series of instructions. There are no axes or scales.
### Detailed Analysis or ### Content Details
The text of the prompt is as follows:
`<|im_start|>` user \n First, generate a short instructional paragraph and ensure the total length does not exceed three sentences; then, append a clearly separated checklist section using bullet points; if the word "error" appears anywhere in the output, all checklist items must be written in lowercase English, else the instructional paragraph must begin with a bold ed core idea; finally, apply a formal, technical writing style to the entire output. `<|im_end|>` \n `<|im_start|>` assistant \n
### Key Observations
- The prompt specifies a combination of paragraph generation and checklist creation.
- The styling of the checklist items (lowercase vs. default) is conditional based on the presence of the word "error".
- The instructional paragraph has a conditional styling requirement (bolded core idea).
- The overall tone is formal and technical.
- There are special tokens `<|im_start|>` and `<|im_end|>` which likely denote the start and end of the user's instruction.
### Interpretation
The prompt is designed to guide an AI assistant in generating a structured and styled response. The conditional formatting adds complexity, requiring the assistant to analyze its own output before applying the final styling. The use of special tokens suggests this is part of a larger system for managing AI interactions.
</details>
(c) Before Training - Qwen3-8B
<details>
<summary>x15.png Details</summary>

### Visual Description
## Heatmap: Instruction Following
### Overview
The image is a heatmap visualizing the importance or focus areas within a set of instructions. The instructions guide an assistant on how to generate a short instructional paragraph, including a checklist. The heatmap highlights specific words and phrases, indicating their relative significance within the instructions.
### Components/Axes
There are no explicit axes in the traditional sense. The heatmap is applied directly to the text of the instructions. The intensity of the color (ranging from light to dark orange) represents the relative importance or focus on each word or phrase.
### Detailed Analysis or ### Content Details
Here's a transcription of the text, along with notes on the heatmap intensity:
* `<|im_start|>`: Light orange
* `user`: Light orange
* `\n`: Light orange
* `First`: Light orange
* `,`: Light orange
* `generate`: Light orange
* `a short`: Light orange
* `instructional`: Light orange
* `paragraph`: Light orange
* `and ensure`: Light orange
* `the`: Light orange
* `total`: Light orange
* `length`: Light orange
* `does`: Light orange
* `not exceed`: Light orange
* `three`: Light orange
* `sentences`: Light orange
* `; then`: Light orange
* `append`: Light orange
* `a clearly`: Light orange
* `separated`: Light orange
* `checklist`: Light orange
* `section`: Light orange
* `using`: Light orange
* `bullet`: Light orange
* `points`: Light orange
* `; if`: Light orange
* `the`: Light orange
* `word`: Light orange
* `"error"`: Light orange
* `appears`: Light orange
* `anywhere`: Light orange
* `in`: Light orange
* `the`: Light orange
* `output`: Light orange
* `,`: Light orange
* `all`: Light orange
* `checklist`: Light orange
* `items`: Light orange
* `must`: Orange
* `be written`: Orange
* `in`: Light orange
* `lowercase`: Light orange
* `English`: Light orange
* `,`: Light orange
* `else`: Light orange
* `the`: Light orange
* `instructional`: Light orange
* `paragraph`: Light orange
* `must`: Orange
* `begin`: Orange
* `with`: Light orange
* `a bold`: Light orange
* `ed core`: Light orange
* `idea`: Light orange
* `; finally`: Light orange
* `,`: Light orange
* `apply`: Light orange
* `a`: Light orange
* `formal`: Light orange
* `,`: Light orange
* `technical`: Light orange
* `writing`: Light orange
* `style`: Light orange
* `to`: Orange
* `the`: Dark orange
* `entire`: Light orange
* `output`: Light orange
* `. <|im_end|>`: Light orange
* `\n <|im_start|>`: Light orange
* `assistant`: Light orange
* `\n`: Dark orange
### Key Observations
* The phrases "must be written" and "must begin" have a higher intensity than the surrounding words.
* The word "the" before "entire" has the highest intensity.
* The newline character `\n` before `<|im_start|>` and after "assistant" also has higher intensity.
### Interpretation
The heatmap highlights the key constraints and actions the assistant must follow. The emphasis on "must be written" and "must begin" suggests these are critical instructions regarding the format and content of the generated text. The high intensity on "the" before "entire output" might indicate the importance of applying the specified writing style consistently throughout the entire output. The newline character emphasis is likely related to formatting requirements. The instructions emphasize generating a short instructional paragraph with a checklist, where the checklist items are in lowercase English if the word "error" appears in the output. The instructional paragraph must begin with a bolded core idea, and the entire output should adhere to a formal, technical writing style.
</details>
(d) After Training - Qwen3-8B
Figure 9: Token-level information flow analysis. Darker orange indicates higher attention importance.
A.4.7 Full Token-Level Information Flow Analysis
As shown in Fig. 9, we extend the token-level information flow analysis to Distill-Qwen-7B and Qwen3-8B models on complex instruction-following tasks. Before training, both models exhibit relatively diffuse attention patterns, with only a subset of tokens (e.g., "instructional", "paragraph", "checklist", "error") showing moderate importance. After training, both models demonstrate a dramatic shift towards uniformly high attention importance across virtually all tokens in the prompt, including conjunctions, prepositions, and specific constraints.