## Diagram: Data Processing Pipeline for DeepSeek-R1 Distilling
### Overview
The image is a technical flowchart illustrating a three-stage data processing pipeline for training or refining a model named "DeepSeek-R1." The pipeline flows from left to right, beginning with raw data collection, moving through a distillation/filtering phase, and concluding with a rejection sampling phase. The diagram uses a consistent visual language: beige header boxes for stage titles, light purple ovals for processes or criteria, and gray rectangles for input sources or rule categories. Arrows indicate the flow of data and decision logic.
### Components/Axes
The diagram is segmented into three primary vertical panels, each with a header:
1. **Left Panel: Raw Data Collection**
* **Header:** "Raw Data Collection"
* **Central Process:** "Collection Rules" (gray rectangle) which feeds into three data type categories (purple ovals).
* **Data Type Categories:**
* "Domain" (with sub-label: "Math/Code/Science/Chat")
* "Infomation" (Note: Likely a typo for "Information"; sub-label: "Refenrence Answer/Test Case/Ground Truth")
* "Response With Reasoning Chains"
* **Input Sources (Bottom Row):** A series of gray rectangles listing specific datasets or sources:
* "NuminaMath"
* "Codeforces"
* "OpenMathR1"
* "Orca"
* "APPS"
* "Natural Reasoning"
* "CodeContest"
* "......" (indicating additional, unspecified sources)
2. **Center Panel: Distilling**
* **Header:** "Distilling"
* **Primary Process Flow:** Two gray rectangles ("Dedup" and "Select Rules") feed into a central block labeled "DeepSeek-R1 Distilling."
* **Dedup Sub-processes (Purple Ovals):**
* "Embedding Model(m3e/gte/bge...)"
* "Similarity Engine (faiss\trt...)"
* **Select Rules Sub-processes (Purple Ovals):**
* "w/o reference answer test case"
* "Difficulty"
* "Quality"
* "Language"
* "Category"
3. **Right Panel: Rejection Sampling**
* **Header:** "Rejection Sampling"
* **Primary Method Categories (Gray Rectangles):** Three distinct approaches, each with sub-criteria (purple ovals).
* **Rule-Based:**
* "Ngram/Length/Format/"
* "Safety"
* **Model-Based:**
* "Reward Model: correctness/verbosity/coherence/complexity/helpfulness..."
* "LLM-as-a-judge"
* **Verify-Based:**
* "Code sandbox"
* "Math verify"
### Detailed Analysis
The diagram details a sequential and branching workflow:
1. **Stage 1 - Collection:** Raw data is gathered according to "Collection Rules." The data is categorized by domain (e.g., Math, Code), by the presence of verified information (reference answers, ground truth), and by the inclusion of reasoning chains. The sources for this data are diverse, including competitive programming platforms (Codeforces, CodeContest), math datasets (NuminaMath, OpenMathR1), and general reasoning corpora (Orca, Natural Reasoning).
2. **Stage 2 - Distilling:** The collected data undergoes two parallel filtering processes before being used for "DeepSeek-R1 Distilling."
* **Dedup (Deduplication):** Uses embedding models (like m3e, gte, bge) and similarity engines (like faiss, trt) to identify and remove duplicate or near-duplicate entries.
* **Select Rules:** Applies multiple filters to curate the dataset. This includes removing test cases without reference answers, and selecting based on difficulty, quality, language, and category.
3. **Stage 3 - Rejection Sampling:** The distilled dataset is further refined using three evaluation methods to reject low-quality samples.
* **Rule-Based:** Applies heuristic rules concerning n-grams, length, format, and safety.
* **Model-Based:** Uses a Reward Model to score responses on dimensions like correctness, verbosity, coherence, complexity, and helpfulness. It also employs an "LLM-as-a-judge" for evaluation.
* **Verify-Based:** Uses executable environments for verification: a "Code sandbox" for programming tasks and "Math verify" for mathematical problems.
### Key Observations
* **Process-Oriented:** The diagram is purely a process flowchart. It contains no numerical data, charts, or quantitative metrics. Its purpose is to outline the methodology, not present results.
* **Typographical Error:** The label "Infomation" in the first panel is a clear misspelling of "Information."
* **Technical Specificity:** The diagram names specific tools and models (e.g., faiss, m3e, gte, bge), indicating a concrete technical implementation rather than a conceptual overview.
* **Comprehensive Filtering:** The pipeline employs a multi-faceted approach to data quality, moving from broad collection rules to specific deduplication, attribute-based selection, and finally, multi-method rejection sampling.
### Interpretation
This diagram outlines a sophisticated data curation pipeline designed to create a high-quality training dataset for the "DeepSeek-R1" model. The process emphasizes **quality over quantity**.
* **The "Why":** The multi-stage filtering (Distilling + Rejection Sampling) suggests that the initial raw data is noisy or varied. The goal is to distill it down to a core set of high-value examples that are diverse (by domain/category), challenging (by difficulty), correct (via reference answers and verification), and well-formed (via rule and model-based checks).
* **Relationship Between Stages:** The stages are interdependent. "Raw Data Collection" defines the input universe. "Distilling" performs coarse-grained filtering to manage scale and remove obvious duplicates and low-relevance items. "Rejection Sampling" performs fine-grained, often computationally expensive, quality assessment on the remaining data to ensure only the best examples are used for model training.
* **Notable Methodology:** The inclusion of "Response With Reasoning Chains" as a primary data type and "LLM-as-a-judge" as an evaluation method indicates a focus on training the model for **chain-of-thought reasoning** and complex problem-solving, not just final answer accuracy. The "Verify-Based" methods (Code sandbox, Math verify) add a layer of objective, ground-truth validation that is crucial for technical domains like math and coding.
In essence, this pipeline is engineered to transform large, heterogeneous raw data into a refined, high-signal dataset capable of teaching a model robust reasoning and problem-solving skills.