2503.19633v1

Model: healer-alpha-free

# 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training **Authors**: Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, Xiangang Li > a-m-team ## Abstract The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M <details> <summary>extracted/6308700/20250325-163012.png Details</summary> ![be0f207c](/v1/image/be0f207c29d926bdbfbc94c08a84525f65fdae0e8b705d9b806fb26d048d1fdf) ### Visual Description ## Grouped Bar Chart: Model Performance Comparison Across Benchmarks ### Overview This image is a grouped bar chart comparing the performance of four different AI models across four distinct benchmarks. The chart measures "Accuracy / Percentile (%)" on the y-axis against four benchmark categories on the x-axis. The legend is positioned in the top-right corner of the chart area. ### Components/Axes * **Y-Axis:** Labeled "Accuracy / Percentile (%)". The scale runs from 30 to 100, with major gridlines at intervals of 10 (30, 40, 50, 60, 70, 80, 90, 100). * **X-Axis:** Lists four benchmark categories. Each category has a primary label and a secondary label in parentheses. 1. **AIME 2024** (Pass@1) 2. **MATH-500** (Pass@1) 3. **GPQA Diamond** (Pass@1) 4. **LiveCodeBench** (Pass@1) * **Legend (Top-Right):** Identifies four models, each associated with a unique color and pattern. 1. **AM-Distill-Qwen-32B:** Teal color with diagonal stripes (\\). 2. **DeepSeek-R1-Distill-Qwen-32B:** Solid light salmon/pink color. 3. **AM-Distill-Qwen-72B:** Light mint green color with diagonal stripes (//). 4. **DeepSeek-R1-Distill-Llama-70B:** Solid light peach/beige color. ### Detailed Analysis The chart presents the following numerical data for each model on each benchmark. The values are read directly from the labels atop each bar. **1. AIME 2024 (Pass@1)** * AM-Distill-Qwen-32B: 72.7% * DeepSeek-R1-Distill-Qwen-32B: 72.6% * AM-Distill-Qwen-72B: 76.5% * DeepSeek-R1-Distill-Llama-70B: 70.0% **2. MATH-500 (Pass@1)** * AM-Distill-Qwen-32B: 96.2% * DeepSeek-R1-Distill-Qwen-32B: 94.3% * AM-Distill-Qwen-72B: 97.0% * DeepSeek-R1-Distill-Llama-70B: 94.5% **3. GPQA Diamond (Pass@1)** * AM-Distill-Qwen-32B: 64.3% * DeepSeek-R1-Distill-Qwen-32B: 62.1% * AM-Distill-Qwen-72B: 65.9% * DeepSeek-R1-Distill-Llama-70B: 65.2% **4. LiveCodeBench (Pass@1)** * AM-Distill-Qwen-32B: 59.1% * DeepSeek-R1-Distill-Qwen-32B: 57.2% * AM-Distill-Qwen-72B: 59.7% * DeepSeek-R1-Distill-Llama-70B: 57.5% ### Key Observations * **Highest Overall Performance:** The **MATH-500** benchmark yielded the highest scores for all models, with all values above 94%. * **Model Ranking Consistency:** The **AM-Distill-Qwen-72B** model (light green, striped) achieves the highest score in three out of four benchmarks (AIME 2024, MATH-500, GPQA Diamond). It is narrowly beaten by its smaller counterpart, AM-Distill-Qwen-32B, on LiveCodeBench (59.7% vs. 59.1%). * **Performance Gap:** The performance gap between the AM-Distill and DeepSeek-R1-Distill variants of the Qwen-32B model is smallest on AIME 2024 (0.1%) and largest on GPQA Diamond (2.2%). * **Lowest Scores:** The **LiveCodeBench** benchmark appears to be the most challenging, with all models scoring below 60%. * **Architecture Comparison:** On the Qwen-32B base, the AM-Distill variant consistently outperforms the DeepSeek-R1-Distill variant. The larger AM-Distill-Qwen-72B generally outperforms the similarly sized DeepSeek-R1-Distill-Llama-70B. ### Interpretation This chart provides a comparative performance analysis of distilled language models on reasoning-heavy benchmarks. The data suggests that the "AM-Distill" method, when applied to the Qwen architecture, yields models that are highly competitive and often superior to the "DeepSeek-R1-Distill" method on these specific tasks. The consistently high scores on MATH-500 indicate that all evaluated models have strong mathematical reasoning capabilities. Conversely, the lower scores on LiveCodeBench suggest that coding generation and execution in a live environment remains a more difficult challenge for these models relative to the other tested domains (math competitions, graduate-level QA). The chart effectively communicates that model scale (72B/70B vs. 32B) and distillation technique (AM vs. DeepSeek-R1) are both significant factors in performance, with the AM-Distill approach showing a slight but consistent advantage in this comparison. The visualization allows for quick cross-benchmark and cross-model comparisons, highlighting strengths and relative weaknesses. </details> Figure 1: Overall performance of AM model ## 1 Introduction OpenAI’s o1 series models (OpenAI, 2024) were the pioneers in introducing inference-time scaling by extending the length of the Chain-of-thought reasoning process (Wei et al., 2023; Snell et al., 2024; Wu et al., 2025). This approach has yielded remarkable improvements across various reasoning tasks, including mathematics, coding, and scientific reasoning (Lightman et al., 2023; Hwang et al., 2024). Subsequently, the introduction of DeepSeek-R1 (DeepSeek-AI et al., 2025) significantly propelled the open-source community forward, enabling deeper insights into inference-time scaling. DeepSeek also introduced the DeepSeek-R1-distilled series of models. These models solely utilized distilled data with reasoning chains for Supervised Fine-Tuning (SFT), yet they achieved outstanding results on diverse benchmarks. In the training pipeline of DeepSeek-R1, compared with DeepSeek-R1-Zero, 800,000 selected entries of data were used for SFT. This is a crucial factor contributing to DeepSeek-R1’s superiority over DeepSeek-R1-Zero, thus demonstrating the necessity of high-quality SFT. SFT process, with carefully selected data, can effectively improve the performance of the model, as evidenced by the significant improvement of DeepSeek-R1 over its counterpart. This not only highlights the importance of data selection in SFT but also further validates the positive impact of well-executed SFT on enhancing a model’s reasoning ability. Building upon prior work, the open-source community has recently introduced numerous datasets that distilled reasoning models from DeepSeek-R1 (OpenThoughts, 2025; Xu et al., 2025). However, the scale of these datasets is generally smaller than the 800,000 samples employed by DeepSeek in its distilled series of models. To date, few open-source initiatives have matched the performance achieved by the DeepSeek-R1-distilled series models based on the corresponding base models. Therefore, we have constructed the AM-DeepSeek-R1-Distilled dataset, which encompasses 1.4 million high-quality data entries with reasoning chains. Among these, 0.5 million data entries are entirely sourced from open-source datasets, and 0.9 million data entries are distilled by AM from DeepSeek-R1, as denoted by the “am-0309” in the response sources. The AM-DeepSeek-R1-Distilled dataset we developed exhibits significant advantages in terms of data scale, quality, and diversity. Through our meticulous data processing and stringent verification procedures, this dataset can offer robust support for the long COT training of large language models. In terms of data collection, we comprehensively gathered diverse types of reasoning problems from numerous open-source datasets and implemented semantic deduplication and cleaning to guarantee the high quality and purity of the data (Li et al., 2023; Tirumala et al., 2023). Simultaneously, we conducted strict verification of all responses, including validating mathematical problems through answer checking, verifying code problems via test cases, and evaluating other tasks using a reward model, thereby ensuring the accuracy and reliability of the data. Regarding data scale, the AM dataset, with its 1.4 million data entries, has significantly outperformed other recent open-source datasets. Among these entries, 500,000 are fully derived from open-source datasets. They span a wide range of knowledge domains and problem types. For the remaining 900,000, the instruction part is sourced from open-source datasets, and the response part is distilled by the AM team from DeepSeek-R1. These data have undergone processing in our data pipeline and possess high quality. In terms of diversity, our dataset not only encompasses problems from common domains such as math, code, and science but also includes some cross-domain and comprehensive reasoning tasks. This can comprehensively exercise the reasoning ability and generalization ability of the models (Song et al., 2024). Moreover, we meticulously processed the instruction part. We utilized a large language model to score all instructions in terms of difficulty and category and performed strict semantic deduplication according to these labels to ensure the high quality and diversity of the instructions (Xu et al., 2024). In addition, our dataset adopts a unified format, and each data entry is annotated in detail, including user-assistant interaction information, reasoning processes, final answers, reference answers, test cases, and other metadata. This standardized format renders the dataset easy to use and understand, facilitating researchers in conducting data processing and model training. We believe that the release of the AM-DeepSeek-R1-Distilled dataset will offer crucial resource support for the research of reasoning-oriented large language models and is anticipated to drive further development and innovation in this field. We look forward to the research community leveraging this dataset to achieve more research breakthroughs and jointly promote the progress of AGI. ## 2 Approach The core criteria for our data selection mainly include three aspects: diversity, complexity, and accuracy. We constructed our data pipeline around how to improve these three core indicators. As demonstrated in Figure 2, the entire pipeline can be divided into (1) Raw Data Collection, (2) Distilling, and (3) Rejection Sampling. The subsequent sections will elaborate on these components in detail. <details> <summary>extracted/6308700/20250322-135500.jpeg Details</summary> ![b14c7d61](/v1/image/b14c7d6142ddd93564acc7a20ceb47c2d05121104294ebbbc22a7f06ca811ae7) ### Visual Description ## Diagram: Data Processing Pipeline for DeepSeek-R1 Distilling ### Overview The image is a technical flowchart illustrating a three-stage data processing pipeline for training or refining a model named "DeepSeek-R1." The pipeline flows from left to right, beginning with raw data collection, moving through a distillation/filtering phase, and concluding with a rejection sampling phase. The diagram uses a consistent visual language: beige header boxes for stage titles, light purple ovals for processes or criteria, and gray rectangles for input sources or rule categories. Arrows indicate the flow of data and decision logic. ### Components/Axes The diagram is segmented into three primary vertical panels, each with a header: 1. **Left Panel: Raw Data Collection** * **Header:** "Raw Data Collection" * **Central Process:** "Collection Rules" (gray rectangle) which feeds into three data type categories (purple ovals). * **Data Type Categories:** * "Domain" (with sub-label: "Math/Code/Science/Chat") * "Infomation" (Note: Likely a typo for "Information"; sub-label: "Refenrence Answer/Test Case/Ground Truth") * "Response With Reasoning Chains" * **Input Sources (Bottom Row):** A series of gray rectangles listing specific datasets or sources: * "NuminaMath" * "Codeforces" * "OpenMathR1" * "Orca" * "APPS" * "Natural Reasoning" * "CodeContest" * "......" (indicating additional, unspecified sources) 2. **Center Panel: Distilling** * **Header:** "Distilling" * **Primary Process Flow:** Two gray rectangles ("Dedup" and "Select Rules") feed into a central block labeled "DeepSeek-R1 Distilling." * **Dedup Sub-processes (Purple Ovals):** * "Embedding Model(m3e/gte/bge...)" * "Similarity Engine (faiss\trt...)" * **Select Rules Sub-processes (Purple Ovals):** * "w/o reference answer test case" * "Difficulty" * "Quality" * "Language" * "Category" 3. **Right Panel: Rejection Sampling** * **Header:** "Rejection Sampling" * **Primary Method Categories (Gray Rectangles):** Three distinct approaches, each with sub-criteria (purple ovals). * **Rule-Based:** * "Ngram/Length/Format/" * "Safety" * **Model-Based:** * "Reward Model: correctness/verbosity/coherence/complexity/helpfulness..." * "LLM-as-a-judge" * **Verify-Based:** * "Code sandbox" * "Math verify" ### Detailed Analysis The diagram details a sequential and branching workflow: 1. **Stage 1 - Collection:** Raw data is gathered according to "Collection Rules." The data is categorized by domain (e.g., Math, Code), by the presence of verified information (reference answers, ground truth), and by the inclusion of reasoning chains. The sources for this data are diverse, including competitive programming platforms (Codeforces, CodeContest), math datasets (NuminaMath, OpenMathR1), and general reasoning corpora (Orca, Natural Reasoning). 2. **Stage 2 - Distilling:** The collected data undergoes two parallel filtering processes before being used for "DeepSeek-R1 Distilling." * **Dedup (Deduplication):** Uses embedding models (like m3e, gte, bge) and similarity engines (like faiss, trt) to identify and remove duplicate or near-duplicate entries. * **Select Rules:** Applies multiple filters to curate the dataset. This includes removing test cases without reference answers, and selecting based on difficulty, quality, language, and category. 3. **Stage 3 - Rejection Sampling:** The distilled dataset is further refined using three evaluation methods to reject low-quality samples. * **Rule-Based:** Applies heuristic rules concerning n-grams, length, format, and safety. * **Model-Based:** Uses a Reward Model to score responses on dimensions like correctness, verbosity, coherence, complexity, and helpfulness. It also employs an "LLM-as-a-judge" for evaluation. * **Verify-Based:** Uses executable environments for verification: a "Code sandbox" for programming tasks and "Math verify" for mathematical problems. ### Key Observations * **Process-Oriented:** The diagram is purely a process flowchart. It contains no numerical data, charts, or quantitative metrics. Its purpose is to outline the methodology, not present results. * **Typographical Error:** The label "Infomation" in the first panel is a clear misspelling of "Information." * **Technical Specificity:** The diagram names specific tools and models (e.g., faiss, m3e, gte, bge), indicating a concrete technical implementation rather than a conceptual overview. * **Comprehensive Filtering:** The pipeline employs a multi-faceted approach to data quality, moving from broad collection rules to specific deduplication, attribute-based selection, and finally, multi-method rejection sampling. ### Interpretation This diagram outlines a sophisticated data curation pipeline designed to create a high-quality training dataset for the "DeepSeek-R1" model. The process emphasizes **quality over quantity**. * **The "Why":** The multi-stage filtering (Distilling + Rejection Sampling) suggests that the initial raw data is noisy or varied. The goal is to distill it down to a core set of high-value examples that are diverse (by domain/category), challenging (by difficulty), correct (via reference answers and verification), and well-formed (via rule and model-based checks). * **Relationship Between Stages:** The stages are interdependent. "Raw Data Collection" defines the input universe. "Distilling" performs coarse-grained filtering to manage scale and remove obvious duplicates and low-relevance items. "Rejection Sampling" performs fine-grained, often computationally expensive, quality assessment on the remaining data to ensure only the best examples are used for model training. * **Notable Methodology:** The inclusion of "Response With Reasoning Chains" as a primary data type and "LLM-as-a-judge" as an evaluation method indicates a focus on training the model for **chain-of-thought reasoning** and complex problem-solving, not just final answer accuracy. The "Verify-Based" methods (Code sandbox, Math verify) add a layer of objective, ground-truth validation that is crucial for technical domains like math and coding. In essence, this pipeline is engineered to transform large, heterogeneous raw data into a refined, high-signal dataset capable of teaching a model robust reasoning and problem-solving skills. </details> Figure 2: Construction process of data pipeline. ### 2.1 Raw Data #### 2.1.1 Data Sources We divided the data selection into four major categories: math, code, scienceQA, and general chat. We classified high-quality open-source datasets into these four categories. For Math, Code, and ScienceQA, we prioritized to select datasets with reference answers or test cases, such as NuminaMath (LI et al., 2024), MetaMathQA (Yu et al., 2023), natural_reasoning (Yuan et al., 2025), OpenCoder (Huang et al., 2024), Omni-MATH (Gao et al., 2024), PRIME (Yuan et al., 2024), CodeIO (Li et al., 2025), MATH-lighteval (Hendrycks et al., 2021). Additionally, we also selected some datasets with reasoning chains generated by DeepSeek-R1 from the open-source community, such as Openthoughts (OpenThoughts, 2025), OpenR1Math (Open-R1, 2025), KodCode (Xu et al., 2025), Bespoke-Stratos-17k (Bespoke, 2025), GeneralThought (Reasoning, 2025), Dolphin-R1 (cognitivecomputations, 2025), data_ablation_full59K (Muennighoff et al., 2025), s1K (Muennighoff et al., 2025), LIMO (Ye et al., 2025). Additionally, to enhance the model’s chatting ability, we obtained chat data from general-data SFT datasets, such as InfinityInstruct (BAAI, 2024), Orca (Lian et al., 2023). The distribution of reference answers and test cases can be found in Appendix A.2 #### 2.1.2 Categories The initial four categories alone were insufficient, especially for general chat data. Thus, we designed some more detailed categories, such as creative writing and instruction following. To facilitate data matching and enhance the diversity of the AM dataset, we used the Qwen2.5-7B-Instruct model (Qwen, 2024) to label the data. The details of the categories can be found in the Appendix A.3 and Appendix B.2. #### 2.1.3 Difficulty For the training of long-cot models, more challenging data can effectively extend the length of the reasoning chains generated by the model and improve its reasoning ability. Thus, we used a large language model to score the difficulty of all instructions, subsequently screening the data and downsampling easy and medium difficulty examples. This ensures that the AM dataset emphasizes more challenging data while maintaining its diversity. The difficulty distribution of the data can be found in Appendix A.4 and Appendix B.1. #### 2.1.4 Deduplication We performed strict semantic deduplication on the collected data. We calculated the embedding for each data entry and computed text similarity based on their embeddings to obtain the semantic similarity of different data entries. For data with high semantic similarity, we designed some priority strategies and ultimately retained only one representative entry. This process ensures dataset uniqueness and diversity of the dataset and prevents the negative impact of similar data during model training. ### 2.2 Distilled Data We obtained responses to prompts via two ways: filtering existing responses and creating new responses. For prompts with existing responses, we retained the original response if it can pass reference-answer or test-case verification. For data without reasoning chains, we generated new responses using DeepSeek-R1. #### 2.2.1 Ground Truth Verification For problems with available reference answers, we conducted verification through a combination of rule-based methods and a large language model. Initially, we applied math-verify (Kydlíček and Gandenberger, 2025) to assess whether the response matched reference answers in terms of format and calculation results. Subsequently, we used Qwen2.5-7B-Instruct to further evaluate the correctness and consistency of these responses, the prompt could be found in Appendix B.3. For code-related problems with test cases, we verified responses within a sandbox environment. We ultimately removed the data that did not pass the verification to ensure the accuracy and reliability of the dataset. #### 2.2.2 Reward We used two methods, Decision-Tree-Reward-Llama-3.1-8B (Rlhflow, 2025) as reward model and Qwen2.5-7B-Instruct for large language model scoring, to evaluate the answer_content part of the model output. We set a certain score threshold based on the score distribution and removed the data with lower scores. The reward model evaluates responses across five dimensions: correctness, helpfulness, coherence, complexity, and verbosity to ensure the selected responses contribute to improving the overall quality of the dataset. #### 2.2.3 Rule Verification We established verification rules, such as format template conformity and n-gram repetition checks. For format verification, we ensured that each response adhered strictly to the specified format, such as clearly indicating <think>reasoning process here</think><answer>final answer here</answer>in the prompt. For n-gram repetition verification, we checked responses for excessive consecutive word repetition. Responses failing these rule-based verifications were excluded to guarantee dataset quality and consistency. #### 2.2.4 Labels We additionally annotated the data with supplementary information, such as length and language. For length annotation, we calculated the number of words or tokens per data entry, providing insights into the complexity and scale of the dataset. The length distribution of the data can be found in Appendix A.1. For language annotation, we primarily annotated entries as Chinese, English, or other languages. These labels facilitate effective data screening and analysis. ## 3 Experiment ### 3.1 Evaluation #### 3.1.1 Benchmark We evaluated the reasoning ability of the model using LiveCodeBench (Jain et al., 2024) (2024-08–2025-01), GPQA-Diamond (Rein et al., 2023), AIME 2024 (MAA, 2024), and MATH-500 (Lightman et al., 2023). These benchmarks span multiple fields and difficulty levels, enabling a thorough assessment of the model’s reasoning performance across diverse scenarios. #### 3.1.2 Evaluation Methodology We set the maximum generation length to 32,768 tokens. For benchmarks requiring sampling, the temperature was uniformly set to 0.6, and the top-p value to 0.95. For AIME 2024 (MAA, 2024), we generated 16 samples per query to estimate pass@1. For LiveCodeBench (Jain et al., 2024), MATH-500 (Lightman et al., 2023) and GPQA Diamond (Rein et al., 2023), we generated 4 responses per query, also to estimate pass@1. The evaluation metric across these benchmarks was the globally averaged accuracy. ### 3.2 Main Result We performed SFT on Qwen2.5-32B producing a model named AM-Distill-Qwen-32B, the system prompt used is shown in Table 1. Compared with DeepSeek-R1-Distill-Qwen-32B, our models achieved significant improvements. Evaluation results are shown in Table 2. Specifically, on AIME2024, the accuracy increased from 72.6% to 72.7%; on MATH-500, from 94.3% to 96.2%; on GPQA-Diamond, from 62.1% to 64.3%; and on LiveCodeBench, from 57.2% to 59.1%. Overall, the average accuracy improved from 71.6% to 73.1%. | You are a helpful assistant. To answer the user’s question, you first think about the reasoning process and then provide the user with the answer. The reasoning process and answer are enclosed within <think> and <answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. | | --- | Table 1: System prompt in training process. We further performed training based on the Qwen2.5-72B model to obtain AM-Distill-Qwen-72B. Compared with DeepSeek-R1-Distill-Llama-70B, our 72B model achieved notable improvements across all evaluation benchmarks. Specifically, accuracy on AIME2024 significantly increased from 70.0% to 76.5%; MATH-500 improved from 94.5% to 97.0%; GPQA-Diamond rose from 65.2% to 65.9%; and LiveCodeBench increased from 57.5% to 59.7%. Experimental results demonstrate that models trained on our constructed AM-DeepSeek-R1-Distilled-1.4M dataset exhibit substantial enhancements in reasoning ability. | DeepSeek-R1-Distill-Qwen-32B AM-Distill-Qwen-32B DeepSeek-R1-Distill-Llama-70B | 72.6 72.7 70.0 | 94.3 96.2 94.5 | 62.1 64.3 65.2 | 57.2 59.1 57.5 | 71.6 73.1 71.8 | | --- | --- | --- | --- | --- | --- | | AM-Distill-Qwen-72B | 76.5 | 97.0 | 65.9 | 59.7 | 74.8 | Table 2: Model performance. ## 4 Limitation Since the responses in this dataset are generated by large language models and have not been rigorously verified, there are still deficiencies in terms of factual accuracy and other aspects. When using this dataset, it is necessary to conduct a careful examination. This dataset is mainly used to enhance the reasoning capabilities of large language models (LLMs). We have not carried out a thorough filtering of the harmful instructions or responses within it. We require developers to use only the open-sourced code, data, model, and any other artifacts generated through this project for research purposes. Commercial use and other potential harmful use cases are not permitted. In addition, due to the nested relationships among some data sources, there may be issues with the inaccuracy of the data sources. ## 5 Conclusion In this study, we have constructed and released an AM-DeepSeek-R1-Distilled dataset, a large-scale general reasoning task dataset with 1.4 million data entries and rich thinking traces. It was created through meticulous selection, semantic deduplication, and strict cleaning of a large number of open-source datasets. Furthermore, the AM-Distill-Qwen-32B model, developed by performing SFT on Qwen2.5-32B with the utilization of our constructed dataset, has exhibited remarkable performance enhancements. This compellingly demonstrates that our dataset serves as a significant asset in training the reasoning capabilities of the model. We are optimistic that our endeavors will play a substantial and catalytic role in the research related to reasoning-oriented Large Language Models, propelling forward the development in this field. ## References - BAAI (2024) BAAI. Infinity instruct. 2024. URL https://huggingface.co/datasets/BAAI/Infinity-Instruct. - Bespoke (2025) Bespoke. Bespoke-stratos: The unreasonable effectiveness of reasoning distillation. https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation, 2025. Accessed: 2025-01-22. - cognitivecomputations (2025) cognitivecomputations. dolphin-r1. https://huggingface.co/datasets/cognitivecomputations/dolphin-r1, 2025. - DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948. - Gao et al. (2024) Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024. URL https://arxiv.org/abs/2410.07985. - Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. - Huang et al. (2024) Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, and Wei Chu. Opencoder: The open cookbook for top-tier code large language models. 2024. URL https://arxiv.org/pdf/2411.04905. - Hwang et al. (2024) Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, and Minjoon Seo. Self-explore: Enhancing mathematical reasoning in language models with fine-grained rewards, 2024. URL https://arxiv.org/abs/2404.10346. - Jain et al. (2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024. - Kydlíček and Gandenberger (2025) Hynek Kydlíček and Greg Gandenberger. Math-verify, 2025. URL https://github.com/huggingface/math-verify. - LI et al. (2024) Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. https://huggingface.co/AI-MO/NuminaMath-CoT, 2024. - Li et al. (2025) Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. Codei/o: Condensing reasoning patterns via code input-output prediction. arXiv preprint arXiv:2502.07316, 2025. - Li et al. (2023) Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. Synthetic data generation with large language models for text classification: Potential and limitations, 2023. URL https://arxiv.org/abs/2310.07849. - Lian et al. (2023) Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and ”Teknium”. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/datasets/Open-Orca/OpenOrca, 2023. - Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050. - MAA (2024) MAA. American invitational mathematics examination - aime. https://maa.org/math-competitions/american-invitational-mathematics-examination-aime, feb 2024. Accessed in February 2024, from American Invitational Mathematics Examination - AIME 2024. - Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393. - Open-R1 (2025) Open-R1. Openr1-math-220k. https://huggingface.co/datasets/open-r1/OpenR1-Math-220k, 2025. - OpenAI (2024) OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/learning-to-reason-with-llms/. - OpenThoughts (2025) Team OpenThoughts. Open Thoughts. https://open-thoughts.ai, January 2025. - Qwen (2024) Qwen. Team qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/. - Reasoning (2025) General Reasoning. Generalthought-feb25. https://huggingface.co/datasets/GeneralReasoning/GeneralThought-Feb25, 02 2025. - Rein et al. (2023) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022. - Rlhflow (2025) Rlhflow. Decision tree reward model. https://rlhflow.github.io/posts/2025-01-22-decision-tree-reward-model/, 2025. Accessed: 2025-03-11. - Snell et al. (2024) Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314. - Song et al. (2024) Feifan Song, Bowen Yu, Hao Lang, Haiyang Yu, Fei Huang, Houfeng Wang, and Yongbin Li. Scaling data diversity for fine-tuning language models in human alignment, 2024. URL https://arxiv.org/abs/2403.11124. - Tirumala et al. (2023) Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S. Morcos. D4: Improving llm pretraining via document de-duplication and diversification, 2023. URL https://arxiv.org/abs/2308.12284. - Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903. - Wu et al. (2025) Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, 2025. URL https://arxiv.org/abs/2408.00724. - Xu et al. (2024) Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing, 2024. URL https://arxiv.org/abs/2406.08464. - Xu et al. (2025) Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. 2025. URL https://arxiv.org/abs/2503.02951. - Ye et al. (2025) Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025. URL https://arxiv.org/abs/2502.03387. - Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023. - Yuan et al. (2024) Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels. arXiv preprint arXiv:2412.01981, 2024. - Yuan et al. (2025) Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, and Xian Li. Naturalreasoning: Reasoning in the wild with 2.8m challenging questions, 2025. URL https://arxiv.org/abs/2502.13124. ## Appendix A Data Analysis ### A.1 Length Distribution <details> <summary>extracted/6308700/length_dist_1.png Details</summary> ![4254aad3](/v1/image/4254aad3f6a089325514a288bb5973d0e42427dc66e0089897b296dd1c5b6384) ### Visual Description ## Histogram: Distribution of Token Counts in a Dataset ### Overview The image displays a histogram illustrating the frequency distribution of token counts within a dataset. The chart shows a strongly right-skewed distribution, where the vast majority of data samples contain a relatively low number of tokens, with a long tail extending towards higher token counts. ### Components/Axes * **Chart Type:** Histogram (vertical bar chart). * **X-Axis (Horizontal):** * **Label:** "Number of tokens in dataset" * **Scale:** Linear scale with major tick marks at intervals of 2048 (2048, 4096, 6144, 8192, 10240, 12288, 14336, 16384). The axis starts at 0. * **Y-Axis (Vertical):** * **Label:** Not explicitly labeled, but represents the frequency or count of samples. * **Scale:** Linear scale with major tick marks at intervals of 100,000, ranging from 0 to 600,000. * **Data Series:** A single data series represented by light blue bars with black outlines. * **Legend:** No legend is present, as there is only one data series. * **Title:** No chart title is visible within the image frame. ### Detailed Analysis The histogram consists of approximately 30-35 contiguous bars (bins), each representing a range of token counts. The width of each bin appears consistent. **Trend Verification:** The visual trend is a sharp peak at the lower end of the x-axis, followed by a steady, exponential-like decay as the number of tokens increases. The height of the bars decreases monotonically after the peak. **Approximate Data Points (Key Bins):** * **Bin 1 (Leftmost, ~0-512 tokens):** ~75,000 samples. * **Bin 2 (~512-1024 tokens):** ~360,000 samples. * **Bin 3 (~1024-1536 tokens):** **Peak of the distribution.** Approximately 620,000 samples. This is the mode of the dataset. * **Bin 4 (~1536-2048 tokens):** ~450,000 samples. * **Bin 5 (~2048-2560 tokens):** ~250,000 samples. * **Bin 6 (~2560-3072 tokens):** ~170,000 samples. * **Bin 7 (~3072-3584 tokens):** ~130,000 samples. * **Bin 8 (~3584-4096 tokens):** ~105,000 samples. * **At 4096 tokens:** The bar height is approximately 90,000. * **At 6144 tokens:** The bar height is approximately 50,000. * **At 8192 tokens:** The bar height is approximately 25,000. * **At 10240 tokens:** The bar height is approximately 15,000. * **At 12288 tokens:** The bar height is approximately 8,000. * **At 16384 tokens (rightmost visible):** The bar height is very low, approximately 1,000-2,000 samples. The distribution has a very long tail, with a non-zero number of samples containing up to at least 16,384 tokens. ### Key Observations 1. **Strong Right Skew:** The distribution is not symmetric. The mass of the data is concentrated on the left (shorter sequences). 2. **Clear Mode:** The most common token count range is between approximately 1024 and 1536 tokens. 3. **Rapid Initial Drop-off:** After the peak, the frequency drops by nearly half within the next two bins (by ~2560 tokens). 4. **Long Tail:** While the frequency is low, there is a significant number of samples with very high token counts (8000+), indicating the presence of long documents or concatenated texts in the dataset. 5. **No Gaps:** The histogram bars are contiguous, suggesting the data is continuous or has been binned without gaps. ### Interpretation This histogram characterizes the composition of a text dataset used for training or evaluating a language model. The data suggests: * **Dataset Composition:** The dataset is dominated by short to medium-length text samples (under 4096 tokens). This is typical for many web-scraped or curated text corpora. * **Model Context Implications:** A model with a context window of 2048 or 4096 tokens would be able to process the majority of samples in their entirety. However, a substantial minority of samples (the long tail) would require truncation or would not fit, potentially leading to information loss for those specific examples. * **Potential Data Curation:** The sharp peak and smooth decay might indicate intentional filtering or a natural property of the source data (e.g., social media posts, short articles). The absence of a secondary peak at very high token counts suggests the dataset is not heavily composed of concatenated documents or books. * **Outliers:** The samples beyond 12,288 tokens are outliers in terms of length. Their presence, while numerically small, could be important for tasks requiring very long-range context understanding. In summary, this visual provides a crucial profile of a dataset's sequence length distribution, which is fundamental for understanding model training dynamics, selecting appropriate context window sizes, and anticipating data preprocessing needs like truncation or packing. </details> Figure 3: Token length distribution of data entries in the dataset. Most data entries contain fewer than 4096 tokens, with the highest concentration around approximately 2048 tokens. The distribution gradually decreases as the token count increases, indicating fewer samples with longer contexts. ### A.2 Reference Distribution <details> <summary>extracted/6308700/test_case_distribution.png Details</summary> ![82bcbb59](/v1/image/82bcbb59e583af44721c2eb54a87e85a185aac28b8d511df155216a22d09ca24) ### Visual Description ## Pie Chart: Distribution of Reference Answer and Test Case ### Overview The image is a pie chart titled "Distribution of Reference Answer and Test Case". It displays the proportional distribution of a dataset across three mutually exclusive categories related to the presence of reference answers and test cases. The chart includes a legend, segment labels with percentages, and a title. ### Components/Axes * **Chart Type:** Pie Chart * **Title:** "Distribution of Reference Answer and Test Case" (located at the bottom center of the image). * **Legend:** Positioned in the top-left corner. It is titled "w/o Reference Answer and Test Case" and defines three categories with associated colors and absolute counts. * **Pink Square:** "Null (549,238)" * **Blue Square:** "Have reference answer (543,935)" * **Green Square:** "Have test case (306,818)" * **Data Segments:** The pie is divided into three slices, each labeled with its category name and percentage of the total. * **Pink Slice (Left/Top-Left):** Labeled "Null (39.2%)". This is the largest segment. * **Blue Slice (Bottom):** Labeled "Have reference answer (38.9%)". This is the second-largest segment. * **Green Slice (Top-Right):** Labeled "Have test case (21.9%)". This is the smallest segment. ### Detailed Analysis The chart presents the following data distribution: | Category (Legend Label) | Color | Absolute Count (from Legend) | Percentage (from Slice Label) | Visual Proportion | | :--- | :--- | :--- | :--- | :--- | | **Null** | Pink | 549,238 | 39.2% | Largest slice, occupying the left and upper-left portion of the pie. | | **Have reference answer** | Blue | 543,935 | 38.9% | Second-largest slice, occupying the bottom portion of the pie. | | **Have test case** | Green | 306,818 | 21.9% | Smallest slice, occupying the upper-right portion of the pie. | **Trend Verification:** The visual trend confirms the numerical data. The pink "Null" slice is visually the largest, followed very closely by the blue "Have reference answer" slice. The green "Have test case" slice is distinctly smaller than the other two. **Total Count:** Summing the absolute counts from the legend (549,238 + 543,935 + 306,818) gives a total of **1,399,991** items in the dataset. ### Key Observations 1. **Near-Equal Split Between "Null" and "Have reference answer":** The two largest categories are almost identical in size, differing by only 0.3 percentage points (39.2% vs. 38.9%) and approximately 5,303 in absolute count. 2. **Significant "Null" Category:** The largest single category (39.2%) represents items that have neither a reference answer nor a test case. 3. **Minority with Test Cases:** Only 21.9% of the items in the dataset possess a test case, which is roughly half the proportion of items that have a reference answer. 4. **No Overlap Implied:** The chart's title "w/o Reference Answer and Test Case" and the mutually exclusive slices suggest the categories are non-overlapping. An item is classified into one, and only one, of these three states. ### Interpretation This chart likely visualizes the composition of a dataset used for evaluating or training a system (e.g., a question-answering model, a code generation tool, or an automated grader). The categories suggest a focus on the availability of ground-truth data ("reference answer") and validation mechanisms ("test case"). * **Data Quality & Coverage Gap:** The fact that the largest group is "Null" (39.2%) indicates a substantial portion of the dataset lacks both a definitive correct answer and a means to programmatically test a solution. This represents a potential gap in supervision or evaluation capability for those items. * **Asymmetry in Annotation:** There is a clear asymmetry: items are nearly twice as likely to have a reference answer (38.9%) as they are to have a test case (21.9%). This suggests that creating formal test cases is a more resource-intensive or less common practice than providing a reference answer within this context. * **Implication for System Development:** For tasks represented by this dataset, a system's performance could be directly measured against a reference answer for about 39% of cases and validated via test cases for only about 22% of cases. The largest segment (39.2%) would require alternative evaluation methods, such as human judgment or indirect metrics. * **Potential for Improvement:** The data highlights an opportunity to improve dataset richness by converting some "Null" items into ones with reference answers or test cases, and potentially by developing test cases for items that currently only have reference answers. </details> Figure 4: Distribution of reference answers and test cases in the dataset. Among the entries, 38.9% have reference answers, 21.9% include test cases, and 39.2% have neither reference answers nor test cases. ### A.3 Category Distribution <details> <summary>extracted/6308700/category_distribution.png Details</summary> ![ea2c424f](/v1/image/ea2c424fed71af2615529a9325fcf74142969cae639dc2dd56c82f0fb2d6e212) ### Visual Description \n ## Pie Chart: Distribution of Different Categories ### Overview This image is a pie chart titled "Distribution of Different Categories." It displays the proportional distribution of seven distinct categories, each represented by a colored slice. The chart includes a legend in the top-left corner that lists each category with its corresponding color and an absolute count. Each slice is also labeled with its category name and percentage of the total. ### Components/Axes * **Chart Title:** "Distribution of Different Categories" (located at the bottom center). * **Legend:** Positioned in the top-left corner. It is a box titled "Categories" containing seven entries, each with a colored square and a label in the format "Category Name (Count)". * **Data Slices:** Seven colored slices radiating from the center. Each slice has an external label showing the category name and its percentage of the total. * **Data Representation:** The chart uses area (slice size) to represent the proportion of each category within the whole dataset. ### Detailed Analysis The chart presents the following data, cross-referenced between the legend and the slice labels: 1. **Math** * **Color:** Light red/salmon. * **Legend Count:** 410,756. * **Slice Label:** "Math (29.3%)". * **Spatial Position:** The largest slice, occupying the upper-left quadrant of the pie. 2. **Coding** * **Color:** Light blue. * **Legend Count:** 339,907. * **Slice Label:** "Coding (24.3%)". * **Spatial Position:** The second-largest slice, located in the lower-left quadrant, adjacent to the Math slice. 3. **Information Seeking** * **Color:** Light green. * **Legend Count:** 311,427. * **Slice Label:** "Information Seeking (22.2%)". * **Spatial Position:** The third-largest slice, located in the lower-right quadrant, adjacent to the Coding slice. 4. **Reasoning** * **Color:** Light orange/peach. * **Legend Count:** 145,329. * **Slice Label:** "Reasoning (10.4%)". * **Spatial Position:** A medium-sized slice in the upper-right quadrant, adjacent to the Information Seeking slice. 5. **Planning** * **Color:** Light purple/lavender. * **Legend Count:** 31,757. * **Slice Label:** "Planning (2.3%)". * **Spatial Position:** A very thin slice in the upper-right quadrant, between the Reasoning and Creative Writing slices. 6. **Creative Writing** * **Color:** Light pink. * **Legend Count:** 30,144. * **Slice Label:** "Creative Writing (2.2%)". * **Spatial Position:** A very thin slice in the upper-right quadrant, between the Planning and Others slices. 7. **Others (combined)** * **Color:** Very light green/mint. * **Legend Count:** 130,671. * **Slice Label:** "Others (combined) (9.3%)". * **Spatial Position:** A medium-sized slice in the upper-right quadrant, adjacent to the Math slice. **Total Count Verification:** Summing the counts from the legend (410,756 + 339,907 + 311,427 + 145,329 + 31,757 + 30,144 + 130,671) equals **1,400,000**. The percentages (29.3% + 24.3% + 22.2% + 10.4% + 2.3% + 2.2% + 9.3%) sum to **100%**. ### Key Observations * **Dominant Categories:** The distribution is heavily skewed towards three categories: **Math (29.3%)**, **Coding (24.3%)**, and **Information Seeking (22.2%)**. Together, they constitute **75.8%** of the total dataset. * **Long Tail:** There is a significant "long tail" of less frequent categories. **Reasoning (10.4%)** is the only other category with a double-digit percentage. * **Minimal Categories:** **Planning (2.3%)** and **Creative Writing (2.2%)** represent very small fractions of the data, each with counts near 30,000. * **Aggregated "Others":** The "Others (combined)" category (9.3%) is larger than both Planning and Creative Writing individually, indicating it aggregates multiple smaller, unspecified categories. ### Interpretation This pie chart visualizes the composition of a dataset containing 1.4 million items, likely prompts, queries, or tasks, categorized by primary intent or domain. * **Data Suggestion:** The data suggests a strong emphasis on **technical, analytical, and factual tasks**. Math, Coding, and Information Seeking are all domains requiring structured problem-solving, logic, or fact retrieval. This could reflect the primary use cases of the system or platform from which this data was drawn. * **Relationship Between Elements:** The chart clearly shows a hierarchy of frequency. The top three categories form a core group, followed by a mid-tier (Reasoning), and then a group of niche applications (Planning, Creative Writing, Others). This structure is common in user behavior datasets, where a few primary functions dominate. * **Notable Anomalies/Trends:** The very low representation of **Creative Writing** is notable, especially if the dataset is from a general-purpose AI assistant. It may indicate that creative tasks are a minor use case compared to technical ones, or that such tasks are categorized differently (e.g., under "Others"). The aggregation of "Others" into a single 9.3% slice hides the diversity of less common tasks, which could include categories like conversation, advice, or entertainment. * **Implication:** For a technical document, this distribution is crucial for understanding system load, resource allocation, and model training priorities. The system is predominantly used for math, coding, and information retrieval, which should inform where optimization and evaluation efforts are focused. The minimal share of planning and creative writing might suggest these are either underutilized features or areas where the system's capability is less developed or less sought after by users. </details> Figure 5: Distribution of data entries across different task categories. The dataset primarily consists of Math (29.3%), Coding (24.3%), and Information Seeking (22.2%) tasks, followed by Reasoning (10.4%), Planning (2.3%), Creative Writing (2.2%), and other combined categories (9.3%). ### A.4 Difficulty Distribution <details> <summary>extracted/6308700/difficulty_distribution.png Details</summary> ![50aed6c0](/v1/image/50aed6c097d34f8ca2e48e677b4ab8285309ff4ff9640d6688b964516a48fb48) ### Visual Description ## Pie Chart: Distribution of Difficulty ### Overview This image is a pie chart titled "Distribution of Difficulty." It visually represents the proportional distribution of items (likely problems, tasks, or questions) across five distinct difficulty levels. The chart includes a legend in the top-left corner and direct labels on each slice showing the percentage of the total. ### Components/Axes * **Chart Title:** "Distribution of Difficulty" (located at the bottom center of the image). * **Legend:** Positioned in the top-left corner. It is titled "Difficulty" and lists five categories with associated colors and absolute counts. * **Data Slices:** Five colored slices, each labeled with a category name and its percentage of the whole. The slices are arranged clockwise starting from the top. * **Language:** All text in the image is in English. ### Detailed Analysis The chart breaks down a total dataset of 1,400,000 items (sum of counts from legend) into five difficulty categories. The data is as follows, listed clockwise from the top of the pie: 1. **Very easy** * **Color:** Light purple/lavender (matches legend). * **Position:** Top slice, starting at the 12 o'clock position. * **Percentage:** 4.7% * **Absolute Count (from legend):** 65,332 2. **Very hard** * **Color:** Light orange/peach (matches legend). * **Position:** Immediately clockwise from "Very easy." * **Percentage:** 6.5% * **Absolute Count (from legend):** 91,576 3. **Easy** * **Color:** Light green (matches legend). * **Position:** Immediately clockwise from "Very hard." * **Percentage:** 11.2% * **Absolute Count (from legend):** 157,053 4. **Hard** * **Color:** Medium blue (matches legend). * **Position:** Immediately clockwise from "Easy," occupying the lower-right quadrant. * **Percentage:** 25.7% * **Absolute Count (from legend):** 360,137 5. **Medium** * **Color:** Light red/salmon (matches legend). * **Position:** The largest slice, occupying the entire left half and part of the lower-right quadrant, from approximately the 6 o'clock position back to 12 o'clock. * **Percentage:** 51.8% * **Absolute Count (from legend):** 725,892 **Data Verification:** The percentages sum to 99.9% (4.7 + 6.5 + 11.2 + 25.7 + 51.8), which is within rounding error of 100%. The absolute counts sum to 1,400,000. The color of each slice corresponds exactly to its entry in the legend. ### Key Observations * **Dominant Category:** The "Medium" difficulty level is the overwhelming majority, comprising more than half (51.8%) of all items. * **Secondary Category:** "Hard" is the second-largest segment at 25.7%, making up about a quarter of the total. * **Skew Towards Higher Difficulty:** Combining "Hard" and "Very hard" accounts for 32.2% of items. In contrast, combining "Easy" and "Very easy" accounts for only 15.9%. The distribution is skewed towards medium and harder difficulties. * **Smallest Segment:** "Very easy" is the smallest category at 4.7%. ### Interpretation This chart demonstrates a clear design or natural clustering of difficulty within the analyzed dataset. The central tendency is strongly towards "Medium" difficulty, suggesting this is the standard or most common level. The significant proportion of "Hard" items indicates a substantial subset designed to challenge users or separate performance levels. The relatively small percentage of "Easy" and "Very easy" items implies they may serve as introductory or baseline tasks rather than the core focus. The distribution is not symmetrical; there is a heavier tail on the harder side of the spectrum. This pattern is common in educational testing, skill assessments, or puzzle sets where the goal is to measure a range of abilities with a focus on differentiating at the intermediate to advanced levels. </details> Figure 6: Difficulty distribution of the data entries. Most of the dataset entries are classified as Medium (51.8%) or Hard (25.7%). A smaller proportion falls into the Easy (11.2%), Very Hard (6.5%), and Very Easy (4.7%) categories. ## Appendix B Prompt ### B.1 Difficulty Rating To grading difficulty rating, wo use prompt as Table 3. | ⬇ # Instruction You first need to analyze the given user intent and then label the difficulty level of the user query based on the content of the user query. ## User Query ‘‘‘ {input} ‘‘‘ ## Evaluation Criteria Given the user query, you first need to analyze the user intent and the knowledge needed to solve the task in the user query. Then, rate the difficulty level of the user query as ‘ very easy ‘, ‘ easy ‘, ‘ medium ‘, ‘ hard ‘, or ‘ very hard ‘. Classify the difficulty of the query into one of five levels: - very easy: Basic, straightforward questions requiring minimal reasoning. - easy: Simple factual queries with slightly more depth. - medium: Requires moderate reasoning, explanation, or multi - step processing. - hard: Involves advanced concepts, deeper analysis, or multiple interrelated steps. - very hard: Expert - level queries demanding significant domain expertise, synthesis, or novel problem - solving. ## Output Format Just output the json format answer, don ’ t provide additional explanation Now, please output the difficulty level below in a json format by filling in the placeholders in []: ‘‘‘ json { " difficulty ": "[very easy / easy / medium / hard / very hard]" } ‘‘‘ | | --- | Table 3: Difficulty rating prompt. ### B.2 Category Classification To label category, wo use prompt as Table 4. | ⬇ # Instruction Please label the task tags for the user query. ## User Query ‘‘‘ {input} ‘‘‘ ## Tagging the user input Please label the task tags for the user query. You will need to analyze the user query and select the most relevant task tag from the list below. all_task_tags = [ " Logic ", # Queries involving logical puzzles, riddles, or formal deductive reasoning. " Information ", # Users ask for specific information or facts about various topics. " Editing ", # Involves editing, rephrasing, proofreading, or other tasks related to the composition of general written content. " Coding ", # Users seek help with writing, reviewing, or fixing code in programming. " Math ", # Queries related to mathematical concepts, problems, and calculations. " Brainstorming ", # Involves generating ideas, creative thinking, exploring possibilities, or assisting with decision - making processes. " Others " # Any queries that do not fit into the above categories or are of a miscellaneous nature. ] ## Output Format: Note that you can only select a single primary tag. Other applicable tags can be added to the list of other tags. Now, please output your tags below in a json format by filling in the placeholders in <...>: ‘‘‘ {{ " primary_tag ": "< primary tag >", " other_tags ": ["< tag 1>", "< tag 2>", ... ] }} ‘‘‘ | | --- | Table 4: Category classification prompt. ### B.3 Correctness Rating To rate correctness, wo use prompt as Table 5. | ⬇ # Instruction You are an evaluation expert tasked with assessing the correctness of answers provided by a relatively small - sized Language Model (such as a 7 B model) based on three inputs. Assign a score from 1 to 5 according to the following criteria: - Score 5: Completely correct, fully matches the reference answer or accurately addresses the query when the reference answer is not provided. - Score 4: Mostly correct, minor deviations or insignificant errors that do not affect overall meaning. - Score 3: Partially correct, includes key information but contains noticeable errors or omissions. - Score 2: Minimally correct, significant errors or major omissions, answer barely meets requirements. - Score 1: Completely incorrect, fails to address the question or content severely mismatches the query. ### Please score based on the following inputs: - ** Query **: {input_query} - ** Reference Answer:** (May be empty) {reference_answer} - ** LLM Answer **: {llm_answer} ### Provide your score strictly following the output format below: ‘‘‘ {{ " correctness ": "< correctness score >", }} ‘‘‘ ** Justification ** (briefly explain your scoring decision): | | --- | Table 5: Correctness Rating prompt.

Rendering Paper...