2503.19633

Model: gemma-3-27b-it-free

# 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training **Authors**: Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, Xiangang Li > a-m-team Abstract The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M <details> <summary>extracted/6308700/20250325-163012.png Details</summary> ![be0f207c](/v1/image/be0f207c29d926bdbfbc94c08a84525f65fdae0e8b705d9b806fb26d048d1fdf) ### Visual Description \n ## Bar Chart: Model Performance on Benchmarks ### Overview This bar chart compares the performance of four different language models – AM-Distill-Qwen-32B, DeepSeek-R1-Distill-Qwen-32B, AM-Distill-Qwen-72B, and DeepSeek-R1-Distill-Llama-70B – across four benchmarks: AIME 2024, MATH-500, GPQA Diamond, and LiveCodeBench. The performance metric is Accuracy/Percentile (Pass@1). ### Components/Axes * **X-axis:** Benchmark Name (AIME 2024, MATH-500, GPQA Diamond, LiveCodeBench) with the Pass@1 metric specified below each name. * **Y-axis:** Accuracy / Percentile (%) ranging from 30 to 100, with increments of 10. * **Legend:** Located in the top-right corner, identifying the four models using both name and parameter size (e.g., AM-Distill-Qwen-32B). The legend uses color-coding to match the bars in the chart: * AM-Distill-Qwen-32B: Light Red/Pink (hashed pattern) * DeepSeek-R1-Distill-Qwen-32B: Medium Red/Pink (hashed pattern) * AM-Distill-Qwen-72B: Light Green (hashed pattern) * DeepSeek-R1-Distill-Llama-70B: Medium Green (hashed pattern) ### Detailed Analysis The chart consists of four groups of bars, one for each benchmark. Within each group, there are four bars representing the performance of each model. **AIME 2024 (Pass@1):** * AM-Distill-Qwen-32B: Approximately 72.7% * DeepSeek-R1-Distill-Qwen-32B: Approximately 72.6% * AM-Distill-Qwen-72B: Approximately 76.5% * DeepSeek-R1-Distill-Llama-70B: Approximately 70.0% **MATH-500 (Pass@1):** * AM-Distill-Qwen-32B: Approximately 96.2% * DeepSeek-R1-Distill-Qwen-32B: Approximately 94.3% * AM-Distill-Qwen-72B: Approximately 97.0% * DeepSeek-R1-Distill-Llama-70B: Approximately 94.5% **GPQA Diamond (Pass@1):** * AM-Distill-Qwen-32B: Approximately 64.3% * DeepSeek-R1-Distill-Qwen-32B: Approximately 62.1% * AM-Distill-Qwen-72B: Approximately 65.9% * DeepSeek-R1-Distill-Llama-70B: Approximately 65.2% **LiveCodeBench (Pass@1):** * AM-Distill-Qwen-32B: Approximately 59.1% * DeepSeek-R1-Distill-Qwen-32B: Approximately 57.2% * AM-Distill-Qwen-72B: Approximately 59.7% * DeepSeek-R1-Distill-Llama-70B: Approximately 57.5% ### Key Observations * **MATH-500 consistently shows the highest accuracy** across all models, with values generally above 94%. * **LiveCodeBench consistently shows the lowest accuracy** across all models, with values generally below 60%. * **AM-Distill-Qwen-72B generally outperforms AM-Distill-Qwen-32B** across all benchmarks. * **DeepSeek-R1-Distill-Llama-70B generally performs similarly to DeepSeek-R1-Distill-Qwen-32B**, with slight variations depending on the benchmark. * The differences between the models are more pronounced on some benchmarks (e.g., AIME 2024) than others (e.g., MATH-500). ### Interpretation The chart demonstrates the performance of different language models on a variety of benchmarks designed to test different capabilities. The consistent high performance on MATH-500 suggests these models are strong at mathematical reasoning. The lower performance on LiveCodeBench indicates a relative weakness in code generation or understanding. The fact that the 72B parameter model (AM-Distill-Qwen-72B) consistently outperforms the 32B parameter model suggests that increasing model size generally leads to improved performance, although the gains are not always substantial. The comparison between the Qwen and Llama based models provides insight into the architectural differences and their impact on performance across different tasks. The Pass@1 metric indicates the percentage of times the model provides the correct answer as the *first* prediction, which is a stringent measure of performance. The use of a hashed pattern within the bars suggests a visual emphasis on the discrete nature of the data, rather than a continuous trend. </details> Figure 1: Overall performance of AM model 1 Introduction OpenAI’s o1 series models (OpenAI, 2024) were the pioneers in introducing inference-time scaling by extending the length of the Chain-of-thought reasoning process (Wei et al., 2023; Snell et al., 2024; Wu et al., 2025). This approach has yielded remarkable improvements across various reasoning tasks, including mathematics, coding, and scientific reasoning (Lightman et al., 2023; Hwang et al., 2024). Subsequently, the introduction of DeepSeek-R1 (DeepSeek-AI et al., 2025) significantly propelled the open-source community forward, enabling deeper insights into inference-time scaling. DeepSeek also introduced the DeepSeek-R1-distilled series of models. These models solely utilized distilled data with reasoning chains for Supervised Fine-Tuning (SFT), yet they achieved outstanding results on diverse benchmarks. In the training pipeline of DeepSeek-R1, compared with DeepSeek-R1-Zero, 800,000 selected entries of data were used for SFT. This is a crucial factor contributing to DeepSeek-R1’s superiority over DeepSeek-R1-Zero, thus demonstrating the necessity of high-quality SFT. SFT process, with carefully selected data, can effectively improve the performance of the model, as evidenced by the significant improvement of DeepSeek-R1 over its counterpart. This not only highlights the importance of data selection in SFT but also further validates the positive impact of well-executed SFT on enhancing a model’s reasoning ability. Building upon prior work, the open-source community has recently introduced numerous datasets that distilled reasoning models from DeepSeek-R1 (OpenThoughts, 2025; Xu et al., 2025). However, the scale of these datasets is generally smaller than the 800,000 samples employed by DeepSeek in its distilled series of models. To date, few open-source initiatives have matched the performance achieved by the DeepSeek-R1-distilled series models based on the corresponding base models. Therefore, we have constructed the AM-DeepSeek-R1-Distilled dataset, which encompasses 1.4 million high-quality data entries with reasoning chains. Among these, 0.5 million data entries are entirely sourced from open-source datasets, and 0.9 million data entries are distilled by AM from DeepSeek-R1, as denoted by the “am-0309” in the response sources. The AM-DeepSeek-R1-Distilled dataset we developed exhibits significant advantages in terms of data scale, quality, and diversity. Through our meticulous data processing and stringent verification procedures, this dataset can offer robust support for the long COT training of large language models. In terms of data collection, we comprehensively gathered diverse types of reasoning problems from numerous open-source datasets and implemented semantic deduplication and cleaning to guarantee the high quality and purity of the data (Li et al., 2023; Tirumala et al., 2023). Simultaneously, we conducted strict verification of all responses, including validating mathematical problems through answer checking, verifying code problems via test cases, and evaluating other tasks using a reward model, thereby ensuring the accuracy and reliability of the data. Regarding data scale, the AM dataset, with its 1.4 million data entries, has significantly outperformed other recent open-source datasets. Among these entries, 500,000 are fully derived from open-source datasets. They span a wide range of knowledge domains and problem types. For the remaining 900,000, the instruction part is sourced from open-source datasets, and the response part is distilled by the AM team from DeepSeek-R1. These data have undergone processing in our data pipeline and possess high quality. In terms of diversity, our dataset not only encompasses problems from common domains such as math, code, and science but also includes some cross-domain and comprehensive reasoning tasks. This can comprehensively exercise the reasoning ability and generalization ability of the models (Song et al., 2024). Moreover, we meticulously processed the instruction part. We utilized a large language model to score all instructions in terms of difficulty and category and performed strict semantic deduplication according to these labels to ensure the high quality and diversity of the instructions (Xu et al., 2024). In addition, our dataset adopts a unified format, and each data entry is annotated in detail, including user-assistant interaction information, reasoning processes, final answers, reference answers, test cases, and other metadata. This standardized format renders the dataset easy to use and understand, facilitating researchers in conducting data processing and model training. We believe that the release of the AM-DeepSeek-R1-Distilled dataset will offer crucial resource support for the research of reasoning-oriented large language models and is anticipated to drive further development and innovation in this field. We look forward to the research community leveraging this dataset to achieve more research breakthroughs and jointly promote the progress of AGI. 2 Approach The core criteria for our data selection mainly include three aspects: diversity, complexity, and accuracy. We constructed our data pipeline around how to improve these three core indicators. As demonstrated in Figure 2, the entire pipeline can be divided into (1) Raw Data Collection, (2) Distilling, and (3) Rejection Sampling. The subsequent sections will elaborate on these components in detail. <details> <summary>extracted/6308700/20250322-135500.jpeg Details</summary> ![b14c7d61](/v1/image/b14c7d6142ddd93564acc7a20ceb47c2d05121104294ebbbc22a7f06ca811ae7) ### Visual Description \n ## Diagram: Data Pipeline for Model Training ### Overview This diagram illustrates a data pipeline for training a model, encompassing raw data collection, distilling, and rejection sampling stages. The pipeline appears to be designed for generating high-quality training data, particularly for tasks involving math, code, science, and chat. ### Components/Axes The diagram is structured into three main sections: "Raw Data Collection" (left), "Distilling" (center), and "Rejection Sampling" (right). Arrows indicate the flow of data between these stages. Within each section, several components are present, represented as boxes. * **Raw Data Collection:** * "Domain": Math/Code/Science/Chat * "Collection Rules" * "Information": Reference Answer/Test Case/Ground Truth * "Response With Reasoning Chains" * Datasets: "NumixMath", "Codeforces", "OpenMathR1", "Orca", "APPS", "Natural Reasoning", "CodeContest" * **Distilling:** * "Dedup": Input from "Raw Data Collection" * "Similarity Engine": (faiss(trt...)) Input from "Dedup" * "w/o reference answer test case" Input from "Similarity Engine" * "DeepSeek-R1 Distilling" Input from "w/o reference answer test case", "Difficulty", "Quality", "Language", "Category" * "Difficulty" Input from "Select Rules" * "Quality" Input from "Select Rules" * "Language" Input from "Select Rules" * "Category" Input from "Select Rules" * "Select Rules" Input from "Raw Data Collection" * "Embedding Model(m3e/gte/bge...)" Input from "Dedup" * **Rejection Sampling:** * "Rule-Based": Input from "Distilling" * "Model-Based": Input from "Distilling" * "Verify-Based": Input from "Distilling" * "Safety": Reward Model: correctness/verbosity/coherence/complexity/helpfulness... * "LLM-as-a-judge" * "Code sandbox" * "Math verify" * "Ngram/Length/Format/" ### Detailed Analysis The data flow begins with "Raw Data Collection," where data from various sources (listed datasets) is gathered based on "Collection Rules" and "Information" (reference answers, test cases, ground truth). Responses with reasoning chains are also collected. This data is then fed into the "Distilling" stage. Within "Distilling," the data undergoes deduplication ("Dedup") and similarity analysis using a "Similarity Engine" (faiss(trt...)). The "Dedup" process also utilizes an "Embedding Model" (m3e/gte/bge...). A key component is "DeepSeek-R1 Distilling," which receives input from the similarity engine and is influenced by factors like "Difficulty," "Quality," "Language," and "Category," which are selected by "Select Rules". The output of "Distilling" is then passed to "Rejection Sampling," which employs three approaches: "Rule-Based," "Model-Based," and "Verify-Based." "Rule-Based" sampling incorporates a "Safety" component that evaluates data based on a "Reward Model" considering correctness, verbosity, coherence, complexity, and helpfulness. "Model-Based" sampling uses an "LLM-as-a-judge." "Verify-Based" sampling utilizes a "Code sandbox" and "Math verify." The "Rejection Sampling" stage also considers "Ngram/Length/Format/". ### Key Observations The pipeline emphasizes data quality and relevance. The "Distilling" stage focuses on removing redundancy and selecting data based on specific criteria (difficulty, quality, language, category). The "Rejection Sampling" stage employs multiple methods to filter out undesirable data, ensuring the final training set is safe, accurate, and aligned with desired characteristics. The use of "faiss(trt...)" suggests a focus on efficient similarity search. The inclusion of "Code sandbox" and "Math verify" indicates a strong emphasis on correctness for code and mathematical tasks. ### Interpretation This diagram represents a sophisticated data engineering pipeline designed to create a high-quality dataset for training large language models, particularly those focused on complex reasoning tasks like math and code. The pipeline's multi-stage approach—collection, distillation, and rejection sampling—reflects a commitment to data quality and safety. The use of multiple filtering mechanisms (rule-based, model-based, verify-based) suggests a robust approach to identifying and removing potentially harmful or inaccurate data. The pipeline is likely intended to address challenges associated with generating synthetic data or curating existing datasets for specialized applications. The emphasis on reasoning chains suggests the model is intended to not only provide answers but also explain its thought process. The pipeline is designed to create a dataset that is not only large but also carefully curated to maximize the performance and reliability of the resulting model. </details> Figure 2: Construction process of data pipeline. 2.1 Raw Data 2.1.1 Data Sources We divided the data selection into four major categories: math, code, scienceQA, and general chat. We classified high-quality open-source datasets into these four categories. For Math, Code, and ScienceQA, we prioritized to select datasets with reference answers or test cases, such as NuminaMath (LI et al., 2024), MetaMathQA (Yu et al., 2023), natural_reasoning (Yuan et al., 2025), OpenCoder (Huang et al., 2024), Omni-MATH (Gao et al., 2024), PRIME (Yuan et al., 2024), CodeIO (Li et al., 2025), MATH-lighteval (Hendrycks et al., 2021). Additionally, we also selected some datasets with reasoning chains generated by DeepSeek-R1 from the open-source community, such as Openthoughts (OpenThoughts, 2025), OpenR1Math (Open-R1, 2025), KodCode (Xu et al., 2025), Bespoke-Stratos-17k (Bespoke, 2025), GeneralThought (Reasoning, 2025), Dolphin-R1 (cognitivecomputations, 2025), data_ablation_full59K (Muennighoff et al., 2025), s1K (Muennighoff et al., 2025), LIMO (Ye et al., 2025). Additionally, to enhance the model’s chatting ability, we obtained chat data from general-data SFT datasets, such as InfinityInstruct (BAAI, 2024), Orca (Lian et al., 2023). The distribution of reference answers and test cases can be found in Appendix A.2 2.1.2 Categories The initial four categories alone were insufficient, especially for general chat data. Thus, we designed some more detailed categories, such as creative writing and instruction following. To facilitate data matching and enhance the diversity of the AM dataset, we used the Qwen2.5-7B-Instruct model (Qwen, 2024) to label the data. The details of the categories can be found in the Appendix A.3 and Appendix B.2. 2.1.3 Difficulty For the training of long-cot models, more challenging data can effectively extend the length of the reasoning chains generated by the model and improve its reasoning ability. Thus, we used a large language model to score the difficulty of all instructions, subsequently screening the data and downsampling easy and medium difficulty examples. This ensures that the AM dataset emphasizes more challenging data while maintaining its diversity. The difficulty distribution of the data can be found in Appendix A.4 and Appendix B.1. 2.1.4 Deduplication We performed strict semantic deduplication on the collected data. We calculated the embedding for each data entry and computed text similarity based on their embeddings to obtain the semantic similarity of different data entries. For data with high semantic similarity, we designed some priority strategies and ultimately retained only one representative entry. This process ensures dataset uniqueness and diversity of the dataset and prevents the negative impact of similar data during model training. 2.2 Distilled Data We obtained responses to prompts via two ways: filtering existing responses and creating new responses. For prompts with existing responses, we retained the original response if it can pass reference-answer or test-case verification. For data without reasoning chains, we generated new responses using DeepSeek-R1. 2.2.1 Ground Truth Verification For problems with available reference answers, we conducted verification through a combination of rule-based methods and a large language model. Initially, we applied math-verify (Kydlíček and Gandenberger, 2025) to assess whether the response matched reference answers in terms of format and calculation results. Subsequently, we used Qwen2.5-7B-Instruct to further evaluate the correctness and consistency of these responses, the prompt could be found in Appendix B.3. For code-related problems with test cases, we verified responses within a sandbox environment. We ultimately removed the data that did not pass the verification to ensure the accuracy and reliability of the dataset. 2.2.2 Reward We used two methods, Decision-Tree-Reward-Llama-3.1-8B (Rlhflow, 2025) as reward model and Qwen2.5-7B-Instruct for large language model scoring, to evaluate the answer_content part of the model output. We set a certain score threshold based on the score distribution and removed the data with lower scores. The reward model evaluates responses across five dimensions: correctness, helpfulness, coherence, complexity, and verbosity to ensure the selected responses contribute to improving the overall quality of the dataset. 2.2.3 Rule Verification We established verification rules, such as format template conformity and n-gram repetition checks. For format verification, we ensured that each response adhered strictly to the specified format, such as clearly indicating <think>reasoning process here</think><answer>final answer here</answer>in the prompt. For n-gram repetition verification, we checked responses for excessive consecutive word repetition. Responses failing these rule-based verifications were excluded to guarantee dataset quality and consistency. 2.2.4 Labels We additionally annotated the data with supplementary information, such as length and language. For length annotation, we calculated the number of words or tokens per data entry, providing insights into the complexity and scale of the dataset. The length distribution of the data can be found in Appendix A.1. For language annotation, we primarily annotated entries as Chinese, English, or other languages. These labels facilitate effective data screening and analysis. 3 Experiment 3.1 Evaluation 3.1.1 Benchmark We evaluated the reasoning ability of the model using LiveCodeBench (Jain et al., 2024) (2024-08–2025-01), GPQA-Diamond (Rein et al., 2023), AIME 2024 (MAA, 2024), and MATH-500 (Lightman et al., 2023). These benchmarks span multiple fields and difficulty levels, enabling a thorough assessment of the model’s reasoning performance across diverse scenarios. 3.1.2 Evaluation Methodology We set the maximum generation length to 32,768 tokens. For benchmarks requiring sampling, the temperature was uniformly set to 0.6, and the top-p value to 0.95. For AIME 2024 (MAA, 2024), we generated 16 samples per query to estimate pass@1. For LiveCodeBench (Jain et al., 2024), MATH-500 (Lightman et al., 2023) and GPQA Diamond (Rein et al., 2023), we generated 4 responses per query, also to estimate pass@1. The evaluation metric across these benchmarks was the globally averaged accuracy. 3.2 Main Result We performed SFT on Qwen2.5-32B producing a model named AM-Distill-Qwen-32B, the system prompt used is shown in Table 1. Compared with DeepSeek-R1-Distill-Qwen-32B, our models achieved significant improvements. Evaluation results are shown in Table 2. Specifically, on AIME2024, the accuracy increased from 72.6% to 72.7%; on MATH-500, from 94.3% to 96.2%; on GPQA-Diamond, from 62.1% to 64.3%; and on LiveCodeBench, from 57.2% to 59.1%. Overall, the average accuracy improved from 71.6% to 73.1%. | You are a helpful assistant. To answer the user’s question, you first think about the reasoning process and then provide the user with the answer. The reasoning process and answer are enclosed within <think> and <answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. | | --- | Table 1: System prompt in training process. We further performed training based on the Qwen2.5-72B model to obtain AM-Distill-Qwen-72B. Compared with DeepSeek-R1-Distill-Llama-70B, our 72B model achieved notable improvements across all evaluation benchmarks. Specifically, accuracy on AIME2024 significantly increased from 70.0% to 76.5%; MATH-500 improved from 94.5% to 97.0%; GPQA-Diamond rose from 65.2% to 65.9%; and LiveCodeBench increased from 57.5% to 59.7%. Experimental results demonstrate that models trained on our constructed AM-DeepSeek-R1-Distilled-1.4M dataset exhibit substantial enhancements in reasoning ability. | DeepSeek-R1-Distill-Qwen-32B AM-Distill-Qwen-32B DeepSeek-R1-Distill-Llama-70B | 72.6 72.7 70.0 | 94.3 96.2 94.5 | 62.1 64.3 65.2 | 57.2 59.1 57.5 | 71.6 73.1 71.8 | | --- | --- | --- | --- | --- | --- | | AM-Distill-Qwen-72B | 76.5 | 97.0 | 65.9 | 59.7 | 74.8 | Table 2: Model performance. 4 Limitation Since the responses in this dataset are generated by large language models and have not been rigorously verified, there are still deficiencies in terms of factual accuracy and other aspects. When using this dataset, it is necessary to conduct a careful examination. This dataset is mainly used to enhance the reasoning capabilities of large language models (LLMs). We have not carried out a thorough filtering of the harmful instructions or responses within it. We require developers to use only the open-sourced code, data, model, and any other artifacts generated through this project for research purposes. Commercial use and other potential harmful use cases are not permitted. In addition, due to the nested relationships among some data sources, there may be issues with the inaccuracy of the data sources. 5 Conclusion In this study, we have constructed and released an AM-DeepSeek-R1-Distilled dataset, a large-scale general reasoning task dataset with 1.4 million data entries and rich thinking traces. It was created through meticulous selection, semantic deduplication, and strict cleaning of a large number of open-source datasets. Furthermore, the AM-Distill-Qwen-32B model, developed by performing SFT on Qwen2.5-32B with the utilization of our constructed dataset, has exhibited remarkable performance enhancements. This compellingly demonstrates that our dataset serves as a significant asset in training the reasoning capabilities of the model. We are optimistic that our endeavors will play a substantial and catalytic role in the research related to reasoning-oriented Large Language Models, propelling forward the development in this field. References - BAAI (2024) BAAI. Infinity instruct. 2024. URL https://huggingface.co/datasets/BAAI/Infinity-Instruct. - Bespoke (2025) Bespoke. Bespoke-stratos: The unreasonable effectiveness of reasoning distillation. https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation, 2025. Accessed: 2025-01-22. - cognitivecomputations (2025) cognitivecomputations. dolphin-r1. https://huggingface.co/datasets/cognitivecomputations/dolphin-r1, 2025. - DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948. - Gao et al. (2024) Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024. URL https://arxiv.org/abs/2410.07985. - Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. - Huang et al. (2024) Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, and Wei Chu. Opencoder: The open cookbook for top-tier code large language models. 2024. URL https://arxiv.org/pdf/2411.04905. - Hwang et al. (2024) Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, and Minjoon Seo. Self-explore: Enhancing mathematical reasoning in language models with fine-grained rewards, 2024. URL https://arxiv.org/abs/2404.10346. - Jain et al. (2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024. - Kydlíček and Gandenberger (2025) Hynek Kydlíček and Greg Gandenberger. Math-verify, 2025. URL https://github.com/huggingface/math-verify. - LI et al. (2024) Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. https://huggingface.co/AI-MO/NuminaMath-CoT, 2024. - Li et al. (2025) Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. Codei/o: Condensing reasoning patterns via code input-output prediction. arXiv preprint arXiv:2502.07316, 2025. - Li et al. (2023) Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. Synthetic data generation with large language models for text classification: Potential and limitations, 2023. URL https://arxiv.org/abs/2310.07849. - Lian et al. (2023) Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and ”Teknium”. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/datasets/Open-Orca/OpenOrca, 2023. - Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050. - MAA (2024) MAA. American invitational mathematics examination - aime. https://maa.org/math-competitions/american-invitational-mathematics-examination-aime, feb 2024. Accessed in February 2024, from American Invitational Mathematics Examination - AIME 2024. - Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393. - Open-R1 (2025) Open-R1. Openr1-math-220k. https://huggingface.co/datasets/open-r1/OpenR1-Math-220k, 2025. - OpenAI (2024) OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/learning-to-reason-with-llms/. - OpenThoughts (2025) Team OpenThoughts. Open Thoughts. https://open-thoughts.ai, January 2025. - Qwen (2024) Qwen. Team qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/. - Reasoning (2025) General Reasoning. Generalthought-feb25. https://huggingface.co/datasets/GeneralReasoning/GeneralThought-Feb25, 02 2025. - Rein et al. (2023) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022. - Rlhflow (2025) Rlhflow. Decision tree reward model. https://rlhflow.github.io/posts/2025-01-22-decision-tree-reward-model/, 2025. Accessed: 2025-03-11. - Snell et al. (2024) Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314. - Song et al. (2024) Feifan Song, Bowen Yu, Hao Lang, Haiyang Yu, Fei Huang, Houfeng Wang, and Yongbin Li. Scaling data diversity for fine-tuning language models in human alignment, 2024. URL https://arxiv.org/abs/2403.11124. - Tirumala et al. (2023) Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S. Morcos. D4: Improving llm pretraining via document de-duplication and diversification, 2023. URL https://arxiv.org/abs/2308.12284. - Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903. - Wu et al. (2025) Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, 2025. URL https://arxiv.org/abs/2408.00724. - Xu et al. (2024) Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing, 2024. URL https://arxiv.org/abs/2406.08464. - Xu et al. (2025) Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. 2025. URL https://arxiv.org/abs/2503.02951. - Ye et al. (2025) Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025. URL https://arxiv.org/abs/2502.03387. - Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023. - Yuan et al. (2024) Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels. arXiv preprint arXiv:2412.01981, 2024. - Yuan et al. (2025) Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, and Xian Li. Naturalreasoning: Reasoning in the wild with 2.8m challenging questions, 2025. URL https://arxiv.org/abs/2502.13124. Appendix A Data Analysis A.1 Length Distribution <details> <summary>extracted/6308700/length_dist_1.png Details</summary> ![4254aad3](/v1/image/4254aad3f6a089325514a288bb5973d0e42427dc66e0089897b296dd1c5b6384) ### Visual Description \n ## Histogram: Distribution of Token Counts in a Dataset ### Overview The image presents a histogram visualizing the distribution of the number of tokens in a dataset. The x-axis represents the number of tokens, and the y-axis represents the frequency (count) of datasets with that number of tokens. The histogram shows a heavily right-skewed distribution, indicating that most datasets have a relatively small number of tokens, while a few datasets have a very large number of tokens. ### Components/Axes * **X-axis Label:** "Number of tokens in dataset" * **Y-axis Label:** (Implied) Frequency or Count * **X-axis Scale:** The x-axis is scaled linearly, with markers at 2048, 4096, 6144, 8192, 10240, 12288, 14336, and 16384. * **Y-axis Scale:** The y-axis is scaled linearly, starting at 0 and going up to approximately 600,000. * **Histogram Bars:** The histogram consists of a series of bars, each representing a range of token counts. ### Detailed Analysis The histogram displays the following approximate data points (reading from left to right): * **2048 tokens:** Approximately 600,000 datasets. This is the highest frequency. * **2560 tokens:** Approximately 550,000 datasets. * **3072 tokens:** Approximately 480,000 datasets. * **3584 tokens:** Approximately 420,000 datasets. * **4096 tokens:** Approximately 350,000 datasets. * **4608 tokens:** Approximately 290,000 datasets. * **5120 tokens:** Approximately 240,000 datasets. * **5632 tokens:** Approximately 190,000 datasets. * **6144 tokens:** Approximately 150,000 datasets. * **6656 tokens:** Approximately 120,000 datasets. * **7168 tokens:** Approximately 90,000 datasets. * **7680 tokens:** Approximately 70,000 datasets. * **8192 tokens:** Approximately 50,000 datasets. * **8704 tokens:** Approximately 40,000 datasets. * **9216 tokens:** Approximately 30,000 datasets. * **9728 tokens:** Approximately 20,000 datasets. * **10240 tokens:** Approximately 15,000 datasets. * **10752 tokens:** Approximately 10,000 datasets. * **11264 tokens:** Approximately 8,000 datasets. * **11776 tokens:** Approximately 6,000 datasets. * **12288 tokens:** Approximately 4,000 datasets. * **12800 tokens:** Approximately 3,000 datasets. * **13312 tokens:** Approximately 2,000 datasets. * **13824 tokens:** Approximately 1,500 datasets. * **14336 tokens:** Approximately 1,000 datasets. * **14848 tokens:** Approximately 700 datasets. * **15360 tokens:** Approximately 500 datasets. * **15872 tokens:** Approximately 300 datasets. * **16384 tokens:** Approximately 200 datasets. The histogram bars generally decrease in height as the number of tokens increases. The decline is steepest in the lower range of token counts (2048-8192) and becomes more gradual as the token count increases. ### Key Observations * The distribution is highly skewed to the right. * The majority of datasets have fewer than 4096 tokens. * There is a long tail of datasets with a large number of tokens, though their frequency is much lower. * The frequency decreases almost exponentially with increasing token count. ### Interpretation This histogram suggests that the dataset consists of many small documents or data entries, with a few exceptionally large ones. This could be due to a variety of factors, such as: * **Natural Language Processing:** The dataset might be a collection of text documents, where most documents are short (e.g., tweets, short articles) and a few are very long (e.g., books, long reports). * **Data Logging:** The dataset might be a log of events, where most events are simple and a few are complex and generate a lot of data. * **Data Aggregation:** The dataset might be aggregated from multiple sources, where some sources produce more data than others. The right skew indicates that the average token count is likely to be higher than the median token count. The long tail suggests that the presence of these larger datasets could significantly impact any analysis that is sensitive to outliers (e.g., calculating the average token count). Further investigation might be needed to understand the nature of these large datasets and whether they should be treated differently in the analysis. </details> Figure 3: Token length distribution of data entries in the dataset. Most data entries contain fewer than 4096 tokens, with the highest concentration around approximately 2048 tokens. The distribution gradually decreases as the token count increases, indicating fewer samples with longer contexts. A.2 Reference Distribution <details> <summary>extracted/6308700/test_case_distribution.png Details</summary> ![82bcbb59](/v1/image/82bcbb59e583af44721c2eb54a87e85a185aac28b8d511df155216a22d09ca24) ### Visual Description \n ## Pie Chart: Distribution of Reference Answer and Test Case ### Overview The image is a pie chart illustrating the distribution of "Reference Answer" and "Test Case" data, categorized into three groups: "Null", "Have reference answer", and "Have test case". The chart displays the percentage and numerical representation of each category. ### Components/Axes * **Title:** "Distribution of Reference Answer and Test Case" - positioned at the bottom center of the image. * **Legend:** Located in the top-left corner. It defines the color-coding for each category: * Red: "Null (549,238)" * Light Blue: "Have reference answer (543,935)" * Light Green: "Have test case (306,818)" * **Pie Chart Segments:** Represent the proportion of each category. Each segment is labeled with its category name and percentage. ### Detailed Analysis The pie chart is divided into three segments. * **Null (Red):** This segment occupies approximately 39.2% of the pie chart. The numerical value associated with this category is (549,238). * **Have reference answer (Light Blue):** This segment occupies approximately 38.9% of the pie chart. The numerical value associated with this category is (543,935). * **Have test case (Light Green):** This segment occupies approximately 21.9% of the pie chart. The numerical value associated with this category is (306,818). The total number of data points represented in the chart is approximately 549,238 + 543,935 + 306,818 = 1,400,000 (rounded). ### Key Observations * The "Null" and "Have reference answer" categories are the most dominant, each representing roughly 39% of the data. * The "Have test case" category represents the smallest proportion of the data, at approximately 22%. * The distribution is not uniform, with a significant portion of the data falling into the "Null" and "Have reference answer" categories. ### Interpretation The chart suggests that a substantial portion of the dataset (approximately 78.1%) contains either missing data ("Null") or has a reference answer available ("Have reference answer"). The relatively smaller proportion of data with a test case ("Have test case") indicates that test cases are less frequently available or applicable compared to reference answers or the presence of null values. This could imply a need for more comprehensive test case generation or a focus on improving data completeness. The numerical values associated with each category provide a quantitative measure of the distribution, allowing for a more precise understanding of the data's characteristics. The chart is a visualization of the completeness of the data, and the availability of reference answers and test cases. </details> Figure 4: Distribution of reference answers and test cases in the dataset. Among the entries, 38.9% have reference answers, 21.9% include test cases, and 39.2% have neither reference answers nor test cases. A.3 Category Distribution <details> <summary>extracted/6308700/category_distribution.png Details</summary> ![ea2c424f](/v1/image/ea2c424fed71af2615529a9325fcf74142969cae639dc2dd56c82f0fb2d6e212) ### Visual Description \n ## Pie Chart: Distribution of Different Categories ### Overview This image presents a pie chart illustrating the distribution of different categories, likely representing time spent or frequency of activities. The chart is visually divided into segments, each corresponding to a specific category, with the size of each segment proportional to its percentage of the whole. A legend is positioned in the top-left corner, providing color-coding for each category. ### Components/Axes * **Title:** "Distribution of Different Categories" (located at the bottom center) * **Legend:** Located in the top-left corner, listing the following categories with their corresponding colors and numerical values (likely counts): * Math: (410,756) - Red * Coding: (339,907) - Blue * Information Seeking: (311,427) - Light Blue * Reasoning: (145,329) - Yellow * Planning: (31,757) - Pink * Creative Writing: (30,144) - Purple * Others (combined): (130,671) - Green * **Pie Segments:** Each segment represents a category, with the size of the segment corresponding to its percentage. ### Detailed Analysis The pie chart displays the following approximate percentages and corresponding values: * **Math:** 29.3% (410,756) - Represented by a large red segment, occupying a significant portion of the pie, starting from the top and moving clockwise. * **Coding:** 24.3% (339,907) - Represented by a large blue segment, following Math clockwise. * **Information Seeking:** 22.2% (311,427) - Represented by a light blue segment, following Coding clockwise. * **Reasoning:** 10.4% (145,329) - Represented by a yellow segment, following Information Seeking clockwise. * **Others (combined):** 9.3% (130,671) - Represented by a green segment, following Reasoning clockwise. * **Planning:** 2.3% (31,757) - Represented by a small pink segment, following Others clockwise. * **Creative Writing:** 2.2% (30,144) - Represented by a small purple segment, following Planning clockwise. The segments are arranged clockwise, starting with Math at the top. The visual trend shows a decreasing segment size as we move clockwise from Math to Creative Writing. ### Key Observations * Math, Coding, and Information Seeking constitute the largest portions of the distribution, accounting for approximately 75.8% of the total. * Planning and Creative Writing represent the smallest portions, each contributing less than 3% to the overall distribution. * The "Others" category, while not negligible, is significantly smaller than the primary categories. ### Interpretation The data suggests a strong focus on technical and analytical activities (Math, Coding, Information Seeking). The relatively large proportion dedicated to Reasoning indicates a significant amount of problem-solving or critical thinking. The smaller proportions for Planning and Creative Writing suggest these activities are less frequent or consume less time compared to the others. The combined "Others" category suggests a variety of less dominant activities. The chart provides a snapshot of the distribution of activities, potentially representing how an individual or group allocates their time or resources. The dominance of technical categories could indicate a professional or educational context focused on STEM fields. The data could be used to identify areas where time allocation might be adjusted or optimized. The inclusion of a combined "Others" category suggests that a more granular breakdown of these activities might be beneficial for a deeper understanding. </details> Figure 5: Distribution of data entries across different task categories. The dataset primarily consists of Math (29.3%), Coding (24.3%), and Information Seeking (22.2%) tasks, followed by Reasoning (10.4%), Planning (2.3%), Creative Writing (2.2%), and other combined categories (9.3%). A.4 Difficulty Distribution <details> <summary>extracted/6308700/difficulty_distribution.png Details</summary> ![50aed6c0](/v1/image/50aed6c097d34f8ca2e48e677b4ab8285309ff4ff9640d6688b964516a48fb48) ### Visual Description \n ## Pie Chart: Distribution of Difficulty ### Overview This image presents a pie chart illustrating the distribution of difficulty levels for a set of items (likely questions or tasks). The chart displays the percentage breakdown for each difficulty category: Very easy, Easy, Medium, Hard, and Very hard. ### Components/Axes * **Title:** "Distribution of Difficulty" - positioned at the bottom center of the image. * **Legend:** Located in the top-left corner of the image. It maps colors to difficulty levels and provides the corresponding counts in parentheses. * Medium: Red (725,892) * Hard: Blue (360,137) * Easy: Green (157,053) * Very hard: Orange (91,576) * Very easy: Gray (65,332) * **Pie Slices:** Each slice represents a difficulty level, with the size of the slice proportional to its percentage of the total. Each slice is labeled with the difficulty level and its percentage. ### Detailed Analysis The pie chart shows the following distribution of difficulty levels: * **Medium:** 51.8% (Red slice, largest portion of the pie) * **Hard:** 25.7% (Blue slice, second largest portion) * **Easy:** 11.2% (Green slice) * **Very hard:** 6.5% (Orange slice) * **Very easy:** 4.7% (Gray slice, smallest portion) The total number of items represented is the sum of the counts for each difficulty level: 725,892 + 360,137 + 157,053 + 91,576 + 65,332 = 1,400,000 (approximately). ### Key Observations * The majority of items are of Medium difficulty (51.8%). * Hard difficulty items represent a significant portion (25.7%). * Very easy items are the least represented (4.7%). * The distribution is heavily skewed towards Medium and Hard difficulties. ### Interpretation The data suggests that the set of items being analyzed is primarily composed of Medium and Hard difficulty levels. This could indicate a deliberate design choice to challenge users or a natural distribution based on the subject matter. The relatively small proportion of Very easy items suggests that the set is not intended for beginners or those seeking simple tasks. The difference between the number of Medium and Hard items is significant, suggesting that the set is more focused on providing a moderate challenge than a very difficult one. The data could be used to assess the overall difficulty of a test, a game, or a collection of tasks, and to identify areas where more easy or very easy items might be needed to provide a more balanced experience. </details> Figure 6: Difficulty distribution of the data entries. Most of the dataset entries are classified as Medium (51.8%) or Hard (25.7%). A smaller proportion falls into the Easy (11.2%), Very Hard (6.5%), and Very Easy (4.7%) categories. Appendix B Prompt B.1 Difficulty Rating To grading difficulty rating, wo use prompt as Table 3. | ⬇ # Instruction You first need to analyze the given user intent and then label the difficulty level of the user query based on the content of the user query. ## User Query ‘‘‘ {input} ‘‘‘ ## Evaluation Criteria Given the user query, you first need to analyze the user intent and the knowledge needed to solve the task in the user query. Then, rate the difficulty level of the user query as ‘ very easy ‘, ‘ easy ‘, ‘ medium ‘, ‘ hard ‘, or ‘ very hard ‘. Classify the difficulty of the query into one of five levels: - very easy: Basic, straightforward questions requiring minimal reasoning. - easy: Simple factual queries with slightly more depth. - medium: Requires moderate reasoning, explanation, or multi - step processing. - hard: Involves advanced concepts, deeper analysis, or multiple interrelated steps. - very hard: Expert - level queries demanding significant domain expertise, synthesis, or novel problem - solving. ## Output Format Just output the json format answer, don ’ t provide additional explanation Now, please output the difficulty level below in a json format by filling in the placeholders in []: ‘‘‘ json { " difficulty ": "[very easy / easy / medium / hard / very hard]" } ‘‘‘ | | --- | Table 3: Difficulty rating prompt. B.2 Category Classification To label category, wo use prompt as Table 4. | ⬇ # Instruction Please label the task tags for the user query. ## User Query ‘‘‘ {input} ‘‘‘ ## Tagging the user input Please label the task tags for the user query. You will need to analyze the user query and select the most relevant task tag from the list below. all_task_tags = [ " Logic ", # Queries involving logical puzzles, riddles, or formal deductive reasoning. " Information ", # Users ask for specific information or facts about various topics. " Editing ", # Involves editing, rephrasing, proofreading, or other tasks related to the composition of general written content. " Coding ", # Users seek help with writing, reviewing, or fixing code in programming. " Math ", # Queries related to mathematical concepts, problems, and calculations. " Brainstorming ", # Involves generating ideas, creative thinking, exploring possibilities, or assisting with decision - making processes. " Others " # Any queries that do not fit into the above categories or are of a miscellaneous nature. ] ## Output Format: Note that you can only select a single primary tag. Other applicable tags can be added to the list of other tags. Now, please output your tags below in a json format by filling in the placeholders in <...>: ‘‘‘ {{ " primary_tag ": "< primary tag >", " other_tags ": ["< tag 1>", "< tag 2>", ... ] }} ‘‘‘ | | --- | Table 4: Category classification prompt. B.3 Correctness Rating To rate correctness, wo use prompt as Table 5. | ⬇ # Instruction You are an evaluation expert tasked with assessing the correctness of answers provided by a relatively small - sized Language Model (such as a 7 B model) based on three inputs. Assign a score from 1 to 5 according to the following criteria: - Score 5: Completely correct, fully matches the reference answer or accurately addresses the query when the reference answer is not provided. - Score 4: Mostly correct, minor deviations or insignificant errors that do not affect overall meaning. - Score 3: Partially correct, includes key information but contains noticeable errors or omissions. - Score 2: Minimally correct, significant errors or major omissions, answer barely meets requirements. - Score 1: Completely incorrect, fails to address the question or content severely mismatches the query. ### Please score based on the following inputs: - ** Query **: {input_query} - ** Reference Answer:** (May be empty) {reference_answer} - ** LLM Answer **: {llm_answer} ### Provide your score strictly following the output format below: ‘‘‘ {{ " correctness ": "< correctness score >", }} ‘‘‘ ** Justification ** (briefly explain your scoring decision): | | --- | Table 5: Correctness Rating prompt.

Rendering Paper...