2406.13975

Model: healer-alpha-free

# MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs > 1 Chinese University of Hong Kong 2 University of Cambridge 3 University of Edinburgh 4 City University of Hong Kong 5 Tsinghua University 6 University of Texas at Austin 7 University of Hong Kong 8 Nanyang Technological University 9 Massachusetts Institute of Technology ## Abstract Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, evaluating these reasoning abilities has become increasingly challenging. Existing outcome-based benchmarks are beginning to saturate, becoming less effective in tracking meaningful progress. To address this, we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. Our meta-reasoning paradigm is especially suited for system-2 slow thinking, mirroring the human cognitive process of carefully examining assumptions, conditions, calculations, and logic to identify mistakes. MR-Ben comprises 5,975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, with models like the o1 series from OpenAI demonstrating strong performance by effectively scrutinizing the solution space, many other state-of-the-art models fall significantly behind on MR-Ben, exposing potential shortcomings in their training strategies and inference methodologies Our dataset and codes are available on https://randolph-zeng.github.io/Mr-Ben.github.io.. footnotetext: Correspondence to: Zhijiang Guo (zg283@cam.ac.uk) and Jiaya Jia (leojia@cse.cuhk.edu.hk). ## 1 Introduction Reasoning, the cognitive process of using evidence, arguments, and logic to reach conclusions, is crucial for problem-solving, decision-making, and critical thinking [65, 19]. With the rapid advancement of Large Language Models (LLMs), there is an increasing interest in exploring their reasoning capabilities [30, 57]. Consequently, evaluating reasoning in LLMs reliably becomes paramount. Current evaluation methodologies primarily focus on the final result [16, 28, 22, 60], disregarding the intricacies of the reasoning process. While effective to some extent, such evaluation practices may conceal underlying issues like logical errors or unnecessary steps that compromise the accuracy and efficiency of reasoning [68, 41]. Therefore, it is important to complement outcome-based evaluation with an intrinsic evaluation of the quality of the reasoning process. However, current benchmarks for evaluating LLMs’ reasoning capabilities have certain limitations in terms of their scope and size. For instance, PRM800K [38] categorizes each reasoning step as positive, negative, or neutral. Similarly, BIG-Bench Mistake [64] focuses on identifying errors in step-level answers. We follow the same meta-reasoning paradigm as MR-GSM8K [77] and MR-Math [68], which go a step further by providing the error reason for the first negative step in the reasoning chain. However, these benchmarks are limited to a narrower task scope—MR-GSM8K and MR-Math focus solely on mathematical reasoning, while BIG-Bench Mistake mainly assesses logical reasoning. To ensure a comprehensive evaluation of reasoning abilities, it is crucial to identify reasoning errors and assess the LLMs’ capacity to elucidate them across wider domains. To bridge this gap, we construct a comprehensive benchmark MR-Ben comprising 6k questions covering a wide range of subjects, including natural sciences like math, biology, and physics, as well as coding and logic. One unique aspect of MR-Ben is its meta-reasoning paradigm, which involves challenging LLMs to reason about different forms of reasoning. In this paradigm, LLMs take on the role of a teacher, evaluating the reasoning process by assessing correctness, analyzing potential errors, and providing corrections, as depicted in Figure 1. Our analysis of various LLMs [50, 51, 5, 33, 47] uncovers distinct limitations and previously unidentified weaknesses in their reasoning abilities. While many LLMs are capable of generating correct answers, they often struggle to identify errors within their reasoning processes and explain the underlying rationale. To excel under our meta-reasoning paradigm, models must meticulously scrutinize assumptions, conditions, calculations, and logical steps, even inferring step outcomes counterfactually. These requirements align with the characteristics of “System-2” slow thinking [35, 9], which we believe remains underdeveloped in most of the state-of-the-art models we evaluated. We suspect that a key reason for this gap lies in current fine-tuning paradigms, which prioritize correct solutions and limit effective exploration of the broader solution space. Echoing this hypothesis, we observed that models like o1-preview [52], which reportedly incorporate effective search and disambiguation techniques across trajectories in the solution space, outperform other models by a large margin. Moreover, we found that leveraging high-quality and diverse synthetic data [1] significantly mitigates this issue, offering a promising path to enhance performance regardless of model size. Additionally, our results indicate that different LLMs excel in distinct reasoning paradigms, challenging the notion that domain-specific enhancements necessarily yield broad cognitive improvements. We hope that MR-Ben will guide researchers in comprehensively evaluating their models’ capabilities and foster the development of more robust AI reasoning frameworks. Our key contributions are summarized as follows: - We introduced MR-Ben, which includes around 6k questions across a wide range of subjects, from natural sciences to coding and logic, and employs a unique meta-reasoning paradigm. - We conduct an extensive analysis of various LLMs on MR-Ben, revealing various limitations and previously unidentified weaknesses in their reasoning abilities. - We offer potential pathways for enhancing the reasoning abilities of LLMs and challenge the assumption that domain-specific enhancements necessarily lead to broad improvements. <details> <summary>x1.png Details</summary> ![d463baed](/v1/image/d463baed9a6126b4be831cdb8617a6177fa34a4d0f3937f08df8b392f83a4ae6) ### Visual Description ## Diagram: Comparative Error Analysis in Three Reasoning Types ### Overview The image is a structured diagram comparing three types of reasoning tasks—Arithmetic, Logical, and Algorithmic—each presented in a dedicated vertical column. For each reasoning type, the diagram illustrates a typical problem ("Questions"), a Chain-of-Thought (CoT) solution attempt ("CoT Answers") containing a deliberate error, and a subsequent "Error Analysis" that identifies and corrects the mistake. The diagram uses color-coding (orange for Arithmetic, green for Logical, blue for Algorithmic) and icons to visually segment the information flow. ### Components/Axes The diagram is organized into a 3x3 grid structure: * **Columns (Reasoning Types):** 1. **Arithmetic Reasoning** (Left column, orange header and borders) 2. **Logical Reasoning** (Center column, green header and borders) 3. **Algorithmic Reasoning** (Right column, blue header and borders) * **Rows (Process Stages):** 1. **Questions** (Top row): Presents the initial problem or instruction. 2. **CoT Answers** (Middle row): Shows a step-by-step solution attempt, with an erroneous step highlighted in a red box. 3. **Error Analysis** (Bottom row): Diagnoses the error and provides the correct reasoning or calculation. ### Detailed Analysis #### **Column 1: Arithmetic Reasoning (Orange)** * **Questions Box:** * **Instruction:** "Question: Helium effuses through a pinhole 5.33 times faster than an unknown gas. That gas is most likely:" * **Options:** "A: CO2", "B: CH4", "C: C5H12", "D: C8H18" * **CoT Answers Box:** * **Solution Steps:** * "Step 1: Recall Graham's law of effusion ..." * "Step 2: Calculate the molar masses of helium and the unknown gas ..." * "Step 3: The molar mass of each candidate gas:" * "- CO2 has a molar mass of 44 g/mol." * **"Step 4: Calculate the rate ratios for each candidate gas compared to helium ..."** (This step is highlighted in a red box). * "- Choice A: √(4/44) ≈ 0.316" * "......" * **Final Answer:** "Solution: Choice A" (Marked with a red 'X' icon, indicating it is incorrect). * **Error Analysis Box:** * **Analysis:** * "Error Step: Step 4" * "Error Reason: The ratio should be the square root of unknown gas over helium based on the Graham's law of effusion." * **Correction:** "Choice A: √(44/4) ≈ 3.32" #### **Column 2: Logical Reasoning (Green)** * **Questions Box:** * **Instruction:** "Question: F, G, J, K, I and M apply for a position. If interview G, interview J. If interview J, interview L. F won't be interviewed unless K does. K won't be interviewed unless M does. Which of the followings might be true?" * **Options:** "A: Only F, J and M were interviewed.", "B: Only F, J and K were interviewed.", "C: Only G and another candidate were interviewed.", "D: Only G and three other candidates were interviewed." * **CoT Answers Box:** * **Solution Steps:** * "Step 1: If F was interviewed, then K must have been interviewed since F will not be hired unless K is interviewed." * "......" * **"Step 3: Since M was interviewed, this means that at least four candidates (F, K, M, and one more) were interviewed."** (This step is highlighted in a red box). * "......" * **Final Answer:** "Solution: Choice D" (Marked with a green checkmark icon, indicating it is correct). * **Error Analysis Box:** * **Analysis:** * "Error Step: Step 3" * "Error Reason: If M is interviewed, it does not imply F and K are interviewed." * **Correction:** "There could be three possible interview combinations of candidates F, K and M, which are (M), (K, M) or (F, K, M)." #### **Column 3: Algorithmic Reasoning (Blue)** * **Questions Box:** * **Instruction:** A code snippet defining a function `def expectation_number(scores):` with a docstring: `"""Recruitment has begun and a total of n candidates have been selected. Alice and Bob are responsible for reviewing the candidates ...... Calculate the expected value of X. """` * **CoT Answers Box:** * **Solution Code:** * `def expectation_number(scores):` * ` """Calculates E(X), the expected number of resumes that appear at the same position in A and B's review order."""` * ` ......` * ` for i, score in enumerate(scores):` * ` ......` * **` # Probability a resume of this score value`** (Highlighted in red box) * **` # appears at same position i is 1 / k!`** (Highlighted in red box) * **` p = 1 / factorial(k)`** (Highlighted in red box) * ` ......` * **Error Analysis Box:** * **Analysis:** * "Error Step: Line 16" * "Error Reason: The probability of a resume of a score appears at position i should be 1/k instead of 1/k!" * **Correction:** "p = 1 / k" ### Key Observations 1. **Consistent Error Pattern:** In all three reasoning types, the error occurs in a middle step of the solution process (Step 4, Step 3, Line 16), not in the initial problem understanding or final answer selection. 2. **Nature of Errors:** * **Arithmetic:** A formula application error (inverting the ratio in Graham's Law). * **Logical:** A flawed logical deduction (over-constraining the possibilities from a given condition). * **Algorithmic:** A probability calculation error (using factorial where a simple reciprocal was needed). 3. **Visual Coding:** Red highlighting is used consistently to pinpoint the exact erroneous line or step across all columns. The final answer in the CoT box is marked with an icon (X or ✓) to indicate its correctness relative to the flawed reasoning. 4. **Correction Format:** The "Error Analysis" section provides a concise, direct correction to the specific erroneous step, rather than re-solving the entire problem. ### Interpretation This diagram serves as a pedagogical or analytical tool for understanding common failure modes in different forms of reasoning. It demonstrates that errors are often not random but stem from specific, identifiable misconceptions or oversights in the application of rules—whether mathematical (Graham's Law), logical (conditional constraints), or algorithmic (probability theory). The structure emphasizes the value of **stepwise verification** and **targeted error analysis**. By isolating the faulty step, the diagram shows how precise correction is more efficient than complete rework. The parallel presentation across reasoning types suggests a universal framework for debugging thought processes: identify the step, diagnose the rule misapplication, and apply the precise correction. This approach is crucial in fields like education, AI training, and software debugging, where understanding the *why* behind an error is as important as finding the right answer. </details> Figure 1: Overview of the evaluation paradigm and representative examples in MR-Ben. Each data point encompasses three key elements: a question, a Chain-of-Thought (CoT) answer, and an error analysis. The CoT answer is generated by various LLMs. Human experts annotate the error analyses, which include error steps, reasons behind the error, and subsequent corrections. The three examples shown are selected to represent arithmetic, logical, and algorithmic reasoning types. ## 2 Related Works #### Reasoning Benchmarks Evaluating the reasoning capabilities of LLMs is crucial for understanding their potential and limitations. While existing benchmarks often assess reasoning by measuring performance on tasks that require reasoning, such as accuracy, they often focus on specific reasoning types like arithmetic, knowledge, logic, or algorithmic reasoning. Arithmetic reasoning, involving mathematical concepts and operations, has been explored in benchmarks ranging from elementary word problems [37, 4, 55, 16] to more complex and large-scale tasks [28, 48]. Knowledge reasoning, on the other hand, requires either internal (commonsense) or external knowledge, or a combination of both [14, 62, 22]. Logical reasoning benchmarks, encompassing deductive and inductive reasoning, use synthetic rule bases for the former [15, 61, 18] and specific observations for the latter to formulate general principles [78, 71]. Algorithmic reasoning often involves understanding the coding problem description and performing multi-step reasoning to solve it [17, 25]. Benchmarks like BBH [59] and MMLU [27] indirectly assess reasoning by evaluating performance on tasks that require it. However, these benchmarks primarily focus on final results, neglecting the analysis of potential errors in the reasoning process. Unlike prior efforts, MR-Ben goes beyond accuracy by assessing the ability to locate potential errors in the reasoning process and provide explanations and corrections. Moreover, MR-Ben covers different types of reasoning, offering a more comprehensive assessment. #### Evaluation Beyond Accuracy Many recent studies have shifted their focus from using only the final result to evaluating the reasoning quality beyond accuracy. This shift has led to the development of two approaches: reference-free and reference-based evaluation. Reference-free methods aim to assess reasoning quality without relying on human-provided solutions. For example, ROSCOE [23] evaluates reasoning chains by quantifying reasoning errors such as redundancy and hallucination. Other approaches convert reasoning steps into structured forms, like subject-verb-object frames [56] or symbolic proofs [58], allowing for automated analysis. Reference-based methods depend on human-generated step-by-step solutions. For instance, PRM800K [38] offers solutions to MATH problems [28], categorizing each reasoning step as positive, negative, or neutral. Building on this, MR-GSM8K [77] and MR-Math [68] further provide the error reason behind the first negative step. MR-GSM8K focuses on elementary math problems, sampling questions from GSM8K [16]. MR-Math samples a smaller set of 459 questions from MATH [28]. Using the same annotation scheme, BIG-Bench Mistake [64] focuses on symbolic reasoning. It encompasses 2,186 instances from 5 tasks in BBH [59]. Despite the progress made by these datasets, limitations in scope and size remain. To address this, we introduce MR-Ben, a benchmark consisting of 5,975 manually annotated instances covering a wide range of subjects, including natural sciences, coding, and logic. MR-Ben also features more challenging questions, spanning high school, graduate, and professional levels. ## 3 MR-Ben: Dataset Construction ### 3.1 Dataset Structure To comprehensively evaluate the reasoning capabilities of LLMs, MR-Ben employs a meta-reasoning paradigm. This paradigm casts LLMs in the role of a teacher, where they assess the reasoning process by evaluating its correctness, analyzing errors, and providing corrections. As shown in Figure 1, each data point within MR-Ben consists of three key elements: a question, a CoT answer, and an error analysis. The construction pipeline is shown in Figure 6 in Appendix- D. #### Question The questions in MR-Ben are designed to cover a diverse range of reasoning types and difficulty levels, spanning from high school to professional levels. To ensure this breadth, we curated questions from various subjects, including natural sciences (mathematics, biology, physics), coding, and logic. Specifically, we sampled questions from mathematics, physics, biology, chemistry, and medicine from MMLU [27], which comprehensively assesses LLMs across academic and professional domains. For logic questions, we draw from LogiQA [40], which encompasses a broad spectrum of logical reasoning types, including categorical, conditional, disjunctive, and conjunctive reasoning. Finally, we select coding problems from MHPP [17], which focuses on function-level code generation requiring advanced algorithmic reasoning. Questions in MMLU and LogiQA require a single-choice answer, while MHPP requires a snippet of code as the answer. #### CoT Answer We queried GPT-3.5-Turbo-0125 [50], Claude2 [5], and Mistral-Medium [32] (as of February 2024) using a prompt template (provided in Figure- 7 in Appendix- D) designed to elicit step-by-step solutions [66]. For clarity, all LLMs were instructed to format their solutions with numbered steps, except for coding problems. To encourage diverse solutions, we set the temperature parameter to 1 during sampling. This empirical setting yielded satisfactory instruction following and desirable fine-grained reasoning errors, which annotators and evaluated models are expected to identify. ### 3.2 Annotation Process After acquiring the questions and their corresponding Chain-of-Thought (CoT) answers, we engage annotators to provide error analyses. The annotation process is divided into three stages. #### Answer Correctness CoT answers that result in a final answer different from the ground truth are automatically flagged as incorrect. However, for cases where the final answer matches the ground truth, manual annotation is required. This is because there are instances where the reasoning process leading to the correct answer is flawed, as illustrated in the middle example of Figure 1. Therefore, annotators are tasked with meticulously examining the entire reasoning path to determine if the correct final answer is a direct result of the reasoning process. #### Error Step This stage is applicable for solutions with either an unmatched final output or a matched final output underpinned by flawed reasoning. Following the prior effort [38], each step in the reasoning process is categorized as positive, neutral, or negative. Positive and neutral steps represent stages where the correct final output remains attainable. Conversely, negative steps indicate a divergence from the path leading to the correct solution. Annotators are required to identify the first step in the reasoning process where the conditions, assumptions, or calculations are incorrect, making the correct final result unreachable for the subsequent reasoning steps. #### Error Reason and Correction Annotators are tasked with conducting an in-depth analysis of the reasoning that led to the identified error. As shown in Figure 1, annotators are required to provide the error reason and the corresponding correction to this reasoning step. This comprehensive approach ensures a thorough understanding and rectification of errors in the reasoning process. ### 3.3 Data Statistics Table 1: Statistics of MR-Ben. The length of questions and solutions are measured inthe number of words. Notice that the steps for coding denote the number of lines of code. They are not directly comparable with other subjects. | Correct Solution Ratio Avg Solution Steps Avg First Error Step | 16.2% 6.8 3.1 | 31.0% 5.3 3.0 | 59.6% 5.1 2.7 | 47.8% 5.7 3.1 | 45.0% 5.6 3.0 | 51.1% 5.3 2.8 | 31.1% 32.5* 14.0* | 40.3% 9.5 4.5 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Avg Length of Questions | 44.3 | 88.7 | 56.3 | 66.6 | 48.1 | 154.8 | 140.1 | 85.6 | | Avg Length of Solutions | 205.9 | 206.1 | 187.6 | 199.4 | 194.5 | 217.7 | 950.3 | 308.8 | Table 1 presents the statistics of MR-Ben. The benchmark exhibits a balanced distribution of correct and incorrect solutions, with an overall correct solution rate of 40.3%. Solutions, on average, involve 9.5 steps, and errors typically manifest around the fourth step (4.5). The questions and solutions are substantial, with average lengths of 85.6 and 308.8 words, respectively. The subject-wise analysis reveals that Math is the most challenging, with a correct solution rate of a mere 16.2%. This could be attributable to the intricacy of the arithmetic operations involved. Conversely, Biology emerges as the least daunting, with a high correct solution rate of 59.6%. Coding problems have the longest solutions, averaging 950.3 number of words. This underscores the complexity and the detailed procedural reasoning inherent in coding tasks. Similarly, Logic problems have the longest questions, averaging 154.8 words. This is in line with the need for elaborate descriptions in logical reasoning. The typical step at which the first error occurs is fairly consistent across most subjects, usually around the 3rd step out of a total of 5. However, Coding deviates from this trend. The first error tends to appear earlier, specifically around the 14th line out of a total of 32.5 lines. This suggests that the problem-solving process in Coding may have distinct dynamics compared to other subjects. ### 3.4 Quality Control #### Annotators Given the complexity of the questions, which span a range of subjects from high school to professional levels, we enlisted the services of an annotation company. This company meticulously recruited annotators, each holding a minimum of a bachelor’s degree. Before their trial labeling, annotators are thoroughly trained and are required to review the annotation guidelines. We’ve included the guidelines for all subjects in Appendix H for reference. The selection of annotators is based on their performance on a balanced, small hold-out set of problems for each subject. In addition to the annotators, a team of 14 quality controllers diligently monitors the quality of the annotation weekly. As a final layer of assurance, we have 4 meta controllers who scrutinize the quality of the work. #### Quality Assurance Every problem in MR-Ben undergoes a rigorous three-round quality assurance process to ensure its accuracy and clarity. Initially, each question is labeled by two different annotators. Any inconsistencies in the solution correctness or the first error step are identified and reviewed by a quality controller for arbitration. Following this, every annotated problem is subjected to a secondary review by annotators who were not involved in the initial labeling. This is to ensure that the annotations for different solutions to the same problem are consistent and coherent. In the final phase of the review, 10% of the problems are randomly sampled and reviewed by the meta controllers. Throughout the entire evaluation process, all annotated fields are meticulously examined in multiple rounds for their accuracy and clarity. Any incorrect annotations or those with disagreements are progressively filtered out and rectified, ensuring a high-quality dataset. This rigorous process allows us to maintain a high level of annotation quality. #### Dataset Artifacts & Biases Table 1 reveals a relatively balanced distribution of correct and incorrect solutions. However, an exception was observed in mathematical subjects, where the distribution tends to skew towards incorrect solutions. This skew could suggest an inherent complexity or ambiguity in mathematical problem statements. Our analysis of the first error step across all subjects indicated that errors predominantly occur in the initial stages ( $n≤ 7$ ) of problem-solving and are distributed relatively uniformly. This pattern was consistent across most subjects, with no significant skew towards later steps. More detailed discussions of biases are provided in the Appendix C. ## 4 Evaluation For each question-solution pair annotated, the evaluated model are supposed to decide the correctness of the solution and report the first-error-step and error-reason if any. The solution-correctness and first-error-step is scored automatically based on the manual annotation result. Only when the evaluated model correctly identified the incorrect solution and first-error-step will its error-reason be further examined manually or automatically by models. Therefore in order to provide a unified and normalized score to reflect the overall competence of the evaluated model, we follow the work of [77] and apply a metric named MR-Score, which consist of three sub-metrics. The first one is the Matthews Correlation Coefficient (a.k.a MCC, 46) for the binary classification of solution-correctness. $$ \displaystyle MCC=\frac{TP× TN-FP× FN}{√{(TP+FP)×(TP+FN) ×(TN+FP)×(TN+FN)}} \tag{1} $$ where TP, TN, FP, FN stand for true positive, true negative, false positive and false negative. The MCC score ranges from -1 to +1 with -1 means total disagreement between prediction and observation, 0 indicates near random performance and +1 represents perfect prediction. In the context of this paper, we interpret negative values as no better than random guess and set 0 as cut-off threshold for normalization purpose. The second metric is the ratio between numbers of solutions with correct first-error-step predicted and the total number of incorrect solutions. $$ \begin{split}ACC_step=\frac{N_correct\_first\_error\_step}{N _incorrect\_sols}\end{split} \tag{2} $$ The third metrics is likewise the ratio between number of solutions with correct first-error-step plus correct error-reason predicted and the total number of incorrect solutions. $$ \begin{split}ACC_reason=\frac{N_correct\_error\_reason}{N_ incorrect\_sols}\end{split} \tag{3} $$ MR-Score is then a weighted combination of three metrics, given by $$ \displaystyle MR\mbox{-}Score \displaystyle=w_1*\max(0,MCC)+w_2*ACC_step+w_3*ACC_ reason \tag{4} $$ For the weights $w_1,w_2$ and $w_3$ , they are chosen based on our evaluation results to maximize the differentiation between different models. It is important to note that the Matthews Correlation Coefficient (MCC) and the accuracy of locating the first error step can be directly calculated by comparing the responses of the evaluated model with the ground truth annotations. However, assessing the accuracy of the error reason explained by the evaluated model presents more complexity. While consulting domain experts for annotations is a feasible approach, we instead utilized GPT-4-Turbo as a proxy to examine the error reasons, as detailed in Figure- 11 in Appendix- D. We operate under the assumption that while our benchmark presents a significant challenge for GPT-4 in evaluating complete solution correctness—identifying the first error step and explaining the error reason—it is comparatively easier for GPT-4 to assess whether the provided error reasons align with the ground truth. Specifically, in a hold-out set of sampled error reasons, there was a 92% agreement rate between the manual annotations by the authors and those generated by GPT-4. For more detailed evaluations on the robustness of MR-Score and its design thinking, please refer to our discussion in Appendix- B. ## 5 Experiments ### 5.1 Experiment Setup Table 2: Evaluation results on MR-Ben: This table presents a detailed breakdown of each model’s performance evaluated under metric MR-Score across different subjects, where K stands for the number of demo examples here. | Model | Bio. | | Phy. | | Math | | Chem. | | Med. | | Logic | | Coding | | Avg. | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | | Closed-Source LLMs | | | | | | | | | | | | | | | | | | | | | | | | | Claude3-Haiku | 5.7 | 5.8 | | 3.3 | 3.5 | | 3.1 | 3.1 | | 6.5 | 6.4 | | 2.0 | 2.0 | | 1.2 | 1.2 | | 9.0 | 0.0 | | 4.4 | 3.1 | | GPT-3.5-Turbo | 3.6 | 6.6 | | 5.7 | 6.7 | | 5.7 | 5.4 | | 4.9 | 6.7 | | 3.6 | 4.4 | | 1.7 | 4.5 | | 3.0 | 4.1 | | 4.0 | 5.5 | | Doubao-pro-4k | 8.4 | 13.5 | | 10.0 | 11.7 | | 12.3 | 15.5 | | 10.6 | 17.5 | | 5.9 | 10.0 | | 4.5 | 5.5 | | 9.8 | 7.4 | | 8.8 | 11.6 | | Mistral-Large | 22.2 | 28.0 | | 26.7 | 25.4 | | 24.3 | 28.2 | | 24.0 | 27.0 | | 15.9 | 19.3 | | 14.7 | 17.1 | | 21.1 | 21.4 | | 21.3 | 23.8 | | Yi-Large | 35.3. | 40.7 | | 37.2 | 36.8 | | 36.5 | 20.6 | | 40.0 | 39.1 | | 29.3 | 32.1 | | 25.1 | 31.3 | | 21.9 | 25.7 | | 32.2 | 32.3 | | Moonshot-v1-8k | 35.0 | 36.8 | | 33.8 | 33.8 | | 34.9 | 33.0 | | 36.7 | 35.0 | | 29.4 | 32.3 | | 25.0 | 29.2 | | 32.7 | 31.2 | | 32.5 | 33.0 | | GPT-4o-mini | 37.7 | 38.9 | | 38.5 | 37.4 | | 44.4 | 40.4 | | 39.2 | 37.0 | | 33.9 | 25.1 | | 23.6 | 17.7 | | 41.6 | 34.9 | | 37.0 | 33.1 | | Zhipu-GLM-4 | 40.7 | 46.2 | | 37.7 | 42.5 | | 38.4 | 36.6 | | 43.1 | 44.0 | | 34.5 | 41.0 | | 37.5 | 32.5 | | 38.8 | 32.8 | | 38.7 | 39.4 | | GPT-4-Turbo | 44.7 | 47.3 | | 42.8 | 45.2 | | 44.3 | 45.4 | | 44.0 | 46.0 | | 38.8 | 38.4 | | 34.1 | 33.6 | | 53.6 | 57.3 | | 43.2 | 44.7 | | GPT-4o | 48.3 | 49.1 | | 45.5 | 48.2 | | 42.6 | 41.3 | | 48.2 | 49.1 | | 47.9 | 47.7 | | 31.9 | 28.4 | | 56.5 | 54.6 | | 45.8 | 45.5 | | o1-mini | 45.8 | 46.9 | | 56.0 | 53.8 | | 68.5 | 67.0 | | 55.2 | 56.1 | | 45.9 | 47.2 | | 30.7 | 28.7 | | 55.1 | 55.6 | | 51.0 | 50.8 | | o1-preview | 54.1 | 56.0 | | 62.2 | 61.7 | | 69.8 | 70.3 | | 60.6 | 60.3 | | 54.3 | 55.1 | | 46.1 | 45.3 | | 65.1 | 70.0 | | 58.9 | 59.8 | | Open-Source Small | | | | | | | | | | | | | | | | | | | | | | | | | Qwen1.5-1.8B | 0.0 | 0.0 | | 0.0 | 0.0 | | 0.0 | 0.1 | | 0.0 | 0.1 | | 0.0 | 0.0 | | 0.0 | 0.1 | | 0.0 | 0.0 | | 0.0 | 0.0 | | Gemma-2B | 0.1 | 0.0 | | 0.0 | 0.0 | | 0.0 | 1.0 | | 0.1 | 0.0 | | 0.0 | 0.4 | | 0.0 | 0.2 | | 0.7 | 0.0 | | 0.1 | 0.2 | | Qwen2-1.5B | 2.2 | 2.8 | | 2.2 | 1.3 | | 3.3 | 6.3 | | 2.5 | 3.3 | | 2.9 | 11.2 | | 1.5 | 9.4 | | 0.0 | 3.6 | | 2.1 | 5.4 | | Phi3-3.8B | 13.4 | 12.5 | | 12.7 | 10.8 | | 13.3 | 13.1 | | 16.4 | 17.1 | | 10.2 | 8.1 | | 8.4 | 5.3 | | 9.1 | 10.2 | | 11.9 | 11.0 | | Open-Source LLMs Medium | | | | | | | | | | | | | | | | | | | | | | | | | GLM-4-9B | 4.4 | 2.4 | | 9.6 | 1.2 | | 8.1 | 4.7 | | 8.7 | 2.9 | | 2.3 | 1.9 | | 2.5 | 1.6 | | 11.4 | 0.0 | | 6.7 | 2.1 | | DeepSeek-7B | 5.7 | 6.2 | | 4.7 | 2.6 | | 4.9 | 5.2 | | 4.2 | 4.9 | | 3.1 | 1.6 | | 3.0 | 3.8 | | 0.0 | 1.2 | | 3.7 | 3.6 | | Deepseek-Coder-33B | 7.4 | 5.5 | | 7.8 | 5.6 | | 7.2 | 8.6 | | 7.8 | 7.4 | | 6.0 | 5.5 | | 4.6 | 6.7 | | 8.4 | 4.9 | | 7.0 | 6.3 | | DeepSeek-Coder-7B | 10.5 | 9.9 | | 11.8 | 9.6 | | 11.8 | 12.1 | | 12.3 | 11.9 | | 10.4 | 11.0 | | 9.8 | 10.7 | | 5.0 | 5.8 | | 10.2 | 10.2 | | LLaMA3-8B | 12.0 | 11.9 | | 10.9 | 7.5 | | 15.0 | 9.0 | | 12.6 | 12.7 | | 9.3 | 8.0 | | 9.4 | 9.6 | | 15.8 | 10.0 | | 12.2 | 9.8 | | Yi-1.5-9B | 10.4 | 14.8 | | 11.9 | 12.9 | | 12.5 | 15.6 | | 13.1 | 14.4 | | 9.5 | 14.8 | | 9.1 | 9.5 | | 4.8 | 6.3 | | 10.2 | 12.6 | | Open-Source LLMs Large | | | | | | | | | | | | | | | | | | | | | | | | | Qwen1.5-72B | 15.3 | 19.2 | | 12.9 | 13.6 | | 12.0 | 10.0 | | 13.9 | 16.3 | | 11.7 | 14.7 | | 10.4 | 12.9 | | 3.9 | 5.9 | | 11.5 | 13.3 | | DeepSeek-67B | 17.1 | 19.7 | | 14.9 | 17.3 | | 15.4 | 16.2 | | 16.3 | 20.6 | | 14.7 | 12.2 | | 13.6 | 14.3 | | 14.5 | 15.2 | | 15.2 | 16.5 | | LLaMA3-70B | 20.4 | 27.1 | | 17.4 | 20.5 | | 14.9 | 15.8 | | 19.5 | 25.1 | | 16.3 | 19.3 | | 16.3 | 16.8 | | 29.8 | 16.7 | | 19.2 | 20.2 | | DeepSeek-V2-236B | 30.0 | 37.1 | | 32.2 | 36.5 | | 32.2 | 30.0 | | 32.5 | 35.4 | | 26.5 | 32.4 | | 23.6 | 27.4 | | 34.2 | 27.1 | | 30.2 | 32.3 | | Qwen2-72B | 36.0 | 40.8 | | 36.7 | 40.9 | | 38.0 | 38.7 | | 37.2 | 38.8 | | 28.3 | 29.3 | | 25.6 | 20.5 | | 31.3 | 30.4 | | 33.3 | 34.2 | To evaluate the performance of different models on our new benchmark, we selected a diverse array of models based on size and source accessibility Note: All models used in our experiments are instruction-finetuned versions, although this is not indicated in their abbreviated names. This included smaller models like Gemma-2B [63], Phi-3 [1], Qwen1.5-1.8B [7], as well as larger counterparts such as Llama3-70B [47], Deepseek-67B [10], and Qwen1.5-72B [7]. We also compared open-source models (e.g. models from the Llama3 and Qwen1.5/Qwen2 series) against closed-source models from the GPT [51], Claude [6], Mistral [32], GLM [3], Yi [39], Moonshot [2], Doubao [12] families. Additionally, models from the Deepseek-Coder [10] series were included to assess the impact of coding-focused pretraining on reasoning performance. Given the complexity of our benchmark, even larger open-source models like Llama3-70B-Instruct struggle to produce accurate evaluation results without the use of prompting methods, often achieving MR-Scores near zero. Consequently, we employed a step-wise chain-of-thought prompting technique similar to those described in [77, 64]. This approach guides models in systematically reasoning through solution traces before making final decisions, as detailed in Appendix- D. Considering the complexity of the task, which includes question comprehension, reasoning through the provided solutions, and adhering to format constraints, few-shot demonstration setups are also explored to investigate if models can benefit from In-Context Learning (ICL) examples. Due to the context token limits, we report zero and one-shot results in the main result table (Table 2) For the breakdown performances of models in the sub-tasks, please refer to Table 7. The performance of additional few-shot configurations on a selection of models with various capabilities is further discussed in Section 6.1. <details> <summary>x2.png Details</summary> ![b38d3859](/v1/image/b38d38598538bce83220bd41a949504f6512ecdd0c89781ae884f29dc1dbf17a) ### Visual Description ## Radar Chart: AI Model Performance Across Academic Subjects ### Overview The image displays a radar chart (also known as a spider chart) comparing the performance of five different large language models (LLMs) across seven academic subject domains. The chart uses a radial layout where each axis represents a subject, and the distance from the center indicates a performance score, likely normalized between 0.0 and 0.7. ### Components/Axes * **Chart Type:** Radar Chart. * **Legend:** Positioned at the top-left of the chart. It contains five entries, each with a colored line and marker symbol: * **DeepSeek-v2:** Brown line with a circular marker. * **GPT-4-turbo:** Blue line with a circular marker. * **O1-Preview:** Light green line with a circular marker. * **Qwen2-72B:** Green line with a circular marker. * **GLM-4:** Pink line with a circular marker. * **Axes (Subjects):** Seven axes radiate from the center, labeled clockwise from the top: 1. math 2. chemistry 3. biology 4. logic 5. coding 6. medicine 7. physics * **Scale:** Concentric circles represent the scoring scale. The innermost circle is labeled `0.0`. Moving outward, the circles are labeled `0.1`, `0.2`, `0.3`, `0.4`, `0.5`, `0.6`, and the outermost visible circle is `0.7`. The labels are rotated and placed along the "biology" axis. ### Detailed Analysis Performance scores are approximate, estimated by the radial distance of each model's data point from the center along each subject axis. **1. O1-Preview (Light Green):** * **Trend:** This model forms the outermost polygon, indicating the highest overall performance. It shows a pronounced peak in `math` and strong performance in `physics` and `chemistry`. * **Approximate Scores:** * math: ~0.68 * chemistry: ~0.62 * biology: ~0.55 * logic: ~0.58 * coding: ~0.60 * medicine: ~0.52 * physics: ~0.65 **2. GPT-4-turbo (Blue):** * **Trend:** Generally the second-highest performer, forming a polygon just inside O1-Preview. It shows relatively balanced performance, with a slight dip in `medicine`. * **Approximate Scores:** * math: ~0.58 * chemistry: ~0.55 * biology: ~0.52 * logic: ~0.53 * coding: ~0.57 * medicine: ~0.48 * physics: ~0.56 **3. Qwen2-72B (Green):** * **Trend:** Performance is clustered in the middle range, often overlapping with or slightly inside GPT-4-turbo. It appears strongest in `coding` and `logic`. * **Approximate Scores:** * math: ~0.52 * chemistry: ~0.50 * biology: ~0.48 * logic: ~0.51 * coding: ~0.54 * medicine: ~0.47 * physics: ~0.50 **4. DeepSeek-v2 (Brown):** * **Trend:** Forms a polygon similar in size to Qwen2-72B but with a different shape. It shows a notable relative strength in `medicine` compared to its other scores. * **Approximate Scores:** * math: ~0.50 * chemistry: ~0.48 * biology: ~0.47 * logic: ~0.49 * coding: ~0.51 * medicine: ~0.52 * physics: ~0.49 **5. GLM-4 (Pink):** * **Trend:** This model forms the innermost polygon, indicating the lowest overall performance across all subjects in this comparison. Its scores are the most tightly clustered. * **Approximate Scores:** * math: ~0.45 * chemistry: ~0.44 * biology: ~0.43 * logic: ~0.44 * coding: ~0.46 * medicine: ~0.42 * physics: ~0.44 ### Key Observations 1. **Clear Performance Hierarchy:** There is a distinct layering of models, with O1-Preview consistently outermost, followed by GPT-4-turbo, then Qwen2-72B and DeepSeek-v2 in a middle tier, and GLM-4 innermost. 2. **Subject-Specific Strengths:** O1-Preview shows a significant advantage in `math` and `physics`. DeepSeek-v2's performance in `medicine` is an outlier relative to its own profile, nearly matching GPT-4-turbo in that specific domain. 3. **Tight Clustering in Medicine:** The spread between the highest and lowest scores appears smallest on the `medicine` axis, suggesting more comparable performance among these models in that field. 4. **Balanced vs. Specialized Profiles:** GPT-4-turbo and GLM-4 show relatively balanced polygons. O1-Preview and DeepSeek-v2 show more pronounced peaks, indicating potential specialization. ### Interpretation This radar chart provides a comparative snapshot of LLM capabilities across STEM and medical domains. The data suggests that **O1-Preview is the leading model in this evaluation**, demonstrating superior performance, particularly in quantitative and physical sciences. **GPT-4-turbo maintains a strong, consistent second-place position.** The middle-tier competition between **Qwen2-72B** and **DeepSeek-v2** is nuanced; while Qwen2-72B may have a slight edge in logic and coding, DeepSeek-v2 appears more capable in medicine. This could indicate different training data emphases or architectural strengths. **GLM-4**, while scoring lower across the board, shows a very consistent performance profile, which might indicate a different optimization strategy focused on breadth rather than peak performance in specific areas. The chart effectively communicates that model selection should be task-dependent. For a math or physics-intensive application, O1-Preview is the clear choice. For medical applications, the gap between models narrows, and factors like efficiency or cost might become more decisive. The visualization underscores that no single model dominates every category by an equal margin, highlighting the importance of benchmarking for specific use cases. </details> Figure 2: Model performance across subjects <details> <summary>x3.png Details</summary> ![e7f68c03](/v1/image/e7f68c03e2404f4a8892859013499ceb408ffd00bc00f141b3224190f6a614f2) ### Visual Description ## Grouped Bar Chart: MR-Scores of Models on Different Reasoning Paradigms ### Overview This is a grouped bar chart comparing the performance of five different AI models across four distinct reasoning paradigms. The performance metric is the "MR-Score," with values ranging from 0.0 to just above 0.6. The chart visually contrasts model strengths and weaknesses across knowledge, logic, arithmetic, and algorithmic reasoning tasks. ### Components/Axes * **Chart Title:** "MR-Scores of Models on Different Reasoning Paradigms" (centered at the top). * **Y-Axis:** Labeled "MR-Scores". The scale runs from 0.0 to 0.6 with major gridlines at intervals of 0.1 (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6). A dashed horizontal reference line is present at the 0.5 mark. * **X-Axis:** Labeled "Models". It lists five distinct models: "DeepSeek-v2", "GPT-4-turbo", "O1-Preview", "Qwen2-72B", and "GLM-4". * **Legend:** Located in the top-right corner of the plot area, titled "Paradigms". It defines the color coding for the four reasoning paradigms: * **knowledge:** Light blue * **logic:** Dark blue * **arithmetic:** Light green * **algorithmic:** Dark green ### Detailed Analysis The chart presents the MR-Scores for each model across the four paradigms. Values are approximate based on visual alignment with the y-axis gridlines. **1. DeepSeek-v2** * **knowledge (light blue):** ~0.32 * **logic (dark blue):** ~0.30 * **arithmetic (light green):** ~0.40 * **algorithmic (dark green):** ~0.42 * *Trend:* Scores increase from logic (lowest) to knowledge, then to arithmetic and algorithmic (highest). **2. GPT-4-turbo** * **knowledge (light blue):** ~0.50 (touches the dashed reference line) * **logic (dark blue):** ~0.36 * **arithmetic (light green):** ~0.46 * **algorithmic (dark green):** ~0.50 * *Trend:* Knowledge and algorithmic are tied for highest. Logic is the lowest. **3. O1-Preview** * **knowledge (light blue):** ~0.56 * **logic (dark blue):** ~0.46 * **arithmetic (light green):** ~0.66 (the highest single bar in the chart) * **algorithmic (dark green):** ~0.65 * *Trend:* This model shows the highest overall performance. Arithmetic is the peak, followed closely by algorithmic. Logic is the lowest but still relatively high compared to other models' logic scores. **4. Qwen2-72B** * **knowledge (light blue):** ~0.34 * **logic (dark blue):** ~0.25 (the lowest single bar in the chart) * **arithmetic (light green):** ~0.37 * **algorithmic (dark green):** ~0.31 * *Trend:* Arithmetic is the highest. Logic is notably the lowest, creating a significant gap. **5. GLM-4** * **knowledge (light blue):** ~0.39 * **logic (dark blue):** ~0.37 * **arithmetic (light green):** ~0.38 * **algorithmic (dark green):** ~0.39 * *Trend:* Scores are very tightly clustered, showing the most balanced performance across all four paradigms among the models shown. ### Key Observations * **Top Performer:** O1-Preview achieves the highest scores in three of the four paradigms (knowledge, arithmetic, algorithmic) and is second in logic. * **Paradigm Difficulty:** Across most models, the "logic" paradigm (dark blue bars) tends to yield the lowest or among the lowest scores, suggesting it may be the most challenging task set for these models. * **Model Specialization:** Models show different strength profiles. O1-Preview excels in arithmetic/algorithmic. GPT-4-turbo is strong in knowledge/algorithmic. Qwen2-72B has a pronounced weakness in logic. GLM-4 is the most generalist. * **Score Range:** The majority of scores fall between 0.25 and 0.55, with O1-Preview's arithmetic score being a clear outlier above 0.6. ### Interpretation This chart provides a comparative benchmark of AI model reasoning capabilities. The data suggests that reasoning performance is not monolithic; a model's proficiency varies significantly depending on the type of reasoning required (knowledge recall, logical deduction, arithmetic calculation, or algorithmic problem-solving). The standout performance of O1-Preview, particularly in arithmetic and algorithmic tasks, indicates a potential architectural or training advantage in handling structured, step-by-step computational reasoning. Conversely, the consistent relative weakness in "logic" across models points to a common challenge in the field, possibly related to handling abstract relational reasoning or avoiding fallacies. The balanced profile of GLM-4 is noteworthy, as it suggests a more uniform capability across diverse reasoning types, which could be advantageous for general-purpose applications. The chart effectively communicates that choosing the "best" model depends heavily on the specific reasoning task at hand. The dashed line at 0.5 serves as a visual benchmark, which only O1-Preview and GPT-4-turbo (in two paradigms each) consistently meet or exceed. </details> Figure 3: Model performance on different reasoning paradigms ### 5.2 Experiment Results The MR-Ben benchmark presents a significant shift in the challenge for state-of-the-art large language models, transitioning from question-answering to the nuanced role of question-solution scoring. This section details our findings, emphasizing variations in model performances and their implications. #### Overall Performance Among the evaluated models, o1-preview consistently achieves the highest MR-Scores across all subjects, significantly outperforming most competitors from both open and closed-source communities. Notably, the open-sourced Qwen2-72B and Deepseek-V2-236B models are performing exceptionally well, surpassing every other open-sourced model including Llama3 by a large margin. Their scores are even comparable to or greater than some of the most capable models from commercial companies, such as Mistral, Yi, and Moonshot AI. In the small language model category, the performance of Phi3-3.8B exceeds many of the mid-size models, including Deepseek-Coder-33B, whose size is around tenfold larger. #### Performance across Model Size and Reasoning Paradigm Table 2 reveals a general trend where larger models tend to perform better, highlighting the correlation between model size and the efficacy in complex reasoning tasks. However, this relationship is not strictly linear, as demonstrated by models like Phi3-3.8B, which excel despite their smaller size. Since MR-Ben challenges the language models to reason about the reasoning in the solution space among a diverse range of domains, models like Phi-3 that are trained with effective data synthesis techniques and broader coverage of the solution space, intuitively achieve higher MR-Score. This suggests that while larger model sizes generally yield superior performances, techniques like knowledge distillation can also significantly boost reasoning performance. Similarly, although the size of the o1 model series remains undisclosed, these models reportedly employ mechanisms that scale computation efficiently through effective exploration, frequent retrospection, and meticulous reflection within the solution space. These characteristics align closely with the principles of “system-2” thinking, which emphasizes deliberate, reflective problem-solving. As a result, the o1 models demonstrate a more effective reasoning process, achieving significantly higher MR-Scores than other models by a large margin. #### Performance across Reasoning Types Our categorization into four reasoning types—knowledge, arithmetic, algorithmic, and logic—illustrates the unique challenges each model faces within these paradigms (Figure 3). Logic reasoning emerges as the most formidable due to the intricate logical operations required by questions from the LogiQA dataset. In stark contrast, o1-Preview and GPT-4-turbo demonstrate exceptional prowess in algorithmic reasoning, where their capabilities markedly surpass other models. Notably, models excel in different reasoning paradigms, reflecting their varied strengths and training backgrounds. For instance, despite Deepseek-Coder’s specialized pre-training focused on coding tasks, it does not necessarily confer superior abilities in algorithmic reasoning, underscoring that targeted pretraining does not guarantee enhanced performance across all reasoning types. Comparing the performance of the Deepseek-Coder with that of the Phi-3 model, which excels despite its much smaller size, highlights the potential significance of high-quality synthetic data in achieving broad-based reasoning capabilities. #### Sensitivity to Task Difficulty and Solution Length An examination across educational levels shows most models perform better at high school-level questions than college-level ones, indicating an intuitive level of sensitivity to the difficulty levels of the questions. Additionally, our analysis finds a minor negative correlation between the length of solution steps and MR-Scores, as detailed in Figure 5 and Figure 5. #### Summary: MR-Ben effectively differentiates model capabilities, often obscured in simpler settings. It not only identifies top performers but also underscores the influence of model size on outcomes, while demonstrating that techniques like knowledge distillation and test-time compute scaling, as seen with the Phi-3 and o1 models, can notably enhance smaller models’ performance, challenging the dominance of larger models. The analysis further reveals that specialized training, such as in coding, does not guarantee superior algorithmic reasoning. This suggests the potential need for more balanced data approaches or improved data synthesis methods. ## 6 Further Analysis & Discussion <details> <summary>x4.png Details</summary> ![dd618bc1](/v1/image/dd618bc1ed60533222d2437feb77e8915309a298eed0bba7d0fdc0a1d3b413da) ### Visual Description ## Bar Chart: MR-Scores of Models on Different Difficulty Levels ### Overview The image is a grouped bar chart comparing the "MR-Scores" of five different AI models across two difficulty levels: "high_school" and "college". The chart visually presents performance data, with each model having two adjacent bars representing the two difficulty categories. ### Components/Axes * **Chart Title:** "MR-Scores of Models on Different Difficulty Levels" * **Y-Axis:** * **Label:** "MR-Scores" * **Scale:** Linear, ranging from 0.0 to 0.6, with major gridlines at intervals of 0.1. * **Notable Feature:** A horizontal dashed line is present at the 0.5 mark. * **X-Axis:** * **Label:** "Models" * **Categories (from left to right):** DeepSeek-v2, GPT-4-turbo, O1-Preview, Qwen2-72B, GLM-4. * **Legend:** * **Title:** "Difficulties" * **Position:** Top-right corner of the chart area. * **Items:** * Light blue square: "high_school" * Dark blue square: "college" ### Detailed Analysis The chart displays the following approximate MR-Scores for each model and difficulty level. Values are estimated based on bar height relative to the y-axis gridlines. | Model | Difficulty Level (Bar Color) | Approximate MR-Score | Visual Trend vs. Other Difficulty | | :--- | :--- | :--- | :--- | | **DeepSeek-v2** | high_school (Light Blue) | ~0.38 | Higher than college score | | | college (Dark Blue) | ~0.29 | Lower than high_school score | | **GPT-4-turbo** | high_school (Light Blue) | ~0.50 | Significantly higher than college score | | | college (Dark Blue) | ~0.38 | Lower than high_school score | | **O1-Preview** | high_school (Light Blue) | ~0.62 | Highest score on the chart; higher than college score | | | college (Dark Blue) | ~0.58 | Second highest score on the chart; lower than high_school score | | **Qwen2-72B** | high_school (Light Blue) | ~0.38 | Slightly higher than college score | | | college (Dark Blue) | ~0.35 | Slightly lower than high_school score | | **GLM-4** | high_school (Light Blue) | ~0.39 | Slightly lower than college score | | | college (Dark Blue) | ~0.40 | Slightly higher than high_school score | ### Key Observations 1. **Top Performer:** The O1-Preview model achieves the highest MR-Scores in both difficulty categories, with its "high_school" score being the only one to exceed the 0.6 mark. 2. **Performance Gap:** For four out of the five models (DeepSeek-v2, GPT-4-turbo, O1-Preview, Qwen2-72B), the MR-Score on the "high_school" difficulty is higher than on the "college" difficulty. The gap is most pronounced for GPT-4-turbo. 3. **Exception to the Trend:** The GLM-4 model is the only one where the "college" difficulty score (~0.40) is marginally higher than the "high_school" score (~0.39). 4. **Clustering:** The scores for DeepSeek-v2, Qwen2-72B, and GLM-4 are relatively clustered in the 0.29 to 0.40 range, while GPT-4-turbo and O1-Preview occupy a higher performance tier. 5. **Threshold Line:** The dashed line at 0.5 serves as a visual benchmark. Only the O1-Preview model surpasses this threshold for both difficulty levels, while GPT-4-turbo meets it exactly for the "high_school" level. ### Interpretation The data suggests a comparative evaluation of AI model reasoning capabilities (as measured by "MR-Scores") on problems categorized by academic difficulty. The consistent pattern of higher scores on "high_school" problems for most models indicates that these models generally find this level of difficulty more manageable than "college" level problems. This is an expected outcome, as college-level material is typically more complex. The standout performance of O1-Preview implies it has superior reasoning abilities relative to the other models tested, across both difficulty tiers. The anomalous result for GLM-4, where it performs slightly better on college-level problems, could indicate several possibilities: a specific strength in the domain of college-level questions used in the test, a potential quirk in the evaluation dataset, or simply statistical noise given the small margin. The chart effectively highlights both the general hierarchy of model performance and the nuanced relationship between problem difficulty and model capability. </details> Figure 4: MR-Scores of different models on different levels of difficulty <details> <summary>x5.png Details</summary> ![56210346](/v1/image/56210346217dad5afc47812109ec1c878bf5d4fad7cedc6c5f1f23e6b55b9e87) ### Visual Description ## Line Chart: Impact of Solution Length on MR-Scores by Models ### Overview This is a line chart comparing the performance of five different AI models on a metric called "MR Score" across varying solution lengths, measured in "Solution Step Numbers." The chart illustrates how each model's performance changes as the complexity (step count) of the solution increases. ### Components/Axes * **Title:** "Impact of Solution Length on MR-Scores by Models" * **Y-Axis:** Labeled "MR Score". The scale runs from 0.0 to 0.7, with major gridlines at intervals of 0.1. * **X-Axis:** Labeled "Solution Step Numbers". The categories are discrete step counts: `<=5`, `6`, `7`, `8`, `9`, `10`, `11`, `12`, `>=13`. * **Legend:** Located at the bottom center of the chart, titled "Model". It maps line colors to model names: * Blue line: **DeepSeek-v2** * Orange line: **GPT-4-turbo** * Green line: **O1-Preview** * Red line: **Qwen2-72B** * Purple line: **GLM-4** ### Detailed Analysis Data points are approximate values read from the chart's gridlines. **1. DeepSeek-v2 (Blue Line)** * **Trend:** Highly volatile. Starts low, rises to a sharp peak, then plummets dramatically before recovering. * **Data Points:** | Step Number | MR Score | |-------------|----------| | `<=5` | ~0.29 | | `6` | ~0.38 | | `7` | ~0.38 | | `8` | ~0.30 | | `9` | ~0.50 (Peak) | | `10` | ~0.21 (Sharp drop) | | `11` | ~0.08 (Lowest point, near zero) | | `12` | ~0.27 | | `>=13` | ~0.43 | **2. GPT-4-turbo (Orange Line)** * **Trend:** Generally upward trend to a peak at step 9, followed by a decline and a final recovery. * **Data Points:** | Step Number | MR Score | |-------------|----------| | `<=5` | ~0.39 | | `6` | ~0.47 | | `7` | ~0.41 | | `8` | ~0.44 | | `9` | ~0.56 (Peak) | | `10` | ~0.27 | | `11` | ~0.48 | | `12` | ~0.27 | | `>=13` | ~0.38 | **3. O1-Preview (Green Line)** * **Trend:** Consistently the highest-performing model. Shows a gentle upward trend to a peak at step 10, followed by a gradual decline. * **Data Points:** | Step Number | MR Score | |-------------|----------| | `<=5` | ~0.55 | | `6` | ~0.56 | | `7` | ~0.58 | | `8` | ~0.60 | | `9` | ~0.60 | | `10` | ~0.62 (Peak) | | `11` | ~0.58 | | `12` | ~0.50 | | `>=13` | ~0.49 | **4. Qwen2-72B (Red Line)** * **Trend:** The lowest-performing model overall. Relatively flat with minor fluctuations and a notable dip at step 11. * **Data Points:** | Step Number | MR Score | |-------------|----------| | `<=5` | ~0.33 | | `6` | ~0.34 | | `7` | ~0.30 | | `8` | ~0.31 | | `9` | ~0.33 | | `10` | ~0.32 | | `11` | ~0.25 (Dip) | | `12` | ~0.25 | | `>=13` | ~0.30 | **5. GLM-4 (Purple Line)** * **Trend:** Moderate performance with a slight downward trend in the middle steps, followed by a recovery. * **Data Points:** | Step Number | MR Score | |-------------|----------| | `<=5` | ~0.38 | | `6` | ~0.40 | | `7` | ~0.33 | | `8` | ~0.36 | | `9` | ~0.38 | | `10` | ~0.34 | | `11` | ~0.38 | | `12` | ~0.27 (Dip) | | `>=13` | ~0.37 | ### Key Observations 1. **Performance Hierarchy:** O1-Preview (green) maintains a clear lead across almost all step counts. Qwen2-72B (red) is consistently at the bottom. 2. **Critical Step Count:** Step `9` appears to be a point of high performance for several models (DeepSeek-v2, GPT-4-turbo, GLM-4), suggesting a potential "sweet spot" in solution complexity for these systems. 3. **Volatility:** DeepSeek-v2 exhibits extreme volatility, with a catastrophic drop in performance at steps `10` and `11` before a strong recovery. This indicates a potential fragility or specific weakness at those solution lengths. 4. **Convergence at Extremes:** At the shortest (`<=5`) and longest (`>=13`) solution lengths, the performance gap between models narrows compared to the middle steps. 5. **Anomaly:** The sharp, synchronized dip for GPT-4-turbo and DeepSeek-v2 at step `10` is notable, while O1-Preview peaks at this same point. ### Interpretation The data suggests that **solution length (complexity) has a non-linear and model-specific impact on MR-Score performance.** There is no universal "longer is better" or "shorter is better" rule. * **O1-Preview** demonstrates robust and superior performance, peaking at a moderately high complexity (10 steps) before a graceful decline. This implies strong generalization across problem difficulties. * The **volatility of DeepSeek-v2** suggests its reasoning or scoring mechanism may be highly sensitive to specific structural features present in solutions of 10-11 steps, leading to a breakdown. Its recovery at `>=13` steps is intriguing and warrants investigation. * The **peak at step 9** for multiple models could indicate that problems requiring around nine steps represent a balance where models can apply their reasoning capabilities most effectively before the complexity becomes overwhelming. * The **consistent underperformance of Qwen2-72B** across all lengths indicates a fundamental gap in capability on this specific MR metric compared to the other models tested. In summary, this chart is crucial for understanding model reliability. It shows that evaluating a model only on average performance or at a single complexity level can be misleading. The choice of the "best" model may depend heavily on the expected difficulty (step count) of the tasks in a given application. </details> Figure 5: The MR-Scores of models on solutions with different step numbers. ### 6.1 Few Shot Prompting As previously discussed and exemplified by our prompt template (Figure 10 in Appendix- D), our evaluation method is characterized by its high level of difficulty and complexity. In this experiment, we aimed to determine whether providing a few step-wise chain-of-thought (CoT) examples could improve model performance in terms of format adherence and reasoning quality. The results, as presented in Table 9 in Appendix, do not show a consistent pattern as the number of shots increases. While smaller language models like Gemma-2B exhibit performance improvements with additional shots, the performance of larger language models tends to fluctuate with an increasing number of shots. We hypothesize that for our complex tasks, the lengthy few-shot demonstrations may act more as a hindrance, providing distracting information rather than aiding in format adherence and reasoning. Our empirical findings suggest that a one-shot demonstration strikes the optimal balance between providing guidance and minimizing distraction. This supports our decision to focus on zero-shot versus one-shot comparisons in our primary experiments, as detailed in Table 2. ### 6.2 Self Refine Prompting As suggested by [31], large language models typically cannot perform self-correction without external ground truth feedback. To explore whether this phenomenon occurs in our benchmark, we adopted a similar setting by prompting the language model to verify its own answer across a three-round interaction sequence: query, examine, and refine. Our prompting template, detailed in Figure 8 in Appendix D, is minimalistic and designed solely to encourage the model to self-examine. The results of this self-refinement process are recorded in Table 4. Notably, models smaller than Llama3-70B exhibit performance degradation with self-refinement, while larger models, such as GPT-4, show marginal benefits from the process. Conversely, from Llama3-8B to Llama3-70B, despite a significant portion of correct predictions shifting to incorrect ones, as previously reported by [31], our benchmark shows an increasing trend of incorrect predictions shifting to correct ones as model size increases. This shift results in the significant performance improvements observed in models like Llama3-70B. To understand the disproportionate improvement observed in the 70B model, we analyzed performance breakdown at the task level. These results are visualized and discussed in Figure 9 of Appendix E. In short, we believe the lack of consistency does not necessarily indicate a more robust or advanced reasoning ability, despite the increase of the evaluation results. ### 6.3 Solution Correctness Prior Table 3: Comparison of average accuracy in identifying the first error step and the corresponding error reason, with and without prior knowledge of the solutions’ correctness. | Gemma-2B Llama3-8B Llama3-70B | 0.3 15.5 14.5 | 0.1 26.4 34.6 | 0.1 6.6 9.1 | 0.0 11.9 25.7 | | --- | --- | --- | --- | --- | | GPT-4-Turbo | 40.9 | 41.6 | 37.9 | 38.0 | Table 4: Comparison of prompting methods: MR-Scores achieved by zero-shot step-wise CoT and Self-Refine technique. | Gemma-2B Llama3-8B Llama3-70B | 0.1 11.7 17.7 | 0.2 11.3 27.5 | | --- | --- | --- | | GPT-4-Turbo | 43.2 | 45.5 | To verify the influence of external ground truth signals, we sampled 100 incorrect solutions from each subject respectively as our test set. By observing the same set of language models under a zero-shot CoT setting, we aim to determine whether the knowledge of the solution’s incorrectness enhances their ability to identify the first error step and the reason for the error. The results in Table 4 illustrate that the benefits of knowing the solution correctness prior generally increase with the model’s competence but begin to plateau at the level of sophisticated models like GPT-4. Specifically, the Gemma-2b model struggles significantly in our benchmark, showing nearly zero performance due to its limited ability to follow formats and comprehend complex tasks. Consequently, having the solution correctness prior does not improve its performance metrics. In contrast, models with moderate capabilities benefit substantially from this prior knowledge, which aids in accurately locating the first error step and elucidating the error reason. However, as model capabilities improve, the incremental benefits of this prior knowledge quickly diminish. For instance, GPT-4 shows only a marginal improvement in identifying the first error step and an almost negligible impact on error reason analysis when provided with the prior. ## 7 Conclusion This paper highlights the importance of evaluating the reasoning capabilities of LLMs with process-oriented design and presents a comprehensive benchmark called MR-Ben that addresses the limitations of existing evaluation methodologies. MR-Ben consists of questions from a diverse range of subjects and incorporates a meta-reasoning paradigm, where LLMs act as teachers to evaluate the reasoning process. Our evaluation of a diverse suite of LLMs on MR-Ben reveals several key limitations and weaknesses. Many models struggle with identifying and correcting errors within reasoning chains, demonstrating difficulty in performing system-2 style thinking—such as scrutinizing assumptions, calculations, and intermediate steps. Furthermore, even state-of-the-art models often fail to maintain consistency across reasoning paradigms, exposing gaps in their generalization abilities. Additionally, our findings emphasize the importance of searching and reflecting on the solution space during inference. Models like the o1 series showcase the potential of scaling test-time computation, where frequent retrospection and iterative search through multiple solution paths significantly enhance reasoning performance. Nevertheless, improving LLMs’ reasoning abilities on complex and nuanced tasks remains an open research question, and we encourage future work to develop upon MR-Ben. ## 8 Acknowledgement This work was supported in part by the Research Grants Council under the Areas of Excellence scheme grant AoE/E-601/22-R. ## References - Abdin et al. [2024] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. - AI [2024a] Moonshot AI. Moonshot ai, 2024a. URL https://www.moonshot.cn/. - AI [2024b] Zhipu AI. Welcome to glm-4, 2024b. URL https://en.chatglm.cn/. - Amini et al. [2019] Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2357–2367. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1245. URL https://doi.org/10.18653/v1/n19-1245. - Anthropic [2024a] Anthropic. Claude 2, 2024a. URL https://www.anthropic.com/news/claude-2. - Anthropic [2024b] Anthropic. Introducing the next generation of claude, 2024b. URL https://www.anthropic.com/news/claude-3-family. - Bai et al. [2023] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. - Bai et al. [2022] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosiute, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemí Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073, 2022. doi: 10.48550/ARXIV.2212.08073. URL https://doi.org/10.48550/arXiv.2212.08073. - Bengio [2020] Yoshua Bengio. Deep learning for system 2 processing. Presentation at the AAAI-20 Turing Award Winners 2018 Special Event, February 9 2020. - Bi et al. [2024] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y. K. Li, Wenfeng Liang, Fangyun Lin, Alex X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, and Yuheng Zou. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024. doi: 10.48550/ARXIV.2401.02954. URL https://doi.org/10.48550/arXiv.2401.02954. - Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165. - Bytedance [2024] Bytedance. Doubao team - crafting the industry’s most advanced llms., 2024. URL https://www.doubao.com/chat/. - Chung et al. [2022] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022. doi: 10.48550/ARXIV.2210.11416. URL https://doi.org/10.48550/arXiv.2210.11416. - Clark et al. [2018] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457. - Clark et al. [2020] Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as soft reasoners over language. In Christian Bessiere, editor, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pages 3882–3890. ijcai.org, 2020. doi: 10.24963/IJCAI.2020/537. URL https://doi.org/10.24963/ijcai.2020/537. - Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168. - Dai et al. [2024] Jianbo Dai, Jianqiao Lu, Yunlong Feng, Rongju Ruan, Ming Cheng, Haochen Tan, and Zhijiang Guo. Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation. arXiv preprint arXiv:2405.11430, 2024. - Dalvi et al. [2021] Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, and Peter Clark. Explaining answers with entailment trees. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7358–7370. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EMNLP-MAIN.585. URL https://doi.org/10.18653/v1/2021.emnlp-main.585. - Fagin and Halpern [1994] Ronald Fagin and Joseph Y. Halpern. Reasoning about knowledge and probability. J. ACM, 41(2):340–367, 1994. doi: 10.1145/174652.174658. URL https://doi.org/10.1145/174652.174658. - Fernandes et al. [2023] Patrick Fernandes, Aman Madaan, Emmy Liu, António Farinhas, Pedro Henrique Martins, Amanda Bertsch, José G. C. de Souza, Shuyan Zhou, Tongshuang Wu, Graham Neubig, and André F. T. Martins. Bridging the gap: A survey on integrating (human) feedback for natural language generation. CoRR, abs/2305.00955, 2023. doi: 10.48550/ARXIV.2305.00955. URL https://doi.org/10.48550/arXiv.2305.00955. - Gao et al. [2023] Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: researching and revising what language models say, using language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 16477–16508. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.910. URL https://doi.org/10.18653/v1/2023.acl-long.910. - Geva et al. [2021] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguistics, 9:346–361, 2021. doi: 10.1162/TACL\_A\_00370. URL https://doi.org/10.1162/tacl_a_00370. - Golovneva et al. [2023] Olga Golovneva, Moya Chen, Spencer Poff, Martin Corredor, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=xYlJRpzZtsY. - Gou et al. [2023] Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: large language models can self-correct with tool-interactive critiquing. CoRR, abs/2305.11738, 2023. doi: 10.48550/ARXIV.2305.11738. URL https://doi.org/10.48550/arXiv.2305.11738. - Gu et al. [2024] Alex Gu, Baptiste Rozière, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=Ffpg52swvg. - Gunasekar et al. [2023] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. CoRR, abs/2306.11644, 2023. doi: 10.48550/ARXIV.2306.11644. URL https://doi.org/10.48550/arXiv.2306.11644. - Hendrycks et al. [2021a] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021a. URL https://openreview.net/forum?id=d7KBjmI3GmQ. - Hendrycks et al. [2021b] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021b. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html. - Huang et al. [2024] Dong Huang, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao Qing, Jie M. Zhang, Heming Cui, and Zhijiang Guo. SOAP: enhancing efficiency of generated code via self-optimization. CoRR, abs/2405.15189, 2024. doi: 10.48550/ARXIV.2405.15189. URL https://doi.org/10.48550/arXiv.2405.15189. - Huang and Chang [2023] Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1049–1065. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-ACL.67. URL https://doi.org/10.18653/v1/2023.findings-acl.67. - Huang et al. [2023] Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023. - Jiang et al. [2023] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. CoRR, abs/2310.06825, 2023. doi: 10.48550/ARXIV.2310.06825. URL https://doi.org/10.48550/arXiv.2310.06825. - Jiang et al. [2024] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts. CoRR, abs/2401.04088, 2024. doi: 10.48550/ARXIV.2401.04088. URL https://doi.org/10.48550/arXiv.2401.04088. - Jung et al. [2022] Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 1266–1279. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.82. URL https://doi.org/10.18653/v1/2022.emnlp-main.82. - Kahneman [2011] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011. - Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html. - Koncel-Kedziorski et al. [2016] Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1152–1157. The Association for Computational Linguistics, 2016. doi: 10.18653/V1/N16-1136. URL https://doi.org/10.18653/v1/n16-1136. - Lightman et al. [2023] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. CoRR, abs/2305.20050, 2023. doi: 10.48550/ARXIV.2305.20050. URL https://doi.org/10.48550/arXiv.2305.20050. - LingYiWanWu [2024] LingYiWanWu. Yi ai, 2024. URL https://platform.lingyiwanwu.com/. - Liu et al. [2020] Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In Christian Bessiere, editor, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pages 3622–3628. ijcai.org, 2020. doi: 10.24963/IJCAI.2020/501. URL https://doi.org/10.24963/ijcai.2020/501. - Liu et al. [2024a] Yinhong Liu, Zhijiang Guo, Tianya Liang, Ehsan Shareghi, Ivan Vulić, and Nigel Collier. Measuring, evaluating and improving logical consistency in large language models. arXiv preprint arXiv:2410.02205, 2024a. - Liu et al. [2024b] Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vulić, Anna Korhonen, and Nigel Collier. Aligning with human judgement: The role of pairwise preference in large language model evaluators, 2024b. - Liu et al. [2024c] Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. Llms as narcissistic evaluators: When ego inflates evaluation scores, 2024c. - Lu et al. [2024a] Jianqiao Lu, Zhiyang Dou, Hongru Wang, Zeyu Cao, Jianbo Dai, Yingjia Wan, Yinya Huang, and Zhijiang Guo. Autocv: Empowering reasoning with automated process labeling via confidence variation. CoRR, abs/2405.16802, 2024a. doi: 10.48550/ARXIV.2405.16802. URL https://doi.org/10.48550/arXiv.2405.16802. - Lu et al. [2024b] Jianqiao Lu, Zhengying Liu, Yingjia Wan, Yinya Huang, Haiming Wang, Zhicheng Yang, Jing Tang, and Zhijiang Guo. Process-driven autoformalization in lean 4. CoRR, abs/2406.01940, 2024b. doi: 10.48550/ARXIV.2406.01940. URL https://doi.org/10.48550/arXiv.2406.01940. - Matthews [1975] Brian W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et biophysica acta, 405 2:442–51, 1975. URL https://api.semanticscholar.org/CorpusID:44596673. - Meta [2024] Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024. URL https://ai.meta.com/blog/meta-llama-3/. - Mishra et al. [2022] Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. LILA: A unified benchmark for mathematical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5807–5832. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.392. URL https://doi.org/10.18653/v1/2022.emnlp-main.392. - Mukherjee et al. [2023] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of GPT-4. CoRR, abs/2306.02707, 2023. doi: 10.48550/ARXIV.2306.02707. URL https://doi.org/10.48550/arXiv.2306.02707. - OpenAI [2023a] OpenAI. GPT-3.5 Turbo, 2023a. URL https://platform.openai.com/docs/models/gpt-3-5. - OpenAI [2023b] OpenAI. GPT-4 Technical Report. CoRR, abs/2303.08774, 2023b. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774. - OpenAI [2024] OpenAI. Introducing openai o1-preview, 2024. URL https://openai.com/index/introducing-openai-o1-preview/. - Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html. - Panickssery et al. [2024] Arjun Panickssery, Samuel R. Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations, 2024. - Patel et al. [2021] Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2080–2094. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.NAACL-MAIN.168. URL https://doi.org/10.18653/v1/2021.naacl-main.168. - Prasad et al. [2023] Archiki Prasad, Swarnadeep Saha, Xiang Zhou, and Mohit Bansal. Receval: Evaluating reasoning chains via correctness and informativeness. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 10066–10086. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-MAIN.622. URL https://doi.org/10.18653/v1/2023.emnlp-main.622. - Qiao et al. [2023] Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5368–5393. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.294. URL https://doi.org/10.18653/v1/2023.acl-long.294. - Saparov and He [2023] Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=qFVVBzXxR2V. - Srivastava et al. [2022] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Santilli, Andreas Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K. Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakas, and et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615, 2022. doi: 10.48550/ARXIV.2206.04615. URL https://doi.org/10.48550/arXiv.2206.04615. - Suzgun et al. [2023] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13003–13051. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-ACL.824. URL https://doi.org/10.18653/v1/2023.findings-acl.824. - Tafjord et al. [2021] Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 3621–3634. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.FINDINGS-ACL.317. URL https://doi.org/10.18653/v1/2021.findings-acl.317. - Talmor et al. [2019] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4149–4158. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1421. URL https://doi.org/10.18653/v1/n19-1421. - Team et al. [2024] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024. - Tyen et al. [2023] Gladys Tyen, Hassan Mansoor, Peter Chen, Tony Mak, and Victor Carbune. Llms cannot find reasoning errors, but can correct them! CoRR, abs/2311.08516, 2023. doi: 10.48550/ARXIV.2311.08516. URL https://doi.org/10.48550/arXiv.2311.08516. - Wason and Johnson-Laird [1972] Peter Cathcart Wason and Philip Nicholas Johnson-Laird. Psychology of reasoning: Structure and content, volume 86. Harvard University Press, 1972. - Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html. - Welleck et al. [2023] Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=hH36JeQZDaO. - Xia et al. [2024] Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, and Pengfei Liu. Evaluating mathematical reasoning beyond accuracy. CoRR, abs/2404.05692, 2024. doi: 10.48550/ARXIV.2404.05692. URL https://doi.org/10.48550/arXiv.2404.05692. - Xiong et al. [2024] Jing Xiong, Zixuan Li, Chuanyang Zheng, Zhijiang Guo, Yichun Yin, Enze Xie, Zhicheng Yang, Qingxing Cao, Haiming Wang, Xiongwei Han, Jing Tang, Chengming Li, and Xiaodan Liang. Dq-lore: Dual queries with low rank approximation re-ranking for in-context learning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=qAoxvePSlq. - Yang et al. [2022] Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein. Re3: Generating longer stories with recursive reprompting and revision. CoRR, abs/2210.06774, 2022. doi: 10.48550/ARXIV.2210.06774. URL https://doi.org/10.48550/arXiv.2210.06774. - Yang et al. [2024] Zonglin Yang, Li Dong, Xinya Du, Hao Cheng, Erik Cambria, Xiaodong Liu, Jianfeng Gao, and Furu Wei. Language models as inductive reasoners. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, pages 209–225. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.eacl-long.13. - Yao et al. [2023] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html. - Yao et al. [2024] Yuxuan Yao, Han Wu, Zhijiang Guo, Biyan Zhou, Jiahui Gao, Sichun Luo, Hanxu Hou, Xiaojin Fu, and Linqi Song. Learning from correctness without prompting makes LLM efficient reasoner. CoRR, abs/2403.19094, 2024. doi: 10.48550/ARXIV.2403.19094. URL https://doi.org/10.48550/arXiv.2403.19094. - Yasunaga et al. [2023] Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. Large language models as analogical reasoners. CoRR, abs/2310.01714, 2023. doi: 10.48550/ARXIV.2310.01714. URL https://doi.org/10.48550/arXiv.2310.01714. - Ye and Durrett [2022] Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/c402501846f9fe03e2cac015b3f0e6b1-Abstract-Conference.html. - Yu et al. [2023] Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, and Ashish Sabharwal. Improving language models via plug-and-play retrieval feedback. CoRR, abs/2305.14002, 2023. doi: 10.48550/ARXIV.2305.14002. URL https://doi.org/10.48550/arXiv.2305.14002. - Zeng et al. [2023] Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, and Jiaya Jia. Mr-gsm8k: A meta-reasoning benchmark for large language model evaluation. CoRR, abs/2312.17080, 2023. doi: 10.48550/ARXIV.2312.17080. URL https://doi.org/10.48550/arXiv.2312.17080. - Zhang et al. [2021] Chi Zhang, Baoxiong Jia, Mark Edmonds, Song-Chun Zhu, and Yixin Zhu. ACRE: abstract causal reasoning beyond covariation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 10643–10653. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.01050. URL https://openaccess.thecvf.com/content/CVPR2021/html/Zhang_ACRE_Abstract_Causal_REasoning_Beyond_Covariation_CVPR_2021_paper.html. - Zhou et al. [2024] Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, and Subhrajit Roy. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=L3FHMoKZcS. - Zhou et al. [2023] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=92gvk82DE-. ## Appendix A Appendix ### A.1 Limitations The meta-reasoning evaluation framework in MR-Ben, while innovative, is not without its limitations. Firstly, its applicability may be restricted when it comes to subjects that are inherently holistic or creative in nature, such as humanities or sociology. These subjects often require a comprehensive understanding and modification (e.g. essay writing), which can be challenging to break down into specific, sequential reasoning steps and corrections. Secondly, MR-Ben is currently confined to questions in English. This could potentially limit the scope of reasoning challenges that can be explored, as different languages may present unique cognitive and linguistic hurdles. Lastly, the analysis and correction of errors in the reasoning steps are currently based on solutions generated by three LLMs, namely GPT-3.5, Mistral-Medium, and Claude 2. It’s important to note that different LLMs and different individuals, may exhibit distinct reasoning and error patterns. Therefore, it would be beneficial to broaden the spectrum of solutions analyzed, incorporating a more diverse range of LLMs and even human responses. This would not only enhance the robustness of the evaluation framework but also provide a more nuanced understanding of the reasoning processes at play. ### A.2 Broader Impact #### Positive Societal Impacts The proposed dataset MR-Ben has the potential to bring about significant positive societal impacts. It can contribute to the development and enhancement of LLMs by providing a comprehensive benchmark suite, which researchers and developers can use to identify and address the limitations and weaknesses of their models. This can lead to more accurate, efficient, and reliable LLMs. The meta-reasoning paradigm might open new avenues in AI research, leading to a deeper understanding of reasoning capabilities and the development of innovative methodologies for their evaluation and improvement. Moreover, with a wide range of subjects, MR-Ben can be a valuable resource for educational AI tools, providing personalized learning experiences and helping students understand and improve their reasoning skills. AI systems with improved reasoning capabilities can also be instrumental in various sectors, including healthcare, finance, and environmental management, aiding in complex decision-making and problem-solving tasks. #### Negative Societal Impacts MR-Ben may also present potential negative societal impacts. As with any technology, there is a risk of LLMs being misused or used maliciously. For instance, LLMs with advanced reasoning capabilities could be used to manipulate information or deceive people. The use of LLMs in decision-making and problem-solving tasks could lead to an over-reliance on these systems, potentially undermining human judgment and critical thinking skills. Advanced LLMs, especially those used in sensitive sectors like healthcare and finance, need to handle vast amounts of data, which can raise privacy and security concerns if not managed properly. ### A.3 Additional Related Work #### Improving Reasoning Abilities of LLMs To enhance the reasoning capabilities of LLMs, prior research primarily focuses on specific prompting techniques [11]. Existing efforts include few-shot prompting with intermediate steps augmented demonstrations [66, 72, 69] or zero-shot prompting with specific instructions [36, 74]. Although these methods have shown promising results, their effectiveness is often constrained by their task-specific nature and the labour-intensive process of designing prompts, leading to inconsistent outcomes across different tasks [75, 80]. Another strategy to facilitate reasoning involves instruction tuning or knowledge distillation, which elicits reasoning paths from LLMs without explicit prompting [13, 49, 26, 44]. These approaches typically involve resource-intensive fine-tuning over LLMs and require a large set of examples annotated with CoT. #### Learning From Feedback Improving LLMs through learning from feedback has become a prevalent strategy, notably through reinforcement learning from human feedback, which seeks to align LLMs with human values by refining their outputs based on feedback [53, 8]. However, this method faces challenges such as high costs due to manual labor and a lack of real-time feedback capabilities [20]. An alternative strategy involves using self-correcting LLMs, which rely on automated feedback to iteratively adapt and understand the consequences of their actions without relying on humans. This feedback can be derived from outside sources such as other models [70, 45], tools [24, 29], knowledge bases [21, 76], evaluation metrics [34, 67] or generation logits [73]. ## Appendix B Robustness of MR-Score | gpt-4-turbo deepseek_coder Qwen2-72B | 83/55 100/38 99/39 | 137/15 145/7 142/10 | 164/11 167/8 167/8 | 305/46 321/30 312/39 | 194/25 200/19 195/24 | 166/27 172/21 172/21 | 192/16 193/15 200/8 | | --- | --- | --- | --- | --- | --- | --- | --- | Table 5: Scoring of error reasons from different models across subjects. | Agreement Ratio | Coding 7/8 | Phy. 12/13 | Bio. 21/21 | Med. 12/12 | Chem. 15/17 | Logic 15/16 | Math 10/13 | | --- | --- | --- | --- | --- | --- | --- | --- | Table 6: Agreement ratio between the author and the proxy scoring model across different subjects. Question: Does the ACC_reason metric’s dependency on the judgments of different LLMs or human evaluators lead to variability in scoring ? Answer: We would like to argue that due to the careful design of our evaluation mechanism, the automatic scoring of error reasons is both robust and economically feasible: - Multiple annotators: During the annotation stage, we collected multiple annotations for the first error reasons and potential error rectification from different annotators who agreed on the solution correctness and the first error step. - Proxy Model Evaluation: Based on the ground truth annotations collected from various perspectives, the proxy language model (e.g., GPT-4-Turbo) then examines the error reasons provided by evaluating models. Given the question/solution pair and information regarding the first error step, error reasons, and rectification, the potential flaws of the error reasons provided by the evaluating models are easy to diagnose under contrast. - ACC_reason robustness: Table- 5 shows the scores of error reasons sampled from our evaluation results. For the same set of error reasons collected in each subject, three different models made their predictions on correctness/incorrectness. We can clearly see the consistency of their predictions among the three models over questions in all subjects. Since the MR-Score is a weighted metric, the final score variability is less than 1 percent in total. Human-Model Agreement Rate: As mentioned in 3, the agreement rate between manual annotations and the GPT-4 predictions over 100 samples randomly collected from all subjects is 92%. Below is the exact detail of our setup: We randomly collected 100 data instances where the evaluating model correctly identified the solution correctness and the first error step across all subjects. We then manually examined whether the proxy scoring model (e.g., GPT-4-Turbo-2024-04-09) correctly scored the error reasons of the evaluating models. Table- 6 is the detailed composition of the ratio in which the author agrees with the proxy scoring model. The annotation time varies significantly across subjects, as some problems—such as coding and chemistry—can take more than 10 minutes to evaluate, while subjects like biology are easier to assess. This high agreement rate further supports the reliability of our evaluation, thus avoiding the need for manual annotation of potentially 138,000 problems (6,000 benchmark size times 23 models evaluated). Table 7: Evaluation results breakdown on MR-Ben: This table presents a detailed breakdown of each model’s performance evaluated under metric MCC/ACC-step/ACC-reason across different subjects. Here k stands for number of shot and every model we used in this experiment are instruction-tuned. | Model | Bio. | | Phy. | | Math | | Chem. | | Med. | | Logic | | Coding | | Avg. | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | $k$ =0 | $k$ =1 | | | MR-Scores | | | | | | | | | | | | | | | | | | | | | | | | | Claude3-Haiku | 5.7 | 5.8 | | 3.3 | 3.5 | | 3.1 | 3.1 | | 6.5 | 6.4 | | 2.0 | 2.0 | | 1.2 | 1.2 | | 9.0 | 0.0 | | 4.4 | 3.1 | | GPT-3.5-Turbo | 3.6 | 6.6 | | 5.7 | 6.7 | | 5.7 | 5.4 | | 4.9 | 6.7 | | 3.6 | 4.4 | | 1.7 | 4.5 | | 3.0 | 4.1 | | 4.0 | 5.5 | | Phi3-3.8B | 13.4 | 12.5 | | 12.7 | 10.8 | | 13.3 | 13.1 | | 16.4 | 17.1 | | 10.2 | 8.1 | | 8.4 | 5.3 | | 9.1 | 10.2 | | 11.9 | 11.0 | | Deepseek-Coder-33B | 7.4 | 5.5 | | 7.8 | 5.6 | | 7.2 | 8.6 | | 7.8 | 7.4 | | 6.0 | 5.5 | | 4.6 | 6.7 | | 8.4 | 4.9 | | 7.0 | 6.3 | | DeepSeek-Coder-7B | 10.5 | 9.9 | | 11.8 | 9.6 | | 11.8 | 12.1 | | 12.3 | 11.9 | | 10.4 | 11.0 | | 9.8 | 10.7 | | 5.0 | 5.8 | | 10.2 | 10.2 | | LLaMA3-8B | 12.0 | 11.9 | | 10.9 | 7.5 | | 15.0 | 9.0 | | 12.6 | 12.7 | | 9.3 | 8.0 | | 9.4 | 9.6 | | 15.8 | 10.0 | | 12.2 | 9.8 | | Qwen1.5-72B | 15.3 | 19.2 | | 12.9 | 13.6 | | 12.0 | 10.0 | | 13.9 | 16.3 | | 11.7 | 14.7 | | 10.4 | 12.9 | | 3.9 | 5.9 | | 11.5 | 13.3 | | DeepSeek-67B | 17.1 | 19.7 | | 14.9 | 17.3 | | 15.4 | 16.2 | | 16.3 | 20.6 | | 14.7 | 12.2 | | 13.6 | 14.3 | | 14.5 | 15.2 | | 15.2 | 16.5 | | LLaMA3-70B | 20.4 | 27.1 | | 17.4 | 20.5 | | 14.9 | 15.8 | | 19.5 | 25.1 | | 16.3 | 19.3 | | 16.3 | 16.8 | | 29.8 | 16.7 | | 19.2 | 20.2 | | Mistral-Large | 22.2 | 28.0 | | 26.7 | 25.4 | | 24.3 | 28.2 | | 24.0 | 27.0 | | 15.9 | 19.3 | | 14.7 | 17.1 | | 21.1 | 21.4 | | 21.3 | 23.8 | | DeepSeek-V2-236B | 30.0 | 37.1 | | 32.2 | 36.5 | | 32.2 | 30.0 | | 32.5 | 35.4 | | 26.5 | 32.4 | | 23.6 | 27.4 | | 34.2 | 27.1 | | 30.2 | 32.3 | | GPT-4-Turbo | 44.7 | 47.3 | | 42.8 | 45.2 | | 44.3 | 45.4 | | 44.0 | 46.0 | | 38.8 | 38.4 | | 34.1 | 33.6 | | 53.6 | 57.3 | | 43.2 | 44.7 | | MCC-Matthews Correlation Coefficient | | | | | | | | | | | | | | | | | | | | | | | | | Claude3-Haiku | 13.96 | 17.72 | | 16.47 | 13.62 | | 15.09 | 10.74 | | 16.54 | 19.96 | | 8.52 | 8.35 | | 6.21 | 4.94 | | 4.36 | 0 | | 11.59 | 10.76 | | GPT-3.5-Turbo | 10.72 | 19.44 | | 16.66 | 21.33 | | 17.48 | 17.45 | | 18.24 | 12.6 | | 11.19 | 13.28 | | 4.07 | 0 | | 12.35 | 12.35 | | 12.96 | 13.78 | | Deepseek-Coder-33B | 7.51 | 8.57 | | 11.73 | 6.81 | | 9.69 | 21.06 | | 9.98 | 7.94 | | 1.62 | 6.28 | | 0 | 0 | | 26.18 | 15.44 | | 9.53 | 9.44 | | Deepseek-Coder-7B | 4.96 | 9.79 | | 8.77 | 6.72 | | 9.05 | 10.82 | | 10.49 | 9.39 | | 5.02 | 3.17 | | 3.22 | 2.58 | | 10.91 | 6.27 | | 7.49 | 6.96 | | LlaMA3-8B | 19.37 | 21.15 | | 16.24 | 18.64 | | 26.55 | 21.87 | | 25.99 | 28.6 | | 14.92 | 18.95 | | 11.8 | 16.24 | | 14.54 | 15.72 | | 18.49 | 20.17 | | Phi3-3.8B | 27.66 | 28.48 | | 21.61 | 21.44 | | 22.29 | 25.17 | | 30.92 | 33.37 | | 17.36 | 14.9 | | 13.03 | 9.56 | | 14.48 | 18.76 | | 21.05 | 21.67 | | Qwen1.5-72B | 33.64 | 42.44 | | 31.4 | 31.56 | | 29.2 | 23.28 | | 35.47 | 36.47 | | 21.76 | 29.64 | | 24.42 | 27.74 | | 13.8 | 15.69 | | 27.1 | 29.55 | | Deepseek-67B | 43.61 | 41.73 | | 24.16 | 28.77 | | 24.95 | 23.87 | | 36.58 | 37.29 | | 27.8 | 28.93 | | 26.74 | 25.09 | | 28.23 | 29.06 | | 30.3 | 30.68 | | LlaMA3-70B | 45.67 | 56.14 | | 40.34 | 41.3 | | 32.76 | 30.94 | | 41.72 | 52.12 | | 33.18 | 37.75 | | 32.0 | 33.87 | | 47.86 | 29.67 | | 39.08 | 40.26 | | Mistral-Large | 41.67 | 49.0 | | 34.24 | 33.47 | | 29.0 | 37.05 | | 41.99 | 47.07 | | 23.76 | 32.05 | | 25.66 | 33.25 | | 37.05 | 33.52 | | 33.34 | 37.92 | | Deepseek-v2-236B | 52.96 | 53.38 | | 41.81 | 46.48 | | 43.75 | 40.53 | | 54.32 | 50.15 | | 37.61 | 44.53 | | 36.36 | 35.41 | | 45.89 | 35.7 | | 44.67 | 43.74 | | GPT-4-Turbo | 63.33 | 62.59 | | 52.9 | 52.7 | | 50.67 | 52.84 | | 53.05 | 54.59 | | 56.79 | 54.66 | | 40.95 | 42.94 | | 52.5 | 57.53 | | 52.88 | 53.98 | | Accuracy of First Error Step | | | | | | | | | | | | | | | | | | | | | | | | | Claude3-Haiku | 2.15 | 3.1 | | 1.4 | 1.12 | | 2.38 | 1.59 | | 1.77 | 4.42 | | 1.69 | 0.68 | | 1.01 | 0.29 | | 0.0 | 0.0 | | 1.49 | 1.6 | | GPT-3.5-Turbo | 2.86 | 4.53 | | 4.2 | 4.76 | | 4.37 | 3.84 | | 2.87 | 8.17 | | 2.37 | 3.05 | | 1.73 | 7.63 | | 0.61 | 2.44 | | 2.72 | 4.92 | | Deepseek-Coder-33B | 14.83 | 10.29 | | 14.94 | 12.36 | | 14.69 | 10.92 | | 15.67 | 16.31 | | 14.54 | 12.43 | | 12.22 | 18.18 | | 5.49 | 3.05 | | 13.2 | 11.93 | | Deepseek-Coder-7B | 21.77 | 18.18 | | 23.28 | 19.83 | | 23.41 | 20.03 | | 24.46 | 23.18 | | 23.29 | 26.09 | | 20.03 | 24.72 | | 4.27 | 6.1 | | 20.07 | 19.73 | | LlaMA3-8B | 14.35 | 14.35 | | 17.53 | 8.62 | | 20.29 | 7.8 | | 14.16 | 11.59 | | 13.13 | 8.41 | | 13.64 | 11.36 | | 17.68 | 9.76 | | 15.83 | 10.27 | | Phi3-3.8B | 12.68 | 11.48 | | 16.38 | 12.07 | | 17.69 | 16.12 | | 18.03 | 17.17 | | 12.78 | 8.76 | | 10.23 | 6.96 | | 8.54 | 9.15 | | 13.76 | 11.67 | | Qwen1.5-72B | 11.48 | 15.31 | | 10.63 | 11.49 | | 10.79 | 9.88 | | 12.45 | 14.38 | | 11.03 | 13.49 | | 8.1 | 10.94 | | 1.83 | 4.27 | | 9.47 | 11.39 | | Deepseek-67B | 13.16 | 19.14 | | 19.25 | 21.84 | | 20.81 | 22.11 | | 17.17 | 23.39 | | 14.71 | 12.08 | | 12.78 | 13.49 | | 12.2 | 14.02 | | 15.72 | 18.01 | | LlaMA3-70B | 15.79 | 22.25 | | 14.66 | 18.39 | | 13.65 | 15.08 | | 17.81 | 22.32 | | 14.36 | 16.64 | | 14.91 | 14.91 | | 26.83 | 13.41 | | 16.86 | 17.57 | | Mistral-Large | 18.38 | 25.54 | | 26.33 | 28.29 | | 27.28 | 33.25 | | 26.05 | 27.59 | | 16.92 | 19.29 | | 14.53 | 14.96 | | 19.51 | 19.51 | | 21.29 | 24.06 | | Deepseek-v2-236B | 27.51 | 35.41 | | 37.64 | 40.23 | | 36.28 | 34.33 | | 33.91 | 37.55 | | 27.5 | 32.57 | | 22.87 | 27.56 | | 33.54 | 26.83 | | 31.32 | 33.5 | | GPT-4-Turbo | 41.77 | 46.06 | | 42.58 | 46.5 | | 46.49 | 47.81 | | 42.6 | 46.8 | | 37.06 | 36.72 | | 29.93 | 33.24 | | 50.61 | 59.15 | | 41.58 | 45.18 | | Accuracy of First Error Reason | | | | | | | | | | | | | | | | | | | | | | | | | Claude3-Haiku | 1.67 | 2.63 | | 0.56 | 0.84 | | 1.19 | 0.93 | | 0.22 | 2.21 | | 1.18 | 0.34 | | 0.72 | 0.29 | | 0.0 | 0.0 | | 0.79 | 1.03 | | GPT-3.5-Turbo | 1.19 | 2.63 | | 2.24 | 1.96 | | 1.85 | 1.46 | | 0.88 | 3.53 | | 1.35 | 1.69 | | 0.72 | 4.17 | | 0.61 | 1.83 | | 1.26 | 2.47 | | Deepseek-Coder-33B | 2.87 | 1.44 | | 2.01 | 1.15 | | 1.69 | 2.21 | | 2.15 | 1.93 | | 2.63 | 1.05 | | 1.85 | 2.56 | | 3.05 | 1.83 | | 2.32 | 1.74 | | Deepseek-Coder-7B | 5.98 | 5.02 | | 6.03 | 4.6 | | 5.98 | 7.93 | | 5.79 | 6.22 | | 4.9 | 5.08 | | 6.25 | 5.54 | | 3.05 | 5.49 | | 5.43 | 5.7 | | LlaMA3-8B | 7.66 | 6.7 | | 4.89 | 2.3 | | 7.28 | 4.55 | | 6.22 | 7.08 | | 4.73 | 3.33 | | 5.97 | 5.97 | | 15.24 | 7.93 | | 7.43 | 5.41 | | Phi3-3.8B | 8.13 | 6.7 | | 6.9 | 5.75 | | 7.02 | 6.37 | | 9.66 | 10.52 | | 5.78 | 5.08 | | 5.4 | 2.7 | | 7.32 | 7.32 | | 7.17 | 6.35 | | Qwen1.5-72B | 10.29 | 12.2 | | 6.9 | 7.76 | | 5.85 | 4.81 | | 6.22 | 9.44 | | 8.06 | 9.46 | | 6.11 | 8.24 | | 1.22 | 3.05 | | 6.38 | 7.85 | | Deepseek-67B | 8.85 | 11.24 | | 8.62 | 10.06 | | 8.32 | 9.49 | | 7.73 | 12.23 | | 9.46 | 5.6 | | 8.81 | 10.37 | | 10.37 | 10.37 | | 8.88 | 9.91 | | LlaMA3-70B | 13.16 | 18.42 | | 9.77 | 13.51 | | 8.58 | 10.27 | | 11.59 | 15.88 | | 10.68 | 13.49 | | 10.94 | 11.08 | | 24.39 | 13.41 | | 12.73 | 13.72 | | Mistral-Large | 15.27 | 21.0 | | 19.05 | 20.45 | | 15.1 | 21.72 | | 16.56 | 18.54 | | 13.03 | 14.21 | | 11.22 | 11.94 | | 17.07 | 17.68 | | 15.33 | 17.94 | | Deepseek-v2-236B | 22.25 | 31.58 | | 25.57 | 30.17 | | 25.1 | 23.28 | | 22.96 | 28.11 | | 21.54 | 27.5 | | 18.89 | 24.15 | | 29.88 | 23.78 | | 23.74 | 26.94 | | GPT-4-Turbo | 39.14 | 42.0 | | 38.38 | 41.46 | | 40.4 | 41.06 | | 36.64 | 42.16 | | 32.83 | 32.83 | | 27.63 | 30.07 | | 50.61 | 56.1 | | 37.95 | 40.81 | Question: Is the MR-Score sensitive to different weightings? Is MR-Score a robust unified metric? Table- 7 shows breakdown performance for models in all four metrics (MR-Score, MCC, ACC_step, and ACC_reason): 1. Metric Robustness: Due to the progressive nature of the definitions of our subtasks (e.g., the success of subsequent tasks depends on the previous ones), we can see the diminishing trend in the scores of MCC, ACC_step, and ACC_reason. However, thanks to the design of our evaluation mechanism and metrics, the score rankings of different models stay in relatively stable order across metrics. In other words, we have not observed any model that excels in determining the solution correctness (thus high in MCC) but is unable to explain the rationale behind it (e.g., low in ACC_reason). 1. Task Difficulties: As shown in the breakdown table, the ACC_reason metric is more discriminative than the MCC metric for competent models but vice versa for the less competent ones. This aligns with our intuition that generally more difficult questions are more discriminative for strong candidates, while weaker ones are simply incapable of solving them. This phenomenon could in part explain why in general the MR-Score is not very sensitive to minor changes in the weightings assigned to the subtasks, since the differentiability of the subtask metrics tends to reconcile with each other under different scenarios. 1. Differentiability and Interpretability: The weights of the MR-Score are ultimately decided by considering both the discriminative ability and the interpretability. To best differentiate models with different evaluation results, we conducted a thorough grid search to investigate the impact of the weightings. Since the weightings calculated returned a few optimal instances, we deliberately selected the one that assigns higher scores to more difficult tasks. We believe the current weighting ratio strikes a good balance between interpretability and differentiation: For example, GPT-4-Turbo, Deepseek-v2-236B, and Mistral-Large achieve 86.4%, 78.5%, and 81.2% respectively in MMLU but score 43.2%, 29.4%, and 21.3% in our benchmark. ## Appendix C More Discussion on Biases To quantitatively assess the relationship between the length of solutions and their correctness, Pearson-Correlation-Coefficients were calculated and reported in Table- 8 in the Appendix. The result suggests varying dynamics across disciplines regarding how solution length impacts the likelihood of correctness. For subjects such as coding, chemistry and math, longer solutions are less likely to be correct, which could suggest that complexity or elaboration in responses may lead to mistakes or incorrect reasoning. For medicine, despite being weak, there’s a tendency for longer solutions to be slightly more correct, possibly due to more detailed or thorough explanations being favorable. For the other subjects, length of solution does not appear to significantly affect correctness, indicating that other factors likely play a more dominant role in determining solution quality. The overall Pearson Coefficients analysis reflects the distinct nature of problem-solving in each field of our benchmark. Table 8: Pearson Correlation Between Solution Length and Correctness | Medicine Physics Biology | 0.094 -0.061 0.009 | 0.0072 0.111 0.783 | | --- | --- | --- | | Chemistry | -0.127 | 0.00018 | | Coding | -0.199 | 0.0021 | | Logic | 0.0002 | 0.995 | | Math | -0.115 | 0.00049 | Table 9: Evaluation Results of Models on MR-Bean in few-shot settings: This table presents a detailed breakdown of each model’s performance evaluated under metric MR-Score across different subjects. | Gemma-2B 1 2 | 0 0.0 0.1 | 0.1 0.0 0.2 | 0.0 1.0 0.7 | 0.0 0.0 0.6 | 0.1 0.4 0.7 | 0.0 0.2 0.2 | 0.0 0.0 0.0 | 0.7 0.2 0.4 | 0.1 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 3 | 0.1 | 0.3 | 1.1 | 0.1 | 0.7 | 0.3 | 0.0 | 0.4 | | | Llama3-8B | 0 | 11.1 | 14.9 | 14.8 | 12.8 | 9.4 | 9.6 | 9.1 | 11.7 | | 1 | 11.7 | 8.1 | 7.8 | 12.8 | 7.3 | 10.7 | 5.7 | 9.2 | | | 2 | 9.7 | 7.8 | 11.1 | 8.8 | 6.4 | 6.2 | 2.4 | 7.5 | | | 3 | 10.0 | 10.7 | 8.3 | 8.2 | 5.5 | 5.3 | 3.0 | 7.3 | | | Llama3-70B | 0 | 19.9 | 15.4 | 15.0 | 17.6 | 14.6 | 13.5 | 28.2 | 17.7 | | 1 | 30.5 | 21.4 | 16.8 | 26.2 | 16.9 | 16.0 | 15.3 | 20.4 | | | 2 | 27.2 | 19.9 | 16.8 | 22.0 | 15.9 | 17.5 | 19.5 | 19.8 | | | 3 | 27.2 | 20.6 | 16.3 | 21.1 | 16.0 | 14.6 | 19.4 | 19.3 | | | GPT-4-Turbo | 0 | 44.7 | 42.8 | 44.3 | 44.0 | 38.8 | 34.1 | 53.6 | 43.2 | | 1 | 47.3 | 45.2 | 45.4 | 46.0 | 38.4 | 33.6 | 57.3 | 44.7 | | | 2 | 46.6 | 42.7 | 44.9 | 43.3 | 42.1 | 35.9 | 53.0 | 44.1 | | | 3 | 44.0 | 44.8 | 46.5 | 44.4 | 41.2 | 33.7 | 56.6 | 44.5 | | <details> <summary>extracted/6085487/figures/Mr-Ben-Pipeline.jpg Details</summary> ![7ee65739](/v1/image/7ee6573927436e281089e35760fe6b8071e6ba5e975b186e90611480e255932b) ### Visual Description ## Diagram: AI Model Evaluation Across Academic Domains ### Overview The image is a conceptual diagram illustrating a workflow for evaluating and improving AI models across multiple academic disciplines. It depicts a process where domain knowledge is fed into AI models, which generate step-by-step solutions to problems, followed by human evaluation and feedback for error correction. ### Components/Axes The diagram is organized into three main vertical sections, flowing from left to right: 1. **Left Section (Input Domains):** A large pie chart divided into seven colored segments, each representing an academic field. * **Labels (clockwise from top-left):** PHYSICS (light blue), LOGIC (pale green), MATH (olive green), CODING (light orange), BIOLOGY (peach), CHEMISTRY (pink), MEDICINE (lavender). * **Spatial Grounding:** The pie chart occupies the left third of the image. The labels are placed within their respective colored segments. 2. **Center Section (AI Models):** Three distinct AI model icons arranged vertically, connected by light blue arrows from the pie chart and pointing towards the right section. * **Top Icon:** A green square with a white, intricate, knot-like symbol (resembling the OpenAI logo). * **Middle Icon:** A brown square with the letters "AI" in black. * **Bottom Icon:** A white square with a stylized, pixelated "H" in orange, yellow, and red (resembling the Hugging Face logo). 3. **Right Section (Evaluation & Feedback):** This section shows a parallel process for three example domains (Logic, Math, Coding), with ellipses (`......`) indicating the process repeats for others. * **For each domain, there are two components:** * **A. Problem/Solution Card:** A rectangular card with a colored header and border. * **Header:** The domain name (e.g., "Logic"). * **Content:** A structured Q&A format: * `Q:` followed by a gray placeholder bar and a question mark `?`. * `A:` followed by a gray placeholder bar. * A list of steps: `Step 1: ......;`, `Step 2: ......;`, `......`, `Step n: ......`. * **B. Feedback Box:** A colored box connected to the card by an arrow, containing evaluation results. * **Fields:** * `Correctness:` followed by a symbol (❌ for incorrect, ✅ for correct). * `First Error Step:` (e.g., "2", "N/A", "5"). * `Error Reason:` (e.g., "......", "N/A"). * `Rectified Step:` (e.g., "......", "N/A"). * **Spatial Grounding & Color Matching:** * The **Logic** card (top) has a pale green header/border. Its feedback box is also pale green and shows `Correctness: ❌`. * The **Math** card (middle) has an olive green header/border. Its feedback box is olive green and shows `Correctness: ✅`. * The **Coding** card (bottom) has a light orange header/border. Its feedback box is light orange and shows `Correctness: ❌`. * Between the cards and feedback boxes, there are small, faint icons of people with speech bubbles, symbolizing human evaluators. ### Detailed Analysis The diagram outlines a clear, multi-stage pipeline: 1. **Domain Input:** Knowledge or problems from seven academic domains (Medicine, Physics, Logic, Math, Coding, Biology, Chemistry) serve as the input source. 2. **AI Processing:** This input is processed by one or more AI models (represented by the three central icons). 3. **Solution Generation:** For a given domain (e.g., Logic), the AI generates a structured, step-by-step solution to a question (`Q:`). 4. **Human Evaluation:** Human evaluators (implied by the people icons) assess the AI's solution. 5. **Feedback & Correction:** The evaluation produces a structured feedback report detailing correctness, the step where the first error occurred (if any), the reason for the error, and a rectified version of that step. This feedback loop is designed for iterative model improvement. ### Key Observations * **Selective Correctness:** In the examples shown, only the Math solution is marked as fully correct (`✅`, `First Error Step: N/A`). The Logic and Coding solutions contain errors, identified at Step 2 and Step 5, respectively. * **Structured Error Analysis:** The feedback format is consistent and granular, focusing on identifying the *first* error step, which is crucial for efficient debugging and training. * **Visual Coding:** Colors are used systematically to link each domain's problem card to its corresponding feedback box, ensuring clarity in the parallel workflows. * **Scalability:** The use of ellipses (`......`) in both the step lists and between the domain examples indicates this is a scalable framework applicable to many problems and domains beyond the three shown. ### Interpretation This diagram represents a **human-in-the-loop framework for benchmarking and improving AI reasoning capabilities**. It moves beyond simple right/wrong assessment to a diagnostic approach. * **What it demonstrates:** The process is designed to create high-quality training data. By pinpointing the exact step where an AI's reasoning fails and providing a correction, the system generates targeted examples for fine-tuning models. This is more effective than training on just final answers. * **Relationship between elements:** The flow is linear but implies a cycle: Domains → AI Generation → Human Evaluation → Feedback. This feedback is presumably used to retrain the AI models (the central icons), creating a closed loop for continuous improvement. The variety of domains suggests the goal is to develop a generalist AI with robust reasoning skills across STEM and professional fields. * **Notable implication:** The inclusion of "Medicine" and "Coding" alongside pure sciences like "Physics" and "Math" indicates an ambition to apply this rigorous, step-wise evaluation to practical, high-stakes fields where explainable and correct reasoning is critical. The framework treats all domains with the same analytical rigor. </details> Figure 6: This is the illustration of the dataset creation pipeline of MR-Ben . We first compile a set of questions from different subjects and then collect solutions from different LLMs. For each subject, a group of domain experts is recruited to annotate each question solution pair on its solution correctness, first error step, and error reasons. <details> <summary>x6.png Details</summary> ![f95f60e8](/v1/image/f95f60e89ba58c8a794ceed76a5c4c1b8ac0e5c30d7f3e3f286087631547b416) ### Visual Description ## Screenshot: Prompt Template for Solution Generation ### Overview The image displays a text-based prompt template designed to guide the generation of a step-by-step analysis for a multiple-choice question. The template is presented within a bordered box with a dark blue header. It contains placeholder variables (e.g., `{subject}`, `{df.iloc[idx]['Question']}`) intended to be dynamically populated from a dataset, likely a pandas DataFrame. The primary language is English. ### Content Details The complete textual content of the image is transcribed below. The text is presented in a monospaced font, with placeholders highlighted in a light blue color. **Header (Dark Blue Bar):** ``` Prompt for Solution Generation During Dataset Compilation ``` **Main Body (Light Background):** ``` Please generate a step-by-step analysis for the following Question in the subject {subject}. Question: {df.iloc[idx]['Question']} Choice_A: {df.iloc[idx]['Choice_A']} Choice_B: {df.iloc[idx]['Choice_B']} Choice_C: {df.iloc[idx]['Choice_C']} Choice_D: {df.iloc[idx]['Choice_D']} Here is the desired format, please analyse each candidate choice sequentially and then jointly decide which option is the solution in the final step. Please ensure every newline character follows a step indicator: Step 1: [The first reasoning of the step by step analysis on the candidate choices here] Step 2: [The second reasoning of the step by step analysis on the candidate choices here] ... Step n: [Conclude your analysis and decide which choice to make here] Solution: Choice_A/B/C/D Please follow this format without any additional introductory or concluding statements. ``` ### Key Observations 1. **Structure:** The template is highly structured, enforcing a specific output format with numbered steps and a final "Solution:" line. 2. **Placeholders:** It uses Python/pandas-style placeholders (`{df.iloc[idx]['Column_Name']}`), indicating it is designed for programmatic use within a data processing pipeline. 3. **Instructions:** The instructions are explicit, requiring sequential analysis of each choice (A, B, C, D) before a joint conclusion. It mandates strict adherence to the format, prohibiting extra text. 4. **Visual Design:** The text uses color differentiation (light blue for placeholders, brownish-orange for static text) to enhance readability for a human reviewing the template code. ### Interpretation This image is not a data visualization but a **meta-template**—a set of instructions for an AI or a script to generate another piece of content (a step-by-step solution analysis). * **Purpose:** Its core function is to standardize the output format for solution rationales during the compilation of a dataset, likely for training or evaluating AI models on reasoning tasks. The strict format ensures consistency across many generated examples. * **Underlying Process:** The placeholders reveal the intended workflow: a script would iterate over rows (`idx`) of a DataFrame (`df`), inject the subject, question, and choices into this template, and then feed the completed prompt to a language model to produce the formatted analysis. * **Design Intent:** The emphasis on "every newline character follows a step indicator" and the prohibition of additional statements suggests the downstream process expects machine-parsable output. The template is engineered for automation, not human conversation. * **Context:** This is a behind-the-scenes component of a larger AI/ML dataset creation pipeline, specifically for building a dataset of reasoned solutions to multiple-choice questions. </details> Figure 7: This is the prompt we used for solution generation during the dataset compilation stage. Note that besides coding, every subject question in our dataset takes the form of multiple choice problem. <details> <summary>x7.png Details</summary> ![f29edaca](/v1/image/f29edacaa8faa94f726471070dc3086b4f8595acf69e38f7a4ff6c0c01ec82a5) ### Visual Description ## Diagram: Three-Round Interaction Prompt Template for Self-Refine ### Overview The image displays a structured text template titled "Three-Round Interaction Prompt Template for Self-Refine." It outlines a three-stage process for evaluating and improving a generated response from an AI model. The template is presented as a block of text with placeholders (indicated by curly braces `{}`) for dynamic content. The text is primarily in English, with no other languages present. ### Components/Axes The template is a single, continuous block of text contained within a bordered box. It has a distinct header and a body structured into three sequential interaction rounds. **Header (Top Center):** * **Title:** "Three-Round Interaction Prompt Template for Self-Refine" **Body (Left-Aligned Text):** The body consists of instructional text and placeholders. The placeholders are visually distinct, rendered in an orange-brown color (`#d28a6e` approximate), while the instructional text is in a darker gray. **Text Transcription (Exact):** ``` Following is a question/solution pair in subject {sol['Subject']}. Your task is to examine the solutions step by step and determine the solution correctness. If the solution is incorrect, please further find out the first error step and explain the error reason. . . . . . . {Generated Response From Evaluated Model} Review your previous answer and find problems with your answer {Review Response From Evaluated Model} Based on the problems you found, improve your answer. Please follow the desired response format: . . . . . . {Self-Refined Response From Evaluated Model} ``` ### Detailed Analysis The template defines a clear, three-round interaction flow: 1. **Round 1 - Initial Evaluation:** * **Prompt:** Instructs the model to examine a solution pair for a given subject (`{sol['Subject']}`), check its correctness step-by-step, and identify the first error with an explanation if incorrect. * **Expected Output:** `{Generated Response From Evaluated Model}`. This is the model's initial analysis. 2. **Round 2 - Self-Review:** * **Prompt:** Asks the model to review its own previous answer (from Round 1) and identify problems within it. * **Expected Output:** `{Review Response From Evaluated Model}`. This is the model's critical self-assessment. 3. **Round 3 - Refinement:** * **Prompt:** Instructs the model to improve its original answer based on the problems found during the self-review. It specifies adherence to a "desired response format." * **Expected Output:** `{Self-Refined Response From Evaluated Model}`. This is the final, improved output. The ellipses (`. . . . . .`) appear to be visual separators or placeholders for additional, unspecified content or formatting. ### Key Observations * **Iterative Structure:** The template enforces a strict iterative loop: Generate -> Critique -> Refine. * **Self-Correction Focus:** The core mechanism is having the model critique and improve its own output, a technique known as "self-refinement" or "self-critique." * **Placeholder Design:** The use of `{key}` syntax (e.g., `{sol['Subject']}`) suggests integration with a data structure (likely a Python dictionary) where `sol` is a variable containing problem data. * **Visual Hierarchy:** The title is prominent in a dark blue header bar. The instructional text is uniform, while the placeholders are highlighted in a contrasting color to denote variable input/output zones. ### Interpretation This image is not a data chart but a **procedural diagram for an AI interaction protocol**. It visually documents a method to enhance the reliability and accuracy of AI-generated solutions through structured self-evaluation. * **What it demonstrates:** The template operationalizes a "chain of thought" or "reasoning trace" approach combined with self-criticism. It breaks down the complex task of solution verification into manageable, sequential steps, forcing the model to articulate its reasoning, find flaws, and then correct them. * **How elements relate:** The three rounds are causally linked. The output of Round 1 becomes the subject of critique in Round 2, and the findings from Round 2 directly inform the improvements made in Round 3. The `{sol['Subject']}` placeholder is the constant input that grounds all three rounds in the same original problem. * **Purpose and Utility:** This template is likely used in research or development settings to: 1. **Evaluate model capabilities:** Test a model's ability to perform accurate self-assessment. 2. **Improve output quality:** Generate more correct and refined final answers than a single-pass generation. 3. **Create training data:** The triplets (initial response, critique, refined response) can be used to fine-tune models to be better self-refiners. * **Notable Design Choice:** The explicit instruction to "find out the **first** error step" is significant. It prevents the model from overwhelming the user with a list of all possible errors and instead focuses the correction process on the most critical, foundational mistake, which is often more efficient for learning and debugging. </details> Figure 8: This is the prompt we used for self-refine experiment, note that three consecutive inference calls are made in order to perform the most basic self correction. <details> <summary>x8.png Details</summary> ![d9d66124](/v1/image/d9d66124a390dd0fdad39c8d93e71ff99f6e1422b5290bd648120683b30f59e5) ### Visual Description ## Comparative Pie Chart Grid: AI Model Performance Across Three Tasks ### Overview The image displays a 3x4 grid of pie charts, comparing the performance of four different AI models (GPT-4-turbo, Llama-3-70B-Instruct, Llama-3-8B-Instruct, and gemma-2b-it) across three distinct tasks (Task 1, Task 2, Task 3). Each pie chart breaks down the model's performance into four categories, representing the transition of answer correctness. A legend at the bottom defines these categories. ### Components/Axes * **Grid Structure:** The charts are arranged in three rows (one per Task) and four columns (one per Model). * **Chart Titles:** Each pie chart is titled with the format "Task [Number] - [Model Name]". * **Legend:** Located at the bottom center of the image. It defines four color-coded categories: * Light Blue: `Correct to Incorrect` * Dark Blue: `Incorrect to Correct` * Light Green: `Correct to Correct` * Dark Green: `Incorrect to Incorrect` * **Data Labels:** Each pie chart segment is labeled with a percentage value. ### Detailed Analysis **Row 1: Task 1** * **GPT-4-turbo (Top-Left):** * Correct to Incorrect (Light Blue): 8.5% * Incorrect to Correct (Dark Blue): 8.5% * Correct to Correct (Light Green): 69.7% * Incorrect to Incorrect (Dark Green): 13.4% * **Llama-3-70B-Instruct (Top-Center-Left):** * Correct to Incorrect (Light Blue): 12.1% * Incorrect to Correct (Dark Blue): 21.6% * Correct to Correct (Light Green): 51.9% * Incorrect to Incorrect (Dark Green): 14.4% * **Llama-3-8B-Instruct (Top-Center-Right):** * Correct to Incorrect (Light Blue): 20.8% * Incorrect to Correct (Dark Blue): 17.2% * Correct to Correct (Light Green): 40.9% * Incorrect to Incorrect (Dark Green): 21.1% * **gemma-2b-it (Top-Right):** * Correct to Incorrect (Light Blue): 6.4% * Incorrect to Correct (Dark Blue): 4.7% * Correct to Correct (Light Green): 3.8% * Incorrect to Incorrect (Dark Green): 85.2% **Row 2: Task 2** * **GPT-4-turbo (Middle-Left):** * Correct to Incorrect (Light Blue): 4.8% * Incorrect to Correct (Dark Blue): 8.6% * Correct to Correct (Light Green): 37.0% * Incorrect to Incorrect (Dark Green): 49.5% * **Llama-3-70B-Instruct (Middle-Center-Left):** * Correct to Incorrect (Light Blue): 2.1% * Incorrect to Correct (Dark Blue): 16.6% * Correct to Correct (Light Green): 14.9% * Incorrect to Incorrect (Dark Green): 66.4% * **Llama-3-8B-Instruct (Middle-Center-Right):** * Correct to Incorrect (Light Blue): 7.0% * Incorrect to Correct (Dark Blue): 10.1% * Correct to Correct (Light Green): 11.1% * Incorrect to Incorrect (Dark Green): 71.9% * **gemma-2b-it (Middle-Right):** * Correct to Incorrect (Light Blue): 0.0% (Value label is partially obscured/overlapping with "Incorrect to Correct" label, appears as "0.0% 0.9%"). * Incorrect to Correct (Dark Blue): 0.9% (Approximate, based on overlapping label). * Correct to Correct (Light Green): 0.0% (Not visibly labeled, inferred from chart). * Incorrect to Incorrect (Dark Green): 99.1% **Row 3: Task 3** * **GPT-4-turbo (Bottom-Left):** * Correct to Incorrect (Light Blue): 9.8% * Incorrect to Correct (Dark Blue): 0.0% * Correct to Correct (Light Green): 76.6% * Incorrect to Incorrect (Dark Green): 13.6% * **Llama-3-70B-Instruct (Bottom-Center-Left):** * Correct to Incorrect (Light Blue): 7.2% * Incorrect to Correct (Dark Blue): 0.0% * Correct to Correct (Light Green): 49.9% * Incorrect to Incorrect (Dark Green): 42.9% * **Llama-3-8B-Instruct (Bottom-Center-Right):** * Correct to Incorrect (Light Blue): 15.3% * Incorrect to Correct (Dark Blue): 0.0% * Correct to Correct (Light Green): 21.6% * Incorrect to Incorrect (Dark Green): 63.1% * **gemma-2b-it (Bottom-Right):** * Correct to Incorrect (Light Blue): 6.2% * Incorrect to Correct (Dark Blue): 0.0% * Correct to Correct (Light Green): 0.0% (Not visibly labeled, inferred from chart). * Incorrect to Incorrect (Dark Green): 93.8% ### Key Observations 1. **Model Performance Hierarchy:** Across all tasks, GPT-4-turbo and Llama-3-70B-Instruct consistently show higher percentages of "Correct to Correct" (Light Green) responses compared to the smaller models, Llama-3-8B-Instruct and gemma-2b-it. 2. **Task Difficulty:** Task 2 appears to be the most challenging for all models, as indicated by the significantly lower "Correct to Correct" percentages and higher "Incorrect to Incorrect" (Dark Green) percentages compared to Tasks 1 and 3. 3. **gemma-2b-it Performance:** The gemma-2b-it model shows extreme results. For Task 1 and Task 2, it is overwhelmingly in the "Incorrect to Incorrect" category (85.2% and 99.1%, respectively). For Task 3, it is 93.8% "Incorrect to Incorrect" with a small 6.2% "Correct to Incorrect" segment. It shows minimal to no "Correct to Correct" or "Incorrect to Correct" responses. 4. **Stability vs. Change:** The "Correct to Correct" and "Incorrect to Incorrect" categories represent stable states (the model's answer did not change correctness), while "Correct to Incorrect" and "Incorrect to Correct" represent changes. Larger models show more stability in correctness (higher "Correct to Correct"). 5. **Data Anomaly:** The pie chart for "Task 2 - gemma-2b-it" has overlapping and partially illegible percentage labels for the two smallest segments, making precise extraction difficult. ### Interpretation This grid of pie charts provides a nuanced view of model performance beyond simple accuracy. It categorizes outcomes based on the *change* in correctness, likely comparing model outputs to a ground truth or a previous iteration. * **What the data suggests:** The data demonstrates a clear correlation between model scale/capability and performance stability. Larger models (GPT-4-turbo, Llama-3-70B) are not only more often correct but are also more *consistently* correct ("Correct to Correct"). Conversely, the smallest model (gemma-2b-it) is consistently incorrect and stable in its incorrectness ("Incorrect to Incorrect") across all tasks, suggesting a fundamental capability gap for these specific tasks. * **Relationship between elements:** The rows (Tasks) allow for comparison of how the same model performs on different problems. The columns (Models) allow for direct comparison of different models on the same problem. The color categories reveal the dynamics of model performance—whether errors are persistent or if the model can correct them. * **Notable trends:** The most striking trend is the performance cliff between the 8B and 2B parameter models. While Llama-3-8B-Instruct shows a mix of outcomes, gemma-2b-it fails almost completely on these tasks. Another key trend is the increased difficulty of Task 2, which suppresses the "Correct to Correct" rate for all models and inflates the "Incorrect to Incorrect" rate. * **Underlying information:** This visualization is likely from a study on model robustness, self-correction, or consistency. The "Correct to Incorrect" category is particularly interesting, as it may indicate cases where a model initially had the right answer but was led astray, perhaps by a chain-of-thought process or a secondary evaluation. The near-zero "Incorrect to Correct" values for Task 3 across all models suggest that for this specific task, if a model is initially wrong, it has virtually no chance of self-correcting to a right answer. </details> Figure 9: This is the performance breakdown for self-refine experiment in the task level, where task 1,2,3 refers to solution correctness, first error step and error reason determination. <details> <summary>x9.png Details</summary> ![19a7fad8](/v1/image/19a7fad826d2ffc5b353c3c0f6d0387af78a818aac71902f8edd3f827b3e3f18) ### Visual Description ## Technical Document: Prompt for Response Generation ### Overview The image displays a structured text document titled "Prompt for Response Generation." It is a template or instruction set designed to guide an AI or evaluator in analyzing a provided question/solution pair. The document defines specific fields for evaluation, provides a template with placeholders for dynamic content, and mandates a strict output format. The text is entirely in English. ### Content Structure The document is organized into several distinct sections within a bordered frame with a dark blue header. 1. **Header:** "Prompt for Response Generation" in white text on a dark blue background. 2. **Primary Task Description:** A paragraph outlining the core task: to examine a question/solution pair, determine solution correctness, and identify the first error step if the solution is incorrect. 3. **Field Definitions:** A detailed section defining the evaluation criteria: * **Solution Correctness:** Asks if the solution correctly answers the question with justifiable reasoning and selected corrected options. * **First Error Step:** Defines three categories for each step: * *Correct:* Sound logic, correct computation, leads to correct answer. * *Neutral:* Explanatory or background-focused, no obvious mistakes, but unclear if it leads to the correct answer. * *Incorrect:* Contains factual, computational, or logic errors that may or may not derail the reasoning. * **Error Reason:** Requires specifying the errors in the identified first error step and suggesting a rectified reasoning step. 4. **Template Section with Placeholders:** This section contains the dynamic parts of the prompt, marked by curly braces `{}`. * `{k_shot_demo}`: A placeholder, likely for inserting few-shot demonstration examples. * "Below is the question and solution for you to solve:" * `Question: {sol['Question']}` * `Options: {sol['Options']}` * `Step by Step Solution: {sol['Model_Solution_Steps']}` * `{hint_sent}`: Another placeholder, likely for optional hints. 5. **Mandated Response Format:** A final section instructing the responder to follow a specific format without any additional introductory or concluding statements. The required format is: * `Solution Analysis: [Give a step by step analysis on the solution correctness here]` * `Solution Correctness: [Input 'correct'/'incorrect' here to indicate the overall correctness of the solution]` * `First Error Step: [Input 'Step x' here to indicate the first error step here. Input 'N/A' if the solution is correct.]` * `Error Reason: [Input the error reason and the rectified reasoning of the first error step here. Input 'N/A' if the solution is correct.]` ### Detailed Analysis The document is a meta-prompt—a prompt designed to generate or structure another AI's response. Its primary function is to enforce a rigorous, step-by-step evaluation of a solution's logical validity. * **Key Definitions:** The definitions for "Correct," "Neutral," and "Incorrect" steps are precise. A "Neutral" step is particularly interesting; it is not erroneous but is flagged for potentially being non-contributory or insufficiently directed toward the answer. * **Placeholders:** The use of `{sol['Subject']}`, `{sol['Question']}`, etc., indicates this is a template where specific problem data is inserted programmatically. `{k_shot_demo}` suggests the prompt can be augmented with examples to guide the evaluator's style. * **Strict Format:** The final instruction ("Please follow this format without any additional introductory or concluding statements") is absolute, aiming to produce standardized, machine-parsable output. ### Key Observations 1. **Focus on Process over Outcome:** The evaluation is heavily weighted on the *reasoning process* ("step by step analysis") rather than just the final answer. Identifying the "First Error Step" is a core requirement. 2. **Granular Error Classification:** The system distinguishes between steps that are outright wrong and steps that are merely unhelpful or neutral, allowing for nuanced feedback. 3. **Template-Driven Design:** The document is clearly a reusable component in a larger pipeline, likely for automated grading, model training, or generating critique datasets. 4. **Visual Layout:** The text is presented in a monospaced font (like Courier) within a light gray box, mimicking a code block or terminal output, which reinforces its technical, programmatic nature. ### Interpretation This document is a **prompt engineering template for automated solution evaluation**. Its purpose is to standardize the critique of step-by-step problem-solving, likely in an educational or AI training context. * **What it demonstrates:** It reveals a sophisticated approach to assessment that values logical soundness and process transparency. By requiring the identification of the *first* error, it encourages pinpointing the root cause of failure rather than just listing all mistakes. * **How elements relate:** The field definitions directly inform the mandated response format. The placeholders connect this static instruction set to dynamic problem data. The strict output format ensures consistency for downstream processing. * **Notable implications:** The inclusion of a "Neutral" step category is insightful. It acknowledges that not all explanatory text is erroneous, but some may be inefficient or off-topic—a subtle distinction important for training models to generate concise, relevant reasoning. The entire structure is designed to minimize ambiguity and subjective judgment in the evaluation process. </details> Figure 10: This is the prompt template we used to evaluate all the models. The k-shot-demo and hint-sent are either the few shot examples and solution correctness prior or empty string, depending on the experiment setup. <details> <summary>x10.png Details</summary> ![fa4c3266](/v1/image/fa4c326658d12be74b5357aa3f14f876a7925562abe83214e83b41ddfb2c5b42) ### Visual Description ## Text Document: Prompt Template for Error Reason Scoring ### Overview The image displays a text-based prompt template designed for evaluating a student's explanation of an error in a problem solution. It is structured as a set of instructions for an evaluator (role-played as an experienced teacher) and contains placeholder variables (e.g., `{data['Subject']}`) that would be populated with specific data in a live system. The document is purely textual with no charts, diagrams, or images embedded within it. ### Components/Axes The document is visually divided into two main regions: 1. **Header Region:** A dark blue, rounded rectangle containing the title text in white. 2. **Main Content Region:** A light-colored background containing the instructional text in a monospaced font. The text is left-aligned. **Title (in Header):** `Prompt for Scoring Error Reasons` ### Detailed Analysis / Content Details The following is a precise transcription of all text content within the main body of the document. **Introductory Paragraph:** ``` As an experienced {data['Subject']} teacher, your assistance is required to evaluate a student's explanation regarding the error in a problem solution. The task involves a detailed understanding of the problem, the incorrect solution provided, and the ground truth behind the error. Your analysis should focus on whether the student's explanation aligns with the actual error in the solution. ``` **Instruction Line:** ``` Please find the details below: ``` **List of Data Placeholders:** ``` - Question: {data['Question']} - Incorrect Solution Provided: {data['Model_Solution_Steps']} - First Incorrect Step in the Solution: {data['Model_Solution_First_Error_Step']} - Ground Truth Error Reasons: {data['Model_Solution_Error_Reason']} - Ground Truth Rectified Steps: {data['Model_Solution_Rectified_First_Error_Step']} - Student's Explanation of the Error: {data['Evaluation_Result']['error_reason']} ``` **Instruction for Output:** ``` Based on this information, please provide the following: ``` **Required Output Format (Numbered List):** ``` 1. Step-by-Step Reasoning: [Offer a succinct, step-by-step interpretation of the ground truth error reason.] 2. Student Error Reason Analysis: [Analyze the student's explanation step by step, determining its accuracy in reflecting the actual error briefly.] 3. Final Decision: [State only 'Correct' or 'Wrong' to conclude whether the student's explanation correctly identifies the error based on your analysis.] ``` **Final Instruction:** ``` Please follow this format without any additional introductory or concluding statements. ``` ### Key Observations 1. **Template Nature:** The document is a template, not a filled-out example. All specific data points (the question, solutions, error reasons) are represented by placeholder variables in the format `{data['key']}`. 2. **Structured Evaluation:** It enforces a strict, three-part output format (Reasoning, Analysis, Decision) to ensure consistent evaluation. 3. **Role-Play Context:** The evaluator is instructed to adopt the persona of an "experienced {data['Subject']} teacher," suggesting the subject matter is academic. 4. **Data Dependencies:** The template relies on a structured data object (`data`) with specific keys to function, indicating it is part of a larger automated or semi-automated system. ### Interpretation This document is a **prompt engineering template** for an AI or a human evaluator within an educational technology or AI training pipeline. Its purpose is to standardize the assessment of how well a student (or a model acting as a student) can diagnose errors in a provided solution. The core investigative logic it promotes is: 1. **Ground Truth Establishment:** First, understand the *actual* error from authoritative sources (the "Ground Truth Error Reasons" and "Rectified Steps"). 2. **Comparative Analysis:** Then, compare the student's stated reason for the error against this ground truth. 3. **Binary Judgment:** Finally, make a definitive "Correct" or "Wrong" judgment on the student's diagnostic accuracy. The template's design minimizes evaluator bias by providing all necessary context and demanding a specific, concise output format. It is likely used to generate training data for AI models, to assess student performance at scale, or to fine-tune an AI's ability to perform pedagogical reasoning. The presence of placeholders like `{data['Evaluation_Result']['error_reason']}` suggests it might be part of a recursive system where one model's output (the student's explanation) is fed into this evaluation prompt. </details> Figure 11: This is the prompt template we used to request GPT-4 to help us score the error reasons explained by evaluated models. Note that despite the difficulties of deciding the solution correctness, it is much easier to decide the error reason correctness given the ground truth annotations. ## Appendix D Evaluation Prompt Figure- 10 is the prompt template we used to evaluate all the models in our paper. Note that with minor modifications on the following template, the evaluation results can be heavily affected. For example, by introducing a simple hint sentence "Hint: This solution is incorrect. Please focus on looking for the First Error Step and Error Reason.", the model performance can drastically improve as shown in 4. Also, by simply taking away the line of ’Solution Analysis’ in the response format part of the prompt, the evaluated model will directly output the scoring result without step-wise COT analysis on the solution. This setup will lead to a near zero MR-Score performance as discussed in Section- 5. Figure- 7 is the prompt we used to query language models for solution generation during the dataset compilation phase. Note that in the prompt, we specifically asked the model to analyse each option in the multiple-choice problem. This is crucial in examining if the model possesses a comprehensive understanding on the topics that the question is asking. Figure- 11 shows the prompt we used to query GPT-4 to score the error reasons returned from evaluated models. Despite the challenging nature of the original task to determine the solution correctness, it is a much easier job to determine if the error reason from the evaluated models aligns with the ground truth error reason. Figure- 8 demonstrates the prompt template we used for self-refine experiment. Note that we followed the setting of [31] without introducing any prior assumptions or knowledge. This minimum version of extra prompting would mostly rely on the capability of language models to perform self-refine procedure. ## Appendix E Self Refine Analysis In this section, we present the results of self-refine in the task level. Specifically, we are looking at the change of labelling by the evaluated models in the determination of solution correctness as shown by Figure- 9. We summarize our observation below: - Small Models like Gemma-2B are too limited to perform effective self-reflection. - Competent Models like GPT4-Turbo are confident in their initial decisions, hardly switching their decisions during self-reflection. - Intermediate Models like Llama3-70B exhibit substantial changes during self-reflection, indicating a lack of consistency in their decisions. However, its change of decisions from incorrect to correct happens to be significantly higher in locating the first error step than in examining solution correctness and explaining the error reason, therefore boosting the overall MR-Score by a large margin. We believe the lack of consistency does not necessarily indicate a more robust or advanced reasoning ability, despite the increase in the evaluation results. - Conclusion: Our results support the observation that LLMs generally lack effective self-refinement capabilities [31]. ## Appendix F Error Analysis We provide qualitative analyses of how GPT-4 as an example model performed on our benchmark across all seven subjects. The purpose is to offer a deeper understanding of the types and causes of errors made by experimented models to inform future improvements. For each subject in the subsections below, a failure case and a success case are listed. Following the MR-Ben evaluation framework, each case demonstration consists of the following parts: (1) original questions, options, ground-truth final answers, and LLM-generated CoT solutions; (2) human annotations of step-wise error detection, explanation, and correction; (3) evaluation annotation from the experimented GPT-4 on the aforementioned LLM-generated CoT solutions; (4) scoring results of the error reason if the experimented model identifies the correct first error step. From our analysis of sampled failure cases, several general observations are made. Firstly, the assessed model GPT-4 exhibits a widely resistant ‘false positive bias’ on our benchmark across all subjects: In cases where the LLM makes incorrect evaluations, the proportion of type I errors is much higher than type II errors. In other words, GPT-4 tends to overlook the mistakes that exist in incorrect model solutions and mislabel them as correct, while seldom actively mislabeling correct model solution steps as incorrect steps. In fact, among the 42 sampled cases we surveyed spanning the seven subjects, all failure cases (size = 21) belong to the type I error category. We provide two possible explanations for such bias: (a) input bias: the implemented LLMs are instructed-tuned, and are inherently biased to follow the prompt input. Therefore, even when the models are asked to fairly judge these CoT solution steps in the prompt input in binary labels, it is likely their labeling threshold is affected and biased towards positive judgments. This is a common issue in using LLMs as generation evaluators and may be mitigated by adjusting the prompt design or other debiasing methods [42, 79]; (b) self-preference bias: it has drawn recent attention that state-of-the-art models display self-preference bias: the phenomenon in which an LLM inherently favors their own generated output over texts from other LLMs and humans [43, 54]. Therefore, the experiment results of LLMs that are under the same family of the three sampled models (GPT-3.5-Turbo-0125 [50], Claude2 [5], and Mistral-Medium [32]) may be affected. With the increasingly extensive use of self-evaluation and LLM-as-judge methods, we call for future researchers’ attention to the potential issue. Secondly, the MR-Ben benchmark revealed many intricate cases where the assessed model GPT-4 reached a correct final answer through incorrect solution steps, challenging the models’ multi-step reasoning capabilities to a greater scale. As shown in the failure cases in math, physics, biology, etc., our benchmark evaluation is able to identify step errors that the sample model made in the solution steps even when its final answer matches the final ground-truth choice. While such step errors can be trivial in terms of generating the correct final answer in the demonstrated failure cases, they can become significant in just slightly nuanced questions, as mentioned in the error analysis section of MMLU [27]. In contrast, our framework, by decomposing the question and model solutions, remains relatively immune to the nuances in question framing. This highlights an important significance of our MR-Ben benchmark in that it is not only elaborate but also robust compared to previous benchmarks. Lastly, there are subtle nuances of model performance in different reasoning paradigms manifest in the case demonstrations of specific subjects. They are interpreted case by case by the captioned figures listed below. See pages - of error_analysis/all.pdf ## Appendix G Computational Resources Used In this paper, all experiments are either performed on open-source models with local inference or closed-source models with API calls. For local inference, we are using A800 machines with 8 GPUs to run the inferences. The total evaluation time on our 6k benchmark on the 70B language models typically takes around 2 hours using fast inference libraries such as vllm. For smaller language models such as Phi-3 or Gemma, the compute time is smaller. ## Appendix H Annotation Guidelines Below we provide the original annotation guidelines distributed to annotators of distinctive subjects included in the MR-Ben benchmark: math, biology, physics, chemistry, logic, medicine, and coding. The guidelines serve as the primary training material and instructions for annotators to complete the labeling tasks, specified with detailed descriptions, requirements, and standards. See pages - of annotation_guidelines/math.pdf See pages - of annotation_guidelines/physics.pdf See pages - of annotation_guidelines/biology.pdf See pages - of annotation_guidelines/medicine.pdf See pages - of annotation_guidelines/chemistry.pdf See pages - of annotation_guidelines/logic.pdf See pages - of annotation_guidelines/coding.pdf ## NeurIPS Paper Checklist 1. Claims 1. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? 1. Answer: [Yes] 1. Justification: Yes, the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope. 1. Guidelines: - The answer NA means that the abstract and introduction do not include the claims made in the paper. - The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. - The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. - It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 1. Limitations 1. Question: Does the paper discuss the limitations of the work performed by the authors? 1. Answer: [Yes] 1. Justification: The limitations of this work are discussed in § A.1. 1. Guidelines: - The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. - The authors are encouraged to create a separate "Limitations" section in their paper. - The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. - The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. - The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. - The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. - If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. - While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 1. Theory Assumptions and Proofs 1. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? 1. Answer: [N/A] 1. Justification: Our work does not involve theoretical assumptions and proofs. 1. Guidelines: - The answer NA means that the paper does not include theoretical results. - All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced. - All assumptions should be clearly stated or referenced in the statement of any theorems. - The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. - Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. - Theorems and Lemmas that the proof relies upon should be properly referenced. 1. Experimental Result Reproducibility 1. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? 1. Answer: [Yes] 1. Justification: We disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and conclusions of the paper, detailed in § 5.1. 1. Guidelines: - The answer NA means that the paper does not include experiments. - If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. - If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. - Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. - While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example 1. If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. 1. If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. 1. If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). 1. We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 1. Open access to data and code 1. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? 1. Answer: [Yes] 1. Justification: We opensourced our evaluation benchmark and the script as described in Section- 1. Additionally, we have detailed the experimental setup in the paper (§ 5) , including model selection, hyperparameter settings, data selection, evaluation metrics, hardware resources, etc. 1. Guidelines: - The answer NA means that paper does not include experiments requiring code. - Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details. - While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). - The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details. - The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. - The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. - At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). - Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 1. Experimental Setting/Details 1. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? 1. Answer: [Yes] 1. Justification: We provide comprehensive dataset statistics, evaluation metric descriptions, hyperparameters, and tool usage in § 5. 1. Guidelines: - The answer NA means that the paper does not include experiments. - The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. - The full details can be provided either with the code, in appendix, or as supplemental material. 1. Experiment Statistical Significance 1. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? 1. Answer: [Yes] 1. Justification: Our proposed method is an inference-only approach for LLM and we adopt the greedy-decoding strategy for all of our experiments, making the experiment results of each session consistent. 1. Guidelines: - The answer NA means that the paper does not include experiments. - The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. - The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). - The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) - The assumptions made should be given (e.g., Normally distributed errors). - It should be clear whether the error bar is the standard deviation or the standard error of the mean. - It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. - For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). - If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 1. Experiments Compute Resources 1. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? 1. Answer: [Yes] 1. Justification: We provide comprehensive experimental setup and hardware computation resources used in § G. 1. Guidelines: - The answer NA means that the paper does not include experiments. - The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. - The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. - The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper). 1. Code Of Ethics 1. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? 1. Answer: [Yes] 1. Justification: We confirm that the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics, and all the authors preserve anonymity. 1. Guidelines: - The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. - If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. - The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 1. Broader Impacts 1. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? 1. Answer: [Yes] 1. Justification: The broader impacts of our paper are presented in § A.2. 1. Guidelines: - The answer NA means that there is no societal impact of the work performed. - If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. - Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. - The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. - The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. - If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 1. Safeguards 1. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? 1. Answer: [N/A] 1. Justification: Our dataset focuses on evaluation rather than training models. We leverage existing datasets rather than scrape from the Internet. 1. Guidelines: - The answer NA means that the paper poses no such risks. - Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. - Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. - We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 1. Licenses for existing assets 1. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? 1. Answer: [Yes] 1. Justification: All the assets, i.e., codes, data and models used in our paper, are properly credited and we explicitly mention and properly respect the license and terms of use. 1. Guidelines: - The answer NA means that the paper does not use existing assets. - The authors should cite the original paper that produced the code package or dataset. - The authors should state which version of the asset is used and, if possible, include a URL. - The name of the license (e.g., CC-BY 4.0) should be included for each asset. - For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. - If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. - For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. - If this information is not available online, the authors are encouraged to reach out to the asset’s creators. 1. New Assets 1. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? 1. Answer: [Yes] 1. Justification: We have submitted the anonymized dataset, codes, and corresponding documents together with the paper. 1. Guidelines: - The answer NA means that the paper does not release new assets. - Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. - The paper should discuss whether and how consent was obtained from people whose asset is used. - At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 1. Crowdsourcing and Research with Human Subjects 1. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? 1. Answer: [Yes] 1. Justification: The full text of instructions given to human annotators is presented in § H. 1. Guidelines: - The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. - Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. - According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 1. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects 1. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? 1. Answer: [N/A] 1. Justification: The justification is as follows: We solely engaged human annotators for the dataset, and they were not subjects of our study. Furthermore, we partnered with a legally recognized annotation company in the country, which has obtained all necessary governmental approvals to operate its annotation business. 1. Guidelines: - The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. - Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. - We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. - For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Rendering Paper...