2406.07545

Model: nemotron-free

# Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena > Joint first author & equal contribution. ## Abstract Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs). Typically, an LLM is given a question and selects the answer deemed most probable after adjustments for factors like length. Unfortunately, LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities, influencing the prediction of answers based on these IDs. Previous research has introduced methods to reduce this “selection bias” by simply permutating options on a few test samples and applying them to new ones. Another problem of MCQ is the lottery ticket choice by “random guessing”. The LLM does not learn particular knowledge, but the option is guessed correctly. This situation is especially serious for those small-scale LLMs For instance, on MMLU, the random guessing accuracy is 25%, and most small-scale LLMs obtain results around this value as shown in [35, 46]. It is difficult to distinguish which model is better under this situation.. To address them, a more thorough approach involves shifting from MCQ to open-style questions, which can fundamentally eliminate selection bias and random guessing issues. However, transitioning causes its own set of challenges in (1) identifying suitable open-style questions and (2) validating the correctness of LLM open-style responses against human-annotated ground-truths. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions. Consequently, we introduce the Open-LLM-Leaderboard to track various LLMs’ performance and reflect true capability of them, such as GPT-4o/4/3.5, Claude 3, Gemini, etc. Our code and dataset are available at https://github.com/VILA-Lab/Open-LLM-Leaderboard. ## 1 Introduction | Question that is suitable for open-style: Let x = 1. What is x << 3 in Python 3? | | --- | | Options: A. 1 B. 3 C. 8 D. 16 | | Answer: C | | Question that is not suitable for open-style: Which of the following statements is true? | | Options: | | A. Every equivalence relation is a partial-ordering relation. | | B. Number of relations form A = x, y, z to B= (1, 2), is 64. | | C. Empty relation _ is reflexive | | D. Properties of a relation being symmetric and being un-symmetric are negative of each other. | | Answer: B | Figure 1: Examples of MCQ from MMLU. Large language models (LLMs) are increasingly excelling at various natural language processing tasks, including text generation [11], translation [45, 50], summarization [22], code generation [20, 33], and chatbot interaction [28]. With the rising capability, the need for a robust evaluation strategy that can accurately assess the performance of these models is becoming crucial in order to identify their true effectiveness and choose the most appropriate one for a given task. Common metrics for assessing LLMs today include relevance, frequency of hallucinations, accuracy in question answering, toxicity, and retrieval-specific metrics, among others. In the context of question-answering evaluations, prior works usually investigate the model’s performance in terms of answer accuracy, courtesy, and conciseness. And multiple choice questions (MCQ) have emerged as a predominant format for such assessments, wherein a question is presented with several possible responses, and the model is required to select the most fitting choice ID, as exemplified in Figure 1. Lately, the MCQ format has seen widespread application in LLM-focused contexts, including benchmarks [18, 44, 12] that examine LLM capabilities and automated/crowdsourcing evaluation frameworks [21, 49, 5] that streamline the assessment process. However, previous studies [48, 32] have discussed that the lack of resilience of LLMs to changes in the positioning of options stems from their tendency to exhibit biased behavior: they often favor choosing certain option IDs (such as “Option A”) as responses, a phenomenon that is referred to as selection bias. Moreover, it shows that selection bias exists widely across various LLMs and cannot be mitigated by simple prompting skills. The underlying reason of this issue comes from the condition that the model is trained with a priori distribution that assigns more probabilistic choices to specific ID tokens. Another issue of MCQ is the “random guessing” that is discussed in [35]. Specifically, small models such as the 1B-level variants, may struggle to achieve reliable predictions on many benchmarks like MMLU which uses four choices as the answer candidates of the questions. Their results could resemble random choices, not truly capturing the model’s actual capabilities. To fundamentally eliminate selection bias and random guessing in LLMs, in this work, we build an open-style question benchmark for LLM evaluation. Leveraging this benchmark, we present the Open-LLM-Leaderboard, a new automated framework designed to refine the assessment process of LLMs. This framework functions in supplement to prior evaluation frameworks such as [21, 49, 5] with several advantages as presented in Sec. 4.4. However, constructing such a benchmark has two significant challenges: (1) how to determine the appropriate questions that can be effectively transformed from MCQ into open-style questions, and (2) how to establish an approach to accurately validate the correctness of the LLM’s open-style answers in comparison to human-annotated ground-truths, especially in contrast to MCQ, which typically have defined single-choice standard answers. For the first challenge of identifying the multiple-choice questions that are suitable for converting to open-style questions, we design an automatic coarse-to-fine selecting protocol through customized prompts and multi-stage filtering process. Specifically, in the first stage, we use the binary classification to filter the questions with high confidence as the positive pool and others are assigned as negative. Our second stage uses a soft scoring method (1-10 ratings) to judge the suitability of the questions for the open-style from the questions that are clarified as negative in the first stage. For the second challenge of evaluating the correctness of the LLM’s open-style answers in comparison to human-annotated ground-truths, we further design a task-specific prompt and leverage GPT-4 to examine if the response is correct. To validate the accuracy of the automatic evaluation strategy, we randomly sample 100 results and manually check the automatic evaluation results with the corresponding responses, and confirm that it is reliable with an error rate of less than 5%. In our end-to-end assessment of the LLM evaluation and ranking process, we conduct a comprehensive analysis on the well-recognized LLMs, including GPT-4o, GPT-4, ChatGPT, Claude-3 Opus, Gemini-Pro and Mistral-Large. Our benchmarking results indicate that GPT-4o currently holds the position as the strongest LLM. We further provide a small regime LLM leaderboard targeting at LLMs smaller than 3B. Moreover, our study demonstrates a high correlation between the rankings produced by our open-style benchmark and those derived from user-based evaluations or direct human assessments. ## 2 Related Work Large Language Models (LLMs). Recent advancements in LLMs, such as GPT-3 [9] and GPT-4 [28] have had a significant impact in the field of natural language processing and have found widespread application across various domains. It has indeed initiated a kind of chain reaction within the community and beyond. As each new iteration of LLMs demonstrates enhanced capabilities, organizations and researchers across various sectors are motivated to develop their own models, such as LLaMA [40, 41], Gemini [38], and Claude [2], or find innovative ways to improve existing LLMs through instruction tuning, like Alpaca [37], and Vicuna [10]. Multiple Choice Questions (MCQ). In the realm of LLM research, MCQ has become a pivotal tool for evaluating and enhancing the capabilities of these models. Notable datasets like the MMLU [18], HellaSwag [44], and ARC [12] have been instrumental in this regard. Their diverse assessment of broad knowledge and commonsense reasoning help in benchmarking the depth and versatility of LLMs in understanding, reasoning, and applying knowledge across various domains. MCSB [31] introduces a natural prompting strategy for LLMs, which presents questions and answer choices together, allowing the model to explicitly compare options. Bias in LLMs. Selection bias, a specific form of bias relevant to the evaluation of LLMs through MCQ, has garnered attention due to its understated and widespread impact. A series of works [49, 30, 48, 42] have shown that LLMs may develop a propensity to favor certain answer choices based on their position or encoding, such as the alphabetical ordering of A/B/C/D in MCQ. This phenomenon can lead to skewed evaluation results, misrepresenting a model’s true understanding and reasoning capabilities. ## 3 Approach <details> <summary>x1.png Details</summary> ![137decc6](/v1/image/137decc6631573e2d9adc45a2efd711f59363b35f92ef5c0baceca5d97130a31) ### Visual Description ## Flowchart: Evaluation Process for LLM Question Formats ### Overview The flowchart illustrates a two-path evaluation system for assessing multiple-choice (MCQ) and open-style questions generated by Large Language Models (LLMs). It includes data collection, classification, accuracy evaluation, and comparative analysis. Key components involve GPT-4 for filtering and scoring, confidence thresholds, and visual comparison of formats. ### Components/Axes 1. **Paths**: - **Multiple-Choice Questions Path** (left): Starts with dataset collection, response gathering, and accuracy evaluation. - **Open-Style Questions Path** (bottom): Involves open-format response collection, prompt design, and evaluation. 2. **Decision Nodes**: - Classification of questions as "YES" (can be written in open style) or "NO" (cannot be written in open style) using GPT-4. - Confidence score assignment (1–10) for "YES" questions via GPT-4. 3. **Evaluation Metrics**: - Accuracy evaluation for both paths. - Comparative analysis using a bar chart (right side). ### Detailed Analysis 1. **Data Flow**: - **MCQ Path**: - Collect MCQ datasets → Gather LLM responses in MCQ format → Evaluate accuracy. - Classify questions as "YES" or "NO" using GPT-4. - Assign confidence scores (1–10) to "YES" questions. - If score > threshold → Move to "YES" category; else → Remove. - **Open-Style Path**: - Collect LLM responses in open-style format → Design evaluation prompt → Evaluate accuracy. 2. **Visual Elements**: - **Gauge**: Confidence score range (1–10) with a needle indicator. - **Bar Chart**: Comparative analysis of both formats (no numerical values provided). 3. **Thresholds**: - Unspecified confidence score threshold for categorization. ### Key Observations - **Bifurcation**: The process splits into two distinct paths for MCQ and open-style questions. - **GPT-4 Dependency**: Critical for classification and confidence scoring, introducing automation but potential bias. - **Threshold Ambiguity**: The confidence score threshold is not quantified, leaving implementation details undefined. - **Comparative Analysis**: Visualized via a bar chart, but specific data points are missing. ### Interpretation The flowchart emphasizes a structured approach to optimizing LLM-generated questions by separating formats and evaluating their suitability for open-style conversion. The use of GPT-4 for classification and scoring suggests an attempt to standardize evaluation, though the lack of threshold specificity and missing bar chart data limits transparency. The comparative analysis implies a focus on identifying which formats perform better under specific criteria, potentially guiding LLM training or prompt engineering. The absence of numerical results in the bar chart raises questions about the practical utility of the comparison without concrete metrics. </details> Figure 2: An overview of a dual-path evaluation pipeline for LLMs, starting with the collection of MCQ datasets. It branches into two paths, with the MCQ path proceeding directly from response collection to evaluation, while the open-style path passes through an additional filtering phase. After evaluation, both paths converge in a comparative analysis. ### 3.1 Defining Open-style Questions Open-style questions, aka open-ended questions, require the model to generate an answer without being constrained by a set of predetermined choices. In the context of LLM evaluation, these questions are designed to assess the model’s ability to generate coherent, relevant, and contextually appropriate responses based on the input query. While multiple-choice questions can efficiently assess specific factual knowledge and comprehension, open-style questions offer a deeper insight into the LLM’s generative capabilities, understanding of context, and ability to engage with complex tasks. Also, open-style questions can avoid the inherent selection bias and random guessing weaknesses compared to multiple-choice questions. ### 3.2 Automatic Open-style Question Filtering and Generation Multi-stage Filtering and Postprocessing via Coarse-to-fine Process. Our proposed multi-stage filtering approach consists of four main steps to streamline the conversion: (1) Initially classify datasets as either convertible or non-convertible. (2) Assign each question a confidence score to indicate the likelihood that it can be framed as an open-style question. (3) Exclude questions with confidence scores below a specified threshold and classified as non-convertible. (4) Combine questions that are labeled as non-convertible but have high confidence scores with those labeled as convertible. Stage1: Preliminary Filter using Binary Classification. Considering that the structure of MCQ varies, converting them into an open-style format is not always possible, particularly because certain questions are strongly linked to their choices. For instance, questions formulated as “Which one of the following is true” or “All except” or “Which of these”. Such questions are typically unsuitable for conversion into an open-style format since the absence of the options could change the question’s core, resulting in incomplete questions. Table 1: Prompt design for two-stage filtering and post verification. | Stage One: Coarse Filtering Prompt | | --- | | """Your task is to review a series of multiple-choice questions and evaluate their ability to be answered without the provided answer choices. For questions that begin with an incomplete sentence (e.g., "During swallowing, ..."), use your knowledge to attempt to complete the sentence accurately. For direct questions that ask for specific information or identification (e.g., "Which of the following structures is part of the small intestine?"), assess whether the question is formulated clearly enough that an informed answer can be given without seeing the multiple-choice options. For mathematical or analytical questions (e.g., "Find all cosets of the subgroup 4Z of 2Z"), determine if the question provides enough context and information for a solution to be formulated without additional options. Please follow this format for your evaluation: QUESTION: [Insert the question here] VERDICT: Respond with "YES" if the question is clear and can be directly answered based on its content alone, or "NO" if it relies on the answer choices to be understood or answered. Your response should include only the verdict without any justification or reasoning.""" | | Stage Two: Fine-grained Filtering Prompt | | You will assign a numerical score from 1 to 10 based on how confidently it can be answered without the choices. The scoring criteria are as follows: 1: The question is entirely dependent on its choices for an answer, making it impossible to answer without them. Example: ‘Which of the following statements is correct?’ 10: The question can be easily and confidently answered based solely on the question stem, without any need to refer to the provided options. Example: ‘What is the first law of thermodynamics in physics?’ Intermediate Scores: 2-4: The question stem gives very little information and is highly reliant on the choices for context. Example: ‘Which of these is a prime number?’ 5: The question provides some context or information, that gives a moderate possibility to answer the question. Example: ‘Which of the following best describes the structure that collects urine in the body?’ 6: The question provides a good amount of context or information, that gives a moderate possibility to answer the question. Example: ‘Statement 1 | A factor group of a non-Abelian group is non-Abelian. Statement 2 | If K is a normal subgroup of H and H is a normal subgroup of G, then K is a normal subgroup of G.’ 7: The question provides a good amount of context or information, that gives a high possibility to answer the question. Example: ‘The element (4, 2) of Z_12 x Z_8 has order’ 8-9: The question provides a good amount of context or information, that gives a high possibility to answer the question. Example: ‘A "dished face" profile is often associated with’ ONLY GIVE THE VALUE BETWEEN 1-10 AS YOUR ANSWER. DO NOT INCLUDE ANY OTHER INFORMATION IN YOUR RESPONSE Example Format: QUESTION: question here VERDICT: value in [1-10] here | | GPT-4 Prompt for Verification | | """Evaluate the answer of a AI model to a question. You will be provided with the question, the AI model’s answer, and the correct answer. Your task is to evaluate the AI model’s response and determine whether it is Correct or Incorrect. Grade the AI model answers based ONLY on their factual accuracy. It is OK if the AI model answer contains more information than the true answer, as long as it does not contain any conflicting statements. Otherwise, it should be marked as Incorrect. Ignore differences in punctuation and phrasing between the AI model’s answer and the true answer. Example Format: QUESTION: question here STUDENT ANSWER: student’s answer here TRUE ANSWER: true answer here GRADE: Correct or Incorrect here Your response should include only the verdict without any justification or reasoning.""" | To effectively handle this challenge of identifying whether multiple-choice questions are suitable for open-style conversion, we leverage the power of prompting techniques to create a customized classification prompt as shown in Table 1. In the prompt, we integrate different types of questions from different datasets to demonstrate how an LLM may evaluate each question to be written in an open-style way, eventually classifying them as convertible “YES” or non-convertible “NO”. It will determine whether a question provides a clear context and information without relying on the provided options or not. In the prompt we integrate different types of questions from different datasets to demonstrate how an LLM like GPT-4 may evaluate each question to be written in an open-style way, eventually classifying them as convertible “YES” or non-convertible “NO”. We set the prompt to eliminate any additional explanations, by stating that “Your response should include only the verdict without any justification or reasoning.” This guarantees that the answer to each inquiry is conveyed concisely as “YES” or “NO”. To understand our initial filtering results, we conduct an error analysis manually by selecting 100 questions in the “YES” and “NO” pools separately. In the samples classified as “YES”, we find that only around 5% of the questions are false positive cases, verifying a low misclassification error for the positive question selection by our filtering strategy. Conversely, within the “NO” sample, around 40% of the questions are actually suitable for open-style questions but mistakenly classified as negative. This situation often arises from questions that include phrases like “Which of”. Similarly, questions involving true/false statements, sentence completions, or fill-in-the-blanks are also sometimes inappropriately classified as non-convertible. This analysis motivates us to develop a cascaded fine-grained stage to further filter more positive questions in “NO” pool using particular prompts, as described in the following Stage 2 process. Stage2: Confidence Score Assignment. As we aim to overcome the issue of classifying questions with specific patterns as non-convertible, we introduce a second stage of filtering centered on confidence score assignment. This involves instructing the large language model to assign a confidence score on a scale from 1 to 10, reflecting the possibility of the question being written in an open-style format. Since a significant number of questions are unsuitable for an open-style format, categorized as “NO” and have a confidence score below 5, we set a confidence score threshold to be 5. Therefore, questions classified as non-convertible with a confidence score lower than this threshold are excluded, while those remaining above the threshold and those initially classified as convertible are moved into the “YES” category to be converted to an open-style format. ### 3.3 Open-style Question Answer Evaluation After establishing a set of convertible questions from various datasets and obtaining their responses from several LLMs, there arises a need to evaluate these questions. Given that our ground truth answers are based on the MCQ format with defined answers, it necessitates a method for efficiently and accurately validating the correctness of responses to open-style questions. To this end, we design a customized prompt, as shown in Figure 2 that utilizes the correct MCQ answer as the ground truth to determine if the open-style responses are correct or incorrect by the prediction $\hat{y}$ : $$ \hat{y}=\texttt{LLM}_{\texttt{e}}(\text{prompt}(q,\hat{a},a)) \tag{1} $$ where $\hat{y}$ represents the prediction and $\texttt{LLM}_{\texttt{e}}$ is the LLM evaluator. $q$ , $\hat{a}$ and $a$ represent the question, LLM generated answer, and correct answer from MCQ, respectively, and the prompt is provided in Table 1 of Appendix. While these open-style answers are evaluated based on the MCQ’s ground truth, issues of misevaluation might arise. This includes scenarios where a response is inaccurately classified as correct simply because it contains certain keywords also found in the ground truth. To tackle this issue we include specific phrases in the prompt. These phrases, such as “as long as it does not contain any conflicting statements”, ensure that a response is not automatically classified as correct based on the presence of a keyword, avoiding incorrect markings when the response contradicts the correct answer. Additionally, to prevent the exclusion of correct answers that incorporate extra information, we incorporate the phrase “It is OK if the AI model’s answer contains more information than the true answer”. Furthermore, we highlight that minor differences in punctuation and phrasing between the open-style responses and the ground truth answers should not lead to their being classified as incorrect. To see the correctness of the LLM judgement we take the randomly drawn 100 responses from all models. The human evaluation process for our study was conducted by the authors themselves. The agreement between the LLM evaluations and those of a human evaluator was quantitatively assessed using Cohen’s kappa [13], which yielded a score of 0.83. This substantial kappa score The Kappa score is a statistical measure of inter-rater agreement for categorical items, defined by the equation: $\kappa=\frac{P_{o}-P_{e}}{1-P_{e}}$ where $P_{o}$ is the observed agreement and $P_{e}$ is the expected agreement by chance. verifies that the LLM’s ability to determine the correctness of responses aligns closely with human judgment, demonstrating strong reliability in its evaluation process. ## 4 An Open-style Question Benchmark (OSQ-bench) ### 4.1 Statistics and Distributions Table 2 describes the basic statistics of the dataset questions that are suitable for answering in open-style format. In total, we have evaluated 42K questions from 9 different datasets and more than 23K of them are classified as appropriate for open-style answering. Table 2: Statistics on open-style questions across different datasets. | MMLU ARC MedMCQA | 14,042 3,428 4,183 | 7,784 3,118 2,318 | 36.6 21.1 14.1 | | --- | --- | --- | --- | | CommonsenseQA | 1,221 | 710 | 13.1 | | Race | 4,934 | 3,520 | 10.0 | | OpenbookQA | 1,000 | 491 | 10.3 | | WinoGrande | 1,267 | 1,267 | 19.1 | | HellaSwag | 10,042 | 3,915 | 40.1 | | PIQA | 1,838 | 696 | 7.1 | | Overall | 41,955 | 23,839 | 19.05 | ### 4.2 Diversity Our investigation into the diversity of questions within our benchmark is foundational for understanding the landscape of open-ended question answering. To comprehensively assess the breadth of question diversity, we have conducted a systematic categorization of the question types sourced from an array of distinct datasets. From the total initial pool of 41,955 questions, we refine the selection to 23,839 questions, ensuring that each one is conducive to open-ended responses. The distribution of those questions is illustrated in Figure 3, which segments the data into several domains based on the content of the questions. The segmentation of the plot underscores the interdisciplinary nature of our dataset. It features a broad spectrum of categories such as literature and reading comprehension, commonsense reasoning, domain-specific (medicine, STEM, and etc), and multi-topic knowledge. Also, Table 2 demonstrates the diversity of question length used for the benchmark. <details> <summary>x2.png Details</summary> ![1586b0ef](/v1/image/1586b0efe7656ab218c11d58d802cc468fe1bfeffa583d1c8036da494a09e55f) ### Visual Description ## Sunburst Chart: Distribution of AI Model Performance Across Tasks and Domains ### Overview The chart visualizes the hierarchical distribution of AI model performance across two primary categories: **Tasks** (left side) and **Domains** (right side). Each category branches into subcategories, with segments connected to their parent nodes. The legend in the top-right corner maps colors to categories. ### Components/Axes - **Main Categories**: - **Tasks**: HellaSwag, Race, ARC, MedMCQA, WinoGrande, PIQA, OpenbookQA. - **Domains**: MMLU, STEM, Humanities, Social Sciences, Miscellaneous. - **Subcategories**: - **Tasks**: - HellaSwag: Situational Reasoning, Activity Prediction. - Race: Language Analysis, Critical Reading, Literature Comprehension. - ARC: Natural Sciences, Technology, Mathematical Reasoning. - MedMCQA: Medical Specialties, Clinical Knowledge, Healthcare. - WinoGrande: Coreference Resolution, Linguistic Patterns. - PIQA: Social Commonsense, Physical Principles. - OpenbookQA: Conceptual Understanding, Analytical Reasoning. - **Domains**: - MMLU: Mathematical Reasoning, Physical Principles, Social Commonsense. - STEM: Mathematical Reasoning, Physical Principles. - Humanities: Language Analysis, Critical Reading. - Social Sciences: Situational Reasoning, Activity Prediction. - Miscellaneous: Temporal Reasoning, Spatial Reasoning. - **Legend**: Located in the top-right corner, with colors matching segments (e.g., HellaSwag = red, Race = teal, ARC = purple, MedMCQA = orange, WinoGrande = light blue, PIQA = green, OpenbookQA = pink; Domains: MMLU = dark blue, STEM = light blue, Humanities = purple, Social Sciences = red, Miscellaneous = green). ### Detailed Analysis - **Tasks**: - **HellaSwag** (red): Focuses on reasoning tasks (Situational Reasoning, Activity Prediction). - **Race** (teal): Emphasizes language and comprehension skills (Language Analysis, Critical Reading, Literature Comprehension). - **ARC** (purple): Covers scientific and mathematical reasoning (Natural Sciences, Technology, Mathematical Reasoning). - **MedMCQA** (orange): Targets medical and healthcare domains (Medical Specialties, Clinical Knowledge, Healthcare). - **WinoGrande** (light blue): Tests linguistic and coreference resolution skills. - **PIQA** (green): Evaluates social and physical reasoning (Social Commonsense, Physical Principles). - **OpenbookQA** (pink): Assesses conceptual and analytical understanding. - **Domains**: - **MMLU** (dark blue): Broad academic knowledge (Mathematical Reasoning, Physical Principles, Social Commonsense). - **STEM** (light blue): Science and technology focus (Mathematical Reasoning, Physical Principles). - **Humanities** (purple): Language and critical analysis (Language Analysis, Critical Reading). - **Social Sciences** (red): Behavioral and situational reasoning (Situational Reasoning, Activity Prediction). - **Miscellaneous** (green): Niche reasoning types (Temporal Reasoning, Spatial Reasoning). ### Key Observations 1. **Hierarchical Structure**: Tasks and Domains are interconnected, with subcategories nested under their parent nodes. 2. **Color Consistency**: All segments for a category share the same color (e.g., all HellaSwag subcategories are red). 3. **Subcategory Counts**: - Tasks: 2–3 subcategories per category (e.g., HellaSwag = 2, Race = 3). - Domains: 2–3 subcategories per domain (e.g., MMLU = 3, STEM = 2). 4. **Overlap**: Some subcategories appear in both Tasks and Domains (e.g., Mathematical Reasoning appears in ARC and STEM). ### Interpretation The chart highlights the **diverse evaluation landscape for AI models**, emphasizing: - **Task-Specific Performance**: Models are tested on specialized tasks (e.g., Medical Specialties, Coreference Resolution) and broader domains (e.g., STEM, Humanities). - **Interdisciplinary Challenges**: Subcategories like Mathematical Reasoning and Physical Principles appear in both Tasks (ARC, PIQA) and Domains (STEM, MMLU), indicating cross-domain applicability. - **Niche Focus**: Categories like Miscellaneous (Temporal/Spatial Reasoning) and OpenbookQA (Conceptual Understanding) suggest evaluation of less common but critical skills. - **Color-Coded Clarity**: The legend ensures quick identification of categories, aiding in visual analysis of performance distribution. This structure underscores the complexity of AI evaluation, balancing specificity (e.g., Medical Specialties) with generality (e.g., STEM), reflecting the multifaceted nature of real-world applications. </details> Figure 3: Diversity and distribution of used datasets for our OSQ-bench. ### 4.3 Quality Our newly developed benchmark, curated from widely recognized datasets, stands out by focusing on questions suitable to open-style answering, i.e., a format that demands a deep understanding and an ability to generate informative, unrestricted responses. Given that the datasets from which these questions originate are widely utilized and highly recognizable within the research community, it follows that the questions have good quality to assess the models’ capabilities. Moreover, due to the thorough filtering process it has undergone, it results in a low false positive rate (questions not suitable for open-style that are classified as suitable) of around 5%. This indicates that the vast majority of questions categorized as suitable for open-style answers indeed meet the criteria. ### 4.4 Property and Advantage As shown in Table 3, our leaderboard exhibits several advantages: first is the debiased results compared to the MCQ-based leaderboard, which has been discussed thoroughly. Another advantage is the faster and cheaper evaluation over crowduser-based leaderboards. Our results and rankings can be generated automatically without any human intervention. Table 3: Comparison with different LLM leaderboards. “Biased” indicates the selection bias. | Huggingface Leaderboard [5] | Multiple Choices Questions | High | ✓ | Automatically | | --- | --- | --- | --- | --- | | AlpacaEval Leaderboard [21] | Human Questions&Feedback | Low | ✗ | GPT-4 | | Chatbot Arena Leaderboard [49] | Human Questions&Feedback | Low | ✗ | GPT-4/Crowdusers | | Open-LLM-Leaderboard (Ours) | Open Style Questions | High | ✗ | GPT-4 | ## 5 Experiments <details> <summary>x3.png Details</summary> ![a0fe971a](/v1/image/a0fe971a07a13d32c677d33d7f438d0c2cd64bc31f79024e7a37c5467273ae39) ### Visual Description ## Radar Charts and Bar Charts: Model Performance Comparison ### Overview The image contains four radar charts and four bar charts comparing the performance of different AI models (GPT-4, GPT-3.5, Claude-3 Opus, Mistral-large) across multiple datasets. The radar charts visualize accuracy metrics for multiple-choice questions (MCQs) and open-ended short questions (OSQs), while the bar charts show counts of correct/incorrect combinations for each dataset. ### Components/Axes **Radar Charts**: - **Axes**: Labeled with datasets (MMLU, HellaSwag, Race, ARC, MedMCQA, WinoGrande, CommonsenseQA, PIQA, OpenbookQA, Race) - **Legend**: - Pink: MCQs Accuracies - Green: OSQs Accuracies - **Positioning**: Legends in top-right corner; datasets arranged clockwise **Bar Charts**: - **X-axis**: Datasets (MMLU, HellaSwag, Race, ARC, MedMCQA, WinoGrande, CommonsenseQA, PIQA, OpenbookQA) - **Y-axis**: Count (0–8000) - **Legend**: - Orange: Incorrect MCQs, Correct OSQs - Green: Correct MCQs, Incorrect OSQs - Red: Correct MCQs, Correct OSQs - Gray: Incorrect MCQs, Incorrect OSQs - **Positioning**: Legends in top-left corner ### Detailed Analysis **Radar Charts**: 1. **GPT-4**: - MCQs (pink) consistently outperform OSQs (green) across all datasets. - Highest accuracy in CommonsenseQA (MCQs ~0.9, OSQs ~0.7). - Lowest accuracy in OpenbookQA (MCQs ~0.6, OSQs ~0.4). 2. **GPT-3.5**: - Similar trend to GPT-4 but with lower overall accuracy. - MCQs ~0.8–0.9, OSQs ~0.5–0.7 across datasets. 3. **Claude-3 Opus**: - MCQs ~0.8–0.9, OSQs ~0.6–0.8. - Strongest performance in ARC (MCQs ~0.9, OSQs ~0.7). 4. **Mistral-large**: - Highest MCQ accuracy in CommonsenseQA (~0.95). - OSQs accuracy peaks at ~0.8 in CommonsenseQA. **Bar Charts**: 1. **GPT-4**: - MMLU dominates with ~7000 total counts. - "Correct MCQs, Correct OSQs" (red) is largest segment (~5000). - "Incorrect MCQs, Correct OSQs" (orange) is smallest (~500). 2. **GPT-3.5**: - MMLU ~6000 total counts. - "Correct MCQs, Correct OSQs" ~4500. - "Incorrect MCQs, Incorrect OSQs" (gray) ~1000. 3. **Claude-3 Opus**: - MMLU ~5500 total counts. - "Correct MCQs, Correct OSQs" ~4000. - "Incorrect MCQs, Correct OSQs" ~800. 4. **Mistral-large**: - MMLU ~5000 total counts. - "Correct MCQs, Correct OSQs" ~3500. - "Incorrect MCQs, Correct OSQs" ~700. ### Key Observations 1. **MCQs vs. OSQs**: MCQs consistently show higher accuracy than OSQs in radar charts (e.g., GPT-4: MCQs ~0.85 vs. OSQs ~0.65). 2. **Dataset Variance**: - MMLU has the highest counts but lower accuracy in OSQs. - CommonsenseQA shows the highest accuracy for both MCQs and OSQs. 3. **Model Performance**: - Mistral-large leads in MCQ accuracy (CommonsenseQA ~0.95). - GPT-4 has the highest OSQ accuracy (CommonsenseQA ~0.75). ### Interpretation The data demonstrates that AI models perform better on structured MCQs than open-ended OSQs, likely due to the latter's ambiguity. Mistral-large excels in MCQ accuracy, suggesting specialized training for factual recall, while GPT-4 balances both question types. The bar charts reveal that even "correct" combinations dominate, but OSQs have significant incorrect answers (e.g., GPT-3.5: ~1000 "Incorrect MCQs, Incorrect OSQs" in MMLU). This highlights challenges in handling unstructured tasks across models. </details> Figure 4: Performance comparison of various LLMs on multiple-choice (MCQ) and open-style questions (OSQ) across different datasets. The bar graphs on the left show the counts of correct and incorrect responses (✗ MCQ vs. ✓ OSQ; ✓ MCQ vs. ✗ OSQ; ✓ MCQ vs. ✓ OSQ; ✗ MCQ vs. ✗ OSQ), while the radar charts on the right illustrate the accuracy comparisons between MCQ and OSQ for each language model (Pink is the MCQ accuracy and LimeGreen is the OSQ accuracy). <details> <summary>x4.png Details</summary> ![cb3d6227](/v1/image/cb3d6227fcf897ae9f4e88bf0f3c930f696978d1965c79508eda49b3dc61090d) ### Visual Description ## Pie Charts: QA Dataset Response Distribution ### Overview The image displays nine pie charts arranged in a 3x3 grid, each representing response distributions ("YES" and "NO") for different question-answering (QA) datasets. The charts use a consistent color scheme: red for "YES" and blue for "NO", with percentages labeled directly on the slices. ### Components/Axes - **Legend**: Located in the top-right corner, with: - Red square labeled "YES" - Blue square labeled "NO" - **Datasets**: Each pie chart is labeled with a QA dataset name in its center: 1. ARC 2. CommonsenseQA 3. Hellaswag 4. MedMCQA 5. MMLU 6. OpenbookQA 7. PIQA 8. Race 9. Winogrande ### Detailed Analysis 1. **ARC**: - YES: 91.2% (red) - NO: 8.8% (blue) 2. **CommonsenseQA**: - YES: 58.1% (red) - NO: 41.9% (blue) 3. **Hellaswag**: - YES: 39.2% (red) - NO: 60.8% (blue) 4. **MedMCQA**: - YES: 55.4% (red) - NO: 44.6% (blue) 5. **MMLU**: - YES: 55.4% (red) - NO: 44.6% (blue) 6. **OpenbookQA**: - YES: 49.1% (red) - NO: 50.9% (blue) 7. **PIQA**: - YES: 37.9% (red) - NO: 62.1% (blue) 8. **Race**: - YES: 71.3% (red) - NO: 28.7% (blue) 9. **Winogrande**: - YES: 100.0% (red) - NO: 0.0% (blue) ### Key Observations - **Majority YES**: 6/9 datasets show >50% "YES" responses (ARC, CommonsenseQA, MedMCQA, MMLU, Race, Winogrande). - **Majority NO**: 3/9 datasets show >50% "NO" responses (Hellaswag, OpenbookQA, PIQA). - **Extreme Values**: - Winogrande has 100% "YES" (no "NO" responses). - PIQA has the lowest "YES" percentage (37.9%). - **Balanced Distribution**: OpenbookQA shows near-equal "YES" (49.1%) and "NO" (50.9%) responses. ### Interpretation The data suggests significant variability in QA dataset characteristics: - **High "YES" percentages** (e.g., ARC, Winogrande) may indicate datasets with clearer, more consensus-driven answers or simpler question structures. - **High "NO" percentages** (e.g., Hellaswag, PIQA) could reflect datasets with ambiguous questions, cultural biases, or complex reasoning requirements. - **Balanced distributions** (OpenbookQA) might represent datasets designed to test nuanced understanding or debate-like scenarios. - Winogrande's 100% "YES" response rate is anomalous and warrants investigation into dataset design or evaluation methodology. The consistent color coding across all charts ensures easy cross-dataset comparison, though the lack of a shared scale complicates direct percentage comparisons. The spatial arrangement in a grid format facilitates visual scanning but does not encode any hierarchical relationships between datasets. </details> Figure 5: Percentage of convertible MCQ to open style questions on various datasets. ### 5.1 Models We generate responses from LLMs of different sizes. The large-scale LLMs: gpt-3.5-turbo, gpt-4-1106-preview, gpt-4o [27], claude-3-opus-20240229 [3], mistral-large-latest [24], gemini-pro [16], and llama3 [1]. We use the commercial APIs to collect responses from all of these models. The small-scale LLMs: qwen1.5 [4], gemma [39], SlimPajama-DC [35], RedPajama [25], OLMo [17], Pythia [6], TinyLlama [46], OPT [47], GPT-Neo [8], and Cerebras-GPT [14]. All of the small-scale model responses are collected using Huggingface [43] and lm-evaluation-harness framework [15] with 4 $\times$ 4090 RTX GPUs. ### 5.2 Datasets We present a brief overview of used datasets, highlighting their distinctive characteristics and the specific aspects they aim to evaluate. MMLU [18], ARC [12], and MedMCQA [29] stand out with their comprehensive range of tasks spanning across various disciplines. PIQA [7], CommonsenseQA [36], OpenBookQA [23], and HellaSwag [44] focus on the different aspects of commonsense reasoning, such as physical interaction, everyday concepts, and their interrelations. RACE [19] provides a source of reading comprehension challenges. WinoGrande [34] is designed to test the model on resolving coreferences and understanding nuanced relationships in text. This dataset with its unique fill-in-a-blank tasks, inherently aligns with open-ended question formats, negating the need for our multi-stage filtering process. For other datasets, questions are filtered using gpt-4-0125-preview using prompts from Table 1. The prompts for both MCQ and OSQ on each dataset are in Appendix D. ### 5.3 Evaluation Table 4: Comparison of multiple choice (MCQ) and open style questions (OSQ) accuracy. | MMLU | 87.28 | 74.77 | 71.25 | 65.38 | 65.71 | 56.04 | 83.52 | 70.23 | 79.50 | 68.76 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ARC | 95.54 | 82.68 | 90.64 | 78.42 | 90.96 | 72.35 | 97.50 | 75.47 | 89.96 | 72.32 | | HellaSwag | 90.98 | 24.35 | 63.84 | 29.99 | 69.05 | 25.69 | 96.04 | 20.79 | 81.78 | 24.47 | | WinoGrande | 84.14 | 66.22 | 78.77 | 64.56 | 66.85 | 56.35 | 81.69 | 63.54 | 75.45 | 56.83 | | PIQA | 96.41 | 61.64 | 84.34 | 54.89 | 83.33 | 47.70 | 97.41 | 59.05 | 83.33 | 61.21 | | CommonsenseQA | 84.93 | 62.96 | 79.15 | 67.89 | 66.62 | 50.56 | 86.76 | 63.66 | 69.58 | 55.35 | | Race | 92.02 | 67.05 | 84.80 | 60.11 | 87.73 | 61.02 | 93.04 | 66.22 | 89.97 | 70.17 | | MedMCQA | 72.65 | 51.81 | 58.02 | 41.42 | 58.02 | 35.89 | 72.91 | 49.14 | 66.05 | 43.44 | | OpenbookQA | 94.30 | 60.29 | 83.71 | 49.90 | 86.97 | 52.55 | 93.48 | 52.95 | 88.19 | 58.66 | | Average | 88.69 | 61.31 | 78.28 | 56.95 | 75.03 | 50.91 | 90.26 | 57.89 | 80.42 | 56.80 | Our assessment approach for both MCQ and OSQ aligns with widely recognized evaluation frameworks and leaderboards for LLMs. The evaluation of MCQ is conducted utilizing the OpenAI Evals framework [26] with the zero-shot setting, which involves comparing the generated response with the ground truth ID. In contrast, for evaluating responses to open-ended questions, we employ the gpt-4-0125-preview model to determine the correctness of responses generated by LLMs relative to a pre-established ground truth answer from the dataset using the prompt from Table 1. The results in Table 4 and Figure 4 are based on filtered questions. They show that every model experiences a significant drop in the accuracy for OSQ compared to MCQ. On average, the accuracy of OSQ is lower than MCQ by about 25% for all models. This result can correlate with our concern that the model will “randomly guess” to correct choices but it cannot answer. This discrepancy in performance between OSQ and MCQ is not necessarily a negative reflection of the models’ overall capabilities. Instead, it can be viewed as a true comparison of the models’ abilities to process and understand diverse types of questions. The most significant difference in models between OSQ and MCQ is observed for Claude-3 Opus, by 31%. The dataset with the largest fall between MCQ and OSQ is HellaSwag. This is because of the type of questions in this dataset. It asks to choose the most plausible continuation for the scenarios presented. Evaluating the OSQ responses of LLMs against the ground truth in this dataset presents a significant challenge due to the different plausible completions. It means that a multitude of valid and contextually appropriate answers can exist, which makes it difficult to evaluate with single-choice ground truth. This contrasts with WinoGrande, which consists of questions that require fill-in-the-blank in sentences with correct words. As a result, HellaSwag does not seem well-suited for open-style questions, and we have chosen to omit it from our final leaderboard. Table 5: Open-LLM Leaderboard for Large-scale Models. WG, CSQA, OBQA, and HS represent WinoGrande, CommonsenseQA, OpenbookQA, and HellaSwag respectively. We did not include HellaSwag results in the overall accuracy as the evaluation difficulties mentioned in Sec. 5.3. | GPT-4o <details> <summary>extracted/5652609/fig/ranking/1_fig.png Details</summary> ![b0b7b6b9](/v1/image/b0b7b6b9bb6d370fba23b6071130b3e6efb223eca45df2445c968e0f0706d428) ### Visual Description ## Icon/Symbol: Gold Medal with "1" ### Overview The image depicts a simplified, stylized gold medal with a blue ribbon. The medal is circular with a prominent orange "1" centered on its surface. The ribbon is bifurcated into two shades of blue (light and dark) and is attached to the medal via a small yellow loop. No additional text, numerical data, or contextual elements are present. ### Components/Axes - **Medal**: - Color: Gold (yellow-orange gradient). - Shape: Perfect circle with a raised edge. - Central Element: Bold orange "1" (no additional text or symbols). - **Ribbon**: - Color: Two-tone blue (light blue on the left, dark blue on the right). - Attachment: Yellow loop connecting the ribbon to the medal’s top. - **Background**: Plain light gray, providing high contrast to the medal and ribbon. ### Detailed Analysis - **Medal Design**: - The "1" is centrally positioned, occupying ~30% of the medal’s diameter. - No additional markings, engravings, or decorative elements are visible. - **Ribbon**: - The bifurcation of the ribbon into two blue shades suggests a deliberate design choice, possibly to imply depth or dimensionality. - The yellow loop is minimalistic, with no shading or texture. - **Color Choices**: - Gold for the medal aligns with traditional first-place awards. - Blue ribbon is a common color for Olympic or competitive medals. ### Key Observations 1. **Symbolism**: The "1" explicitly denotes a first-place ranking, though no event or competition context is provided. 2. **Minimalism**: The design avoids realism, favoring flat, vector-style graphics. 3. **Contrast**: The orange "1" stands out sharply against the gold background due to color saturation differences. ### Interpretation This icon likely represents a **first-place award** in a competitive context (e.g., sports, academics, or gaming). The absence of additional text or contextual clues (e.g., event names, dates) suggests it is intended for generic use, such as in user interfaces, achievement systems, or educational materials. The simplicity of the design prioritizes immediate recognition over narrative detail. **Note**: No numerical data, trends, or analytical components are present. The image functions purely as a symbolic representation of victory or top-tier achievement. </details> GPT-4-1106-preview <details> <summary>extracted/5652609/fig/ranking/2_fig.png Details</summary> ![5c2191b4](/v1/image/5c2191b40e62628a628cc3eece02fc07b2cf05953143f57d3202c1ac8fd1226d) ### Visual Description ## Icon: Second Place Medal ### Overview The image depicts a simplified, stylized icon of a second-place medal. The medal features a circular gray body with a darker gray outline, a central gray numeral "2", and a bifurcated blue ribbon attached to the top. The background is a uniform light gray. ### Components/Axes - **Medal Body**: - Circular shape with a light gray fill. - Darker gray circular outline. - Central numeral "2" in dark gray. - Subtle white highlights on the upper left and lower right edges to suggest reflective surfaces. - **Ribbon**: - Bifurcated design with two rectangular segments. - Left segment: Bright blue. - Right segment: Dark blue. - Attached to the medal via a small gray loop. - **Background**: Light gray, providing neutral contrast to the medal and ribbon. ### Detailed Analysis - **Numeral "2"**: Positioned centrally within the medal, occupying ~60% of the medal's diameter. No additional text or symbols are present. - **Ribbon Color Gradient**: The left ribbon segment is a lighter blue (#007BFF), while the right segment is a darker blue (#003366). The bifurcation creates a V-shape, a common design for second-place ribbons. - **Medal Reflectivity**: White highlights on the medal's edges imply a metallic sheen, though no other metallic textures (e.g., gold, silver) are present. ### Key Observations 1. The numeral "2" is the sole textual element, explicitly indicating second-place ranking. 2. The ribbon's color scheme adheres to traditional Olympic-style medal conventions, where blue ribbons are often associated with second place. 3. The icon lacks contextual elements (e.g., event name, athlete name, date), suggesting it is a generic template. ### Interpretation This icon likely represents a digital asset for ranking systems, sports applications, or achievement badges. The absence of additional text or embellishments emphasizes its purpose as a universal symbol for second-place recognition. The use of blue for the ribbon aligns with international standards for second-place distinctions, though cultural variations (e.g., some regions use red for second place) are not reflected here. The design prioritizes clarity and scalability, making it suitable for small-scale display (e.g., mobile apps, web interfaces). </details> Claude-3 Opus <details> <summary>extracted/5652609/fig/ranking/3_fig.png Details</summary> ![8d016496](/v1/image/8d016496c9e3c6e55f74f4102edb3fd174b11a66ede8ad41b2f704ce82db7e4d) ### Visual Description ## Icon/Symbol: Third Place Medal ### Overview The image depicts a simplified, stylized representation of a third-place medal. It features a circular medal with a blue ribbon attached at the top. The medal has a layered color scheme and a prominent numerical indicator. ### Components/Axes - **Medal Shape**: Circular with a flat top for ribbon attachment. - **Color Layers**: - Outer ring: Light orange (#FFA500) with a white highlight on the upper left and lower right edges. - Inner ring: Dark orange (#FF8C00) forming a border around the central area. - Central area: Darker orange (#FF6347) with a bold, brownish-red number "3" centered. - **Ribbon**: Two-tone blue (light blue #ADD8E6 on the left, dark blue #00008B on the right) forming a "V" shape. ### Detailed Analysis - **Textual Elements**: - The number "3" is the only textual content, rendered in a bold, sans-serif font. - No additional labels, legends, or axis markers are present. - **Spatial Grounding**: - The number "3" is centered within the medal’s inner circle. - The ribbon is positioned at the top of the medal, with the light blue segment on the left and dark blue on the right. ### Key Observations - The medal’s design adheres to conventional third-place symbolism (gold/silver/bronze color gradients, though simplified here). - The number "3" is unambiguous and centrally placed, ensuring immediate recognition. - No gradients or shading are used beyond the white highlights, suggesting a minimalist aesthetic. ### Interpretation This icon likely represents a digital asset for gamification, achievement tracking, or competitive ranking systems. The use of a "3" instead of traditional bronze imagery emphasizes clarity and universality. The blue ribbon’s two-tone design may symbolize duality (e.g., celebration and solemnity) or simply serve as a visual contrast. The absence of additional text or complexity suggests it is intended for quick recognition in user interfaces or informational graphics. **Note**: No numerical data, trends, or contextual relationships exist beyond the symbolic representation of third place. </details> | 70.15 65.93 62.53 | 79.09 74.77 70.23 | 86.31 82.68 75.47 | 72.22 66.22 63.54 | 60.34 61.64 59.05 | 70.28 62.96 63.66 | 67.87 67.05 66.22 | 57.85 51.81 49.14 | 67.21 60.29 52.95 | – 24.35 20.79 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mistral Large | 60.84 | 68.76 | 72.32 | 56.83 | 61.21 | 55.35 | 70.17 | 43.44 | 58.66 | 24.47 | | GPT-3.5 | 60.32 | 65.38 | 78.42 | 64.56 | 54.89 | 67.89 | 60.11 | 41.42 | 49.90 | 29.99 | | Gemini 1.0 Pro | 54.06 | 56.04 | 72.35 | 56.35 | 47.70 | 50.56 | 61.02 | 35.89 | 52.55 | 25.69 | | Llama3-70b-Instruct | 52.92 | 59.67 | 67.09 | 57.14 | 43.10 | 55.49 | 58.21 | 41.67 | 40.94 | – | Table 6: Open-LLM Leaderboard for small-scale model regime. | Qwen1.5 (1.8B) Gemma (2B) SlimPajama-DC (1.3B) | 21.68 16.66 9.60 | 9.99 17.52 9.22 | 15.84 23.93 14.95 | 40.96 16.10 14.76 | 15.52 15.09 5.32 | 31.13 27.46 9.01 | 34.91 14.32 16.19 | 4.70 4.57 1.68 | 20.37 14.26 5.70 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | RedPajama (1.3B) | 9.00 | 9.21 | 13.50 | 16.97 | 0.86 | 11.41 | 14.35 | 1.86 | 3.87 | | OLMo (1.2B) | 8.85 | 8.54 | 13.18 | 6.16 | 8.05 | 13.10 | 13.61 | 2.07 | 6.11 | | Pythia (1.4B) | 8.79 | 9.66 | 14.69 | 11.52 | 4.17 | 9.01 | 12.76 | 3.19 | 5.30 | | TinyLlama (1.1B) | 8.45 | 8.94 | 13.31 | 12.23 | 3.59 | 6.06 | 16.7 | 2.07 | 4.68 | | OPT (1.3B) | 7.89 | 7.40 | 11.83 | 12.47 | 4.48 | 7.61 | 13.61 | 1.25 | 4.48 | | GPT-Neo (1.3B) | 7.42 | 6.94 | 9.69 | 10.81 | 4.31 | 6.34 | 13.75 | 2.63 | 4.89 | | Cerebras-GPT (1.3B) | 4.86 | 5.37 | 4.43 | 9.31 | 2.16 | 6.20 | 6.90 | 1.04 | 3.46 | ### 5.4 Leaderboard and Arena The overall ranking of models for our benchmark is represented in Table 6 and Table 6. The performance of GPT-4o overall demonstrates its leading edge, with an accuracy of 70.15%, which indicates its robustness in open-style question answering tasks compared to other models. It is followed by GPT-4-1106-preview with 65.93%, and Claude-3 Opus with 62.68%. These results highlight the advanced capabilities of the GPT-4 series. Mid-tier models like Mistral Large and GPT-3.5 perform well but are not on par with the top performers. On the other hand, models like Gemini 1.0 Pro and Llama3-70b-Instruct lag behind in terms of the capabilities to answer the open-style questions. The performance evaluation of smaller-scale LLMs reveals that Qwen1.5 leads with an overall accuracy of 21.68%, significantly outperforming the other models in this category. Gemma follows with 16.66%, indicating a considerable gap in performance compared to the top model. The remaining models score below 10.00%, highlighting their limited abilities to answer the open-style questions. Almost all of the models struggle significantly with questions from MedMCQA dataset, showing an accuracy below of 5%. ## 6 Conclusion We proposed Open-LLM-Leaderboard for LLM evaluation and comprehensively examined its efficacy using open-style questions from nine datasets on OSQ-bench. Different from previous works that rely on human evaluation or thousands of crowd users on Chatbot Arena, we can have a benchmark for chat LLMs in a fast, automatic, and cheap scheme. Our results show a highly correlated level of agreement with humans, indicating a foundation for an LLM-based evaluation benchmark and framework using open-style questions. ## Limitations and Ethics Statement We have discussed multiple advantages of employing open-style questions over multiple-choice questions used in prior works. However, the LLM Leaderboard, as a tool for evaluating and benchmarking LLMs, has several common limitations itself. Firstly, the performance metrics used may not fully capture the nuanced capabilities of each model, especially in areas that require an understanding of context, creativity, or common sense reasoning. Secondly, the benchmark datasets may not be comprehensive enough to cover all possible domains and scenarios, leading to a potential bias towards certain types of questions or tasks. Thirdly, due to the rapidly evolving nature of the field, models may quickly become outdated, meaning the leaderboard may not always reflect the most current state of the art. Since our benchmark utilizes public datasets and our corpus consists of questions and answers, user privacy concerns are minimal. ## References - [1] AI@Meta. Llama 3 model card. 2024. - [2] Anthropic. Model card and evaluations for claude models, 2023. - [3] Anthropic. https://www.anthropic.com/claude, 2024. - [4] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. - [5] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023. - [6] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023. - [7] Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439, Apr. 2020. - [8] Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. If you use this software, please cite it using these metadata. - [9] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, et al. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. - [10] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. - [11] John Chung, Ece Kamar, and Saleema Amershi. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023. - [12] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. - [13] Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960. - [14] Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint:2304.03208, 2023. - [15] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. - [16] Google. https://ai.google.dev/, 2023. - [17] Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. Olmo: Accelerating the science of language models. Preprint, 2024. - [18] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. - [19] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017. - [20] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, et al. Starcoder: may the source be with you! arXiv preprint arXiv:23.05.061161, 2023. - [21] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023. - [22] Yixin Liu, Kejian Shi, Katherine S He, Longtian Ye, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, and Arman Cohan. On learning to summarize with large language models as references. arXiv preprint arXiv:2305.14239, 2023. - [23] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. - [24] Mistral. https://chat.mistral.ai/chat, 2024. - [25] MosaicML. Mpt-1b redpajama-200b. https://huggingface.co/mosaicml/mpt-1b-redpajama-200b. Accessed: 2024-04-29. - [26] OpenAI. Openai evals. https://github.com/openai/evals. - [27] OpenAI. https://chat.openai.com/chat, 2022. - [28] OpenAI. Gpt-4 technical report. arxiv preprint arXiv:2303.08774, 2024. - [29] Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, 2022. - [30] Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023. - [31] Joshua Robinson, Christopher Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. ArXiv, abs/2210.12353, 2022. - [32] Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353, 2023. - [33] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2024. - [34] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, aug 2021. - [35] Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Joel Hestness, Natalia Vassilieva, Daria Soboleva, and Eric Xing. Slimpajama-dc: Understanding data combinations for llm training. arXiv preprint arXiv:2309.10818, 2023. - [36] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019. - [37] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. - [38] Gemini Team. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. - [39] Gemma Team. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024. - [40] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. - [41] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. - [42] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023. - [43] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771, 2019. - [44] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. - [45] Biao Zhang, Barry Haddow, and Alexandra Birch. Prompting large language model for machine translation: A case study. In Proceedings of the 40th International Conference on Machine Learning, 2023. - [46] Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint:2401.02385, 2024. - [47] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. arXiv preprint:2205.01068, 2022. - [48] Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882, 2024. - [49] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023. - [50] Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint 2304.04675, 2023. ## Appendix ## Appendix A Reproducibility Statement We will make all our filtered open-style data (MMLU, ARC, HellaSwag, WinoGrande, PIQA, CommonsenseQA, Race, MedMCQA, and OpenbookQA) used in our experiments of Sec. 5 and preprocessing scripts publicly available. Detailed data statistics are provided in Sec. 4.1. Considering the potential high costs associated with gathering and reproducing our LLM response data from the ground up, we will make available all responses from the various LLMs and their corresponding evaluation results to support and simplify the reproducibility of our work. The OpenAI APIs we used include gpt-3.5-turbo-1106, gpt-4.0-1106-preview, gpt-4o (for response collection), and gpt-4.0-0125-preview (for filtering and post-evaluation); Claude 3: claude-3-opus-20240229; Gemini-Pro: gemini-pro, and Mistral: mistral-large-latest. ## Appendix B More Results on Gemini Pro and Stage1 Filtering <details> <summary>x5.png Details</summary> ![371317b1](/v1/image/371317b120926b83d1edd03ed111d103575e5414f7dc6bed0351d55f15e67fd9) ### Visual Description ## Bar Chart: Dataset Performance Breakdown by Question Type ### Overview The bar chart compares performance metrics across multiple datasets, segmented by question type (MCQs/OSQs) and correctness. Each bar is divided into four color-coded segments representing combinations of correct/incorrect answers for MCQs and OSQs. ### Components/Axes - **X-axis (Datasets)**: - MMLU, HellaSwag, Race, ARC, MedMCQA, WinoGrande, CommonsenseQA, PIQA, OpenbookQA - **Y-axis (Count)**: - Scale from 0 to 8,000 in increments of 2,000 - **Legend**: - Orange: Incorrect MCQs, Correct OSQs - Green: Correct MCQs, Incorrect OSQs - Red: Correct MCQs, Correct OSQs - Gray: Incorrect MCQs, Incorrect OSQs ### Detailed Analysis - **MMLU**: - Tallest bar (~8,000 total) - Segments: ~1,000 (orange), ~2,000 (green), ~6,000 (red), ~1,000 (gray) - **HellaSwag**: - ~4,000 total - Segments: ~200 (orange), ~1,800 (green), ~2,000 (red), ~1,000 (gray) - **Race**: - ~3,500 total - Segments: ~100 (orange), ~1,200 (green), ~2,200 (red), ~100 (gray) - **ARC**: - ~3,200 total - Segments: ~150 (orange), ~900 (green), ~2,150 (red), ~0 (gray) - **MedMCQA**: - ~2,500 total - Segments: ~200 (orange), ~1,000 (green), ~1,300 (red), ~0 (gray) - **WinoGrande**: - ~1,500 total - Segments: ~100 (orange), ~400 (green), ~1,000 (red), ~0 (gray) - **CommonsenseQA**: - ~800 total - Segments: ~50 (orange), ~200 (green), ~500 (red), ~50 (gray) - **PIQA**: - ~600 total - Segments: ~30 (orange), ~150 (green), ~400 (red), ~20 (gray) - **OpenbookQA**: - ~400 total - Segments: ~20 (orange), ~100 (green), ~250 (red), ~30 (gray) ### Key Observations 1. **MCQ Dominance**: Correct MCQs (red) consistently dominate across all datasets, with MMLU showing the highest count (~6,000). 2. **OSQ Variability**: Correct OSQs (green) vary significantly, with MMLU having the highest (~2,000) and OpenbookQA the lowest (~100). 3. **Error Patterns**: Incorrect MCQs (orange) are minimal except in MMLU (~1,000). Incorrect OSQs (gray) are negligible in most datasets but present in MMLU and CommonsenseQA. ### Interpretation The data suggests MCQs are more frequently answered correctly than OSQs across all datasets, with MMLU exhibiting the largest volume of correct MCQs. The near-absence of incorrect OSQs in datasets like ARC and MedMCQA implies higher reliability for OSQs in these contexts. However, the dominance of MCQs may reflect dataset design biases rather than inherent question type superiority. ## Radar Chart: MCQ vs. OSQ Accuracy Comparison ### Overview The radar chart compares accuracy scores (0–0.8) for MCQs (red) and OSQs (green) across nine datasets. MCQ accuracies consistently outperform OSQs, with ARC showing the highest MCQ accuracy and CommonsenseQA the lowest. ### Components/Axes - **Axes (Datasets)**: - ARC, CommonsenseQA, PIQA, WinoGrande, OpenbookQA, Race, MedMCQA, HellaSwag, MMLU - **Legend**: - Red: MCQs Accuracies - Green: OSQs Accuracies ### Detailed Analysis - **ARC**: - MCQs: ~0.75 - OSQs: ~0.5 - **CommonsenseQA**: - MCQs: ~0.6 - OSQs: ~0.4 - **PIQA**: - MCQs: ~0.65 - OSQs: ~0.45 - **WinoGrande**: - MCQs: ~0.7 - OSQs: ~0.55 - **OpenbookQA**: - MCQs: ~0.7 - OSQs: ~0.5 - **Race**: - MCQs: ~0.68 - OSQs: ~0.48 - **MedMCQA**: - MCQs: ~0.62 - OSQs: ~0.42 - **HellaSwag**: - MCQs: ~0.6 - OSQs: ~0.4 - **MMLU**: - MCQs: ~0.65 - OSQs: ~0.45 ### Key Observations 1. **MCQ Superiority**: MCQ accuracies exceed OSQs by ~0.15–0.25 across all datasets. 2. **ARC Exception**: ARC shows the largest gap (~0.25) between MCQ and OSQ accuracies. 3. **CommonsenseQA**: Lowest performance for both question types, with MCQs at ~0.6 and OSQs at ~0.4. ### Interpretation The radar chart reveals a systematic advantage for MCQs in accuracy, potentially due to structured answer choices simplifying response generation. However, the minimal OSQ accuracy in CommonsenseQA suggests challenges in open-ended reasoning for this dataset. The consistent MCQ-OSQ gap across datasets implies architectural or training biases favoring structured formats over open-ended ones. </details> Figure 6: Performance comparison of Gemini Pro on multiple-choice and open-style response questions across diverse datasets, as shown by the count of correct and incorrect answers in the left bar chart and model accuracy in the right radar chart. <details> <summary>x6.png Details</summary> ![e27ec983](/v1/image/e27ec98317a00b409246b4453dfd26bde7aa65586563e2a8c744a5509675096a) ### Visual Description ## Pie Charts: Dataset Response Distribution ### Overview The image displays nine pie charts comparing response distributions (YES/NO) across different datasets. Each chart uses a red/blue color scheme (legend: red = YES, blue = NO) to represent agreement/disagreement rates. ### Components/Axes - **Legend**: Located in the top-right corner, with red labeled "YES" and blue labeled "NO". - **Pie Charts**: Nine circular charts arranged in a 3x3 grid, each labeled with a dataset name (e.g., ARC, CommonsenseQA). - **Percentages**: Each segment of the pie charts includes numerical values (e.g., "85.4%", "14.6%"). ### Detailed Analysis 1. **ARC**: 85.4% YES (red), 14.6% NO (blue). 2. **CommonsenseQA**: 53.7% YES, 46.3% NO. 3. **HellaSwag**: 5.1% YES, 94.9% NO. 4. **MedMCQA**: 48.8% YES, 51.2% NO. 5. **MMLU**: 41.9% YES, 58.1% NO. 6. **OpenbookQA**: 37.2% YES, 62.8% NO. 7. **PIQA**: 35.4% YES, 64.6% NO. 8. **Race**: 70.4% YES, 29.6% NO. 9. **WinoGrande**: 100.0% YES, 0.0% NO. ### Key Observations - **WinoGrande** is the only dataset with 100% YES responses, indicating unanimous agreement. - **HellaSwag** has the highest NO response rate (94.9%), suggesting strong disagreement. - **OpenbookQA** and **PIQA** show significant NO majorities (>60%). - **ARC** and **Race** have the highest YES majorities (>70%). - **CommonsenseQA** and **MedMCQA** are nearly balanced (~50% YES/NO). ### Interpretation The data suggests varying levels of consensus or correctness across datasets. WinoGrande’s 100% YES response implies near-perfect agreement, possibly due to unambiguous questions or high model confidence. Conversely, HellaSwag’s 94.9% NO response may reflect inherent ambiguity or challenging questions. Datasets like OpenbookQA and PIQA show lower YES rates, indicating potential difficulties in model performance or interpretability. The near-even splits in CommonsenseQA and MedMCQA highlight datasets where responses are polarized, possibly due to subjective or complex queries. These trends could inform dataset design or model training strategies to address specific weaknesses. </details> Figure 7: Initial filtering results across datasets, showcasing the percentages of questions that are suitable (Yes) for open-style formatting and those that are not (No). ## Appendix C More Sample Questions We provide more sample questions that are correct for open-style and incorrect for multiple choice in Table 7, and vice versa in Table 8. These examples demonstrate how the model’s ability varies in handling different question formats. Table 7: Example of an Incorrect MCQ and a Correct OSQ in MMLU. | Question that is correct for OSQ and incorrect for MCQ: How many CO2 and ATP molecules are formed during one complete turn of the tricarboxylic acid cycle (Krebs’ cycle)? | | --- | | Options: A. 2CO2 and 2ATP B. 2CO2 and 16ATP C. 2CO2 and 12ATP D. 2CO2 and 1ATP | | Ground Truth: D | | MCQ_Answer: A | | OSQ_Answer: During one complete turn of the tricarboxylic acid cycle (also known as the Krebs’ cycle or the citric acid cycle), a series of chemical reactions occur in the mitochondria of cells that result in the production of carbon dioxide (CO2) and adenosine triphosphate (ATP), among other molecules. | | For each acetyl-CoA molecule that enters the cycle, the following are produced: | | - 2 molecules of CO2: Each acetyl-CoA contributes two carbons to the cycle, which are eventually released as two molecules of carbon dioxide. | | - 1 molecule of ATP: One molecule of ATP (or an equivalent, such as GTP, depending on the cell type) is produced directly in the cycle through substrate-level phosphorylation. | | However, the cycle also produces reduced coenzymes, specifically NADH and FADH2, which are used in the electron transport chain to produce additional ATP through oxidative phosphorylation. For each acetyl-CoA that enters the cycle: | | - 3 molecules of NADH are produced, which can be used to generate approximately 2.5 ATP each through the electron transport chain. | | - 1 molecule of FADH2 is produced, which can be used to generate approximately 1.5 ATP through the electron transport chain. | | Therefore, while only 1 ATP is produced directly in the cycle per acetyl-CoA, the NADH and FADH2 generated can lead to the production of additional ATP in the electron transport chain. The exact number of ATP molecules produced can vary depending on the efficiency of the electron transport chain and the ATP synthase complex in different organisms and cell types. | Table 8: Example of a Correct MCQ and an Incorrect OSQ in MMLU. | Question that is correct for MCQ and incorrect for OSQ: A tree is a connected graph with no cycles. How many nonisomorphic trees with 5 vertices exist? | | --- | | Options: A. 1 B. 2 C. 3 D. 4 | | Ground Truth: C | | MCQ_Answer: C | | OSQ_Answer: To determine the number of nonisomorphic trees with 5 vertices, we can use Cayley’s formula, which states that the number of trees on $n$ labeled vertices is $n^{n-2}$ . However, this formula gives the number of trees when vertices are labeled, meaning that it counts different labelings of the same tree shape as distinct. Since we are interested in nonisomorphic trees (which do not consider different labelings as distinct), we cannot directly use Cayley’s formula. | | Instead, we need to enumerate the trees by considering their possible structures. For a tree with 5 vertices, we can have the following configurations: | | 1. A linear chain (path) where each vertex is connected to two others except for the two end vertices, which are connected to only one other vertex. This is one possible tree. | | 2. A "star" tree where one central vertex is connected to the other four vertices. This is another possible tree. | | 3. A tree where one vertex is connected to three others, and one of those three is connected to the fifth vertex. This structure can be seen as a "T" shape with an extra branch on one of the endpoints. There are three nonisomorphic trees of this type because the branch can be on any of the three endpoints. | | 4. A tree where two vertices are connected to a central vertex, and each of those two vertices is connected to one other vertex (creating a structure with a central vertex, two intermediate vertices, and two leaf vertices). This structure can be seen as a "Y" shape. There is only one nonisomorphic tree of this type because any permutation of the leaf vertices does not create a fundamentally different tree structure. | | Adding these up, we have: | | 1 (linear chain) + 1 (star) + 3 (T-shaped with an extra branch) + 1 (Y-shaped) = 6 nonisomorphic trees with 5 vertices. | ## Appendix D Prompts for Different Datasets The following are examples of prompts for different datasets. They allow models to better understand and respond to specific contexts and objectives in each dataset. Table 9: Prompt for MMLU dataset. Table 10: Prompt for ARC dataset. | MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. | | --- | | Question: Tiny organisms called plankton live in oceans. Some plankton can take energy from the Sun and turn it into food. How are plankton most useful to the animals that live in the ocean? | | A. Plankton are colorful. | | B. Plankton clean the water. | | C. Plankton release oxygen. | | D. Plankton reproduce quickly. | | Answer: | | Open-Style Prompt: Answer the following question. | | Question: Tiny organisms called plankton live in oceans. Some plankton can take energy from the Sun and turn it into food. How are plankton most useful to the animals that live in the ocean? | | Answer: | Table 11: Prompt for CommonsenseQA dataset. | MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D, E. For example, if you think the correct answer is A, your response should be ’A’. | | --- | | Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what? | | A. bank | | B. library | | C. department store | | D. mall | | E. New York | | Answer: | | Open-Style Prompt: You will be presented with a variety of questions that require an understanding of everyday scenarios, human behaviors, and common sense. Your task is to provide the best possible answer to each question based solely on your understanding and reasoning. | | Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what? | | Answer: | Table 12: Prompt for MedMCQA dataset. | MCQ Prompt: The following is the multiple choice question about medicine. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. | | --- | | Question: Modulus of elasticity means: | | A. Rigidity or stiffness of the material | | B. Ability to be stretched with permanent deformation | | C. Ductility of a material | | D. Malleability of the metal | | Answer: | | Open-Style Prompt: Answer the following question about medicine. | | Question: Modulus of elasticity means: | | Answer: | Table 13: Prompt for HellaSwag dataset. | MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. | | --- | | Question: How to clean your rv windows and mirrors fast without using any spray. you | | A. also have a bucket that you spray paint a window in. | | B. can reach for a running water hose and clean the inside of your rv quickly. | | C. get a wash cloth and you put it under the faucet to get wet and then you rinse it out so it’s not soaking. | | D. meticulously clean the window in the glass shop and then take the plastic off and start taking the hood off. | | Answer: | | Open-Style Prompt: Imagine you are provided with a scenario or a partial story taken from everyday life or a common activity. Your task is to continue this story or scenario in a way that makes the most sense based on what typically happens in such situations. Please complete the sentence. | | Question: How to clean your rv windows and mirrors fast without using any spray. you | | Answer: | Table 14: Prompt for OpenbookQA dataset. | MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. | | --- | | Question: what system is needed for a body to get its needed supply of the gas humans breathe in? | | A. the circulatory system | | B. the digestive system | | C. the school system | | D. central nervous system | | Answer: | | Open-Style Prompt: Consider common scenarios or outcomes that fit the context of the sentence. Attempt to logically complete the sentences based on common knowledge and reasoning. | | Question: what system is needed for a body to get its needed supply of the gas humans breathe in? | | Answer: | Table 15: Prompt for PIQA dataset. | MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B. For example, if you think the correct answer is A, your response should be ’A’. | | --- | | Question: How do I ready a guinea pig cage for it’s new occupants? | | A. Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish. | | B. Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish. | | Answer: | | Open-Style Prompt: Consider common scenarios or outcomes that fit the context of the sentence. Attempt to logically complete the sentences based on common knowledge and reasoning. | | Question: How do I ready a guinea pig cage for it’s new occupants? | | Answer: | Table 16: Prompt for Race dataset. | MCQ Prompt: I will give you a passage with multiple-choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. | | --- | | Passage:... | | Question: What did Nancy try to do before she fell over? | | A. Measure the depth of the river | | B. Look for a fallen tree trunk | | C. Protect her cows from being drowned | | D. Run away from the flooded farm | | Answer: | | Open-Style Prompt: I will give you passage with question. Please, answer the question. | | Passage:... | | Question: What did Nancy try to do before she fell over? | | Answer: | Table 17: Prompt for WinoGrande dataset. | MCQ Prompt: The following is the multiple choice question. Please put the correct words in place of _. Your response should include only the option without any justification or reasoning. Please select the correct answer from the options A, B. | | --- | | Question: Sarah was a much better surgeon than Maria so _ always got the easier cases. | | A. Sarah | | B. Maria | | Answer: | | Open-Style Prompt: Please put the correct words in place of _. Give only the word that fits the sentence. | | Question: Sarah was a much better surgeon than Maria so _ always got the easier cases. | | Answer: |

Rendering Paper...