# Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
> Joint first author & equal contribution.
Abstract
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs). Typically, an LLM is given a question and selects the answer deemed most probable after adjustments for factors like length. Unfortunately, LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities, influencing the prediction of answers based on these IDs. Previous research has introduced methods to reduce this “selection bias” by simply permutating options on a few test samples and applying them to new ones. Another problem of MCQ is the lottery ticket choice by “random guessing”. The LLM does not learn particular knowledge, but the option is guessed correctly. This situation is especially serious for those small-scale LLMs For instance, on MMLU, the random guessing accuracy is 25%, and most small-scale LLMs obtain results around this value as shown in [35, 46]. It is difficult to distinguish which model is better under this situation.. To address them, a more thorough approach involves shifting from MCQ to open-style questions, which can fundamentally eliminate selection bias and random guessing issues. However, transitioning causes its own set of challenges in (1) identifying suitable open-style questions and (2) validating the correctness of LLM open-style responses against human-annotated ground-truths. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions. Consequently, we introduce the Open-LLM-Leaderboard to track various LLMs’ performance and reflect true capability of them, such as GPT-4o/4/3.5, Claude 3, Gemini, etc. Our code and dataset are available at https://github.com/VILA-Lab/Open-LLM-Leaderboard.
1 Introduction
| Question that is suitable for open-style: Let x = 1. What is x << 3 in Python 3? |
| --- |
| Options: A. 1 B. 3 C. 8 D. 16 |
| Answer: C |
| Question that is not suitable for open-style: Which of the following statements is true? |
| Options: |
| A. Every equivalence relation is a partial-ordering relation. |
| B. Number of relations form A = x, y, z to B= (1, 2), is 64. |
| C. Empty relation _ is reflexive |
| D. Properties of a relation being symmetric and being un-symmetric are negative of each other. |
| Answer: B |
Figure 1: Examples of MCQ from MMLU.
Large language models (LLMs) are increasingly excelling at various natural language processing tasks, including text generation [11], translation [45, 50], summarization [22], code generation [20, 33], and chatbot interaction [28]. With the rising capability, the need for a robust evaluation strategy that can accurately assess the performance of these models is becoming crucial in order to identify their true effectiveness and choose the most appropriate one for a given task. Common metrics for assessing LLMs today include relevance, frequency of hallucinations, accuracy in question answering, toxicity, and retrieval-specific metrics, among others. In the context of question-answering evaluations, prior works usually investigate the model’s performance in terms of answer accuracy, courtesy, and conciseness. And multiple choice questions (MCQ) have emerged as a predominant format for such assessments, wherein a question is presented with several possible responses, and the model is required to select the most fitting choice ID, as exemplified in Figure 1. Lately, the MCQ format has seen widespread application in LLM-focused contexts, including benchmarks [18, 44, 12] that examine LLM capabilities and automated/crowdsourcing evaluation frameworks [21, 49, 5] that streamline the assessment process.
However, previous studies [48, 32] have discussed that the lack of resilience of LLMs to changes in the positioning of options stems from their tendency to exhibit biased behavior: they often favor choosing certain option IDs (such as “Option A”) as responses, a phenomenon that is referred to as selection bias. Moreover, it shows that selection bias exists widely across various LLMs and cannot be mitigated by simple prompting skills. The underlying reason of this issue comes from the condition that the model is trained with a priori distribution that assigns more probabilistic choices to specific ID tokens. Another issue of MCQ is the “random guessing” that is discussed in [35]. Specifically, small models such as the 1B-level variants, may struggle to achieve reliable predictions on many benchmarks like MMLU which uses four choices as the answer candidates of the questions. Their results could resemble random choices, not truly capturing the model’s actual capabilities.
To fundamentally eliminate selection bias and random guessing in LLMs, in this work, we build an open-style question benchmark for LLM evaluation. Leveraging this benchmark, we present the Open-LLM-Leaderboard, a new automated framework designed to refine the assessment process of LLMs. This framework functions in supplement to prior evaluation frameworks such as [21, 49, 5] with several advantages as presented in Sec. 4.4. However, constructing such a benchmark has two significant challenges: (1) how to determine the appropriate questions that can be effectively transformed from MCQ into open-style questions, and (2) how to establish an approach to accurately validate the correctness of the LLM’s open-style answers in comparison to human-annotated ground-truths, especially in contrast to MCQ, which typically have defined single-choice standard answers.
For the first challenge of identifying the multiple-choice questions that are suitable for converting to open-style questions, we design an automatic coarse-to-fine selecting protocol through customized prompts and multi-stage filtering process. Specifically, in the first stage, we use the binary classification to filter the questions with high confidence as the positive pool and others are assigned as negative. Our second stage uses a soft scoring method (1-10 ratings) to judge the suitability of the questions for the open-style from the questions that are clarified as negative in the first stage. For the second challenge of evaluating the correctness of the LLM’s open-style answers in comparison to human-annotated ground-truths, we further design a task-specific prompt and leverage GPT-4 to examine if the response is correct. To validate the accuracy of the automatic evaluation strategy, we randomly sample 100 results and manually check the automatic evaluation results with the corresponding responses, and confirm that it is reliable with an error rate of less than 5%.
In our end-to-end assessment of the LLM evaluation and ranking process, we conduct a comprehensive analysis on the well-recognized LLMs, including GPT-4o, GPT-4, ChatGPT, Claude-3 Opus, Gemini-Pro and Mistral-Large. Our benchmarking results indicate that GPT-4o currently holds the position as the strongest LLM. We further provide a small regime LLM leaderboard targeting at LLMs smaller than 3B. Moreover, our study demonstrates a high correlation between the rankings produced by our open-style benchmark and those derived from user-based evaluations or direct human assessments.
2 Related Work
Large Language Models (LLMs). Recent advancements in LLMs, such as GPT-3 [9] and GPT-4 [28] have had a significant impact in the field of natural language processing and have found widespread application across various domains. It has indeed initiated a kind of chain reaction within the community and beyond. As each new iteration of LLMs demonstrates enhanced capabilities, organizations and researchers across various sectors are motivated to develop their own models, such as LLaMA [40, 41], Gemini [38], and Claude [2], or find innovative ways to improve existing LLMs through instruction tuning, like Alpaca [37], and Vicuna [10].
Multiple Choice Questions (MCQ). In the realm of LLM research, MCQ has become a pivotal tool for evaluating and enhancing the capabilities of these models. Notable datasets like the MMLU [18], HellaSwag [44], and ARC [12] have been instrumental in this regard. Their diverse assessment of broad knowledge and commonsense reasoning help in benchmarking the depth and versatility of LLMs in understanding, reasoning, and applying knowledge across various domains. MCSB [31] introduces a natural prompting strategy for LLMs, which presents questions and answer choices together, allowing the model to explicitly compare options.
Bias in LLMs. Selection bias, a specific form of bias relevant to the evaluation of LLMs through MCQ, has garnered attention due to its understated and widespread impact. A series of works [49, 30, 48, 42] have shown that LLMs may develop a propensity to favor certain answer choices based on their position or encoding, such as the alphabetical ordering of A/B/C/D in MCQ. This phenomenon can lead to skewed evaluation results, misrepresenting a model’s true understanding and reasoning capabilities.
3 Approach
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Multiple-Choice and Open-Style Question Paths
### Overview
This diagram illustrates the workflow for processing multiple-choice questions (MCQs) and open-style questions using Large Language Models (LLMs) and GPT-4. It depicts two parallel paths: one for MCQs and one for open-style questions, culminating in a comparative analysis. The diagram uses a flowchart style with icons representing processes and decision points.
### Components/Axes
The diagram is segmented into two main paths: "Multiple-Choice Questions Path" (top) and "Open-Style Questions Path" (bottom). Key components include:
* **Start:** Initial point of the process.
* **Multiple-choice question datasets collection:** Represents the input data.
* **Collect the responses from LLMs in a multiple-choice format:** LLM response collection for MCQs.
* **Result Evaluation (accuracy):** Evaluation of MCQ responses.
* **Utilize GPT-4 to filter MCQs that can be written as an open style:** Filtering process using GPT-4.
* **Questions are first classified as either 'YES' (can be written in an open style) or 'No' (cannot be written in an open style) using GPT4:** Decision point based on GPT-4 classification.
* **Assign a confidence score ranging from 1 to 10 using GPT-4:** Confidence scoring process.
* **Greater than threshold / Less than threshold:** Decision point based on confidence score.
* **Move to 'YES' category / Remove:** Actions based on the threshold comparison.
* **Collect the responses from LLMs in an open-style format:** LLM response collection for open-style questions.
* **Design a prompt for an evaluation:** Prompt design for open-style evaluation.
* **Result Evaluation (accuracy):** Evaluation of open-style responses.
* **Comparative analysis of both formats:** Final analysis step.
### Detailed Analysis or Content Details
The diagram details a two-pronged approach to question answering.
**Multiple-Choice Questions Path:**
1. The process begins with a "Multiple-choice question datasets collection".
2. Responses are collected from LLMs in a multiple-choice format.
3. These responses undergo "Result Evaluation (accuracy)".
4. A subset of these MCQs are then filtered using GPT-4 to determine if they can be re-written as open-style questions.
5. GPT-4 classifies the questions as "YES" (can be open-style) or "No" (cannot be open-style).
6. For questions classified as "YES", a confidence score (1-10) is assigned by GPT-4.
7. A threshold is applied to the confidence score. If the score is "Greater than threshold", the question is moved to the "YES" category. If "Less than threshold", it is removed.
8. The "YES" category questions are then processed through the open-style question path.
**Open-Style Questions Path:**
1. Responses are collected from LLMs in an open-style format.
2. A prompt is designed for evaluation.
3. These responses undergo "Result Evaluation (accuracy)".
**Convergence:**
1. Both paths (MCQ and open-style) converge at a "Comparative analysis of both formats" stage.
### Key Observations
* The diagram highlights the use of GPT-4 for both filtering MCQs and assigning confidence scores.
* A threshold-based decision-making process is used to determine which MCQs are suitable for conversion to open-style questions.
* The diagram emphasizes the importance of accuracy evaluation for both question formats.
* The final step involves a comparative analysis, suggesting a goal of understanding the performance differences between the two question formats.
### Interpretation
The diagram illustrates a sophisticated workflow for leveraging LLMs and GPT-4 to analyze and evaluate question-answering performance. The process aims to identify MCQs that can be effectively answered in an open-style format, potentially unlocking more nuanced and insightful responses. The use of a confidence score and threshold suggests a desire to maintain quality control and avoid introducing unreliable data into the open-style analysis. The comparative analysis at the end indicates a focus on understanding the strengths and weaknesses of each question format, which could inform future question design and LLM evaluation strategies. The diagram suggests a research or development context where the goal is to optimize question-answering systems and gain a deeper understanding of LLM capabilities. The entire process is designed to improve the quality and depth of responses obtained from LLMs.
</details>
Figure 2: An overview of a dual-path evaluation pipeline for LLMs, starting with the collection of MCQ datasets. It branches into two paths, with the MCQ path proceeding directly from response collection to evaluation, while the open-style path passes through an additional filtering phase. After evaluation, both paths converge in a comparative analysis.
3.1 Defining Open-style Questions
Open-style questions, aka open-ended questions, require the model to generate an answer without being constrained by a set of predetermined choices. In the context of LLM evaluation, these questions are designed to assess the model’s ability to generate coherent, relevant, and contextually appropriate responses based on the input query. While multiple-choice questions can efficiently assess specific factual knowledge and comprehension, open-style questions offer a deeper insight into the LLM’s generative capabilities, understanding of context, and ability to engage with complex tasks. Also, open-style questions can avoid the inherent selection bias and random guessing weaknesses compared to multiple-choice questions.
3.2 Automatic Open-style Question Filtering and Generation
Multi-stage Filtering and Postprocessing via Coarse-to-fine Process. Our proposed multi-stage filtering approach consists of four main steps to streamline the conversion: (1) Initially classify datasets as either convertible or non-convertible. (2) Assign each question a confidence score to indicate the likelihood that it can be framed as an open-style question. (3) Exclude questions with confidence scores below a specified threshold and classified as non-convertible. (4) Combine questions that are labeled as non-convertible but have high confidence scores with those labeled as convertible.
Stage1: Preliminary Filter using Binary Classification. Considering that the structure of MCQ varies, converting them into an open-style format is not always possible, particularly because certain questions are strongly linked to their choices. For instance, questions formulated as “Which one of the following is true” or “All except” or “Which of these”. Such questions are typically unsuitable for conversion into an open-style format since the absence of the options could change the question’s core, resulting in incomplete questions.
Table 1: Prompt design for two-stage filtering and post verification.
| Stage One: Coarse Filtering Prompt |
| --- |
| """Your task is to review a series of multiple-choice questions and evaluate their ability to be answered without the provided answer choices. For questions that begin with an incomplete sentence (e.g., "During swallowing, ..."), use your knowledge to attempt to complete the sentence accurately. For direct questions that ask for specific information or identification (e.g., "Which of the following structures is part of the small intestine?"), assess whether the question is formulated clearly enough that an informed answer can be given without seeing the multiple-choice options. For mathematical or analytical questions (e.g., "Find all cosets of the subgroup 4Z of 2Z"), determine if the question provides enough context and information for a solution to be formulated without additional options. Please follow this format for your evaluation: QUESTION: [Insert the question here] VERDICT: Respond with "YES" if the question is clear and can be directly answered based on its content alone, or "NO" if it relies on the answer choices to be understood or answered. Your response should include only the verdict without any justification or reasoning.""" |
| Stage Two: Fine-grained Filtering Prompt |
| You will assign a numerical score from 1 to 10 based on how confidently it can be answered without the choices. The scoring criteria are as follows: 1: The question is entirely dependent on its choices for an answer, making it impossible to answer without them. Example: ‘Which of the following statements is correct?’ 10: The question can be easily and confidently answered based solely on the question stem, without any need to refer to the provided options. Example: ‘What is the first law of thermodynamics in physics?’ Intermediate Scores: 2-4: The question stem gives very little information and is highly reliant on the choices for context. Example: ‘Which of these is a prime number?’ 5: The question provides some context or information, that gives a moderate possibility to answer the question. Example: ‘Which of the following best describes the structure that collects urine in the body?’ 6: The question provides a good amount of context or information, that gives a moderate possibility to answer the question. Example: ‘Statement 1 | A factor group of a non-Abelian group is non-Abelian. Statement 2 | If K is a normal subgroup of H and H is a normal subgroup of G, then K is a normal subgroup of G.’ 7: The question provides a good amount of context or information, that gives a high possibility to answer the question. Example: ‘The element (4, 2) of Z_12 x Z_8 has order’ 8-9: The question provides a good amount of context or information, that gives a high possibility to answer the question. Example: ‘A "dished face" profile is often associated with’ ONLY GIVE THE VALUE BETWEEN 1-10 AS YOUR ANSWER. DO NOT INCLUDE ANY OTHER INFORMATION IN YOUR RESPONSE Example Format: QUESTION: question here VERDICT: value in [1-10] here |
| GPT-4 Prompt for Verification |
| """Evaluate the answer of a AI model to a question. You will be provided with the question, the AI model’s answer, and the correct answer. Your task is to evaluate the AI model’s response and determine whether it is Correct or Incorrect. Grade the AI model answers based ONLY on their factual accuracy. It is OK if the AI model answer contains more information than the true answer, as long as it does not contain any conflicting statements. Otherwise, it should be marked as Incorrect. Ignore differences in punctuation and phrasing between the AI model’s answer and the true answer. Example Format: QUESTION: question here STUDENT ANSWER: student’s answer here TRUE ANSWER: true answer here GRADE: Correct or Incorrect here
Your response should include only the verdict without any justification or reasoning.""" |
To effectively handle this challenge of identifying whether multiple-choice questions are suitable for open-style conversion, we leverage the power of prompting techniques to create a customized classification prompt as shown in Table 1. In the prompt, we integrate different types of questions from different datasets to demonstrate how an LLM may evaluate each question to be written in an open-style way, eventually classifying them as convertible “YES” or non-convertible “NO”. It will determine whether a question provides a clear context and information without relying on the provided options or not. In the prompt we integrate different types of questions from different datasets to demonstrate how an LLM like GPT-4 may evaluate each question to be written in an open-style way, eventually classifying them as convertible “YES” or non-convertible “NO”. We set the prompt to eliminate any additional explanations, by stating that “Your response should include only the verdict without any justification or reasoning.” This guarantees that the answer to each inquiry is conveyed concisely as “YES” or “NO”.
To understand our initial filtering results, we conduct an error analysis manually by selecting 100 questions in the “YES” and “NO” pools separately. In the samples classified as “YES”, we find that only around 5% of the questions are false positive cases, verifying a low misclassification error for the positive question selection by our filtering strategy. Conversely, within the “NO” sample, around 40% of the questions are actually suitable for open-style questions but mistakenly classified as negative. This situation often arises from questions that include phrases like “Which of”. Similarly, questions involving true/false statements, sentence completions, or fill-in-the-blanks are also sometimes inappropriately classified as non-convertible. This analysis motivates us to develop a cascaded fine-grained stage to further filter more positive questions in “NO” pool using particular prompts, as described in the following Stage 2 process.
Stage2: Confidence Score Assignment. As we aim to overcome the issue of classifying questions with specific patterns as non-convertible, we introduce a second stage of filtering centered on confidence score assignment. This involves instructing the large language model to assign a confidence score on a scale from 1 to 10, reflecting the possibility of the question being written in an open-style format. Since a significant number of questions are unsuitable for an open-style format, categorized as “NO” and have a confidence score below 5, we set a confidence score threshold to be 5. Therefore, questions classified as non-convertible with a confidence score lower than this threshold are excluded, while those remaining above the threshold and those initially classified as convertible are moved into the “YES” category to be converted to an open-style format.
3.3 Open-style Question Answer Evaluation
After establishing a set of convertible questions from various datasets and obtaining their responses from several LLMs, there arises a need to evaluate these questions. Given that our ground truth answers are based on the MCQ format with defined answers, it necessitates a method for efficiently and accurately validating the correctness of responses to open-style questions. To this end, we design a customized prompt, as shown in Figure 2 that utilizes the correct MCQ answer as the ground truth to determine if the open-style responses are correct or incorrect by the prediction $\hat{y}$ :
$$
\hat{y}=\texttt{LLM}_{\texttt{e}}(\text{prompt}(q,\hat{a},a)) \tag{1}
$$
where $\hat{y}$ represents the prediction and $\texttt{LLM}_{\texttt{e}}$ is the LLM evaluator. $q$ , $\hat{a}$ and $a$ represent the question, LLM generated answer, and correct answer from MCQ, respectively, and the prompt is provided in Table 1 of Appendix. While these open-style answers are evaluated based on the MCQ’s ground truth, issues of misevaluation might arise. This includes scenarios where a response is inaccurately classified as correct simply because it contains certain keywords also found in the ground truth. To tackle this issue we include specific phrases in the prompt. These phrases, such as “as long as it does not contain any conflicting statements”, ensure that a response is not automatically classified as correct based on the presence of a keyword, avoiding incorrect markings when the response contradicts the correct answer. Additionally, to prevent the exclusion of correct answers that incorporate extra information, we incorporate the phrase “It is OK if the AI model’s answer contains more information than the true answer”. Furthermore, we highlight that minor differences in punctuation and phrasing between the open-style responses and the ground truth answers should not lead to their being classified as incorrect. To see the correctness of the LLM judgement we take the randomly drawn 100 responses from all models. The human evaluation process for our study was conducted by the authors themselves. The agreement between the LLM evaluations and those of a human evaluator was quantitatively assessed using Cohen’s kappa [13], which yielded a score of 0.83. This substantial kappa score The Kappa score is a statistical measure of inter-rater agreement for categorical items, defined by the equation: $\kappa=\frac{P_{o}-P_{e}}{1-P_{e}}$ where $P_{o}$ is the observed agreement and $P_{e}$ is the expected agreement by chance. verifies that the LLM’s ability to determine the correctness of responses aligns closely with human judgment, demonstrating strong reliability in its evaluation process.
4 An Open-style Question Benchmark (OSQ-bench)
4.1 Statistics and Distributions
Table 2 describes the basic statistics of the dataset questions that are suitable for answering in open-style format. In total, we have evaluated 42K questions from 9 different datasets and more than 23K of them are classified as appropriate for open-style answering.
Table 2: Statistics on open-style questions across different datasets.
| MMLU ARC MedMCQA | 14,042 3,428 4,183 | 7,784 3,118 2,318 | 36.6 21.1 14.1 |
| --- | --- | --- | --- |
| CommonsenseQA | 1,221 | 710 | 13.1 |
| Race | 4,934 | 3,520 | 10.0 |
| OpenbookQA | 1,000 | 491 | 10.3 |
| WinoGrande | 1,267 | 1,267 | 19.1 |
| HellaSwag | 10,042 | 3,915 | 40.1 |
| PIQA | 1,838 | 696 | 7.1 |
| Overall | 41,955 | 23,839 | 19.05 |
4.2 Diversity
Our investigation into the diversity of questions within our benchmark is foundational for understanding the landscape of open-ended question answering. To comprehensively assess the breadth of question diversity, we have conducted a systematic categorization of the question types sourced from an array of distinct datasets. From the total initial pool of 41,955 questions, we refine the selection to 23,839 questions, ensuring that each one is conducive to open-ended responses. The distribution of those questions is illustrated in Figure 3, which segments the data into several domains based on the content of the questions. The segmentation of the plot underscores the interdisciplinary nature of our dataset. It features a broad spectrum of categories such as literature and reading comprehension, commonsense reasoning, domain-specific (medicine, STEM, and etc), and multi-topic knowledge. Also, Table 2 demonstrates the diversity of question length used for the benchmark.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Chart: Benchmark Distribution Across Disciplines
### Overview
This image presents a circular chart illustrating the distribution of various AI benchmarks across different academic disciplines. The chart is a segmented circle, resembling a pie chart, with each segment representing a benchmark and colored according to the discipline it primarily assesses. The chart is divided into four main quadrants: Humanities, Social Sciences, STEM, and Miscellaneous.
### Components/Axes
The chart does not have traditional axes. Instead, it uses colored segments to represent benchmarks and labels to identify both the benchmarks and the disciplines. The disciplines are labeled in large, bold text within the quadrants. The benchmarks are labeled along the curved edges of their respective segments.
**Disciplines:**
* Humanities (Purple)
* Social Sciences (Red)
* STEM (Green)
* Miscellaneous (Yellow)
**Benchmarks:**
* MMLU
* HellaSwag
* Activity Prediction
* Situational Reasoning
* Language Analysis
* Critical Reading
* Literature Comprehension
* Natural Sciences
* Technology
* Mathematical Reasoning
* Medical Specialities
* Clinical Knowledge
* Healthcare
* Linguistic Patterns
* Coreference Resolution
* Winograde
* CommonsenseQA
* PIQA
* OpenbookQA
* MedMCQA
* ARC
* Race
**Sub-categories within STEM (listed vertically in the right quadrant):**
* Common Knowledge
* Conceptual Understanding
* Analytical Reasoning
* Spatial-Temporal Reasoning
* Predictive Reasoning
* Physical Principles
* Social Commonsense
* World Knowledge
* Temporal Knowledge
* Physical Commonsense
### Detailed Analysis
The chart shows a varied distribution of benchmarks across the disciplines.
* **Humanities:** Contains only MMLU.
* **Social Sciences:** Contains Activity Prediction, Situational Reasoning, and HellaSwag. HellaSwag appears to take up approximately 25% of the Social Sciences quadrant. Activity Prediction and Situational Reasoning appear to be roughly equal in size.
* **STEM:** Contains a large number of benchmarks, including OpenbookQA, PIQA, CommonsenseQA, Winograde, MedMCQA, ARC, Race, and a list of sub-categories. The sub-categories are listed vertically and take up a significant portion of the STEM quadrant. OpenbookQA and PIQA appear to be the largest segments within STEM.
* **Miscellaneous:** Contains a small segment labeled "Miscellaneous".
* **Natural Sciences:** Contains Literature Comprehension and Critical Reading.
* **Technology:** Contains ARC and Race.
* **Mathematical Reasoning:** Contains ARC.
* **Medical Specialities:** Contains MedMCQA.
* **Clinical Knowledge:** Contains Winograde.
* **Healthcare:** Contains Linguistic Patterns and Coreference Resolution.
The relative sizes of the segments indicate the relative prominence or complexity of the benchmarks within each discipline. For example, MMLU appears to be the dominant benchmark in the Humanities, while STEM has a much more diverse set of benchmarks.
### Key Observations
* STEM has the most diverse set of benchmarks, suggesting a broader range of assessment areas within this discipline.
* The Humanities are represented by a single benchmark, MMLU.
* The Miscellaneous category is very small, indicating that there are few benchmarks that don't fit neatly into the other categories.
* Some benchmarks appear to overlap between disciplines (e.g., ARC appears in both Technology and Mathematical Reasoning).
### Interpretation
This chart provides a visual overview of how different AI benchmarks are aligned with various academic disciplines. It suggests that certain disciplines, like STEM, require a more comprehensive set of assessment tools than others. The chart could be used to understand the strengths and weaknesses of current AI models in different areas of knowledge. The overlap of benchmarks between disciplines highlights the interdisciplinary nature of many AI tasks. The chart also suggests potential areas for further research and development of new benchmarks, particularly in the Humanities and Miscellaneous categories. The detailed breakdown of sub-categories within STEM indicates the complexity of evaluating AI capabilities in this field. The chart is a useful tool for researchers and practitioners interested in understanding the landscape of AI benchmarks and their relationship to academic disciplines.
</details>
Figure 3: Diversity and distribution of used datasets for our OSQ-bench.
4.3 Quality
Our newly developed benchmark, curated from widely recognized datasets, stands out by focusing on questions suitable to open-style answering, i.e., a format that demands a deep understanding and an ability to generate informative, unrestricted responses. Given that the datasets from which these questions originate are widely utilized and highly recognizable within the research community, it follows that the questions have good quality to assess the models’ capabilities. Moreover, due to the thorough filtering process it has undergone, it results in a low false positive rate (questions not suitable for open-style that are classified as suitable) of around 5%. This indicates that the vast majority of questions categorized as suitable for open-style answers indeed meet the criteria.
4.4 Property and Advantage
As shown in Table 3, our leaderboard exhibits several advantages: first is the debiased results compared to the MCQ-based leaderboard, which has been discussed thoroughly. Another advantage is the faster and cheaper evaluation over crowduser-based leaderboards. Our results and rankings can be generated automatically without any human intervention.
Table 3: Comparison with different LLM leaderboards. “Biased” indicates the selection bias.
| Huggingface Leaderboard [5] | Multiple Choices Questions | High | ✓ | Automatically |
| --- | --- | --- | --- | --- |
| AlpacaEval Leaderboard [21] | Human Questions&Feedback | Low | ✗ | GPT-4 |
| Chatbot Arena Leaderboard [49] | Human Questions&Feedback | Low | ✗ | GPT-4/Crowdusers |
| Open-LLM-Leaderboard (Ours) | Open Style Questions | High | ✗ | GPT-4 |
5 Experiments
<details>
<summary>x3.png Details</summary>

### Visual Description
## Bar Charts with Overlaid Area Charts: MCQs & OSQs Accuracies
### Overview
The image presents four bar charts, each with an overlaid area chart. Each chart compares the performance of a different Large Language Model (LLM) – GPT-4, GPT-3.5, Claude-3 Opus, and Mistral-large – across several datasets. The bar charts represent the count of correct and incorrect Multiple Choice Questions (MCQs) and Open-ended Short Questions (OSQs). The overlaid area charts show the accuracy scores for both MCQs and OSQs.
### Components/Axes
* **X-axis:** "Dataset" - with categories: MMLU, HellaSwag, ARC, CommonsenseQA, MedMCQA, PIQA, OpenbookQA, and Winogrande.
* **Y-axis:** "Count" - ranging from 0 to 8000.
* **Bar Chart Legend (Top-Left):**
* Incorrect MCQs (Red)
* Correct MCQs (Green)
* Incorrect OSQs (Light Red)
* Correct OSQs (Light Green)
* **Area Chart Legend (Top-Right):**
* MCQs (Dark Blue) - Accuracy ranging from 0.0 to 1.0
* OSQs (Dark Green) - Accuracy ranging from 0.0 to 0.8
### Detailed Analysis or Content Details
**Chart 1: GPT-4**
* **MMLU:** Correct MCQs ≈ 6800, Incorrect MCQs ≈ 1000, Correct OSQs ≈ 5000, Incorrect OSQs ≈ 1000. MCQ Accuracy ≈ 0.87, OSQ Accuracy ≈ 0.83.
* **HellaSwag:** Correct MCQs ≈ 7800, Incorrect MCQs ≈ 200, Correct OSQs ≈ 6500, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.97, OSQ Accuracy ≈ 0.93.
* **ARC:** Correct MCQs ≈ 6000, Incorrect MCQs ≈ 1500, Correct OSQs ≈ 4500, Incorrect OSQs ≈ 1500. MCQ Accuracy ≈ 0.80, OSQ Accuracy ≈ 0.75.
* **CommonsenseQA:** Correct MCQs ≈ 6500, Incorrect MCQs ≈ 500, Correct OSQs ≈ 5500, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.93, OSQ Accuracy ≈ 0.92.
* **MedMCQA:** Correct MCQs ≈ 6200, Incorrect MCQs ≈ 800, Correct OSQs ≈ 5000, Incorrect OSQs ≈ 1000. MCQ Accuracy ≈ 0.89, OSQ Accuracy ≈ 0.83.
* **PIQA:** Correct MCQs ≈ 7000, Incorrect MCQs ≈ 300, Correct OSQs ≈ 6000, Incorrect OSQs ≈ 400. MCQ Accuracy ≈ 0.96, OSQ Accuracy ≈ 0.94.
* **OpenbookQA:** Correct MCQs ≈ 6500, Incorrect MCQs ≈ 500, Correct OSQs ≈ 5500, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.93, OSQ Accuracy ≈ 0.92.
* **Winogrande:** Correct MCQs ≈ 5500, Incorrect MCQs ≈ 1500, Correct OSQs ≈ 4000, Incorrect OSQs ≈ 2000. MCQ Accuracy ≈ 0.78, OSQ Accuracy ≈ 0.67.
**Chart 2: GPT-3.5**
* **MMLU:** Correct MCQs ≈ 5500, Incorrect MCQs ≈ 1500, Correct OSQs ≈ 4000, Incorrect OSQs ≈ 2000. MCQ Accuracy ≈ 0.78, OSQ Accuracy ≈ 0.67.
* **HellaSwag:** Correct MCQs ≈ 7000, Incorrect MCQs ≈ 300, Correct OSQs ≈ 5500, Incorrect OSQs ≈ 1000. MCQ Accuracy ≈ 0.96, OSQ Accuracy ≈ 0.85.
* **ARC:** Correct MCQs ≈ 5000, Incorrect MCQs ≈ 2000, Correct OSQs ≈ 3500, Incorrect OSQs ≈ 2000. MCQ Accuracy ≈ 0.71, OSQ Accuracy ≈ 0.64.
* **CommonsenseQA:** Correct MCQs ≈ 5500, Incorrect MCQs ≈ 500, Correct OSQs ≈ 4500, Incorrect OSQs ≈ 1000. MCQ Accuracy ≈ 0.92, OSQ Accuracy ≈ 0.82.
* **MedMCQA:** Correct MCQs ≈ 5000, Incorrect MCQs ≈ 1000, Correct OSQs ≈ 4000, Incorrect OSQs ≈ 1500. MCQ Accuracy ≈ 0.83, OSQ Accuracy ≈ 0.73.
* **PIQA:** Correct MCQs ≈ 6500, Incorrect MCQs ≈ 350, Correct OSQs ≈ 5000, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.95, OSQ Accuracy ≈ 0.91.
* **OpenbookQA:** Correct MCQs ≈ 5500, Incorrect MCQs ≈ 500, Correct OSQs ≈ 4500, Incorrect OSQs ≈ 1000. MCQ Accuracy ≈ 0.92, OSQ Accuracy ≈ 0.82.
* **Winogrande:** Correct MCQs ≈ 4500, Incorrect MCQs ≈ 2000, Correct OSQs ≈ 3000, Incorrect OSQs ≈ 2500. MCQ Accuracy ≈ 0.69, OSQ Accuracy ≈ 0.55.
**Chart 3: Claude-3 Opus**
* **MMLU:** Correct MCQs ≈ 7000, Incorrect MCQs ≈ 300, Correct OSQs ≈ 6000, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.96, OSQ Accuracy ≈ 0.92.
* **HellaSwag:** Correct MCQs ≈ 8000, Incorrect MCQs ≈ 0, Correct OSQs ≈ 7000, Incorrect OSQs ≈ 0. MCQ Accuracy ≈ 1.00, OSQ Accuracy ≈ 1.00.
* **ARC:** Correct MCQs ≈ 6500, Incorrect MCQs ≈ 500, Correct OSQs ≈ 5000, Incorrect OSQs ≈ 1000. MCQ Accuracy ≈ 0.93, OSQ Accuracy ≈ 0.83.
* **CommonsenseQA:** Correct MCQs ≈ 7500, Incorrect MCQs ≈ 0, Correct OSQs ≈ 6500, Incorrect OSQs ≈ 0. MCQ Accuracy ≈ 1.00, OSQ Accuracy ≈ 1.00.
* **MedMCQA:** Correct MCQs ≈ 6800, Incorrect MCQs ≈ 200, Correct OSQs ≈ 5500, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.97, OSQ Accuracy ≈ 0.92.
* **PIQA:** Correct MCQs ≈ 7500, Incorrect MCQs ≈ 0, Correct OSQs ≈ 6500, Incorrect OSQs ≈ 0. MCQ Accuracy ≈ 1.00, OSQ Accuracy ≈ 1.00.
* **OpenbookQA:** Correct MCQs ≈ 7000, Incorrect MCQs ≈ 300, Correct OSQs ≈ 6000, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.96, OSQ Accuracy ≈ 0.92.
* **Winogrande:** Correct MCQs ≈ 6000, Incorrect MCQs ≈ 1000, Correct OSQs ≈ 4500, Incorrect OSQs ≈ 1500. MCQ Accuracy ≈ 0.86, OSQ Accuracy ≈ 0.75.
**Chart 4: Mistral-large**
* **MMLU:** Correct MCQs ≈ 6000, Incorrect MCQs ≈ 1000, Correct OSQs ≈ 4500, Incorrect OSQs ≈ 1500. MCQ Accuracy ≈ 0.86, OSQ Accuracy ≈ 0.75.
* **HellaSwag:** Correct MCQs ≈ 7500, Incorrect MCQs ≈ 250, Correct OSQs ≈ 6000, Incorrect OSQs ≈ 1000. MCQ Accuracy ≈ 0.97, OSQ Accuracy ≈ 0.86.
* **ARC:** Correct MCQs ≈ 5500, Incorrect MCQs ≈ 1500, Correct OSQs ≈ 4000, Incorrect OSQs ≈ 1500. MCQ Accuracy ≈ 0.78, OSQ Accuracy ≈ 0.73.
* **CommonsenseQA:** Correct MCQs ≈ 6000, Incorrect MCQs ≈ 500, Correct OSQs ≈ 5000, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.92, OSQ Accuracy ≈ 0.91.
* **MedMCQA:** Correct MCQs ≈ 5500, Incorrect MCQs ≈ 500, Correct OSQs ≈ 4500, Incorrect OSQs ≈ 1000. MCQ Accuracy ≈ 0.92, OSQ Accuracy ≈ 0.82.
* **PIQA:** Correct MCQs ≈ 6800, Incorrect MCQs ≈ 320, Correct OSQs ≈ 5500, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.96, OSQ Accuracy ≈ 0.92.
* **OpenbookQA:** Correct MCQs ≈ 6000, Incorrect MCQs ≈ 500, Correct OSQs ≈ 5000, Incorrect OSQs ≈ 500. MCQ Accuracy ≈ 0.92, OSQ Accuracy ≈ 0.91.
* **Winogrande:** Correct MCQs ≈ 5000, Incorrect MCQs ≈ 1500, Correct OSQs ≈ 3500, Incorrect OSQs ≈ 2000. MCQ Accuracy ≈ 0.77, OSQ Accuracy ≈ 0.64.
### Key Observations
* Claude-3 Opus consistently demonstrates the highest accuracy across all datasets for both MCQs and OSQs.
* GPT-4 generally performs well, but is outperformed by Claude-3 Opus.
* GPT-3.5 shows the lowest accuracy scores, particularly on OSQs.
* Mistral-large performs comparably to GPT-3.5, with slightly better results on some datasets.
* All models struggle with Winogrande, exhibiting lower accuracy scores compared to other datasets.
* MCQ accuracy is generally higher than OSQ accuracy for all models.
### Interpretation
The data suggests a clear hierarchy in performance among the evaluated LLMs. Claude-3 Opus emerges as the most capable model, excelling in both multiple-choice and open-ended question answering. GPT-4 is a strong performer, but falls short of Claude-3 Opus's capabilities. GPT-3.5 and Mistral-large exhibit lower accuracy, indicating limitations in their reasoning and knowledge retention abilities.
The consistent difficulty with the Winogrande dataset suggests that this benchmark poses a unique challenge for all models, potentially due to its reliance on subtle contextual understanding and commonsense reasoning. The higher accuracy scores for MCQs compared to OSQs highlight the models' relative strength in recognizing patterns and selecting from predefined options versus generating coherent and accurate responses to open-ended questions.
The overlaid area charts provide a visual representation of the accuracy trends, allowing for a quick comparison of model performance across different datasets. The differences in area chart heights directly correlate with the accuracy scores, reinforcing the observed performance hierarchy. The data suggests that Claude-3 Opus represents a significant advancement in LLM capabilities, particularly in complex reasoning tasks.
</details>
Figure 4: Performance comparison of various LLMs on multiple-choice (MCQ) and open-style questions (OSQ) across different datasets. The bar graphs on the left show the counts of correct and incorrect responses (✗ MCQ vs. ✓ OSQ; ✓ MCQ vs. ✗ OSQ; ✓ MCQ vs. ✓ OSQ; ✗ MCQ vs. ✗ OSQ), while the radar charts on the right illustrate the accuracy comparisons between MCQ and OSQ for each language model (Pink is the MCQ accuracy and LimeGreen is the OSQ accuracy).
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Pie Charts: Performance Evaluation on Various Question Answering Datasets
### Overview
The image presents a 3x3 grid of pie charts, each representing the performance (likely accuracy) of a model on a different question answering dataset. The performance is categorized into "YES" and "NO" responses, visually represented by different shades of red and blue respectively. A legend in the top-right corner defines the color scheme.
### Components/Axes
* **Pie Charts:** Each chart represents a dataset.
* **Labels:** Each pie chart is labeled with the dataset name: ARC, CommonsenseQA, Hellaswag, MedMCQA, MMLU, OpenbookQA, PIQA, Race, Winogrande.
* **Legend:** Located in the top-right corner, the legend indicates:
* "YES" - represented by a reddish-brown color (approximately RGB 204, 51, 51).
* "NO" - represented by a light blue color (approximately RGB 173, 216, 230).
* **Values:** Each segment of the pie chart is labeled with a percentage value.
### Detailed Analysis
Here's a breakdown of the data for each dataset:
1. **ARC:** 8.8% "YES", 91.2% "NO"
2. **CommonsenseQA:** 41.9% "YES", 58.1% "NO"
3. **Hellaswag:** 60.8% "YES", 39.2% "NO"
4. **MedMCQA:** 44.6% "YES", 55.4% "NO"
5. **MMLU:** 44.6% "YES", 55.4% "NO"
6. **OpenbookQA:** 50.9% "YES", 49.1% "NO"
7. **PIQA:** 62.1% "YES", 37.9% "NO"
8. **Race:** 28.7% "YES", 71.3% "NO"
9. **Winogrande:** 0.0% "YES", 100.0% "NO"
### Key Observations
* **Winogrande** shows a complete failure rate (0% "YES").
* **ARC** and **Race** have very low "YES" percentages, indicating poor performance.
* **PIQA** has the highest "YES" percentage (62.1%), suggesting the best performance among these datasets.
* **OpenbookQA** is nearly balanced between "YES" and "NO" responses.
* **MedMCQA** and **MMLU** have identical performance metrics.
### Interpretation
The data suggests that the model being evaluated struggles significantly with certain question answering tasks, particularly Winogrande, ARC, and Race. The "YES" and "NO" labels likely represent correct and incorrect answers, respectively. The wide range of performance across different datasets indicates that the model's capabilities are highly dependent on the specific type of question or knowledge domain. The fact that Winogrande has 0% accuracy is a significant outlier and warrants further investigation. It could indicate a fundamental limitation of the model in handling coreference resolution or commonsense reasoning, which are often tested in Winogrande. The relatively high performance on PIQA suggests the model is better at physical interaction questions. The similarity in performance between MedMCQA and MMLU could indicate that the model has similar strengths and weaknesses in medical and general knowledge domains. Overall, the data provides a valuable snapshot of the model's strengths and weaknesses across a diverse set of question answering benchmarks.
</details>
Figure 5: Percentage of convertible MCQ to open style questions on various datasets.
5.1 Models
We generate responses from LLMs of different sizes. The large-scale LLMs: gpt-3.5-turbo, gpt-4-1106-preview, gpt-4o [27], claude-3-opus-20240229 [3], mistral-large-latest [24], gemini-pro [16], and llama3 [1]. We use the commercial APIs to collect responses from all of these models. The small-scale LLMs: qwen1.5 [4], gemma [39], SlimPajama-DC [35], RedPajama [25], OLMo [17], Pythia [6], TinyLlama [46], OPT [47], GPT-Neo [8], and Cerebras-GPT [14]. All of the small-scale model responses are collected using Huggingface [43] and lm-evaluation-harness framework [15] with 4 $×$ 4090 RTX GPUs.
5.2 Datasets
We present a brief overview of used datasets, highlighting their distinctive characteristics and the specific aspects they aim to evaluate. MMLU [18], ARC [12], and MedMCQA [29] stand out with their comprehensive range of tasks spanning across various disciplines. PIQA [7], CommonsenseQA [36], OpenBookQA [23], and HellaSwag [44] focus on the different aspects of commonsense reasoning, such as physical interaction, everyday concepts, and their interrelations. RACE [19] provides a source of reading comprehension challenges. WinoGrande [34] is designed to test the model on resolving coreferences and understanding nuanced relationships in text. This dataset with its unique fill-in-a-blank tasks, inherently aligns with open-ended question formats, negating the need for our multi-stage filtering process. For other datasets, questions are filtered using gpt-4-0125-preview using prompts from Table 1. The prompts for both MCQ and OSQ on each dataset are in Appendix D.
5.3 Evaluation
Table 4: Comparison of multiple choice (MCQ) and open style questions (OSQ) accuracy.
| MMLU | 87.28 | 74.77 | 71.25 | 65.38 | 65.71 | 56.04 | 83.52 | 70.23 | 79.50 | 68.76 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ARC | 95.54 | 82.68 | 90.64 | 78.42 | 90.96 | 72.35 | 97.50 | 75.47 | 89.96 | 72.32 |
| HellaSwag | 90.98 | 24.35 | 63.84 | 29.99 | 69.05 | 25.69 | 96.04 | 20.79 | 81.78 | 24.47 |
| WinoGrande | 84.14 | 66.22 | 78.77 | 64.56 | 66.85 | 56.35 | 81.69 | 63.54 | 75.45 | 56.83 |
| PIQA | 96.41 | 61.64 | 84.34 | 54.89 | 83.33 | 47.70 | 97.41 | 59.05 | 83.33 | 61.21 |
| CommonsenseQA | 84.93 | 62.96 | 79.15 | 67.89 | 66.62 | 50.56 | 86.76 | 63.66 | 69.58 | 55.35 |
| Race | 92.02 | 67.05 | 84.80 | 60.11 | 87.73 | 61.02 | 93.04 | 66.22 | 89.97 | 70.17 |
| MedMCQA | 72.65 | 51.81 | 58.02 | 41.42 | 58.02 | 35.89 | 72.91 | 49.14 | 66.05 | 43.44 |
| OpenbookQA | 94.30 | 60.29 | 83.71 | 49.90 | 86.97 | 52.55 | 93.48 | 52.95 | 88.19 | 58.66 |
| Average | 88.69 | 61.31 | 78.28 | 56.95 | 75.03 | 50.91 | 90.26 | 57.89 | 80.42 | 56.80 |
Our assessment approach for both MCQ and OSQ aligns with widely recognized evaluation frameworks and leaderboards for LLMs. The evaluation of MCQ is conducted utilizing the OpenAI Evals framework [26] with the zero-shot setting, which involves comparing the generated response with the ground truth ID. In contrast, for evaluating responses to open-ended questions, we employ the gpt-4-0125-preview model to determine the correctness of responses generated by LLMs relative to a pre-established ground truth answer from the dataset using the prompt from Table 1.
The results in Table 4 and Figure 4 are based on filtered questions. They show that every model experiences a significant drop in the accuracy for OSQ compared to MCQ. On average, the accuracy of OSQ is lower than MCQ by about 25% for all models. This result can correlate with our concern that the model will “randomly guess” to correct choices but it cannot answer. This discrepancy in performance between OSQ and MCQ is not necessarily a negative reflection of the models’ overall capabilities. Instead, it can be viewed as a true comparison of the models’ abilities to process and understand diverse types of questions.
The most significant difference in models between OSQ and MCQ is observed for Claude-3 Opus, by 31%. The dataset with the largest fall between MCQ and OSQ is HellaSwag. This is because of the type of questions in this dataset. It asks to choose the most plausible continuation for the scenarios presented. Evaluating the OSQ responses of LLMs against the ground truth in this dataset presents a significant challenge due to the different plausible completions. It means that a multitude of valid and contextually appropriate answers can exist, which makes it difficult to evaluate with single-choice ground truth. This contrasts with WinoGrande, which consists of questions that require fill-in-the-blank in sentences with correct words. As a result, HellaSwag does not seem well-suited for open-style questions, and we have chosen to omit it from our final leaderboard.
Table 5: Open-LLM Leaderboard for Large-scale Models. WG, CSQA, OBQA, and HS represent WinoGrande, CommonsenseQA, OpenbookQA, and HellaSwag respectively. We did not include HellaSwag results in the overall accuracy as the evaluation difficulties mentioned in Sec. 5.3.
| GPT-4o
<details>
<summary>extracted/5652609/fig/ranking/1_fig.png Details</summary>

### Visual Description
\n
## Icon: First Place Medal
### Overview
The image depicts a stylized, flat-design icon of a first-place medal. It consists of a circular gold medal with the number "1" prominently displayed, suspended from a blue ribbon. The background is a light gray. This image does not contain factual data or numerical trends, but rather a symbolic representation.
### Components/Axes
There are no axes or scales present in this image. The components are:
* **Medal:** Circular shape, gold color with white highlights.
* **Number "1":** Orange color, centrally located on the medal.
* **Ribbon:** Blue color, attached to the top of the medal.
* **Background:** Light gray color.
### Detailed Analysis or Content Details
The medal is approximately circular, with a diameter roughly equal to its height. The number "1" is a bold, sans-serif font. The ribbon is composed of two overlapping blue segments, creating a V-shape. The medal has two white highlights, one on the upper left and one on the lower right, suggesting a reflective surface. The overall style is cartoonish and simplified.
### Key Observations
The image is a clear and universally understood symbol of winning or achieving first place. The bright colors (gold and blue) contribute to a positive and celebratory feeling. The simplicity of the design makes it easily recognizable and scalable.
### Interpretation
The icon represents achievement, success, and being the best. It's a visual shorthand for competition and reward. The use of the number "1" reinforces the concept of first place. The design is likely intended for use in digital contexts, such as websites, apps, or presentations, to signify accomplishment or recognition. There are no hidden meanings or complex relationships within the image; its purpose is straightforward and direct. The image is a symbolic representation and does not contain any quantifiable data or trends.
</details>
GPT-4-1106-preview
<details>
<summary>extracted/5652609/fig/ranking/2_fig.png Details</summary>

### Visual Description
\n
## Icon: Second Place Medal
### Overview
The image depicts a stylized icon of a second-place medal. It consists of a circular metallic base with the number "2" prominently displayed, surmounted by a blue ribbon. The background is a light gray. This image does not contain factual data or numerical trends, but rather a symbolic representation.
### Components/Axes
The icon is composed of the following elements:
* **Circular Base:** A silver-colored circular shape representing the medal's body.
* **Number "2":** A dark gray numeral "2" centered within the circular base.
* **Ribbon:** A blue ribbon with two pointed ends, attached to the top of the circular base.
* **Background:** A light gray background.
### Detailed Analysis or Content Details
The medal's circular base has a metallic sheen, suggesting silver or a similar material. The number "2" is a sans-serif font, and is approximately 60% of the diameter of the circular base. The ribbon is a solid, bright blue color. The overall style is flat and cartoonish, typical of an icon design.
### Key Observations
The image clearly represents second place, as indicated by the number "2". The color scheme (silver and blue) is commonly associated with second-place awards. The simplicity of the design makes it easily recognizable and scalable.
### Interpretation
The icon is a visual symbol representing achievement and ranking. It signifies that someone or something has attained the second-highest position in a competition or evaluation. The use of a medal as the visual metaphor implies recognition and honor. The icon is likely intended for use in digital interfaces, such as websites, apps, or presentations, to denote second place in a leaderboard or ranking system. The lack of detailed features suggests a focus on immediate recognition rather than realistic representation.
</details>
Claude-3 Opus
<details>
<summary>extracted/5652609/fig/ranking/3_fig.png Details</summary>

### Visual Description
\n
## Icon: Third Place Medal
### Overview
The image depicts a stylized icon of a third-place medal. It consists of a circular bronze-colored medal with the number "3" prominently displayed, suspended from a blue ribbon. The background is a light gray. This image does not contain factual data or numerical trends, but rather a symbolic representation.
### Components/Axes
The image consists of the following components:
* **Medal:** Circular shape, bronze color (#E0A760 approximately), with a white highlight around the edge.
* **Number:** The numeral "3" is centered within the medal, in a darker shade of orange/bronze (#D18553 approximately).
* **Ribbon:** A blue ribbon (#4682B4 approximately) is attached to the top of the medal. The ribbon is shaped like a "V".
* **Attachment:** A small gold-colored attachment point connects the ribbon to the medal.
* **Background:** Light gray (#F0F0F0 approximately).
### Detailed Analysis or Content Details
There are no axes, scales, or legends present in this image. The image is purely visual. The number "3" is the only textual element.
### Key Observations
The image clearly represents a third-place award. The color scheme (bronze and blue) is typical for third-place medals. The design is simplified and cartoonish, suggesting it's intended for use as an icon or graphic element.
### Interpretation
The image symbolizes achievement and recognition, specifically third place in a competition or event. The simplicity of the design makes it easily recognizable and adaptable for various applications, such as websites, apps, or presentations. The image conveys a sense of accomplishment, even if it's not the highest level of achievement. The use of color is intentional, with bronze traditionally representing third place. The image is a visual shorthand for the concept of "third place" and the associated feelings of pride and effort.
</details>
| 70.15 65.93 62.53 | 79.09 74.77 70.23 | 86.31 82.68 75.47 | 72.22 66.22 63.54 | 60.34 61.64 59.05 | 70.28 62.96 63.66 | 67.87 67.05 66.22 | 57.85 51.81 49.14 | 67.21 60.29 52.95 | – 24.35 20.79 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Mistral Large | 60.84 | 68.76 | 72.32 | 56.83 | 61.21 | 55.35 | 70.17 | 43.44 | 58.66 | 24.47 |
| GPT-3.5 | 60.32 | 65.38 | 78.42 | 64.56 | 54.89 | 67.89 | 60.11 | 41.42 | 49.90 | 29.99 |
| Gemini 1.0 Pro | 54.06 | 56.04 | 72.35 | 56.35 | 47.70 | 50.56 | 61.02 | 35.89 | 52.55 | 25.69 |
| Llama3-70b-Instruct | 52.92 | 59.67 | 67.09 | 57.14 | 43.10 | 55.49 | 58.21 | 41.67 | 40.94 | – |
Table 6: Open-LLM Leaderboard for small-scale model regime.
| Qwen1.5 (1.8B) Gemma (2B) SlimPajama-DC (1.3B) | 21.68 16.66 9.60 | 9.99 17.52 9.22 | 15.84 23.93 14.95 | 40.96 16.10 14.76 | 15.52 15.09 5.32 | 31.13 27.46 9.01 | 34.91 14.32 16.19 | 4.70 4.57 1.68 | 20.37 14.26 5.70 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| RedPajama (1.3B) | 9.00 | 9.21 | 13.50 | 16.97 | 0.86 | 11.41 | 14.35 | 1.86 | 3.87 |
| OLMo (1.2B) | 8.85 | 8.54 | 13.18 | 6.16 | 8.05 | 13.10 | 13.61 | 2.07 | 6.11 |
| Pythia (1.4B) | 8.79 | 9.66 | 14.69 | 11.52 | 4.17 | 9.01 | 12.76 | 3.19 | 5.30 |
| TinyLlama (1.1B) | 8.45 | 8.94 | 13.31 | 12.23 | 3.59 | 6.06 | 16.7 | 2.07 | 4.68 |
| OPT (1.3B) | 7.89 | 7.40 | 11.83 | 12.47 | 4.48 | 7.61 | 13.61 | 1.25 | 4.48 |
| GPT-Neo (1.3B) | 7.42 | 6.94 | 9.69 | 10.81 | 4.31 | 6.34 | 13.75 | 2.63 | 4.89 |
| Cerebras-GPT (1.3B) | 4.86 | 5.37 | 4.43 | 9.31 | 2.16 | 6.20 | 6.90 | 1.04 | 3.46 |
5.4 Leaderboard and Arena
The overall ranking of models for our benchmark is represented in Table 6 and Table 6. The performance of GPT-4o overall demonstrates its leading edge, with an accuracy of 70.15%, which indicates its robustness in open-style question answering tasks compared to other models. It is followed by GPT-4-1106-preview with 65.93%, and Claude-3 Opus with 62.68%. These results highlight the advanced capabilities of the GPT-4 series. Mid-tier models like Mistral Large and GPT-3.5 perform well but are not on par with the top performers. On the other hand, models like Gemini 1.0 Pro and Llama3-70b-Instruct lag behind in terms of the capabilities to answer the open-style questions.
The performance evaluation of smaller-scale LLMs reveals that Qwen1.5 leads with an overall accuracy of 21.68%, significantly outperforming the other models in this category. Gemma follows with 16.66%, indicating a considerable gap in performance compared to the top model. The remaining models score below 10.00%, highlighting their limited abilities to answer the open-style questions. Almost all of the models struggle significantly with questions from MedMCQA dataset, showing an accuracy below of 5%.
6 Conclusion
We proposed Open-LLM-Leaderboard for LLM evaluation and comprehensively examined its efficacy using open-style questions from nine datasets on OSQ-bench. Different from previous works that rely on human evaluation or thousands of crowd users on Chatbot Arena, we can have a benchmark for chat LLMs in a fast, automatic, and cheap scheme. Our results show a highly correlated level of agreement with humans, indicating a foundation for an LLM-based evaluation benchmark and framework using open-style questions.
Limitations and Ethics Statement
We have discussed multiple advantages of employing open-style questions over multiple-choice questions used in prior works. However, the LLM Leaderboard, as a tool for evaluating and benchmarking LLMs, has several common limitations itself. Firstly, the performance metrics used may not fully capture the nuanced capabilities of each model, especially in areas that require an understanding of context, creativity, or common sense reasoning. Secondly, the benchmark datasets may not be comprehensive enough to cover all possible domains and scenarios, leading to a potential bias towards certain types of questions or tasks. Thirdly, due to the rapidly evolving nature of the field, models may quickly become outdated, meaning the leaderboard may not always reflect the most current state of the art. Since our benchmark utilizes public datasets and our corpus consists of questions and answers, user privacy concerns are minimal.
References
- [1] AI@Meta. Llama 3 model card. 2024.
- [2] Anthropic. Model card and evaluations for claude models, 2023.
- [3] Anthropic. https://www.anthropic.com/claude, 2024.
- [4] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- [5] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- [6] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- [7] Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439, Apr. 2020.
- [8] Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. If you use this software, please cite it using these metadata.
- [9] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, et al. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- [10] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- [11] John Chung, Ece Kamar, and Saleema Amershi. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
- [12] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
- [13] Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960.
- [14] Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint:2304.03208, 2023.
- [15] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023.
- [16] Google. https://ai.google.dev/, 2023.
- [17] Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. Olmo: Accelerating the science of language models. Preprint, 2024.
- [18] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- [19] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
- [20] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, et al. Starcoder: may the source be with you! arXiv preprint arXiv:23.05.061161, 2023.
- [21] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- [22] Yixin Liu, Kejian Shi, Katherine S He, Longtian Ye, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, and Arman Cohan. On learning to summarize with large language models as references. arXiv preprint arXiv:2305.14239, 2023.
- [23] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
- [24] Mistral. https://chat.mistral.ai/chat, 2024.
- [25] MosaicML. Mpt-1b redpajama-200b. https://huggingface.co/mosaicml/mpt-1b-redpajama-200b. Accessed: 2024-04-29.
- [26] OpenAI. Openai evals. https://github.com/openai/evals.
- [27] OpenAI. https://chat.openai.com/chat, 2022.
- [28] OpenAI. Gpt-4 technical report. arxiv preprint arXiv:2303.08774, 2024.
- [29] Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, 2022.
- [30] Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
- [31] Joshua Robinson, Christopher Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. ArXiv, abs/2210.12353, 2022.
- [32] Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353, 2023.
- [33] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2024.
- [34] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, aug 2021.
- [35] Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Joel Hestness, Natalia Vassilieva, Daria Soboleva, and Eric Xing. Slimpajama-dc: Understanding data combinations for llm training. arXiv preprint arXiv:2309.10818, 2023.
- [36] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- [37] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- [38] Gemini Team. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- [39] Gemma Team. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- [40] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- [41] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- [42] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
- [43] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771, 2019.
- [44] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- [45] Biao Zhang, Barry Haddow, and Alexandra Birch. Prompting large language model for machine translation: A case study. In Proceedings of the 40th International Conference on Machine Learning, 2023.
- [46] Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint:2401.02385, 2024.
- [47] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. arXiv preprint:2205.01068, 2022.
- [48] Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882, 2024.
- [49] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- [50] Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint 2304.04675, 2023.
Appendix
Appendix A Reproducibility Statement
We will make all our filtered open-style data (MMLU, ARC, HellaSwag, WinoGrande, PIQA, CommonsenseQA, Race, MedMCQA, and OpenbookQA) used in our experiments of Sec. 5 and preprocessing scripts publicly available. Detailed data statistics are provided in Sec. 4.1. Considering the potential high costs associated with gathering and reproducing our LLM response data from the ground up, we will make available all responses from the various LLMs and their corresponding evaluation results to support and simplify the reproducibility of our work. The OpenAI APIs we used include gpt-3.5-turbo-1106, gpt-4.0-1106-preview, gpt-4o (for response collection), and gpt-4.0-0125-preview (for filtering and post-evaluation); Claude 3: claude-3-opus-20240229; Gemini-Pro: gemini-pro, and Mistral: mistral-large-latest.
Appendix B More Results on Gemini Pro and Stage1 Filtering
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Chart: Performance Comparison of MCQs and OSQs Across Datasets
### Overview
This chart presents a comparative analysis of performance on Multiple Choice Questions (MCQs) and Open-ended Short Questions (OSQs) across various datasets. The left side is a bar chart showing the count of correct and incorrect answers for each question type, while the right side is a radar chart displaying accuracy scores.
### Components/Axes
* **Left Chart:**
* X-axis: Dataset (MMLU, HellaSwag, Race, ARC, MedMCQA, Winogrande, CommonsenseQA, PIQA, OpenbookQA)
* Y-axis: Count (ranging from 0 to 8000)
* Bar Colors:
* Red: Incorrect MCQs, Correct OSQs
* Green: Correct MCQs, Incorrect OSQs
* Blue: Incorrect MCQs, Incorrect OSQs
* **Right Chart:**
* Radial Axes: Representing the datasets (MMLU, HellaSwag, CommonsenseQA, ARC, OpenbookQA, PIQA, Race, Winogrande, MedMCQA)
* Radial Scale: Accuracy (ranging from 0 to 1.0)
* Line Colors:
* Red: MCQs Accuracies
* Pink: OSQs Accuracies
* **Legend (Top-Right):** Clearly identifies the line colors for MCQs and OSQs accuracies.
### Detailed Analysis or Content Details
**Left Chart (Bar Chart):**
* **MMLU:** Approximately 6500 Incorrect MCQs, Correct OSQs; ~500 Correct MCQs, Incorrect OSQs; ~500 Incorrect MCQs, Incorrect OSQs.
* **HellaSwag:** Approximately 5500 Incorrect MCQs, Correct OSQs; ~1000 Correct MCQs, Incorrect OSQs; ~500 Incorrect MCQs, Incorrect OSQs.
* **Race:** Approximately 2500 Incorrect MCQs, Correct OSQs; ~2500 Correct MCQs, Incorrect OSQs; ~500 Incorrect MCQs, Incorrect OSQs.
* **ARC:** Approximately 2000 Incorrect MCQs, Correct OSQs; ~2000 Correct MCQs, Incorrect OSQs; ~500 Incorrect MCQs, Incorrect OSQs.
* **MedMCQA:** Approximately 1000 Incorrect MCQs, Correct OSQs; ~1000 Correct MCQs, Incorrect OSQs; ~200 Incorrect MCQs, Incorrect OSQs.
* **Winogrande:** Approximately 800 Incorrect MCQs, Correct OSQs; ~500 Correct MCQs, Incorrect OSQs; ~200 Incorrect MCQs, Incorrect OSQs.
* **CommonsenseQA:** Approximately 600 Incorrect MCQs, Correct OSQs; ~400 Correct MCQs, Incorrect OSQs; ~100 Incorrect MCQs, Incorrect OSQs.
* **PIQA:** Approximately 400 Incorrect MCQs, Correct OSQs; ~200 Correct MCQs, Incorrect OSQs; ~50 Incorrect MCQs, Incorrect OSQs.
* **OpenbookQA:** Approximately 300 Incorrect MCQs, Correct OSQs; ~200 Correct MCQs, Incorrect OSQs; ~50 Incorrect MCQs, Incorrect OSQs.
**Right Chart (Radar Chart):**
* **MMLU:** MCQs Accuracy ~0.8, OSQs Accuracy ~0.2
* **HellaSwag:** MCQs Accuracy ~0.8, OSQs Accuracy ~0.2
* **CommonsenseQA:** MCQs Accuracy ~0.6, OSQs Accuracy ~0.4
* **ARC:** MCQs Accuracy ~0.6, OSQs Accuracy ~0.4
* **OpenbookQA:** MCQs Accuracy ~0.4, OSQs Accuracy ~0.6
* **PIQA:** MCQs Accuracy ~0.4, OSQs Accuracy ~0.6
* **Race:** MCQs Accuracy ~0.4, OSQs Accuracy ~0.6
* **Winogrande:** MCQs Accuracy ~0.4, OSQs Accuracy ~0.6
* **MedMCQA:** MCQs Accuracy ~0.4, OSQs Accuracy ~0.6
### Key Observations
* The bar chart shows that across all datasets, there are significantly more instances of incorrect MCQs being answered correctly as OSQs than vice versa.
* The radar chart reveals that MCQs generally achieve higher accuracy scores than OSQs on MMLU and HellaSwag.
* On datasets like OpenbookQA, PIQA, Race, Winogrande, and MedMCQA, OSQs consistently outperform MCQs in terms of accuracy.
* The accuracy scores for MCQs and OSQs are relatively similar on CommonsenseQA and ARC.
### Interpretation
The data suggests a trade-off between question type and dataset complexity. MCQs excel on datasets requiring factual recall or pattern recognition (MMLU, HellaSwag), where the multiple-choice format can guide the answer. However, OSQs demonstrate superior performance on datasets demanding reasoning, common sense, or nuanced understanding (OpenbookQA, PIQA, Race, Winogrande, MedMCQA). This indicates that OSQs are better suited for evaluating higher-order cognitive skills.
The large number of incorrect MCQs answered correctly as OSQs suggests that many questions may be ambiguous or poorly designed, leading to incorrect selections in the MCQ format but eliciting correct responses when open-ended. The radar chart visually reinforces this, showing a clear divergence in performance based on the dataset. The datasets where OSQs outperform MCQs are likely those where the nuances of language and context are critical for accurate answers, something that OSQs can better capture. The positioning of the datasets on the radar chart also suggests a clustering of performance, with some datasets consistently favoring one question type over the other.
</details>
Figure 6: Performance comparison of Gemini Pro on multiple-choice and open-style response questions across diverse datasets, as shown by the count of correct and incorrect answers in the left bar chart and model accuracy in the right radar chart.
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Pie Charts: Performance on Various Question Answering Benchmarks
### Overview
The image presents a 3x3 grid of pie charts, each representing the performance of a model on a different question answering benchmark. The performance is categorized into "YES" and "NO" answers, represented by red and light blue slices respectively. Each chart is labeled with the name of the benchmark.
### Components/Axes
* **Benchmarks:** ARC, CommonsenseQA, HellaSwag, MedMCQA, MMLU, OpenbookQA, PIQA, Race, and Winogrande.
* **Categories:** YES (red), NO (light blue).
* **Legend:** Located in the top-right corner, indicating that red represents "YES" and light blue represents "NO".
### Detailed Analysis
Here's a breakdown of each pie chart, with approximate values:
1. **ARC:** The pie chart shows a dominant "NO" response. Approximately 85.4% of the answers are "NO" (red), and 14.6% are "YES" (light blue).
2. **CommonsenseQA:** This chart is more balanced. Approximately 53.7% of the answers are "YES" (light blue), and 46.3% are "NO" (red).
3. **HellaSwag:** This chart shows a very strong "NO" response. Approximately 94.9% of the answers are "NO" (light blue), and 5.1% are "YES" (red).
4. **MedMCQA:** This chart is nearly balanced. Approximately 51.2% of the answers are "YES" (light blue), and 48.8% are "NO" (red).
5. **MMLU:** This chart shows a slight preference for "NO". Approximately 58.1% of the answers are "NO" (red), and 41.9% are "YES" (light blue).
6. **OpenbookQA:** This chart shows a strong "YES" response. Approximately 62.8% of the answers are "YES" (light blue), and 37.2% are "NO" (red).
7. **PIQA:** This chart shows a strong "YES" response. Approximately 64.6% of the answers are "YES" (light blue), and 35.4% are "NO" (red).
8. **Race:** This chart shows a strong "NO" response. Approximately 70.4% of the answers are "NO" (red), and 29.6% are "YES" (light blue).
9. **WinoGrande:** This chart shows a complete "NO" response. 100% of the answers are "NO" (red), and 0.0% are "YES" (light blue).
### Key Observations
* The performance varies significantly across different benchmarks.
* HellaSwag and Winogrande show overwhelmingly "NO" responses.
* CommonsenseQA, MedMCQA, MMLU, OpenbookQA, PIQA, and Race show more balanced or slightly skewed responses.
* ARC shows a strong preference for "NO" responses.
### Interpretation
The data suggests that the model being evaluated struggles with certain types of question answering tasks more than others. The benchmarks with high "NO" response rates (HellaSwag, Winogrande, ARC) likely represent tasks that are more challenging for the model, potentially due to requiring deeper reasoning, common sense knowledge, or nuanced understanding of language. The more balanced benchmarks (CommonsenseQA, MedMCQA, MMLU, OpenbookQA, PIQA, Race) indicate that the model has some ability to answer these questions correctly, but still makes a significant number of errors. The complete "NO" response on Winogrande is particularly striking and suggests a fundamental limitation in the model's ability to handle that specific type of question. The "YES" and "NO" labels likely represent whether the model's answer is correct or incorrect, respectively. The data provides a snapshot of the model's strengths and weaknesses across a range of question answering benchmarks, which can be used to guide further development and improvement.
</details>
Figure 7: Initial filtering results across datasets, showcasing the percentages of questions that are suitable (Yes) for open-style formatting and those that are not (No).
Appendix C More Sample Questions
We provide more sample questions that are correct for open-style and incorrect for multiple choice in Table 7, and vice versa in Table 8. These examples demonstrate how the model’s ability varies in handling different question formats.
Table 7: Example of an Incorrect MCQ and a Correct OSQ in MMLU.
| Question that is correct for OSQ and incorrect for MCQ: How many CO2 and ATP molecules are formed during one complete turn of the tricarboxylic acid cycle (Krebs’ cycle)? |
| --- |
| Options: A. 2CO2 and 2ATP B. 2CO2 and 16ATP C. 2CO2 and 12ATP D. 2CO2 and 1ATP |
| Ground Truth: D |
| MCQ_Answer: A |
| OSQ_Answer: During one complete turn of the tricarboxylic acid cycle (also known as the Krebs’ cycle or the citric acid cycle), a series of chemical reactions occur in the mitochondria of cells that result in the production of carbon dioxide (CO2) and adenosine triphosphate (ATP), among other molecules. |
| For each acetyl-CoA molecule that enters the cycle, the following are produced: |
| - 2 molecules of CO2: Each acetyl-CoA contributes two carbons to the cycle, which are eventually released as two molecules of carbon dioxide. |
| - 1 molecule of ATP: One molecule of ATP (or an equivalent, such as GTP, depending on the cell type) is produced directly in the cycle through substrate-level phosphorylation. |
| However, the cycle also produces reduced coenzymes, specifically NADH and FADH2, which are used in the electron transport chain to produce additional ATP through oxidative phosphorylation. For each acetyl-CoA that enters the cycle: |
| - 3 molecules of NADH are produced, which can be used to generate approximately 2.5 ATP each through the electron transport chain. |
| - 1 molecule of FADH2 is produced, which can be used to generate approximately 1.5 ATP through the electron transport chain. |
| Therefore, while only 1 ATP is produced directly in the cycle per acetyl-CoA, the NADH and FADH2 generated can lead to the production of additional ATP in the electron transport chain. The exact number of ATP molecules produced can vary depending on the efficiency of the electron transport chain and the ATP synthase complex in different organisms and cell types. |
Table 8: Example of a Correct MCQ and an Incorrect OSQ in MMLU.
| Question that is correct for MCQ and incorrect for OSQ: A tree is a connected graph with no cycles. How many nonisomorphic trees with 5 vertices exist? |
| --- |
| Options: A. 1 B. 2 C. 3 D. 4 |
| Ground Truth: C |
| MCQ_Answer: C |
| OSQ_Answer: To determine the number of nonisomorphic trees with 5 vertices, we can use Cayley’s formula, which states that the number of trees on $n$ labeled vertices is $n^{n-2}$ . However, this formula gives the number of trees when vertices are labeled, meaning that it counts different labelings of the same tree shape as distinct. Since we are interested in nonisomorphic trees (which do not consider different labelings as distinct), we cannot directly use Cayley’s formula. |
| Instead, we need to enumerate the trees by considering their possible structures. For a tree with 5 vertices, we can have the following configurations: |
| 1. A linear chain (path) where each vertex is connected to two others except for the two end vertices, which are connected to only one other vertex. This is one possible tree. |
| 2. A "star" tree where one central vertex is connected to the other four vertices. This is another possible tree. |
| 3. A tree where one vertex is connected to three others, and one of those three is connected to the fifth vertex. This structure can be seen as a "T" shape with an extra branch on one of the endpoints. There are three nonisomorphic trees of this type because the branch can be on any of the three endpoints. |
| 4. A tree where two vertices are connected to a central vertex, and each of those two vertices is connected to one other vertex (creating a structure with a central vertex, two intermediate vertices, and two leaf vertices). This structure can be seen as a "Y" shape. There is only one nonisomorphic tree of this type because any permutation of the leaf vertices does not create a fundamentally different tree structure. |
| Adding these up, we have: |
| 1 (linear chain) + 1 (star) + 3 (T-shaped with an extra branch) + 1 (Y-shaped) = 6 nonisomorphic trees with 5 vertices. |
Appendix D Prompts for Different Datasets
The following are examples of prompts for different datasets. They allow models to better understand and respond to specific contexts and objectives in each dataset.
Table 9: Prompt for MMLU dataset.
Table 10: Prompt for ARC dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. |
| --- |
| Question: Tiny organisms called plankton live in oceans. Some plankton can take energy from the Sun and turn it into food. How are plankton most useful to the animals that live in the ocean? |
| A. Plankton are colorful. |
| B. Plankton clean the water. |
| C. Plankton release oxygen. |
| D. Plankton reproduce quickly. |
| Answer: |
| Open-Style Prompt: Answer the following question. |
| Question: Tiny organisms called plankton live in oceans. Some plankton can take energy from the Sun and turn it into food. How are plankton most useful to the animals that live in the ocean? |
| Answer: |
Table 11: Prompt for CommonsenseQA dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D, E. For example, if you think the correct answer is A, your response should be ’A’. |
| --- |
| Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what? |
| A. bank |
| B. library |
| C. department store |
| D. mall |
| E. New York |
| Answer: |
| Open-Style Prompt: You will be presented with a variety of questions that require an understanding of everyday scenarios, human behaviors, and common sense. Your task is to provide the best possible answer to each question based solely on your understanding and reasoning. |
| Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what? |
| Answer: |
Table 12: Prompt for MedMCQA dataset.
| MCQ Prompt: The following is the multiple choice question about medicine. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. |
| --- |
| Question: Modulus of elasticity means: |
| A. Rigidity or stiffness of the material |
| B. Ability to be stretched with permanent deformation |
| C. Ductility of a material |
| D. Malleability of the metal |
| Answer: |
| Open-Style Prompt: Answer the following question about medicine. |
| Question: Modulus of elasticity means: |
| Answer: |
Table 13: Prompt for HellaSwag dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. |
| --- |
| Question: How to clean your rv windows and mirrors fast without using any spray. you |
| A. also have a bucket that you spray paint a window in. |
| B. can reach for a running water hose and clean the inside of your rv quickly. |
| C. get a wash cloth and you put it under the faucet to get wet and then you rinse it out so it’s not soaking. |
| D. meticulously clean the window in the glass shop and then take the plastic off and start taking the hood off. |
| Answer: |
| Open-Style Prompt: Imagine you are provided with a scenario or a partial story taken from everyday life or a common activity. Your task is to continue this story or scenario in a way that makes the most sense based on what typically happens in such situations. Please complete the sentence. |
| Question: How to clean your rv windows and mirrors fast without using any spray. you |
| Answer: |
Table 14: Prompt for OpenbookQA dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. |
| --- |
| Question: what system is needed for a body to get its needed supply of the gas humans breathe in? |
| A. the circulatory system |
| B. the digestive system |
| C. the school system |
| D. central nervous system |
| Answer: |
| Open-Style Prompt: Consider common scenarios or outcomes that fit the context of the sentence. Attempt to logically complete the sentences based on common knowledge and reasoning. |
| Question: what system is needed for a body to get its needed supply of the gas humans breathe in? |
| Answer: |
Table 15: Prompt for PIQA dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B. For example, if you think the correct answer is A, your response should be ’A’. |
| --- |
| Question: How do I ready a guinea pig cage for it’s new occupants? |
| A. Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish. |
| B. Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish. |
| Answer: |
| Open-Style Prompt: Consider common scenarios or outcomes that fit the context of the sentence. Attempt to logically complete the sentences based on common knowledge and reasoning. |
| Question: How do I ready a guinea pig cage for it’s new occupants? |
| Answer: |
Table 16: Prompt for Race dataset.
| MCQ Prompt: I will give you a passage with multiple-choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be ’A’. |
| --- |
| Passage:... |
| Question: What did Nancy try to do before she fell over? |
| A. Measure the depth of the river |
| B. Look for a fallen tree trunk |
| C. Protect her cows from being drowned |
| D. Run away from the flooded farm |
| Answer: |
| Open-Style Prompt: I will give you passage with question. Please, answer the question. |
| Passage:... |
| Question: What did Nancy try to do before she fell over? |
| Answer: |
Table 17: Prompt for WinoGrande dataset.
| MCQ Prompt: The following is the multiple choice question. Please put the correct words in place of _. Your response should include only the option without any justification or reasoning. Please select the correct answer from the options A, B. |
| --- |
| Question: Sarah was a much better surgeon than Maria so _ always got the easier cases. |
| A. Sarah |
| B. Maria |
| Answer: |
| Open-Style Prompt: Please put the correct words in place of _. Give only the word that fits the sentence. |
| Question: Sarah was a much better surgeon than Maria so _ always got the easier cases. |
| Answer: |