2406.07545v1
Model: gemini-2.0-flash
# Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
> Joint first author & equal contribution.
Abstract
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs). Typically, an LLM is given a question and selects the answer deemed most probable after adjustments for factors like length. Unfortunately, LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities, influencing the prediction of answers based on these IDs. Previous research has introduced methods to reduce this âselection biasâ by simply permutating options on a few test samples and applying them to new ones. Another problem of MCQ is the lottery ticket choice by ârandom guessingâ. The LLM does not learn particular knowledge, but the option is guessed correctly. This situation is especially serious for those small-scale LLMs For instance, on MMLU, the random guessing accuracy is 25%, and most small-scale LLMs obtain results around this value as shown in [35, 46]. It is difficult to distinguish which model is better under this situation.. To address them, a more thorough approach involves shifting from MCQ to open-style questions, which can fundamentally eliminate selection bias and random guessing issues. However, transitioning causes its own set of challenges in (1) identifying suitable open-style questions and (2) validating the correctness of LLM open-style responses against human-annotated ground-truths. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions. Consequently, we introduce the Open-LLM-Leaderboard to track various LLMsâ performance and reflect true capability of them, such as GPT-4o/4/3.5, Claude 3, Gemini, etc. Our code and dataset are available at https://github.com/VILA-Lab/Open-LLM-Leaderboard.
1 Introduction
| Question that is suitable for open-style: Let x = 1. What is x << 3 in Python 3? |
| --- |
| Options: A. 1 B. 3 C. 8 D. 16 |
| Answer: C |
| Question that is not suitable for open-style: Which of the following statements is true? |
| Options: |
| A. Every equivalence relation is a partial-ordering relation. |
| B. Number of relations form A = x, y, z to B= (1, 2), is 64. |
| C. Empty relation _ is reflexive |
| D. Properties of a relation being symmetric and being un-symmetric are negative of each other. |
| Answer: B |
Figure 1: Examples of MCQ from MMLU.
Large language models (LLMs) are increasingly excelling at various natural language processing tasks, including text generation [11], translation [45, 50], summarization [22], code generation [20, 33], and chatbot interaction [28]. With the rising capability, the need for a robust evaluation strategy that can accurately assess the performance of these models is becoming crucial in order to identify their true effectiveness and choose the most appropriate one for a given task. Common metrics for assessing LLMs today include relevance, frequency of hallucinations, accuracy in question answering, toxicity, and retrieval-specific metrics, among others. In the context of question-answering evaluations, prior works usually investigate the modelâs performance in terms of answer accuracy, courtesy, and conciseness. And multiple choice questions (MCQ) have emerged as a predominant format for such assessments, wherein a question is presented with several possible responses, and the model is required to select the most fitting choice ID, as exemplified in Figure 1. Lately, the MCQ format has seen widespread application in LLM-focused contexts, including benchmarks [18, 44, 12] that examine LLM capabilities and automated/crowdsourcing evaluation frameworks [21, 49, 5] that streamline the assessment process.
However, previous studies [48, 32] have discussed that the lack of resilience of LLMs to changes in the positioning of options stems from their tendency to exhibit biased behavior: they often favor choosing certain option IDs (such as âOption Aâ) as responses, a phenomenon that is referred to as selection bias. Moreover, it shows that selection bias exists widely across various LLMs and cannot be mitigated by simple prompting skills. The underlying reason of this issue comes from the condition that the model is trained with a priori distribution that assigns more probabilistic choices to specific ID tokens. Another issue of MCQ is the ârandom guessingâ that is discussed in [35]. Specifically, small models such as the 1B-level variants, may struggle to achieve reliable predictions on many benchmarks like MMLU which uses four choices as the answer candidates of the questions. Their results could resemble random choices, not truly capturing the modelâs actual capabilities.
To fundamentally eliminate selection bias and random guessing in LLMs, in this work, we build an open-style question benchmark for LLM evaluation. Leveraging this benchmark, we present the Open-LLM-Leaderboard, a new automated framework designed to refine the assessment process of LLMs. This framework functions in supplement to prior evaluation frameworks such as [21, 49, 5] with several advantages as presented in Sec. 4.4. However, constructing such a benchmark has two significant challenges: (1) how to determine the appropriate questions that can be effectively transformed from MCQ into open-style questions, and (2) how to establish an approach to accurately validate the correctness of the LLMâs open-style answers in comparison to human-annotated ground-truths, especially in contrast to MCQ, which typically have defined single-choice standard answers.
For the first challenge of identifying the multiple-choice questions that are suitable for converting to open-style questions, we design an automatic coarse-to-fine selecting protocol through customized prompts and multi-stage filtering process. Specifically, in the first stage, we use the binary classification to filter the questions with high confidence as the positive pool and others are assigned as negative. Our second stage uses a soft scoring method (1-10 ratings) to judge the suitability of the questions for the open-style from the questions that are clarified as negative in the first stage. For the second challenge of evaluating the correctness of the LLMâs open-style answers in comparison to human-annotated ground-truths, we further design a task-specific prompt and leverage GPT-4 to examine if the response is correct. To validate the accuracy of the automatic evaluation strategy, we randomly sample 100 results and manually check the automatic evaluation results with the corresponding responses, and confirm that it is reliable with an error rate of less than 5%.
In our end-to-end assessment of the LLM evaluation and ranking process, we conduct a comprehensive analysis on the well-recognized LLMs, including GPT-4o, GPT-4, ChatGPT, Claude-3 Opus, Gemini-Pro and Mistral-Large. Our benchmarking results indicate that GPT-4o currently holds the position as the strongest LLM. We further provide a small regime LLM leaderboard targeting at LLMs smaller than 3B. Moreover, our study demonstrates a high correlation between the rankings produced by our open-style benchmark and those derived from user-based evaluations or direct human assessments.
2 Related Work
Large Language Models (LLMs). Recent advancements in LLMs, such as GPT-3 [9] and GPT-4 [28] have had a significant impact in the field of natural language processing and have found widespread application across various domains. It has indeed initiated a kind of chain reaction within the community and beyond. As each new iteration of LLMs demonstrates enhanced capabilities, organizations and researchers across various sectors are motivated to develop their own models, such as LLaMA [40, 41], Gemini [38], and Claude [2], or find innovative ways to improve existing LLMs through instruction tuning, like Alpaca [37], and Vicuna [10].
Multiple Choice Questions (MCQ). In the realm of LLM research, MCQ has become a pivotal tool for evaluating and enhancing the capabilities of these models. Notable datasets like the MMLU [18], HellaSwag [44], and ARC [12] have been instrumental in this regard. Their diverse assessment of broad knowledge and commonsense reasoning help in benchmarking the depth and versatility of LLMs in understanding, reasoning, and applying knowledge across various domains. MCSB [31] introduces a natural prompting strategy for LLMs, which presents questions and answer choices together, allowing the model to explicitly compare options.
Bias in LLMs. Selection bias, a specific form of bias relevant to the evaluation of LLMs through MCQ, has garnered attention due to its understated and widespread impact. A series of works [49, 30, 48, 42] have shown that LLMs may develop a propensity to favor certain answer choices based on their position or encoding, such as the alphabetical ordering of A/B/C/D in MCQ. This phenomenon can lead to skewed evaluation results, misrepresenting a modelâs true understanding and reasoning capabilities.
3 Approach
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Multiple-Choice and Open-Style Question Processing Path
### Overview
The image is a flowchart illustrating the process of handling multiple-choice and open-style questions using Large Language Models (LLMs) and GPT-4. It outlines the steps involved in filtering, classifying, evaluating, and comparing the two question formats.
### Components/Axes
The diagram is divided into two main paths: "Multiple-Choice Questions Path" at the top and "Open-Style Questions Path" at the bottom.
**Multiple-Choice Questions Path:**
* **Start:** A rounded rectangle labeled "START" at the top-left.
* **Multiple-choice question datasets collection:** A database icon with multiple documents flowing into it.
* **Collect the responses from LLMs in a multiple-choice format:** A speech bubble icon with "A" and "Q" inside.
* **Result Evaluation (accuracy):** A computer screen icon displaying a checklist.
* **Decision Point 1:** A database icon split into "YES" and "No" categories, indicating whether questions can be written in an open style.
* **Decision Point 2:** A gauge icon representing the confidence score assignment (ranging from 1 to 10) using GPT-4.
* **Decision Point 3:** A branching point based on whether the confidence score is "Greater than threshold" or "Less than threshold."
* **Decision Point 4:** "Move to 'YES' category"
* **Remove:** A trash can icon.
* **Comparative analysis of both formats:** A computer screen icon displaying a chart and a pie chart.
**Open-Style Questions Path:**
* **Collect the responses from LLMs in an open-style format:** A speech bubble icon with a question mark.
* **Design a prompt for an evaluation:** An AI chip icon.
* **Result Evaluation (accuracy):** A computer screen icon displaying a checklist.
**Shared Elements:**
* **Utilize GPT-4 to filter MCQs that can be written as an open style:** A funnel icon.
### Detailed Analysis or Content Details
**Multiple-Choice Questions Path:**
1. **Start:** The process begins with a collection of multiple-choice question datasets.
2. **Collect Responses:** LLMs generate responses in a multiple-choice format.
3. **Result Evaluation:** The accuracy of the responses is evaluated.
4. **Classification:** Questions are classified as either 'YES' (can be written in an open style) or 'No' (cannot be written in an open style) using GPT-4.
5. **Confidence Score:** A confidence score ranging from 1 to 10 is assigned using GPT-4.
6. **Thresholding:** If the confidence score is greater than a threshold, the question is moved to the 'YES' category. If it is less than the threshold, the question is removed.
7. **Comparative Analysis:** A comparative analysis of both formats (multiple-choice and open-style) is performed.
**Open-Style Questions Path:**
1. **Collect Responses:** LLMs generate responses in an open-style format.
2. **Prompt Design:** A prompt is designed for evaluation.
3. **Result Evaluation:** The accuracy of the responses is evaluated.
**Filtering:**
* GPT-4 is used to filter MCQs that can be written as an open style.
### Key Observations
* The diagram illustrates a comprehensive process for handling both multiple-choice and open-style questions using LLMs and GPT-4.
* The multiple-choice path includes a classification and confidence scoring step, while the open-style path focuses on prompt design and evaluation.
* A comparative analysis is performed to compare the results of both formats.
### Interpretation
The diagram demonstrates a systematic approach to leveraging LLMs for question processing. The multiple-choice path incorporates a filtering mechanism to identify questions suitable for open-style conversion, potentially enhancing the dataset's versatility. The confidence scoring step adds a layer of quality control, ensuring that only reliable questions are moved to the 'YES' category. The comparative analysis suggests an effort to understand the strengths and weaknesses of each question format, potentially informing future question design and evaluation strategies. The use of GPT-4 throughout the process highlights its role in question classification, confidence scoring, and filtering, indicating its importance in the overall workflow.
</details>
Figure 2: An overview of a dual-path evaluation pipeline for LLMs, starting with the collection of MCQ datasets. It branches into two paths, with the MCQ path proceeding directly from response collection to evaluation, while the open-style path passes through an additional filtering phase. After evaluation, both paths converge in a comparative analysis.
3.1 Defining Open-style Questions
Open-style questions, aka open-ended questions, require the model to generate an answer without being constrained by a set of predetermined choices. In the context of LLM evaluation, these questions are designed to assess the modelâs ability to generate coherent, relevant, and contextually appropriate responses based on the input query. While multiple-choice questions can efficiently assess specific factual knowledge and comprehension, open-style questions offer a deeper insight into the LLMâs generative capabilities, understanding of context, and ability to engage with complex tasks. Also, open-style questions can avoid the inherent selection bias and random guessing weaknesses compared to multiple-choice questions.
3.2 Automatic Open-style Question Filtering and Generation
Multi-stage Filtering and Postprocessing via Coarse-to-fine Process. Our proposed multi-stage filtering approach consists of four main steps to streamline the conversion: (1) Initially classify datasets as either convertible or non-convertible. (2) Assign each question a confidence score to indicate the likelihood that it can be framed as an open-style question. (3) Exclude questions with confidence scores below a specified threshold and classified as non-convertible. (4) Combine questions that are labeled as non-convertible but have high confidence scores with those labeled as convertible.
Stage1: Preliminary Filter using Binary Classification. Considering that the structure of MCQ varies, converting them into an open-style format is not always possible, particularly because certain questions are strongly linked to their choices. For instance, questions formulated as âWhich one of the following is trueâ or âAll exceptâ or âWhich of theseâ. Such questions are typically unsuitable for conversion into an open-style format since the absence of the options could change the questionâs core, resulting in incomplete questions.
Table 1: Prompt design for two-stage filtering and post verification.
| Stage One: Coarse Filtering Prompt |
| --- |
| """Your task is to review a series of multiple-choice questions and evaluate their ability to be answered without the provided answer choices. For questions that begin with an incomplete sentence (e.g., "During swallowing, ..."), use your knowledge to attempt to complete the sentence accurately. For direct questions that ask for specific information or identification (e.g., "Which of the following structures is part of the small intestine?"), assess whether the question is formulated clearly enough that an informed answer can be given without seeing the multiple-choice options. For mathematical or analytical questions (e.g., "Find all cosets of the subgroup 4Z of 2Z"), determine if the question provides enough context and information for a solution to be formulated without additional options. Please follow this format for your evaluation: QUESTION: [Insert the question here] VERDICT: Respond with "YES" if the question is clear and can be directly answered based on its content alone, or "NO" if it relies on the answer choices to be understood or answered. Your response should include only the verdict without any justification or reasoning.""" |
| Stage Two: Fine-grained Filtering Prompt |
| You will assign a numerical score from 1 to 10 based on how confidently it can be answered without the choices. The scoring criteria are as follows: 1: The question is entirely dependent on its choices for an answer, making it impossible to answer without them. Example: âWhich of the following statements is correct?â 10: The question can be easily and confidently answered based solely on the question stem, without any need to refer to the provided options. Example: âWhat is the first law of thermodynamics in physics?â Intermediate Scores: 2-4: The question stem gives very little information and is highly reliant on the choices for context. Example: âWhich of these is a prime number?â 5: The question provides some context or information, that gives a moderate possibility to answer the question. Example: âWhich of the following best describes the structure that collects urine in the body?â 6: The question provides a good amount of context or information, that gives a moderate possibility to answer the question. Example: âStatement 1 | A factor group of a non-Abelian group is non-Abelian. Statement 2 | If K is a normal subgroup of H and H is a normal subgroup of G, then K is a normal subgroup of G.â 7: The question provides a good amount of context or information, that gives a high possibility to answer the question. Example: âThe element (4, 2) of Z_12 x Z_8 has orderâ 8-9: The question provides a good amount of context or information, that gives a high possibility to answer the question. Example: âA "dished face" profile is often associated withâ ONLY GIVE THE VALUE BETWEEN 1-10 AS YOUR ANSWER. DO NOT INCLUDE ANY OTHER INFORMATION IN YOUR RESPONSE Example Format: QUESTION: question here VERDICT: value in [1-10] here |
| GPT-4 Prompt for Verification |
| """Evaluate the answer of a AI model to a question. You will be provided with the question, the AI modelâs answer, and the correct answer. Your task is to evaluate the AI modelâs response and determine whether it is Correct or Incorrect. Grade the AI model answers based ONLY on their factual accuracy. It is OK if the AI model answer contains more information than the true answer, as long as it does not contain any conflicting statements. Otherwise, it should be marked as Incorrect. Ignore differences in punctuation and phrasing between the AI modelâs answer and the true answer. Example Format: QUESTION: question here STUDENT ANSWER: studentâs answer here TRUE ANSWER: true answer here GRADE: Correct or Incorrect here
Your response should include only the verdict without any justification or reasoning.""" |
To effectively handle this challenge of identifying whether multiple-choice questions are suitable for open-style conversion, we leverage the power of prompting techniques to create a customized classification prompt as shown in Table 1. In the prompt, we integrate different types of questions from different datasets to demonstrate how an LLM may evaluate each question to be written in an open-style way, eventually classifying them as convertible âYESâ or non-convertible âNOâ. It will determine whether a question provides a clear context and information without relying on the provided options or not. In the prompt we integrate different types of questions from different datasets to demonstrate how an LLM like GPT-4 may evaluate each question to be written in an open-style way, eventually classifying them as convertible âYESâ or non-convertible âNOâ. We set the prompt to eliminate any additional explanations, by stating that âYour response should include only the verdict without any justification or reasoning.â This guarantees that the answer to each inquiry is conveyed concisely as âYESâ or âNOâ.
To understand our initial filtering results, we conduct an error analysis manually by selecting 100 questions in the âYESâ and âNOâ pools separately. In the samples classified as âYESâ, we find that only around 5% of the questions are false positive cases, verifying a low misclassification error for the positive question selection by our filtering strategy. Conversely, within the âNOâ sample, around 40% of the questions are actually suitable for open-style questions but mistakenly classified as negative. This situation often arises from questions that include phrases like âWhich ofâ. Similarly, questions involving true/false statements, sentence completions, or fill-in-the-blanks are also sometimes inappropriately classified as non-convertible. This analysis motivates us to develop a cascaded fine-grained stage to further filter more positive questions in âNOâ pool using particular prompts, as described in the following Stage 2 process.
Stage2: Confidence Score Assignment. As we aim to overcome the issue of classifying questions with specific patterns as non-convertible, we introduce a second stage of filtering centered on confidence score assignment. This involves instructing the large language model to assign a confidence score on a scale from 1 to 10, reflecting the possibility of the question being written in an open-style format. Since a significant number of questions are unsuitable for an open-style format, categorized as âNOâ and have a confidence score below 5, we set a confidence score threshold to be 5. Therefore, questions classified as non-convertible with a confidence score lower than this threshold are excluded, while those remaining above the threshold and those initially classified as convertible are moved into the âYESâ category to be converted to an open-style format.
3.3 Open-style Question Answer Evaluation
After establishing a set of convertible questions from various datasets and obtaining their responses from several LLMs, there arises a need to evaluate these questions. Given that our ground truth answers are based on the MCQ format with defined answers, it necessitates a method for efficiently and accurately validating the correctness of responses to open-style questions. To this end, we design a customized prompt, as shown in Figure 2 that utilizes the correct MCQ answer as the ground truth to determine if the open-style responses are correct or incorrect by the prediction $\hat{y}$ :
$$
\hat{y}=\texttt{LLM}_{\texttt{e}}(\text{prompt}(q,\hat{a},a)) \tag{1}
$$
where $\hat{y}$ represents the prediction and $\texttt{LLM}_{\texttt{e}}$ is the LLM evaluator. $q$ , $\hat{a}$ and $a$ represent the question, LLM generated answer, and correct answer from MCQ, respectively, and the prompt is provided in Table 1 of Appendix. While these open-style answers are evaluated based on the MCQâs ground truth, issues of misevaluation might arise. This includes scenarios where a response is inaccurately classified as correct simply because it contains certain keywords also found in the ground truth. To tackle this issue we include specific phrases in the prompt. These phrases, such as âas long as it does not contain any conflicting statementsâ, ensure that a response is not automatically classified as correct based on the presence of a keyword, avoiding incorrect markings when the response contradicts the correct answer. Additionally, to prevent the exclusion of correct answers that incorporate extra information, we incorporate the phrase âIt is OK if the AI modelâs answer contains more information than the true answerâ. Furthermore, we highlight that minor differences in punctuation and phrasing between the open-style responses and the ground truth answers should not lead to their being classified as incorrect. To see the correctness of the LLM judgement we take the randomly drawn 100 responses from all models. The human evaluation process for our study was conducted by the authors themselves. The agreement between the LLM evaluations and those of a human evaluator was quantitatively assessed using Cohenâs kappa [13], which yielded a score of 0.83. This substantial kappa score The Kappa score is a statistical measure of inter-rater agreement for categorical items, defined by the equation: $\kappa=\frac{P_{o}-P_{e}}{1-P_{e}}$ where $P_{o}$ is the observed agreement and $P_{e}$ is the expected agreement by chance. verifies that the LLMâs ability to determine the correctness of responses aligns closely with human judgment, demonstrating strong reliability in its evaluation process.
4 An Open-style Question Benchmark (OSQ-bench)
4.1 Statistics and Distributions
Table 2 describes the basic statistics of the dataset questions that are suitable for answering in open-style format. In total, we have evaluated 42K questions from 9 different datasets and more than 23K of them are classified as appropriate for open-style answering.
Table 2: Statistics on open-style questions across different datasets.
| MMLU ARC MedMCQA | 14,042 3,428 4,183 | 7,784 3,118 2,318 | 36.6 21.1 14.1 |
| --- | --- | --- | --- |
| CommonsenseQA | 1,221 | 710 | 13.1 |
| Race | 4,934 | 3,520 | 10.0 |
| OpenbookQA | 1,000 | 491 | 10.3 |
| WinoGrande | 1,267 | 1,267 | 19.1 |
| HellaSwag | 10,042 | 3,915 | 40.1 |
| PIQA | 1,838 | 696 | 7.1 |
| Overall | 41,955 | 23,839 | 19.05 |
4.2 Diversity
Our investigation into the diversity of questions within our benchmark is foundational for understanding the landscape of open-ended question answering. To comprehensively assess the breadth of question diversity, we have conducted a systematic categorization of the question types sourced from an array of distinct datasets. From the total initial pool of 41,955 questions, we refine the selection to 23,839 questions, ensuring that each one is conducive to open-ended responses. The distribution of those questions is illustrated in Figure 3, which segments the data into several domains based on the content of the questions. The segmentation of the plot underscores the interdisciplinary nature of our dataset. It features a broad spectrum of categories such as literature and reading comprehension, commonsense reasoning, domain-specific (medicine, STEM, and etc), and multi-topic knowledge. Also, Table 2 demonstrates the diversity of question length used for the benchmark.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Circular Partition Diagram: MMLU Task Distribution
### Overview
The image is a circular partition diagram illustrating the distribution of tasks within the MMLU (Massive Multitask Language Understanding) benchmark. The diagram is divided into concentric rings, with the inner rings representing broader categories and the outer rings representing more specific tasks. The size of each segment corresponds to the relative proportion of tasks within that category.
### Components/Axes
* **Center:** MMLU (Massive Multitask Language Understanding)
* **First Ring:**
* STEM (Science, Technology, Engineering, and Mathematics) - Light Blue
* Miscellaneous - Light Blue
* Humanities - Light Blue
* Social Sciences - Light Blue
* Activity Prediction - Salmon
* Situational Reasoning - Salmon
* Language Analysis - Teal
* Critical Reading - Teal
* Natural Sciences - Light Purple
* Technology - Light Purple
* Medical Specialties - Light Orange
* Healthcare - Light Orange
* Linguistic Patterns - Cyan
* Race - Green-Teal
* ARC - Purple
* HellaSwag - Red-Orange
* **Second Ring:**
* Literature Comprehension - Teal
* Mathematical Reasoning - Light Purple
* Clinical Knowledge - Light Orange
* Coreference Resolution - Cyan
* Physical Commonsense - Light Green
* World Knowledge - Light Green
* Temporal Commonsense - Light Green
* Social Commonsense - Light Green
* WinoGrande - Light Green
* CommonsenseQA - Light Green
* PIQA - Light Green
* OpenbookQA - Light Green
* Predictive Reasoning - Light Green
* Spatial-Temporal Reasoning - Light Green
* Analytical Reasoning - Light Green
* Conceptual Understanding - Light Green
* Common Knowledge - Light Green
* MedMCQA - Light Orange
### Detailed Analysis or Content Details
The diagram is structured as follows:
* **MMLU (Center):** The central node represents the overall MMLU benchmark.
* **Broad Categories (First Ring):** The first ring divides the tasks into broad categories such as STEM, Humanities, Social Sciences, Activity Prediction, Language Analysis, Natural Sciences, Technology, Medical Specialties, Linguistic Patterns, Race, ARC, and HellaSwag.
* **Specific Tasks (Second Ring):** The second ring further subdivides some of the broad categories into more specific tasks. For example, the area near "Linguistic Patterns" is subdivided into "Physical Commonsense", "World Knowledge", "Temporal Commonsense", "Social Commonsense", "WinoGrande", "CommonsenseQA", "PIQA", "OpenbookQA", "Predictive Reasoning", "Spatial-Temporal Reasoning", "Analytical Reasoning", "Conceptual Understanding", and "Common Knowledge".
**Color-Coded Categories:**
* **Light Blue:** STEM, Miscellaneous, Humanities, Social Sciences
* **Salmon:** Activity Prediction, Situational Reasoning
* **Teal:** Language Analysis, Critical Reading, Literature Comprehension
* **Light Purple:** Natural Sciences, Technology, Mathematical Reasoning
* **Light Orange:** Medical Specialties, Healthcare, Clinical Knowledge, MedMCQA
* **Cyan:** Linguistic Patterns, Coreference Resolution
* **Green-Teal:** Race
* **Purple:** ARC
* **Red-Orange:** HellaSwag
* **Light Green:** Common Knowledge, Conceptual Understanding, Analytical Reasoning, Spatial-Temporal Reasoning, Predictive Reasoning, Physical Principles, Social Commonsense, World Knowledge, Temporal Commonsense, Physical Commonsense, WinoGrande, CommonsenseQA, PIQA, OpenbookQA
### Key Observations
* The diagram provides a visual representation of the diversity of tasks included in the MMLU benchmark.
* The size of each segment reflects the relative proportion of tasks within that category.
* The color-coding helps to group related tasks together.
* The outer ring provides a more granular breakdown of specific tasks within certain categories.
### Interpretation
The circular partition diagram offers a clear overview of the MMLU benchmark's composition. It highlights the breadth of knowledge and reasoning abilities required to perform well on this benchmark. The diagram suggests that MMLU covers a wide range of subjects, from STEM fields to the humanities and social sciences, and includes tasks that require various types of reasoning, such as logical, spatial, and temporal reasoning. The distribution of tasks across different categories can provide insights into the strengths and weaknesses of different language models. For example, a model that performs well on STEM tasks but poorly on humanities tasks may have a bias towards scientific knowledge. The diagram also reveals the relative importance of different types of knowledge and reasoning in the MMLU benchmark. For instance, the large segment dedicated to STEM suggests that scientific knowledge is a significant component of the benchmark.
</details>
Figure 3: Diversity and distribution of used datasets for our OSQ-bench.
4.3 Quality
Our newly developed benchmark, curated from widely recognized datasets, stands out by focusing on questions suitable to open-style answering, i.e., a format that demands a deep understanding and an ability to generate informative, unrestricted responses. Given that the datasets from which these questions originate are widely utilized and highly recognizable within the research community, it follows that the questions have good quality to assess the modelsâ capabilities. Moreover, due to the thorough filtering process it has undergone, it results in a low false positive rate (questions not suitable for open-style that are classified as suitable) of around 5%. This indicates that the vast majority of questions categorized as suitable for open-style answers indeed meet the criteria.
4.4 Property and Advantage
As shown in Table 3, our leaderboard exhibits several advantages: first is the debiased results compared to the MCQ-based leaderboard, which has been discussed thoroughly. Another advantage is the faster and cheaper evaluation over crowduser-based leaderboards. Our results and rankings can be generated automatically without any human intervention.
Table 3: Comparison with different LLM leaderboards. âBiasedâ indicates the selection bias.
| Huggingface Leaderboard [5] | Multiple Choices Questions | High | â | Automatically |
| --- | --- | --- | --- | --- |
| AlpacaEval Leaderboard [21] | Human Questions&Feedback | Low | â | GPT-4 |
| Chatbot Arena Leaderboard [49] | Human Questions&Feedback | Low | â | GPT-4/Crowdusers |
| Open-LLM-Leaderboard (Ours) | Open Style Questions | High | â | GPT-4 |
5 Experiments
<details>
<summary>x3.png Details</summary>

### Visual Description
## Chart/Diagram Type: Comparative Performance Charts
### Overview
The image presents a comparative analysis of four different language models (GPT-4, GPT-3.5, Claude-3 Opus, and Mistral-large) across various datasets. The analysis is visualized using a combination of bar charts and radar charts. The bar charts show the counts of correct and incorrect Multiple Choice Questions (MCQs) and Open-ended Short Questions (OSQs) for each dataset. The radar charts display the accuracies for MCQs and OSQs across the same datasets.
### Components/Axes
**General Components:**
* Four subplots, each representing a different language model.
* Each subplot contains a bar chart and a radar chart.
* The datasets used are: MMLU, HellaSwag, Race, ARC, MedMCQA, WinoGrande, CommonsenseQA, PIQA, and OpenbookQA.
**Bar Chart Components:**
* **Y-axis:** "Count", ranging from 0 to 8000, with increments of 2000.
* **X-axis:** "Dataset", listing the datasets mentioned above.
* **Legend (Top):**
* Orange: "Incorrect MCQs, Correct OSQs"
* Green: "Correct MCQs, Incorrect OSQs"
* Red: "Correct MCQs, Correct OSQs"
* Gray: "Incorrect MCQs, Incorrect OSQs"
**Radar Chart Components:**
* The radar charts display accuracies ranging from 0.2 to 1.0, with increments of 0.2.
* The datasets are arranged around the perimeter of the radar chart.
* **Legend (Top):**
* Pink Line: "MCQs Accuracies"
* Green Line: "OSQs Accuracies"
### Detailed Analysis
#### (1) GPT-4
* **Bar Chart:**
* MMLU: Red bar ~2800, Green bar ~100, Gray bar ~3800, Orange bar ~1000
* HellaSwag: Red bar ~2000, Green bar ~1800, Gray bar ~100, Orange bar ~3800
* Race: Red bar ~2200, Green bar ~1000, Gray bar ~100, Orange bar ~3000
* ARC: Red bar ~2000, Green bar ~1000, Gray bar ~100, Orange bar ~2800
* MedMCQA: Red bar ~1800, Green bar ~100, Gray bar ~100, Orange bar ~1200
* WinoGrande: Red bar ~800, Green bar ~100, Gray bar ~100, Orange bar ~1000
* CommonsenseQA: Red bar ~800, Green bar ~100, Gray bar ~100, Orange bar ~800
* PIQA: Red bar ~600, Green bar ~100, Gray bar ~100, Orange bar ~600
* OpenbookQA: Red bar ~600, Green bar ~100, Gray bar ~100, Orange bar ~600
* **Radar Chart:**
* MCQs Accuracies (Pink): Ranges from approximately 0.5 (PIQA) to 0.9 (HellaSwag).
* OSQs Accuracies (Green): Ranges from approximately 0.4 (PIQA) to 0.8 (HellaSwag).
#### (2) GPT-3.5
* **Bar Chart:**
* MMLU: Red bar ~2000, Green bar ~100, Gray bar ~4800, Orange bar ~1000
* HellaSwag: Red bar ~1800, Green bar ~1000, Gray bar ~100, Orange bar ~3000
* Race: Red bar ~1800, Green bar ~800, Gray bar ~100, Orange bar ~2200
* ARC: Red bar ~1600, Green bar ~800, Gray bar ~100, Orange bar ~2000
* MedMCQA: Red bar ~1400, Green bar ~100, Gray bar ~100, Orange bar ~1000
* WinoGrande: Red bar ~600, Green bar ~100, Gray bar ~100, Orange bar ~800
* CommonsenseQA: Red bar ~600, Green bar ~100, Gray bar ~100, Orange bar ~600
* PIQA: Red bar ~400, Green bar ~100, Gray bar ~100, Orange bar ~400
* OpenbookQA: Red bar ~400, Green bar ~100, Gray bar ~100, Orange bar ~400
* **Radar Chart:**
* MCQs Accuracies (Pink): Ranges from approximately 0.4 (PIQA) to 0.8 (HellaSwag).
* OSQs Accuracies (Green): Ranges from approximately 0.3 (PIQA) to 0.7 (HellaSwag).
#### (3) Claude-3 Opus
* **Bar Chart:**
* MMLU: Red bar ~5800, Green bar ~100, Gray bar ~1000, Orange bar ~1000
* HellaSwag: Red bar ~3000, Green bar ~800, Gray bar ~100, Orange bar ~3800
* Race: Red bar ~2800, Green bar ~800, Gray bar ~100, Orange bar ~2200
* ARC: Red bar ~2600, Green bar ~800, Gray bar ~100, Orange bar ~2000
* MedMCQA: Red bar ~2000, Green bar ~100, Gray bar ~100, Orange bar ~1000
* WinoGrande: Red bar ~800, Green bar ~100, Gray bar ~100, Orange bar ~800
* CommonsenseQA: Red bar ~800, Green bar ~100, Gray bar ~100, Orange bar ~600
* PIQA: Red bar ~600, Green bar ~100, Gray bar ~100, Orange bar ~400
* OpenbookQA: Red bar ~600, Green bar ~100, Gray bar ~100, Orange bar ~400
* **Radar Chart:**
* MCQs Accuracies (Pink): Ranges from approximately 0.5 (PIQA) to 0.9 (HellaSwag).
* OSQs Accuracies (Green): Ranges from approximately 0.4 (PIQA) to 0.8 (HellaSwag).
#### (4) Mistral-large
* **Bar Chart:**
* MMLU: Red bar ~6000, Green bar ~100, Gray bar ~800, Orange bar ~1000
* HellaSwag: Red bar ~3000, Green bar ~800, Gray bar ~100, Orange bar ~3800
* Race: Red bar ~2800, Green bar ~800, Gray bar ~100, Orange bar ~2200
* ARC: Red bar ~2600, Green bar ~800, Gray bar ~100, Orange bar ~2000
* MedMCQA: Red bar ~2000, Green bar ~100, Gray bar ~100, Orange bar ~1000
* WinoGrande: Red bar ~800, Green bar ~100, Gray bar ~100, Orange bar ~800
* CommonsenseQA: Red bar ~800, Green bar ~100, Gray bar ~100, Orange bar ~600
* PIQA: Red bar ~600, Green bar ~100, Gray bar ~100, Orange bar ~400
* OpenbookQA: Red bar ~600, Green bar ~100, Gray bar ~100, Orange bar ~400
* **Radar Chart:**
* MCQs Accuracies (Pink): Ranges from approximately 0.5 (PIQA) to 0.9 (HellaSwag).
* OSQs Accuracies (Green): Ranges from approximately 0.4 (PIQA) to 0.8 (HellaSwag).
### Key Observations
* **MMLU Performance:** All models show a significant number of incorrect MCQs and incorrect OSQs on the MMLU dataset, as indicated by the large gray bars.
* **HellaSwag Performance:** All models perform relatively well on the HellaSwag dataset, with a high number of correct MCQs and incorrect OSQs, as indicated by the large orange bars.
* **Accuracy Range:** The radar charts indicate that the accuracy for both MCQs and OSQs varies across datasets, with HellaSwag generally showing the highest accuracy and PIQA showing the lowest.
* **Model Comparison:** Claude-3 Opus and Mistral-large appear to have similar performance profiles, with slightly better performance on MMLU compared to GPT-3.5 and GPT-4.
### Interpretation
The data suggests that the performance of language models varies significantly depending on the dataset. The MMLU dataset appears to be particularly challenging for all models, while HellaSwag is relatively easier. The radar charts highlight the differences in accuracy across datasets, providing a visual representation of the models' strengths and weaknesses. The bar charts offer a more granular view, showing the counts of correct and incorrect answers for both MCQs and OSQs.
The models' performance on different datasets likely reflects the nature of the tasks and the types of knowledge required. For example, MMLU may require more complex reasoning or specialized knowledge, while HellaSwag may be more reliant on common sense or pattern recognition.
The comparison between models reveals that Claude-3 Opus and Mistral-large have similar performance profiles, and they outperform GPT-3.5 and GPT-4 on the MMLU dataset. This suggests that these models may have improved capabilities in certain areas, such as complex reasoning or knowledge retrieval.
</details>
Figure 4: Performance comparison of various LLMs on multiple-choice (MCQ) and open-style questions (OSQ) across different datasets. The bar graphs on the left show the counts of correct and incorrect responses (â MCQ vs. â OSQ; â MCQ vs. â OSQ; â MCQ vs. â OSQ; â MCQ vs. â OSQ), while the radar charts on the right illustrate the accuracy comparisons between MCQ and OSQ for each language model (Pink is the MCQ accuracy and LimeGreen is the OSQ accuracy).
<details>
<summary>x4.png Details</summary>

### Visual Description
## Pie Chart Grid: Performance on Different Datasets
### Overview
The image presents a grid of nine pie charts, each representing the performance (YES/NO) on a different dataset. The pie charts are arranged in a 3x3 grid. The legend indicates that the coral color represents "YES" and the light blue color represents "NO". Each pie chart is labeled with the dataset name and displays the percentage breakdown for "YES" and "NO" responses.
### Components/Axes
* **Legend:** Located at the top of the image, indicating "YES" (coral) and "NO" (light blue).
* **Pie Charts:** Nine pie charts, each representing a different dataset.
* **Labels:** Each pie chart is labeled with the dataset name: ARC, CommonsenseQA, Hellaswag, MedMCQA, MMLU, OpenbookQA, PIQA, Race, and Winogrande.
* **Percentages:** Each pie chart segment displays the percentage of "YES" and "NO" responses.
### Detailed Analysis
Here's a breakdown of each pie chart:
1. **ARC:**
* YES (coral): 91.2%
* NO (light blue): 8.8%
2. **CommonsenseQA:**
* YES (coral): 58.1%
* NO (light blue): 41.9%
3. **Hellaswag:**
* YES (coral): 39.2%
* NO (light blue): 60.8%
4. **MedMCQA:**
* YES (coral): 55.4%
* NO (light blue): 44.6%
5. **MMLU:**
* YES (coral): 55.4%
* NO (light blue): 44.6%
6. **OpenbookQA:**
* YES (coral): 49.1%
* NO (light blue): 50.9%
7. **PIQA:**
* YES (coral): 37.9%
* NO (light blue): 62.1%
8. **Race:**
* YES (coral): 71.3%
* NO (light blue): 28.7%
9. **Winogrande:**
* YES (coral): 100.0%
* NO (light blue): 0.0%
### Key Observations
* **Winogrande** has the highest "YES" percentage (100%).
* **ARC** has the second highest "YES" percentage (91.2%).
* **PIQA** has the lowest "YES" percentage (37.9%).
* **Hellaswag** has the highest "NO" percentage (60.8%).
* **Winogrande** has the lowest "NO" percentage (0.0%).
### Interpretation
The pie charts illustrate the performance of a system or model across various datasets. The "YES" and "NO" responses likely represent correct and incorrect answers, respectively. The significant variation in performance across datasets suggests that the system's effectiveness is highly dependent on the specific task or domain represented by each dataset.
Winogrande stands out as the dataset where the system performs perfectly, while performance on PIQA and Hellaswag is notably lower. This could indicate that the system struggles with the types of reasoning or knowledge required by these particular datasets. The other datasets show a more balanced performance, with "YES" percentages ranging from approximately 49% to 58%.
</details>
Figure 5: Percentage of convertible MCQ to open style questions on various datasets.
5.1 Models
We generate responses from LLMs of different sizes. The large-scale LLMs: gpt-3.5-turbo, gpt-4-1106-preview, gpt-4o [27], claude-3-opus-20240229 [3], mistral-large-latest [24], gemini-pro [16], and llama3 [1]. We use the commercial APIs to collect responses from all of these models. The small-scale LLMs: qwen1.5 [4], gemma [39], SlimPajama-DC [35], RedPajama [25], OLMo [17], Pythia [6], TinyLlama [46], OPT [47], GPT-Neo [8], and Cerebras-GPT [14]. All of the small-scale model responses are collected using Huggingface [43] and lm-evaluation-harness framework [15] with 4 $Ă$ 4090 RTX GPUs.
5.2 Datasets
We present a brief overview of used datasets, highlighting their distinctive characteristics and the specific aspects they aim to evaluate. MMLU [18], ARC [12], and MedMCQA [29] stand out with their comprehensive range of tasks spanning across various disciplines. PIQA [7], CommonsenseQA [36], OpenBookQA [23], and HellaSwag [44] focus on the different aspects of commonsense reasoning, such as physical interaction, everyday concepts, and their interrelations. RACE [19] provides a source of reading comprehension challenges. WinoGrande [34] is designed to test the model on resolving coreferences and understanding nuanced relationships in text. This dataset with its unique fill-in-a-blank tasks, inherently aligns with open-ended question formats, negating the need for our multi-stage filtering process. For other datasets, questions are filtered using gpt-4-0125-preview using prompts from Table 1. The prompts for both MCQ and OSQ on each dataset are in Appendix D.
5.3 Evaluation
Table 4: Comparison of multiple choice (MCQ) and open style questions (OSQ) accuracy.
| MMLU | 87.28 | 74.77 | 71.25 | 65.38 | 65.71 | 56.04 | 83.52 | 70.23 | 79.50 | 68.76 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ARC | 95.54 | 82.68 | 90.64 | 78.42 | 90.96 | 72.35 | 97.50 | 75.47 | 89.96 | 72.32 |
| HellaSwag | 90.98 | 24.35 | 63.84 | 29.99 | 69.05 | 25.69 | 96.04 | 20.79 | 81.78 | 24.47 |
| WinoGrande | 84.14 | 66.22 | 78.77 | 64.56 | 66.85 | 56.35 | 81.69 | 63.54 | 75.45 | 56.83 |
| PIQA | 96.41 | 61.64 | 84.34 | 54.89 | 83.33 | 47.70 | 97.41 | 59.05 | 83.33 | 61.21 |
| CommonsenseQA | 84.93 | 62.96 | 79.15 | 67.89 | 66.62 | 50.56 | 86.76 | 63.66 | 69.58 | 55.35 |
| Race | 92.02 | 67.05 | 84.80 | 60.11 | 87.73 | 61.02 | 93.04 | 66.22 | 89.97 | 70.17 |
| MedMCQA | 72.65 | 51.81 | 58.02 | 41.42 | 58.02 | 35.89 | 72.91 | 49.14 | 66.05 | 43.44 |
| OpenbookQA | 94.30 | 60.29 | 83.71 | 49.90 | 86.97 | 52.55 | 93.48 | 52.95 | 88.19 | 58.66 |
| Average | 88.69 | 61.31 | 78.28 | 56.95 | 75.03 | 50.91 | 90.26 | 57.89 | 80.42 | 56.80 |
Our assessment approach for both MCQ and OSQ aligns with widely recognized evaluation frameworks and leaderboards for LLMs. The evaluation of MCQ is conducted utilizing the OpenAI Evals framework [26] with the zero-shot setting, which involves comparing the generated response with the ground truth ID. In contrast, for evaluating responses to open-ended questions, we employ the gpt-4-0125-preview model to determine the correctness of responses generated by LLMs relative to a pre-established ground truth answer from the dataset using the prompt from Table 1.
The results in Table 4 and Figure 4 are based on filtered questions. They show that every model experiences a significant drop in the accuracy for OSQ compared to MCQ. On average, the accuracy of OSQ is lower than MCQ by about 25% for all models. This result can correlate with our concern that the model will ârandomly guessâ to correct choices but it cannot answer. This discrepancy in performance between OSQ and MCQ is not necessarily a negative reflection of the modelsâ overall capabilities. Instead, it can be viewed as a true comparison of the modelsâ abilities to process and understand diverse types of questions.
The most significant difference in models between OSQ and MCQ is observed for Claude-3 Opus, by 31%. The dataset with the largest fall between MCQ and OSQ is HellaSwag. This is because of the type of questions in this dataset. It asks to choose the most plausible continuation for the scenarios presented. Evaluating the OSQ responses of LLMs against the ground truth in this dataset presents a significant challenge due to the different plausible completions. It means that a multitude of valid and contextually appropriate answers can exist, which makes it difficult to evaluate with single-choice ground truth. This contrasts with WinoGrande, which consists of questions that require fill-in-the-blank in sentences with correct words. As a result, HellaSwag does not seem well-suited for open-style questions, and we have chosen to omit it from our final leaderboard.
Table 5: Open-LLM Leaderboard for Large-scale Models. WG, CSQA, OBQA, and HS represent WinoGrande, CommonsenseQA, OpenbookQA, and HellaSwag respectively. We did not include HellaSwag results in the overall accuracy as the evaluation difficulties mentioned in Sec. 5.3.
| GPT-4o
<details>
<summary>extracted/5652609/fig/ranking/1_fig.png Details</summary>

### Visual Description
## Icon: First Place Medal
### Overview
The image is a stylized illustration of a gold medal, typically awarded for first place in a competition. The medal hangs from a blue ribbon. The number "1" is prominently displayed on the face of the medal.
### Components/Axes
* **Medal:** Circular, gold color.
* **Number:** The number "1" in orange, centered on the medal.
* **Ribbon:** Blue, attached to the top of the medal.
* **Attachment:** Gold colored piece connecting the ribbon to the medal.
### Detailed Analysis
The medal is a simple, cartoon-like representation. The number "1" is bold and easily readable. The ribbon is depicted as two separate strips that converge at the attachment point. The background is a light gray.
### Key Observations
The image is designed to be easily recognizable as a first-place award. The use of bright colors (gold, blue, orange) makes it visually appealing.
### Interpretation
The image is a symbolic representation of achievement and success. The number "1" clearly indicates the highest level of accomplishment. The medal serves as a visual metaphor for winning or being the best in a particular field.
</details>
GPT-4-1106-preview
<details>
<summary>extracted/5652609/fig/ranking/2_fig.png Details</summary>

### Visual Description
## Icon: Silver Medal
### Overview
The image depicts a silver medal, indicating second place. The medal is suspended from a blue ribbon. The background is a light gray.
### Components/Axes
* **Medal:** Circular, silver with a gray number "2" in the center.
* **Ribbon:** Blue, attached to the top of the medal.
* **Background:** Light gray.
### Detailed Analysis
* The medal is a light silver color with a darker gray outline. The number "2" is centered within the medal's face.
* The ribbon is a bright blue and appears to be folded or looped at the top where it connects to the medal.
* The background is a uniform light gray, providing contrast to the medal and ribbon.
### Key Observations
* The image is a simple, stylized representation of a silver medal.
* The use of color is limited to silver/gray, blue, and light gray, creating a clean and straightforward visual.
### Interpretation
The image is a clear and recognizable symbol of second place or a silver medal achievement. The design is simple and effective, making it easily understandable. The color choices are appropriate for the subject matter, with silver representing the medal and blue providing a contrasting color for the ribbon.
</details>
Claude-3 Opus
<details>
<summary>extracted/5652609/fig/ranking/3_fig.png Details</summary>

### Visual Description
## Image: Bronze Medal
### Overview
The image depicts a bronze medal hanging from a blue ribbon. The medal has the number "3" prominently displayed on its face.
### Components
* **Medal:** Circular, bronze-colored with a white highlight along the bottom-left edge. The number "3" is centered on the medal.
* **Ribbon:** Blue, attached to the top of the medal. The ribbon is split into two strands that angle upwards and outwards.
* **Background:** Light gray.
### Detailed Analysis
* The medal is a solid bronze color.
* The number "3" is also bronze, matching the medal's color.
* The ribbon is a bright blue.
* The medal is attached to the ribbon by a bronze-colored loop.
### Key Observations
* The image is a simple illustration of a bronze medal, typically awarded for third place in a competition.
* The design is clean and easily recognizable.
### Interpretation
The image represents achievement and recognition, specifically for attaining third place. The bronze color of the medal and the number "3" are universally understood symbols for this level of accomplishment. The image is straightforward and conveys its message effectively.
</details>
| 70.15 65.93 62.53 | 79.09 74.77 70.23 | 86.31 82.68 75.47 | 72.22 66.22 63.54 | 60.34 61.64 59.05 | 70.28 62.96 63.66 | 67.87 67.05 66.22 | 57.85 51.81 49.14 | 67.21 60.29 52.95 | â 24.35 20.79 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Mistral Large | 60.84 | 68.76 | 72.32 | 56.83 | 61.21 | 55.35 | 70.17 | 43.44 | 58.66 | 24.47 |
| GPT-3.5 | 60.32 | 65.38 | 78.42 | 64.56 | 54.89 | 67.89 | 60.11 | 41.42 | 49.90 | 29.99 |
| Gemini 1.0 Pro | 54.06 | 56.04 | 72.35 | 56.35 | 47.70 | 50.56 | 61.02 | 35.89 | 52.55 | 25.69 |
| Llama3-70b-Instruct | 52.92 | 59.67 | 67.09 | 57.14 | 43.10 | 55.49 | 58.21 | 41.67 | 40.94 | â |
Table 6: Open-LLM Leaderboard for small-scale model regime.
| Qwen1.5 (1.8B) Gemma (2B) SlimPajama-DC (1.3B) | 21.68 16.66 9.60 | 9.99 17.52 9.22 | 15.84 23.93 14.95 | 40.96 16.10 14.76 | 15.52 15.09 5.32 | 31.13 27.46 9.01 | 34.91 14.32 16.19 | 4.70 4.57 1.68 | 20.37 14.26 5.70 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| RedPajama (1.3B) | 9.00 | 9.21 | 13.50 | 16.97 | 0.86 | 11.41 | 14.35 | 1.86 | 3.87 |
| OLMo (1.2B) | 8.85 | 8.54 | 13.18 | 6.16 | 8.05 | 13.10 | 13.61 | 2.07 | 6.11 |
| Pythia (1.4B) | 8.79 | 9.66 | 14.69 | 11.52 | 4.17 | 9.01 | 12.76 | 3.19 | 5.30 |
| TinyLlama (1.1B) | 8.45 | 8.94 | 13.31 | 12.23 | 3.59 | 6.06 | 16.7 | 2.07 | 4.68 |
| OPT (1.3B) | 7.89 | 7.40 | 11.83 | 12.47 | 4.48 | 7.61 | 13.61 | 1.25 | 4.48 |
| GPT-Neo (1.3B) | 7.42 | 6.94 | 9.69 | 10.81 | 4.31 | 6.34 | 13.75 | 2.63 | 4.89 |
| Cerebras-GPT (1.3B) | 4.86 | 5.37 | 4.43 | 9.31 | 2.16 | 6.20 | 6.90 | 1.04 | 3.46 |
5.4 Leaderboard and Arena
The overall ranking of models for our benchmark is represented in Table 6 and Table 6. The performance of GPT-4o overall demonstrates its leading edge, with an accuracy of 70.15%, which indicates its robustness in open-style question answering tasks compared to other models. It is followed by GPT-4-1106-preview with 65.93%, and Claude-3 Opus with 62.68%. These results highlight the advanced capabilities of the GPT-4 series. Mid-tier models like Mistral Large and GPT-3.5 perform well but are not on par with the top performers. On the other hand, models like Gemini 1.0 Pro and Llama3-70b-Instruct lag behind in terms of the capabilities to answer the open-style questions.
The performance evaluation of smaller-scale LLMs reveals that Qwen1.5 leads with an overall accuracy of 21.68%, significantly outperforming the other models in this category. Gemma follows with 16.66%, indicating a considerable gap in performance compared to the top model. The remaining models score below 10.00%, highlighting their limited abilities to answer the open-style questions. Almost all of the models struggle significantly with questions from MedMCQA dataset, showing an accuracy below of 5%.
6 Conclusion
We proposed Open-LLM-Leaderboard for LLM evaluation and comprehensively examined its efficacy using open-style questions from nine datasets on OSQ-bench. Different from previous works that rely on human evaluation or thousands of crowd users on Chatbot Arena, we can have a benchmark for chat LLMs in a fast, automatic, and cheap scheme. Our results show a highly correlated level of agreement with humans, indicating a foundation for an LLM-based evaluation benchmark and framework using open-style questions.
Limitations and Ethics Statement
We have discussed multiple advantages of employing open-style questions over multiple-choice questions used in prior works. However, the LLM Leaderboard, as a tool for evaluating and benchmarking LLMs, has several common limitations itself. Firstly, the performance metrics used may not fully capture the nuanced capabilities of each model, especially in areas that require an understanding of context, creativity, or common sense reasoning. Secondly, the benchmark datasets may not be comprehensive enough to cover all possible domains and scenarios, leading to a potential bias towards certain types of questions or tasks. Thirdly, due to the rapidly evolving nature of the field, models may quickly become outdated, meaning the leaderboard may not always reflect the most current state of the art. Since our benchmark utilizes public datasets and our corpus consists of questions and answers, user privacy concerns are minimal.
References
- [1] AI@Meta. Llama 3 model card. 2024.
- [2] Anthropic. Model card and evaluations for claude models, 2023.
- [3] Anthropic. https://www.anthropic.com/claude, 2024.
- [4] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- [5] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- [6] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle OâBrien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397â2430. PMLR, 2023.
- [7] Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432â7439, Apr. 2020.
- [8] Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. If you use this software, please cite it using these metadata.
- [9] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, et al. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877â1901. Curran Associates, Inc., 2020.
- [10] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- [11] John Chung, Ece Kamar, and Saleema Amershi. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
- [12] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
- [13] Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37â46, 1960.
- [14] Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint:2304.03208, 2023.
- [15] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noacâh, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023.
- [16] Google. https://ai.google.dev/, 2023.
- [17] Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. Olmo: Accelerating the science of language models. Preprint, 2024.
- [18] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- [19] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
- [20] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, et al. Starcoder: may the source be with you! arXiv preprint arXiv:23.05.061161, 2023.
- [21] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- [22] Yixin Liu, Kejian Shi, Katherine S He, Longtian Ye, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, and Arman Cohan. On learning to summarize with large language models as references. arXiv preprint arXiv:2305.14239, 2023.
- [23] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
- [24] Mistral. https://chat.mistral.ai/chat, 2024.
- [25] MosaicML. Mpt-1b redpajama-200b. https://huggingface.co/mosaicml/mpt-1b-redpajama-200b. Accessed: 2024-04-29.
- [26] OpenAI. Openai evals. https://github.com/openai/evals.
- [27] OpenAI. https://chat.openai.com/chat, 2022.
- [28] OpenAI. Gpt-4 technical report. arxiv preprint arXiv:2303.08774, 2024.
- [29] Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, 2022.
- [30] Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
- [31] Joshua Robinson, Christopher Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. ArXiv, abs/2210.12353, 2022.
- [32] Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353, 2023.
- [33] Baptiste RoziÚre, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2024.
- [34] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99â106, aug 2021.
- [35] Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Joel Hestness, Natalia Vassilieva, Daria Soboleva, and Eric Xing. Slimpajama-dc: Understanding data combinations for llm training. arXiv preprint arXiv:2309.10818, 2023.
- [36] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- [37] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- [38] Gemini Team. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- [39] Gemma Team. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- [40] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste RoziÚre, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- [41] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- [42] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
- [43] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, RĂ©mi Louf, Morgan Funtowicz, and Jamie Brew. Huggingfaceâs transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771, 2019.
- [44] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- [45] Biao Zhang, Barry Haddow, and Alexandra Birch. Prompting large language model for machine translation: A case study. In Proceedings of the 40th International Conference on Machine Learning, 2023.
- [46] Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint:2401.02385, 2024.
- [47] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. arXiv preprint:2205.01068, 2022.
- [48] Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882, 2024.
- [49] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- [50] Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint 2304.04675, 2023.
Appendix
Appendix A Reproducibility Statement
We will make all our filtered open-style data (MMLU, ARC, HellaSwag, WinoGrande, PIQA, CommonsenseQA, Race, MedMCQA, and OpenbookQA) used in our experiments of Sec. 5 and preprocessing scripts publicly available. Detailed data statistics are provided in Sec. 4.1. Considering the potential high costs associated with gathering and reproducing our LLM response data from the ground up, we will make available all responses from the various LLMs and their corresponding evaluation results to support and simplify the reproducibility of our work. The OpenAI APIs we used include gpt-3.5-turbo-1106, gpt-4.0-1106-preview, gpt-4o (for response collection), and gpt-4.0-0125-preview (for filtering and post-evaluation); Claude 3: claude-3-opus-20240229; Gemini-Pro: gemini-pro, and Mistral: mistral-large-latest.
Appendix B More Results on Gemini Pro and Stage1 Filtering
<details>
<summary>x5.png Details</summary>

### Visual Description
## Chart Type: Combined Bar and Radar Chart
### Overview
The image presents a combined visualization consisting of a stacked bar chart on the left and a radar chart on the right. The bar chart displays the counts of different combinations of correct/incorrect answers for MCQs (Multiple Choice Questions) and OSQs (Open-ended Short Questions) across various datasets. The radar chart compares the accuracies of MCQs and OSQs across the same datasets.
### Components/Axes
**Left: Stacked Bar Chart**
* **X-axis:** "Dataset" with categories: MMLU, HellaSwag, Race, ARC, MedMCQA, WinoGrande, CommonsenseQA, PIQA, OpenbookQA.
* **Y-axis:** "Count" ranging from 0 to 8000, with increments of 2000.
* **Legend (Top-Right of Bar Chart):**
* Orange: Incorrect MCQs, Correct OSQs
* Green: Correct MCQs, Incorrect OSQs
* Red: Correct MCQs, Correct OSQs
* Gray: Incorrect MCQs, Incorrect OSQs
**Right: Radar Chart**
* **Axes:** Radial axes represent the datasets: HellaSwag, CommonsenseQA, ARC, WinoGrande, Race, PIQA, OpenbookQA, MMLU, MedMCQA.
* **Scale:** Concentric circles indicate accuracy values from 0.2 to 0.8, with increments of 0.2.
* **Legend (Top-Right of Radar Chart):**
* Red Line/Area: MCQs Accuracies
* Green Line/Area: OSQs Accuracies
### Detailed Analysis
**Left: Stacked Bar Chart**
* **MMLU:**
* Incorrect MCQs, Correct OSQs (Orange): ~1200
* Correct MCQs, Incorrect OSQs (Green): ~1800
* Correct MCQs, Correct OSQs (Red): ~3300
* Incorrect MCQs, Incorrect OSQs (Gray): ~1500
* **HellaSwag:**
* Incorrect MCQs, Correct OSQs (Orange): ~200
* Correct MCQs, Incorrect OSQs (Green): ~2100
* Correct MCQs, Correct OSQs (Red): ~600
* Incorrect MCQs, Incorrect OSQs (Gray): ~1000
* **Race:**
* Incorrect MCQs, Correct OSQs (Orange): ~100
* Correct MCQs, Incorrect OSQs (Green): ~1100
* Correct MCQs, Correct OSQs (Red): ~1800
* Incorrect MCQs, Incorrect OSQs (Gray): ~500
* **ARC:**
* Incorrect MCQs, Correct OSQs (Orange): ~100
* Correct MCQs, Incorrect OSQs (Green): ~1200
* Correct MCQs, Correct OSQs (Red): ~1800
* Incorrect MCQs, Incorrect OSQs (Gray): ~300
* **MedMCQA:**
* Incorrect MCQs, Correct OSQs (Orange): ~50
* Correct MCQs, Incorrect OSQs (Green): ~400
* Correct MCQs, Correct OSQs (Red): ~1100
* Incorrect MCQs, Incorrect OSQs (Gray): ~800
* **WinoGrande:**
* Incorrect MCQs, Correct OSQs (Orange): ~50
* Correct MCQs, Incorrect OSQs (Green): ~500
* Correct MCQs, Correct OSQs (Red): ~400
* Incorrect MCQs, Incorrect OSQs (Gray): ~300
* **CommonsenseQA:**
* Incorrect MCQs, Correct OSQs (Orange): ~25
* Correct MCQs, Incorrect OSQs (Green): ~250
* Correct MCQs, Correct OSQs (Red): ~400
* Incorrect MCQs, Incorrect OSQs (Gray): ~100
* **PIQA:**
* Incorrect MCQs, Correct OSQs (Orange): ~25
* Correct MCQs, Incorrect OSQs (Green): ~150
* Correct MCQs, Correct OSQs (Red): ~200
* Incorrect MCQs, Incorrect OSQs (Gray): ~100
* **OpenbookQA:**
* Incorrect MCQs, Correct OSQs (Orange): ~25
* Correct MCQs, Incorrect OSQs (Green): ~100
* Correct MCQs, Correct OSQs (Red): ~150
* Incorrect MCQs, Incorrect OSQs (Gray): ~50
**Right: Radar Chart**
* **MCQs Accuracies (Red Area):** The red area represents the accuracy of MCQs across different datasets. The accuracy varies, with peaks at ARC and CommonsenseQA, and lower values at PIQA and OpenbookQA.
* HellaSwag: ~0.7
* CommonsenseQA: ~0.75
* ARC: ~0.75
* WinoGrande: ~0.6
* Race: ~0.6
* PIQA: ~0.5
* OpenbookQA: ~0.5
* MMLU: ~0.6
* MedMCQA: ~0.65
* **OSQs Accuracies (Green Area):** The green area represents the accuracy of OSQs across different datasets. The accuracy is generally lower than MCQs, with peaks at HellaSwag and CommonsenseQA.
* HellaSwag: ~0.5
* CommonsenseQA: ~0.5
* ARC: ~0.4
* WinoGrande: ~0.3
* Race: ~0.3
* PIQA: ~0.3
* OpenbookQA: ~0.3
* MMLU: ~0.4
* MedMCQA: ~0.45
### Key Observations
* The stacked bar chart shows the distribution of correct and incorrect answers for MCQs and OSQs across different datasets. MMLU has the highest overall count, while OpenbookQA has the lowest.
* The radar chart indicates that MCQs generally have higher accuracy than OSQs across all datasets.
* The difference in accuracy between MCQs and OSQs is most pronounced in datasets like ARC and WinoGrande.
### Interpretation
The combined chart provides insights into the performance of models on different question types (MCQs and OSQs) across various datasets. The stacked bar chart highlights the raw counts of correct and incorrect answers, while the radar chart normalizes these counts into accuracy scores, allowing for a direct comparison of MCQ and OSQ performance.
The data suggests that models generally perform better on MCQs than OSQs, possibly due to the constrained nature of multiple-choice questions. Datasets like MMLU, which have a high overall count but moderate accuracy in the radar chart, may be more challenging overall. The radar chart clearly shows the relative strengths and weaknesses of the models across different datasets and question types. The stacked bar chart shows the raw counts of each combination of correct/incorrect answers, which can be useful for understanding the types of errors the models are making.
</details>
Figure 6: Performance comparison of Gemini Pro on multiple-choice and open-style response questions across diverse datasets, as shown by the count of correct and incorrect answers in the left bar chart and model accuracy in the right radar chart.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Pie Chart: Performance on Different Question Answering Datasets
### Overview
The image presents a series of pie charts, each representing the performance of a system (likely a machine learning model) on a different question-answering dataset. Each pie chart is divided into two sections, indicating the percentage of "YES" and "NO" answers. The legend in the top-right corner clarifies that the red segments represent "YES" answers, while the blue segments represent "NO" answers. The datasets are ARC, CommonsenseQA, HellaSwag, MedMCQA, MMLU, OpenbookQA, PIQA, Race, and WinoGrande.
### Components/Axes
* **Legend:** Located in the top-right corner, indicating "YES" (red) and "NO" (blue).
* **Pie Charts:** Each pie chart represents a different dataset.
* **Dataset Labels:** Each pie chart is labeled with the name of the dataset it represents (e.g., ARC, CommonsenseQA).
* **Percentage Labels:** Each segment of the pie chart is labeled with the percentage it represents.
### Detailed Analysis or ### Content Details
Here's a breakdown of the "YES" and "NO" percentages for each dataset:
* **ARC:** YES: 85.4%, NO: 14.6%
* **CommonsenseQA:** YES: 53.7%, NO: 46.3%
* **HellaSwag:** YES: 5.1%, NO: 94.9%
* **MedMCQA:** YES: 48.8%, NO: 51.2%
* **MMLU:** YES: 41.9%, NO: 58.1%
* **OpenbookQA:** YES: 37.2%, NO: 62.8%
* **PIQA:** YES: 35.4%, NO: 64.6%
* **Race:** YES: 70.4%, NO: 29.6%
* **WinoGrande:** YES: 100.0%, NO: 0.0%
### Key Observations
* **WinoGrande:** Shows perfect performance with 100% "YES" answers.
* **HellaSwag:** Shows very poor performance with only 5.1% "YES" answers.
* **ARC and Race:** Show relatively high percentages of "YES" answers compared to other datasets.
* **PIQA, OpenbookQA, MMLU, and MedMCQA:** Show a higher percentage of "NO" answers than "YES" answers.
* **CommonsenseQA:** Shows a near 50/50 split between "YES" and "NO" answers.
### Interpretation
The pie charts provide a visual comparison of the system's performance across different question-answering datasets. The significant variation in performance suggests that the system's ability to answer questions correctly is highly dependent on the specific characteristics of each dataset. For example, the high performance on WinoGrande indicates that the system is well-suited for that particular type of question, while the low performance on HellaSwag suggests a weakness in handling that type of question. The near 50/50 split on CommonsenseQA may indicate that the system struggles with questions requiring common sense reasoning. The data highlights the importance of evaluating question-answering systems on a diverse set of datasets to obtain a comprehensive understanding of their capabilities and limitations.
</details>
Figure 7: Initial filtering results across datasets, showcasing the percentages of questions that are suitable (Yes) for open-style formatting and those that are not (No).
Appendix C More Sample Questions
We provide more sample questions that are correct for open-style and incorrect for multiple choice in Table 7, and vice versa in Table 8. These examples demonstrate how the modelâs ability varies in handling different question formats.
Table 7: Example of an Incorrect MCQ and a Correct OSQ in MMLU.
| Question that is correct for OSQ and incorrect for MCQ: How many CO2 and ATP molecules are formed during one complete turn of the tricarboxylic acid cycle (Krebsâ cycle)? |
| --- |
| Options: A. 2CO2 and 2ATP B. 2CO2 and 16ATP C. 2CO2 and 12ATP D. 2CO2 and 1ATP |
| Ground Truth: D |
| MCQ_Answer: A |
| OSQ_Answer: During one complete turn of the tricarboxylic acid cycle (also known as the Krebsâ cycle or the citric acid cycle), a series of chemical reactions occur in the mitochondria of cells that result in the production of carbon dioxide (CO2) and adenosine triphosphate (ATP), among other molecules. |
| For each acetyl-CoA molecule that enters the cycle, the following are produced: |
| - 2 molecules of CO2: Each acetyl-CoA contributes two carbons to the cycle, which are eventually released as two molecules of carbon dioxide. |
| - 1 molecule of ATP: One molecule of ATP (or an equivalent, such as GTP, depending on the cell type) is produced directly in the cycle through substrate-level phosphorylation. |
| However, the cycle also produces reduced coenzymes, specifically NADH and FADH2, which are used in the electron transport chain to produce additional ATP through oxidative phosphorylation. For each acetyl-CoA that enters the cycle: |
| - 3 molecules of NADH are produced, which can be used to generate approximately 2.5 ATP each through the electron transport chain. |
| - 1 molecule of FADH2 is produced, which can be used to generate approximately 1.5 ATP through the electron transport chain. |
| Therefore, while only 1 ATP is produced directly in the cycle per acetyl-CoA, the NADH and FADH2 generated can lead to the production of additional ATP in the electron transport chain. The exact number of ATP molecules produced can vary depending on the efficiency of the electron transport chain and the ATP synthase complex in different organisms and cell types. |
Table 8: Example of a Correct MCQ and an Incorrect OSQ in MMLU.
| Question that is correct for MCQ and incorrect for OSQ: A tree is a connected graph with no cycles. How many nonisomorphic trees with 5 vertices exist? |
| --- |
| Options: A. 1 B. 2 C. 3 D. 4 |
| Ground Truth: C |
| MCQ_Answer: C |
| OSQ_Answer: To determine the number of nonisomorphic trees with 5 vertices, we can use Cayleyâs formula, which states that the number of trees on $n$ labeled vertices is $n^{n-2}$ . However, this formula gives the number of trees when vertices are labeled, meaning that it counts different labelings of the same tree shape as distinct. Since we are interested in nonisomorphic trees (which do not consider different labelings as distinct), we cannot directly use Cayleyâs formula. |
| Instead, we need to enumerate the trees by considering their possible structures. For a tree with 5 vertices, we can have the following configurations: |
| 1. A linear chain (path) where each vertex is connected to two others except for the two end vertices, which are connected to only one other vertex. This is one possible tree. |
| 2. A "star" tree where one central vertex is connected to the other four vertices. This is another possible tree. |
| 3. A tree where one vertex is connected to three others, and one of those three is connected to the fifth vertex. This structure can be seen as a "T" shape with an extra branch on one of the endpoints. There are three nonisomorphic trees of this type because the branch can be on any of the three endpoints. |
| 4. A tree where two vertices are connected to a central vertex, and each of those two vertices is connected to one other vertex (creating a structure with a central vertex, two intermediate vertices, and two leaf vertices). This structure can be seen as a "Y" shape. There is only one nonisomorphic tree of this type because any permutation of the leaf vertices does not create a fundamentally different tree structure. |
| Adding these up, we have: |
| 1 (linear chain) + 1 (star) + 3 (T-shaped with an extra branch) + 1 (Y-shaped) = 6 nonisomorphic trees with 5 vertices. |
Appendix D Prompts for Different Datasets
The following are examples of prompts for different datasets. They allow models to better understand and respond to specific contexts and objectives in each dataset.
Table 9: Prompt for MMLU dataset.
Table 10: Prompt for ARC dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be âAâ. |
| --- |
| Question: Tiny organisms called plankton live in oceans. Some plankton can take energy from the Sun and turn it into food. How are plankton most useful to the animals that live in the ocean? |
| A. Plankton are colorful. |
| B. Plankton clean the water. |
| C. Plankton release oxygen. |
| D. Plankton reproduce quickly. |
| Answer: |
| Open-Style Prompt: Answer the following question. |
| Question: Tiny organisms called plankton live in oceans. Some plankton can take energy from the Sun and turn it into food. How are plankton most useful to the animals that live in the ocean? |
| Answer: |
Table 11: Prompt for CommonsenseQA dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D, E. For example, if you think the correct answer is A, your response should be âAâ. |
| --- |
| Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what? |
| A. bank |
| B. library |
| C. department store |
| D. mall |
| E. New York |
| Answer: |
| Open-Style Prompt: You will be presented with a variety of questions that require an understanding of everyday scenarios, human behaviors, and common sense. Your task is to provide the best possible answer to each question based solely on your understanding and reasoning. |
| Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what? |
| Answer: |
Table 12: Prompt for MedMCQA dataset.
| MCQ Prompt: The following is the multiple choice question about medicine. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be âAâ. |
| --- |
| Question: Modulus of elasticity means: |
| A. Rigidity or stiffness of the material |
| B. Ability to be stretched with permanent deformation |
| C. Ductility of a material |
| D. Malleability of the metal |
| Answer: |
| Open-Style Prompt: Answer the following question about medicine. |
| Question: Modulus of elasticity means: |
| Answer: |
Table 13: Prompt for HellaSwag dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be âAâ. |
| --- |
| Question: How to clean your rv windows and mirrors fast without using any spray. you |
| A. also have a bucket that you spray paint a window in. |
| B. can reach for a running water hose and clean the inside of your rv quickly. |
| C. get a wash cloth and you put it under the faucet to get wet and then you rinse it out so itâs not soaking. |
| D. meticulously clean the window in the glass shop and then take the plastic off and start taking the hood off. |
| Answer: |
| Open-Style Prompt: Imagine you are provided with a scenario or a partial story taken from everyday life or a common activity. Your task is to continue this story or scenario in a way that makes the most sense based on what typically happens in such situations. Please complete the sentence. |
| Question: How to clean your rv windows and mirrors fast without using any spray. you |
| Answer: |
Table 14: Prompt for OpenbookQA dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be âAâ. |
| --- |
| Question: what system is needed for a body to get its needed supply of the gas humans breathe in? |
| A. the circulatory system |
| B. the digestive system |
| C. the school system |
| D. central nervous system |
| Answer: |
| Open-Style Prompt: Consider common scenarios or outcomes that fit the context of the sentence. Attempt to logically complete the sentences based on common knowledge and reasoning. |
| Question: what system is needed for a body to get its needed supply of the gas humans breathe in? |
| Answer: |
Table 15: Prompt for PIQA dataset.
| MCQ Prompt: The following is the multiple choice question. Please select the correct answer from the options A, B. For example, if you think the correct answer is A, your response should be âAâ. |
| --- |
| Question: How do I ready a guinea pig cage for itâs new occupants? |
| A. Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish. |
| B. Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish. |
| Answer: |
| Open-Style Prompt: Consider common scenarios or outcomes that fit the context of the sentence. Attempt to logically complete the sentences based on common knowledge and reasoning. |
| Question: How do I ready a guinea pig cage for itâs new occupants? |
| Answer: |
Table 16: Prompt for Race dataset.
| MCQ Prompt: I will give you a passage with multiple-choice question. Please select the correct answer from the options A, B, C, D. For example, if you think the correct answer is A, your response should be âAâ. |
| --- |
| Passage:... |
| Question: What did Nancy try to do before she fell over? |
| A. Measure the depth of the river |
| B. Look for a fallen tree trunk |
| C. Protect her cows from being drowned |
| D. Run away from the flooded farm |
| Answer: |
| Open-Style Prompt: I will give you passage with question. Please, answer the question. |
| Passage:... |
| Question: What did Nancy try to do before she fell over? |
| Answer: |
Table 17: Prompt for WinoGrande dataset.
| MCQ Prompt: The following is the multiple choice question. Please put the correct words in place of _. Your response should include only the option without any justification or reasoning. Please select the correct answer from the options A, B. |
| --- |
| Question: Sarah was a much better surgeon than Maria so _ always got the easier cases. |
| A. Sarah |
| B. Maria |
| Answer: |
| Open-Style Prompt: Please put the correct words in place of _. Give only the word that fits the sentence. |
| Question: Sarah was a much better surgeon than Maria so _ always got the easier cases. |
| Answer: |