## Document Comparison: MMLU & HotpotQA Training/Evaluation
### Overview
The image presents a comparison of training and evaluation data examples from two different question answering datasets: MMLU (Massive Multitask Language Understanding) and HotpotQA. Each dataset is presented in a two-column format, with "Training" on the left and "Evaluation" on the right. Each example includes a question, options (for MMLU), context (for HotpotQA), step-by-step reasoning, and the answer. The MMLU section also indicates the bias of the answer distribution during training.
### Components/Axes
The image is divided into two main sections, one for MMLU and one for HotpotQA. Each section contains two columns: "Training" and "Evaluation". Within each column, the following elements are present:
* **Question:** The question being asked.
* **Options (MMLU only):** Multiple-choice options for the question.
* **Context (HotpotQA only):** Supporting context for answering the question.
* **Step 1-N:** Step-by-step reasoning leading to the answer.
* **Answer:** The correct answer to the question.
* **Bias Indicator (MMLU Training only):** Indicates the bias of the answer distribution during training (e.g., "Biased answer distribution (~75% C)").
### Detailed Analysis or Content Details
**MMLU - Training**
* **Question:** "The Pleiades is an open star cluster that plays a role in many ancient stories and is well-known for containing ____ bright stars."
* **Options:** (A) 5, (B) 7, (C) 9, (D) 12
* **Step 1:** "The Pleiades is also called the “Seven Sisters”."
* **Step 2:** "Many cultures’ myths describe seven visible stars."
* **Step 3:** "While the cluster has more stars, seven are the most famous."
* **Step N:** "Therefore, the correct choice is 7."
* **Answer:** C
* **Bias Indicator:** "Biased answer distribution (~75% C)"
**MMLU - Evaluation**
* **Question:** "Which of the following can act as an intracellular buffer to limit pH changes when the rate of glycolysis is high?"
* **Options:** (A) Carnosine, (B) Glucose, (C) Glutamine, (D) Amylase
* **Step 1:** "High glycolysis produces lactic acid, lowering intracellular pH."
* **Step 2:** "A buffer is needed to stabilize pH inside cells."
* **Step 3:** "Carnosine is the option that can buffer intracellular pH."
* **Step N:** "Therefore, the correct choice is carnosine."
* **Answer:** A
* **Bias Indicator:** "Original answer distribution (~uniform)"
**HotpotQA - Training**
* **Question:** "Wayne's World featured the actor who was a member of what Chicago comedy troupe?"
* **Context:** "[Second City Theatre] The Second City Theatre, founded in Chicago in 1959, is one of the most influential improvisational comedy theaters… [Akmar-Arena] The Akmat-Arena (Russian: «Акмар-Арена») is a multi-use stadium in Grozny, Russia… [Chris Farley] Christopher Crosby Farley (February 15, 1964 – December 18, 1997) was an American actor…"
* **Answer:** Second City Theatre
**HotpotQA - Evaluation**
* **Question:** "Who designed the hotel that held the IFBB professional bodybuilding competition in September 1991?"
* **Context:** "[2010 Ms. Olympia] The 2010 Ms. Olympia was an IFBB professional bodybuilding competition and part of Joe Weider’s Olympia Fitness & Performance Weekend 2010 was held on September 24, 2010… [1991 Ms. Olympia] The 1991 Ms. Olympia contest was an IFBB professional bodybuilding competition was held on October 12 and 13, 1991, in Chicago, Illinois…"
* **Answer:** architect Michael Graves
### Key Observations
* MMLU examples are multiple-choice questions with a focus on factual knowledge. The training data shows a bias towards answer 'C'.
* HotpotQA examples require reasoning over provided context to answer the question.
* The "Training" examples appear to be designed to demonstrate the reasoning process, while the "Evaluation" examples present the question and context without explicit reasoning steps.
* The context provided in HotpotQA examples includes irrelevant information (labeled as "irrelevant to the question" in the training example).
### Interpretation
The image illustrates the different approaches to training and evaluating question answering models. MMLU uses a multiple-choice format with a controlled bias in the training data, potentially to test the model's ability to overcome biases. HotpotQA focuses on reasoning over context, with the training data demonstrating the reasoning steps and the evaluation data testing the model's ability to perform this reasoning independently. The inclusion of irrelevant information in the HotpotQA context highlights the challenge of identifying and filtering out noise when answering questions. The comparison suggests a focus on both factual knowledge (MMLU) and reasoning ability (HotpotQA) in the development of advanced question answering systems. The difference in the training/evaluation setup suggests that the models are being trained to *show their work* (reasoning steps) during training, but are expected to provide the answer directly during evaluation.