## Bar Chart: Model Performance Across Different Knowledge Domains
### Overview
The image is a series of bar charts comparing the performance of different language models (LLaMA-2 7B, LLaMA-2 7B Chat, LLaMA-2 13B, LLaMA-2 13B Chat, Mistral 7B, Mistral 7B Instruct) across various knowledge domains. Performance is measured using ECE (Expected Calibration Error) and AUROC (Area Under the Receiver Operating Characteristic curve) metrics, with different training/prompting strategies (Zero-Shot Classifier, Probe, LoRA, LoRA + Prompt).
### Components/Axes
* **X-axis:** ECE and AUROC scores, with markers at 20%, 50%, 60%, and 90%.
* **Y-axis:** Knowledge domains, including abstract algebra, anatomy, astronomy, business ethics, clinical knowledge, college biology, college chemistry, college computer science, college mathematics, college medicine, college physics, computer security, conceptual physics, econometrics, electrical engineering, elementary mathematics, formal logic, global facts, high school biology, high school chemistry, high school computer science, high school european history, high school geography, high school government and politics, high school macroeconomics, high school mathematics, high school microeconomics, and high school physics.
* **Chart Columns (Top):** LLaMA-2 7B, LLaMA-2 7B Chat, LLaMA-2 13B, LLaMA-2 13B Chat, Mistral 7B, Mistral 7B Instruct.
* **Legend (Bottom):**
* Zero-Shot Classifier (Dark Red)
* Probe (Light Purple)
* LoRA (Dark Purple)
* LoRA + Prompt (Medium Purple)
### Detailed Analysis
The chart presents performance metrics (ECE and AUROC) for each model and knowledge domain, using different training/prompting strategies. Each knowledge domain has four bars representing the four strategies.
Here's a breakdown of the data, noting trends and approximate values:
* **Abstract Algebra:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Anatomy:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Astronomy:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Business Ethics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Clinical Knowledge:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **College Biology:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **College Chemistry:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **College Computer Science:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **College Mathematics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **College Medicine:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **College Physics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Computer Security:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Conceptual Physics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Econometrics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Electrical Engineering:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Elementary Mathematics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Formal Logic:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **Global Facts:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School Biology:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School Chemistry:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School Computer Science:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School European History:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School Geography:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School Government and Politics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School Macroeconomics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School Mathematics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School Microeconomics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
* **High School Physics:**
* LLaMA-2 7B: Zero-Shot Classifier (ECE ~20%, AUROC ~50%), Probe (ECE ~20%, AUROC ~50%), LoRA (ECE ~20%, AUROC ~50%), LoRA + Prompt (ECE ~20%, AUROC ~50%).
* LLaMA-2 7B Chat: Similar performance to LLaMA-2 7B.
### Key Observations
* The performance of LLaMA-2 7B and LLaMA-2 7B Chat is very similar across all knowledge domains and strategies.
* The performance of all models is generally low, with most bars indicating scores between 20% and 50% for both ECE and AUROC.
* There are some exceptions where Zero-Shot Classifier performs better, reaching up to 90% AUROC in certain domains for LLaMA-2 13B and LLaMA-2 13B Chat.
* LoRA and LoRA + Prompt strategies do not consistently outperform the Zero-Shot Classifier or Probe strategies.
### Interpretation
The data suggests that the language models, in their tested configurations, struggle to perform well across a broad range of knowledge domains. The low AUROC scores indicate poor discriminatory ability, while the ECE scores suggest calibration issues. The similarity in performance between LLaMA-2 7B and its chat-optimized variant suggests that the chat-specific fine-tuning does not significantly impact performance on these knowledge-based tasks.
The inconsistent performance of LoRA and LoRA + Prompt indicates that these strategies may require further optimization or may not be suitable for all knowledge domains. The occasional high AUROC scores for Zero-Shot Classifier in certain domains suggest that the models possess some inherent knowledge, but struggle to generalize or apply it consistently.
Further investigation is needed to understand the specific challenges faced by these models in each knowledge domain and to identify strategies for improving their performance.