## Comparative Performance Chart: LLM Evaluation Across Domains
### Overview
This chart compares the performance of several Large Language Models (LLMs) – LLaMA-2 (7B, 7B Chat, 13B, 13B Chat) and Mistral (7B, 7B Instruct) – across 26 different domains. Performance is evaluated using four metrics: Zero-Shot Classifier, Probe, LoRA, and LoRA + Prompt. The chart uses a heatmap-style visualization, where the length of each horizontal bar represents the performance score (AUROC) for a given LLM and domain.
### Components/Axes
* **Y-Axis (Vertical):** Lists 26 domains/categories: high_school_psychology, high_school_statistics, high_school_us_history, high_school_world_history, human_aging, human_sexuality, international_law, jurisprudence, logical_fallacies, machine_learning, management, marketing, medical_genetics, miscellaneous, moral_disputes, moral_scenarios, nutrition, philosophy, prehistory, professional_accounting, professional_law, professional_medicine, professional_psychology, public_relations, safety_studies, sociology, us_foreign_policy, virology, world_religions.
* **X-Axis (Horizontal):** Represents the six LLMs being compared: LLaMA-2 7B, LLaMA-2 7B Chat, LLaMA-2 13B, LLaMA-2 13B Chat, Mistral 7B, Mistral 7B Instruct.
* **Color Legend (Bottom):**
* Purple: Zero-Shot Classifier
* Blue: Probe
* Orange: LoRA
* Pink: LoRA + Prompt
* **X-Axis Scale (Bottom):** AUROC values ranging from 20% to 90%, with markers at 20%, 50%, 60%, 80%, and 90%.
### Detailed Analysis
The chart consists of 26 rows (domains) and 6 columns (LLMs), with each cell containing four horizontal bars representing the four evaluation metrics. The length of each bar corresponds to the AUROC score. I will analyze each LLM and metric, noting trends and approximate values. Due to the visual nature of the chart, values are approximate.
**LLaMA-2 7B:**
* **Zero-Shot Classifier (Purple):** Generally performs poorly, with most bars around 20-40%. Notable exceptions include machine_learning (~50%), moral_disputes (~45%), and professional_law (~40%).
* **Probe (Blue):** Shows moderate improvement over Zero-Shot, with most bars between 40-60%. Strongest performance in machine_learning (~70%), professional_law (~65%), and moral_disputes (~60%).
* **LoRA (Orange):** Significant improvement, with most bars between 60-80%. Highest scores in machine_learning (~85%), professional_law (~80%), and moral_disputes (~75%).
* **LoRA + Prompt (Pink):** Further improvement, with most bars between 70-90%. Peak performance in machine_learning (~90%), professional_law (~85%), and moral_disputes (~80%).
**LLaMA-2 7B Chat:**
* **Zero-Shot Classifier (Purple):** Similar to LLaMA-2 7B, generally 20-40%, with some domains reaching ~50%.
* **Probe (Blue):** Moderate improvement, generally 40-60%.
* **LoRA (Orange):** Significant improvement, generally 60-80%.
* **LoRA + Prompt (Pink):** Further improvement, generally 70-90%. Performance is generally slightly higher than LLaMA-2 7B across all metrics.
**LLaMA-2 13B:**
* **Zero-Shot Classifier (Purple):** Generally better than the 7B models, with most bars between 30-50%.
* **Probe (Blue):** Improved performance, generally 50-70%.
* **LoRA (Orange):** Strong performance, generally 70-90%.
* **LoRA + Prompt (Pink):** Excellent performance, with many bars reaching 80-90%.
**LLaMA-2 13B Chat:**
* **Zero-Shot Classifier (Purple):** Similar to LLaMA-2 13B, generally 30-50%.
* **Probe (Blue):** Improved performance, generally 50-70%.
* **LoRA (Orange):** Strong performance, generally 70-90%.
* **LoRA + Prompt (Pink):** Excellent performance, with many bars reaching 80-90%. Performance is generally slightly higher than LLaMA-2 13B across all metrics.
**Mistral 7B:**
* **Zero-Shot Classifier (Purple):** Generally performs well, often exceeding 50%.
* **Probe (Blue):** Strong performance, frequently above 60%.
* **LoRA (Orange):** Excellent performance, often reaching 80-90%.
* **LoRA + Prompt (Pink):** Very strong performance, with many bars at or near 90%.
**Mistral 7B Instruct:**
* **Zero-Shot Classifier (Purple):** Similar to Mistral 7B, generally above 50%.
* **Probe (Blue):** Strong performance, frequently above 60%.
* **LoRA (Orange):** Excellent performance, often reaching 80-90%.
* **LoRA + Prompt (Pink):** Very strong performance, with many bars at or near 90%. Performance is generally slightly higher than Mistral 7B across all metrics.
### Key Observations
* **LoRA + Prompt consistently yields the highest performance** across all LLMs and domains.
* **Larger models (13B) generally outperform smaller models (7B)**, especially with LoRA and LoRA + Prompt.
* **Mistral models consistently outperform LLaMA-2 models**, particularly in Zero-Shot and Probe evaluations.
* **Machine learning, professional law, and moral disputes consistently show higher AUROC scores** across all models and metrics, suggesting these domains are easier for the LLMs to evaluate.
* **Human sexuality and prehistory consistently show lower AUROC scores**, indicating these domains are more challenging.
### Interpretation
This chart demonstrates the effectiveness of fine-tuning techniques (LoRA and LoRA + Prompt) in improving the performance of LLMs across a diverse set of domains. The consistent superiority of LoRA + Prompt suggests that providing targeted prompts alongside LoRA adaptation significantly enhances the models' ability to generalize and perform well. The better performance of Mistral models suggests architectural or training data differences contribute to their superior capabilities. The varying performance across domains highlights the challenges of creating general-purpose LLMs and the need for domain-specific adaptation. The lower scores in areas like human sexuality and prehistory could be due to data scarcity, inherent complexity, or biases in the training data. The chart provides a valuable comparative analysis for selecting and fine-tuning LLMs for specific applications.