\n
## Heatmap: Model Performance Across Domains
### Overview
The image presents a heatmap comparing the performance of several Large Language Models (LLMs) – LLaMA-2 (7B, 7B Chat, 13B, 13B Chat) and Mistral (7B, 7B Instruct) – across 24 different domains. Performance is measured using four different evaluation methods: Zero-Shot Classifier, Probe, LoRA, and LoRA + Prompt. The heatmap visualizes the Equalized Calibration Error (ECE) and Area Under the Receiver Operating Characteristic curve (AUROC) for each model-domain-evaluation method combination.
### Components/Axes
* **Y-axis (Vertical):** Lists 24 domains, including: abstract\_algebra, anatomy, astronomy, business\_ethics, clinical\_knowledge, college\_biology, college\_chemistry, college\_computer\_science, college\_mathematics, college\_medicine, college\_physics, computer\_security, conceptual\_physics, econometrics, electrical\_engineering, elementary\_mathematics, formal\_logic, global\_facts, high\_school\_biology, high\_school\_chemistry, high\_school\_computer\_science, high\_school\_european\_history, high\_school\_geography, high\_school\_government\_and\_politics, high\_school\_macroeconomics, high\_school\_mathematics, high\_school\_microeconomics, high\_school\_physics.
* **X-axis (Horizontal):** Represents the six LLMs: LLaMA-2 7B, LLaMA-2 7B Chat, LLaMA-2 13B, LLaMA-2 13B Chat, Mistral 7B, Mistral 7B Instruct.
* **Color Scale:** A gradient from purple to red, representing ECE/AUROC values. The scale is marked with 20%, 50%, 60%, 80%, and 90%. Purple indicates lower values (better performance), while red indicates higher values (worse performance).
* **Legend (Bottom-Center):** Defines the color coding for the four evaluation methods:
* Zero-Shot Classifier (Dark Purple)
* Probe (Medium Purple)
* LoRA (Medium Red)
* LoRA + Prompt (Bright Red)
### Detailed Analysis
The heatmap is structured into six columns, one for each model. Within each column, there are 24 rows, one for each domain. Each cell in the heatmap represents the ECE/AUROC value for a specific model, domain, and evaluation method. The color of the cell indicates the performance level.
Here's a breakdown of the observed trends and approximate values, focusing on the dominant color within each cell:
**LLaMA-2 7B:**
* **abstract\_algebra:** Zero-Shot: ~25%, Probe: ~30%, LoRA: ~50%, LoRA+Prompt: ~60%
* **anatomy:** Zero-Shot: ~40%, Probe: ~35%, LoRA: ~45%, LoRA+Prompt: ~55%
* **astronomy:** Zero-Shot: ~30%, Probe: ~25%, LoRA: ~40%, LoRA+Prompt: ~50%
* **business\_ethics:** Zero-Shot: ~50%, Probe: ~45%, LoRA: ~60%, LoRA+Prompt: ~70%
* **clinical\_knowledge:** Zero-Shot: ~60%, Probe: ~55%, LoRA: ~70%, LoRA+Prompt: ~80%
* **college\_biology:** Zero-Shot: ~35%, Probe: ~30%, LoRA: ~45%, LoRA+Prompt: ~55%
* **college\_chemistry:** Zero-Shot: ~40%, Probe: ~35%, LoRA: ~50%, LoRA+Prompt: ~60%
* **college\_computer\_science:** Zero-Shot: ~45%, Probe: ~40%, LoRA: ~55%, LoRA+Prompt: ~65%
* **college\_mathematics:** Zero-Shot: ~50%, Probe: ~45%, LoRA: ~60%, LoRA+Prompt: ~70%
* **college\_medicine:** Zero-Shot: ~60%, Probe: ~55%, LoRA: ~70%, LoRA+Prompt: ~80%
* **college\_physics:** Zero-Shot: ~45%, Probe: ~40%, LoRA: ~55%, LoRA+Prompt: ~65%
* **computer\_security:** Zero-Shot: ~55%, Probe: ~50%, LoRA: ~65%, LoRA+Prompt: ~75%
* **conceptual\_physics:** Zero-Shot: ~40%, Probe: ~35%, LoRA: ~50%, LoRA+Prompt: ~60%
* **econometrics:** Zero-Shot: ~60%, Probe: ~55%, LoRA: ~70%, LoRA+Prompt: ~80%
* **electrical\_engineering:** Zero-Shot: ~50%, Probe: ~45%, LoRA: ~60%, LoRA+Prompt: ~70%
* **elementary\_mathematics:** Zero-Shot: ~30%, Probe: ~25%, LoRA: ~40%, LoRA+Prompt: ~50%
* **formal\_logic:** Zero-Shot: ~40%, Probe: ~35%, LoRA: ~50%, LoRA+Prompt: ~60%
* **global\_facts:** Zero-Shot: ~35%, Probe: ~30%, LoRA: ~45%, LoRA+Prompt: ~55%
* **high\_school\_biology:** Zero-Shot: ~30%, Probe: ~25%, LoRA: ~40%, LoRA+Prompt: ~50%
* **high\_school\_chemistry:** Zero-Shot: ~35%, Probe: ~30%, LoRA: ~45%, LoRA+Prompt: ~55%
* **high\_school\_computer\_science:** Zero-Shot: ~40%, Probe: ~35%, LoRA: ~50%, LoRA+Prompt: ~60%
* **high\_school\_european\_history:** Zero-Shot: ~50%, Probe: ~45%, LoRA: ~60%, LoRA+Prompt: ~70%
* **high\_school\_geography:** Zero-Shot: ~45%, Probe: ~40%, LoRA: ~55%, LoRA+Prompt: ~65%
* **high\_school\_government\_and\_politics:** Zero-Shot: ~50%, Probe: ~45%, LoRA: ~60%, LoRA+Prompt: ~70%
(Similar detailed breakdowns would be provided for the other models, but are omitted for brevity. The general pattern is that LoRA and LoRA+Prompt consistently perform worse than Zero-Shot and Probe, with LoRA+Prompt being the worst.)
### Key Observations
* **Evaluation Method Impact:** LoRA and LoRA+Prompt consistently yield higher ECE/AUROC values (worse performance) across all models and domains compared to Zero-Shot Classifier and Probe.
* **Domain Difficulty:** Domains like clinical\_knowledge, college\_medicine, econometrics, and computer\_security generally exhibit higher ECE/AUROC values across all models, indicating they are more challenging for these LLMs.
* **Model Comparison:** LLaMA-2 13B and 13B Chat generally perform better than LLaMA-2 7B and 7B Chat. Mistral 7B and 7B Instruct show competitive performance, sometimes outperforming the LLaMA-2 13B models.
* **Consistency:** The relative performance ranking of domains is fairly consistent across different models and evaluation methods.
### Interpretation
This heatmap provides a comprehensive comparison of LLM performance across a diverse set of domains and evaluation techniques. The consistent underperformance of LoRA and LoRA+Prompt suggests that these fine-tuning methods, while potentially useful for specific tasks, may introduce calibration issues or reduce generalization ability when evaluated on a broad range of domains. The higher error rates in complex domains (e.g., medical, security) highlight the challenges of applying LLMs to specialized knowledge areas. The competitive performance of Mistral models suggests they are strong contenders in the LLM landscape.
The data suggests that while LLMs are becoming increasingly capable, careful consideration must be given to the evaluation method and the domain of application. Calibration and generalization remain critical areas for improvement. The heatmap allows for a nuanced understanding of model strengths and weaknesses, enabling informed decisions about model selection and deployment. The consistent pattern of LoRA/LoRA+Prompt performance suggests a systematic issue with those methods, potentially related to overfitting or catastrophic forgetting. Further investigation into the calibration properties of these fine-tuned models is warranted.