Image 35ab8fcb86fb...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## Bar Chart: Model Performance Across Subjects by Evaluation Metrics

### Overview
The chart compares the performance of four AI models (LLaMA-2 7B, LLaMA-2 13B, Mistral 7B, Mistral 7B Instruct) across 25 subjects (e.g., high school psychology, human sexuality, international law) using two evaluation metrics: Expected Calibration Error (ECE) and Area Under the Receiver Operating Characteristic curve (AUROC). Four methods are evaluated: Zero-Shot Classifier, Probe, LoRA, and LoRA + Prompt. Bars are color-coded by method, with ECE and AUROC split along the x-axis.

### Components/Axes
- **Y-Axis**: Subjects (e.g., "high_school_psychology", "human_aging", "world_religions").
- **X-Axis**: 
  - **ECE**: 20%50%, 60%90% (calibration error ranges).
  - **AUROC**: 20%50%, 60%90% (discrimination performance ranges).
- **Legend**: 
  - **Red**: Zero-Shot Classifier.
  - **Light Purple**: Probe.
  - **Dark Purple**: LoRA.
  - **Blue**: LoRA + Prompt.
- **Model Groups**: 
  - LLaMA-2 7B (leftmost), LLaMA-2 13B, Mistral 7B, Mistral 7B Instruct (rightmost).

### Detailed Analysis
- **ECE Trends**: 
  - For most subjects, **LoRA + Prompt** (blue) shows the lowest ECE (narrowest bars), indicating better calibration. 
  - **Zero-Shot Classifier** (red) often has the highest ECE (widest bars), suggesting poor calibration.
  - Example: In "high_school_psychology" (LLaMA-2 7B), Zero-Shot ECE ≈ 45%, Probe ≈ 35%, LoRA ≈ 30%, LoRA + Prompt ≈ 25%.
- **AUROC Trends**: 
  - **LoRA + Prompt** consistently achieves the highest AUROC (tallest bars), indicating superior discrimination.
  - **Zero-Shot Classifier** frequently has the lowest AUROC (shortest bars).
  - Example: In "human_sexuality" (Mistral 7B Instruct), Zero-Shot AUROC ≈ 55%, Probe ≈ 65%, LoRA ≈ 70%, LoRA + Prompt ≈ 75%.

### Key Observations
1. **LoRA + Prompt Dominance**: Outperforms other methods in both ECE and AUROC across nearly all subjects and models.
2. **Model Size Impact**: LLaMA-2 13B generally shows better performance than LLaMA-2 7B, particularly in AUROC.
3. **Subject Variability**: 
  - **High AUROC**: "prehistory" (LLaMA-2 13B, LoRA + Prompt ≈ 85%).
  - **Low AUROC**: "virology" (Mistral 7B, Zero-Shot ≈ 45%).
4. **ECE Anomalies**: 
  - "moral_scenarios" (Mistral 7B Instruct) shows unusually high ECE for LoRA + Prompt (≈ 40%).

### Interpretation
The data demonstrates that **LoRA + Prompt** significantly enhances model performance across diverse subjects, likely by refining parameter efficiency and task-specific adaptation. The **Zero-Shot Classifier** struggles with calibration and discrimination, highlighting the need for fine-tuning. Larger models (e.g., LLaMA-2 13B) outperform smaller counterparts, but method choice (LoRA + Prompt) remains the strongest predictor of success. Outliers like "moral_scenarios" suggest domain-specific challenges requiring further investigation.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

35ab8fcb86fb5d95fcbceba3

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1