# Technical Document Extraction: GPT2-M Model Performance Analysis
## Image Description
The image contains six grouped bar charts comparing the performance of GPT2-M model variants across three metrics (Web NLG, CommonGen, Adidas Score) using two evaluation datasets (BLEU, Rouge-L). Each metric has two sub-charts: one for BLEU and one for Rouge-L. The charts compare "Base" and "Plugin" configurations across four model variants: GPT2-M (zeroshot), GPT2-M (1FT), GPT2-M (2FT), and GPT2-M (5FT).
---
## Key Components
### 1. Chart Structure
- **Columns**:
- Left column: BLEU dataset results
- Right column: Rouge-L dataset results
- **Rows**:
- Row 1: Web NLG metric
- Row 2: CommonGen metric
- Row 3: Adidas Score metric
### 2. Axes Labels
- **X-axis**:
- Categories: GPT2-M (zeroshot), GPT2-M (1FT), GPT2-M (2FT), GPT2-M (5FT)
- Labels: "GPT2-M (zeroshot)", "GPT2-M (1FT)", "GPT2-M (2FT)", "GPT2-M (5FT)"
- **Y-axis**:
- Row 1: "(a) Web NLG Score" (0.00–0.30)
- Row 2: "(b) CommonGen Score" (0.00–0.30)
- Row 3: "(c) Adidas Score" (0.00–0.35)
### 3. Legends
- **Legend Labels**:
- Base (pink)
- Plugin (purple)
- **Legend Position**: Top-left corner of each chart
### 4. Data Trends
#### BLEU Dataset
| Metric | Model Variant | Base Score | Plugin Score | Trend Description |
|-----------------|---------------------|------------|--------------|---------------------------------------|
| Web NLG | zeroshot | ~0.02 | ~0.13 | Plugin > Base by ~0.11 |
| Web NLG | 1FT | ~0.20 | ~0.28 | Plugin > Base by ~0.08 |
| Web NLG | 2FT | ~0.20 | ~0.29 | Plugin > Base by ~0.09 |
| Web NLG | 5FT | ~0.22 | ~0.30 | Plugin > Base by ~0.08 |
| CommonGen | zeroshot | ~0.01 | ~0.14 | Plugin > Base by ~0.13 |
| CommonGen | 1FT | ~0.20 | ~0.27 | Plugin > Base by ~0.07 |
| CommonGen | 2FT | ~0.20 | ~0.28 | Plugin > Base by ~0.08 |
| CommonGen | 5FT | ~0.22 | ~0.30 | Plugin > Base by ~0.08 |
| Adidas Score | zeroshot | ~0.00 | ~0.05 | Plugin > Base by ~0.05 |
| Adidas Score | 1FT | ~0.11 | ~0.19 | Plugin > Base by ~0.08 |
| Adidas Score | 2FT | ~0.12 | ~0.19 | Plugin > Base by ~0.07 |
| Adidas Score | 5FT | ~0.15 | ~0.21 | Plugin > Base by ~0.06 |
#### Rouge-L Dataset
| Metric | Model Variant | Base Score | Plugin Score | Trend Description |
|-----------------|---------------------|------------|--------------|---------------------------------------|
| Web NLG | zeroshot | ~0.18 | ~0.35 | Plugin > Base by ~0.17 |
| Web NLG | 1FT | ~0.38 | ~0.43 | Plugin > Base by ~0.05 |
| Web NLG | 2FT | ~0.38 | ~0.44 | Plugin > Base by ~0.06 |
| Web NLG | 5FT | ~0.40 | ~0.45 | Plugin > Base by ~0.05 |
| CommonGen | zeroshot | ~0.18 | ~0.39 | Plugin > Base by ~0.21 |
| CommonGen | 1FT | ~0.37 | ~0.42 | Plugin > Base by ~0.05 |
| CommonGen | 2FT | ~0.37 | ~0.43 | Plugin > Base by ~0.06 |
| CommonGen | 5FT | ~0.39 | ~0.44 | Plugin > Base by ~0.05 |
| Adidas Score | zeroshot | ~0.14 | ~0.19 | Plugin > Base by ~0.05 |
| Adidas Score | 1FT | ~0.23 | ~0.31 | Plugin > Base by ~0.08 |
| Adidas Score | 2FT | ~0.24 | ~0.32 | Plugin > Base by ~0.08 |
| Adidas Score | 5FT | ~0.27 | ~0.33 | Plugin > Base by ~0.06 |
---
## Observations
1. **Plugin Consistently Outperforms Base**:
- Across all metrics, datasets, and model variants, the Plugin configuration achieves higher scores than the Base configuration.
- Average improvement:
- BLEU: ~0.08–0.17 (Web NLG), ~0.07–0.13 (CommonGen), ~0.05–0.08 (Adidas)
- Rouge-L: ~0.05–0.21 (Web NLG), ~0.05–0.21 (CommonGen), ~0.05–0.08 (Adidas)
2. **Fine-Tuning Impact**:
- Scores generally increase with more fine-tuning steps (1FT → 5FT), though improvements diminish after 2FT in some cases.
- Exception: Adidas Score shows consistent gains across all fine-tuning steps.
3. **Dataset Differences**:
- Rouge-L scores are consistently higher than BLEU scores for equivalent configurations.
- Example: GPT2-M (5FT) Plugin scores 0.45 (BLEU) vs. 0.44 (Rouge-L) for Web NLG.
---
## Spatial Grounding
- **Legend Position**: Top-left corner of each chart (confirmed via visual inspection).
- **Bar Colors**:
- Base: Pink (#FFC0CB)
- Plugin: Purple (#800080)
- All bars match legend colors exactly.
---
## Conclusion
The Plugin configuration demonstrates superior performance across all evaluated metrics and datasets, with performance gains increasing with fine-tuning steps. Rouge-L generally yields higher scores than BLEU for equivalent configurations.