Image 2f476833c1b0...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Data Extraction: Performance Comparison Across Datasets and Model Configurations

This document provides a comprehensive extraction of data from a 3x2 grid of bar charts comparing model performance (BLEU and Rouge-L scores) across three datasets: **Web NLG**, **CommonGen**, and **Adidas**.

## 1. Document Structure and Global Components

The image is organized into three horizontal rows, each representing a specific dataset, and two vertical columns representing evaluation metrics.

*   **Rows (Datasets):**
    *   (a) Web NLG
    *   (b) CommonGen
    *   (c) Adidas
*   **Columns (Metrics):**
    *   BLEU (Left Column)
    *   Rouge-L (Right Column)
*   **X-Axis Categories (Common to all charts):**
    1.  No plugin
    2.  1-layer
    3.  2-layer
    4.  4-layer
    5.  8-layer
    6.  12-layer
    7.  GPT2 Small
*   **Visual Encoding:**
    *   **Bars:** Represent the score for each configuration. The color transitions from light pink (left) to dark purple (right).
    *   **Dashed Line with Markers:** Connects the top of each bar to visualize the trend across configurations.
    *   **Y-Axis:** Represents the numerical "Score" for the respective metric.

---

## 2. Detailed Data Extraction by Dataset

### (a) Web NLG Dataset
**Trend Analysis:** Performance spikes significantly when moving from "No plugin" to "1-layer". From "1-layer" to "12-layer", there is a gradual, slight decline in performance. "GPT2 Small" represents the peak performance for both metrics.

| Configuration | BLEU Score (Approx.) | Rouge-L Score (Approx.) |
| :--- | :--- | :--- |
| No plugin | 0.02 | 0.19 |
| 1-layer | 0.13 | 0.35 |
| 2-layer | 0.10 | 0.31 |
| 4-layer | 0.10 | 0.31 |
| 8-layer | 0.10 | 0.30 |
| 12-layer | 0.09 | 0.28 |
| GPT2 Small | 0.20 | 0.51 |

---

### (b) CommonGen Dataset
**Trend Analysis:** Similar to Web NLG, there is a sharp increase at "1-layer". A steady, linear decrease is observed as the number of layers increases from 1 to 12. "GPT2 Small" remains the highest-performing baseline.

| Configuration | BLEU Score (Approx.) | Rouge-L Score (Approx.) |
| :--- | :--- | :--- |
| No plugin | 0.015 | 0.15 |
| 1-layer | 0.14 | 0.39 |
| 2-layer | 0.12 | 0.35 |
| 4-layer | 0.11 | 0.35 |
| 8-layer | 0.105 | 0.34 |
| 12-layer | 0.09 | 0.33 |
| GPT2 Small | 0.20 | 0.55 |

---

### (c) Adidas Dataset
**Trend Analysis:** The scores for this dataset are lower in absolute magnitude compared to the others. The trend follows the same pattern: a peak at "1-layer" followed by a consistent downward slope through "12-layer". "GPT2 Small" shows a significant lead over all plugin configurations.

| Configuration | BLEU Score (Approx.) | Rouge-L Score (Approx.) |
| :--- | :--- | :--- |
| No plugin | 0.005 | 0.135 |
| 1-layer | 0.048 | 0.165 |
| 2-layer | 0.041 | 0.148 |
| 4-layer | 0.037 | 0.138 |
| 8-layer | 0.035 | 0.135 |
| 12-layer | 0.030 | 0.128 |
| GPT2 Small | 0.081 | 0.245 |

---

## 3. Summary of Key Findings

1.  **Plugin Impact:** Adding even a single layer ("1-layer") results in a massive performance gain over the "No plugin" baseline across all datasets and metrics.
2.  **Layer Scaling:** Increasing the number of layers in the plugin (from 1 to 12) consistently leads to a **decrease** in performance. The 1-layer configuration is the most effective plugin setup shown.
3.  **Baseline Comparison:** The "GPT2 Small" model consistently outperforms all plugin-augmented configurations, typically achieving scores roughly double those of the 1-layer plugin in BLEU metrics.
4.  **Metric Correlation:** BLEU and Rouge-L scores follow identical trends across all configurations, suggesting that the model improvements (or degradations) are consistent across different linguistic evaluation methods.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

2f476833c1b034d03c7be18a

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1