Image 24b0b76e5225...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Data Extraction: Performance Metrics vs. Temperature

This document provides a comprehensive extraction of data from a series of three line charts comparing the performance of four Large Language Models (LLMs) across varying temperature settings.

## 1. General Metadata and Structure

The image consists of three sub-plots arranged horizontally. Each plot shares a common X-axis and legend but measures a different performance metric. All plots utilize a "broken" Y-axis to show the high-performing models (top section) and the significantly lower-performing GPT-2 model (bottom section) on the same scale.

### Common Legend
The legend is located in the lower-left quadrant of each main plot area.
*   **Light Green:** Qwen2 (7B)
*   **Teal/Medium Green:** Mistral (7B)
*   **Steel Blue:** Gemma 2 (2B)
*   **Dark Purple/Navy:** GPT-2 (163M)

### Common X-Axis
*   **Title:** Temperature ($\tau$)
*   **Markers:** 0.001, 0.25, 0.5, 0.75, 1, 1.5

---

## 2. Chart 1: F1 Score vs Temperature

### Axis Information
*   **Y-Axis Title:** F1 Score
*   **Y-Axis Markers (Top):** 30, 40, 50, 60, 70
*   **Y-Axis Markers (Bottom):** 0, 5

### Data Trends and Values
All models exhibit a **downward trend** as temperature increases, indicating that higher stochasticity reduces F1 performance.

| Model | Trend Description | Approx. Value at $\tau=0.001$ | Approx. Value at $\tau=1.5$ |
| :--- | :--- | :---: | :---: |
| **Qwen2 (7B)** | Highest performer; steady slight decline. | 70 | 60 |
| **Mistral (7B)** | Second highest; parallel decline to Qwen2. | 65 | 52 |
| **Gemma 2 (2B)** | Third highest; steady decline. | 51 | 40 |
| **GPT-2 (163M)** | Significantly lower; slight decline. | 6 | 4 |

---

## 3. Chart 2: Exact Match (%) vs Temperature

### Axis Information
*   **Y-Axis Title:** Exact Match (%)
*   **Y-Axis Markers (Top):** 30, 40, 50, 60
*   **Y-Axis Markers (Bottom):** 0, 1

### Data Trends and Values
All models exhibit a **downward trend**. The performance gap between the 7B models and the 2B/163M models is more pronounced here than in the F1 chart.

| Model | Trend Description | Approx. Value at $\tau=0.001$ | Approx. Value at $\tau=1.5$ |
| :--- | :--- | :---: | :---: |
| **Qwen2 (7B)** | Leading; sharpest drop after $\tau=0.5$. | 62% | 50% |
| **Mistral (7B)** | Second; steady decline. | 57% | 41% |
| **Gemma 2 (2B)** | Third; notable drop between 0.5 and 1.0. | 43% | 31% |
| **GPT-2 (163M)** | Near zero; negligible performance. | 0.4% | 0.0% |

---

## 4. Chart 3: Semantic Match (%) vs Temperature

### Axis Information
*   **Y-Axis Title:** Semantic Match (%)
*   **Y-Axis Markers (Top):** 30, 40, 50, 60, 70
*   **Y-Axis Markers (Bottom):** 0, 5

### Data Trends and Values
The top three models show a **consistent downward trend**. GPT-2 shows a **volatile/flat trend** at a very low baseline.

| Model | Trend Description | Approx. Value at $\tau=0.001$ | Approx. Value at $\tau=1.5$ |
| :--- | :--- | :---: | :---: |
| **Qwen2 (7B)** | Highest; maintains >60% throughout. | 71% | 61% |
| **Mistral (7B)** | Second; steady decline. | 67% | 55% |
| **Gemma 2 (2B)** | Third; steady decline. | 52% | 43% |
| **GPT-2 (163M)** | Low/Flat; slight peak at $\tau=0.5$. | 6% | 5% |

---

## 5. Component Isolation Summary

*   **Header:** Contains three distinct titles: "F1 Score vs Temperature", "Exact Match (%) vs Temperature", and "Semantic Match (%) vs Temperature".
*   **Main Chart Area:** Features shaded regions around each line, representing confidence intervals or standard deviation. The background uses a light grey grid.
*   **Footer:** Contains the X-axis labels "Temperature ($\tau$)" and the numerical scale.
*   **Visual Indicators:** Diagonal "break" marks (//) are present on the Y-axes of all three charts between the 5/30 or 1/30 marks to indicate the scale discontinuity.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

24b0b76e5225243c4bdf0148

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1