Image 24b0b76e5225...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview

INTEL_VERIFIED

# Technical Data Extraction: Performance Metrics vs. Temperature

This document provides a comprehensive extraction of data from a series of three line charts comparing the performance of four Large Language Models (LLMs) across varying temperature settings.

## 1. General Metadata and Structure

The image consists of three sub-plots arranged horizontally. Each plot shares a common X-axis and legend but measures a different performance metric. All plots utilize a "broken" Y-axis to show the high-performing models (top section) and the significantly lower-performing GPT-2 model (bottom section) on the same scale.

### Common Legend
The legend is located in the lower-left quadrant of each main plot area.
*   **Light Green:** Qwen2 (7B)
*   **Teal/Medium Green:** Mistral (7B)
*   **Steel Blue:** Gemma 2 (2B)
*   **Dark Purple/Navy:** GPT-2 (163M)

### Common X-Axis
*   **Title:** Temperature ($\tau$)
*   **Markers:** 0.001, 0.25, 0.5, 0.75, 1, 1.5

---

## 2. Chart 1: F1 Score vs Temperature

### Axis Information
*   **Y-Axis Title:** F1 Score
*   **Y-Axis Markers (Top):** 30, 40, 50, 60, 70
*   **Y-Axis Markers (Bottom):** 0, 5

### Data Trends and Values
All models exhibit a **downward trend** as temperature increases, indicating that higher stochasticity reduces F1 performance.

| Model | Trend Description | Approx. Value at $\tau=0.001$ | Approx. Value at $\tau=1.5$ |
| :--- | :--- | :---: | :---: |
| **Qwen2 (7B)** | Highest performer; steady slight decline. | 70 | 60 |
| **Mistral (7B)** | Second highest; parallel decline to Qwen2. | 65 | 52 |
| **Gemma 2 (2B)** | Third highest; steady decline. | 51 | 40 |
| **GPT-2 (163M)** | Significantly lower; slight decline. | 6 | 4 |

---

## 3. Chart 2: Exact Match (%) vs Temperature

### Axis Information
*   **Y-Axis Title:** Exact Match (%)
*   **Y-Axis Markers (Top):** 30, 40, 50, 60
*   **Y-Axis Markers (Bottom):** 0, 1

### Data Trends and Values
All models exhibit a **downward trend**. The performance gap between the 7B models and the 2B/163M models is more pronounced here than in the F1 chart.

| Model | Trend Description | Approx. Value at $\tau=0.001$ | Approx. Value at $\tau=1.5$ |
| :--- | :--- | :---: | :---: |
| **Qwen2 (7B)** | Leading; sharpest drop after $\tau=0.5$. | 62% | 50% |
| **Mistral (7B)** | Second; steady decline. | 57% | 41% |
| **Gemma 2 (2B)** | Third; notable drop between 0.5 and 1.0. | 43% | 31% |
| **GPT-2 (163M)** | Near zero; negligible performance. | 0.4% | 0.0% |

---

## 4. Chart 3: Semantic Match (%) vs Temperature

### Axis Information
*   **Y-Axis Title:** Semantic Match (%)
*   **Y-Axis Markers (Top):** 30, 40, 50, 60, 70
*   **Y-Axis Markers (Bottom):** 0, 5

### Data Trends and Values
The top three models show a **consistent downward trend**. GPT-2 shows a **volatile/flat trend** at a very low baseline.

| Model | Trend Description | Approx. Value at $\tau=0.001$ | Approx. Value at $\tau=1.5$ |
| :--- | :--- | :---: | :---: |
| **Qwen2 (7B)** | Highest; maintains >60% throughout. | 71% | 61% |
| **Mistral (7B)** | Second; steady decline. | 67% | 55% |
| **Gemma 2 (2B)** | Third; steady decline. | 52% | 43% |
| **GPT-2 (163M)** | Low/Flat; slight peak at $\tau=0.5$. | 6% | 5% |

---

## 5. Component Isolation Summary

*   **Header:** Contains three distinct titles: "F1 Score vs Temperature", "Exact Match (%) vs Temperature", and "Semantic Match (%) vs Temperature".
*   **Main Chart Area:** Features shaded regions around each line, representing confidence intervals or standard deviation. The background uses a light grey grid.
*   **Footer:** Contains the X-axis labels "Temperature ($\tau$)" and the numerical scale.
*   **Visual Indicators:** Diagonal "break" marks (//) are present on the Y-axes of all three charts between the 5/30 or 1/30 marks to indicate the scale discontinuity.

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

# Technical Document Extraction: Model Performance vs Temperature

## Overview
The image contains three comparative line graphs analyzing model performance metrics across varying temperature parameters (τ). Each graph evaluates different evaluation criteria: F1 Score, Exact Match (%), and Semantic Match (%). Four language models are compared:
- Qwen2 (7B)
- Mistral (7B)
- Gemma 2 (2B)
- GPT-2 (163M)

---

## Graph 1: F1 Score vs Temperature
### Axes
- **X-axis**: Temperature (τ) [0.001, 0.25, 0.5, 0.75, 1, 1.5]
- **Y-axis**: F1 Score [0, 10, 20, ..., 70]

### Legend
- **Placement**: Right side of graph
- **Color Mapping**:
  - Qwen2 (7B): Green
  - Mistral (7B): Teal
  - Gemma 2 (2B): Blue
  - GPT-2 (163M): Purple

### Key Trends
1. **Qwen2 (7B)**:
   - Starts at ~70 F1 Score at τ=0.001
   - Gradual decline to ~60 at τ=1.5
   - Shaded confidence interval narrows slightly with increasing τ

2. **Mistral (7B)**:
   - Starts at ~65 F1 Score at τ=0.001
   - Declines to ~55 at τ=1.5
   - Confidence interval widens moderately

3. **Gemma 2 (2B)**:
   - Starts at ~50 F1 Score at τ=0.001
   - Drops to ~40 at τ=1.5
   - Confidence interval remains relatively stable

4. **GPT-2 (163M)**:
   - Starts at ~5 F1 Score at τ=0.001
   - Declines to ~3 at τ=1.5
   - Confidence interval shows minimal variation

### Data Points (Approximate)
| Model          | τ=0.001 | τ=0.25 | τ=0.5 | τ=0.75 | τ=1.0 | τ=1.5 |
|----------------|---------|--------|-------|--------|-------|-------|
| Qwen2 (7B)     | 70      | 68     | 65    | 63     | 61    | 60    |
| Mistral (7B)   | 65      | 63     | 60    | 58     | 55    | 53    |
| Gemma 2 (2B)   | 50      | 48     | 45    | 43     | 40    | 38    |
| GPT-2 (163M)   | 5       | 4.5    | 4     | 3.5    | 3     | 2.5   |

---

## Graph 2: Exact Match (%) vs Temperature
### Axes
- **X-axis**: Temperature (τ) [0.001, 0.25, 0.5, 0.75, 1, 1.5]
- **Y-axis**: Exact Match (%) [0, 10, 20, ..., 60]

### Legend
- **Placement**: Right side of graph
- **Color Mapping**: Same as Graph 1

### Key Trends
1. **Qwen2 (7B)**:
   - Starts at ~60% at τ=0.001
   - Declines to ~55% at τ=1.5
   - Confidence interval narrows slightly

2. **Mistral (7B)**:
   - Starts at ~55% at τ=0.001
   - Drops to ~50% at τ=1.5
   - Confidence interval widens moderately

3. **Gemma 2 (2B)**:
   - Starts at ~40% at τ=0.001
   - Declines to ~35% at τ=1.5
   - Confidence interval remains stable

4. **GPT-2 (163M)**:
   - Starts at ~1% at τ=0.001
   - Drops to ~0.5% at τ=1.5
   - Confidence interval shows minimal variation

### Data Points (Approximate)
| Model          | τ=0.001 | τ=0.25 | τ=0.5 | τ=0.75 | τ=1.0 | τ=1.5 |
|----------------|---------|--------|-------|--------|-------|-------|
| Qwen2 (7B)     | 60      | 58     | 56    | 54     | 52    | 50    |
| Mistral (7B)   | 55      | 53     | 50    | 48     | 45    | 43    |
| Gemma 2 (2B)   | 40      | 38     | 35    | 33     | 30    | 28    |
| GPT-2 (163M)   | 1       | 0.8    | 0.6   | 0.5    | 0.4   | 0.3   |

---

## Graph 3: Semantic Match (%) vs Temperature
### Axes
- **X-axis**: Temperature (τ) [0.001, 0.25, 0.5, 0.75, 1, 1.5]
- **Y-axis**: Semantic Match (%) [0, 5, 10, ..., 70]

### Legend
- **Placement**: Right side of graph
- **Color Mapping**: Same as Graph 1

### Key Trends
1. **Qwen2 (7B)**:
   - Starts at ~70% at τ=0.001
   - Declines to ~60% at τ=1.5
   - Confidence interval narrows slightly

2. **Mistral (7B)**:
   - Starts at ~65% at τ=0.001
   - Drops to ~55% at τ=1.5
   - Confidence interval widens moderately

3. **Gemma 2 (2B)**:
   - Starts at ~50% at τ=0.001
   - Declines to ~40% at τ=1.5
   - Confidence interval remains stable

4. **GPT-2 (163M)**:
   - Starts at ~5% at τ=0.001
   - Drops to ~4% at τ=1.5
   - Confidence interval shows minimal variation

### Data Points (Approximate)
| Model          | τ=0.001 | τ=0.25 | τ=0.5 | τ=0.75 | τ=1.0 | τ=1.5 |
|----------------|---------|--------|-------|--------|-------|-------|
| Qwen2 (7B)     | 70      | 68     | 65    | 63     | 61    | 60    |
| Mistral (7B)   | 65      | 63     | 60    | 58     | 55    | 53    |
| Gemma 2 (2B)   | 50      | 48     | 45    | 43     | 40    | 38    |
| GPT-2 (163M)   | 5       | 4.5    | 4     | 3.5    | 3     | 2.5   |

---

## Observations
1. **Temperature Sensitivity**:
   - All models show performance degradation as temperature increases
   - Larger models (Qwen2, Mistral) maintain higher performance across τ ranges

2. **Model Hierarchy**:
   - Qwen2 > Mistral > Gemma 2 > GPT-2 in all metrics
   - Performance gaps widen at higher τ values

3. **Confidence Intervals**:
   - Wider intervals at higher τ values suggest increased uncertainty
   - GPT-2 shows the most stable confidence intervals despite lowest performance

4. **Performance Plateaus**:
   - All models exhibit diminishing returns beyond τ=0.5
   - GPT-2 shows near-linear decline across all τ values

---

## Critical Notes
- All graphs use identical τ ranges and scaling
- Shaded areas represent 95% confidence intervals
- No textual annotations present beyond axis labels and legends
- No non-English text detected in the image

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

24b0b76e5225243c4bdf0148

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1

EXPERT: nemotron-free VERSION 1