# Technical Data Extraction: Performance Metrics vs. Temperature
This document provides a comprehensive extraction of data from a series of three line charts comparing the performance of four Large Language Models (LLMs) across varying temperature settings.
## 1. General Metadata and Structure
The image consists of three sub-plots arranged horizontally. Each plot shares a common X-axis and legend but measures a different performance metric. All plots utilize a "broken" Y-axis to show the high-performing models (top section) and the significantly lower-performing GPT-2 model (bottom section) on the same scale.
### Common Legend
The legend is located in the lower-left quadrant of each main plot area.
* **Light Green:** Qwen2 (7B)
* **Teal/Medium Green:** Mistral (7B)
* **Steel Blue:** Gemma 2 (2B)
* **Dark Purple/Navy:** GPT-2 (163M)
### Common X-Axis
* **Title:** Temperature ($\tau$)
* **Markers:** 0.001, 0.25, 0.5, 0.75, 1, 1.5
---
## 2. Chart 1: F1 Score vs Temperature
### Axis Information
* **Y-Axis Title:** F1 Score
* **Y-Axis Markers (Top):** 30, 40, 50, 60, 70
* **Y-Axis Markers (Bottom):** 0, 5
### Data Trends and Values
All models exhibit a **downward trend** as temperature increases, indicating that higher stochasticity reduces F1 performance.
| Model | Trend Description | Approx. Value at $\tau=0.001$ | Approx. Value at $\tau=1.5$ |
| :--- | :--- | :---: | :---: |
| **Qwen2 (7B)** | Highest performer; steady slight decline. | 70 | 60 |
| **Mistral (7B)** | Second highest; parallel decline to Qwen2. | 65 | 52 |
| **Gemma 2 (2B)** | Third highest; steady decline. | 51 | 40 |
| **GPT-2 (163M)** | Significantly lower; slight decline. | 6 | 4 |
---
## 3. Chart 2: Exact Match (%) vs Temperature
### Axis Information
* **Y-Axis Title:** Exact Match (%)
* **Y-Axis Markers (Top):** 30, 40, 50, 60
* **Y-Axis Markers (Bottom):** 0, 1
### Data Trends and Values
All models exhibit a **downward trend**. The performance gap between the 7B models and the 2B/163M models is more pronounced here than in the F1 chart.
| Model | Trend Description | Approx. Value at $\tau=0.001$ | Approx. Value at $\tau=1.5$ |
| :--- | :--- | :---: | :---: |
| **Qwen2 (7B)** | Leading; sharpest drop after $\tau=0.5$. | 62% | 50% |
| **Mistral (7B)** | Second; steady decline. | 57% | 41% |
| **Gemma 2 (2B)** | Third; notable drop between 0.5 and 1.0. | 43% | 31% |
| **GPT-2 (163M)** | Near zero; negligible performance. | 0.4% | 0.0% |
---
## 4. Chart 3: Semantic Match (%) vs Temperature
### Axis Information
* **Y-Axis Title:** Semantic Match (%)
* **Y-Axis Markers (Top):** 30, 40, 50, 60, 70
* **Y-Axis Markers (Bottom):** 0, 5
### Data Trends and Values
The top three models show a **consistent downward trend**. GPT-2 shows a **volatile/flat trend** at a very low baseline.
| Model | Trend Description | Approx. Value at $\tau=0.001$ | Approx. Value at $\tau=1.5$ |
| :--- | :--- | :---: | :---: |
| **Qwen2 (7B)** | Highest; maintains >60% throughout. | 71% | 61% |
| **Mistral (7B)** | Second; steady decline. | 67% | 55% |
| **Gemma 2 (2B)** | Third; steady decline. | 52% | 43% |
| **GPT-2 (163M)** | Low/Flat; slight peak at $\tau=0.5$. | 6% | 5% |
---
## 5. Component Isolation Summary
* **Header:** Contains three distinct titles: "F1 Score vs Temperature", "Exact Match (%) vs Temperature", and "Semantic Match (%) vs Temperature".
* **Main Chart Area:** Features shaded regions around each line, representing confidence intervals or standard deviation. The background uses a light grey grid.
* **Footer:** Contains the X-axis labels "Temperature ($\tau$)" and the numerical scale.
* **Visual Indicators:** Diagonal "break" marks (//) are present on the Y-axes of all three charts between the 5/30 or 1/30 marks to indicate the scale discontinuity.