Image d3465397093a...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Data Extraction: Performance Analysis of LLMs under Induced Error

This document provides a comprehensive extraction of data from a technical figure containing multiple line charts. The figure is divided into two primary sections: **(a)**, which shows accuracy over task length for specific models, and **(b)**, which shows accuracy at a specific turn (Turn 100) relative to induced error rates.

---

## Section (a): Turn Accuracy vs. Task Length
This section consists of four sub-plots, each representing a different Large Language Model (LLM). 

### Common Axis and Legend Information
*   **Y-Axis:** "Turn Accuracy" (Scale: 0.0 to 1.0).
*   **X-Axis:** "Task Length" (Scale: 0 to 100).
*   **Legend (Bottom Left):**
    *   **Dark Blue Square:** Original Run
    *   **Red Square:** 100% Error Rate
    *   **Orange Square:** 75% Error Rate
    *   **Yellow Square:** 50% Error Rate
    *   **Light Green Square:** 25% Error Rate
    *   **Dark Green Square:** 0% Error Rate

### Sub-plot 1: Qwen3-14B
*   **Trend:** Performance is relatively stable for low error rates (0%, 25%, Original). As the error rate increases (50% to 100%), accuracy degrades significantly as task length increases.
*   **Data Points (Approximate at Task Length 100):**
    *   0% Error Rate (Dark Green): ~0.8
    *   25% Error Rate (Light Green): ~0.75
    *   Original Run (Dark Blue): ~0.7
    *   50% Error Rate (Yellow): ~0.65
    *   75% Error Rate (Orange): ~0.55
    *   100% Error Rate (Red): ~0.5

### Sub-plot 2: Gemma3-12B
*   **Trend:** Shows a much steeper decline in accuracy compared to Qwen3-14B. Even at low error rates, accuracy drops below 0.4 by task length 100.
*   **Data Points (Approximate at Task Length 100):**
    *   0% Error Rate (Dark Green): ~0.4
    *   25% Error Rate (Light Green): ~0.3
    *   Original Run (Dark Blue): ~0.3
    *   50% Error Rate (Yellow): ~0.2
    *   75% Error Rate (Orange): ~0.15
    *   100% Error Rate (Red): ~0.1

### Sub-plot 3: Qwen3-32B
*   **Trend:** High resilience. The 0%, 25%, and Original runs maintain accuracy above 0.8 throughout the task. Higher error rates (75%, 100%) show a steady decline but remain higher than the Gemma models.
*   **Data Points (Approximate at Task Length 100):**
    *   0%/25%/Original: ~0.85 - 0.9
    *   50% Error Rate (Yellow): ~0.6
    *   75% Error Rate (Orange): ~0.35
    *   100% Error Rate (Red): ~0.3

### Sub-plot 4: Gemma3-27B
*   **Trend:** Similar to the 12B version, it shows a rapid collapse in accuracy for high error rates, though the 0% and 25% rates stay higher (~0.8) than the 12B model.
*   **Data Points (Approximate at Task Length 100):**
    *   0%/25% Error Rate: ~0.8 - 0.85
    *   50% Error Rate (Yellow): ~0.4
    *   75% Error Rate (Orange): ~0.15
    *   100% Error Rate (Red): ~0.1

---

## Section (b): Turn 100 Accuracy vs. Induced Error Rate
This section contains two charts comparing different model families.

### Common Axis Information
*   **Y-Axis:** "Turn 100 Accuracy" (Scale: 0.0 to 1.0).
*   **X-Axis:** "Induced Error Rate" (Markers: 0.00, 0.25, 0.50, 0.75, 1.00).

### Left Chart (Qwen vs. Gemma)
**Legend (Top Right):**
*   **Blue Tones (Qwen):** Qwen3-32B (Darkest), Qwen3-14B, Qwen3-8B, Qwen3-4B (Lightest).
*   **Red/Orange Tones (Gemma):** Gemma-27B (Darkest), Gemma-12B, Gemma-4B (Lightest).

**Key Trends:**
*   **Qwen Series:** Generally more robust. Qwen3-32B starts at ~0.85 and ends at ~0.55.
*   **Gemma Series:** Starts higher (Gemma-27B at ~0.92) but drops precipitously to near 0.1 accuracy as error rate reaches 1.00.

### Right Chart (Competitive Models)
**Legend (Top Right):**
*   **Purple Line:** Deepseek-chat-v3-0324
*   **Yellow Line:** Kimi-k2
*   **Brown Line:** Qwen3-2507-235B

**Data Extraction (Approximate Values):**

| Induced Error Rate | Deepseek-chat-v3 | Kimi-k2 | Qwen3-2507-235B |
| :--- | :--- | :--- | :--- |
| **0.00** | 1.0 | 1.0 | 0.98 |
| **0.25** | 0.91 | 0.87 | 0.80 |
| **0.50** | 0.73 | 0.66 | 0.52 |
| **0.75** | 0.34 | 0.39 | 0.20 |
| **1.00** | 0.25 | 0.34 | 0.24 |

**Trend Observation:** All three high-parameter models show a sharp, non-linear decline in accuracy as the induced error rate increases. Deepseek maintains the highest accuracy until the 0.75 mark, where Kimi-k2 becomes slightly more resilient.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

d3465397093a4f984b5964b8

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1