Image 8b4696005a35...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Document: Pre-Deployment Decision Framework for LLM Critic-Based Intervention

This document provides a comprehensive extraction of the data and logic contained within the "Pre-Deployment Decision Framework" infographic. The framework outlines a two-step process for determining whether to deploy an LLM critic-based intervention based on pilot study results.

---

## 1. Header Information
*   **Main Title:** Pre-Deployment Decision Framework
*   **Subtitle:** For LLM Critic-Based Intervention

---

## 2. Main Framework Flow

### STEP 1: Pilot Study
This section defines the categorization of outcomes when comparing a **Baseline** model against an **Intervention** model.

#### Outcome Matrix (2x2 Table)
The matrix crosses Baseline results (Rows) with Intervention results (Columns).

| Baseline \ Intervention | Fail | Success |
| :--- | :---: | :---: |
| **Fail** | **A** (Blue) | **C** (Green) |
| **Success** | **B** (Orange) | **D** (Purple) |

#### Key Definitions
*   **C - recovered:** Baseline failed, but succeeded with intervention. (Highlighted in Green)
*   **B - disrupted:** Baseline succeeded, but failed with intervention. (Highlighted in Orange)

---

### STEP 2: Compute Rates & Thresholds
This section provides the mathematical formulas required for the decision logic.

*   **Recovery Rate ($r$):** $r = C / (A + C)$
*   **Disruption Rate ($d$):** $d = B / (B + D)$
*   **Threshold ($p^*$):** $p^* = d / (r + d)$

---

### Decision Logic (Flowchart)
The framework uses a central decision node comparing **$p$ vs $p^*$**.
*   **$p$**: Baseline failure rate.
*   **$\epsilon$**: Margin (e.g., 0.05).

| Condition | Decision | Description | Recommendation |
| :--- | :--- | :--- | :--- |
| **$p < p^*$** | **DON'T DEPLOY** | Disruption dominates | Use Best-of-N selection |
| **$|p - p^*| \le \epsilon$** | **INCONCLUSIVE** | Near threshold | Increase pilot size |
| **$p > p^*$** | **DEPLOY** | Recovery dominates | Set min_step $\ge$ 2 |

---

## 3. Worked Example (Sidebar)
A practical application of the framework is provided in the right-hand panel.

### Context
*   **Environment/Model:** ALFWorld + Qwen-3-8B
*   **Scope:** 50-task pilot

### Pilot Results
*   **Baseline failure rate ($p$):** 89%
*   **Tasks recovered (C):** 16 (Text in Green)
*   **Tasks disrupted (B):** 9 (Text in Orange)

### Computed Values
*   **Recovery rate ($r$):** 12%
*   **Disruption rate ($d$):** 56%
*   **Threshold ($p^*$):** 82%

### Final Decision
*   **Comparison:** $p$ (89%) > $p^*$ (82%)
*   **Outcome:** **DEPLOY -> +4.7 pp gain** (Highlighted in Green)

---

## 4. Component Isolation & Spatial Grounding

*   **Header:** Contains the title and subtitle.
*   **Main Chart (Left/Center):** Contains the two-step process and the decision tree.
    *   The **Matrix** is located at the top left.
    *   The **Formulas** are centered below the matrix.
    *   The **Decision Nodes** are at the bottom, color-coded to match the matrix (Orange for "Don't Deploy", Green for "Deploy").
*   **Worked Example (Right):** A grey-shaded vertical box containing specific data points and a final calculation.
*   **Footer:** Contains variable definitions for $p$ and $\epsilon$.

## 5. Trend and Logic Verification
*   **Logic Check:** The framework suggests that deployment is only favorable when the baseline failure rate ($p$) is high enough to outweigh the risk of the intervention disrupting tasks that the baseline would have otherwise solved.
*   **Visual Consistency:** The color **Green** consistently represents positive outcomes (Recovery, Success, Deploy), while **Orange** represents negative outcomes (Disruption, Fail, Don't Deploy). The "Inconclusive" state is represented by a neutral **Yellow/Tan**.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

8b4696005a355c5db052c0aa

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1