# Technical Document Extraction: GPT-4 Coherency Scores
## 1. Document Metadata
* **Title:** (a) GPT-4 coherency scores
* **Image Type:** Box and Whisker Plot
* **Language:** English
## 2. Component Isolation
### Header
* **Text:** "(a) GPT-4 coherency scores"
* **Function:** Defines the subject of the data: a comparison of coherency performance across different prompting methods using the GPT-4 model.
### Main Chart Area (Data Visualization)
The chart is a box plot comparing five distinct categories. The Y-axis represents a numerical score, and the X-axis represents the prompting method. A vertical dashed line separates the first three methods from the two "refined" methods.
#### Axis Information
* **Y-Axis (Vertical):** Numerical scale ranging from approximately 3 to 10. Major grid lines and labels are present at **4**, **6**, and **8**.
* **X-Axis (Horizontal):** Categorical labels for prompting methods:
1. **IO** (Input-Output)
2. **CoT** (Chain of Thought)
3. **ToT** (Tree of Thoughts)
4. **IO +refine**
5. **ToT +refine**
#### Legend/Color Coding
While there is no explicit floating legend box, the colors distinguish the base methods:
* **Blue:** Associated with "IO" and "IO +refine".
* **Orange/Brown:** Associated with "CoT".
* **Green:** Associated with "ToT" and "ToT +refine".
## 3. Data Extraction and Trend Analysis
The following table estimates the values based on the spatial positioning relative to the Y-axis grid lines (4, 6, 8).
| Method | Color | Median Score | Interquartile Range (Box) | Whiskers (Min/Max excluding outliers) | Outliers (Diamonds) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **IO** | Blue | ~6.3 | ~5.2 to ~7.4 | ~3.0 to ~8.6 | None visible |
| **CoT** | Orange | ~7.2 | ~6.3 to ~7.8 | ~4.4 to ~8.6 | 1 at ~4.0 |
| **ToT** | Green | ~7.8 | ~7.1 to ~8.4 | ~5.4 to ~9.2 | 4 between ~3.6 and ~4.8 |
| **IO +refine** | Blue | ~8.0 | ~7.3 to ~8.4 | ~5.8 to ~9.5 | 6 between ~4.2 and ~5.7 |
| **ToT +refine** | Green | ~8.0 | ~7.4 to ~8.4 | ~6.2 to ~9.2 | 4 between ~4.8 and ~5.8 |
### Trend Verification
1. **Performance Progression:** There is a clear upward trend in the median coherency score as the complexity of the prompting method increases (IO < CoT < ToT).
2. **Impact of Refinement:** The addition of a "+refine" step significantly raises the median score for the "IO" method (from ~6.3 to ~8.0). For "ToT", the median remains relatively stable at ~8.0, but the lower whisker moves up, suggesting a more consistent floor for performance.
3. **Variance and Outliers:** As the median scores increase, the number of low-end outliers also increases, particularly for the "IO +refine" and "ToT +refine" methods, indicating that while the average performance is higher, there are still specific instances of significantly lower coherency.
## 4. Structural Summary
The diagram demonstrates that **Tree of Thoughts (ToT)** and **Refinement (+refine)** techniques yield higher coherency scores in GPT-4 compared to standard **Input-Output (IO)** or **Chain of Thought (CoT)** methods. The "IO +refine" method shows the most dramatic improvement over its base counterpart, reaching a median performance level comparable to "ToT +refine".