## Scatter Plots: ProofWriter - Llama3.1-8B and Llama3.1-70B
### Overview
Two scatter plots compare **JS Divergence** (y-axis) and **Entropy** (x-axis) for correct (blue) and incorrect (red) outputs from the ProofWriter system using Llama3.1-8B and Llama3.1-70B models. Key annotations include a "Lowest Entropy (Incorrect)" label and a purple dashed line at JS Divergence = 0.07.
---
### Components/Axes
1. **Top Plot (Llama3.1-8B)**:
- **X-axis (Entropy)**: Ranges from 0 to ~40.
- **Y-axis (JS Divergence)**: Ranges from 0 to ~0.12.
- **Legend**: Blue = Correct (T), Red = Incorrect (F).
- **Annotations**:
- "Lowest Entropy (Incorrect)" at (Entropy ≈ 8, JS Divergence ≈ 0.07).
- Purple dashed line at JS Divergence = 0.07.
- Green dashed vertical line at Entropy = 10.
2. **Bottom Plot (Llama3.1-70B)**:
- **X-axis (Entropy)**: Ranges from 0 to ~70.
- **Y-axis (JS Divergence)**: Ranges from 0 to ~0.14.
- **Legend**: Same as above.
- **Annotations**:
- "Lowest Entropy (Incorrect)" at (Entropy ≈ 10, JS Divergence ≈ 0.08).
- Purple dashed line at JS Divergence = 0.07.
- Green dashed vertical line at Entropy = 10.
---
### Detailed Analysis
#### Llama3.1-8B (Top Plot)
- **Data Trends**:
- Correct (blue) points cluster tightly around low Entropy (0–15) and low JS Divergence (0–0.06).
- Incorrect (red) points spread widely, with higher Entropy (15–40) and JS Divergence (0.06–0.12).
- The "Lowest Entropy (Incorrect)" point (Entropy ≈ 8, JS Divergence ≈ 0.07) lies just above the purple threshold line.
- **Key Features**:
- Vertical green line at Entropy = 10 separates clusters of correct/incorrect points.
- Most incorrect points exceed JS Divergence = 0.07.
#### Llama3.1-70B (Bottom Plot)
- **Data Trends**:
- Correct (blue) points dominate low Entropy (0–25) and low JS Divergence (0–0.10).
- Incorrect (red) points extend to higher Entropy (25–70) and JS Divergence (0.08–0.14).
- The "Lowest Entropy (Incorrect)" point (Entropy ≈ 10, JS Divergence ≈ 0.08) lies slightly above the purple threshold.
- **Key Features**:
- Vertical green line at Entropy = 10 is less distinct; incorrect points appear beyond this boundary.
- JS Divergence threshold (0.07) is crossed by many incorrect points, even at low Entropy.
---
### Key Observations
1. **Model Size Correlation**:
- Llama3.1-70B exhibits higher Entropy and JS Divergence values overall compared to 8B.
- Incorrect answers in 70B are more dispersed across the plot, suggesting greater uncertainty in outputs.
2. **Threshold Behavior**:
- The JS Divergence = 0.07 line (purple) acts as a rough boundary for correctness in 8B but is less effective for 70B, where incorrect points frequently exceed this threshold.
3. **Entropy Threshold**:
- The green line at Entropy = 10 marks a transition point in 8B, where incorrect answers become more prevalent. In 70B, incorrect answers appear even at lower Entropy values.
---
### Interpretation
- **Model Performance**:
- Larger models (70B) generate outputs with higher uncertainty (Entropy) and divergence from ground truth (JS Divergence), potentially indicating overconfidence or complexity in handling tasks.
- The "Lowest Entropy (Incorrect)" points suggest that some errors occur even in low-uncertainty regions, possibly due to model biases or edge-case failures.
- **Threshold Implications**:
- The JS Divergence = 0.07 threshold may not reliably distinguish correctness in larger models, as incorrect answers frequently exceed this value.
- The Entropy = 10 boundary highlights a critical region where model performance degrades, particularly for the 8B variant.
- **Design Considerations**:
- The plots emphasize trade-offs between model size, uncertainty quantification, and error rates. Larger models may require refined calibration to balance accuracy and confidence.