## Scatter Plots: Entropy vs. JS Divergence for ProofWriter with Llama3.1 Models
### Overview
The image contains two scatter plots comparing the relationship between Entropy and JS Divergence for the ProofWriter dataset using Llama3.1-8B (top) and Llama3.1-70B (bottom) models. Each plot distinguishes between correct (T) and incorrect (F) predictions using blue and red data points, respectively. The plots also include visual aids such as a vertical green line indicating a low entropy threshold and a horizontal purple dashed line.
### Components/Axes
**Top Plot (ProofWriter - Llama3.1-8B):**
* **Title:** ProofWriter - Llama3.1-8B
* **X-axis:** Entropy, with ticks at 0, 10, 20, 30, and 40.
* **Y-axis:** JS Divergence, with ticks at 0.04, 0.06, 0.08, 0.10, and 0.12.
* **Legend (top-right):**
* Blue: Correct (T)
* Red: Incorrect (F)
* **Visual Aids:**
* Vertical green line at approximately x = 2.
* Horizontal purple dashed line at approximately y = 0.07.
* **Annotation:** "Lowest Entropy (Incorrect)" with an arrow pointing to the cluster of points near the vertical green line.
**Bottom Plot (ProofWriter - Llama3.1-70B):**
* **Title:** ProofWriter - Llama3.1-70B
* **X-axis:** Entropy, with ticks at 0, 10, 20, 30, 40, 50, 60, and 70.
* **Y-axis:** JS Divergence, with ticks at 0.04, 0.06, 0.08, 0.10, 0.12, and 0.14.
* **Legend (top-right):**
* Blue: Correct (T)
* Red: Incorrect (F)
* **Visual Aids:**
* Vertical green line at approximately x = 8.
* Horizontal purple dashed line at approximately y = 0.08.
* **Annotation:** "Lowest Entropy (Incorrect)" with an arrow pointing to the cluster of points near the vertical green line.
### Detailed Analysis
**Top Plot (Llama3.1-8B):**
* **Correct (T) - Blue:** The blue points are concentrated at lower entropy values (0-20) and JS Divergence values ranging from 0.04 to 0.10. The density of blue points decreases as entropy increases.
* **Incorrect (F) - Red:** The red points are more scattered across the entropy range (0-45), with JS Divergence values ranging from 0.04 to 0.12. There is a cluster of red points at higher entropy values (25-45).
* **Trend:** As entropy increases, the proportion of incorrect (red) predictions appears to increase.
**Bottom Plot (Llama3.1-70B):**
* **Correct (T) - Blue:** The blue points are concentrated at lower entropy values (0-20) and JS Divergence values ranging from 0.04 to 0.12. The density of blue points decreases as entropy increases.
* **Incorrect (F) - Red:** The red points are more scattered across the entropy range (0-70), with JS Divergence values ranging from 0.06 to 0.14. There is a cluster of red points at higher entropy values (30-70).
* **Trend:** As entropy increases, the proportion of incorrect (red) predictions appears to increase.
### Key Observations
* Both models show a trend where lower entropy values are associated with correct predictions, while higher entropy values are associated with incorrect predictions.
* The Llama3.1-70B model has a wider range of entropy values compared to the Llama3.1-8B model.
* The annotation "Lowest Entropy (Incorrect)" highlights that even at low entropy, there are still incorrect predictions.
* The vertical green line seems to indicate a threshold below which the model is expected to perform well, but this is not always the case.
* The horizontal purple dashed line appears to represent an average JS Divergence value.
### Interpretation
The scatter plots suggest that entropy is a useful, but imperfect, indicator of prediction accuracy for these models on the ProofWriter dataset. Lower entropy generally correlates with correct predictions, but there are still instances of incorrect predictions at low entropy. As entropy increases, the likelihood of incorrect predictions also increases. The Llama3.1-70B model, with its wider entropy range, may be exhibiting more diverse or complex behaviors compared to the Llama3.1-8B model. The presence of incorrect predictions even at low entropy suggests that other factors beyond entropy are influencing the model's accuracy. The plots highlight the importance of considering entropy as one of several metrics when evaluating the performance of language models.