## Bar Chart: Evaluation on Verification and Correction (Base Model: Qwen2-7B-Instruct)
### Overview
This image contains two bar charts side-by-side, presenting evaluation metrics for a base model named "Qwen2-7B-Instruct". The left chart displays "Self-verification Metrics", and the right chart displays "Self-correction Metrics". Both charts compare three different configurations: "SFT", "SFT + Process-level RL", and "SFT + Outcome-level RL". The y-axis for both charts represents "Value (%)".
### Components/Axes
**Overall Title:** Evaluation on Verification and Correction (Base Model: Qwen2-7B-Instruct)
**Left Chart: Self-verification Metrics**
* **Title:** Self-verification Metrics
* **Y-axis Title:** Value (%)
* **Y-axis Scale:** 50 to 100, with major ticks at 50, 60, 70, 80, 90, 100.
* **X-axis Categories:** Verification Accuracy, Error Recall, Correct Precision.
* **Legend:** Located in the top-left quadrant of the left chart.
* **SFT:** Represented by a light grey rectangle.
* **SFT + Process-level RL:** Represented by a teal/mint green rectangle.
* **SFT + Outcome-level RL:** Represented by a coral/light orange rectangle.
**Right Chart: Self-correction Metrics**
* **Title:** Self-correction Metrics
* **Y-axis Title:** Value (%)
* **Y-axis Scale:** 0 to 25, with major ticks at 0, 5, 10, 15, 20, 25.
* **X-axis Categories:** Incorrect to Correct, Correct to Incorrect.
* **Legend:** The legend from the left chart is applicable to both charts.
### Detailed Analysis
**Left Chart: Self-verification Metrics**
* **Verification Accuracy:**
* SFT (Grey): 58.31%
* SFT + Process-level RL (Teal): 67.86%
* SFT + Outcome-level RL (Coral): 63.93%
* **Trend:** SFT + Process-level RL shows the highest Verification Accuracy, followed by SFT + Outcome-level RL, and then SFT.
* **Error Recall:**
* SFT (Grey): 81.91%
* SFT + Process-level RL (Teal): 86.67%
* SFT + Outcome-level RL (Coral): 87.34%
* **Trend:** SFT + Outcome-level RL shows the highest Error Recall, closely followed by SFT + Process-level RL, and then SFT.
* **Correct Precision:**
* SFT (Grey): 65.58%
* SFT + Process-level RL (Teal): 73.59%
* SFT + Outcome-level RL (Coral): 69.80%
* **Trend:** SFT + Process-level RL shows the highest Correct Precision, followed by SFT + Outcome-level RL, and then SFT.
**Right Chart: Self-correction Metrics**
* **Incorrect to Correct:**
* SFT (Grey): 20.00%
* SFT + Process-level RL (Teal): 22.17%
* SFT + Outcome-level RL (Coral): 19.55%
* **Trend:** SFT + Process-level RL shows the highest rate of correcting incorrect predictions, followed by SFT, and then SFT + Outcome-level RL.
* **Correct to Incorrect:**
* SFT (Grey): 8.42%
* SFT + Process-level RL (Teal): 5.39%
* SFT + Outcome-level RL (Coral): 3.93%
* **Trend:** SFT shows the highest rate of incorrectly correcting correct predictions, while SFT + Outcome-level RL shows the lowest rate. The SFT + Process-level RL is in between.
### Key Observations
* **Self-verification:** The "SFT + Process-level RL" configuration generally performs best across "Verification Accuracy" and "Correct Precision". "SFT + Outcome-level RL" performs best for "Error Recall". All RL-enhanced configurations ("SFT + Process-level RL" and "SFT + Outcome-level RL") outperform the base "SFT" model in all self-verification metrics.
* **Self-correction:** For "Incorrect to Correct", "SFT + Process-level RL" is the best. For "Correct to Incorrect", "SFT + Outcome-level RL" is the best, indicating it is least likely to make a correct prediction incorrect.
* **Trade-offs:** There appears to be a trade-off between "Incorrect to Correct" and "Correct to Incorrect" rates. While "SFT + Process-level RL" excels at correcting errors, it also has a higher rate of making correct predictions incorrect compared to "SFT + Outcome-level RL". Conversely, "SFT + Outcome-level RL" is better at preserving correct predictions but is slightly less effective at correcting incorrect ones compared to "SFT + Process-level RL".
### Interpretation
The data suggests that applying Reinforcement Learning (RL) techniques, specifically "Process-level RL" and "Outcome-level RL", to the base "SFT" model significantly improves its self-verification and self-correction capabilities when evaluated on the "Qwen2-7B-Instruct" base model.
The "Self-verification Metrics" indicate that RL enhancements lead to better accuracy in verifying information, recalling errors, and precisely correcting errors. The "SFT + Process-level RL" configuration appears to be a strong contender for overall self-verification performance, particularly in accuracy and precision.
The "Self-correction Metrics" reveal nuanced performance. "SFT + Process-level RL" is most effective at turning incorrect predictions into correct ones. However, "SFT + Outcome-level RL" demonstrates a superior ability to avoid degrading correct predictions into incorrect ones. This suggests that "Outcome-level RL" might be more conservative or robust in maintaining correctness, while "Process-level RL" might be more aggressive in error correction, potentially at the cost of introducing new errors.
In essence, the choice between "SFT + Process-level RL" and "SFT + Outcome-level RL" might depend on the specific priorities of the application. If the primary goal is to maximize the correction of errors, "SFT + Process-level RL" is favored. If the priority is to minimize the degradation of correct predictions, "SFT + Outcome-level RL" is the better choice. Both RL approaches offer substantial improvements over the baseline "SFT" model.