## Diagram: Two-Phase Paraphrasing and Consistency Verification Process for AI Answer Validation
### Overview
The image is a technical flowchart illustrating a two-phase method for verifying the correctness of an AI-generated answer to a math word problem. The process involves paraphrasing the original question, generating an initial response, identifying a "critical token" within that response, and then using consistency checks across multiple generated trajectories to determine a final, verified answer. The diagram uses a combination of text boxes, arrows, icons, and color-coding (red for incorrect, green for correct) to depict the workflow.
### Components/Axes
The diagram is divided into two main vertical sections:
1. **Left Section: Phase I: Paraphrastic Probing**
* Contains Steps 1 through 4.
* Uses a light blue background for the phase header.
* Includes example text boxes with a math problem about a bakery.
* Features a red "X" icon indicating an incorrect answer in Step 2.
* Includes a mathematical notation box in Step 3.
* Features a red warning triangle icon in Step 4.
2. **Right Section: Phase II: Consistency Verification**
* Contains Steps 1 through 4.
* Uses a light blue background for the phase header.
* Includes multiple example text boxes showing different answer trajectories.
* Features a green checkmark icon indicating the correct final answer in Step 4.
### Detailed Analysis
**Phase I: Paraphrastic Probing**
* **Step 1: paraphrase the original question.**
* Original Question: "A bakery produces 60 loaves of bread each day... How many loaves of bread are sold in the afternoon?"
* Paraphrased Question: "In a baking bakery, daily production meets the demand for 60 freshly baked loaves. What is the number of loaves sold in the afternoon?"
* **Step 2: generate the initial response.**
* The AI's initial response to the original question is shown: "To solve this problem, we will break it down into steps. Step 1: Calculate the number of loaves sold in the morning. Therefore, the number of loaves of bread sold in the afternoon is 5."
* This response is marked with a red **X**, indicating it is incorrect.
* **Step 3: Concatenate the paraphrased question with the initial answer, and compute the probabilities for top-k tokens and expected tokens at each position.**
* The concatenated input is shown: "[Paraphrased Question]... [Initial Answer]..."
* A box displays token probability calculations:
* `Top-k tokens: p(morning)=0.85, p(5)=0.32, p(10)=0.71, p(number)=0.25, p(sold)=0.85, p(in)=0.22`
* `Expected tokens: p(morning)=0.85, p(5)=0.32, p(10)=0.71, p(number)=0.25, p(sold)=0.85, p(in)=0.22`
* **Step 4: Identify the critical token with the verifier Δ.**
* A calculation is shown: `Δ(morning)=|p(morning)-p'(morning)|=0.85-0.25=0.60, Δ(5)=0.32-0.22=0.10, Δ(10)=0.71-0.48=0.23`
* The conclusion is stated: "The token 'in' is the chosen critical token."
**Phase II: Consistency Verification**
* **Step 1: Obtain the candidate tokens at the critical token position.**
* The system looks at the position of the critical token ("in") in the initial response.
* **Step 2: Truncate the initial answer and replace the critical token with the critical tokens.**
* Two new prompt versions are created by truncating the initial answer before the critical token and inserting candidate tokens.
* Prompt A ends with "...sold in **during** the afternoon."
* Prompt B ends with "...sold in **the** afternoon."
* **Step 3: Generate new trajectories from partial answers to both the original and paraphrased questions, and the same procedure is also applied to the initial answer.**
* Four example trajectories are shown in a 2x2 grid:
* **Top-Left (Original Question, Prompt A):** Generates an answer ending with "...sold in the afternoon is 5."
* **Top-Right (Paraphrased Question, Prompt A):** Generates an answer ending with "...sold in the afternoon is 5."
* **Bottom-Left (Original Question, Prompt B):** Generates an answer ending with "...sold in the afternoon is 10."
* **Bottom-Right (Paraphrased Question, Prompt B):** Generates an answer ending with "...sold in the afternoon is 10."
* **Step 4: Determine the final answer with consistency mechanism.**
* The final box states: "The answers derived from the second input are more consistent than that from the first input. Thus, the final answer is 10."
* This conclusion is marked with a green **checkmark**.
### Key Observations
1. **Process Flow:** The diagram clearly shows a sequential, two-phase pipeline. Phase I identifies a point of uncertainty (the critical token "in") in an initial, incorrect answer. Phase II uses that token to probe for more consistent answers.
2. **Color Coding:** Red is used exclusively to mark the incorrect initial answer (Step 2, Phase I). Green is used to mark the correct final answer (Step 4, Phase II).
3. **Spatial Layout:** The two phases are presented side-by-side for comparison. Within Phase II, the four generated trajectories are arranged in a 2x2 grid, grouping results by the question type (original vs. paraphrased) and the inserted critical token ("during" vs. "the").
4. **Mathematical Notation:** Step 3 of Phase I includes explicit probability calculations, showing the model's confidence in different tokens at a specific position.
5. **Consistency as a Metric:** The core insight is that the answer "10" is chosen not because it is generated once, but because it appears consistently across multiple generated paths (both question types) when the critical token is replaced with "the".
### Interpretation
This diagram illustrates a sophisticated method for improving the reliability of AI-generated answers, particularly for multi-step reasoning tasks like math word problems. The process acknowledges that an initial AI response may be flawed. Instead of accepting it at face value, the system:
1. **Probes for Uncertainty:** It paraphrases the question and analyzes the model's token probabilities to find a "critical token"—a point where the model's confidence is low or where a small change could alter the answer's meaning (e.g., changing "in" to "during" or "the").
2. **Tests for Consistency:** It then uses this critical token as a lever to generate multiple alternative answer paths. The key hypothesis is that a *correct* answer will be robust and appear consistently across these varied paths, while an *incorrect* answer will be fragile and lead to inconsistent results.
3. **Makes a Data-Driven Decision:** The final answer is selected based on which candidate ("5" or "10") demonstrates greater consistency across the generated trajectories. In this case, "10" is the consistent outcome.
The underlying principle is that **truth is consistent, while error is often arbitrary**. By systematically testing how small perturbations at a point of uncertainty affect the final output, the system can distinguish between a reliably correct answer and a coincidentally generated incorrect one. This method moves beyond simple single-pass generation towards a form of internal verification and consensus-building within the model's own outputs.