\n
## Diagram: Two-Stage Training Process for Multimodal Large Language Model (MLLM) Image Authenticity Detection
### Overview
The image is a technical flowchart illustrating a two-stage training methodology for a Multimodal Large Language Model (MLLM) designed to determine whether a given image is real or synthetic. The process is divided into "Stage 1: CoE Tuning" and "Stage 2: R-GRPO," showing the flow of data, model processing, and reward mechanisms.
### Components/Axes
The diagram is split into two primary panels by a vertical dashed line.
**Left Panel: Stage 1: CoE Tuning**
* **Input Prompt:** A text box at the top reads: "Please help me determine whether this image is real or synthetic?"
* **Input Image:** A photograph of a small bird (appears to be a sparrow or similar species) is shown to the right of the prompt.
* **Visual Tokens:** Below the prompt and image are two rows of colored squares, likely representing visual embeddings or tokens.
* Top row: 5 orange squares.
* Bottom row: 5 yellow squares.
* **Model:** A rounded rectangle labeled "MLLM" with a flame icon (🔥) on its left side. Arrows from the prompt, image, and tokens point into this box.
* **Model Output:** A large text box below the MLLM contains a structured response:
* `<think>`
* `<answer> 1 </answer>`
* The phrase "synthetic traces" is highlighted in blue text.
* The number "1" in the answer tag is highlighted in orange text.
* **Loss Functions:** Two arrows point downward from the output box:
* A blue arrow labeled `L_think` originates from the `<think>` tag from Completion 1.
* Evaluation: A robot icon evaluates similarity. A decision diamond checks for "match", "similar", or "mismatch".
* Output: Green box "R=1" for match, yellow box "R=0.5" for similar, red box "R=0" for mismatch.
* **Multi-view alignment reward:**
* **Match Scenario (Green Checkmark):**
* Input Text: `<think>`
* The phrases "structural irregularities" and "high-frequency artifacts" are in blue text.
* Input Images: Four small thumbnail images showing different views/processing of the bird's eye (original, zoomed, possibly filtered).
* Evaluation: A robot icon assesses alignment between the textual description and the visual evidence across views.
* Output: Green box "R=1".
* **Mismatch Scenario (Red X):**
* Input Text: `<think>`
* The phrase "appears natural" is in red text, and "artifacts" is in blue text.
* Evaluation: The same robot icon assesses alignment.
* Output: Red box "R=0".
* **Feedback Loop:** A green arrow loops from the reward outputs back to the MLLM in the input stage, indicating a reinforcement learning update.
### Detailed Analysis
The diagram meticulously outlines a training pipeline.
**Stage 1 (CoE Tuning):** This stage focuses on teaching the MLLM to produce a structured "Chain of Evidence" (CoE) reasoning process (`<think>` tag) before giving a final binary classification (`<answer>` tag, where 1 likely means "synthetic"). The separate loss functions (`L_think` and `L_answer`) suggest the model is trained to optimize both the quality of its reasoning and the accuracy of its final answer.
**Stage 2 (R-GRPO):** This stage employs a reinforcement learning technique, likely "Reinforcement Learning with Group Relative Policy Optimization" (R-GRPO). It generates multiple candidate responses (Completions 1 to G) for a given input. Each completion is then scored by three complementary reward models:
1. **Answer Reward:** A simple binary check for factual correctness of the final answer.
2. **Think Reward:** Evaluates the quality and similarity of the reasoning chain against a reference or ideal reasoning path, allowing for partial credit (R=0.5).
3. **Multi-view Alignment Reward:** This is the most complex component. It verifies if the model's textual reasoning (e.g., "eyeball shows structural irregularities") is grounded in and consistent with visual evidence from multiple processed views of the image (e.g., zoomed, high-pass filtered). A mismatch between the textual claim and the visual evidence results in a zero reward.
### Key Observations
* **Structured Output Mandate:** The model is explicitly trained to separate its reasoning (`<think>`) from its conclusion (`<answer>`).
* **Multi-Faceted Evaluation:** The system doesn't just check if the answer is right; it scrutinizes *how* the model arrived at the answer, rewarding coherent, evidence-based reasoning.
* **Visual Grounding is Critical:** The "Multi-view alignment reward" is a key innovation. It forces the model's textual reasoning to be verifiable against visual data, combating hallucination. The example shows that claiming an eyeball looks "natural" while visual filters show artifacts leads to a penalty.
* **Color Coding for Clarity:** The diagram uses consistent color coding: green for correct/match (R=1), yellow/orange for partial credit or components (R=0.5, tokens, answer), and red for incorrect/mismatch (R=0). Blue text highlights key evidence phrases in the reasoning.
### Interpretation
This diagram describes a sophisticated training framework aimed at creating a more reliable and interpretable AI for detecting synthetic media. The core innovation lies in moving beyond simple answer-based training.
The **CoE Tuning** stage instills a habit of explicit, step-by-step reasoning. The **R-GRPO** stage then refines this behavior using reinforcement learning with a multi-dimensional reward signal. The most significant aspect is the **Multi-view Alignment Reward**, which directly addresses a major weakness of large language models: the potential for generating plausible-sounding but visually ungrounded text. By requiring the model's described evidence ("structural irregularities") to align with what can be seen in different image views, the system encourages the development of genuine visual understanding rather than pattern-matching on text alone.
The process suggests that for high-stakes tasks like authenticity detection, it is insufficient for an AI to simply be accurate. It must also be *explainable* in a way that is *verifiable* against the source data. This framework aims to produce models whose reasoning can be audited and trusted because it is tied to observable visual features.