# REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection
**Authors**: Huangsen Cao, Qin Mei, Zhiheng Li, Yuxi Li, Ying Zhang, Chen Li, Zhimeng Zhang, Xin Ding, Yongwei Wang, Jing LYU and Fei Wu
> Huangsen Cao, Qin Mei, Zhiheng Li, Zhimeng Zhang, Yongwei Wang, Fei Wu are with Zhejiang University. E-mail: huangsen_cao, yongwei.wang, Yuxi Li, Ying Zhang, Chen Li, Jing Lyu are with WeChat Vision, Tencent Inc. Ding Xin is with Nanjing University of Information Science and Technology.
## Abstract
With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce REVEAL-Bench, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose REVEAL (R easoning- e nhanced Forensic E v id e nce A na l ysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.
Index Terms: AI-generated image detection, Explainable AI, Forensic reasoning.
<details>
<summary>x1.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Authenticity Analysis Flowchart
## Overview
The image depicts a three-stage workflow for determining whether an image is real or synthetic. The process combines multi-perspective analysis with expert-informed reasoning to reach a conclusion.
---
## Stage 1: Accept the User's Instructions and Analyze the Image
**User Query**:
"Please help me determine whether this image is real or synthetic?...providing the reasoning conclusion."
*(Accompanied by a small image of a white dog)*
**System Response**:
"I understand the user's need. I will analyze and detect this image from eight different perspectives."
---
## Stage 2: Performing Multi-perspective, Expert-informed Image Evidence Analysis
### Evidence Detection Components:
1. **Geometry Flaws**
- *Analysis*: After geometric analysis, the image was mistakenly classified as real due to plausible flow of fur, eye reflections.
- *Finding*: Inconsistent with real-world physics.
2. **Spectral Clues**
- *Analysis*: Through frequency analysis, the expert detected high-frequency artifacts with unexpected patterns.
- *Finding*: Indicative of synthetic generation.
3. **High-pass Fusion**
- *Analysis*: High-pass maps revealed inconsistent details (e.g., overly sharp edges).
- *Finding*: Confirmed synthetic origin.
4. **Local Artifacts**
- *Analysis*: Pixel-level irregularities in pupil structure and lighting.
- *Finding*: Detected synthetic anomalies.
---
## Stage 3: Provide an Authenticity Judgment
**Reasoning Process**:
- Spectral clues and high-pass fusion anomalies were cross-verified.
- Geometry flaws and local artifacts were re-evaluated for consistency.
- Final synthesis of eight perspectives led to a conclusion.
**Final Answer**:
"The image is synthetic."
---
## Diagram Components and Flow
1. **Input**: User query with image (Stage 1).
2. **Analysis**: Four evidence detection modules (Stage 2).
3. **Output**: Authenticity judgment (Stage 3).
4. **Visual Elements**:
- Dog image (top-left).
- Spectral analysis visualization (green crosshair).
- Binary masks (black/white dog images).
- Lightbulb icon (final answer indicator).
---
## Key Trends and Data Points
- **Spectral Clues**: High-frequency artifacts detected via frequency analysis.
- **High-pass Fusion**: Inconsistent edge sharpness identified.
- **Local Artifacts**: Pixel-level irregularities in pupils and lighting.
- **Geometry Flaws**: Misclassification due to unrealistic fur/eye reflections.
---
## Notes
- No numerical data or charts present.
- All text is in English.
- Flowchart uses color-coded annotations (red for errors, green for confirmations).
- Final answer is explicitly stated as "synthetic" after multi-perspective validation.
</details>
Figure 1: Overview of the proposed REVEAL framework for reasoning-enhanced explainable synthetic image detection. The framework consists of three main stages: (1) receiving user instructions, (2) performing expert-grounded multi-perspective evidence detection, and (3) conducting reasoning through the chain of evidence (CoE) to derive a reliable decision with justifications.
## 1 Introduction
With the rapid evolution of generative artificial intelligence techniques such as Generative Adversarial Networks (GANs) [goodfellow2014generative, karras2019style] and Diffusion Models [dhariwal2021diffusion], the visual realism of synthesized content has advanced to a level that can easily deceive human perception. While these advanced models have unlocked unprecedented creative and economic potential in fields like digital art, design, and film production, they have also raised significant concerns regarding misinformation, privacy violations, and copyright issues. The continual progress in advanced diffusion models such as FLUX [black-forest-labs_flux_2024] and SD3.5 [esser2024scaling], along with autoregressive generation methods (e.g., VAR [tian2024visual]), has further intensified the challenge of distinguishing between real and synthetic content, making reliable detection an urgent research priority.
<details>
<summary>x2.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Analysis
## Diagram Overview
The image contains two technical diagrams labeled **a) Explainable classification** and **b) REVEAL**, depicting workflows for image authenticity determination using multimodal language models (MLLM). Both diagrams include visual components, textual labels, and logical flow arrows.
---
### **a) Explainable classification**
#### Components and Flow:
1. **Input**:
- **Image**: A bird perched on a branch (visual element).
- **Text Prompt**: *"Please help me determine whether this image is real or synthetic?"*
2. **Processing**:
- **MLLM (Multimodal Large Language Model)**:
- Receives the image and text prompt as input.
- Outputs a prediction score `y` (probability of being real/synthetic).
3. **Decision Logic**:
- **Threshold Comparison**:
- If `y > threshold τ`, the model classifies the image as **real/fake**.
- **Next Token Prediction**:
- Generates an explanation for the classification decision.
#### Key Labels:
- **Input**: "Please help me determine whether this image is real or synthetic?"
- **Processing**: "MLLM"
- **Decision**: "Whether the model's prediction `y > threshold τ`"
- **Output**: "real/fake" and "explanation"
---
### **b) REVEAL**
#### Components and Flow:
1. **Input**:
- **Image**: Same bird image as in Diagram a).
- **Text Prompt**: *"Please help me determine whether this image is real or synthetic?"*
2. **MLLM Output**:
- Generates **logits** (`o₁, o₂, ..., o_G`) representing intermediate model states.
3. **Reward Mechanisms**:
- **R1: Answer Reward**: Evaluates the correctness of the MLLM's final answer.
- **R2: Think Reward**: Assesses the quality of the model's reasoning process.
- **R3: Multi-view alignment reward**: Measures consistency across different perspectives (e.g., visual, textual).
4. **Group Completion**:
- Aggregates rewards (`R1`, `R2`, `R3`) to refine the model's output.
5. **Evidence Analysis**:
- Analyzes grouped completion to produce a final **real/fake** classification.
#### Key Labels:
- **Input**: "Please help me determine whether this image is real or synthetic?"
- **MLLM Output**: Logits (`o₁, o₂, ..., o_G`)
- **Rewards**:
- R1: Answer reward
- R2: Think reward
- R3: Multi-view alignment reward
- **Output**: "real/fake"
---
### Spatial and Structural Notes:
- **Diagram a)** focuses on **explainability**, emphasizing threshold-based classification and post-hoc explanation generation.
- **Diagram b)** introduces **REVEAL**, a framework incorporating reward-based optimization for improved authenticity detection.
- Both diagrams share the same input image and prompt but differ in processing logic and output mechanisms.
---
### Missing Elements:
- No numerical data, charts, or color-coded legends are present.
- All textual labels and flow arrows are explicitly described above.
---
### Summary:
The diagrams illustrate two approaches for determining image authenticity:
1. **Explainable classification** (a): Direct threshold-based decision with explanation generation.
2. **REVEAL** (b): Reward-driven optimization for robust authenticity assessment.
Both workflows rely on MLLM outputs but diverge in their evaluation and refinement strategies.
</details>
Figure 2: a) Existing post-hoc rationalization detection. b) REVEAL framework, a reasoning-enhanced paradigm for truly explainable forensic analysis.
Recent research [wang2020cnn, chai2020makes, wang2023dire, ojha2023towards, liu2024forgery, tan2024rethinking] has made notable progress in detecting AI-generated images. However, most traditional methods focus solely on discrimination, offering limited forensic analysis. The emergence of multimodal large language models (MLLMs) offers new opportunities, enabling models to combine visual perception with textual descriptions. Recent endeavours such as GPT-4 based detection [jia2024can], AIGI-Holmes [zhou2025aigi], FakeBench [li2025fakebench], and RAIDX [li2025raidx] have initiated this transition towards explainability. Yet, as illustrated in Figure 2, these methods share fundamental limitations: they primarily rely on post-hoc rationalizations or leverage the MLLMs merely as a powerful general-purpose visual classifier to identify high-level visual anomaly patterns (e.g.“unnatural lighting”, “blurry edge”). They fail to construct a causally grounded reasoning-based forensic pipeline where specialized evidence is systematically collected, analyzed, and synthesized through logical deduction. Specifically, these prior works: 1) use datasets (e.g. FakeBench [li2025fakebench]) that lack fine-grained, structured evidence, limiting support for deep causal reasoning; and 2) rely on methods (e.g. RAIDX [li2025raidx] with RAG) where explanations exhibit surface-level coherence derived from pattern matching, rather than being grounded in verifiable forensic evidence traces.
The critical gap highlights two major challenges in developing reasoning-enhanced synthetic image detection: 1) Lack of a reasoning-oriented forensic dataset. Existing datasets contain either binary labels or shallow textual justifications, without structured and rigorous chain-of-evidence annotations necessary to build auditable forensic judgments. 2) Limited reasoning-based explainability. Current MLLM-based detectors tend to produce post-hoc rationalizations instead of verifiable reasoning chains, leading to fragile generalization and unreliable claims in the forensic context.
To this end, we introduce REVEAL-Bench, a novel reasoning-oriented benchmark for AI-generated image forensics. Our data generation pipeline is fundamentally distinct from existing approaches: we shift from general visual correlation to expert-grounded evidence analysis. For each image, we first leverage eight lightweight expert models to provide structured, reliable, low-level forensic evidence. Such evidence then forms the input for a subsequent large model to generate a chain-of-evidence (CoE) annotation. By consolidating the multi-round forensic analysis from these specialized experts into a single, structured CoE trace, REVEL-Bench becomes the first dataset to explicitly provide an expert-grounded, verifiable forensic analysis that connects low-level cues to high-level conclusions.
Building upon this dataset, we propose the REVEAL framework, a two-stage training paradigm designed to enforce reasoning-based forensic evidence analysis. In the first stage, we employ a supervised fine-tuning (SFT) to teach the MLLM the canonical CoE structure. In the second stage, we introduce R-GRPO (Reasoning-enhanced Group Relative Preference Optimization), an expert-grounded policy optimization algorithm, featuring a novel reward function critical for enhancing the logical coherence and verifiability of forensic analysis. Specifically, R-GRPO jointly optimizes (i) detection accuracy, (ii) reasoning stability, and (iii) multi-view consistency. The novel optimization enforces the MLLM to perform logical synthesis over explicit forensic evidence rather than simple visual pattern matching, thereby achieving accurate, reliable, and explainable forensic analysis.
In summary, our work makes three major contributions:
REVEAL-Bench. We pioneer the first reasoning-based and explainable dataset for AI-generated image detection. Unlike prior datasets that offer only post-hoc explanations, REVEAL-Bench is uniquely structured around expert-grounded, verifiable forensic evidence that embeds an explicit chain-of-evidence following a systematic evidence-then-reasoning paradigm.
REVEAL Framework. We introduce the REVEAL Framework, a progressive two-stage training paradigm designed to instill standardized and explainable reasoning in multimodal LLMs. Its core, R-GRPO, optimizes the MLLM to perform logical synthesis for forensic evidence, jointly enhancing accuracy, reasoning consistency, and generalization.
Empirical Performance. Our approach achieves superior detection accuracy, generalization, and explanation fidelity, benchmarking a new state of the art for reasoning-based forensic research.
## 2 Related Work
#### Detection of AI-Generated Fake Images
The rapid evolution of generative models, e.g., GANs [goodfellow2014gan, esser2021taming], autoregressive models [oord2017vqvae], diffusion-based models [esser2024rectifiedflow, song2020ddim, ho2020ddpm, gu2022vqdiffusion, saharia2022imagen, ji2025mllm], has driven AI-generated images to near-photorealistic quality, challenging conventional detection methods. Early forensic studies focused on traditional manipulations like splicing or copy-move, analyzing noise inconsistencies, boundary anomalies, or compression artifacts [zhou2018manipulation, li2022splicing]. Researchers then shifted focus to generation artifacts, such as up-sampling grid effects, texture mismatches, or abnormal high-frequency decay [frank2020frequency, liu2020texture, dzanic2020fourier]. For example, the Spectral Learning Detector [karageorgiou2025spectral] models the spectral distribution of authentic images, treating AI-generated samples as out-of-distribution anomalies, achieving consistent detection across generators. However, as generators incorporate post-processing techniques like super-resolution, these low-level statistical clues become increasingly subtle and less reliable for robust detection.
Recent methods employ general-purpose feature extractors, such as CNN- or ViT-based detectors, to learn discriminative features directly. While lightweight CNNs achieve strong benchmark performance [ladevic2024cnn], methods like the Variational Information Bottleneck (VIB) network [zhang2025vib] aim to enhance generalization by constraining feature representations through the information bottleneck principle to retain only task-relevant information. Post-hoc Distribution Alignment (PDA) [wang2025pda] attempts to improve robustness to unseen generators by aligning regenerated and real distributions to detect unseen generators. Recently, NPR [tan2024rethinking] has become a representative approach by capturing low-level artifacts, demonstrating strong generalization capability. Similarly, HyperDet [cao2024hyperdet] and AIDE [yan2024sanity] achieve robust generalization through high-frequency spectrum analysis. Despite their discriminatory power, these approaches remain limited in forensic value, as their conclusions rely on global statistics and lack the semantic, verifiable evidence required for comprehensive explainability.
#### Explainable AI-generated Image Detection
The emergence of MLLMs [liu2023visual, wang2024qwen2] has accelerated the development of explainable image forensics by leveraging their advanced cross-modal understanding [wu2024comprehensive, talmor2019commonsenseqa]. Early efforts reformulated detection as a Visual Question Answering (VQA) task [jia2024can, keita2025bi, chang2023antifakeprompt], allowing MLLMs to provide accompanying descriptive text. FatFormer [liu2024forgery] extended this with a forgery-aware adapter to improve generalization on the CLIP-ViT [radford2021learning] encoder.
Subsequent studies focused on constructing task-specific multimodal datasets for fine-tuning. FakeBench [li2025fakebench] and LOKI [ye2024loki] provide synthetic images with manually written, high-level forgery descriptions. Holmes-Set [zhou2025aigi] utilized small models for initial image filtering and a Multi-Expert Jury mechanism to generate postt-hoc explanatory texts. At the methodological level, FakeShield [xu2024fakeshield], ForgerySleuth [sun2024forgerysleuth], ForgeryGPT [liu2024forgerygpt] and SIDA [huang2025sida] fine-tune MLLMs to achieve explainable forgery detection and localization. AIGI-Holmes [zhou2025aigi] integrates low-level visual experts with reasoning modules. RAIDX [li2025raidx] combines retrieval-augmented generation (RAG) [lewis2020retrieval] with GRPO optimization to improve the ability to describe texts.
Critically, existing datasets and methods suffer from two key limitations: First, the explanations are attributed to post-hoc rationalizations, often relying on the MLLM’s general knowledge and visual classification capabilities, failing to achieve logical synthesis of specialized forensic evidence. Second, they lack structured, fine-grained forensic evidence required to support a verifiable causal link between low-level artifacts and the final forensic judgments.
## 3 REVEAL-Bench
<details>
<summary>x3.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Analysis
## Section 1: Data Curation & Pre-filtering
### Components:
1. **Data Sources**:
- Chameleon Fake2M GenImage
- Autoregressive GAN Diffusion
2. **Pre-filtering Categories** (Color-Coded):
- **Local artifacts** (Red)
- **Geometry flaws** (Orange)
- **Spectral clues** (Yellow)
- **Pixel noise** (Green)
- **Spatial consistency** (Blue)
- **Shadow logic** (Purple)
- **Texture fusion** (Pink)
- **High-pass fusion** (Cyan)
### Legend:
- **Circular Chart** with 8 color segments (Red-Orange-Yellow-Green-Blue-Purple-Pink-Cyan)
- Labels correspond to pre-filtering categories
- Positioned at bottom-left of diagram
## Section 2: Expert-grounded Evidence Collection
### Flowchart Components:
1. **Input Path**:
- Local artifacts → Spectral clues → Spatial consistency → Geometry flaws → Shadow logic → Texture fusion → High-pass fusion
2. **Expert Analysis Nodes**:
- **Local artifacts**: "Bird's eyes reflection missing"
- **Spectral clues**: "Periodic artifacts revealed along spectral axis"
- **Geometry flaws**: "Spatial consistency test detects irregularities"
- **Shadow logic**: "Shadow consistency test detects over-lighting patterns"
- **Texture fusion**: "Fine-grain texture irregularities"
- **High-pass fusion**: "Smooth area contains no forgery signs"
3. **Output**:
- Chain-of-Evidence Synthesis (visual evidence examples)
## Section 3: Chain-of-Evidence Synthesis
### Visual Evidence Examples:
1. **Bird Images** (4 examples):
- Annotated with green bounding boxes
- Highlighted regions: Eyes, feathers, background
### Think/Answer Process:
1. **Initial Observation** (Step 1):
- Checks texture and lighting anomalies
2. **Detailed Inspection** (Step 2):
- Identifies uniform surfaces and missing imperfections
3. **Spatial Analysis** (Step 3):
- Compares object-background alignment and projection logic
4. **Shadow Consistency** (Step 4):
- Detects over-lighting patterns
5. **High-frequency Analysis** (Step 5):
- Examines fine-grain texture irregularities
6. **Frequency Spectrum Evaluation** (Step 6):
- Reveals abnormal energy distributions
### Final Determination:
- **Synthetic Image** (Green checkmark)
- **Real Image** (Red cross)
## Legend Spatial Grounding:
- Position: Bottom-left quadrant
- Color Mapping:
- Red: Local artifacts
- Orange: Geometry flaws
- Yellow: Spectral clues
- Green: Pixel noise
- Blue: Spatial consistency
- Purple: Shadow logic
- Pink: Texture fusion
- Cyan: High-pass fusion
## Trend Verification:
- All components follow linear flow from left (data sources) to right (synthesis)
- No numerical trends present (qualitative analysis only)
## Component Isolation:
1. **Header**: Data Curation & Pre-filtering (top-left)
2. **Main Chart**: Expert-grounded Evidence Collection (center)
3. **Footer**: Chain-of-Evidence Synthesis (bottom)
## Additional Notes:
- All text in English
- No numerical data tables present
- Diagram uses color-coded categorization for visual distinction
</details>
Figure 3: The pipeline of REVEAL-Bench. This figure illustrates our data processing pipeline, which consists of three stages: Data Curation & Pre-filtering, Expert-grounded Evidence Collection, and Chain-of-Evidence (CoE) Synthesis
As illustrated in Figure 3, this study constructs the REVEAL-Bench dataset through a rigorous, three-stage pipeline designed for reasoning-based image forensic: Data Curation & Pre-filtering, Expert-grounded Evidence Collection, and Chain-of-Evidence (CoE) Synthesis. This approach is fundamentally distinct as it replaces manual, subjective labeling with a process that systematically integrates verifiable evidence from specialized models with the logical synthesis capabilities of large vision-language models. The resulting dataset contains explicit, expert knowledge-grounded Chain-of-Evidence annotations, which is crucial for training forensic detectors with superior transparency and generalization capability.
#### Data Curation & Prefiltering
To ensure sufficient content, generator, and artifact diversity, we aggregate several prominent AI-generated detection benchmarks, including CNNDetection [wang2020cnn], UnivFD [ojha2023towards], AIGCDetectBenchmark [zhong2023patchcraft], GenImage [zhu2023genimage], Fake2M [lu2023seeing], and Chameleon [yan2024sanity]. This yielded in an initial corpus of approximately 5,120K synthetic images and 850K authentic images. To manage annotation costs while ensuring high data quality, we implemented a stratified sampling strategy based on automated quality assessments [talebi2018nima] and image resolution. Specifically, we sampled images based on aesthetic scores (50% high, 30% medium, 20% low), and image resolution, high-resolution ( $\geq$ 512 $\times$ 512) images at 50%, medium-resolution (384 $\times$ 384–512 $\times$ 512) images at 30%, and low-resolution ( $<$ 384 $\times$ 384) images at 20%. Images were also semantically classified into 13 major categories (e.g., humans, architecture, artworks). After rigorous multi-stage filtering and preprocessing to eliminate non-representative or low-quality samples, we finalized a balanced corpus of 30K synthetic and 30K real images, which serves as the foundation for subsequent expert annotation
#### Expert-grounded Evidence Collection
To enable fine-grained, verifiable forensic analysis, we design and employ a set of eight lightweight and specialized expert models [li2025improving, sarkar2024shadows, tan2024rethinking, cao2024hyperdet, tan2024frequency, li2025optimized], each dedicated to screening and localizing a distinct category of synthetic artifact (as depicted in Figure 3). This is a crucial distinction from prior work, such as AIGI-Holmes [zhou2025aigi], which uses experts primarily for global filtering. Our experts, by contrast, provide structured, machine-readable evidence, including artifact masks and diagnostic labels. These eight outputs constitute the necessary forensic evidence foundation. By conditioning the LVLM on these high-fidelity, structured references, we ensure the final generated explanations are faithful, logically consistent, and verifiable against objective, low-level artifact data. This expert-grounded decompositional analysis effectively bridges the gap between small-model perception of artifacts and large-model logical reasoning.
<details>
<summary>x4.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Analysis
## Overview
The image depicts a **two-stage computational framework** for image authenticity verification using a **Multi-Layer Language Model (MLLM)**. The process involves **coefficient tuning** and **reward-guided reasoning** (R-GRPO) to distinguish real vs. synthetic images.
---
### Stage 1: CoE Tuning
**Purpose**: Initial training of the MLLM to detect synthetic artifacts.
#### Components:
1. **Input Prompt**:
- Text: `"Please help me determine whether this image is real or synthetic?"`
- Example images:
- A bird on a branch (real)
- A flame icon (synthetic)
2. **MLLM Processing**:
- **Think Block**:
- Reasoning steps:
1. Initial observation suggests...
2. Upon closer inspection, uneven features are detected...
7. After synthesizing all cues, the image is determined to be synthetic.
- **Output**:
- Answer: `1` (synthetic)
3. **Loss Functions**:
- `L_think`: Optimizes reasoning coherence.
- `L_answer`: Penalizes incorrect synthetic/real classification.
#### Flow:
```mermaid
graph LR
A[Input Prompt] --> B[MLLM]
B --> C[Think Block]
C --> D[Answer]
```
---
### Stage 2: R-GRPO (Reward-Guided Reasoning)
**Purpose**: Refine the MLLM using multi-view alignment and completion rewards.
#### Components:
1. **Input Prompt**:
- Text: `"Please help me determine whether this image is real or synthetic?"`
- Example image: Bird with zoom-ins showing:
- Natural eye details
- High-frequency artifacts (e.g., "high-pass image reveals artifacts")
2. **MLLM Processing**:
- **Completion Steps** (1 to G):
- Each completion generates a reasoning trace (e.g., ``).
- **Reward Evaluation**:
- **Answer Reward**:
- `R=1` if completion matches ground truth (`<answer>...</answer>`).
- `R=0` if mismatch.
- **Think Reward**:
- `R=1` for coherent reasoning (e.g., "When zoomed in, the eyeball shows structural irregularities...").
- `R=0` for mismatched logic.
- **Multi-View Alignment Reward**:
- `R=1` for consistency across zoom levels.
- `R=0` for discrepancies (e.g., "eye appears natural" vs. "artifacts in high-pass image").
3. **Example Flow**:
- Bird image → Zoom-ins reveal artifacts → Mismatch → `R=0`.
#### Flow:
```mermaid
graph LR
E[Input Prompt] --> F[MLLM]
F --> G[Completion 1]
G --> H[Answer Reward (R=1)]
G --> I[Think Reward (R=1)]
G --> J[Multi-View Alignment Reward (R=0)]
```
---
### Key Trends & Data Points
1. **Synthetic Detection**:
- The MLLM identifies uneven features (e.g., flame icon) as synthetic.
- Zoom-ins reveal high-frequency artifacts in synthetic images.
2. **Reward System**:
- **Binary Rewards**: `R=1` (match), `R=0` (mismatch).
- **Multi-View Alignment**: Ensures consistency across zoom levels.
3. **Loss Optimization**:
- `L_think` and `L_answer` drive the MLLM to refine reasoning and classification accuracy.
---
### Spatial Grounding & Component Isolation
- **Stage 1 (Left)**: Focuses on initial tuning with single-image prompts.
- **Stage 2 (Right)**: Expands to multi-view reasoning with reward signals.
- **Legend**: Not explicitly present; rewards (`R=1`, `R=0`) are implicitly tied to color-coded blocks (green for match, red for mismatch).
---
### Critical Observations
- **Flowchart Logic**:
1. Input prompts are processed by the MLLM.
2. Reasoning traces (``) guide synthetic detection.
3. Rewards (`R=1/R=0`) refine the model’s decision-making.
- **Example Artifacts**:
- Flame icon (synthetic) vs. bird (real).
- High-pass image artifacts in zoomed bird images.
---
### Final Notes
- **Language**: All text is in English.
- **No Data Tables**: The diagram uses flowcharts and textual annotations instead of numerical tables.
- **Trend Verification**:
- Synthetic images show uneven features and artifacts.
- Reward signals (`R=1/R=0`) correlate with match/mismatch outcomes.
</details>
Figure 4: Overview of REVEAL. The pipeline mainly consists of two stages: CoE Tuning and R-GRPO.
#### Chain-of-Evidence Synthesis
As shown in Figure 3, after the specialized expert annotation, the initial eight rounds of multi-perspective diagnostic outputs are diverse and fragmented. To construct a unified and progressive reasoning dataset suitable for Chain-of-Thought (CoT) fine-tuning, we leverage a high-capacity LVLM (Qwen-2.5VL-72B [bai2025qwen2]) to perform structured knowledge consolidation. This process reconstructs the diverse, specialized evidence into a single, cohesive, and auditable reasoning trace, formatted using a standard <think> $\cdots$ </think> $\cdot$ <answer> $\cdots$ </answer> structure.
Fundamentally distinct from existing datasets like AIGI-Holmes [zhou2025aigi] and FakeBench [li2025fakebench], which merely provide generic explanations, REVEAL-Bench explicitly formalizes the link between low-level expert evidence and high-level judgments. This two-stage pipeline transforms the detection tasks into a reasoning task, offering coherent CoE annotations that enhance logical consistency, minimizing annotation noise, and support supervision paradigms with advanced reinforcement learning techniques to improve explanation fidelity and generalization.
## 4 Methodology
### 4.1 Overview of REVEAL
As illustrated in Figure 4, the overall training pipeline adopts a two-stage progressive training paradigm inspired by advanced policy optimization-based reinforcement learning techniques [guo2025deepseek].
We first perform supervised fine-tuning (SFT) on a consolidated Chain-of-Evidence (CoE) dataset to obtain a base policy that can deduce the required forensic reasoning procedure. While this stage establishes the fundamental reasoning-based forensic structure, the resulting model still exhibits limitations in logical consistency, forensic accuracy, and robustness. To mitigate these limitations, we propose a novel reinforcement learning algorithm: R easoning- e nhanced Forensic E vid e nce A na l ysis (R-GRPO). R-GRPO extends beyond standard Group Relative Policy Optimization (GRPO) by incorporating a task-specific composite reward that dynamically aligns forensic reasoning trajectories and stabilizes policy updates, significantly enhancing semantic consistency and reasoning robustness.
### 4.2 Progressive Multimodal Training for AI-Generated Image Detection
We introduce REVEAL (Reasoning-enhanced Forensic Evidence AnaLysis), a progressive multimodal training framework comprising two sequential stages designed to cultivate robust, logically consistent, and verifiable forensic reasoning in multimodal models.
Stage 1: Chain-of-Evidence Tuning (CoE Tuning). In the initial stage, we perform cold-start supervised fine-tuning to establish a stable, stepwise reasoning policy and a consistent output paradigm built upon the REVEAL-Bench dataset. Let $x$ denote the visual input, $z=(z_{1},\dots,z_{T})$ denote the tokenized reasoning sequence (Chain-of-Evidence, CoE), and $y$ denote the final classification label. We adopt an explicit joint reasoning–decision modeling paradigm, where the final prediction $y$ is conditioned on the explicit reasoning trace $z$ . This formulation enforces a think-then-answer mechanism, fundamentally distinct from post-hoc rationalizations (e.g. modeling $p(y\mid x)$ and then $p(z\mid x,y)$ ), thereby achieving causally grounded genuine explanations.
Concretely, we factorize the joint conditional probability as
$$
p(y,z\mid x)\;=\;p(z\mid x)\,p(y\mid x,z), \tag{1}
$$
which structurally encourages the model to first generate verifiable reasoning evidence and subsequently derive the final prediction conditioned directly on that reasoning process.
Maximizing the likelihood under (1) corresponds to minimizing the following negative log-likelihood loss:
$$
\mathcal{L}_{\mathrm{NLL}}(x,y,z;\theta)\;=\;-\log p_{\theta}(z\mid x)\;-\;\log p_{\theta}(y\mid x,z). \tag{2}
$$
For training control and to explicitly balance the emphasis on reasoning quality versus final decision accuracy, we decompose $\mathcal{L}_{\mathrm{NLL}}$ into two components, the reasoning generation loss $\mathcal{L}_{\mathrm{think}}$ and the answer loss $\mathcal{L}_{\mathrm{answer}}$ ,
$$
\mathcal{L}_{\mathrm{think}}\;=\;-\sum_{t=1}^{T}\log p_{\theta}(z_{t}\mid z_{<t},x), \tag{3}
$$
$$
\mathcal{L}_{\mathrm{answer}}\;=\;-\log p_{\theta}(y\mid x,z), \tag{4}
$$
We then employ a weighted composite SFT loss:
$$
\mathcal{L}_{\mathrm{SFT}}=\begin{aligned} &(1-\alpha)\,\mathcal{L}_{\mathrm{think}}+\alpha\,\mathcal{L}_{\mathrm{answer}}+\eta\,\mathrm{KL}\big(\pi_{\mathrm{pre}}\|\pi_{\theta}\big).\end{aligned} \tag{5}
$$
where $\alpha\in(0,1)$ controls the relative importance of the answer loss versus the reasoning trace, the KL regularization term constrains the fine-tuned policy $\pi_{\theta}$ to remain proximal to the pretrained policy $\pi_{\mathrm{pre}}$ , effectively mitigating catastrophic forgetting.
Stage 2: Reasoning-enhanced Group Relative Policy Optimization (R-GRPO).
Group Relative Policy Optimization (GRPO). Group Relative Policy Optimization (GRPO) is a reinforcement learning technique that stabilizes policy updates by comparing a group of candidate trajectories, rather than relying on the noisy reward signals of individual samples. Given an input $x$ , we sample a group of $K$ trajectories $\{\tau_{i}\}_{i=1}^{K}$ from the current policy $\pi_{\theta}$ , where each trajectory $\tau_{i}$ consists of an intermediate reasoning trace $z_{i}$ and a final output $y_{i}$ . A group-based composite reward $R_{\mathrm{group}}(\tau_{i})$ is computed for each trajectory, and the group-relative advantage $A_{i}$ is defined by subtracting the mean group reward $\overline{R}_{\mathrm{group}}$ :
$$
\displaystyle A_{i} \displaystyle=R_{\mathrm{group}}(\tau_{i})-\overline{R}_{\mathrm{group}}, \displaystyle\overline{R}_{\mathrm{group}} \displaystyle=\frac{1}{K}\sum_{j=1}^{K}R_{\mathrm{group}}(\tau_{j}). \tag{6}
$$
The GRPO objective maximizes the expected group-relative log-probability, regularized by a KL penalty for stable policy convergence:
$$
\max_{\theta}\;\mathbb{E}\Big[\sum_{i=1}^{K}A_{i}\log\pi_{\theta}(\tau_{i}\mid x)\Big]\;-\;\lambda_{\mathrm{KL}}\,\mathrm{KL}\big(\pi_{\mathrm{old}}\|\pi_{\theta}\big). \tag{7}
$$
Reasoning-enhanced GRPO (R-GRPO). To employ GRPO for forensic analysis tasks, we propose R-GRPO, which augments the objective with a task-aware composite reward specifically designed to capture forensic fidelity and reasoning robustness. Let $y$ denote the generated answer, $y^{\ast}$ the reference answer, $z=(z_{1},\dots,z_{T})$ the reasoning tokens, and $\{v_{m}(x)\}_{m=1}^{M}$ a set of multi-visual visual evidence (e.g., spectral representations, high-pass filtered images, and localized artifact patches).
Rationale for Agent-based Reward Modeling. In preliminary experiments, we observed that simple metric-based rewards (e.g. using cosine similarity of sentence embeddings for $r_{\mathrm{sem}}$ ) fail to adequately reflect the semantic and contextual logic required for high-quality forensic explanations. Therefore, we introduce a dedicated large language model as an intelligent agent (Agent) to evaluate responses. This Agent-based assessment considers contextual logic, explanation coherence, and factual consistency against the provided structured evidence, thereby generating a more human-aligned and interpretable reward signal than purely metric-based approaches (see Appendix A for details).
R-GRPO defines three complementary, evidence-driven reward components:
(1) Answer Reward $r_{\mathrm{sem}}$ . This binary reward ensures the accuracy of the detection:
$$
r_{\mathrm{sem}}(y,y^{\ast})=\begin{cases}1,&\text{if }y=y^{\ast},\\
0,&\text{otherwise.}\end{cases} \tag{8}
$$
(2) Think Reward $r_{\mathrm{think}}$ . This reward quantifies the quality and structural integrity of the reasoning trace $z$ .
Let $z=(z_{1},\dots,z_{T})$ be the generated reasoning trace and $z^{\ast}=(z^{\ast}_{1},\dots,z^{\ast}_{T^{\ast}})$ the ground-truth reasoning trace (when available). Define a perturbed trace $\tilde{z}=\operatorname{shuffle}(z)$ . Then
$$
r_{\mathrm{think}}(z,z^{\ast},\tilde{z})\;=\;\mathcal{A}_{\mathrm{sem}}(z,z^{\ast})+\mathcal{A}_{\mathrm{logic}}(z,\tilde{z}), \tag{9}
$$
where $\mathcal{A}_{\mathrm{sem}}$ measures alignment between the generated and reference reasoning, and $\mathcal{A}_{\mathrm{logic}}(z,\tilde{z})$ evaluates the logical coherence of the trace. Crucially, $\mathcal{A}_{\mathrm{logic}}$ evaluates the logical coherence by penalizing the model if minor structural perturbations $\tilde{z}$ severely alter the inferred conclusion. This mechanism forces the model to maintain sequential consistency and ensure the reasoning steps are robustly connected.
(3) Multi-view Alignment Reward $r_{\mathrm{view}}$ . This reward encourages the generated reasoning trace $z$ to be robustly grounded in evidence that persists across different forensic views of the image.
$$
r_{\mathrm{view}}(z,x)\;=\;\mathcal{A}_{\mathrm{view}}\Big(z,\{v_{m}(x)\}_{m=1}^{M}\Big), \tag{10}
$$
where $\mathcal{A}_{\mathrm{view}}$ measures fidelity of the reasoning to the multi-view visual evidence $\{x_{m}\}$ . By requiring alignment with evidence visible under different transformations (e.g., spectral, high-pass), this reward promotes cross-artifact generalization and enables the self-supervised discovery of novel, transformation-invariant artifacts.
The composite trajectory reward $R(\tau)$ combines these terms:
$$
\displaystyle R(\tau)= \displaystyle\lambda_{s}r_{\mathrm{sem}}(y,y^{\ast})+\lambda_{t}r_{\mathrm{think}}(z,z^{\ast},\tilde{z}) \displaystyle+\lambda_{v}r_{\mathrm{view}}(z,x), \tag{11}
$$
where $\lambda_{s},\lambda_{t},\lambda_{v}\geq 0$ are tunable parameters balancing the rewards. For improved stability, rewards are standardized within each sampled group before calculating the advantage $\widehat{A}_{i}$ :
$$
\widehat{R}(\tau_{i})=\frac{R(\tau_{i})-\mu_{\mathrm{group}}}{\sigma_{\mathrm{group}}}, \tag{12}
$$
$$
\mu_{\mathrm{group}}=\frac{1}{K}\sum_{j}R(\tau_{j}), \tag{13}
$$
$$
\sigma_{\mathrm{group}}=\mathrm{std}(\{R(\tau_{j})\}) \tag{14}
$$
and the normalized group-relative advantage is
$$
\widehat{A}_{i}=\widehat{R}(\tau_{i})-\frac{1}{K}\sum_{j}\widehat{R}(\tau_{j}). \tag{15}
$$
Unified GRPO with the R-GRPO objective. Combining the original GRPO formulation (7) with the R-GRPO composite reward (11), the unified optimization objective becomes
$$
\max_{\theta}\;\mathbb{E}\Big[\sum_{i=1}^{K}\widehat{A}_{i}\log\pi_{\theta}(\tau_{i}\mid x)\Big]\;-\;\lambda_{\mathrm{KL}}\,\mathrm{KL}\big(\pi_{\mathrm{old}}\|\pi_{\theta}\big), \tag{16}
$$
where $\widehat{A}_{i}$ encodes both the group-relative comparison and the reasoning-enhanced composite reward.
This evidence-enhanced reward signals can effectively guide the model to optimize its reasoning trajectories, enforcing both stability and logical coherence in verifiable forensic evidence analysis.
## 5 Experiments
### 5.1 Experimental Settings
TABLE I: Comparison of REVEAL-bench with previous datasets. REVEAL-bench is the first reasoning dataset for synthetic image detection.
Dataset #Image Explanation Multiview Fusion Reasoning Process CNNDetection [wang2020cnn] 720k ✗ ✗ ✗ GenImage [zhu2023genimage] 1M ✗ ✗ ✗ FakeBench [li2025fakebench] 6K ✓ ✗ ✗ Holmes-Set [zhou2025aigi] 69K ✓ ✓ ✗ REVEAL-bench 60K ✓ ✓ ✓
To comprehensively evaluate the performance of REVEAL, we conduct experiments on two datasets: REVEAL-Bench and GenImage [zhu2023genimage] (see Table I). REVEAL-Bench, the first chain-of-evidence-based explainable dataset for synthetic image detection, serves as the in-domain dataset for training and evaluation. GenImage, a large-scale synthetic image dataset containing images generated by multiple generation methods, is used as an out-of-domain dataset to assess generalization. We train REVEAL on REVEAL-Bench and systematically evaluate its performance on both datasets (see Appendix B for detailed training settings). Building on this evaluation setup, we further investigate several core aspects of REVEAL’s capabilities. In particular, we study the impact of different MLLMs used as vision–language backbones, conduct ablation experiments to quantify the contribution of R-GRPO, and assess the model’s robustness under diverse perturbation settings. Appendix C reports the few-shot training results, and Appendix D provides a systematic comparison with existing large-scale model-based detectors.
Baselines We compare REVEAL with state-of-the-art AI-generated image detection methods, including CNNSpot [wang2020cnn], UnivFD [ojha2023towards], NPR [tan2024rethinking], HyperDet [cao2024hyperdet], AIDE [yan2024sanity] and VIB-Net [zhang2025towards]. To ensure a fair comparison, we retrain these methods using the official code under the same experimental settings and datasets.
Evaluation metrics Following existing research, we adopt Accuracy (ACC) as our evaluation metric. Accuracy is defined as the proportion of correctly predicted samples among the total number of samples, reflecting the overall correctness of a classification model. Since our detection results are provided by the MLLM in textual form (Real/Fake), we convert these texts into binary labels to compute accuracy, while baseline methods use the default thresholds provided by their official code. Moreover, because the output of the MLLM is interpretable text rather than logit values, we do not consider metrics that require logit values for computation, such as Average Precision (AP), in our evaluation.
### 5.2 Generalization across datasets
TABLE II: REVEAL demonstrates superior generalization across both in-domain and out-of-domain evaluations. REVEAL outperforms the best competing method by 3.87 %.
| CNNSpot [wang2020cnn] UnivFD [ojha2023towards] NPR [tan2024rethinking] | 87.80 86.95 95.40 | 62.45 75.00 84.80 | 74.25 84.35 88.85 | 73.85 80.95 88.05 | 63.55 85.50 85.10 | 73.60 71.75 94.30 | 73.70 82.00 87.05 | 71.35 80.70 84.45 | 39.45 88.45 88.95 | 68.89 81.74 88.55 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| HyperDet [cao2024hyperdet] | 93.25 | 68.40 | 91.85 | 92.30 | 100.0 | 67.05 | 89.20 | 80.45 | 57.65 | 82.24 |
| AIDE [yan2024sanity] | 95.25 | 79.90 | 95.90 | 94.95 | 87.75 | 90.35 | 94.85 | 90.10 | 91.10 | 91.13 |
| VIB-Net [zhang2025towards] | 67.05 | 53.25 | 60.25 | 57.85 | 65.00 | 68.55 | 60.85 | 52.55 | 38.00 | 58.15 |
| REVEAL | 95.31 | 93.75 | 97.81 | 97.19 | 95.00 | 86.88 | 96.25 | 95.94 | 96.88 | 95.00 |
Table II reports the performance of REVEAL on the in-domain dataset REVEAL-bench and the out-of-domain benchmark GenImage. The results indicate that REVEAL, leveraging a Chain-of-Evidence (CoE) reasoning-and-forensics mechanism, achieves superior cross-domain generalization compared to baseline lightweight binary classifiers: it maintains higher accuracy and more stable performance on GenImage. In the in-domain setting, smaller classifiers, such as those using methods like NPR [tan2024rethinking] and AIDE [yan2024sanity], are more prone to overfitting, demonstrating stronger fitting ability to domain-specific statistical regularities and subtle signals. As a result, REVEAL’s performance in-domain is comparable to that of these compact models. However, REVEAL excels in terms of cross-domain generalization. These findings suggest that while smaller models remain attractive for tasks prioritizing computational efficiency and in-domain accuracy, REVEAL better preserves and propagates key reasoning cues across domains. Therefore, there is a clear trade-off between generalization and domain-specific fit that should inform deployment choices. Notably, in the context of synthetic-image detection, reasoning-based forensic approaches, like REVEAL, exhibit particularly robust generalization.
### 5.3 Generalization across Base MLLMs
TABLE III: Performance across different MLLMs, showing larger models exhibit consistently stronger detection capability.
Training Scheme Phi-3.5 Qwen2.5- VL-3b Qwen2.5- VL-7b llava- v1.5-7b llava- v1.5-13b CoE Tuning 83.75 87.18 85.73 91.56 93.06 CoE Tuning+G-GRPO 87.19 89.06 92.19 92.81 95.31
The proposed algorithm in this study demonstrates strong generalizability and can be flexibly applied to a variety of multimodal large model architectures. To validate the effectiveness of our method, we conduct experiments using Qwen2.5-VL [bai2025qwen2], LLaVA-1.5-VL [liu2023visual], and Phi-3.5 as representative training frameworks. As shown in Table III, the results indicate that our approach achieves excellent detection performance and robust generalization across different multimodal large models.
Furthermore, we observe that as the model size increases, the detection capability improves significantly. This trend suggests the existence of a scaling law for synthetic image detection within the context of large models, similar to other tasks in the large model domain. As multimodal models continue to grow, their ability to handle complex tasks such as synthetic image detection becomes increasingly effective, demonstrating a direct correlation between model scale and performance.
### 5.4 Ablation Studies
TABLE IV: Ablation study of the impact of CoE Tuning, GRPO, and R-GRPO on model accuracy on REVEAL-Bench.
| ✗ | ✗ | ✗ | 61.21 |
| --- | --- | --- | --- |
| ✓ | ✗ | ✗ | 85.73 |
| ✓ | ✓ | ✗ | 91.56 |
| ✓ | ✗ | ✓ | 95.31 |
We conducted ablation experiments to investigate the role of reasoning datasets in synthetic image detection. As shown in Table IV, we first evaluated the performance of models trained without reasoning data (i.e., non-Reasoning SFT) and compared them with models fine-tuned using reasoning data (i.e., CoE Tuning). Additionally, we tested the effects of applying simple GRPO and our proposed R-GRPO method on performance improvement. The experimental results demonstrate that reasoning datasets significantly enhance the performance of MLLMs in synthetic image detection, with models lacking reasoning data performing close to random levels. Moreover, applying G-GRPO further improved the performance, highlighting the critical role of R-GRPO in this task.
### 5.5 Robustness Evaluation of REVEAL
<details>
<summary>x5.png Details</summary>

### Visual Description
# Technical Document Extraction: REVEAL-bench Performance Analysis
## Chart Structure Overview
The image contains **two line charts** titled "REVEAL-bench", comparing two evaluation metrics across different parameters. Both charts share identical y-axis labels but differ in x-axis parameters.
---
### Legend Analysis
- **Legend Position**: Top-left corner
- **Color-Coded Entries**:
- `Red Circle`: "Ours(REVEAL)"
- `Blue Triangle`: "NPR(2024CVPR)"
---
### Chart 1: Accuracy vs Quality
**X-axis**: "quality" (100 → 60, decreasing)
**Y-axis**: "Acc" (Accuracy, 0–100)
#### Key Trends:
1. **Ours(REVEAL)** (Red):
- Starts at **95** when quality = 100
- Decreases linearly to **58** at quality = 60
- Slope: -0.37 per quality unit
2. **NPR(2024CVPR)** (Blue):
- Flat line at **58** (quality = 100) → **51** (quality = 60)
- Minimal variation (Δ = 7)
#### Data Points:
| Quality | Ours(REVEAL) | NPR(2024CVPR) |
|---------|--------------|---------------|
| 100 | 95 | 58 |
| 90 | 77 | 52 |
| 80 | 65 | 51 |
| 70 | 60 | 51 |
| 60 | 58 | 51 |
---
### Chart 2: Accuracy vs Sigma
**X-axis**: "sigma" (0 → 4, increasing)
**Y-axis**: "Acc" (Accuracy, 0–100)
#### Key Trends:
1. **Ours(REVEAL)** (Red):
- Starts at **95** (sigma = 0)
- Drops sharply to **80** at sigma = 1
- Gradual decline to **58** at sigma = 4
- Total Δ = 37
2. **NPR(2024CVPR)** (Blue):
- Starts at **95** (sigma = 0)
- Drops to **80** at sigma = 1
- Steeper decline to **53** at sigma = 4
- Total Δ = 42
#### Data Points:
| Sigma | Ours(REVEAL) | NPR(2024CVPR) |
|-------|--------------|---------------|
| 0 | 95 | 95 |
| 1 | 80 | 80 |
| 2 | 65 | 58 |
| 3 | 60 | 55 |
| 4 | 58 | 53 |
---
### Cross-Validation Checks
1. **Legend Consistency**:
- Red markers match "Ours(REVEAL)" in both charts
- Blue markers match "NPR(2024CVPR)" in both charts
2. **Trend Verification**:
- Red line in Chart 1 shows consistent decline (✓)
- Blue line in Chart 2 shows sharper decline than red (✓)
3. **Spatial Grounding**:
- Legend positioned at [x=0, y=0] relative to chart boundaries
- Axis labels anchored at chart edges
---
### Observations
1. **Performance Degradation**:
- Both methods degrade with lower quality/sigma
- Ours(REVEAL) maintains higher accuracy across all quality levels
- NPR(2024CVPR) shows greater sensitivity to sigma changes
2. **Benchmark Context**:
- "REVEAL-bench" appears to evaluate robustness against quality degradation and noise (sigma)
- Threshold line at y=50 suggests performance floor
---
### Limitations
- No explicit methodology details provided
- No statistical significance markers (e.g., p-values)
- No error bars for confidence intervals
---
### Conclusion
The charts demonstrate that **Ours(REVEAL)** consistently outperforms **NPR(2024CVPR)** across both quality and sigma parameters, with more gradual degradation in performance. This suggests superior robustness in the proposed method under varying input conditions.
</details>
Figure 5: The accuracy comparison between the two methods under various perturbation conditions.
To evaluate the robustness of REVEAL against common post-processing distortions, we conducted a systematic robustness study on the REVEAL-bench dataset. The experiments apply two typical post-processing operations to the original test images: Gaussian blur ( $\sigma=1,2,3,4$ ) and JPEG compression (quality = 90, 80, 70, 60). For each distortion level, we compare REVEAL with the state-of-the-art baseline methods (results are shown in Figure 5). The results indicate that REVEAL demonstrates stronger robustness and improved cross-domain generalization across the considered post-processing settings.
## 6 Conclusion
We presented REVEAL, a reasoning-centered approach for explainable AI-generated image detection. First, we introduced REVEAL-Bench, the first dataset organized around expert-grounded, verifiable forensic evidence and an explicit chain-of-evidence following an evidence-then-reasoning paradigm. Second, we proposed the REVEAL Framework, a progressive two-stage training scheme whose core component R-GRPO explicitly teaches multimodal LLMs to perform logical synthesis over forensic evidence, jointly improving accuracy, reasoning consistency, and generalization. Empirically, REVEAL attains superior detection accuracy, stronger out-of-domain generalization, and higher explanation fidelity, establishing a new state of the art for reasoning-based image forensics.