CVMay 9

EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics

arXiv:2605.0869567.2
Predicted impact top 47% in CV · last 90 daysOriginality Incremental advance
AI Analysis

Provides a large-scale, verifiable reasoning dataset for image-edit forensics, enabling models to produce grounded explanations beyond binary detection.

EditSleuth is a dataset of 257,725 image-edit triplets for grounded forensic reasoning, with deterministic reasoning chains. Fine-tuning Qwen2-VL-2B with chain supervision matches label-only classification accuracy while producing grounded explanations.

Forensic analysis of AI-edited images requires more than binary real-versus-fake prediction: a useful system should localize the edit, identify its semantic type, and ground its decisions in visual evidence. Existing image-forensics datasets typically emphasize detection or localization, while reasoning-supervised vision-language datasets rarely target image manipulation and often rely on LLM-generated rationales whose faithfulness is difficult to verify. We introduce EditSleuth, a dataset of 257,725 image-edit triplets constructed from existing image-editing corpora for grounded image-edit forensic reasoning. Each example includes an edited image, its source image, a binary edit mask, a 12-class edit taxonomy label, a difficulty score, and a six-step reasoning chain. EditSleuth chains are generated deterministically from triplet-grounded upstream artifacts, with each statement tied to a specific computable source of evidence. Our analysis reveals that a naive four-component difficulty formulation suffers from a rank-2 correlation collapse among magnitude features; a simplified three-component formulation substantially increases score dispersion on both Pico-Banana and MagicBrush. Difficulty also varies meaningfully within most edit categories, indicating that the score is not a proxy for edit type. As an initial learning study, we fine-tune Qwen2-VL-2B with LoRA and find that chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers, while additionally yielding grounded explanatory prose that label-only supervision cannot produce. We release the dataset, the deterministic construction pipeline, and pilot training scripts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes