CVAINov 24, 2024

The Visual Counter Turing Test (VCT2): A Benchmark for Evaluating AI-Generated Image Detection and the Visual AI Index (VAI)

arXiv:2411.16754v22 citationsh-index: 13
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of misinformation from AI-generated visuals by providing a benchmark for evaluating detection methods, though it is incremental as it builds on existing AGID research with new data and metrics.

The paper tackles the problem of AI-generated image detection (AGID) methods overfitting to known generators by introducing the Visual Counter Turing Test (VCT2), a benchmark of 166,000 images from six state-of-the-art text-to-image systems, and finds low detection accuracy of 58% on average across 17 models. It also proposes the Visual AI Index (VAI), an interpretable realism metric that shows a moderate inverse correlation with detection accuracy, indicating more realistic images are harder to detect.

The rapid progress and widespread availability of text-to-image (T2I) generative models have heightened concerns about the misuse of AI-generated visuals, particularly in the context of misinformation campaigns. Existing AI-generated image detection (AGID) methods often overfit to known generators and falter on outputs from newer or unseen models. We introduce the Visual Counter Turing Test (VCT2), a comprehensive benchmark of 166,000 images, comprising both real and synthetic prompt-image pairs produced by six state-of-the-art T2I systems: Stable Diffusion 2.1, SDXL, SD3 Medium, SD3.5 Large, DALL.E 3, and Midjourney 6. We curate two distinct subsets: COCOAI, featuring structured captions from MS COCO, and TwitterAI, containing narrative-style tweets from The New York Times. Under a unified zero-shot evaluation, we benchmark 17 leading AGID models and observe alarmingly low detection accuracy, 58% on COCOAI and 58.34% on TwitterAI. To transcend binary classification, we propose the Visual AI Index (VAI), an interpretable, prompt-agnostic realism metric based on twelve low-level visual features, enabling us to quantify and rank the perceptual quality of generated outputs with greater nuance. Correlation analysis reveals a moderate inverse relationship between VAI and detection accuracy: Pearson of -0.532 on COCOAI and -0.503 on TwitterAI, suggesting that more visually realistic images tend to be harder to detect, a trend observed consistently across generators. We release COCOAI, TwitterAI, and all codes to catalyze future advances in generalized AGID and perceptual realism assessment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes