CVLGDec 4, 2025

Rethinking the Use of Vision Transformers for AI-Generated Image Detection

arXiv:2512.04969v12 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses the problem of detecting AI-generated images for security and verification purposes, presenting an incremental improvement over existing methods.

The paper tackled AI-generated image detection by analyzing layer-wise features in Vision Transformers, finding that earlier layers offer better performance, and introduced MoLD, an adaptive method that integrates multiple layers to improve detection, achieving significant gains in experiments on GAN- and diffusion-generated images.

Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes