FUSE: Unifying Spectral and Semantic Cues for Robust AI-Generated Image Detection
This addresses the need for reliable detection of AI-generated images, which is crucial for security and media integrity, representing a strong specific gain in the field.
The paper tackled the problem of detecting AI-generated images by introducing FUSE, a hybrid system that combines spectral and semantic features, achieving state-of-the-art results on the Chameleon benchmark with 91.36% mean accuracy on GenImage and 88.71% accuracy across all tested generators.
The fast evolution of generative models has heightened the demand for reliable detection of AI-generated images. To tackle this challenge, we introduce FUSE, a hybrid system that combines spectral features extracted through Fast Fourier Transform with semantic features obtained from the CLIP's Vision encoder. The features are fused into a joint representation and trained progressively in two stages. Evaluations on GenImage, WildFake, DiTFake, GPT-ImgEval and Chameleon datasets demonstrate strong generalization across multiple generators. Our FUSE (Stage 1) model demonstrates state-of-the-art results on the Chameleon benchmark. It also attains 91.36% mean accuracy on the GenImage dataset, 88.71% accuracy across all tested generators, and a mean Average Precision of 94.96%. Stage 2 training further improves performance for most generators. Unlike existing methods, which often perform poorly on high-fidelity images in Chameleon, our approach maintains robustness across diverse generators. These findings highlight the benefits of integrating spectral and semantic features for generalized detection of images generated by AI.