CVCLFeb 27

NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection

Xiaoyu Guo, Arkaitz Zubiaga
arXiv:2602.23863v12 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of AI-generated content detection for security and verification applications, but it is incremental as it builds on existing pre-trained models and competition benchmarks.

The paper tackled the problem of detecting AI-generated images and identifying their source models by proposing a multi-modal multi-task model using BERT and CLIP encoders with cross-modal fusion and pseudo-labeling augmentation. It achieved fifth place in a competition with F1 scores of 83.16% for detection and 48.88% for model identification.

With the aim of detecting AI-generated images and identifying the specific models responsible for their generation, we propose a multi-modal multi-task model. The model leverages pre-trained BERT and CLIP Vision encoders for text and image feature extraction, respectively, and employs cross-modal feature fusion with a tailored multi-task loss function. Additionally, a pseudo-labeling-based data augmentation strategy was utilized to expand the training dataset with high-confidence samples. The model achieved fifth place in both Tasks A and B of the `CT2: AI-Generated Image Detection' competition, with F1 scores of 83.16\% and 48.88\%, respectively. These findings highlight the effectiveness of the proposed architecture and its potential for advancing AI-generated content detection in real-world scenarios. The source code for our method is published on https://github.com/xxxxxxxxy/AIGeneratedImageDetection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes