Chunjie Ma

CV
h-index6
4papers
10citations
Novelty50%
AI Score41

4 Papers

39.5CVMar 31
GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

Yaning Zhang, Linlin Shen, Zitong Yu et al.

Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.

60.7CVMay 11
MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

Yaning Zhang, Tianyi Wang, Zan Gao et al.

The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

CVNov 1, 2024
Multiple Information Prompt Learning for Cloth-Changing Person Re-Identification

Shengxun Wei, Zan Gao, Chunjie Ma et al.

Cloth-changing person re-identification is a subject closer to the real world, which focuses on solving the problem of person re-identification after pedestrians change clothes. The primary challenge in this field is to overcome the complex interplay between intra-class and inter-class variations and to identify features that remain unaffected by changes in appearance. Sufficient data collection for model training would significantly aid in addressing this problem. However, it is challenging to gather diverse datasets in practice. Current methods focus on implicitly learning identity information from the original image or introducing additional auxiliary models, which are largely limited by the quality of the image and the performance of the additional model. To address these issues, inspired by prompt learning, we propose a novel multiple information prompt learning (MIPL) scheme for cloth-changing person ReID, which learns identity robust features through the common prompt guidance of multiple messages. Specifically, the clothing information stripping (CIS) module is designed to decouple the clothing information from the original RGB image features to counteract the influence of clothing appearance. The Bio-guided attention (BGA) module is proposed to increase the learning intensity of the model for key information. A dual-length hybrid patch (DHP) module is employed to make the features have diverse coverage to minimize the impact of feature bias. Extensive experiments demonstrate that the proposed method outperforms all state-of-the-art methods on the LTCC, Celeb-reID, Celeb-reID-light, and CSCC datasets, achieving rank-1 scores of 74.8%, 73.3%, 66.0%, and 88.1%, respectively. When compared to AIM (CVPR23), ACID (TIP23), and SCNet (MM23), MIPL achieves rank-1 improvements of 11.3%, 13.8%, and 7.9%, respectively, on the PRCC dataset.

CVApr 19, 2025
PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution

Yaning Zhang, Jiahe Zhang, Chunjie Ma et al.

The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with dynamic contrastive learning(PVLM) method for zero-shot deepfake attribution (ZS-DFA),which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative parsing-guided vision language model with dynamic contrastive learning (PVLM) method to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We employ the inherent face attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.