CVApr 16, 2025
Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for ForensicsYiran He, Yun Cao, Bowen Yang et al.
The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.
CVJun 26, 2025
Multimodal Prompt Alignment for Facial Expression RecognitionFuyan Ma, Yiran He, Bin Sun et al.
Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.
SPOct 14, 2021
On the Stability of Low Pass Graph Filter With a Large Number of Edge RewiresHoang-Son Nguyen, Yiran He, Hoi-To Wai
Recently, the stability of graph filters has been studied as one of the key theoretical properties driving the highly successful graph convolutional neural networks (GCNs). The stability of a graph filter characterizes the effect of topology perturbation on the output of a graph filter, a fundamental building block for GCNs. Many existing results have focused on the regime of small perturbation with a small number of edge rewires. However, the number of edge rewires can be large in many applications. To study the latter case, this work departs from the previous analysis and proves a bound on the stability of graph filter relying on the filter's frequency response. Assuming the graph filter is low pass, we show that the stability of the filter depends on perturbation to the community structure. As an application, we show that for stochastic block model graphs, the graph filter distance converges to zero when the number of nodes approaches infinity. Numerical simulations validate our findings.