CVMay 31
TextFake: Benchmarking AI-Generated Image Detection on Text-Rich ImagesYuning Zhang, Changtao Miao, Mingyu Liao et al.
Recent AI-generated image (AIGI) detectors perform well on natural-image benchmarks, but their behavior on text-rich forgeries, such as fabricated screenshots, documents, and news pages prevalent in misinformation, remains untested. We introduce TextFake, a 20,000-image benchmark for text-rich AIGI detection spanning 28 languages, 4 topic categories, and 2 scene modalities. Fake images are synthesized via a four-stage pipeline that annotates real images along three controlled dimensions and generates counterparts through distribution-aligned structured prompting, ruling out covariate shortcuts. Zero-shot evaluation of 14 specialized detectors and 3 frontier VLM APIs reveals a large systematic gap: no method exceeds 80% accuracy, with some dropping over 60% from natural-image benchmarks. Diagnostic evaluations identify three failure modes: the Text Density Curse, where dense glyphs overwhelm low-level detectors; Cloaking via Rendering Fidelity, where stronger text rendering suppresses enerative artifacts; and Threshold Collapse, where routine perturbations drive detectors toward chance-level performance.
CLApr 23Code
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use BehaviorsChenghao Yang, Yuning Zhang, Zhoufutu Wen et al.
Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbf{Response Pattern Similarity (RPS)} for verbal alignment and \textbf{Action Graph Similarity (AGS)} for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on $τ$-Bench and $τ^2$-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6\% $S_{\text{node}}$ and 94.7\% $S_{\text{dep}}$, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson $r$ = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at https://github.com/Syuchin/AgentEcho.
CVMar 29Code
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMsXuanpu Zhao, Zhentao Tan, Dianmo Sheng et al.
To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model's strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: https://github.com/XuanPu-Z/LFPC.
CVAug 5, 2024
Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and LocalizationChangtao Miao, Qi Chu, Tao Gong et al.
With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Besides, motivated by the crucial role of noise information in forgery detection, the MNM leverages multiple noise extractors based on the concept of the mixture of experts to enhance the general RGB features, further boosting the performance of our framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization and the proposed \textit{MoNFAP} achieves significant performance. The codes will be made available.
IVOct 20, 2022
Physics-informed Deep Diffusion MRI Reconstruction with Synthetic Data: Break Training Data Bottleneck in Artificial IntelligenceChen Qian, Haoyu Zhang, Yuncheng Gao et al.
Diffusion magnetic resonance imaging (MRI) is the only imaging modality for non-invasive movement detection of in vivo water molecules, with significant clinical and research applications. Diffusion weighted imaging (DWI) MRI acquired by multi-shot techniques can achieve higher resolution, better signal-to-noise ratio, and lower geometric distortion than single-shot, but suffers from inter-shot motion-induced artifacts. These artifacts cannot be removed prospectively, leading to the absence of artifact-free training labels. Thus, the potential of deep learning in multi-shot DWI reconstruction remains largely untapped. To break the training data bottleneck, here, we propose a Physics-Informed Deep DWI reconstruction method (PIDD) to synthesize high-quality paired training data by leveraging the physical diffusion model (magnitude synthesis) and inter-shot motion-induced phase model (motion phase synthesis). The network is trained only once with 100,000 synthetic samples, achieving encouraging results on multiple realistic in vivo data reconstructions. Advantages over conventional methods include: (a) Better motion artifact suppression and reconstruction stability; (b) Outstanding generalization to multi-scenario reconstructions, including multi-resolution, multi-b-value, multi-under-sampling, multi-vendor, and multi-center; (c) Excellent clinical adaptability to patients with verifications by seven experienced doctors (p<0.001). In conclusion, PIDD presents a novel deep learning framework by exploiting the power of MRI physics, providing a cost-effective and explainable way to break the data bottleneck in deep learning medical imaging.
CVJan 9
SAPL: Semantic-Agnostic Prompt Learning in CLIP for Weakly Supervised Image Manipulation LocalizationXinghao Wang, Changtao Miao, Dianmo Sheng et al.
Malicious image manipulation threatens public safety and requires efficient localization methods. Existing approaches depend on costly pixel-level annotations which make training expensive. Existing weakly supervised methods rely only on image-level binary labels and focus on global classification, often overlooking local edge cues that are critical for precise localization. We observe that feature variations at manipulated boundaries are substantially larger than in interior regions. To address this gap, we propose Semantic-Agnostic Prompt Learning (SAPL) in CLIP, which learns text prompts that intentionally encode non-semantic, boundary-centric cues so that CLIPs multimodal similarity highlights manipulation edges rather than high-level object semantics. SAPL combines two complementary modules Edge-aware Contextual Prompt Learning (ECPL) and Hierarchical Edge Contrastive Learning (HECL) to exploit edge information in both textual and visual spaces. The proposed ECPL leverages edge-enhanced image features to generate learnable textual prompts via an attention mechanism, embedding semantic-irrelevant information into text features, to guide CLIP focusing on manipulation edges. The proposed HECL extract genuine and manipulated edge patches, and utilize contrastive learning to boost the discrimination between genuine edge patches and manipulated edge patches. Finally, we predict the manipulated regions from the similarity map after processing. Extensive experiments on multiple public benchmarks demonstrate that SAPL significantly outperforms existing approaches, achieving state-of-the-art localization performance.
CVNov 26, 2025
GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety SupervisionYuxiao Xiang, Junchi Chen, Zhenchao Jin et al.
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
CLJul 26, 2025Code
Flora: Effortless Context Construction to Arbitrary Length and ScaleTianxiang Chen, Zhentao Tan, Xiaofan Bo et al.
Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks. Our data-construction code is available at \href{https://github.com/txchen-USTC/Flora}{https://github.com/txchen-USTC/Flora}.
CVDec 30, 2024Code
Inclusion 2024 Global Multimedia Deepfake Detection Challenge: Towards Multi-dimensional Face Forgery DetectionYi Zhang, Weize Gao, Changtao Miao et al.
In this paper, we present the Global Multimedia Deepfake Detection held concurrently with the Inclusion 2024. Our Multimedia Deepfake Detection aims to detect automatic image and audio-video manipulations including but not limited to editing, synthesis, generation, Photoshop,etc. Our challenge has attracted 1500 teams from all over the world, with about 5000 valid result submission counts. We invite the top 20 teams to present their solutions to the challenge, from which the top 3 teams are awarded prizes in the grand finale. In this paper, we present the solutions from the top 3 teams of the two tracks, to boost the research work in the field of image and audio-video forgery detection. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection systems and we encourage participants to open source their methods.
CVMar 4, 2024
MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target DetectionTianxiang Chen, Zi Ye, Zhentao Tan et al.
Recently, infrared small target detection (ISTD) has made significant progress, thanks to the development of basic models. Specifically, the models combining CNNs with transformers can successfully extract both local and global features. However, the disadvantage of the transformer is also inherited, i.e., the quadratic computational complexity to sequence length. Inspired by the recent basic model with linear complexity for long-distance modeling, Mamba, we explore the potential of this state space model for ISTD task in terms of effectiveness and efficiency in the paper. However, directly applying Mamba achieves suboptimal performances due to the insufficient harnessing of local features, which are imperative for detecting small targets. Instead, we tailor a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient ISTD. It consists of Outer and Inner Mamba blocks to adeptly capture both global and local features. Specifically, we treat the local patches as "visual sentences" and use the Outer Mamba to explore the global information. We then decompose each visual sentence into sub-patches as "visual words" and use the Inner Mamba to further explore the local information among words in the visual sentence with negligible computational costs. By aggregating the visual word and visual sentence features, our MiM-ISTD can effectively explore both global and local information. Experiments on NUAA-SIRST and IRSTD-1k show the superior accuracy and efficiency of our method. Specifically, MiM-ISTD is $8 \times$ faster than the SOTA method and reduces GPU memory usage by 62.2$\%$ when testing on $2048 \times 2048$ images, overcoming the computation and memory constraints on high-resolution infrared images.
CVSep 6, 2025Code
MFFI: Multi-Dimensional Face Forgery Image Dataset for Real-World ScenariosChangtao Miao, Yi Zhang, Man Luo et al.
Rapid advances in Artificial Intelligence Generated Content (AIGC) have enabled increasingly sophisticated face forgeries, posing a significant threat to social security. However, current Deepfake detection methods are limited by constraints in existing datasets, which lack the diversity necessary in real-world scenarios. Specifically, these data sets fall short in four key areas: unknown of advanced forgery techniques, variability of facial scenes, richness of real data, and degradation of real-world propagation. To address these challenges, we propose the Multi-dimensional Face Forgery Image (\textbf{MFFI}) dataset, tailored for real-world scenarios. MFFI enhances realism based on four strategic dimensions: 1) Wider Forgery Methods; 2) Varied Facial Scenes; 3) Diversified Authentic Data; 4) Multi-level Degradation Operations. MFFI integrates $50$ different forgery methods and contains $1024K$ image samples. Benchmark evaluations show that MFFI outperforms existing public datasets in terms of scene complexity, cross-domain generalization capability, and detection difficulty gradients. These results validate the technical advance and practical utility of MFFI in simulating real-world conditions. The dataset and additional details are publicly available at {https://github.com/inclusionConf/MFFI}.
CLMay 27, 2025Code
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue EvaluationChenghao Yang, Yinbo Luo, Zhoufutu Wen et al.
Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.
CVMay 18, 2023Code
Multi-spectral Class Center Network for Face Manipulation Detection and LocalizationChangtao Miao, Qi Chu, Zhentao Tan et al.
As deepfake content proliferates online, advancing face manipulation forensics has become crucial. To combat this emerging threat, previous methods mainly focus on studying how to distinguish authentic and manipulated face images. Although impressive, image-level classification lacks explainability and is limited to specific application scenarios, spurring recent research on pixel-level prediction for face manipulation forensics. However, existing forgery localization methods suffer from exploring frequency-based forgery traces in the localization network. In this paper, we observe that multi-frequency spectrum information is effective for identifying tampered regions. To this end, a novel Multi-Spectral Class Center Network (MSCCNet) is proposed for face manipulation detection and localization. Specifically, we design a Multi-Spectral Class Center (MSCC) module to learn more generalizable and multi-frequency features. Based on the features of different frequency bands, the MSCC module collects multi-spectral class centers and computes pixel-to-class relations. Applying multi-spectral class-level representations suppresses the semantic information of the visual concepts which is insensitive to manipulated regions of forgery images. Furthermore, we propose a Multi-level Features Aggregation (MFA) module to employ more low-level forgery artifacts and structural textures. Meanwhile, we conduct a comprehensive localization benchmark based on pixel-level FF++ and Dolos datasets. Experimental results quantitatively and qualitatively demonstrate the effectiveness and superiority of the proposed MSCCNet. We expect this work to inspire more studies on pixel-level face manipulation localization. The codes are available (https://github.com/miaoct/MSCCNet).
CVMay 8, 2023Code
MultiModal-GPT: A Vision and Language Model for Dialogue with HumansTao Gong, Chengqi Lyu, Shilong Zhang et al.
We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed caption, counting the number of interested objects, and answering general questions from users. MultiModal-GPT is parameter-efficiently fine-tuned from OpenFlamingo, with Low-rank Adapter (LoRA) added both in the cross-attention part and the self-attention part of the language model. We first construct instruction templates with vision and language data for multi-modality instruction tuning to make the model understand and follow human instructions. We find the quality of training data is vital for the dialogue performance, where few data containing short answers can lead the model to respond shortly to any instructions. To further enhance the ability to chat with humans of the MultiModal-GPT, we utilize language-only instruction-following data to train the MultiModal-GPT jointly. The joint training of language-only and visual-language instructions with the \emph{same} instruction template effectively improves dialogue performance. Various demos show the ability of continuous dialogue of MultiModal-GPT with humans. Code, dataset, and demo are at https://github.com/open-mmlab/Multimodal-GPT
CVFeb 28, 2022Code
StrongSORT: Make DeepSORT Great AgainYunhao Du, Zhicheng Zhao, Yang Song et al.
Recently, Multi-Object Tracking (MOT) has attracted rising attention, and accordingly, remarkable progresses have been achieved. However, the existing methods tend to use various basic models (e.g, detector and embedding model), and different training or inference tricks, etc. As a result, the construction of a good baseline for a fair comparison is essential. In this paper, a classic tracker, i.e., DeepSORT, is first revisited, and then is significantly improved from multiple perspectives such as object detection, feature embedding, and trajectory association. The proposed tracker, named StrongSORT, contributes a strong and fair baseline for the MOT community. Moreover, two lightweight and plug-and-play algorithms are proposed to address two inherent "missing" problems of MOT: missing association and missing detection. Specifically, unlike most methods, which associate short tracklets into complete trajectories at high computation complexity, we propose an appearance-free link model (AFLink) to perform global association without appearance information, and achieve a good balance between speed and accuracy. Furthermore, we propose a Gaussian-smoothed interpolation (GSI) based on Gaussian process regression to relieve the missing detection. AFLink and GSI can be easily plugged into various trackers with a negligible extra computational cost (1.7 ms and 7.1 ms per image, respectively, on MOT17). Finally, by fusing StrongSORT with AFLink and GSI, the final tracker (StrongSORT++) achieves state-of-the-art results on multiple public benchmarks, i.e., MOT17, MOT20, DanceTrack and KITTI. Codes are available at https://github.com/dyhBUPT/StrongSORT and https://github.com/open-mmlab/mmtracking.
CVSep 8, 2021Code
Temporal RoI Align for Video Object RecognitionTao Gong, Kai Chen, Xinjiang Wang et al.
Video object detection is challenging in the presence of appearance deterioration in certain video frames. Therefore, it is a natural choice to aggregate temporal information from other frames of the same video into the current frame. However, RoI Align, as one of the most core procedures of video detectors, still remains extracting features from a single-frame feature map for proposals, making the extracted RoI features lack temporal information from videos. In this work, considering the features of the same object instance are highly similar among frames in a video, a novel Temporal RoI Align operator is proposed to extract features from other frames feature maps for current frame proposals by utilizing feature similarity. The proposed Temporal RoI Align operator can extract temporal information from the entire video for proposals. We integrate it into single-frame video detectors and other state-of-the-art video detectors, and conduct quantitative experiments to demonstrate that the proposed Temporal RoI Align operator can consistently and significantly boost the performance. Besides, the proposed Temporal RoI Align can also be applied into video instance segmentation. Codes are available at https://github.com/open-mmlab/mmtracking
CVAug 23, 2020Code
Seesaw Loss for Long-Tailed Instance SegmentationJiaqi Wang, Wenwei Zhang, Yuhang Zang et al.
Instance segmentation has witnessed a remarkable progress on class-balanced benchmarks. However, they fail to perform as accurately in real-world scenarios, where the category distribution of objects naturally comes with a long tail. Instances of head classes dominate a long-tailed dataset and they serve as negative samples of tail categories. The overwhelming gradients of negative samples on tail classes lead to a biased learning process for classifiers. Consequently, objects of tail categories are more likely to be misclassified as backgrounds or head categories. To tackle this problem, we propose Seesaw Loss to dynamically re-balance gradients of positive and negative samples for each category, with two complementary factors, i.e., mitigation factor and compensation factor. The mitigation factor reduces punishments to tail categories w.r.t. the ratio of cumulative training instances between different categories. Meanwhile, the compensation factor increases the penalty of misclassified instances to avoid false positives of tail categories. We conduct extensive experiments on Seesaw Loss with mainstream frameworks and different data sampling strategies. With a simple end-to-end training pipeline, Seesaw Loss obtains significant gains over Cross-Entropy Loss, and achieves state-of-the-art performance on LVIS dataset without bells and whistles. Code is available at https://github.com/open-mmlab/mmdetection.
CVDec 9, 2019Code
Side-Aware Boundary Localization for More Precise Object DetectionJiaqi Wang, Wenwei Zhang, Yuhang Cao et al.
Current object detection frameworks mainly rely on bounding box regression to localize objects. Despite the remarkable progress in recent years, the precision of bounding box regression remains unsatisfactory, hence limiting performance in object detection. We observe that precise localization requires careful placement of each side of the bounding box. However, the mainstream approach, which focuses on predicting centers and sizes, is not the most effective way to accomplish this task, especially when there exists displacements with large variance between the anchors and the targets. In this paper, we propose an alternative approach, named as Side-Aware Boundary Localization (SABL), where each side of the bounding box is respectively localized with a dedicated network branch. To tackle the difficulty of precise localization in the presence of displacements with large variance, we further propose a two-step localization scheme, which first predicts a range of movement through bucket prediction and then pinpoints the precise position within the predicted bucket. We test the proposed method on both two-stage and single-stage detection frameworks. Replacing the standard bounding box regression branch with the proposed design leads to significant improvements on Faster R-CNN, RetinaNet, and Cascade R-CNN, by 3.0%, 1.7%, and 0.9%, respectively. Code is available at https://github.com/open-mmlab/mmdetection.
CVDec 5, 2023
Towards More Unified In-context Visual UnderstandingDianmo Sheng, Dongdong Chen, Zhentao Tan et al.
The rapid advancement of large language models (LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently, ICL has been employed in visual understanding tasks, such as semantic segmentation and image captioning, yielding promising results. However, existing visual ICL framework can not enable producing content across multiple modalities, which limits their potential usage scenarios. To address this issue, we present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space, structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them, facilitating in-context learning. Thanks to this design, the model is capable of handling in-context vision understanding tasks with multimodal output in a unified pipeline.Experimental results demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall, our research takes a further step toward unified multimodal in-context learning.
CVFeb 4, 2024
Bootstrapping Audio-Visual Segmentation by Strengthening Audio CuesTianxiang Chen, Zhentao Tan, Tao Gong et al.
How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate those of the audio modality, due to a unidirectional and insufficient integration of audio cues. This imbalance skews the feature representation towards the visual aspect, impeding the learning of joint audio-visual representations and potentially causing segmentation inaccuracies. To address this issue, we propose AVSAC. Our approach features a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges, enhancing audio cues and fostering continuous interplay between audio and visual modalities. This bidirectional interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. Additionally, we present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD. This strategy enhances the share of auditory components in visual features, contributing to a more balanced audio-visual representation learning. Extensive experiments show that our method attains new benchmarks in AVS performance.
CVJun 29, 2025
DDL: A Large-Scale Datasets for Deepfake Detection and Localization in Diversified Real-World ScenariosChangtao Miao, Yi Zhang, Weize Gao et al.
Recent advances in AIGC have exacerbated the misuse of malicious deepfake content, making the development of reliable deepfake detection methods an essential means to address this challenge. Although existing deepfake detection models demonstrate outstanding performance in detection metrics, most methods only provide simple binary classification results, lacking interpretability. Recent studies have attempted to enhance the interpretability of classification results by providing spatial manipulation masks or temporal forgery segments. However, due to the limitations of forgery datasets, the practical effectiveness of these methods remains suboptimal. The primary reason lies in the fact that most existing deepfake datasets contain only binary labels, with limited variety in forgery scenarios, insufficient diversity in deepfake types, and relatively small data scales, making them inadequate for complex real-world scenarios.To address this predicament, we construct a novel large-scale deepfake detection and localization (\textbf{DDL}) dataset containing over $\textbf{1.4M+}$ forged samples and encompassing up to $\textbf{80}$ distinct deepfake methods. The DDL design incorporates four key innovations: (1) \textbf{Comprehensive Deepfake Methods} (covering 7 different generation architectures and a total of 80 methods), (2) \textbf{Varied Manipulation Modes} (incorporating 7 classic and 3 novel forgery modes), (3) \textbf{Diverse Forgery Scenarios and Modalities} (including 3 scenarios and 3 modalities), and (4) \textbf{Fine-grained Forgery Annotations} (providing 1.18M+ precise spatial masks and 0.23M+ precise temporal segments).Through these improvements, our DDL not only provides a more challenging benchmark for complex real-world forgeries but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods.
CVMay 19, 2025
Benchmarking Unified Face Attack Detection via Hierarchical Prompt TuningAjian Liu, Haocheng Yuan, Xiao Guo et al.
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively. However, isolated training of these two models significantly increases vulnerability towards unknown attacks, burdening deployment environments. The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors: (1) A benchmark that is sufficient for models to explore is lacking. Existing UAD datasets only contain limited attack types and samples, leading to the model's confined ability to address abundant advanced threats. In light of these, through an explainable hierarchical way, we propose the most extensive and sophisticated collection of forgery techniques available to date, namely UniAttackDataPlus. Our UniAttackData+ encompasses 2,875 identities and their 54 kinds of corresponding falsified samples, in a total of 697,347 videos. (2) The absence of a trustworthy classification criterion. Current methods endeavor to explore an arbitrary criterion within the same semantic space, which fails to exist when encountering diverse attacks. Thus, we present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces. Specifically, we construct a VP-Tree to explore various classification rules hierarchically. Then, by adaptively pruning the prompts, the model can select the most suitable prompts guiding the encoder to extract discriminative features at different levels in a coarse-to-fine manner. Finally, to help the model understand the classification criteria in visual space, we propose a DPI module to project the visual prompts to the text encoder to help obtain a more accurate semantics.
CVMar 26, 2025
Context-Aware Weakly Supervised Image Manipulation Localization with SAM RefinementXinghao Wang, Tao Gong, Qi Chu et al.
Malicious image manipulation poses societal risks, increasing the importance of effective image manipulation detection methods. Recent approaches in image manipulation detection have largely been driven by fully supervised approaches, which require labor-intensive pixel-level annotations. Thus, it is essential to explore weakly supervised image manipulation localization methods that only require image-level binary labels for training. However, existing weakly supervised image manipulation methods overlook the importance of edge information for accurate localization, leading to suboptimal localization performance. To address this, we propose a Context-Aware Boundary Localization (CABL) module to aggregate boundary features and learn context-inconsistency for localizing manipulated areas. Furthermore, by leveraging Class Activation Mapping (CAM) and Segment Anything Model (SAM), we introduce the CAM-Guided SAM Refinement (CGSR) module to generate more accurate manipulation localization maps. By integrating two modules, we present a novel weakly supervised framework based on a dual-branch Transformer-CNN architecture. Our method achieves outstanding localization performance across multiple datasets.
CVOct 21, 2025
Semantic Relation-Enhanced CLIP Adapter for Domain Adaptive Zero-Shot LearningJiaao Yu, Mingjie Han, Jinkun Jiang et al.
The high cost of data annotation has spurred research on training deep learning models in data-limited scenarios. Existing paradigms, however, fail to balance cross-domain transfer and cross-category generalization, giving rise to the demand for Domain-Adaptive Zero-Shot Learning (DAZSL). Although vision-language models (e.g., CLIP) have inherent advantages in the DAZSL field, current studies do not fully exploit their potential. Applying CLIP to DAZSL faces two core challenges: inefficient cross-category knowledge transfer due to the lack of semantic relation guidance, and degraded cross-modal alignment during target domain fine-tuning. To address these issues, we propose a Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter framework, integrating a Semantic Relation Structure Loss and a Cross-Modal Alignment Retention Strategy. As the first CLIP-based DAZSL method, SRE-CLIP achieves state-of-the-art performance on the I2AwA and I2WebV benchmarks, significantly outperforming existing approaches.
CVOct 21, 2025
Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video RetrievalJiaao Yu, Mingjie Han, Tao Gong et al.
With the rapid growth of video data, text-video retrieval technology has become increasingly important in numerous application scenarios such as recommendation and search. Early text-video retrieval methods suffer from two critical drawbacks: first, they heavily rely on large-scale annotated video-text pairs, leading to high data acquisition costs; second, there is a significant modal gap between video and text features, which limits cross-modal alignment accuracy. With the development of vision-language model, adapting CLIP to video tasks has attracted great attention. However, existing adaptation methods generally lack enhancement for dynamic video features and fail to effectively suppress static redundant features. To address this issue, this paper proposes FDA-CLIP (Frame Difference Alpha-CLIP), which is a concise CLIP-based training framework for text-video alignment. Specifically, the method uses frame differences to generate dynamic region masks, which are input into Alpha-CLIP as an additional Alpha channel. This proactively guides the model to focus on semantically critical dynamic regions while suppressing static background redundancy. Experiments demonstrate that frame difference-guided video semantic encoding can effectively balance retrieval efficiency and accuracy.
CVOct 11, 2025
Training-Free In-Context Forensic Chain for Image Manipulation Detection and LocalizationRui Chen, Bin Liu, Changtao Miao et al.
Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.
CVSep 20, 2025
Towards Anytime Retrieval: A Benchmark for Anytime Person Re-IdentificationXulin Li, Yan Lu, Bin Liu et al.
In real applications, person re-identification (ReID) is expected to retrieve the target person at any time, including both daytime and nighttime, ranging from short-term to long-term. However, existing ReID tasks and datasets can not meet this requirement, as they are constrained by available time and only provide training and evaluation for specific scenarios. Therefore, we investigate a new task called Anytime Person Re-identification (AT-ReID), which aims to achieve effective retrieval in multiple scenarios based on variations in time. To address the AT-ReID problem, we collect the first large-scale dataset, AT-USTC, which contains 403k images of individuals wearing multiple clothes captured by RGB and IR cameras. Our data collection spans 21 months, and 270 volunteers were photographed on average 29.1 times across different dates or scenes, 4-15 times more than current datasets, providing conditions for follow-up investigations in AT-ReID. Further, to tackle the new challenge of multi-scenario retrieval, we propose a unified model named Uni-AT, which comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW) strategy to ensure balanced training across all scenarios. Extensive experiments show that our model leads to satisfactory results and exhibits excellent generalization to all scenarios.
CVAug 26, 2021
Mining Contextual Information Beyond Image for Semantic SegmentationZhenchao Jin, Tao Gong, Dongdong Yu et al.
This paper studies the context aggregation problem in semantic image segmentation. The existing researches focus on improving the pixel representations by aggregating the contextual information within individual images. Though impressive, these methods neglect the significance of the representations of the pixels of the corresponding class beyond the input image. To address this, this paper proposes to mine the contextual information beyond individual images to further augment the pixel representations. We first set up a feature memory module, which is updated dynamically during training, to store the dataset-level representations of various categories. Then, we learn class probability distribution of each pixel representation under the supervision of the ground-truth segmentation. At last, the representation of each pixel is augmented by aggregating the dataset-level representations based on the corresponding class probability distribution. Furthermore, by utilizing the stored dataset-level representations, we also propose a representation consistent learning strategy to make the classification head better address intra-class compactness and inter-class dispersion. The proposed method could be effortlessly incorporated into existing segmentation frameworks (e.g., FCN, PSPNet, OCRNet and DeepLabV3) and brings consistent performance improvements. Mining contextual information beyond image allows us to report state-of-the-art performance on various benchmarks: ADE20K, LIP, Cityscapes and COCO-Stuff.
CVMar 14, 2021
Towards Generalizable and Robust Face Manipulation Detection via Bag-of-local-featureChangtao Miao, Qi Chu, Weihai Li et al.
Over the past several years, in order to solve the problem of malicious abuse of facial manipulation technology, face manipulation detection technology has obtained considerable attention and achieved remarkable progress. However, most existing methods have very impoverished generalization ability and robustness. In this paper, we propose a novel method for face manipulation detection, which can improve the generalization ability and robustness by bag-of-local-feature. Specifically, we extend Transformers using bag-of-feature approach to encode inter-patch relationships, allowing it to learn local forgery features without any explicit supervision. Extensive experiments demonstrate that our method can outperform competing state-of-the-art methods on FaceForensics++, Celeb-DF and DeeperForensics-1.0 datasets.
SPFeb 23, 2020
A study of resting-state EEG biomarkers for depression recognitionShuting Sun, Jianxiu Li, Huayu Chen et al.
Background: Depression has become a major health burden worldwide, and effective detection depression is a great public-health challenge. This Electroencephalography (EEG)-based research is to explore the effective biomarkers for depression recognition. Methods: Resting state EEG data was collected from 24 major depressive patients (MDD) and 29 normal controls using 128 channel HydroCel Geodesic Sensor Net (HCGSN). To better identify depression, we extracted different types of EEG features including linear features, nonlinear features and functional connectivity features phase lagging index (PLI) to comprehensively analyze the EEG signals in patients with MDD. And using different feature selection methods and classifiers to evaluate the optimal feature sets. Results: Functional connectivity feature PLI is superior to the linear features and nonlinear features. And when combining all the types of features to classify MDD patients, we can obtain the highest classification accuracy 82.31% using ReliefF feature selection method and logistic regression (LR) classifier. Analyzing the distribution of optimal feature set, it was found that intrahemispheric connection edges of PLI were much more than the interhemispheric connection edges, and the intrahemispheric connection edges had a significant differences between two groups. Conclusion: Functional connectivity feature PLI plays an important role in depression recognition. Especially, intrahemispheric connection edges of PLI might be an effective biomarker to identify depression. And statistic results suggested that MDD patients might exist functional dysfunction in left hemisphere.
CVApr 26, 2019
A Large Scale Urban Surveillance Video Dataset for Multiple-Object Tracking and Behavior AnalysisGuojun Yin, Bin Liu, Huihui Zhu et al.
Multiple-object tracking and behavior analysis have been the essential parts of surveillance video analysis for public security and urban management. With billions of surveillance video captured all over the world, multiple-object tracking and behavior analysis by manual labor are cumbersome and cost expensive. Due to the rapid development of deep learning algorithms in recent years, automatic object tracking and behavior analysis put forward an urgent demand on a large scale well-annotated surveillance video dataset that can reflect the diverse, congested, and complicated scenarios in real applications. This paper introduces an urban surveillance video dataset (USVD) which is by far the largest and most comprehensive. The dataset consists of 16 scenes captured in 7 typical outdoor scenarios: street, crossroads, hospital entrance, school gate, park, pedestrian mall, and public square. Over 200k video frames are annotated carefully, resulting in more than 3:7 million object bounding boxes and about 7:1 thousand trajectories. We further use this dataset to evaluate the performance of typical algorithms for multiple-object tracking and anomaly behavior analysis and explore the robustness of these methods in urban congested scenarios.