81.5CVJun 4Code
StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated DatasetZhengqian Wu, Zhixian Liu, Aodong Chen et al.
Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets. These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/
83.5CVApr 15
Seedance 2.0: Advancing Video Generation for World ComplexityTeam Seedance, De Chen, Liyang Chen et al. · gatech
Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.
CVJul 19, 2024Code
Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object TrackingYunfei Zhang, Chao Liang, Jin Gao et al.
Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks by incorporating the extraction of appearance features as auxiliary tasks through embedding Re-Identification task (ReID) into the detector, achieving a balance between inference speed and tracking performance. However, solving the competition between the detector and the feature extractor has always been a challenge. Meanwhile, the issue of directly embedding the ReID task into MOT has remained unresolved. The lack of high discriminability in appearance features results in their limited utility. In this paper, a new learning approach using cross-correlation to capture temporal information of objects is proposed. The feature extraction network is no longer trained solely on appearance features from each frame but learns richer motion features by utilizing feature heatmaps from consecutive frames, which addresses the challenge of inter-class feature similarity. Furthermore, our learning approach is applied to a more lightweight feature extraction network, and treat the feature matching scores as strong cues rather than auxiliary cues, with an appropriate weight calculation to reflect the compatibility between our obtained features and the MOT task. Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks, i.e., MOT17, MOT20, and DanceTrack datasets. Specifically, on the DanceTrack test set, we achieve 56.8 HOTA, 58.1 IDF1 and 92.5 MOTA, making it the best online tracker capable of achieving real-time performance. Comparative evaluations with other trackers prove that our tracker achieves the best balance between speed, robustness and accuracy. Code is available at https://github.com/yfzhang1214/TCBTrack.
CVJul 3, 2023
Co-Learning Meets Stitch-Up for Noisy Multi-label Visual RecognitionChao Liang, Zongxin Yang, Linchao Zhu et al.
In real-world scenarios, collected and annotated data often exhibit the characteristics of multiple classes and long-tailed distribution. Additionally, label noise is inevitable in large-scale annotations and hinders the applications of learning-based models. Although many deep learning based methods have been proposed for handling long-tailed multi-label recognition or label noise respectively, learning with noisy labels in long-tailed multi-label visual data has not been well-studied because of the complexity of long-tailed distribution entangled with multi-label correlation. To tackle such a critical yet thorny problem, this paper focuses on reducing noise based on some inherent properties of multi-label classification and long-tailed learning under noisy cases. In detail, we propose a Stitch-Up augmentation to synthesize a cleaner sample, which directly reduces multi-label noise by stitching up multiple noisy training samples. Equipped with Stitch-Up, a Heterogeneous Co-Learning framework is further designed to leverage the inconsistency between long-tailed and balanced distributions, yielding cleaner labels for more robust representation learning with noisy long-tailed data. To validate our method, we build two challenging benchmarks, named VOC-MLT-Noise and COCO-MLT-Noise, respectively. Extensive experiments are conducted to demonstrate the effectiveness of our proposed method. Compared to a variety of baselines, our method achieves superior results.
CVSep 14, 2024Code
Associate Everything Detected: Facilitating Tracking-by-Detection to the UnknownZimeng Fang, Chao Liang, Xue Zhou et al.
Multi-object tracking (MOT) emerges as a pivotal and highly promising branch in the field of computer vision. Classical closed-vocabulary MOT (CV-MOT) methods aim to track objects of predefined categories. Recently, some open-vocabulary MOT (OV-MOT) methods have successfully addressed the problem of tracking unknown categories. However, we found that the CV-MOT and OV-MOT methods each struggle to excel in the tasks of the other. In this paper, we present a unified framework, Associate Everything Detected (AED), that simultaneously tackles CV-MOT and OV-MOT by integrating with any off-the-shelf detector and supports unknown categories. Different from existing tracking-by-detection MOT methods, AED gets rid of prior knowledge (e.g. motion cues) and relies solely on highly robust feature learning to handle complex trajectories in OV-MOT tasks while keeping excellent performance in CV-MOT tasks. Specifically, we model the association task as a similarity decoding problem and propose a sim-decoder with an association-centric learning mechanism. The sim-decoder calculates similarities in three aspects: spatial, temporal, and cross-clip. Subsequently, association-centric learning leverages these threefold similarities to ensure that the extracted features are appropriate for continuous tracking and robust enough to generalize to unknown categories. Compared with existing powerful OV-MOT and CV-MOT methods, AED achieves superior performance on TAO, SportsMOT, and DanceTrack without any prior knowledge. Our code is available at https://github.com/balabooooo/AED.
67.7CVMar 25Code
Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake DetectionZhanhe Lei, Zhongyuan Wang, Jikang Cheng et al.
Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.
CVOct 16, 2023
Combating Label Noise With A General Surrogate Model For Sample SelectionChao Liang, Linchao Zhu, Humphrey Shi et al.
Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way to deal with label noise. The key is to separate clean samples based on some criterion. Previous methods pay more attention to the small loss criterion where small-loss samples are regarded as clean ones. Nevertheless, such a strategy relies on the learning dynamics of each data instance. Some noisy samples are still memorized due to frequently occurring corrupted learning patterns. To tackle this problem, a training-free surrogate model is preferred, freeing from the effect of memorization. In this work, we propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically. CLIP brings external knowledge to facilitate the selection of clean samples with its ability of text-image alignment. Furthermore, a margin adaptive loss is designed to regularize the selection bias introduced by CLIP, providing robustness to label noise. We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. Our method achieves significant improvement without CLIP involved during the inference stage.
72.4CVMar 27Code
Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-raysKang Liu, Zhuoqi Ma, Siyu Liang et al.
Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.
66.3CVMay 17Code
Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic AlignmentJiaqing Li, Yajuan Lu, Xiaochuan Shi et al.
Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model's training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.
CVSep 4, 2024
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion DependencyJianwen Jiang, Chao Liang, Jiaqi Yang et al.
With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.
CVSep 3, 2024
CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook AttentionGaojie Lin, Jianwen Jiang, Chao Liang et al.
Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.
CVAug 13, 2024
ED$^4$: Explicit Data-level Debiasing for Deepfake DetectionJikang Cheng, Ying Zhang, Qin Zou et al.
Learning intrinsic bias from limited data has been considered the main reason for the failure of deepfake detection with generalizability. Apart from the discovered content and specific-forgery bias, we reveal a novel spatial bias, where detectors inertly anticipate observing structural forgery clues appearing at the image center, also can lead to the poor generalization of existing methods. We present ED$^4$, a simple and effective strategy, to address aforementioned biases explicitly at the data level in a unified framework rather than implicit disentanglement via network design. In particular, we develop ClockMix to produce facial structure preserved mixtures with arbitrary samples, which allows the detector to learn from an exponentially extended data distribution with much more diverse identities, backgrounds, local manipulation traces, and the co-occurrence of multiple forgery artifacts. We further propose the Adversarial Spatial Consistency Module (AdvSCM) to prevent extracting features with spatial bias, which adversarially generates spatial-inconsistent images and constrains their extracted feature to be consistent. As a model-agnostic debiasing strategy, ED$^4$ is plug-and-play: it can be integrated with various deepfake detectors to obtain significant benefits. We conduct extensive experiments to demonstrate its effectiveness and superiority over existing deepfake detection approaches.
CVJul 8, 2024
MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile DevicesJianwen Jiang, Gaojie Lin, Zhengkun Rong et al.
Existing neural head avatars methods have achieved significant progress in the image quality and motion range of portrait animation. However, these methods neglect the computational overhead, and to the best of our knowledge, none is designed to run on mobile devices. This paper presents MobilePortrait, a lightweight one-shot neural head avatars method that reduces learning complexity by integrating external knowledge into both the motion modeling and image synthesis, enabling real-time inference on mobile devices. Specifically, we introduce a mixed representation of explicit and implicit keypoints for precise motion modeling and precomputed visual features for enhanced foreground and background synthesis. With these two key designs and using simple U-Nets as backbones, our method achieves state-of-the-art performance with less than one-tenth the computational demand. It has been validated to reach speeds of over 100 FPS on mobile devices and support both video and audio-driven inputs.
CVFeb 1, 2024Code
CapHuman: Capture Your Moments in Parallel UniversesChao Liang, Fan Ma, Linchao Zhu et al.
We concentrate on a novel human-centric image synthesis task, that is, given only one reference facial photograph, it is expected to generate specific individual images with diverse head positions, poses, facial expressions, and illuminations in different contexts. To accomplish this goal, we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently, large pre-trained text-to-image diffusion models have shown remarkable results, serving as a powerful generative foundation. As a basis, we aim to unleash the above two capabilities of the pre-trained model. In this work, we present a new framework named CapHuman. We embrace the "encode then learn to align" paradigm, which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions, superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman.
NAJan 14, 2015
Some New Symplectic Multiple Timestepping Methods for Multiscale Molecular Dynamics ModelsChao Liang, Xiaolan Yuan, Xiantao Li
We derived a number of numerical methods to treat biomolecular systems with multiple time scales. Based on the splitting of the operators associated with the slow-varying and fast-varying forces, new multiple time-stepping (MTS) methods are obtained by eliminating the dominant terms in the error. These new methods can be viewed as a generalization of the impulse method. In the implementation of these methods, the long-range forces only need to be computed on the slow time scale, which reduces the computational cost considerably. Preliminary analysis for the energy conservation property is provided.
CVFeb 22
VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh RecoveryWenhao Shen, Hao Wang, Wanqi Yin et al.
Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Leveraging this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches.
CVSep 14, 2023
HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation MethodsYongyuan Li, Xiuyuan Qin, Chao Liang et al.
Talking Face Generation (TFG) aims to reconstruct facial movements to achieve high natural lip movements from audio and facial features that are under potential connections. Existing TFG methods have made significant advancements to produce natural and realistic images. However, most work rarely takes visual quality into consideration. It is challenging to ensure lip synchronization while avoiding visual quality degradation in cross-modal generation methods. To address this issue, we propose a universal High-Definition Teeth Restoration Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth regions at an extremely fast speed while maintaining synchronization, and temporal consistency. In particular, we propose a Fine-Grained Feature Fusion (FGFF) module to effectively capture fine texture feature information around teeth and surrounding regions, and use these features to fine-grain the feature map to enhance the clarity of teeth. Extensive experiments show that our method can be adapted to arbitrary TFG methods without suffering from lip synchronization and frame coherence. Another advantage of HDTR-Net is its real-time generation ability. Also under the condition of high-definition restoration of talking face video synthesis, its inference speed is $300\%$ faster than the current state-of-the-art face restoration based on super-resolution.
AISep 26, 2024
What Would Happen Next? Predicting Consequences from An Event Causality GraphChuanhong Zhan, Wei Xiang, Chao Liang et al.
Existing script event prediction task forcasts the subsequent event based on an event script chain. However, the evolution of historical events are more complicated in real world scenarios and the limited information provided by the event script chain also make it difficult to accurately predict subsequent events. This paper introduces a Causality Graph Event Prediction(CGEP) task that forecasting consequential event based on an Event Causality Graph (ECG). We propose a Semantic Enhanced Distance-sensitive Graph Prompt Learning (SeDGPL) Model for the CGEP task. In SeDGPL, (1) we design a Distance-sensitive Graph Linearization (DsGL) module to reformulate the ECG into a graph prompt template as the input of a PLM; (2) propose an Event-Enriched Causality Encoding (EeCE) module to integrate both event contextual semantic and graph schema information; (3) propose a Semantic Contrast Event Prediction (ScEP) module to enhance the event representation among numerous candidate events and predict consequential event following prompt learning paradigm. %We construct two CGEP datasets based on existing MAVEN-ERE and ESC corpus for experiments. Experiment results validate our argument our proposed SeDGPL model outperforms the advanced competitors for the CGEP task.
74.6CRMay 19
BiRD: A Bidirectional Ranking Defense Mechanism for Retrieval Augmented GenerationChengcai Gao, Zhihong Sun, Xiaochuan Shi et al.
The growing adoption of Retrieval-Augmented Generation (RAG) has led to a rise in adversarial attacks. Existing defenses, relying on semantic analysis or voting, face a trade-off between high computational cost and limited robustness under strong poisoning attacks. Their fundamental limitation is the exclusive focus on semantic content relevance, while neglecting the retrieval context that is critically defined by ranking structures. To this end, we investigate the bidirectional ranking behavior of poisoned and benign documents, and discover a key discriminative pattern: poisoned documents exhibit significantly stronger alignment between their backward rankings and the query's forward ranking. Capitalizing on this, we propose BiRD, a bidirectional ranking defense mechanism built upon a dual-signal framework that leverages forward ranking to assess semantic content relevance and backward ranking to quantify ranking context consistency. This design directly addresses the fundamental limitation of prior approaches, enabling simultaneous efficiency and robustness. Extensive evaluation across 3 datasets with 3 retrievers and 3 LLMs under 2 attack scenarios validates BiRD's effectiveness. Notably, BiRD reduces the attack success rate of PoisonedRAG by up to 54% while simultaneously improving task accuracy by up to 56%, with average additional latency under 1 second.
CVJan 15
FlowAct-R1: Towards Interactive Humanoid Video GenerationLizhen Wang, Yongming Zhu, Zhipeng Ge et al.
Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.
AINov 24, 2024Code
Decoding Urban Industrial Complexity: Enhancing Knowledge-Driven Insights via IndustryScopeGPTSiqi Wang, Chao Liang, Yunfan Gao et al.
Industrial parks are critical to urban economic growth. Yet, their development often encounters challenges stemming from imbalances between industrial requirements and urban services, underscoring the need for strategic planning and operations. This paper introduces IndustryScopeKG, a pioneering large-scale multi-modal, multi-level industrial park knowledge graph, which integrates diverse urban data including street views, corporate, socio-economic, and geospatial information, capturing the complex relationships and semantics within industrial parks. Alongside this, we present the IndustryScopeGPT framework, which leverages Large Language Models (LLMs) with Monte Carlo Tree Search to enhance tool-augmented reasoning and decision-making in Industrial Park Planning and Operation (IPPO). Our work significantly improves site recommendation and functional planning, demonstrating the potential of combining LLMs with structured datasets to advance industrial park management. This approach sets a new benchmark for intelligent IPPO research and lays a robust foundation for advancing urban industrial development. The dataset and related code are available at https://github.com/Tongji-KGLLM/IndustryScope.
CVAug 13, 2024
IDRetracor: Towards Visual Forensics Against Malicious Face SwappingJikang Cheng, Jiaxin Ai, Zhen Han et al.
The face swapping technique based on deepfake methods poses significant social risks to personal identity security. While numerous deepfake detection methods have been proposed as countermeasures against malicious face swapping, they can only output binary labels (Fake/Real) for distinguishing fake content without reliable and traceable evidence. To achieve visual forensics and target face attribution, we propose a novel task named face retracing, which considers retracing the original target face from the given fake one via inverse mapping. Toward this goal, we propose an IDRetracor that can retrace arbitrary original target identities from fake faces generated by multiple face swapping methods. Specifically, we first adopt a mapping resolver to perceive the possible solution space of the original target face for the inverse mappings. Then, we propose mapping-aware convolutions to retrace the original target face from the fake one. Such convolutions contain multiple kernels that can be combined under the control of the mapping resolver to tackle different face swapping mappings dynamically. Extensive experiments demonstrate that the IDRetracor exhibits promising retracing performance from both quantitative and qualitative perspectives.
AIDec 18, 2025
CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?Siqi Wang, Chao Liang, Yunfan Gao et al.
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.
CVOct 23, 2021Code
Confidence-Aware Active Feedback for Interactive Instance SearchYue Zhang, Chao Liang, Longxiang Jiang
Online relevance feedback (RF) is widely utilized in instance search (INS) tasks to further refine imperfect ranking results, but it often has low interaction efficiency. The active learning (AL) technique addresses this problem by selecting valuable feedback candidates. However, mainstream AL methods require an initial labeled set for a cold start and are often computationally complex to solve. Therefore, they cannot fully satisfy the requirements for online RF in interactive INS tasks. To address this issue, we propose a confidence-aware active feedback method (CAAF) that is specifically designed for online RF in interactive INS tasks. Inspired by the explicit difficulty modeling scheme in self-paced learning, CAAF utilizes a pairwise manifold ranking loss to evaluate the ranking confidence of each unlabeled sample. The ranking confidence improves not only the interaction efficiency by indicating valuable feedback candidates but also the ranking quality by modulating the diffusion weights in manifold ranking. In addition, we design two acceleration strategies, an approximate optimization scheme and a top-K search scheme, to reduce the computational complexity of CAAF. Extensive experiments on both image INS tasks and video INS tasks searching for buildings, landscapes, persons, and human behaviors demonstrate the effectiveness of the proposed method. Notably, in the real-world, large-scale video INS task of NIST TRECVID 2021, CAAF uses 25% fewer feedback samples to achieve a performance that is nearly equivalent to the champion solution. Moreover, with the same number of feedback samples, CAAF's mAP is 51.9%, significantly surpassing the champion solution by 5.9%. Code is available at https://github.com/nercms-mmap/caaf.
CVApr 19, 2021Code
One More Check: Making "Fake Background" Be Tracked AgainChao Liang, Zhipeng Zhang, Xue Zhou et al.
The one-shot multi-object tracking, which integrates object detection and ID embedding extraction into a unified network, has achieved groundbreaking results in recent years. However, current one-shot trackers solely rely on single-frame detections to predict candidate bounding boxes, which may be unreliable when facing disastrous visual degradation, e.g., motion blur, occlusions. Once a target bounding box is mistakenly classified as background by the detector, the temporal consistency of its corresponding tracklet will be no longer maintained. In this paper, we set out to restore the bounding boxes misclassified as ``fake background'' by proposing a re-check network. The re-check network innovatively expands the role of ID embedding from data association to motion forecasting by effectively propagating previous tracklets to the current frame with a small overhead. Note that the propagation results are yielded by an independent and efficient embedding search, preventing the model from over-relying on detection results. Eventually, it helps to reload the ``fake background'' and repair the broken tracklets. Building on a strong baseline CSTrack, we construct a new one-shot tracker and achieve favorable gains by 70.7 $\rightarrow$ 76.4, 70.6 $\rightarrow$ 76.3 MOTA on MOT16 and MOT17, respectively. It also reaches a new state-of-the-art MOTA and IDF1 performance. Code is released at https://github.com/JudasDie/SOTS.
CVOct 23, 2020Code
Rethinking the competition between detection and ReID in Multi-Object TrackingChao Liang, Zhipeng Zhang, Xue Zhou et al.
Due to balanced accuracy and speed, one-shot models which jointly learn detection and identification embeddings, have drawn great attention in multi-object tracking (MOT). However, the inherent differences and relations between detection and re-identification (ReID) are unconsciously overlooked because of treating them as two isolated tasks in the one-shot tracking paradigm. This leads to inferior performance compared with existing two-stage methods. In this paper, we first dissect the reasoning process for these two tasks, which reveals that the competition between them inevitably would destroy task-dependent representations learning. To tackle this problem, we propose a novel reciprocal network (REN) with a self-relation and cross-relation design so that to impel each branch to better learn task-dependent representations. The proposed model aims to alleviate the deleterious tasks competition, meanwhile improve the cooperation between detection and ReID. Furthermore, we introduce a scale-aware attention network (SAAN) that prevents semantic level misalignment to improve the association capability of ID embeddings. By integrating the two delicately designed networks into a one-shot online MOT system, we construct a strong MOT tracker, namely CSTrack. Our tracker achieves the state-of-the-art performance on MOT16, MOT17 and MOT20 datasets, without other bells and whistles. Moreover, CSTrack is efficient and runs at 16.4 FPS on a single modern GPU, and its lightweight version even runs at 34.6 FPS. The complete code has been released at https://github.com/JudasDie/SOTS.
80.1ROMay 10
Beyond Isolation: A Unified Benchmark for General-Purpose NavigationSamson Sun, Tianyi Yang, Tengyue Wang et al.
The pursuit of general-purpose embodied agents is hindered by fragmented evaluation protocols that isolate navigation skills and fixate on specific robot morphologies, failing to reflect real-world scenarios where agents must orchestrate diverse behaviors across varying embodiments. To bridge this gap, we introduce OmniNavBench, a benchmark for cross-skill coordination and cross-embodiment generalization. OmniNavBench introduces three paradigm shifts: (1) Compositional Complexity. We propose composite instructions that interleave sub-tasks from 6 categories (PointNav, VLN, ObjectNav, SocialNav, Human Following and EQA), compelling agents to transition between exploration, interaction, and social compliance within a single episode. (2) Morphological Universality and Sensor Flexibility. We present a simulation platform that breaks the reliance on single-morphology evaluation, enabling generalization tests across humanoid, quadrupedal, and wheeled robots, with a modular sensor interface and 170 environments blending synthetic assets with real-world scans. (3) Demonstrations Quality. Moving beyond shortest-path algorithms, we curate 1779 expert trajectories via human teleoperation, capturing behavioral nuances such as exploratory glance and anticipatory avoidance. Extensive evaluations demonstrate that current methods, despite their claimed unified design, struggle with the complex, interleaved nature of general-purpose navigation. This exposes a critical disparity between existing capabilities and real-world deployment demands, underscoring OmniNavBench as a testbed for the next generation of generalist navigators. Dataset, code, and leaderboard are available at http://omninavbench.cloud-ip.cc.
CVFeb 3, 2025
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation ModelsGaojie Lin, Jianwen Jiang, Jiaqi Yang et al.
End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)
IRMay 17, 2024
In-context Contrastive Learning for Event Causality IdentificationChao Liang, Wei Xiang, Bang Wang
Event Causality Identification (ECI) aims at determining the existence of a causal relation between two events. Although recent prompt learning-based approaches have shown promising improvements on the ECI task, their performance are often subject to the delicate design of multiple prompts and the positive correlations between the main task and derivate tasks. The in-context learning paradigm provides explicit guidance for label prediction in the prompt learning paradigm, alleviating its reliance on complex prompts and derivative tasks. However, it does not distinguish between positive and negative demonstrations for analogy learning. Motivated from such considerations, this paper proposes an In-Context Contrastive Learning (ICCL) model that utilizes contrastive learning to enhance the effectiveness of both positive and negative demonstrations. Additionally, we apply contrastive learning to event pairs to better facilitate event causality identification. Our ICCL is evaluated on the widely used corpora, including the EventStoryLine and Causal-TimeBank, and results show significant performance improvements over the state-of-the-art algorithms.
CVJun 11, 2025
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio ConditionsZhenzhi Wang, Jiaqi Yang, Jianwen Jiang et al.
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
CVDec 22, 2024
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG DistillationTianyun Zhong, Chao Liang, Jianwen Jiang et al.
Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times. Demos are available at our webpage http://fadavatar.github.io.
CVAug 4, 2025
DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer FrameworkTongchun Zuo, Zaiyu Huang, Shuliang Ning et al.
Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. \textbf{In the second stage}, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page https://virtu-lab.github.io/
CVJun 11, 2025
AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human AnimationChao Liang, Jianwen Jiang, Wang Liao et al.
Recent advancements in human video generation and animation tasks, driven by diffusion models, have achieved significant progress. However, expressive and realistic human animation remains challenging due to the trade-off between motion naturalness and visual fidelity. To address this, we propose \textbf{AlignHuman}, a framework that combines Preference Optimization as a post-training technique with a divide-and-conquer training strategy to jointly optimize these competing objectives. Our key insight stems from an analysis of the denoising process across timesteps: (1) early denoising timesteps primarily control motion dynamics, while (2) fidelity and human structure can be effectively managed by later timesteps, even if early steps are skipped. Building on this observation, we propose timestep-segment preference optimization (TPO) and introduce two specialized LoRAs as expert alignment modules, each targeting a specific dimension in its corresponding timestep interval. The LoRAs are trained using their respective preference data and activated in the corresponding intervals during inference to enhance motion naturalness and fidelity. Extensive experiments demonstrate that AlignHuman improves strong baselines and reduces NFEs during inference, achieving a 3.3$\times$ speedup (from 100 NFEs to 30 NFEs) with minimal impact on generation quality. Homepage: \href{https://alignhuman.github.io/}{https://alignhuman.github.io/}
CVDec 22, 2024
FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story VideosZhengqian Wu, Ruizhe Li, Zijun Xu et al.
Video question answering (VideoQA) aims to answer natural language questions according to the given videos. Although existing models perform well in the factoid VideoQA task, they still face challenges in deep video understanding (DVU) task, which focuses on story videos. Compared to factoid videos, the most significant feature of story videos is storylines, which are composed of complex interactions and long-range evolvement of core story topics including characters, actions and locations. Understanding these topics requires models to possess DVU capability. However, existing DVU datasets rarely organize questions according to these story topics, making them difficult to comprehensively assess VideoQA models' DVU capability of complex storylines. Additionally, the question quantity and video length of these dataset are limited by high labor costs of handcrafted dataset building method. In this paper, we devise a large language model based multi-agent collaboration framework, StoryMind, to automatically generate a new large-scale DVU dataset. The dataset, FriendsQA, derived from the renowned sitcom Friends with an average episode length of 1,358 seconds, contains 44.6K questions evenly distributed across 14 fine-grained topics. Finally, We conduct comprehensive experiments on 10 state-of-the-art VideoQA models using the FriendsQA dataset.
CVMar 26, 2024
Superior and Pragmatic Talking Face Generation with Teacher-Student FrameworkChao Liang, Jianwen Jiang, Tianyun Zhong et al.
Talking face generation technology creates talking videos from arbitrary appearance and motion signal, with the "arbitrary" offering ease of use but also introducing challenges in practical applications. Existing methods work well with standard inputs but suffer serious performance degradation with intricate real-world ones. Moreover, efficiency is also an important concern in deployment. To comprehensively address these issues, we introduce SuperFace, a teacher-student framework that balances quality, robustness, cost and editability. We first propose a simple but effective teacher model capable of handling inputs of varying qualities to generate high-quality results. Building on this, we devise an efficient distillation strategy to acquire an identity-specific student model that maintains quality with significantly reduced computational load. Our experiments validate that SuperFace offers a more comprehensive solution than existing methods for the four mentioned objectives, especially in reducing FLOPs by 99\% with the student model. SuperFace can be driven by both video and audio and allows for localized facial attributes editing.
CVAug 26, 2025
OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive SimulationJianwen Jiang, Weihong Zeng, Zerong Zheng et al.
Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}
CVAug 4, 2025
X-Actor: Emotional and Expressive Long-Range Portrait Acting from AudioChenxu Zhang, Zenan Li, Hongyi Xu et al.
We present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior methods that emphasize lip synchronization and short-range visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-form portrait performance capturing nuanced, dynamically evolving emotions that flow coherently with the rhythm and content of speech. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive diffusion model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length emotionally-rich motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting.
CVJan 5, 2025
Noise-Tolerant Hybrid Prototypical Learning with Noisy Web DataChao Liang, Linchao Zhu, Zongxin Yang et al.
We focus on the challenging problem of learning an unbiased classifier from a large number of potentially relevant but noisily labeled web images given only a few clean labeled images. This problem is particularly practical because it reduces the expensive annotation costs by utilizing freely accessible web images with noisy labels. Typically, prototypes are representative images or features used to classify or identify other images. However, in the few clean and many noisy scenarios, the class prototype can be severely biased due to the presence of irrelevant noisy images. The resulting prototypes are less compact and discriminative, as previous methods do not take into account the diverse range of images in the noisy web image collections. On the other hand, the relation modeling between noisy and clean images is not learned for the class prototype generation in an end-to-end manner, which results in a suboptimal class prototype. In this article, we introduce a similarity maximization loss named SimNoiPro. Our SimNoiPro first generates noise-tolerant hybrid prototypes composed of clean and noise-tolerant prototypes and then pulls them closer to each other. Our approach considers the diversity of noisy images by explicit division and overcomes the optimization discrepancy issue. This enables better relation modeling between clean and noisy images and helps extract judicious information from the noisy image set. The evaluation results on two extended few-shot classification benchmarks confirm that our SimNoiPro outperforms prior methods in measuring image relations and cleaning noisy data.
CVDec 15, 2025
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation ModelTeam Seedance, Heyi Chen, Siyan Chen et al.
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
CVSep 25, 2025
X-Streamer: Unified Human World Modeling with Audiovisual InteractionYou Xie, Tianpei Gu, Zenan Li et al.
We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker's hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.
CVAug 13, 2025
Leveraging Failed Samples: A Few-Shot and Training-Free Framework for Generalized Deepfake DetectionShibo Yao, Renshuai Tao, Xiaolong Zheng et al.
Recent deepfake detection studies often treat unseen sample detection as a ``zero-shot" task, training on images generated by known models but generalizing to unknown ones. A key real-world challenge arises when a model performs poorly on unknown samples, yet these samples remain available for analysis. This highlights that it should be approached as a ``few-shot" task, where effectively utilizing a small number of samples can lead to significant improvement. Unlike typical few-shot tasks focused on semantic understanding, deepfake detection prioritizes image realism, which closely mirrors real-world distributions. In this work, we propose the Few-shot Training-free Network (FTNet) for real-world few-shot deepfake detection. Simple yet effective, FTNet differs from traditional methods that rely on large-scale known data for training. Instead, FTNet uses only one fake samplefrom an evaluation set, mimicking the scenario where new samples emerge in the real world and can be gathered for use, without any training or parameter updates. During evaluation, each test sample is compared to the known fake and real samples, and it is classified based on the category of the nearest sample. We conduct a comprehensive analysis of AI-generated images from 29 different generative models and achieve a new SoTA performance, with an average improvement of 8.7\% compared to existing methods. This work introduces a fresh perspective on real-world deepfake detection: when the model struggles to generalize on a few-shot sample, leveraging the failed samples leads to better performance.
CLMay 18, 2023
TEPrompt: Task Enlightenment Prompt Learning for Implicit Discourse Relation RecognitionWei Xiang, Chao Liang, Bang Wang
Implicit Discourse Relation Recognition (IDRR) aims at classifying the relation sense between two arguments without an explicit connective. Recently, the ConnPrompt~\cite{Wei.X:et.al:2022:COLING} has leveraged the powerful prompt learning for IDRR based on the fusion of multi-prompt decisions from three different yet much similar connective prediction templates. Instead of multi-prompt ensembling, we propose to design auxiliary tasks with enlightened prompt learning for the IDRR task. Although an auxiliary task is not used to directly output final prediction, we argue that during the joint training some of its learned features can be useful to boost the main task. In light of such motivations, we propose a task enlightenment prompt learning model, called TEPrompt, to fuse learned features from three related tasks for IDRR. In particular, the TEPrompt contains three tasks, viz., Discourse Relation Recognition (DRR), Sense Semantics Classification (SSC) and Annotated Connective Prediction (ACP), each with a unique prompt template and an answer space. In the training phase, we jointly train three prompt learning tasks with shared argument representation. In the testing phase, we only take the DRR output with fused features as the final IDRR decision. Experiments with the same conditions have shown that the proposed TEPrompt outperforms the ConnPrompt. This can be attributed to the promoted decision features and language models benefited from joint-training of auxiliary tasks.
CVOct 30, 2021
whu-nercms at trecvid2021:instance search taskYanrui Niu, Jingyao Yang, Ankang Lu et al.
We will make a brief introduction of the experimental methods and results of the WHU-NERCMS in the TRECVID2021 in the paper. This year we participate in the automatic and interactive tasks of Instance Search (INS). For the automatic task, the retrieval target is divided into two parts, person retrieval, and action retrieval. We adopt a two-stage method including face detection and face recognition for person retrieval and two kinds of action detection methods consisting of three frame-based human-object interaction detection methods and two video-based general action detection methods for action retrieval. After that, the person retrieval results and action retrieval results are fused to initialize the result ranking lists. In addition, we make attempts to use complementary methods to further improve search performance. For interactive tasks, we test two different interaction strategies on the fusion results. We submit 4 runs for automatic and interactive tasks respectively. The introduction of each run is shown in Table 1. The official evaluations show that the proposed strategies rank 1st in both automatic and interactive tracks.
MLMay 25, 2020
Feature Robust Optimal Transport for High-dimensional DataMathis Petrovich, Chao Liang, Ryoma Sato et al.
Optimal transport is a machine learning problem with applications including distribution comparison, feature selection, and generative adversarial networks. In this paper, we propose feature-robust optimal transport (FROT) for high-dimensional data, which solves high-dimensional OT problems using feature selection to avoid the curse of dimensionality. Specifically, we find a transport plan with discriminative features. To this end, we formulate the FROT problem as a min--max optimization problem. We then propose a convex formulation of the FROT problem and solve it using a Frank--Wolfe-based optimization algorithm, whereby the subproblem can be efficiently solved using the Sinkhorn algorithm. Since FROT finds the transport plan from selected features, it is robust to noise features. To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence. By conducting synthetic and benchmark experiments, we demonstrate that the proposed method can find a strong correspondence by determining important layers. We show that the FROT algorithm achieves state-of-the-art performance in real-world semantic correspondence datasets.