35.5CVFeb 21, 2025
RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion TransformersMin Zhao, Guande He, Yixiao Chen et al.
Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch--achieving high-quality 2x extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables 3x extrapolation by minimal fine-tuning without long videos. Project page and codes: https://riflex-video.github.io/.
Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN DiscriminatorKaiwen Zheng, Yongxin Chen, Huayu Chen et al.
While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512x512 datasets without any guidance mechanisms, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256x256.
10.2CVMar 4, 2025
UAR-NVC: A Unified AutoRegressive Framework for Memory-Efficient Neural Video CompressionJia Wang, Xinfeng Zhang, Gai Zhang et al.
Implicit Neural Representations (INRs) have demonstrated significant potential in video compression by representing videos as neural networks. However, as the number of frames increases, the memory consumption for training and inference increases substantially, posing challenges in resource-constrained scenarios. Inspired by the success of traditional video compression frameworks, which process video frame by frame and can efficiently compress long videos, we adopt this modeling strategy for INRs to decrease memory consumption, while aiming to unify the frameworks from the perspective of timeline-based autoregressive modeling. In this work, we present a novel understanding of INR models from an autoregressive (AR) perspective and introduce a Unified AutoRegressive Framework for memory-efficient Neural Video Compression (UAR-NVC). UAR-NVC integrates timeline-based and INR-based neural video compression under a unified autoregressive paradigm. It partitions videos into several clips and processes each clip using a different INR model instance, leveraging the advantages of both compression frameworks while allowing seamless adaptation to either in form. To further reduce temporal redundancy between clips, we design two modules to optimize the initialization, training, and compression of these model parameters. UAR-NVC supports adjustable latencies by varying the clip length. Extensive experimental results demonstrate that UAR-NVC, with its flexible video clip setting, can adapt to resource-constrained environments and significantly improve performance compared to different baseline models. The project page: "https://wj-inf.github.io/UAR-NVC-page/".
14.4LGMay 28, 2025
Versatile Cardiovascular Signal Generation with a Unified Diffusion TransformerZehua Chen, Yuyang Miao, Liyuan Wang et al.
Cardiovascular signals such as photoplethysmography (PPG), electrocardiography (ECG), and blood pressure (BP) are inherently correlated and complementary, together reflecting the health of cardiovascular system. However, their joint utilization in real-time monitoring is severely limited by diverse acquisition challenges from noisy wearable recordings to burdened invasive procedures. Here we propose UniCardio, a multi-modal diffusion transformer that reconstructs low-quality signals and synthesizes unrecorded signals in a unified generative framework. Its key innovations include a specialized model architecture to manage the signal modalities involved in generation tasks and a continual learning paradigm to incorporate varying modality combinations. By exploiting the complementary nature of cardiovascular signals, UniCardio clearly outperforms recent task-specific baselines in signal denoising, imputation, and translation. The generated signals match the performance of ground-truth signals in detecting abnormal health conditions and estimating vital signs, even in unseen domains, while ensuring interpretability for human experts. These advantages position UniCardio as a promising avenue for advancing AI-assisted healthcare.
7.3CVNov 9, 2018
M2M-GAN: Many-to-Many Generative Adversarial Transfer Learning for Person Re-IdentificationWenqi Liang, Guangcong Wang, Jianhuang Lai et al.
Cross-domain transfer learning (CDTL) is an extremely challenging task for the person re-identification (ReID). Given a source domain with annotations and a target domain without annotations, CDTL seeks an effective method to transfer the knowledge from the source domain to the target domain. However, such a simple two-domain transfer learning method is unavailable for the person ReID in that the source/target domain consists of several sub-domains, e.g., camera-based sub-domains. To address this intractable problem, we propose a novel Many-to-Many Generative Adversarial Transfer Learning method (M2M-GAN) that takes multiple source sub-domains and multiple target sub-domains into consideration and performs each sub-domain transferring mapping from the source domain to the target domain in a unified optimization process. The proposed method first translates the image styles of source sub-domains into that of target sub-domains, and then performs the supervised learning by using the transferred images and the corresponding annotations in source domain. As the gap is reduced, M2M-GAN achieves a promising result for the cross-domain person ReID. Experimental results on three benchmark datasets Market-1501, DukeMTMC-reID and MSMT17 show the effectiveness of our M2M-GAN.