RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers
This work addresses the problem of temporal repetition and motion deceleration in long video generation for AI video synthesis, offering an incremental improvement over existing methods.
The paper tackles the challenge of generating longer videos with temporal coherence in video diffusion transformers by identifying an intrinsic frequency in positional embeddings that causes repetition and deceleration. The proposed RIFLEx method reduces this frequency to achieve high-quality 2x length extrapolation without training and enables 3x extrapolation with minimal fine-tuning.
Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch--achieving high-quality 2x extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables 3x extrapolation by minimal fine-tuning without long videos. Project page and codes: https://riflex-video.github.io/.