ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
This work addresses scalability issues in video generation for applications requiring long-duration or on-device processing, offering an incremental improvement over existing methods.
The paper tackled the problem of quadratic attention complexity in video diffusion transformers, which limits scalability for longer sequences, by introducing ReHyAt, a recurrent hybrid attention mechanism that reduces attention cost from quadratic to linear while achieving state-of-the-art video quality, with training cost reduced by two orders of magnitude to ~160 GPU hours.
Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.