CVDec 24, 2025

FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

arXiv:2512.21015v21 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses video editing challenges for AI researchers and practitioners, offering an incremental improvement over existing methods by enhancing temporal consistency and efficiency.

The paper tackled the problem of temporal inconsistency and high computational overhead in video editing by adapting text-to-image diffusion models, proposing FluencyVE which integrates Mamba and bypass attention to achieve global frame-level attention with reduced costs, demonstrating promising results in editing attributes, subjects, and locations in real-world videos.

Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes