CVOct 27, 2025

VALA: Learning Latent Anchors for Training-Free and Temporally Consistent

Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Longbing Cao

arXiv:2510.22970v1h-index: 5

Originality Incremental advance

AI Analysis

This addresses scalability and manual bias issues in video editing for users of pre-trained diffusion models, though it is incremental as it builds on existing training-free methods.

The paper tackles the problem of maintaining temporal consistency in training-free video editing by proposing VALA, a variational alignment module that adaptively selects key frames and compresses latent features into semantic anchors, achieving state-of-the-art performance in inversion fidelity, editing quality, and consistency with improved efficiency.

Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.

View on arXiv PDF

Similar