CVAIFeb 16

Adapting VACE for Real-Time Autoregressive Video Diffusion

arXiv:2602.14381v11 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of enabling real-time video generation for streaming applications, but it is incremental as it adapts an existing method with trade-offs in quality.

The authors adapted VACE for real-time autoregressive video generation by modifying it to use fixed chunk sizes and causal attention, which added 20-30% latency overhead but severely degraded reference-to-video fidelity compared to the original batch method.

We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes