MMCVSDMar 12

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

arXiv:2603.11647v138.53 citationsh-index: 6
Predicted impact top 4% in MM · last 90 daysOriginality Incremental advance
AI Analysis

This enables real-time applications for joint audio-visual generation, addressing a bottleneck in multi-modal AI systems, though it is incremental as it builds on existing diffusion models.

The paper tackles the high latency of joint audio-visual diffusion models by proposing OmniForcing, a framework that distills an offline bidirectional model into a streaming autoregressive generator, achieving state-of-the-art real-time generation at ~25 FPS on a single GPU while maintaining synchronization and visual quality comparable to the teacher model.

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes