CVMay 24

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

arXiv:2605.2519397.0
Predicted impact top 6% in CV · last 90 daysOriginality Highly original
AI Analysis

This work addresses audio-visual desynchronization and contextual conflicts in video editing, a problem for content creators and multimedia editors.

SpongeBob introduces the first end-to-end audio-visual joint editing framework with bidirectional cross-modal interaction, achieving 30% improvement in Sync-C and 12.5% in Ctx-F1 over baselines.

Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: https://hy-spongebob.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes