CVLGMMSDASJan 29, 2024

Synchformer: Efficient Synchronization from Sparse Cues

arXiv:2401.16423v185 citationsh-index: 49ICASSP
Originality Incremental advance
AI Analysis

This addresses synchronization challenges for real-world video applications like YouTube, though it appears incremental with hybrid methods.

The paper tackles audio-visual synchronization in 'in-the-wild' videos with sparse cues by introducing a novel model and training method, achieving state-of-the-art performance in both dense and sparse settings.

Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes