SD CV ASMay 6, 2025

CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

Detao Bai, Zhiheng Ma, Xihan Wei, Liefeng Bo

arXiv:2505.03186v27.01 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work addresses speech processing challenges in noisy conditions for applications like recognition and enhancement, representing an incremental advance with strong specific gains.

The paper tackled the problem of learning audio-visual representations for speech processing by introducing CoGenAV, which uses contrastive-generative synchronization on limited data, achieving a state-of-the-art Word Error Rate of 1.27 for Audio-Visual Speech Recognition and over 70% improvement in noisy environments.

The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a wide range of speech and audio-visual tasks. CoGenAV is trained by optimizing a dual objective derived from natural audio-visual synchrony, contrastive feature alignment and generative text prediction, using only 223 hours of labeled data from the LRS2 dataset. This contrastive-generative synchronization strategy effectively captures fundamental cross-modal correlations. We showcase the effectiveness and versatility of the learned CoGenAV representations on multiple benchmarks. When utilized for Audio-Visual Speech Recognition (AVSR) on LRS2, these representations contribute to achieving a state-of-the-art Word Error Rate (WER) of 1.27. They also enable strong performance in Visual Speech Recognition (VSR) with a WER of 20.5 on LRS2, and significantly improve performance in noisy environments by over 70%. Furthermore, CoGenAV representations benefit speech reconstruction tasks, boosting performance in Speech Enhancement and Separation, and achieve competitive results in audio-visual synchronization tasks like Active Speaker Detection (ASD). Our model will be open-sourced to facilitate further development and collaboration within both academia and industry.

View on arXiv PDF Code

Similar