ASCVSDJun 14, 2024

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

arXiv:2406.10082v351 citations
Originality Highly original
AI Analysis

This work addresses the challenge of improving speech recognition and translation accuracy in noisy environments for users of AVSR systems, offering a versatile, unified model that is incremental but with strong performance gains.

The paper tackled the problem of limited video training data in Audio-Visual Speech Recognition (AVSR) by adapting the Whisper model to handle video inputs using gated cross attention, achieving state-of-the-art word error rates (e.g., 0.68% ASR and 0.76% AVSR on LRS3) and outperforming audio-only Whisper in noisy conditions for recognition and translation tasks.

Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data difference motivates us to adapt Whisper to handle video inputs. Inspired by Flamingo which injects visual features into language models, we propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our models achieve state-of-the-art ASR WER (0.68%) and AVSR WER (0.76%) on LRS3, and state-of-the-art ASR WER (1.3%) and AVSR WER (1.4%) on LRS2. Audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions. Moreover, Whisper-Flamingo is versatile and conducts all of these tasks using one set of parameters, while prior methods are trained separately on each language.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes