ASCVSDFeb 3, 2025

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

arXiv:2502.01547v34 citationsh-index: 11IEEE Signal Processing Letters
AI Analysis

This addresses the problem of robust speech recognition in multiple languages under noisy environments, offering a domain-specific improvement over existing methods.

The paper tackles multilingual audio-visual speech recognition in noisy conditions by proposing mWhisper-Flamingo, which integrates pre-trained audio and video models with decoder modality dropout, achieving state-of-the-art word error rates on a 9-language dataset and outperforming audio-only methods in noise.

Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes