LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Doyeop Kwak, Jeongsoo Choi, Suyeon Lee, Joon Son Chung

arXiv:2604.2786631.6

Predicted impact top 47% in AS · last 90 daysOriginality Synthesis-oriented

AI Analysis

This benchmark provides a more realistic evaluation for AVSR researchers, addressing the lack of diverse, in-the-wild conditions in existing benchmarks.

The authors introduce LRS-VoxMM, a new benchmark for audio-visual speech recognition derived from real-world conversations, which is harder than LRS3 and shows that visual information becomes more important as audio degrades.

We introduce LRS-VoxMM, an in-the-wild benchmark for audio-visual speech recognition (AVSR). The benchmark is derived from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. We select AVSR-suitable samples and preprocess them in an LRS-style format for direct use in existing AVSR pipelines. Compared with commonly used benchmarks, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. We also release distorted evaluation sets with additive noise, reverberation, and bandwidth limitation to support evaluation under severe acoustic degradation. Experimental results show that LRS-VoxMM is considerably harder than LRS3 and that the contribution of visual information becomes more evident as the audio signal degrades. LRS-VoxMM supports more realistic AVSR benchmarking and encourages further research on the role of visual information in challenging real-world conditions.

View on arXiv PDF

Similar