CVApr 2, 2024

BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition

arXiv:2404.02098v126 citationsh-index: 36ICASSP
Originality Incremental advance
AI Analysis

This work addresses the challenge of reducing reliance on costly transcribed data for speech recognition, offering a scalable solution for researchers and practitioners in audio-visual AI, though it is incremental as it builds on an existing method.

The paper tackles the problem of learning speech representations from unlabelled audio-visual data by proposing BRAVEn, an extension to RAVEn, which achieves state-of-the-art results in self-supervised visual and auditory speech recognition, with word error rates of 20.0% for VSR and 1.7% for ASR on the LRS3 test set using only 30 hours of labelled data.

Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes