CVApr 2, 2024

BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition

Alexandros Haliassos, Andreas Zinonos, Rodrigo Mira, Stavros Petridis, Maja Pantic

arXiv:2404.02098v116.826 citationsh-index: 36Has CodeICASSP

Originality Incremental advance

AI Analysis

This work addresses the challenge of reducing reliance on costly transcribed data for speech recognition, offering a scalable solution for researchers and practitioners in audio-visual AI, though it is incremental as it builds on an existing method.

The paper tackles the problem of learning speech representations from unlabelled audio-visual data by proposing BRAVEn, an extension to RAVEn, which achieves state-of-the-art results in self-supervised visual and auditory speech recognition, with word error rates of 20.0% for VSR and 1.7% for ASR on the LRS3 test set using only 30 hours of labelled data.

Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.

View on arXiv PDF Code

Similar