MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
This provides a new benchmark for researchers in audio-visual speech processing, though it is incremental as it extends existing datasets to multilingual and translation tasks.
The authors tackled the problem of building robust speech recognition and translation models by introducing MuAViC, a multilingual audio-visual corpus with 1200 hours of speech in 9 languages, which baseline results show is effective for this purpose.
We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translation and the largest open benchmark for multilingual audio-visual speech recognition. Our baseline results show that MuAViC is effective for building noise-robust speech recognition and translation models. We make the corpus available at https://github.com/facebookresearch/muavic.