AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
This work addresses the need for more efficient and effective audio-visual speech recognition systems, particularly in reducing labeled data requirements, though it appears incremental as it builds on prior uni-modal contextualized representation techniques.
The paper tackles the problem of audio-visual speech recognition by introducing AV-data2vec, a self-supervised method that learns joint representations from both audio and video modalities, resulting in consistent performance improvements over existing methods on the LRS3 dataset.
Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.