ASAICLFeb 10, 2023

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

arXiv:2302.06419v248 citationsh-index: 52
Originality Incremental advance
AI Analysis

This work addresses the need for more efficient and effective audio-visual speech recognition systems, particularly in reducing labeled data requirements, though it appears incremental as it builds on prior uni-modal contextualized representation techniques.

The paper tackles the problem of audio-visual speech recognition by introducing AV-data2vec, a self-supervised method that learns joint representations from both audio and video modalities, resulting in consistent performance improvements over existing methods on the LRS3 dataset.

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes