CL AI CV SD AS IVJul 14, 2022

u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

arXiv:2207.07036v27.255 citationsh-index: 41Has Code

Originality Highly original

AI Analysis

This work addresses the problem of limited labeled data and high deployment costs for multimodal speech models, offering a unified solution that improves robustness and efficiency for speech processing applications.

The paper tackles the challenge of developing robust audio-visual speech models by introducing u-HuBERT, a self-supervised pre-training framework that unifies multimodal and unimodal speech processing, achieving state-of-the-art performance with a single model and enabling zero-shot generalization to unlabeled modalities, such as yielding 1.2%/1.4%/27.2% word error rates on LRS3 for audio-visual/audio/visual inputs.

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input. Codes and models are available at https://github.com/facebookresearch/av_hubert

View on arXiv PDF Code

Similar