LGSDASMLSep 29, 2023

AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

arXiv:2309.17395v16 citationsh-index: 54
Originality Incremental advance
AI Analysis

This work addresses speech recognition for audio-visual systems, offering an incremental improvement by leveraging unlabeled data to enhance visual-only performance.

The paper tackled the problem of audio-visual speech recognition by introducing AV-CPL, a semi-supervised method using continuous pseudo-labeling on labeled and unlabeled videos, which achieved significant improvements in visual speech recognition performance on the LRS3 dataset while maintaining practical audio and audio-visual speech recognition.

Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. AV-CPL obtains significant improvements in VSR performance on the LRS3 dataset while maintaining practical ASR and AVSR performance. Finally, using visual-only speech data, our method is able to leverage unlabeled visual speech to improve VSR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes