ASCVSDMay 15, 2022

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

arXiv:2205.07180v224 citationsh-index: 41
Originality Synthesis-oriented
AI Analysis

This work addresses speaker verification tasks by enhancing performance and noise robustness, representing an incremental improvement through the application of an existing pre-training method to a specific domain.

The paper tackles speaker representation learning by using self-supervised pre-training with audio-visual inputs, specifically focusing on the AV-HuBERT framework. It shows that this approach improves label efficiency by roughly tenfold for speaker verification and reduces EER by 38% in clean conditions and 75% in noisy conditions when incorporating visual lip information.

This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes