SDCVLGASIVFeb 22, 2023

Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

arXiv:2302.11254v114 citationsh-index: 43
Originality Incremental advance
AI Analysis

This work addresses speaker verification for security or biometric applications by improving accuracy through cross-modal learning, representing an incremental advance over existing fusion methods.

The paper tackled speaker verification by leveraging the correlation between audio and visual speech through a cross-modal co-learning paradigm, achieving 60% and 20% average relative performance improvements over audio-only/visual-only and baseline fusion systems, respectively.

Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes