AICVMMMar 24, 2024

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

arXiv:2403.16071v282 citationsh-index: 12LREC
Originality Incremental advance
AI Analysis

This addresses the challenge of speaker variability in lip reading systems, which is incremental by building on existing deep learning approaches.

The paper tackled the problem of cross-speaker lip reading by reducing visual variations across speakers, achieving improved performance in both intra-speaker and inter-speaker conditions as demonstrated on public datasets.

Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes