CVMMSDASSep 9, 2022

Learning Audio-Visual embedding for Person Verification in the Wild

arXiv:2209.04093v24 citationsh-index: 29
AI Analysis

This work addresses person verification in the wild, an incremental improvement for applications like security and biometrics.

The paper tackles person verification by proposing a novel audio-visual embedding strategy that uses weight-enhanced attentive statistics pooling and gated attention fusion, achieving state-of-the-art results with 0.18%, 0.27%, and 0.49% EER on VoxCeleb1 trial lists.

It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes