CVASNov 27, 2018

Noise-tolerant Audio-visual Online Person Verification using an Attention-based Neural Network Fusion

arXiv:1811.10813v155 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of reliable person verification in real-world scenarios with noisy or missing data, though it is incremental as it builds on existing multi-modal methods.

The paper tackles the problem of robust person verification in noisy or incomplete audio-visual data by proposing an attention-based neural network that learns to select salient modalities, achieving favorable performance on the VoxCeleb2 dataset and robustness to extreme corruption or missing modalities.

In this paper, we present a multi-modal online person verification system using both speech and visual signals. Inspired by neuroscientific findings on the association of voice and face, we propose an attention-based end-to-end neural network that learns multi-sensory associations for the task of person verification. The attention mechanism in our proposed network learns to conditionally select a salient modality between speech and facial representations that provides a balance between complementary inputs. By virtue of this capability, the network is robust to missing or corrupted data from either modality. In the VoxCeleb2 dataset, we show that our method performs favorably against competing multi-modal methods. Even for extreme cases of large corruption or an entirely missing modality, our method demonstrates robustness over other unimodal methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes