CVSep 23, 2020

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Sylvain Guy, Stéphane Lathuilière, Pablo Mesejo, Radu Horaud

arXiv:2009.11204v23.311 citations

Originality Incremental advance

AI Analysis

This work addresses V-VAD for scenarios where audio is unreliable or missing, offering a domain-specific solution with incremental improvements.

The paper tackled the problem of visual voice activity detection (V-VAD) by proposing two deep architectures and introducing WildVVAD, an automatically annotated dataset, which improved model performance with concrete gains in accuracy.

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD -- based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.

View on arXiv PDF

Similar