ASLGSDNov 10, 2019

Robust Unsupervised Audio-visual Speech Enhancement Using a Mixture of Variational Autoencoders

arXiv:1911.03930v121 citations
Originality Incremental advance
AI Analysis

This work addresses robustness in audio-visual speech enhancement for applications like hearing aids or video conferencing, but it is incremental as it builds on existing VAE-based methods.

The paper tackled the problem of audio-visual speech enhancement being non-robust to noisy visual data, such as occluded lips, by proposing a mixture of variational autoencoders that switches between audio-only and audio-visual models per frame, resulting in improved performance as shown in experiments.

Recently, an audio-visual speech generative model based on variational autoencoder (VAE) has been proposed, which is combined with a nonnegative matrix factorization (NMF) model for noise variance to perform unsupervised speech enhancement. When visual data is clean, speech enhancement with audio-visual VAE shows a better performance than with audio-only VAE, which is trained on audio-only data. However, audio-visual VAE is not robust against noisy visual data, e.g., when for some video frames, speaker face is not frontal or lips region is occluded. In this paper, we propose a robust unsupervised audio-visual speech enhancement method based on a per-frame VAE mixture model. This mixture model consists of a trained audio-only VAE and a trained audio-visual VAE. The motivation is to skip noisy visual frames by switching to the audio-only VAE model. We present a variational expectation-maximization method to estimate the parameters of the model. Experiments show the promising performance of the proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes