ASLGSDIVNov 15, 2018

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

arXiv:1811.06234v122 citations
Originality Synthesis-oriented
AI Analysis

This work provides guidance for practitioners in audio-visual speech enhancement by systematically evaluating training choices, though it is incremental as it focuses on comparing existing methods rather than introducing new ones.

The paper experimentally compares different training targets and objective functions for deep-learning-based audio-visual speech enhancement, finding that mask estimation approaches perform best overall in terms of speech quality and intelligibility, with log magnitude spectrum estimation performing similarly in quality.

Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes