CLOct 16, 2020

Multimodal Speech Recognition with Unstructured Audio Masking

arXiv:2010.08642v1997 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more robust speech recognition in noisy environments, but it is incremental as it builds on prior methods with a more realistic masking scenario.

The paper tackled the problem of making multimodal speech recognition more realistic by simulating unstructured audio masking during training, and showed that the model can recover various masked words and attend to visual signals when audio is corrupted.

Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called RandWordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes