CLMMASFeb 13, 2020

Looking Enhances Listening: Recovering Missing Speech Using Images

arXiv:2002.05639v115 citations
Originality Incremental advance
AI Analysis

This addresses robustness in ASR systems for noisy environments, though it is incremental as it builds on existing multimodal approaches.

The paper tackled the problem of improving automatic speech recognition (ASR) under noisy conditions by using visual context, showing that multimodal ASR models can recover masked words with up to 35% relative improvement.

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes