CLSDASJun 30, 2019

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

arXiv:1907.00477v28 citations
Originality Incremental advance
AI Analysis

This reveals a limitation in current multimodal integration methods for speech recognition robustness, which is important for applications in noisy environments.

The paper investigated whether visual context improves multimodal speech recognition under noisy conditions, finding that while multimodal models show up to 4.2% WER improvements over unimodal ones, they fail to effectively use visual information when audio is corrupted.

Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world. However, it is currently unclear to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. We examine the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, where we deprive the models from partial audio signal during inference time. Our experiments show that while MMASR models show significant gains over traditional speech-to-text architectures (upto 4.2% WER improvements), they do not incorporate visual information when the audio signal has been corrupted. This shows that current methods of integrating the visual modality do not improve model robustness to noise, and we need better visually grounded adaptation techniques.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes