ASAICLMay 31, 2023

VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition

arXiv:2305.19972v25 citations
Originality Incremental advance
AI Analysis

This work addresses ASR performance for scenarios with multimodal context, but it is incremental as it builds on prior multimodal ASR research.

The authors tackled the problem of improving automatic speech recognition (ASR) by integrating visual and textual context, proposing the ViLaS model and a training strategy for modal-incomplete scenarios, and reporting empirical results on Flickr8K and a new VSDial dataset.

Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism, which can integrate visual and textual context simultaneously or separately, to facilitate speech recognition. Next, we introduce an effective training strategy that improves performance in modal-incomplete test scenarios. Then, to explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions. Finally, empirical results are reported on the public Flickr8K and self-constructed VSDial datasets. We explore various cross-modal fusion schemes, analyze fine-grained crossmodal alignment on VSDial, and provide insights into the effects of integrating multimodal information on speech recognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes