ASLGSDOct 12, 2021

Multi-Modal Pre-Training for Automated Speech Recognition

arXiv:2110.09890v216 citations
Originality Incremental advance
AI Analysis

This work addresses robustness issues in ASR for applications in noisy environments, representing an incremental advance by enhancing existing methods with global context.

The paper tackles the vulnerability of automated speech recognition to local and global noise by introducing a multi-modal pre-training approach that integrates global environmental context, achieving up to 7% improvement on Librispeech and gains of 6-45% on internal datasets.

Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach which leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets range from 6% (on larger models) to 45% (on smaller models).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes