Learning ASR-Robust Contextualized Embeddings for Spoken Language Understanding
This work addresses the challenge of spoken language understanding for systems using ASR, though it is incremental as it builds on existing fine-tuning techniques.
The paper tackled the problem of applying pre-trained language models to noisy ASR transcripts by proposing a confusion-aware fine-tuning method to make contextualized embeddings more robust to ASR errors, resulting in significant performance improvements on the ATIS dataset.
Employing pre-trained language models (LM) to extract contextualized word representations has achieved state-of-the-art performance on various NLP tasks. However, applying this technique to noisy transcripts generated by automatic speech recognizer (ASR) is concerned. Therefore, this paper focuses on making contextualized representations more ASR-robust. We propose a novel confusion-aware fine-tuning method to mitigate the impact of ASR errors to pre-trained LMs. Specifically, we fine-tune LMs to produce similar representations for acoustically confusable words that are obtained from word confusion networks (WCNs) produced by ASR. Experiments on the benchmark ATIS dataset show that the proposed method significantly improves the performance of spoken language understanding when performing on ASR transcripts. Our source code is available at https://github.com/MiuLab/SpokenVec