CL SD ASJul 22, 2023

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael L. Seltzer

Meta AI

arXiv:2307.12134v11.74 citationsh-index: 43

Originality Incremental advance

AI Analysis

This work addresses robustness issues in SLU for on-device streaming applications, but it is incremental as it builds on existing E2E SLU methods.

The paper tackles the problem of end-to-end spoken language understanding systems being vulnerable to ASR transcription errors by proposing a novel system that fuses audio and text representations based on estimated modality confidence, resulting in accuracy improvements on the STOP dataset.

End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the analysis to demonstrate the effectiveness of our approach.

View on arXiv PDF

Similar