CL SD ASNov 16, 2020

End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features

Edmilson Morais, Hong-Kwang J. Kuo, Samuel Thomas, Zoltan Tuske, Brian Kingsbury

arXiv:2011.08238v11.413 citations

Originality Incremental advance

AI Analysis

This work addresses spoken language understanding for applications like voice assistants, but it is incremental as it adapts existing NLP methods to SLU.

The paper tackled the problem of spoken language understanding (SLU) by introducing a modular end-to-end transformer architecture that uses self-supervised pre-trained acoustic features, pre-trained model initialization, and multi-task training, achieving results where these features outperformed traditional filterbank features and reduced the need for pre-trained initialization.

Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features, pre-trained model initialization and multi-task training. Several SLU experiments for predicting intent and entity labels/values using the ATIS dataset are performed. These experiments investigate the interaction of pre-trained model initialization and multi-task training with either traditional filterbank or self-supervised pre-trained acoustic features. Results show not only that self-supervised pre-trained acoustic features outperform filterbank features in almost all the experiments, but also that when these features are used in combination with multi-task training, they almost eliminate the necessity of pre-trained model initialization.

View on arXiv PDF

Similar