CLCVSDASJul 13, 2023

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

arXiv:2307.07057v17 citationsh-index: 32
Originality Highly original
AI Analysis

This work addresses efficient and accurate spoken language understanding for applications like voice assistants, though it is incremental in leveraging existing ASR pretraining.

The authors tackled speech intent classification and slot filling by using a pretrained ASR encoder to initialize an end-to-end Conformer-Transformer model, achieving state-of-the-art results on the SLURP dataset with 90.14% intent accuracy and 82.27% SLURP-F1.

We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90.14% intent accuracy and 82.27% SLURP-F1. We compare our model with encoders pretrained on self-supervised learning (SSL), and show that ASR pretraining is much more effective than SSL for SICSF. To explore parameter efficiency, we freeze the encoder and add Adapter modules, and show that parameter efficiency is only achievable with an ASR-pretrained encoder, while the SSL encoder needs full finetuning to achieve comparable results. In addition, we provide an in-depth comparison on end-to-end models versus cascading models (ASR+NLU), and show that E2E models are better than cascaded models unless an oracle ASR model is provided. Last but not least, our model is the first E2E model that achieves the same performance as cascading models with oracle ASR. Code, checkpoints and configs are available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes