CL SD ASMay 4, 2023

End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders

Jixuan Wang, Martin Radfar, Kai Wei, Clement Chung

arXiv:2305.02937v20.5

Originality Incremental advance

AI Analysis

This work addresses computational inefficiency in end-to-end SLU for speech processing applications, representing an incremental improvement over existing methods.

The paper tackles the challenge of extracting semantic meanings directly from audio in spoken language understanding by using joint CTC loss and self-supervised acoustic encoders, achieving a 4% absolute improvement on DSTC2 and 1.3% on SLURP datasets over state-of-the-art models.

It is challenging to extract semantic meanings directly from audio signals in spoken language understanding (SLU), due to the lack of textual information. Popular end-to-end (E2E) SLU models utilize sequence-to-sequence automatic speech recognition (ASR) models to extract textual embeddings as input to infer semantics, which, however, require computationally expensive auto-regressive decoding. In this work, we leverage self-supervised acoustic encoders fine-tuned with Connectionist Temporal Classification (CTC) to extract textual embeddings and use joint CTC and SLU losses for utterance-level SLU tasks. Experiments show that our model achieves 4% absolute improvement over the the state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset and 1.3% absolute improvement over the SOTA SLU model on the SLURP dataset.

View on arXiv PDF

Similar