On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding
This work addresses spoken language understanding, potentially benefiting multilingual speech applications, though it appears incremental as it builds on existing models.
The paper tackles the problem of improving spoken language understanding (SLU) by using semantically-aligned speech representations, showing that the SAMU-XLSR model significantly boosts performance over the baseline XLS-R model in end-to-end SLU and enhances language portability.
In this paper we examine the use of semantically-aligned speech representations for end-to-end spoken language understanding (SLU). We employ the recently-introduced SAMU-XLSR model, which is designed to generate a single embedding that captures the semantics at the utterance level, semantically aligned across different languages. This model combines the acoustic frame-level speech representation learning model (XLS-R) with the Language Agnostic BERT Sentence Embedding (LaBSE) model. We show that the use of the SAMU-XLSR model instead of the initial XLS-R model improves significantly the performance in the framework of end-to-end SLU. Finally, we present the benefits of using this model towards language portability in SLU.