CLLGSDASNov 15, 2022

Introducing Semantics into Speech Encoders

Meta AIMIT
arXiv:2211.08402v1223 citationsh-index: 52
Originality Incremental advance
AI Analysis

This addresses the need for semantic speech understanding without costly labeled audio transcriptions, offering an incremental improvement over existing supervised methods.

The paper tackles the problem of self-supervised speech encoders lacking semantic information by proposing an unsupervised method to incorporate semantics from large language models, improving intent classification by over 10% and spoken question answering by over 2%.

Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding performance by over 10\% on intent classification, with modest gains in named entity resolution and slot filling, and spoken question answering FF1 score by over 2\%. Our unsupervised approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts, demonstrating the feasibility of unsupervised semantic augmentations to existing speech encoders.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes