CLSDASJun 13, 2024

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

arXiv:2406.09345v111 citations
Originality Incremental advance
AI Analysis

This work addresses spoken language understanding for tasks like spoken question answering, but it is incremental as it builds on existing methods by replacing continuous speech encoder outputs with discrete units.

The paper tackled the problem of integrating speech input with large language models for spoken language understanding by proposing the use of discrete speech units instead of continuous outputs, resulting in robust performance on seen/unseen domains and instruction-following capability in spoken question answering.

The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes