AS AI CLJun 8, 2023

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey

DeepMind

arXiv:2306.07944v18.610 citationsh-index: 42

Originality Incremental advance

AI Analysis

This addresses speech understanding challenges for dialog systems, offering incremental improvements in accuracy and error reduction.

The paper tackles the performance drop in applying Large Language Models to speech by proposing a joint speech and language model with a Speech2Text adapter and retrieval augmentation, improving dialog state tracking accuracy from 24.7% to 34.6% and reducing ASR word error rate from 9.4% to 8.5%.

Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, we can reduce the speech sequence length to that of text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to address errors on rare entities, we augment SLM with a Speech2Entity retriever, which uses speech to retrieve relevant entities, and then adds them to the original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with the dialog understanding task improves the ASR performance from 9.4% to 8.5% WER.

View on arXiv PDF

Similar