CL SDFeb 6

Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities

Ju Lin, Jing Pan, Ruizhi Li, Ming Sun, Yuzong Liu, Alaa Hassan, Jing Zheng, Florian Metze

arXiv:2602.07211v10.6h-index: 24

Originality Incremental advance

AI Analysis

This work addresses the problem of directional speech understanding in noisy, multi-talker environments for smart glasses users, representing an incremental advancement in speech LLMs.

The paper tackles the challenge of enabling large language models (LLMs) to understand multi-talker and multi-channel speech, specifically for smart glasses, by proposing two novel integration approaches. It achieves strong performance in speech recognition and translation tasks.

Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.

View on arXiv PDF

Similar