CLSep 25, 2024

Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

Robin Shing-Hei Yuen, Timothy Tin-Long Tse, Jian Zhu

arXiv:2409.17353v32.74 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the need for more natural and efficient real-time audio interactions in conversational AI, though it appears incremental as it builds on existing ASR-to-TTS pipelines.

The paper tackles the problem of latency and loss of audio features in speech-to-speech conversational LLMs by proposing a method that implicitly internalizes ASR chain of thought, reducing latency and improving native speech understanding for more efficient real-time audio interactions.

Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.

View on arXiv PDF

Similar