CL LG SD ASSep 23, 2024

Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents

Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, Shyamnath Gollakota

arXiv:2409.15594v121.868 citationsh-index: 49

Originality Highly original

AI Analysis

This addresses the challenge of making LLMs more human-like in real-time spoken interactions, which is incremental as it builds on existing LLM capabilities by adding time synchronization.

The paper tackles the problem of enabling large language models to engage in full-duplex spoken dialogue, which allows for synchronous interactions like overlapping speech and backchanneling, unlike traditional turn-based methods. The result is a model that outperforms state-of-the-art in dialogue meaningfulness while maintaining naturalness, demonstrated with synthetic and real-world data.

Despite broad interest in modeling spoken dialogue agents, most approaches are inherently "half-duplex" -- restricted to turn-based interaction with responses requiring explicit prompting by the user or implicit tracking of interruption or silence events. Human dialogue, by contrast, is "full-duplex" allowing for rich synchronicity in the form of quick and dynamic turn-taking, overlapping speech, and backchanneling. Technically, the challenge of achieving full-duplex dialogue with LLMs lies in modeling synchrony as pre-trained LLMs do not have a sense of "time". To bridge this gap, we propose Synchronous LLMs for full-duplex spoken dialogue modeling. We design a novel mechanism to integrate time information into Llama3-8b so that they run synchronously with the real-world clock. We also introduce a training recipe that uses 212k hours of synthetic spoken dialogue data generated from text dialogue data to create a model that generates meaningful and natural spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue meaningfulness while maintaining naturalness. Finally, we demonstrate the model's ability to participate in full-duplex dialogue by simulating interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms. Webpage: https://syncllm.cs.washington.edu/.

View on arXiv PDF

Similar