CLAISDASJun 1, 2025

NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

arXiv:2506.00975v418 citationsh-index: 12ICML
Originality Highly original
AI Analysis

This work addresses the challenge of fluid spoken dialogue for real-time applications, representing a novel method for a known bottleneck in speech language modeling.

The paper tackles the problem of enabling speech language models to engage in natural spoken interactions by exploiting dual-channel speech data, introducing a novel generative modeling paradigm called Next-Token-Pair Prediction (NTPP) that significantly improves conversational abilities in terms of turn-taking prediction, response coherence, and naturalness, while achieving substantially lower inference latency compared to existing methods.

Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes