CLSDASMay 21, 2025

SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

NVIDIA
arXiv:2505.15670v421 citationsh-index: 18
Originality Highly original
AI Analysis

This work addresses the lack of real-time adaptability in spoken dialogue systems for human-computer interaction, representing a novel method rather than an incremental improvement.

The authors tackled the problem of real-time adaptability in speech-to-speech language models by proposing a novel duplex architecture that directly models simultaneous user and agent streams, resulting in improved reasoning, turn-taking, and barge-in abilities while halving the bitrate to 0.6 kbps compared to previous works.

Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes