CLJun 11, 2025

DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma

arXiv:2506.09349v36.72 citationsh-index: 10Has Code

Originality Highly original

AI Analysis

This work addresses the challenge of efficient and effective speech-text integration in voice conversation models, offering a novel approach that reduces computational overhead while maintaining high performance, which is significant for applications in speech synthesis and human-computer interaction.

The paper tackles the problem of end-to-end speech generation with large language models by introducing DrVoice, a parallel speech-text voice conversation model that uses dual-resolution speech representations to reduce computational cost and improve modality alignment. The model achieves new state-of-the-art results on OpenAudioBench and Big Bench Audio benchmarks and competitive performance on others, establishing it as a leading open-source speech foundation model in the ~7B parameter range.

Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs' capabilities. Experimental results demonstrate that DRVOICE-7B establishes new state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, while achieving performance comparable to the SOTA on VoiceBench and UltraEval-Audio benchmarks, making it a leading open-source speech foundation model in ~7B models.

View on arXiv PDF Code

Similar