AS CL SDNov 15, 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, David Harwath

arXiv:2511.12347v16.66 citationsh-index: 16EMNLP

Originality Incremental advance

AI Analysis

This addresses the need for efficient, unified tools in multilingual speech applications, though it is incremental as it builds on existing autoregressive and language model approaches.

The paper tackles the problem of multilingual speech synthesis and editing by introducing VoiceCraft-X, a unified autoregressive model that achieves high-quality, natural-sounding speech across 11 languages, even with limited per-language data.

We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

View on arXiv PDF

Similar