SDAICLMar 18

MOSS-TTS Technical Report

arXiv:2603.1809099.23 citationsh-index: 14
AI Analysis

This work addresses the need for efficient and controllable text-to-speech systems, though it appears incremental as it builds on existing tokenization and transformer methods.

The authors tackled the problem of scalable speech generation by introducing MOSS-TTS, a foundation model using discrete audio tokens and autoregressive modeling, which supports zero-shot voice cloning, token-level duration control, and stable long-form generation across multilingual settings.

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes