AS CL SDDec 22, 2024

KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction

Kangxiang Xia, Xinfa Zhu, Jixun Yao, Wenjie Tian, Wenhao Li, Lei Xie

arXiv:2412.16846v24.34 citationsh-index: 14

Originality Highly original

AI Analysis

This addresses speech synthesis for applications needing high-quality, adaptable output, representing a novel method rather than incremental improvement.

The paper tackles text-to-speech synthesis by introducing KALL-E, an autoregressive model that predicts continuous speech distributions from text, eliminating diffusion components and achieving superior quality with adaptation from a single sample.

We introduce KALL-E, a novel autoregressive (AR) language model for text-to-speech (TTS) synthesis that operates by predicting the next distribution of continuous speech frames. Unlike existing methods, KALL-E directly models the continuous speech distribution conditioned on text, eliminating the need for any diffusion-based components. Specifically, we utilize a Flow-VAE to extract a continuous latent speech representation from waveforms, instead of relying on discrete speech tokens. A single AR Transformer is then trained to predict these continuous speech distributions from text, optimizing a Kullback-Leibler divergence loss as its objective. Experimental results demonstrate that KALL-E achieves superior speech synthesis quality and can even adapt to a target speaker from just a single sample. Importantly, KALL-E provides a more direct and effective approach for utilizing continuous speech representations in TTS.

View on arXiv PDF

Similar