DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors
This work addresses scalability and accessibility issues in TTS for researchers and developers by removing domain-specific dependencies, though it is incremental as it builds on existing diffusion and transformer methods.
The paper tackles the problem of text-to-speech (TTS) models relying on domain-specific factors like phonemes and durations, which limit scalability, and introduces DiTTo-TTS, a Diffusion Transformer-based model that achieves state-of-the-art or comparable zero-shot performance in naturalness, intelligibility, and speaker similarity without these factors, by scaling to 82K hours of training data and 790M parameters.
Large-scale latent diffusion models (LDMs) excel in content generation across various modalities, but their reliance on phonemes and durations in text-to-speech (TTS) limits scalability and access from other fields. While recent studies show potential in removing these domain-specific factors, performance remains suboptimal. In this work, we introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors. Through rigorous analysis and empirical exploration, we find that (1) DiT with minimal modifications outperforms U-Net, (2) variable-length modeling with a speech length predictor significantly improves results over fixed-length approaches, and (3) conditions like semantic alignment in speech latent representations are key to further enhancement. By scaling our training data to 82K hours and the model size to 790M parameters, we achieve superior or comparable zero-shot performance to state-of-the-art TTS models in naturalness, intelligibility, and speaker similarity, all without relying on domain-specific factors. Speech samples are available at https://ditto-tts.github.io.