Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS
This work addresses prosody modeling for TTS systems, but it is incremental as it builds on existing methods like diffusion models and contrastive learning.
The paper tackled prosody modeling in diffusion-based text-to-speech (TTS) by investigating multi-stage pretraining with masked language modeling and cross-modal contrastive learning, finding that a two-stage curriculum achieved the best synthesis quality in terms of intelligibility, speaker similarity, and perceptual measures, while same-phoneme refinement degraded synthesis despite improving prosodic retrieval.
We investigate multi-stage pretraining for prosody modeling in diffusion-based TTS. A speaker-conditioned dual-stream encoder is trained with masked language modeling followed by SigLIP-style cross-modal contrastive learning using mixed-phoneme batches, with an additional same-phoneme refinement stage studied separately. We evaluate intrinsic text-audio retrieval and downstream synthesis in Grad-TTS and a latent diffusion TTS system. The two-stage curriculum (MLM + mixed-phoneme contrastive learning) achieves the best overall synthesis quality in terms of intelligibility, speaker similarity, and perceptual measures. Although same-phoneme refinement improves prosodic retrieval, it reduces phoneme discrimination and degrades synthesis. These findings indicate that improvements in embedding-space metrics do not necessarily translate to better generative performance and highlight the need to balance phoneme discrimination and prosodic sensitivity in TTS pretraining.