CLApr 30

JaiTTS: A Thai Voice Cloning Model

arXiv:2604.2760746.3
AI Analysis

This work provides a high-quality Thai voice cloning model that handles code-switching and numerals without normalization, benefiting Thai speech applications.

JaiTTS-v1.0 achieves state-of-the-art Thai voice cloning with a CER of 1.94%, surpassing human ground truth (1.98%) on short-duration tasks, and wins 283 of 400 pairwise comparisons against commercial systems.

We present JaiTTS-v1.0, a state-of-the-art Thai voice cloning text-to-speech model built through continual training on a large Thai-centric speech corpus. The model architecture is adapted from VoxCPM, a tokenizer-free autoregressive TTS model. JaiTTS-v1.0 directly processes numerals and Thai-English code-switching, which is very common in realistic settings, without explicit text normalization. We test the models on short-duration speech generation and long-duration speech generation, which reflects many real-world use cases. JaiTTS-v1.0 achieves a state-of-the-art CER of 1.94\%, surpassing the human ground truth of 1.98% for short-duration tasks while performing on par with human ground truth for long-duration tasks. In human judgment evaluations, our model wins 283 of 400 pairwise comparisons against commercial flagships, with only 58 losses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes