CLSep 6, 2025

LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization

arXiv:2509.05863v1h-index: 1
Originality Incremental advance
AI Analysis

This work addresses speaker identity preservation in multilingual TTS for applications like translation, though it is incremental with improvements over existing methods.

The paper tackles the problem of preserving speaker identity in multilingual text-to-speech for speech-to-speech translation by introducing LatinX, a model that uses Direct Preference Optimization to reduce Word Error Rate and improve speaker similarity, with human evaluations showing stronger perceived similarity than a baseline.

We present LatinX, a multilingual text-to-speech (TTS) model for cascaded speech-to-speech translation that preserves the source speaker's identity across languages. LatinX is a 12-layer decoder-only Transformer trained in three stages: (i) pre-training for text-to-audio mapping, (ii) supervised fine-tuning for zero-shot voice cloning, and (iii) alignment with Direct Preference Optimization (DPO) using automatically labeled pairs based on Word Error Rate (WER) and speaker-similarity metrics. Trained on English and Romance languages with emphasis on Portuguese, LatinX with DPO consistently reduces WER and improves objective similarity over the fine-tuned baseline. Human evaluations further indicate stronger perceived speaker similarity than a strong baseline (XTTSv2), revealing gaps between objective and subjective measures. We provide cross-lingual analyses and discuss balanced preference signals and lower-latency architectures as future work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes