CLSDJun 5

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

arXiv:2606.072405.1
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of preserving speaker identity and intelligibility in cross-lingual voice cloning for speech translation, but the gains are incremental and task-specific.

The authors tackle cross-lingual voice cloning by enhancing a multilingual TTS model with language tag prompting, RL fine-tuning, and lexical matching, achieving improved intelligibility and pronunciation of domain-specific terms.

Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes