CL SDJun 5

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

arXiv:2606.072405.1

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of preserving speaker identity and intelligibility in cross-lingual voice cloning for speech translation, but the gains are incremental and task-specific.

The authors tackle cross-lingual voice cloning by enhancing a multilingual TTS model with language tag prompting, RL fine-tuning, and lexical matching, achieving improved intelligibility and pronunciation of domain-specific terms.

Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

View on arXiv PDF

Similar