ASCLApr 28

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

arXiv:2604.2613667.2
AI Analysis

For researchers and practitioners in spoken language technology, this work provides a practical method to improve cross-lingual voice cloning in specialized domains, though it is an incremental application of existing techniques.

The paper tackles cross-lingual voice cloning for scientific speech, achieving consistent improvements in intelligibility (WER and CER) across Arabic, Chinese, and French while preserving speaker similarity, using data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus.

Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating consistent improvements in intelligibility (WER and CER) across languages while preserving speaker similarity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes