CLSDASJun 24, 2024

Towards Zero-Shot Text-To-Speech for Arabic Dialects

arXiv:2406.16751v330 citationsHas Code
AI Analysis

This addresses the gap in zero-shot TTS for Arabic, a language with over 450 million speakers, but it is incremental as it builds on existing methods.

The paper tackled zero-shot text-to-speech for Arabic dialects by adapting datasets and fine-tuning an open-source model, achieving convincing performance in generating dialectal speech as shown by evaluations on 31 unseen speakers.

Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{https://docs.coqui.ai/en/latest/models/xtts.html}\footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f}\footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes