CLDec 13, 2025

F5-TTS-RO: Extending F5-TTS to Romanian TTS via Lightweight Input Adaptation

arXiv:2512.12297v12.7Has Code

Originality Synthesis-oriented

AI Analysis

It enables Romanian TTS with voice cloning for users, but is incremental as it adapts an existing model.

This work extended the F5-TTS model to support Romanian text-to-speech using a lightweight input adapter, achieving voice cloning and code-switching capabilities but with residual English accents.

This work introduces a lightweight input-level adapter for the F5-TTS model that enables Romanian Language support. To preserve the existing capabilities of the model (voice cloning, English and Chinese support), we keep the original weights frozen, append a sub-network to the model and train it as an extension for the textual embedding matrix of the text encoder. For simplicity, we rely on ConvNeXt module implemented in F5-TTS to also model the co-dependencies between the new character-level embeddings. The module serves as a ``soft`` letter-to-sound layer, converting Romanian text into a continuous representation that the F5-TTS model uses to produce naturally sounding Romanian utterances. We evaluate the model with a pool of 20 human listeners across three tasks: (a) audio similarity between reference and generated speech, (b) pronunciation and naturalness and (c) Romanian-English code-switching. The results indicate that our approach maintains voice cloning capabilities and enables, to a certain extent, code-switching within the same utterance; however, residual English accent characteristics remain. We open-source our code and provide example audio samples at https://github.com/racai-ro/Ro-F5TTS.

View on arXiv PDF Code

Similar