Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
This work provides a method for fine-grained and compositional accent control in multilingual TTS, which is significant for the large population of non-native English speakers and for broader applications in multicultural communication.
This paper addresses the lack of accented English in Text-To-Speech (TTS) systems by proposing "Accent Vector," a controllable representation for accent manipulation in multilingual TTS. It achieves this without requiring accented training data, instead deriving accent characteristics by fine-tuning a TTS system on native speech of a different language and applying these as task vectors to English speech.
Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited accented data. We propose \textit{Accent Vector}, a controllable representation that enables accent manipulation in multilingual TTS without requiring accented training data. \textit{Accent Vector} is derived by fine-tuning a TTS system on native speech of a different language (i.e. non-English) and computing task vectors capturing accent characteristics (i.e. in English). By scaling and interpolating the vector, we achieve fine-grained control over accent strength and generate mixed-accent speech. In addition, it generalizes beyond English, enabling accent control across multiple languages. Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.