Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS
This addresses the problem of unclear non-native speech for listeners, though it appears incremental by combining existing techniques like VITS with synthetic ground-truth.
The paper tackles the problem of non-native speech having both accent and pronunciation issues by developing an accent conversion approach that improves pronunciation using synthetic ground-truth audio from native TTS. The system produces audio closely resembling native accents while retaining speaker identity and improving pronunciation, as shown in evaluation results.
Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles native accents and while retaining the original speaker's identity but also improve pronunciation, as demonstrated by evaluation results.