Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement
This enables real-time accent conversion for non-native speakers, though it is incremental as it builds on existing architectures.
The paper tackled the problem of converting non-native speech to a native-like accent in real-time by proposing the first streaming accent conversion model, which achieved comparable performance to top models while maintaining stable latency.
We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.