Compact Neural TTS Voices for Accessibility
This addresses the need for practical, high-quality TTS for accessibility applications on handheld devices, representing an incremental improvement over existing deployable neural TTS.
The paper tackled the problem of high latency and large disk footprint in neural TTS systems for accessibility, achieving a compact neural TTS with latency of 15 ms and low disk footprint that runs on low-power devices.
Contemporary text-to-speech solutions for accessibility applications can typically be classified into two categories: (i) device-based statistical parametric speech synthesis (SPSS) or unit selection (USEL) and (ii) cloud-based neural TTS. SPSS and USEL offer low latency and low disk footprint at the expense of naturalness and audio quality. Cloud-based neural TTS systems provide significantly better audio quality and naturalness but regress in terms of latency and responsiveness, rendering these impractical for real-world applications. More recently, neural TTS models were made deployable to run on handheld devices. Nevertheless, latency remains higher than SPSS and USEL, while disk footprint prohibits pre-installation for multiple voices at once. In this work, we describe a high-quality compact neural TTS system achieving latency on the order of 15 ms with low disk footprint. The proposed solution is capable of running on low-power devices.