SD ASJan 16, 2020

SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis

Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph E. Gonzalez, Kurt Keutzer

arXiv:2001.05685v114.927 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient on-device speech synthesis for edge devices, representing an incremental improvement over existing vocoders.

The paper tackles the problem of real-time speech synthesis on edge devices by introducing SqueezeWave, a family of lightweight vocoders that achieve audio quality similar to WaveGlow with 61x to 214x fewer multiply-accumulate operations (MACs).

Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs. Code, trained models, and generated audio are publicly available at https://github.com/tianrengao/SqueezeWave.

View on arXiv PDF Code

Similar