SDLGASOct 12, 2020

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

arXiv:2010.05646v22673 citations
Originality Incremental advance
AI Analysis

This work addresses the need for efficient and high-fidelity speech synthesis for applications like text-to-speech, though it is incremental as it builds on existing GAN approaches.

The paper tackled the problem of inefficient and low-fidelity speech synthesis in GAN-based methods by proposing HiFi-GAN, which achieved high-quality audio generation with a mean opinion score similar to human quality and was 167.9 times faster than real-time on a GPU.

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Code Implementations11 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes