ASLGSDMLApr 10, 2019

RawNet: Fast End-to-End Neural Vocoder

arXiv:1904.05351v22 citations
Originality Highly original
AI Analysis

This work addresses a bottleneck in speech synthesis by eliminating the need for human-designed features, offering improvements for vocoder applications.

The authors tackled the problem of neural vocoders relying on handcrafted spectral features by proposing RawNet, an end-to-end model that learns features directly from raw audio, achieving better speech quality and faster generation speed.

Neural network-based vocoders have recently demonstrated the powerful ability to synthesize high-quality speech. These models usually generate samples by conditioning on spectral features, such as Mel-spectrogram and fundamental frequency, which is crucial to speech synthesis. However, the feature extraction procession tends to depend heavily on human knowledge resulting in a less expressive description of the origin audio. In this work, we proposed RawNet, a complete end-to-end neural vocoder following the auto-encoder structure for speaker-dependent and -independent speech synthesis. It automatically learns to extract features and recover audio using neural networks, which include a coder network to capture a higher representation of the input audio and an autoregressive voder network to restore the audio in a sample-by-sample manner. The coder and voder are jointly trained directly on the raw waveform without any human-designed features. The experimental results show that RawNet achieves a better speech quality using a simplified model architecture and obtains a faster speech generation speed at the inference stage.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes