ASLGSDMay 18, 2023

FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs

arXiv:2305.10823v11 citations
Originality Incremental advance
AI Analysis

This addresses the need for real-time audio synthesis in applications like text-to-speech, though it is incremental as it builds on existing vocoder architectures.

The paper tackled the problem of slow generation rates in neural vocoders by introducing FastFit, which replaces the U-Net encoder with multiple STFTs, achieving nearly twice the generation speed while maintaining high sound quality.

This paper presents FastFit, a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs) to achieve faster generation rates without sacrificing sample quality. We replaced each encoder block with an STFT, with parameters equal to the temporal resolution of each decoder block, leading to the skip connection. FastFit reduces the number of parameters and the generation time of the model by almost half while maintaining high fidelity. Through objective and subjective evaluations, we demonstrated that the proposed model achieves nearly twice the generation speed of baseline iteration-based vocoders while maintaining high sound quality. We further showed that FastFit produces sound qualities similar to those of other baselines in text-to-speech evaluation scenarios, including multi-speaker and zero-shot text-to-speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes