SDLGASDec 27, 2023

AE-Flow: AutoEncoder Normalizing Flow

arXiv:2312.16552v14 citationsh-index: 16ICASSP
Originality Incremental advance
AI Analysis

This work addresses voice conversion tasks by proposing an incremental enhancement to normalizing flow training for better audio quality and speaker matching.

The paper tackled the problem of improving normalizing flows for text-to-speech and voice conversion by introducing a supervised training paradigm called AE-Flow, which adds a reconstruction loss without needing parallel data, resulting in systematic improvements in speaker similarity and naturalness over regular methods and state-of-the-art baselines.

Recently normalizing flows have been gaining traction in text-to-speech (TTS) and voice conversion (VC) due to their state-of-the-art (SOTA) performance. Normalizing flows are unsupervised generative models. In this paper, we introduce supervision to the training process of normalizing flows, without the need for parallel data. We call this training paradigm AutoEncoder Normalizing Flow (AE-Flow). It adds a reconstruction loss forcing the model to use information from the conditioning to reconstruct an audio sample. Our goal is to understand the impact of each component and find the right combination of the negative log-likelihood (NLL) and the reconstruction loss in training normalizing flows with coupling blocks. For that reason we will compare flow-based mapping model trained with: (i) NLL loss, (ii) NLL and reconstruction losses, as well as (iii) reconstruction loss only. Additionally, we compare our model with SOTA VC baseline. The models are evaluated in terms of naturalness, speaker similarity, intelligibility in many-to-many and many-to-any VC settings. The results show that the proposed training paradigm systematically improves speaker similarity and naturalness when compared to regular training methods of normalizing flows. Furthermore, we show that our method improves speaker similarity and intelligibility over the state-of-the-art.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes