ASAILGMay 26, 2023

Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

arXiv:2305.16699v1
Originality Incremental advance
AI Analysis

This addresses a practical bottleneck for researchers and practitioners using zero-shot speech synthesis models by eliminating the need for manual hyper-parameter tuning, though it is incremental as it builds on existing VITS-based methods.

The paper tackled the problem of tuning loss trade-offs in zero-shot speech synthesis models, which previously required burdensome hyper-parameter search, by proposing a framework that automatically finds the optimal balance without search, achieving state-of-the-art performance in zero-shot TTS and VC.

Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum without search, by inducing the decoder of VITS-based models to its full reconstruction ability. With our framework, we show superior performance compared to baselines in zero-shot TTS and VC, achieving state-of-the-art performance. Furthermore, we show the robustness of our framework in various settings. We provide an explanation for the results in the discussion.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes