SDLGASMay 19, 2025

Score-Based Training for Energy-Based TTS Models

arXiv:2505.13771v1h-index: 1INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in EBM training for TTS, offering an incremental improvement over prior methods.

The paper tackles the problem of training energy-based models (EBMs) for text-to-speech (TTS) by proposing a new criterion that learns scores more suitable for first-order optimization schemes, contrasting it with existing methods like noise contrastive estimation (NCE) and sliced score matching (SSM).

Noise contrastive estimation (NCE) is a popular method for training energy-based models (EBM) with intractable normalisation terms. The key idea of NCE is to learn by comparing unnormalised log-likelihoods of the reference and noisy samples, thus avoiding explicitly computing normalisation terms. However, NCE critically relies on the quality of noisy samples. Recently, sliced score matching (SSM) has been popularised by closely related diffusion models (DM). Unlike NCE, SSM learns a gradient of log-likelihood, or score, by learning distribution of its projections on randomly chosen directions. However, both NCE and SSM disregard the form of log-likelihood function, which is problematic given that EBMs and DMs make use of first-order optimisation during inference. This paper proposes a new criterion that learns scores more suitable for first-order schemes. Experiments contrasts these approaches for training EBMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes