CLAug 22, 2016

Median-Based Generation of Synthetic Speech Durations using a Non-Parametric Approach

arXiv:1608.06134v218 citations
AI Analysis

This work addresses a specific bottleneck in speech synthesis for incremental generation, but it is incremental as it builds on existing statistical parametric methods.

The paper tackles the problem of duration modeling in statistical parametric speech synthesis by proposing a non-parametric approach that uses a recurrent model to predict phone transition probabilities at each acoustic frame, with generation based on median durations. Results show the method is competitive with baselines in approximating median durations of natural speech.

This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling -- which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis -- our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes