CLAINov 17, 2023

A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness

arXiv:2311.10804v1h-index: 4
Originality Incremental advance
AI Analysis

This addresses the challenge of improving expressiveness in TTS systems for applications like voice assistants, but it is incremental as it builds on existing methods without achieving clear SOTA gains.

The study tackled the problem of enhancing expressiveness control in Text-to-Speech models by augmenting a frozen pretrained model with a Diffusion Model conditioned on joint semantic audio/text embeddings, but the results only offered insights into the complexities without concrete numerical improvements.

This report explores the challenge of enhancing expressiveness control in Text-to-Speech (TTS) models by augmenting a frozen pretrained model with a Diffusion Model that is conditioned on joint semantic audio/text embeddings. The paper identifies the challenges encountered when working with a VAE-based TTS model and evaluates different image-to-image methods for altering latent speech features. Our results offer valuable insights into the complexities of adding expressiveness control to TTS systems and open avenues for future research in this direction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes