CLAISDSep 29, 2025

Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization

arXiv:2509.25416v13 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the challenge of producing emotionally expressive speech for applications like virtual assistants or audiobooks, though it appears incremental as it builds on existing diffusion TTS methods.

The paper tackles the problem of generating emotional speech in text-to-speech systems by introducing EASPO, a post-training framework that aligns diffusion models with fine-grained emotional preferences at intermediate denoising steps, resulting in superior performance in expressiveness and naturalness compared to existing methods.

Emotional text-to-speech seeks to convey affect while preserving intelligibility and prosody, yet existing methods rely on coarse labels or proxy classifiers and receive only utterance-level feedback. We introduce Emotion-Aware Stepwise Preference Optimization (EASPO), a post-training framework that aligns diffusion TTS with fine-grained emotional preferences at intermediate denoising steps. Central to our approach is EASPM, a time-conditioned model that scores noisy intermediate speech states and enables automatic preference pair construction. EASPO optimizes generation to match these stepwise preferences, enabling controllable emotional shaping. Experiments show superior performance over existing methods in both expressiveness and naturalness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes