CLAIOct 16, 2025

RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

arXiv:2510.14628v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses the problem of emotionally flat speech synthesis for applications requiring expressive audio, but it is incremental as it builds on existing methods with specific enhancements.

The paper tackled the challenge of emotional expressiveness in text-to-speech synthesis by proposing the RLAIF-SPA framework, which uses reinforcement learning from AI feedback to optimize emotional expressiveness and intelligibility, resulting in a 26.1% reduction in word error rate and over 10% improvement in human evaluation compared to Chat-TTS.

Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes