ASCLMay 21, 2025

P2VA: Converting Persona Descriptions into Voice Attributes for Fair and Controllable Text-to-Speech

arXiv:2505.17093v2
Originality Incremental advance
AI Analysis

This addresses a usability gap for users of TTS systems who want to create voices based on implicit persona descriptions, though it appears incremental as it builds on existing LLM and TTS technologies.

The paper tackles the problem of generating voices from persona descriptions in text-to-speech systems, where users lack expertise to specify voice attributes, by introducing the P2VA framework, which reduces word error rate by 5% and improves mean opinion score by 0.33 points.

While persona-driven large language models (LLMs) and prompt-based text-to-speech (TTS) systems have advanced significantly, a usability gap arises when users attempt to generate voices matching their desired personas from implicit descriptions. Most users lack specialized knowledge to specify detailed voice attributes, which often leads TTS systems to misinterpret their expectations. To address these gaps, we introduce Persona-to-Voice-Attribute (P2VA), the first framework enabling voice generation automatically from persona descriptions. Our approach employs two strategies: P2VA-C for structured voice attributes, and P2VA-O for richer style descriptions. Evaluation shows our P2VA-C reduces WER by 5% and improves MOS by 0.33 points. To the best of our knowledge, P2VA is the first framework to establish a connection between persona and voice synthesis. In addition, we discover that current LLMs embed societal biases in voice attributes during the conversion process. Our experiments and findings further provide insights into the challenges of building persona-voice systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes