Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment

arXiv:2605.008613.8h-index: 3

Predicted impact top 44% in AS · last 90 daysOriginality Synthesis-oriented

AI Analysis

For TTS researchers and developers, this provides a new evaluation method focusing on voice dynamics and expressiveness, though it is incremental as it applies known metrics to a new problem.

This study proposes voice mapping as a framework for evaluating TTS quality using metrics like crest factor, spectrum balance, and CPPs. It finds that VITS has the largest voice range, Glow-TTS excels in soft phonation, and CPPs between 7-8 dB indicate naturalness while >10 dB sounds robotic.

This study investigates voice mapping as an evaluation framework for text-to-speech (TTS) synthesis quality. The study analyzes six TTS models, including historical and recent ones. The metrics are crest factor, spectrum balance, and cepstral peak prominence (CPPs). We investigated 6 influential TTS models: Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS. The results demonstrate that voice range serves as a primary indicator of model capability, with VITS showing the largest range among tested models. Glow-TTS exhibited superior performance in soft phonation, indicated by higher spectrum balance, despite limited voice range. The results showed that the CPPs values between 7-8 dB indicate natural voice quality, while with CPPs exceeding 10 dB, the speech tends to sound robotic. These findings underscore the need for voice mapping to evaluate vocal effort, and capture how TTS systems handle voice dynamic and expressiveness.

View on arXiv PDF

Similar