SD CL ASJul 1, 2025

Multi-interaction TTS toward professional recording reproduction

Hiroki Kanagawa, Kenichi Fujita, Aya Watanabe, Yusuke Ijima

arXiv:2507.00808v27.01 citationsh-index: 513th edition of the Speech Synthesis Workshop

Originality Incremental advance

AI Analysis

This addresses the need for more intuitive and rapid style adjustments in TTS for users like voice directors, though it is incremental as it builds on existing TTS frameworks.

The paper tackles the problem of fine-grained style refinement in text-to-speech synthesis by proposing a multi-step interaction method that allows users to iteratively refine synthesized speech, demonstrating its capability through experiments.

Voice directors often iteratively refine voice actors' performances by providing feedback to achieve the desired outcome. While this iterative feedback-based refinement process is important in actual recordings, it has been overlooked in text-to-speech synthesis (TTS). As a result, fine-grained style refinement after the initial synthesis is not possible, even though the synthesized speech often deviates from the user's intended style. To address this issue, we propose a TTS method with multi-step interaction that allows users to intuitively and rapidly refine synthesized speech. Our approach models the interaction between the TTS model and its user to emulate the relationship between voice actors and voice directors. Experiments show that the proposed model with its corresponding dataset enables iterative style refinements in accordance with users' directions, thus demonstrating its multi-interaction capability. Sample audios are available: https://ntt-hilab-gensp.github.io/ssw13multiinteractiontts/

View on arXiv PDF

Similar