ASSDOct 22, 2020

How Similar or Different Is Rakugo Speech Synthesizer to Professional Performers?

arXiv:2010.11549v14 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of creating authentically entertaining speech synthesis for rakugo, identifying key factors beyond naturalness, but it is incremental as it focuses on evaluation and insights rather than a new synthesis method.

The paper tackled the problem of evaluating speech synthesis for rakugo, a traditional Japanese entertainment form, by comparing synthesized speech to professional performers, finding that while naturalness was comparable, the synthesized speech entertained listeners less due to inferior understandability and character distinguishability.

We have been working on speech synthesis for rakugo (a traditional Japanese form of verbal entertainment similar to one-person stand-up comedy) toward speech synthesis that authentically entertains audiences. In this paper, we propose a novel evaluation methodology using synthesized rakugo speech and real rakugo speech uttered by professional performers of three different ranks. The naturalness of the synthesized speech was comparable to that of the human speech, but the synthesized speech entertained listeners less than the performers of any rank. However, we obtained some interesting insights into challenges to be solved in order to achieve a truly entertaining rakugo synthesizer. For example, naturalness was not the most important factor, even though it has generally been emphasized as the most important point to be evaluated in the conventional speech synthesis field. More important factors were the understandability of the content and distinguishability of the characters in the rakugo story, both of which the synthesized rakugo speech was relatively inferior at as compared with the professional performers. We also found that fundamental frequency fo modeling should be further improved to better entertain audiences. These results show important steps to reaching authentically entertaining speech synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes