Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations
This work addresses personalized synthesis of non-speech vocalizations like laughter, which could enhance human-computer interaction, but it appears incremental with noted limitations and future improvements.
The authors tackled the problem of modeling non-speech vocalizations (NSVs) by formulating it as a text-to-speech task, verifying its viability and showing the model can control speaker timbre with few-shot training data, though heterogeneity in recording conditions was identified as a major obstacle.
We formulated non-speech vocalization (NSV) modeling as a text-to-speech task and verified its viability. Specifically, we evaluated the phonetic expressivity of HUBERT speech units on NSVs and verified our model's ability to control over speaker timbre even though the training data is speaker few-shot. In addition, we substantiated that the heterogeneity in recording conditions is the major obstacle for NSV modeling. Finally, we discussed five improvements over our method for future research. Audio samples of synthesized NSVs are available on our demo page: https://resemble-ai.github.io/reLaugh.