Synthetic Data Generation for Phrase Break Prediction with Large Language Model
This work addresses data scarcity and annotation costs for researchers and developers in speech technology, but it is incremental as it applies an existing LLM method to a new domain.
The paper tackles the problem of high manual annotation costs and data variability in phrase break prediction for text-to-speech systems by using large language models to generate synthetic data, showing that this approach effectively mitigates data challenges across multiple languages.
Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.