SD CL ASJun 5, 2025

Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning

Hien Ohnaka, Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto

arXiv:2506.04527v17.01 citationsh-index: 16INTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the need for parallel data of speech, labels, and graphemes for downstream tasks like text-to-speech and accent estimation, representing an incremental improvement over existing methods.

The paper tackled the problem of generating phonemic and prosodic labels from speech that are coherent with graphemes, resulting in significantly improved consistency between graphemes and predicted labels and enhanced accuracy in accent estimation tasks.

We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method effectively improve the estimation accuracy.

View on arXiv PDF

Similar