AS CL LG SDMay 30, 2023

Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling

Theodoros Kouzelis, Georgios Paraskevopoulos, Athanasios Katsamanis, Vassilis Katsouros

arXiv:2306.00996v17.39 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of time-aligned data for speech disorder studies, offering a practical solution for researchers and clinicians dealing with disfluent speech, though it is incremental as it modifies existing alignment graph construction.

The paper tackles the problem of forced alignment for disfluent speech, where audio-text mismatches degrade performance, by proposing a weakly-supervised method using phoneme-level modeling with CTC-based models and Weighted Finite State Transducers, resulting in a 23-25% relative improvement in recall over baselines on corrupted TIMIT and UCLASS datasets.

The study of speech disorders can benefit greatly from time-aligned data. However, audio-text mismatches in disfluent speech cause rapid performance degradation for modern speech aligners, hindering the use of automatic approaches. In this work, we propose a simple and effective modification of alignment graph construction of CTC-based models using Weighted Finite State Transducers. The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment. During the graph construction, we allow the modeling of common speech disfluencies, i.e. repetitions and omissions. Further, we show that by assessing the degree of audio-text mismatch through the use of Oracle Error Rate, our method can be effectively used in the wild. Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements, particularly for recall, achieving a 23-25% relative improvement over our baselines.

View on arXiv PDF Code

Similar