Automatic recognition of suprasegmentals in speech
This work addresses the problem of enhancing speech recognition systems for linguists and AI applications, but it is incremental as it builds on existing methods like wav2vec 2.0.
The study tackled automatic recognition of suprasegmentals in speech by fine-tuning wav2vec 2.0 with CTC, resulting in improved state-of-the-art performance for syllables, tones, and pitch accents, with specific gains such as significant improvements in Mandarin tone recognition using tonal finals or syllables.
This study reports our efforts to improve automatic recognition of suprasegmentals by fine-tuning wav2vec 2.0 with CTC, a method that has been successful in automatic speech recognition. We demonstrate that the method can improve the state-of-the-art on automatic recognition of syllables, tones, and pitch accents. Utilizing segmental information, by employing tonal finals or tonal syllables as recognition units, can significantly improve Mandarin tone recognition. Language models are helpful when tonal syllables are used as recognition units, but not helpful when tones are recognition units. Finally, Mandarin tone recognition can benefit from English phoneme recognition by combining the two tasks in fine-tuning wav2vec 2.0.