AS SDJul 28, 2020

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

arXiv:2007.14351v14.35 citations

Originality Incremental advance

AI Analysis

This work addresses a specific problem in ASR for tone languages, offering incremental improvements in modeling synchronization.

The paper investigates whether phones and tones should be synchronized or asynchronized in acoustic models for automatic speech recognition, finding that synchronous models reduce joint error rates while asynchronous training lowers tone error rates in multilingual and cross-lingual settings.

Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization between phones and tones. In this paper, we test four Connectionist Temporal Classification (CTC)-based acoustic models, differing in the degree of synchrony they impose between phones and tones. Models are trained and tested multilingually in three languages, then adapted and tested cross-lingually in a fourth. Both synchronous and asynchronous models are effective in both multilingual and cross-lingual settings. Synchronous models achieve lower error rate in the joint phone+tone tier, but asynchronous training results in lower tone error rate.

View on arXiv PDF

Similar