CLJul 25, 2024

Scaling A Simple Approach to Zero-Shot Speech Recognition

Jinming Zhao, Vineel Pratap, Michael Auli

arXiv:2407.17852v110 citationsh-index: 52

Originality Highly original

AI Analysis

This work addresses the challenge of expanding speech recognition to low-resource languages, offering a significant improvement over existing zero-shot methods.

The paper tackles the problem of zero-shot speech recognition for languages without labeled data by introducing MMS Zero-shot, which uses romanization and an acoustic model trained on 1,078 languages, reducing the average character error rate by 46% over 100 unseen languages compared to prior work.

Despite rapid progress in increasing the language coverage of automatic speech recognition, the field is still far from covering all languages with a known writing script. Recent work showed promising results with a zero-shot approach requiring only a small amount of text data, however, accuracy heavily depends on the quality of the used phonemizer which is often weak for unseen languages. In this paper, we present MMS Zero-shot a conceptually simpler approach based on romanization and an acoustic model trained on data in 1,078 different languages or three orders of magnitude more than prior art. MMS Zero-shot reduces the average character error rate by a relative 46% over 100 unseen languages compared to the best previous work. Moreover, the error rate of our approach is only 2.5x higher compared to in-domain supervised baselines, while our approach uses no labeled data for the evaluation languages at all.

View on arXiv PDF

Similar