The Tag-Team Approach: Leveraging CLS and Language Tagging for Enhancing Multilingual ASR
This work addresses the problem of limited speech data and script differences for multilingual ASR in specific domains like India, representing an incremental improvement over existing CLS methods.
The paper tackles the challenge of building multilingual ASR systems in linguistically diverse regions like India by exploiting phonetic similarities among languages through a Common Label Set (CLS), and it shows that infusing specific language information via Language ID or CLS-to-native script conversion significantly improves Word Error Rate (WER) compared to a CLS baseline.
Building a multilingual Automated Speech Recognition (ASR) system in a linguistically diverse country like India can be a challenging task due to the differences in scripts and the limited availability of speech data. This problem can be solved by exploiting the fact that many of these languages are phonetically similar. These languages can be converted into a Common Label Set (CLS) by mapping similar sounds to common labels. In this paper, new approaches are explored and compared to improve the performance of CLS based multilingual ASR model. Specific language information is infused in the ASR model by giving Language ID or using CLS to Native script converter on top of the CLS Multilingual model. These methods give a significant improvement in Word Error Rate (WER) compared to the CLS baseline. These methods are further tried on out-of-distribution data to check their robustness.