CLSDASAug 20, 2023

Indonesian Automatic Speech Recognition with XLSR-53

arXiv:2308.11589v112 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of building competitive ASR systems for non-English languages like Indonesian with reduced data requirements, representing an incremental improvement over prior research.

This study tackled the problem of developing Indonesian Automatic Speech Recognition (ASR) with limited training data by using the XLSR-53 pre-trained model, achieving a Word Error Rate (WER) of 20% on a test set, which could be reduced to 12% with a language model.

This study focuses on the development of Indonesian Automatic Speech Recognition (ASR) using the XLSR-53 pre-trained model, the XLSR stands for cross-lingual speech representations. The use of this XLSR-53 pre-trained model is to significantly reduce the amount of training data in non-English languages required to achieve a competitive Word Error Rate (WER). The total amount of data used in this study is 24 hours, 18 minutes, and 1 second: (1) TITML-IDN 14 hours and 31 minutes; (2) Magic Data 3 hours and 33 minutes; and (3) Common Voice 6 hours, 14 minutes, and 1 second. With a WER of 20%, the model built in this study can compete with similar models using the Common Voice dataset split test. WER can be decreased by around 8% using a language model, resulted in WER from 20% to 12%. Thus, the results of this study have succeeded in perfecting previous research in contributing to the creation of a better Indonesian ASR with a smaller amount of data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes