BERT-LID: Leveraging BERT to Improve Spoken Language Identification
This addresses a bottleneck in multilingual speech systems by enhancing accuracy for short-duration speech, though it is incremental as it adapts an existing method (BERT) to a specific domain.
The paper tackled the problem of poor performance in spoken language identification on short utterances (<=1s) by proposing BERT-LID, a BERT-based system that improved accuracy by 6.5% on long segments and 19.9% on short segments.
Language identification is the task of automatically determining the identity of a language conveyed by a spoken segment. It has a profound impact on the multilingual interoperability of an intelligent speech system. Despite language identification attaining high accuracy on medium or long utterances(>3s), the performance on short utterances (<=1s) is still far from satisfactory. We propose a BERT-based language identification system (BERT-LID) to improve language identification performance, especially on short-duration speech segments. We extend the original BERT model by taking the phonetic posteriorgrams (PPG) derived from the front-end phone recognizer as input. Then we deployed the optimal deep classifier followed by it for language identification. Our BERT-LID model can improve the baseline accuracy by about 6.5% on long-segment identification and 19.9% on short-segment identification, demonstrating our BERT-LID's effectiveness to language identification.