CL SD ASNov 15, 2023

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

Jin Qiu, Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma

arXiv:2311.08966v12.18 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses the problem of improving rare word recognition in streaming ASR for practical applications, representing an incremental advance in deep biasing methods.

The paper tackled the challenge of large-scale deep biasing in streaming Transducer-based ASR, where performance drops with many distractors and similar words, by combining phoneme and textual information and using text-only data for training, achieving state-of-the-art rare word error rates on the LibriSpeech corpus.

Deep biasing for the Transducer can improve the recognition performance of rare words or contextual entities, which is essential in practical applications, especially for streaming Automatic Speech Recognition (ASR). However, deep biasing with large-scale rare words remains challenging, as the performance drops significantly when more distractors exist and there are words with similar grapheme sequences in the bias list. In this paper, we combine the phoneme and textual information of rare words in Transducers to distinguish words with similar pronunciation or spelling. Moreover, the introduction of training with text-only data containing more rare words benefits large-scale deep biasing. The experiments on the LibriSpeech corpus demonstrate that the proposed method achieves state-of-the-art performance on rare word error rate for different scales and levels of bias lists.

View on arXiv PDF

Similar