Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
This addresses the issue of out-of-vocabulary word recognition in speech recognition for users dealing with domain-specific or mismatched words, but it is incremental as it builds on existing context biasing methods.
The paper tackles the problem of automatic speech recognition systems failing to recognize words with pronunciation-orthography mismatches, such as named entities, by proposing a method that allows on-the-fly corrections during inference, resulting in up to 8% relative improvement in biased word error rate while maintaining competitive overall performance.
Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition. When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, acronyms, or domain-specific special words. To address this problem, many context biasing methods have been proposed; however, for words with a pronunciation-orthography mismatch, these methods may still struggle. We propose a method which allows corrections of substitution errors to improve the recognition accuracy of such challenging words. Users can add corrections on the fly during inference. We show that with this method we get a relative improvement in biased word error rate of up to 8%, while maintaining a competitive overall word error rate.