Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization
This addresses the problem of personal content recognition in speech recognition systems for users, though it builds incrementally on previous work.
The paper tackles the challenge of recognizing personal content like contact names in end-to-end speech recognition by proposing a novel method for generating pronunciation-driven subword tokenizations for personal entities. The result shows that combining this technique with contextual biasing and wordpiece prior normalization achieves personal named entity accuracy comparable to a competitive hybrid system.
Recent advances in deep learning and automatic speech recognition have improved the accuracy of end-to-end speech recognition systems, but recognition of personal content such as contact names remains a challenge. In this work, we describe our personalization solution for an end-to-end speech recognition system based on connectionist temporal classification. Building on previous work, we present a novel method for generating additional subword tokenizations for personal entities from their pronunciations. We show that using this technique in combination with two established techniques, contextual biasing and wordpiece prior normalization, we are able to achieve personal named entity accuracy on par with a competitive hybrid system.