Transformer-based Model for ASR N-Best Rescoring and Rewriting
This work addresses accuracy issues in voice assistants for complex domains, though it is incremental as it builds on existing ASR and Transformer methods.
The paper tackles the problem of improving on-device ASR for complex queries by proposing a Transformer model that rescores and rewrites N-best hypotheses in parallel, achieving up to an average 8.6% relative WER reduction over the baseline ASR system.
Voice assistants increasingly use on-device Automatic Speech Recognition (ASR) to ensure speed and privacy. However, due to resource constraints on the device, queries pertaining to complex information domains often require further processing by a search engine. For such applications, we propose a novel Transformer based model capable of rescoring and rewriting, by exploring full context of the N-best hypotheses in parallel. We also propose a new discriminative sequence training objective that can work well for both rescore and rewrite tasks. We show that our Rescore+Rewrite model outperforms the Rescore-only baseline, and achieves up to an average 8.6% relative Word Error Rate (WER) reduction over the ASR system by itself.