Hybrid Autoregressive Transducer (hat)
This work addresses speech recognition accuracy for users of voice search systems, presenting an incremental improvement with novel training and inference simplifications.
The paper tackles the problem of improving automatic speech recognition by proposing the Hybrid Autoregressive Transducer (HAT) model, which preserves modularity and addresses exposure bias, resulting in significant improvements in word error rate (WER) on a large-scale voice search task compared to state-of-the-art approaches.
This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. This article also presents a finite context version of the HAT model that addresses the exposure bias problem and significantly simplifies the overall training and inference. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches.