Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding
This work provides a significant decoding speed-up for users of HAT-based Automatic Speech Recognition systems, which is an incremental improvement to existing methods.
This paper proposes an internal acoustic model (IAM) training strategy to improve Hybrid Autoregressive Transducer (HAT)-based speech recognition. The joint training of IAM and HAT leads to statistically significant relative error reductions and, when combined with dual blank thresholding, achieves a 42-75% decoding speed-up with no major performance degradation.
A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances the HAT training efficiency but also encourages IAM and HAT to emit blanks synchronously which skips the more expensive non-blank computation, resulting in more effective blank thresholding for faster decoding. Experiments demonstrate that the relative error reductions of the HAT with IAM compared to the vanilla HAT are statistically significant. Moreover, we introduce dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm. This results in a 42-75% decoding speed-up with no major performance degradation.