Automatic context window composition for distant speech recognition
This work addresses the challenge of enhancing speech recognition accuracy in noisy, reverberant settings, which is crucial for applications like voice assistants and automated transcription, though it is incremental as it builds on existing asymmetric context methods.
The paper tackles the problem of optimizing asymmetric context windows for distant speech recognition in reverberant conditions by proposing a gradient analysis-based method for automatic context window composition, resulting in more effective DNN training and improved recognition performance across various acoustic environments and tasks.
Distant speech recognition is being revolutionized by deep learning, that has contributed to significantly outperform previous HMM-GMM systems. A key aspect behind the rapid rise and success of DNNs is their ability to better manage large time contexts. With this regard, asymmetric context windows that embed more past than future frames have been recently used with feed-forward neural networks. This context configuration turns out to be useful not only to address low-latency speech recognition, but also to boost the recognition performance under reverberant conditions. This paper investigates on the mechanisms occurring inside DNNs, which lead to an effective application of asymmetric contexts.In particular, we propose a novel method for automatic context window composition based on a gradient analysis. The experiments, performed with different acoustic environments, features, DNN architectures, microphone settings, and recognition tasks show that our simple and efficient strategy leads to a less redundant frame configuration, which makes DNN training more effective in reverberant scenarios.