Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model
This work addresses computational bottlenecks in speech recognition systems, offering an incremental improvement for faster training and decoding.
The paper tackles the inefficiency of frame stacking in conventional neural network acoustic models by proposing a frame retaining method for decoding, achieving almost linear training speedup and reducing real-time factor by 41% with no degradation in recognition performance on a Mandarin voice search dataset.
Frame stacking is broadly applied in end-to-end neural network training like connectionist temporal classification (CTC), and it leads to more accurate models and faster decoding. However, it is not well-suited to conventional neural network based on context-dependent state acoustic model, if the decoder is unchanged. In this paper, we propose a novel frame retaining method which is applied in decoding. The system which combined frame retaining with frame stacking could reduces the time consumption of both training and decoding. Long short-term memory (LSTM) recurrent neural networks (RNNs) using it achieve almost linear training speedup and reduces relative 41\% real time factor (RTF). At the same time, recognition performance is no degradation or improves sightly on Shenma voice search dataset in Mandarin.