Supervised Initialization of LSTM Networks for Fundamental Frequency Detection in Noisy Speech Signals
This work addresses a domain-specific problem in speech processing for noisy signals, offering an incremental improvement over existing methods.
The paper tackles the challenge of detecting fundamental frequency in noisy speech by proposing a supervised initialization method for LSTM networks using an Auto-associative network, resulting in improved accuracy and training efficiency with objective measures under various noise levels.
Fundamental frequency is one of the most important parameters of human speech, of importance for the classification of accent, gender, speaking styles, speaker identification, age, among others. The proper detection of this parameter remains as an important challenge for severely degraded signals. In previous references for detecting fundamental frequency in noisy speech using deep learning, the networks, such as Long Short-term Memory (LSTM) has been initialized with random weights, and then trained following a back-propagation through time algorithm. In this work, a proposal for a more efficient initialization, based on a supervised training using an Auto-associative network, is presented. This initialization is a better starting point for the detection of fundamental frequency in noisy speech. The advantages of this initialization are noticeable using objective measures for the accuracy of the detection and for the training of the networks, under the presence of additive white noise at different signal-to-noise levels.