End-to-end Models with auditory attention in Multi-channel Keyword Spotting
This work addresses keyword spotting for voice-activated systems, showing incremental improvements in robustness to noise.
The paper tackled multi-channel keyword spotting by proposing an attention-based end-to-end model, which outperformed baseline models with signal pre-processing in clean and noisy data, achieving a 30% absolute improvement in wake-up rate at 0.1 false alarms per hour in noisy conditions.
In this paper, we propose an attention-based end-to-end model for multi-channel keyword spotting (KWS), which is trained to optimize the KWS result directly. As a result, our model outperforms the baseline model with signal pre-processing techniques in both the clean and noisy testing data. We also found that multi-task learning results in a better performance when the training and testing data are similar. Transfer learning and multi-target spectral mapping can dramatically enhance the robustness to the noisy environment. At 0.1 false alarm (FA) per hour, the model with transfer learning and multi-target mapping gain an absolute 30% improvement in the wake-up rate in the noisy data with SNR about -20.