End-to-End Multi-Look Keyword Spotting
This addresses the problem of unreliable keyword spotting for users in noisy, far-field environments, representing an incremental advancement over existing methods.
The paper tackles the degradation of keyword spotting performance in far-field and noisy conditions by proposing an end-to-end multi-look neural network that integrates enhanced signals from multiple directions with attention, resulting in significant improvements over baseline and recent beamformer-based systems.
The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multi-look neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end KWS model which integrates the enhanced signals from multiple look directions and leverages an attention mechanism to dynamically tune the model's attention to the reliable sources. We demonstrate, on our large noisy and far-field evaluation sets, that the proposed approach significantly improves the KWS performance against the baseline KWS system and a recent beamformer based multi-beam KWS system.