Sequence-to-sequence Models for Small-Footprint Keyword Spotting
This work addresses the problem of efficient keyword spotting for real-world wake-up systems, presenting an incremental improvement over existing attention-based models.
The paper tackles keyword spotting by proposing a sequence-to-sequence model that simplifies production pipelines while meeting high accuracy, low-latency, and small-footprint requirements, achieving a false rejection rate of ~3.05% at 0.1 false alarms per hour with 73K parameters.
In this paper, we propose a sequence-to-sequence model for keyword spotting (KWS). Compared with other end-to-end architectures for KWS, our model simplifies the pipelines of production-quality KWS system and satisfies the requirement of high accuracy, low-latency, and small-footprint. We also evaluate the performances of different encoder architectures, which include LSTM and GRU. Experiments on the real-world wake-up data show that our approach outperforms the recently proposed attention-based end-to-end model. Specifically speaking, with 73K parameters, our sequence-to-sequence model achieves $\sim$3.05\% false rejection rate (FRR) at 0.1 false alarm (FA) per hour.