PCOV-KWS: Multi-task Learning for Personalized Customizable Open Vocabulary Keyword Spotting
This work addresses privacy and personalization needs in voice assistants, but it is incremental as it builds on existing multi-task and lightweight network approaches.
The paper tackles the problem of personalized keyword spotting by introducing a multi-task learning framework that simultaneously performs keyword spotting and speaker verification, achieving better performance with fewer parameters and lower computational resources compared to baselines.
As advancements in technologies like Internet of Things (IoT), Automatic Speech Recognition (ASR), Speaker Verification (SV), and Text-to-Speech (TTS) lead to increased usage of intelligent voice assistants, the demand for privacy and personalization has escalated. In this paper, we introduce a multi-task learning framework for personalized, customizable open-vocabulary Keyword Spotting (PCOV-KWS). This framework employs a lightweight network to simultaneously perform Keyword Spotting (KWS) and SV to address personalized KWS requirements. We have integrated a training criterion distinct from softmax-based loss, transforming multi-class classification into multiple binary classifications, which eliminates inter-category competition, while an optimization strategy for multi-task loss weighting is employed during training. We evaluated our PCOV-KWS system in multiple datasets, demonstrating that it outperforms the baselines in evaluation results, while also requiring fewer parameters and lower computational resources.