SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting
This addresses keyword spotting for edge devices, offering a domain-specific incremental improvement.
The paper tackles the challenge of user-defined keyword spotting on resource-constrained edge devices by exploiting maximum keyword length constraints, proposing SLiCK to improve efficiency and accuracy. It increases AUC from 88.52 to 94.9 and reduces EER from 18.82 to 11.1 on the Libriphrase hard dataset.
User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by a maximum keyword length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined keyword spotting can be treated as a length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method for efficient keyword spotting, SLiCK (exploiting Subsequences for Length-Constrained Keyword spotting). We further introduce a subsequence-level matching scheme to learn audio-text relations at a finer granularity, thus distinguishing similar-sounding keywords more effectively through enhanced context. In SLiCK, the model is trained with a multi-task learning approach using two modules: Matcher (utterance-level matching task, novel subsequence-level matching task) and Encoder (phoneme recognition task). The proposed method improves the baseline results on Libriphrase hard dataset, increasing AUC from $88.52$ to $94.9$ and reducing EER from $18.82$ to $11.1$.