Open-vocabulary Keyword-spotting with Adaptive Instance Normalization
This addresses the problem of detecting user-defined keywords in speech for ASR applications, showing incremental improvements with novel method adaptation.
The paper tackled open-vocabulary keyword spotting by proposing AdaKWS, a method that uses a text encoder to output keyword-conditioned normalization parameters for processing auditory input, resulting in significant improvements over baselines on multi-lingual benchmarks and low-resource languages.
Open vocabulary keyword spotting is a crucial and challenging task in automatic speech recognition (ASR) that focuses on detecting user-defined keywords within a spoken utterance. Keyword spotting methods commonly map the audio utterance and keyword into a joint embedding space to obtain some affinity score. In this work, we propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters. These parameters are used to process the auditory input. We provide an extensive evaluation using challenging and diverse multi-lingual benchmarks and show significant improvements over recent keyword spotting and ASR baselines. Furthermore, we study the effectiveness of our approach on low-resource languages that were unseen during the training. The results demonstrate a substantial performance improvement compared to baseline methods.