ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody
This work addresses the problem of improving keyword spotting accuracy for users with varying pronunciation traits, though it appears incremental by integrating prosody modeling into existing phoneme-based approaches.
The paper tackles the problem of keyword spotting by addressing user-specific pronunciation traits like prosody, which are often ignored in current systems, and results in a framework that achieves competitive performance comparable to state-of-the-art models on standard benchmarks with strong robustness for personalized keywords.
Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.