ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody

arXiv:2603.18024h-index: 2

AI Analysis

This work addresses the problem of improving keyword spotting accuracy for users with varying pronunciation traits, though it appears incremental by integrating prosody modeling into existing phoneme-based approaches.

The paper tackles the problem of keyword spotting by addressing user-specific pronunciation traits like prosody, which are often ignored in current systems, and results in a framework that achieves competitive performance comparable to state-of-the-art models on standard benchmarks with strong robustness for personalized keywords.

Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.

View on arXiv PDF

Similar