CLSDASJul 6, 2023

On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation

arXiv:2307.02720v18 citationsh-index: 17
Originality Incremental advance
AI Analysis

This work addresses on-device keyword spotting for voice assistants like Alexa, but it is incremental as it builds on existing knowledge distillation and self-supervised learning methods.

The paper tackled the challenge of applying large self-supervised models for keyword spotting under on-device budget constraints and biased datasets by proposing a knowledge distillation-based architecture, achieving exceptional performance in normal and noisy conditions on a 16.6k-hour dataset.

Large self-supervised models are effective feature extractors, but their application is challenging under on-device budget constraints and biased dataset collection, especially in keyword spotting. To address this, we proposed a knowledge distillation-based self-supervised speech representation learning (S3RL) architecture for on-device keyword spotting. Our approach used a teacher-student framework to transfer knowledge from a larger, more complex model to a smaller, light-weight model using dual-view cross-correlation distillation and the teacher's codebook as learning objectives. We evaluated our model's performance on an Alexa keyword spotting detection task using a 16.6k-hour in-house dataset. Our technique showed exceptional performance in normal and noisy conditions, demonstrating the efficacy of knowledge distillation methods in constructing self-supervised models for keyword spotting tasks while working within on-device resource constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes