CLSDASJan 25, 2020

Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling

arXiv:2001.09246v121 citations
AI Analysis

This work addresses keyword spotting for speech recognition systems, offering an incremental improvement by reducing dependency on LVCSR for easier on-device adaptation.

The paper tackles the problem of keyword spotting by proposing a smoothed max pooling loss that jointly trains an encoder to detect keyword parts and a decoder to detect the whole keyword in a semi-supervised manner, resulting in outperforming a baseline model due to increased optimizability.

We propose smoothed max pooling loss and its application to keyword spotting systems. The proposed approach jointly trains an encoder (to detect keyword parts) and a decoder (to detect whole keyword) in a semi-supervised manner. The proposed new loss function allows training a model to detect parts and whole of a keyword, without strictly depending on frame-level labeling from LVCSR (Large vocabulary continuous speech recognition), making further optimization possible. The proposed system outperforms the baseline keyword spotting model in [1] due to increased optimizability. Further, it can be more easily adapted for on-device learning applications due to reduced dependency on LVCSR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes