ASAILGJan 12, 2021

Learning Efficient Representations for Keyword Spotting with Triplet Loss

arXiv:2101.04792v471 citations
Originality Incremental advance
AI Analysis

This work addresses keyword spotting for speech recognition systems by introducing a novel approach that significantly boosts accuracy, though it is incremental as it adapts existing techniques from vision to speech.

The paper tackles keyword spotting by applying triplet loss-based metric embeddings, which are common in computer vision but rarely used in speech recognition, to improve classification accuracy. It achieves significant improvements, such as 26-38% on LibriWords and up to 50% on Google Speech Commands datasets, with accuracies like 98.55% on V1 10+2-class.

In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most no-tably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also improve the current best published SOTA for Google Speech Commands dataset V1 10+2 -class classification by about 34%, achieving 98.55% accuracy, V2 10+2-class classification by about 20%, achieving 98.37% accuracy, and V2 35-class classification by over 50%, achieving 97.0% accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes