MLLGMar 10, 2018

Speech Recognition: Keyword Spotting Through Image Recognition

arXiv:1803.03759v222 citations
Originality Synthesis-oriented
AI Analysis

This is an incremental approach for speech recognition systems, applying existing image-based methods to audio data with potential regularization benefits.

The paper tackled keyword spotting in noisy audio by converting speech recognition to image classification and testing CNN architectures, including an adversarially trained one, on a 10-word dataset with unknown words and silence, but no concrete performance numbers were provided.

The problem of identifying voice commands has always been a challenge due to the presence of noise and variability in speed, pitch, etc. We will compare the efficacies of several neural network architectures for the speech recognition problem. In particular, we will build a model to determine whether a one second audio clip contains a particular word (out of a set of 10), an unknown word, or silence. The models to be implemented are a CNN recommended by the Tensorflow Speech Recognition tutorial, a low-latency CNN, and an adversarially trained CNN. The result is a demonstration of how to convert a problem in audio recognition to the better-studied domain of image classification, where the powerful techniques of convolutional neural networks are fully developed. Additionally, we demonstrate the applicability of the technique of Virtual Adversarial Training (VAT) to this problem domain, functioning as a powerful regularizer with promising potential future applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes