ASLGSDMLApr 14, 2019

SpeechYOLO: Detection and Localization of Speech Objects

arXiv:1904.07704v219 citations
Originality Synthesis-oriented
AI Analysis

This work addresses speech processing tasks like keyword spotting by applying computer vision techniques, but it is incremental as it adapts an existing method to a new domain without major innovations.

The paper tackles the problem of detecting and localizing speech utterances by adapting the YOLO object detection method from vision to audio, treating audio fragments as objects. The result is SpeechYOLO, which shows favorable performance in keyword spotting tasks on read and spontaneous speech corpora compared to other localization and classification algorithms.

In this paper, we propose to apply object detection methods from the vision domain on the speech recognition domain, by treating audio fragments as objects. More specifically, we present SpeechYOLO, which is inspired by the YOLO algorithm for object detection in images. The goal of SpeechYOLO is to localize boundaries of utterances within the input signal, and to correctly classify them. Our system is composed of a convolutional neural network, with a simple least-mean-squares loss function. We evaluated the system on several keyword spotting tasks, that include corpora of read speech and spontaneous speech. Our system compares favorably with other algorithms trained for both localization and classification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes