SD AS MLNov 14, 2019

Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Aren Jansen, Daniel P. W. Ellis, Shawn Hershey, R. Channing Moore, Manoj Plakal, Ashok C. Popat, Rif A. Saurous

arXiv:1911.05894v117.032 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of reducing labeled data needs for sound recognition, which is incremental as it builds on existing unsupervised and active learning methods.

The paper tackles the problem of learning sound representations and recognition with minimal supervision by combining self-supervised, clustering, and active learning objectives, achieving a new state-of-the-art in unsupervised audio representation and up to a 20-fold reduction in label requirements for classification.

Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes. By training a combined sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate up to a 20-fold reduction in the number of labels required to reach a desired classification performance.

View on arXiv PDF

Similar