Automatic Classification of Music Genre using Masked Conditional Neural Networks
This addresses the problem of automatic music genre classification for audio processing applications, offering a novel method that improves over existing approaches but is incremental in the broader context of neural network architectures.
The authors tackled music genre classification by proposing Masked Conditional Neural Networks (MCLNN), which are designed for multidimensional temporal signals and incorporate sparseness to learn in frequency bands, reducing susceptibility to frequency shifts. They achieved competitive performance on the Ballroom music dataset, outperforming state-of-the-art Convolutional Neural Networks.
Neural network based architectures used for sound recognition are usually adapted from other application domains such as image recognition, which may not harness the time-frequency representation of a signal. The ConditionaL Neural Networks (CLNN) and its extension the Masked ConditionaL Neural Networks (MCLNN) are designed for multidimensional temporal signal recognition. The CLNN is trained over a window of frames to preserve the inter-frame relation, and the MCLNN enforces a systematic sparseness over the network's links that mimics a filterbank-like behavior. The masking operation induces the network to learn in frequency bands, which decreases the network susceptibility to frequency-shifts in time-frequency representations. Additionally, the mask allows an exploration of a range of feature combinations concurrently analogous to the manual handcrafting of the optimum collection of features for a recognition task. MCLNN have achieved competitive performance on the Ballroom music dataset compared to several hand-crafted attempts and outperformed models based on state-of-the-art Convolutional Neural Networks.