Musical Tempo and Key Estimation using Convolutional Neural Networks with Directional Filters
This work addresses music information retrieval tasks for applications like audio analysis, but it is incremental as it builds on existing CNN methods with domain-specific adaptations.
The paper tackled musical tempo and key estimation by exploiting the semantics of spectrogram axes with CNNs using directional filters, showing that axis-aligned architectures perform similarly to VGG-style networks while being less vulnerable to confounding factors and requiring fewer parameters.
In this article we explore how the different semantics of spectrograms' time and frequency axes can be exploited for musical tempo and key estimation using Convolutional Neural Networks (CNN). By addressing both tasks with the same network architectures ranging from shallow, domain-specific approaches to deep variants with directional filters, we show that axis-aligned architectures perform similarly well as common VGG-style networks developed for computer vision, while being less vulnerable to confounding factors and requiring fewer model parameters.