Deep convolutional networks on the pitch spiral for musical instrument recognition
This work addresses instrument recognition for music analysis, but it is incremental as it benchmarks and combines existing convolutional strategies without introducing a fundamentally new approach.
The paper tackled the problem of audio-based musical instrument recognition by developing deep convolutional networks with different weight sharing strategies in the time-frequency domain, achieving the best classification accuracy through a hybrid architecture combining temporal, time-frequency, and pitch spiral kernels.
Musical performance combines a wide range of pitches, nuances, and expressive techniques. Audio-based classification of musical instruments thus requires to build signal representations that are invariant to such transformations. This article investigates the construction of learned convolutional architectures for instrument recognition, given a limited amount of annotated training data. In this context, we benchmark three different weight sharing strategies for deep convolutional networks in the time-frequency domain: temporal kernels; time-frequency kernels; and a linear combination of time-frequency kernels which are one octave apart, akin to a Shepard pitch spiral. We provide an acoustical interpretation of these strategies within the source-filter framework of quasi-harmonic sounds with a fixed spectral envelope, which are archetypal of musical notes. The best classification accuracy is obtained by hybridizing all three convolutional layers into a single deep learning architecture.