Investigating kernel shapes and skip connections for deep learning-based harmonic-percussive separation
This work addresses audio source separation for music processing, but it is incremental as it builds on existing deep learning methods with optimizations.
The paper tackles harmonic-percussive source separation by proposing an efficient encoder-decoder network that reduces trainable parameters using dense skip connections and explores kernel sizes for better learning of time-frequency patterns, achieving state-of-the-art performance on the MUSDB18 dataset.
In this paper we propose an efficient deep learning encoder-decoder network for performing Harmonic-Percussive Source Separation (HPSS). It is shown that we are able to greatly reduce the number of model trainable parameters by using a dense arrangement of skip connections between the model layers. We also explore the utilisation of different kernel sizes for the 2D filters of the convolutional layers with the objective of allowing the network to learn the different time-frequency patterns associated with percussive and harmonic sources more efficiently. The training and evaluation of the separation has been done using the training and test sets of the MUSDB18 dataset. Results show that the proposed deep network achieves automatic learning of high-level features and maintains HPSS performance at a state-of-the-art level while reducing the number of parameters and training time.