Deep Neural Network approaches for Analysing Videos of Music Performances
This work addresses the problem of automating gesture analysis in music performance videos for researchers, but it is incremental as it builds on prior studies with specific enhancements.
The paper tackles automating gesture labeling in music performance videos using a 3D CNN, achieving a 12% improvement in gesture identification accuracy (51% vs. 39% in prior work) and validating on multiple gesture classes and videos.
This paper presents a framework to automate the labelling process for gestures in musical performance videos with a 3D Convolutional Neural Network (CNN). While this idea was proposed in a previous study, this paper introduces several novelties: (i) Presents a novel method to overcome the class imbalance challenge and make learning possible for co-existent gestures by batch balancing approach and spatial-temporal representations of gestures. (ii) Performs a detailed study on 7 and 18 categories of gestures generated during the performance (guitar play) of musical pieces that have been video-recorded. (iii) Investigates the possibility to use audio features. (iv) Extends the analysis to multiple videos. The novel methods significantly improve the performance of gesture identification by 12 %, when compared to the previous work (51 % in this study over 39 % in previous work). We successfully validate the proposed methods on 7 super classes (72 %), an ensemble of the 18 gestures/classes, and additional videos (75 %).