SD ASApr 14, 2021

Revisiting the Onsets and Frames Model with Additive Attention

Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Dorien Herremans

arXiv:2104.06607v112.621 citationsh-index: 31Has Code

Originality Synthesis-oriented

AI Analysis

This work provides insights for researchers in automatic music transcription, though it is incremental as it revisits an existing model without introducing major new methods.

The paper analyzed the Onsets-and-Frames model for automatic music transcription, finding that onsets are the most important feature and that rule-based post-processing largely drives state-of-the-art performance, with additive attention beyond moderate temporal context not providing benefits.

Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano transcription results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep and complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In this paper, we conduct a comprehensive examination of the Onsets-and-Frames AMT model, and pinpoint the essential components contributing to a strong AMT performance. This is achieved through exploitation of a modified additive attention mechanism. The experimental results suggest that the attention mechanism beyond a moderate temporal context does not benefit the model, and that rule-based post-processing is largely responsible for the SOTA performance. We also demonstrate that the onsets are the most significant attentive feature regardless of model complexity. The findings encourage AMT research to weigh more on both a robust onset detector and an effective post-processor.

View on arXiv PDF Code

Similar