TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music
This work addresses the problem of improving melody extraction accuracy for music information retrieval applications, representing an incremental advancement over existing methods.
The paper tackles singing melody extraction from polyphonic music by proposing TONet, a plug-and-play model that improves tone and octave perception through a novel input representation and network architecture, resulting in significant gains in octave and tone accuracy across various datasets.
Singing melody extraction is an important problem in the field of music information retrieval. Existing methods typically rely on frequency-domain representations to estimate the sung frequencies. However, this design does not lead to human-level performance in the perception of melody information for both tone (pitch-class) and octave. In this paper, we propose TONet, a plug-and-play model that improves both tone and octave perceptions by leveraging a novel input representation and a novel network architecture. First, we present an improved input representation, the Tone-CFP, that explicitly groups harmonics via a rearrangement of frequency-bins. Second, we introduce an encoder-decoder architecture that is designed to obtain a salience feature map, a tone feature map, and an octave feature map. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map. Experiments are done to verify the capability of TONet with various baseline backbone models. Our results show that tone-octave fusion with Tone-CFP can significantly improve the singing voice extraction performance across various datasets -- with substantial gains in octave and tone accuracy.