SDASJun 26, 2019

Learning a Joint Embedding Space of Monophonic and Mixed Music Signals for Singing Voice

arXiv:1906.11139v121 citations
Originality Incremental advance
AI Analysis

This addresses the problem of singer identification across different audio domains for music retrieval applications, but it is incremental as it builds on existing metric learning methods.

The paper tackles the semantic gap between monophonic vocal tracks and mixed music signals for singer identification by learning a joint embedding space, enabling cross-domain tasks like retrieving mixed tracks from monophonic queries without needing source separation.

Previous approaches in singer identification have used one of monophonic vocal tracks or mixed tracks containing multiple instruments, leaving a semantic gap between these two domains of audio. In this paper, we present a system to learn a joint embedding space of monophonic and mixed tracks for singing voice. We use a metric learning method, which ensures that tracks from both domains of the same singer are mapped closer to each other than those of different singers. We train the system on a large synthetic dataset generated by music mashup to reflect real-world music recordings. Our approach opens up new possibilities for cross-domain tasks, e.g., given a monophonic track of a singer as a query, retrieving mixed tracks sung by the same singer from the database. Also, it requires no additional vocal enhancement steps such as source separation. We show the effectiveness of our system for singer identification and query-by-singer in both the same-domain and cross-domain tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes