SDCLASMar 2, 2021

Listen, Read, and Identify: Multimodal Singing Language Identification of Music

arXiv:2103.01893v49 citations
AI Analysis

This work addresses language identification in music for applications like content tagging, but it is incremental as it builds on existing multimodal and dropout techniques.

The paper tackles the problem of singing language identification by proposing LRID-Net, a multimodal model that uses audio and textual metadata, and shows that multimodal input improves performance, with modality dropout enabling handling of missing modalities without degrading full-input performance.

We propose a multimodal singing language classification model that uses both audio content and textual metadata. LRID-Net, the proposed model, takes an audio signal and a language probability vector estimated from the metadata and outputs the probabilities of the target languages. Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality. In the experiment, we trained several LRID-Nets with varying modality dropout configuration and tested them with various combinations of input modalities. The experiment results demonstrate that using multimodal input improves performance. The results also suggest that adopting modality dropout does not degrade the performance of the model when there are full modality inputs while enabling the model to handle missing modality cases to some extent.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes