SDNov 14, 2022Code
MedleyVox: An Evaluation Dataset for Multiple Singing Voices SeparationChang-Bin Jeon, Hyeongi Moon, Keunwoo Choi et al.
Separation of multiple singing voices into each voice is a rarely studied area in music source separation research. The absence of a benchmark dataset has hindered its progress. In this paper, we present an evaluation dataset and provide baseline studies for multiple singing voices separation. First, we introduce MedleyVox, an evaluation dataset for multiple singing voices separation. We specify the problem definition in this dataset by categorizing it into i) unison, ii) duet, iii) main vs. rest, and iv) N-singing separation. Second, to overcome the absence of existing multi-singing datasets for a training purpose, we present a strategy for construction of multiple singing mixtures using various single-singing datasets. Third, we propose the improved super-resolution network (iSRNet), which greatly enhances initial estimates of separation networks. Jointly trained with the Conv-TasNet and the multi-singing mixture construction strategy, the proposed iSRNet achieved comparable performance to ideal time-frequency masks on duet and unison subsets of MedleyVox. Audio samples, the dataset, and codes are available on our website (https://github.com/jeonchangbin49/MedleyVox).
60.9IRJun 2
TALKPLAY: Multimodal Music Recommendation with Large Language ModelsSeungheon Doh, Keunwoo Choi, Juhan Nam
We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.
73.9IRJun 2
TalkPlay-Tools: Conversational Music Recommendation with LLM Tool CallingSeungheon Doh, Keunwoo Choi, Juhan Nam
While the recent developments in large language models (LLMs) have successfully enabled generative recommenders with natural language interactions, their recommendation behavior is limited, leaving other simpler yet crucial components such as metadata or attribute filtering underutilized in the system. We propose an LLM-based music recommendation system with tool calling to serve as a unified retrieval-reranking pipeline. Our system positions an LLM as an end-to-end recommendation system that interprets user intent, plans tool invocations, and orchestrates specialized components: boolean filters (SQL), sparse retrieval (BM25), dense retrieval (embedding similarity), and generative retrieval (semantic IDs). Through tool planning, the system predicts which types of tools to use, their execution order, and the arguments needed to find music matching user preferences, supporting diverse modalities while seamlessly integrating multiple database filtering methods. We demonstrate that this unified tool-calling framework achieves competitive performance across diverse recommendation scenarios by selectively employing appropriate retrieval methods based on user queries, envisioning a new paradigm for conversational music recommendation systems.
SDJun 22, 2022
Jointist: Joint Learning for Multi-instrument Transcription and Its ApplicationsKin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong et al.
In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results. The instrument conditioning is designed for an explicit multi-instrument functionality while the connection between the transcription and source separation modules is for better transcription performance. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. However, its novelty necessitates a new perspective on how to evaluate such a model. During the experiment, we assess the model from various aspects, providing a new evaluation perspective for multi-instrument transcription. We also argue that transcription models can be utilized as a preprocessing module for other music analysis tasks. In the experiment on several downstream tasks, the symbolic representation provided by our transcription model turned out to be helpful to spectrograms in solving downbeat detection, chord recognition, and key estimation.
SDJul 21, 2022
A Proposal for Foley Sound Synthesis ChallengeKeunwoo Choi, Sangshin Oh, Minsung Kang et al.
"Foley" refers to sound effects that are added to multimedia during post-production to enhance its perceived acoustic properties, e.g., by simulating the sounds of footsteps, ambient environmental sounds, or visible objects on the screen. While foley is traditionally produced by foley artists, there is increasing interest in automatic or machine-assisted techniques building upon recent advances in sound synthesis and generative models. To foster more participation in this growing research area, we propose a challenge for automatic foley synthesis. Through case studies on successful previous challenges in audio and machine learning, we set the goals of the proposed challenge: rigorous, unified, and efficient evaluation of different foley synthesis systems, with an overarching goal of drawing active participation from the research community. We outline the details and design considerations of a foley sound synthesis challenge, including task definition, dataset requirements, and evaluation criteria.
SDFeb 1, 2023
Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint TrainingKin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong et al.
In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist. Demo available at \url{https://jointist.github.io/Demo}.
ASJul 10, 2023
A Demand-Driven Perspective on Generative Audio AISangshin Oh, Minsung Kang, Hyeongi Moon et al.
To achieve successful deployment of AI research, it is crucial to understand the demands of the industry. In this paper, we present the results of a survey conducted with professional audio engineers, in order to determine research priorities and define various research tasks. We also summarize the current challenges in audio quality and controllability based on the survey. Our analysis emphasizes that the availability of datasets is currently the main bottleneck for achieving high-quality audio generation. Finally, we suggest potential solutions for some revealed issues with empirical evidence.
SDFeb 21, 2025Code
KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio GenerationYoonjin Chung, Pilsun Eu, Junwon Lee et al.
Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.
LGJan 28Code
TwinWeaver: An LLM-Based Foundation Model Framework for Pan-Cancer Digital TwinsNikita Makarov, Maria Bordukova, Lena Voith von Voithenberg et al.
Precision oncology requires forecasting clinical events and trajectories, yet modeling sparse, multi-modal clinical time series remains a critical challenge. We introduce TwinWeaver, an open-source framework that serializes longitudinal patient histories into text, enabling unified event prediction as well as forecasting with large language models, and use it to build Genie Digital Twin (GDT) on 93,054 patients across 20 cancer types. In benchmarks, GDT significantly reduces forecasting error, achieving a median Mean Absolute Scaled Error (MASE) of 0.87 compared to 0.97 for the strongest time-series baseline (p<0.001). Furthermore, GDT improves risk stratification, achieving an average concordance index (C-index) of 0.703 across survival, progression, and therapy switching tasks, surpassing the best baseline of 0.662. GDT also generalizes to out-of-distribution clinical trials, matching trained baselines at zero-shot and surpassing them with fine-tuning, achieving a median MASE of 0.75-0.88 and outperforming the strongest baseline in event prediction with an average C-index of 0.672 versus 0.648. Finally, TwinWeaver enables an interpretable clinical reasoning extension, providing a scalable and transparent foundation for longitudinal clinical modeling.
SDSep 12, 2021Code
Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source SeparationQiuqiang Kong, Yin Cao, Haohe Liu et al.
Deep neural network based methods have been successfully applied to music source separation. They typically learn a mapping from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-frequency bins have ideal ratio mask values of over~1 in a popular dataset, MUSDB18, 3) its potential on very deep architectures is under-explored. Our proposed system is designed to overcome these. First, we propose to estimate phases by estimating complex ideal ratio masks (cIRMs) where we decouple the estimation of cIRMs into magnitude and phase estimations. Second, we extend the separation method to effectively allow the magnitude of the mask to be larger than 1. Finally, we propose a residual UNet architecture with up to 143 layers. Our proposed system achieves a state-of-the-art MSS result on the MUSDB18 dataset, especially, a SDR of 8.98~dB on vocals, outperforming the previous best performance of 7.24~dB. The source code is available at: https://github.com/bytedance/music_source_separation
22.1CLMar 30
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and FaithfulnessJinyoung Kim, Hyeongsoo Lim, Eunseo Seo et al.
Recent advances in large audio language models (LALMs) have enabled multilingual speech understanding. However, benchmarks for evaluating LALMs remain scarce for non-English languages, with Korean being one such underexplored case. In this paper, we introduce KoALa-Bench, a comprehensive benchmark for evaluating Korean speech understanding and speech faithfulness of LALMs. In particular, KoALa-Bench comprises six tasks. Four tasks evaluate fundamental speech understanding capabilities, including automatic speech recognition, speech translation, speech question answering, and speech instruction following, while the remaining two tasks evaluate speech faithfulness, motivated by our observation that several LALMs often fail to fully leverage the speech modality. Furthermore, to reflect Korea-specific knowledge, our benchmark incorporates listening questions from the Korean college scholastic ability test as well as content covering Korean cultural domains. We conduct extensive experiments across six models, including both white-box and black-box ones. Our benchmark, evaluation code, and leaderboard are publicly available at https://ksbench.github.io/Korean-Benchmark/.
SDOct 23, 2024
Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio GenerationJunwon Lee, Modan Tailleur, Laurie M. Heller et al.
Despite significant advancements in neural text-to-audio generation, challenges persist in controllability and evaluation. This paper addresses these issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024. We present an evaluation protocol combining objective metric, namely Fréchet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation. Our analysis reveals varying performance across sound categories and model architectures, with larger models generally excelling but innovative lightweight approaches also showing promise. The strong correlation between objective metrics and human ratings validates our evaluation approach. We discuss outcomes in terms of audio quality, controllability, and architectural considerations for text-to-audio synthesizers, providing direction for future research.
IRAug 18, 2025
TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music RecommendationKeunwoo Choi, Seungheon Doh, Juhan Nam
We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In the proposed pipeline, multiple large language model (LLM) agents are created under various roles with specialized prompts and access to different parts of information, and the chat data is acquired by logging the conversation between the Listener LLM and the Recsys LLM. To cover various conversation scenarios, for each conversation, the Listener LLM is conditioned on a finetuned conversation goal. Finally, all the LLMs are multimodal with audio and images, allowing a simulation of multimodal recommendation and conversation. In the LLM-as-a-judge and subjective evaluation experiments, TalkPlayData 2 achieved the proposed goal in various aspects related to training a generative recommendation model for music. TalkPlayData 2 and its generation code are released at https://talkpl.ai/talkplaydata2.
AIJan 15, 2025
Sound Scene Synthesis at the DCASE 2024 ChallengeMathieu Lagrange, Junwon Lee, Modan Tailleur et al.
This paper presents Task 7 at the DCASE 2024 Challenge: sound scene synthesis. Recent advances in sound synthesis and generative models have enabled the creation of realistic and diverse audio content. We introduce a standardized evaluation framework for comparing different sound scene synthesis systems, incorporating both objective and subjective metrics. The challenge attracted four submissions, which are evaluated using the Fréchet Audio Distance (FAD) and human perceptual ratings. Our analysis reveals significant insights into the current capabilities and limitations of sound scene synthesis systems, while also highlighting areas for future improvement in this rapidly evolving field.
SDNov 26, 2021
Semi-Supervised Music Tagging TransformerMinz Won, Keunwoo Choi, Xavier Serra
We present Music Tagging Transformer that is trained with a semi-supervised approach. The proposed model captures local acoustic characteristics in shallow convolutional layers, then temporally summarizes the sequence of the extracted features using stacked self-attention layers. Through a careful model assessment, we first show that the proposed architecture outperforms the previous state-of-the-art music tagging models that are based on convolutional neural networks under a supervised scheme. The Music Tagging Transformer is further improved by noisy student training, a semi-supervised approach that leverages both labeled and unlabeled data combined with data augmentation. To our best knowledge, this is the first attempt to utilize the entire audio of the million song dataset.
SDNov 23, 2021
Music Classification: Beyond Supervised Learning, Towards Real-world ApplicationsMinz Won, Janne Spijkervet, Keunwoo Choi
Music classification is a music information retrieval (MIR) task to classify music items to labels such as genre, mood, and instruments. It is also closely related to other concepts such as music similarity and musical preference. In this tutorial, we put our focus on two directions - the recent training schemes beyond supervised learning and the successful application of music classification models. The target audience for this web book is researchers and practitioners who are interested in state-of-the-art music classification research and building real-world applications. We assume the audience is familiar with the basic machine learning concepts. In this book, we present three lectures as follows: 1. Music classification overview: Task definition, applications, existing approaches, datasets, 2. Beyond supervised learning: Semi- and self-supervised learning for music classification, 3. Towards real-world applications: Less-discussed, yet important research issues in practice.
SDOct 18, 2021
SpecTNT: a Time-Frequency Transformer for Music AudioWei-Tsung Lu, Ju-Chiang Wang, Minz Won et al.
Transformers have drawn attention in the MIR field for their remarkable performance shown in natural language processing and computer vision. However, prior works in the audio processing domain mostly use Transformer as a temporal feature aggregator that acts similar to RNNs. In this paper, we propose SpecTNT, a Transformer-based architecture to model both spectral and temporal sequences of an input time-frequency representation. Specifically, we introduce a novel variant of the Transformer-in-Transformer (TNT) architecture. In each SpecTNT block, a spectral Transformer extracts frequency-related features into the frequency class token (FCT) for each frame. Later, the FCTs are linearly projected and added to the temporal embeddings (TEs), which aggregate useful information from the FCTs. Then, a temporal Transformer processes the TEs to exchange information across the time axis. By stacking the SpecTNT blocks, we build the SpecTNT model to learn the representation for music signals. In experiments, SpecTNT demonstrates state-of-the-art performance in music tagging and vocal melody extraction, and shows competitive performance for chord recognition. The effectiveness of SpecTNT and other design choices are further examined through ablation studies.
SDMar 2, 2021
Listen, Read, and Identify: Multimodal Singing Language Identification of MusicKeunwoo Choi, Yuxuan Wang
We propose a multimodal singing language classification model that uses both audio content and textual metadata. LRID-Net, the proposed model, takes an audio signal and a language probability vector estimated from the metadata and outputs the probabilities of the target languages. Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality. In the experiment, we trained several LRID-Nets with varying modality dropout configuration and tested them with various combinations of input modalities. The experiment results demonstrate that using multimodal input improves performance. The results also suggest that adopting modality dropout does not degrade the performance of the model when there are full modality inputs while enabling the model to handle missing modality cases to some extent.
SDOct 28, 2020
Large-Scale MIDI-based Composer ClassificationQiuqiang Kong, Keunwoo Choi, Yuxuan Wang
Music classification is a task to classify a music piece into labels such as genres or composers. We propose large-scale MIDI based composer classification systems using GiantMIDI-Piano, a transcription-based dataset. We propose to use piano rolls, onset rolls, and velocity rolls as input representations and use deep neural networks as classifiers. To our knowledge, we are the first to investigate the composer classification problem with up to 100 composers. By using convolutional recurrent neural networks as models, our MIDI based composer classification system achieves a 10-composer and a 100-composer classification accuracies of 0.648 and 0.385 (evaluated on 30-second clips) and 0.739 and 0.489 (evaluated on music pieces), respectively. Our MIDI based composer system outperforms several audio-based baseline classification systems, indicating the effectiveness of using compact MIDI representations for composer classification.
SDOct 2, 2020
Deep Composer Classification Using Symbolic RepresentationSunghyeon Kim, Hyeyoon Lee, Sunjong Park et al.
In this study, we train deep neural networks to classify composer on a symbolic domain. The model takes a two-channel two-dimensional input, i.e., onset and note activations of time-pitch representation, which is converted from MIDI recordings and performs a single-label classification. On the experiments conducted on MAESTRO dataset, we report an F1 value of 0.8333 for the classification of 13~classical composers.
ASJul 24, 2020
Dereverberation using joint estimation of dry speech signal and acoustic systemSanna Wager, Keunwoo Choi, Simon Durand
The purpose of speech dereverberation is to remove quality-degrading effects of a time-invariant impulse response filter from the signal. In this report, we describe an approach to speech dereverberation that involves joint estimation of the dry speech signal and of the room impulse response. We explore deep learning models that apply to each task separately, and how these can be combined in a joint model with shared parameters.
SDJun 9, 2019
Deep Unsupervised Drum TranscriptionKeunwoo Choi, Kyunghyun Cho
We introduce DrummerNet, a drum transcription system that is trained in an unsupervised manner. DrummerNet does not require any ground-truth transcription and, with the data-scalability of deep neural networks, learns from a large unlabeled dataset. In DrummerNet, the target drum signal is first passed to a (trainable) transcriber, then reconstructed in a (fixed) synthesizer according to the transcription estimate. By training the system to minimize the distance between the input and the output audio signals, the transcriber learns to transcribe without ground truth transcription. Our experiment shows that DrummerNet performs favorably compared to many other recent drum transcription systems, both supervised and unsupervised.
SDJun 4, 2018
Revisiting Singing Voice Detection: a Quantitative Review and the Future OutlookKyungyun Lee, Keunwoo Choi, Juhan Nam
Since the vocal component plays a crucial role in popular music, singing voice detection has been an active research topic in music information retrieval. Although several proposed algorithms have shown high performances, we argue that there still is a room to improve to build a more robust singing voice detection system. In order to identify the area of improvement, we first perform an error analysis on three recent singing voice detection systems. Based on the analysis, we design novel methods to test the systems on multiple sets of internally curated and generated data to further examine the pitfalls, which are not clearly revealed with the current datasets. From the experiment results, we also propose several directions towards building a more robust singing voice detector.
MMFeb 14, 2018
Similarity measures for vocal-based drum sample retrieval using deep convolutional auto-encodersAdib Mehrabi, Keunwoo Choi, Simon Dixon et al.
The expressive nature of the voice provides a powerful medium for communicating sonic ideas, motivating recent research on methods for query by vocalisation. Meanwhile, deep learning methods have demonstrated state-of-the-art results for matching vocal imitations to imitated sounds, yet little is known about how well learned features represent the perceptual similarity between vocalisations and queried sounds. In this paper, we address this question using similarity ratings between vocal imitations and imitated drum sounds. We use a linear mixed effect regression model to show how features learned by convolutional auto-encoders (CAEs) perform as predictors for perceptual similarity between sounds. Our experiments show that CAEs outperform three baseline feature sets (spectrogram-based representations, MFCCs, and temporal features) at predicting the subjective similarity ratings. We also investigate how the size and shape of the encoded layer effects the predictive power of the learned features. The results show that preservation of temporal information is more important than spectral resolution for this application.
SDDec 14, 2017
DLR : Toward a deep learned rhythmic representation for music content analysisYeonwoo Jeong, Keunwoo Choi, Hosan Jeong
In the use of deep neural networks, it is crucial to provide appropriate input representations for the network to learn from. In this paper, we propose an approach to learn a representation that focus on rhythmic representation which is named as DLR (Deep Learning Rhythmic representation). The proposed approach aims to learn DLR from the raw audio signal and use it for other music informatics tasks. A 1-dimensional convolutional network is utilised in the learning of DLR. In the experiment, we present the results from the source task and the target task as well as visualisations of DLRs. The results reveals that DLR provides compact rhythmic information which can be used on multi-tagging task.
CVSep 13, 2017
A Tutorial on Deep Learning for Music Information RetrievalKeunwoo Choi, György Fazekas, Kyunghyun Cho et al.
Following their success in Computer Vision and other areas, deep learning techniques have recently become widely adopted in Music Information Retrieval (MIR) research. However, the majority of works aim to adopt and assess methods that have been shown to be effective in other domains, while there is still a great need for more original research focusing on music primarily and utilising musical knowledge and insight. The goal of this paper is to boost the interest of beginners by providing a comprehensive tutorial and reducing the barriers to entry into deep learning for MIR. We lay out the basic principles and review prominent works in this hard to navigate the field. We then outline the network structures that have been successful in MIR problems and facilitate the selection of building blocks for the problems at hand. Finally, guidelines for new tasks and some advanced topics in deep learning are discussed to stimulate new research in this fascinating field.
SDSep 6, 2017
A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music TaggingKeunwoo Choi, György Fazekas, Kyunghyun Cho et al.
In this paper, we empirically investigate the effect of audio preprocessing on music tagging with deep neural networks. We perform comprehensive experiments involving audio preprocessing using different time-frequency representations, logarithmic magnitude compression, frequency weighting, and scaling. We show that many commonly used input preprocessing techniques are redundant except magnitude compression.
SDJun 19, 2017
Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with KerasKeunwoo Choi, Deokjin Joo, Juho Kim
We introduce Kapre, Keras layers for audio and music signal preprocessing. Music research using deep neural networks requires a heavy and tedious preprocessing stage, for which audio processing parameters are often ignored in parameter optimisation. To solve this problem, Kapre implements time-frequency conversions, normalisation, and data augmentation as Keras layers. We report simple benchmark results, showing real-time on-GPU preprocessing adds a reasonable amount of computation.
IRJun 7, 2017
The Effects of Noisy Labels on Deep Convolutional Neural Networks for Music TaggingKeunwoo Choi, George Fazekas, Kyunghyun Cho et al.
Deep neural networks (DNN) have been successfully applied to music classification including music tagging. However, there are several open questions regarding the training, evaluation, and analysis of DNNs. In this article, we investigate specific aspects of neural networks, the effects of noisy labels, to deepen our understanding of their properties. We analyse and (re-)validate a large music tagging dataset to investigate the reliability of training and evaluation. Using a trained network, we compute label vector similarities which is compared to groundtruth similarity. The results highlight several important aspects of music tagging and neural networks. We show that networks can be effective despite relatively large error rates in groundtruth datasets, while conjecturing that label noise can be the cause of varying tag-wise performance differences. Lastly, the analysis of our trained network provides valuable insight into the relationships between music tags. These results highlight the benefit of using data-driven methods to address automatic music tagging.
CVMar 27, 2017
Transfer learning for music classification and regression tasksKeunwoo Choi, György Fazekas, Mark Sandler et al.
In this paper, we present a transfer learning approach for music classification and regression tasks. We propose to use a pre-trained convnet feature, a concatenated feature vector using the activations of feature maps of multiple layers in a trained convolutional network. We show how this convnet feature can serve as general-purpose music representation. In the experiments, a convnet is trained for music tagging and then transferred to other music-related classification and regression tasks. The convnet feature outperforms the baseline MFCC feature in all the considered tasks and several previous approaches that are aggregating MFCCs as well as low- and high-level music features.
NESep 14, 2016
Convolutional Recurrent Neural Networks for Music ClassificationKeunwoo Choi, George Fazekas, Mark Sandler et al.
We introduce a convolutional recurrent neural network (CRNN) for music tagging. CRNNs take advantage of convolutional neural networks (CNNs) for local feature extraction and recurrent neural networks for temporal summarisation of the extracted features. We compare CRNN with three CNN structures that have been used for music tagging while controlling the number of parameters with respect to their performance and training time per sample. Overall, we found that CRNNs show a strong performance with respect to the number of parameter and training time, indicating the effectiveness of its hybrid structure in music feature extraction and feature summarisation.
MMAug 17, 2016
Towards Music Captioning: Generating Music Playlist DescriptionsKeunwoo Choi, George Fazekas, Brian McFee et al.
Descriptions are often provided along with recommendations to help users' discovery. Recommending automatically generated music playlists (e.g. personalised playlists) introduces the problem of generating descriptions. In this paper, we propose a method for generating music playlist descriptions, which is called as music captioning. In the proposed method, audio content analysis and natural language processing are adopted to utilise the information of each track.
LGJul 8, 2016
Explaining Deep Convolutional Neural Networks on Music ClassificationKeunwoo Choi, George Fazekas, Mark Sandler
Deep convolutional neural networks (CNNs) have been actively adopted in the field of music information retrieval, e.g. genre classification, mood detection, and chord recognition. However, the process of learning and prediction is little understood, particularly when it is applied to spectrograms. We introduce auralisation of a CNN to understand its underlying mechanism, which is based on a deconvolution procedure introduced in [2]. Auralisation of a CNN is converting the learned convolutional features that are obtained from deconvolution into audio signals. In the experiments and discussions, we explain trained features of a 5-layer CNN based on the deconvolved spectrograms and auralised signals. The pairwise correlations per layers with varying different musical attributes are also investigated to understand the evolution of the learnt features. It is shown that in the deep layers, the features are learnt to capture textures, the patterns of continuous distributions, rather than shapes of lines.
AIJun 7, 2016
Towards Playlist Generation Algorithms Using RNNs Trained on Within-Track TransitionsKeunwoo Choi, George Fazekas, Mark Sandler
We introduce a novel playlist generation algorithm that focuses on the quality of transitions using a recurrent neural network (RNN). The proposed model assumes that optimal transitions between tracks can be modelled and predicted by internal transitions within music tracks. We introduce modelling sequences of high-level music descriptors using RNNs and discuss an experiment involving different similarity functions, where the sequences are provided by a musical structural analysis algorithm. Qualitative observations show that the proposed approach can effectively model transitions of music tracks in playlists.
SDJun 1, 2016
Automatic tagging using deep convolutional neural networksKeunwoo Choi, George Fazekas, Mark Sandler
We present a content-based automatic music tagging algorithm using fully convolutional neural networks (FCNs). We evaluate different architectures consisting of 2D convolutional layers and subsampling layers only. In the experiments, we measure the AUC-ROC scores of the architectures with different complexities and input types using the MagnaTagATune dataset, where a 4-layer architecture shows state-of-the-art performance with mel-spectrogram input. Furthermore, we evaluated the performances of the architectures with varying the number of layers on a larger dataset (Million Song Dataset), and found that deeper models outperformed the 4-layer architecture. The experiments show that mel-spectrogram is an effective time-frequency representation for automatic tagging and that more complex models benefit from more training data.
AIApr 18, 2016
Text-based LSTM networks for Automatic Music CompositionKeunwoo Choi, George Fazekas, Mark Sandler
In this paper, we introduce new methods and discuss results of text-based LSTM (Long Short-Term Memory) networks for automatic music composition. The proposed network is designed to learn relationships within text documents that represent chord progressions and drum tracks in two case studies. In the experiments, word-RNNs (Recurrent Neural Networks) show good results for both cases, while character-based RNNs (char-RNNs) only succeed to learn chord progressions. The proposed system can be used for fully automatic composition or as semi-automatic systems that help humans to compose music by controlling a diversity parameter of the model.
MMNov 22, 2015
Understanding Music PlaylistsKeunwoo Choi, George Fazekas, Mark Sandler
As music streaming services dominate the music industry, the playlist is becoming an increasingly crucial element of music consumption. Con- sequently, the music recommendation problem is often casted as a playlist generation prob- lem. Better understanding of the playlist is there- fore necessary for developing better playlist gen- eration algorithms. In this work, we analyse two playlist datasets to investigate some com- monly assumed hypotheses about playlists. Our findings indicate that deeper understanding of playlists is needed to provide better prior infor- mation and improve machine learning algorithms in the design of recommendation systems.