Ben Sangbae Chon

h-index1

5papers

8citations

Novelty28%

AI Score24

Ranked #172,499 of 194,257 authors (top 89%)#1,541 in SD (top 85%)

5 Papers

5.7SDNov 14, 2022Code

MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation

Chang-Bin Jeon, Hyeongi Moon, Keunwoo Choi et al.

Separation of multiple singing voices into each voice is a rarely studied area in music source separation research. The absence of a benchmark dataset has hindered its progress. In this paper, we present an evaluation dataset and provide baseline studies for multiple singing voices separation. First, we introduce MedleyVox, an evaluation dataset for multiple singing voices separation. We specify the problem definition in this dataset by categorizing it into i) unison, ii) duet, iii) main vs. rest, and iv) N-singing separation. Second, to overcome the absence of existing multi-singing datasets for a training purpose, we present a strategy for construction of multiple singing mixtures using various single-singing datasets. Third, we propose the improved super-resolution network (iSRNet), which greatly enhances initial estimates of separation networks. Jointly trained with the Conv-TasNet and the multi-singing mixture construction strategy, the proposed iSRNet achieved comparable performance to ideal time-frequency masks on duet and unison subsets of MedleyVox. Audio samples, the dataset, and codes are available on our website (https://github.com/jeonchangbin49/MedleyVox).

4.3ASJul 10, 2023

A Demand-Driven Perspective on Generative Audio AI

Sangshin Oh, Minsung Kang, Hyeongi Moon et al.

To achieve successful deployment of AI research, it is crucial to understand the demands of the industry. In this paper, we present the results of a survey conducted with professional audio engineers, in order to determine research priorities and define various research tasks. We also summarize the current challenges in audio quality and controllability based on the survey. Our analysis emphasizes that the availability of datasets is currently the main bottleneck for achieving high-quality audio generation. Finally, we suggest potential solutions for some revealed issues with empirical evidence.

2.3ASJun 16, 2023Code

FALL-E: A Foley Sound Synthesis Model and Strategies

Minsung Kang, Sangshin Oh, Hyeongi Moon et al.

This paper introduces FALL-E, a foley synthesis system and its training/inference strategies. The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder. We trained every sound-related model from scratch using our extensive datasets, and utilized a pre-trained language model. We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on text input. Moreover, we leveraged external language models to improve text descriptions of our datasets and performed prompt engineering for quality, coherence, and diversity. FALL-E was evaluated by an objective measure as well as listening tests in the DCASE 2023 challenge Task 7. The submission achieved the second place on average, while achieving the best score for diversity, second place for audio quality, and third place for class fitness.

13.8MMJul 7

Multimodal Video-to-Music Recommendation via Semantic Retrieval and Temporal Reranking

Seungheon Doh, Minhee Lee, Sangmoon Lee et al.

We present VTMR, a two-stage framework for Video-To-Music Recommendation. In Stage~1, VTMR aligns comprehensive video and music signals in a joint audio-visual-text representation space and efficiently retrieves semantically compatible candidates using coarse global embeddings. In Stage~2, it reranks the retrieved candidates by attending to the temporal sequences of both video and music, thereby capturing fine-grained temporal correspondence. Evaluated on the video-to-music recommendation task, the multimodal retrieval stage improves R@10 from 14.2 to 15.9 and Median Rank from 75 to 58 over the strongest baseline; the temporal reranker further boosts R@10 to 18.3 and Median Rank to 46, demonstrating complementary gains from richer query encoding and temporal alignment. A human preference study confirms that VTMR is on par with a commercial baseline in overall preference, while outperforming a generative baseline in music quality.

1.9SDOct 23, 2020

GSEP: A robust vocal and accompaniment separation system using gated CBHG module and loudness normalization

Soochul Park, Ben Sangbae Chon

In the field of audio signal processing research, source separation has been a popular research topic for a long time and the recent adoption of the deep neural networks have shown a significant improvement in performance. The improvement vitalizes the industry to productize audio deep learning based products and services including Karaoke in the music streaming apps and dialogue enhancement in the UHDTV. For these early markets, we defined a set of design principles of the vocal and accompaniment separation model in terms of robustness, quality, and cost. In this paper, we introduce GSEP (Gaudio source SEParation system), a robust vocal and accompaniment separation system using a Gated- CBHG module, mask warping, and loudness normalization and it was verified that the proposed system satisfies all three principles and outperforms the state-of-the-art systems both in objective measure and subjective assessment through experiments.