LG SD AS MLNov 21, 2019

WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation

Amir Zadeh, Tianjun Ma, Soujanya Poria, Louis-Philippe Morency

arXiv:1911.09783v16.68 citationsh-index: 79

Originality Incremental advance

AI Analysis

This work addresses the problem of separating mixed audio sources for applications in audio processing, but it is incremental as it builds on existing transformer-based methods with a new dataset.

The paper tackles monoaural audio source separation by introducing the WildMix dataset, which includes diverse in-the-wild recordings from 25 sound classes, and the Spectro-Temporal Transformer (STT) model, which outperforms previous baselines on this dataset.

Monoaural audio source separation is a challenging research area in machine learning. In this area, a mixture containing multiple audio sources is given, and a model is expected to disentangle the mixture into isolated atomic sources. In this paper, we first introduce a challenging new dataset for monoaural source separation called WildMix. WildMix is designed with the goal of extending the boundaries of source separation beyond what previous datasets in this area would allow. It contains diverse in-the-wild recordings from 25 different sound classes, combined with each other using arbitrary composition policies. Source separation often requires modeling long-range dependencies in both temporal and spectral domains. To this end, we introduce a novel trasnformer-based model called Spectro-Temporal Transformer (STT). STT utilizes a specialized encoder, called Spectro-Temporal Encoder (STE). STE highlights temporal and spectral components of sources within a mixture, using a self-attention mechanism. It subsequently disentangles them in a hierarchical manner. In our experiments, STT swiftly outperforms various previous baselines for monoaural source separation on the challenging WildMix dataset.

View on arXiv PDF

Similar