SD LG ASApr 7, 2022

Heterogeneous Target Speech Separation

Efthymios Tzinis, Gordon Wichern, Aswin Subramanian, Paris Smaragdis, Jonathan Le Roux

arXiv:2204.03594v116.333 citationsh-index: 23

Originality Highly original

AI Analysis

This addresses the challenge of robust and flexible speech separation for applications requiring adaptation to diverse real-world conditions, representing a novel paradigm rather than an incremental improvement.

The paper tackles the problem of single-channel target source separation by introducing a framework that uses non-mutually exclusive concepts (e.g., loudness, gender) as conditioning, enabling generalization to new concepts and out-of-domain data while outperforming single-domain specialist models.

We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e.g., loudness, gender, language, spatial location, etc). Our proposed heterogeneous separation framework can seamlessly leverage datasets with large distribution shifts and learn cross-domain representations under a variety of concepts used as conditioning. Our experiments show that training separation models with heterogeneous conditions facilitates the generalization to new concepts with unseen out-of-domain data while also performing substantially higher than single-domain specialist models. Notably, such training leads to more robust learning of new harder source separation discriminative concepts and can yield improvements over permutation invariant training with oracle source selection. We analyze the intrinsic behavior of source separation training with heterogeneous metadata and propose ways to alleviate emerging problems with challenging separation conditions. We release the collection of preparation recipes for all datasets used to further promote research towards this challenging task.

View on arXiv PDF

Similar