Gordon Wichern

SD
h-index25
32papers
1,610citations
Novelty48%
AI Score53

32 Papers

SDNov 22, 2022
Latent Iterative Refinement for Modular Source Separation

Dimitrios Bralios, Efthymios Tzinis, Gordon Wichern et al.

Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user fetches a static computation graph and runs the full model on some specified observed mixture signal to get the estimated source signals. Additionally, many of those models consist of several basic processing blocks which are applied sequentially. We argue that we can significantly increase resource efficiency during both training and inference stages by reformulating a model's training and inference procedures as iterative mappings of latent signal representations. First, we can apply the same processing block more than once on its output to refine the input signal and consequently improve parameter efficiency. During training, we can follow a block-wise procedure which enables a reduction on memory requirements. Thus, one can train a very complicated network structure using significantly less computation compared to end-to-end training. During inference, we can dynamically adjust how many processing blocks and iterations of a specific block an input signal needs using a gating module.

SDNov 11, 2022
Optimal Condition Training for Target Source Separation

Efthymios Tzinis, Gordon Wichern, Paris Smaragdis et al.

Recent research has shown remarkable performance in leveraging multiple extraneous conditional and non-mutually exclusive semantic concepts for sound source separation, allowing the flexibility to extract a given target source based on multiple different queries. In this work, we propose a new optimal condition training (OCT) method for single-channel target source separation, based on greedy parameter updates using the highest performing condition among equivalent conditions associated with a given target source. Our experiments show that the complementary information carried by the diverse semantic concepts significantly helps to disentangle and isolate sources of interest much more efficiently compared to single-conditioned models. Moreover, we propose a variation of OCT with condition refinement, in which an initial conditional vector is adapted to the given mixture and transformed to a more amenable representation for target source extraction. We showcase the effectiveness of OCT on diverse source separation experiments where it improves upon permutation invariant models with oracle assignment and obtains state-of-the-art performance in the more challenging task of text-based source separation, outperforming even dedicated text-only conditioned models.

SDApr 7, 2022
Heterogeneous Target Speech Separation

Efthymios Tzinis, Gordon Wichern, Aswin Subramanian et al.

We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e.g., loudness, gender, language, spatial location, etc). Our proposed heterogeneous separation framework can seamlessly leverage datasets with large distribution shifts and learn cross-domain representations under a variety of concepts used as conditioning. Our experiments show that training separation models with heterogeneous conditions facilitates the generalization to new concepts with unseen out-of-domain data while also performing substantially higher than single-domain specialist models. Notably, such training leads to more robust learning of new harder source separation discriminative concepts and can yield improvements over permutation invariant training with oracle source selection. We analyze the intrinsic behavior of source separation training with heterogeneous metadata and propose ways to alleviate emerging problems with challenging separation conditions. We release the collection of preparation recipes for all datasets used to further promote research towards this challenging task.

LGNov 14, 2022
Meta-Learning of Neural State-Space Models Using Data From Similar Systems

Ankush Chakrabarty, Gordon Wichern, Christopher R. Laughman

Deep neural state-space models (SSMs) provide a powerful tool for modeling dynamical systems solely using operational data. Typically, neural SSMs are trained using data collected from the actual system under consideration, despite the likely existence of operational data from similar systems which have previously been deployed in the field. In this paper, we propose the use of model-agnostic meta-learning (MAML) for constructing deep encoder network-based SSMs, by leveraging a combination of archived data from similar systems (used to meta-train offline) and limited data from the actual system (used for rapid online adaptation). We demonstrate using a numerical example that meta-learning can result in more accurate neural SSM models than supervised- or transfer-learning, despite few adaptation steps and limited online data. Additionally, we show that by carefully partitioning and adapting the encoder layers while fixing the state-transition operator, we can achieve comparable performance to MAML while reducing online adaptation complexity.

ASMar 8, 2022
Locate This, Not That: Class-Conditioned Sound Event DOA Estimation

Olga Slizovskaia, Gordon Wichern, Zhong-Qiu Wang et al.

Existing systems for sound event localization and detection (SELD) typically operate by estimating a source location for all classes at every time instant. In this paper, we propose an alternative class-conditioned SELD model for situations where we may not be interested in localizing all classes all of the time. This class-conditioned SELD model takes as input the spatial and spectral features from the sound file, and also a one-hot vector indicating the class we are currently interested in localizing. We inject the conditioning information at several points in our model using feature-wise linear modulation (FiLM) layers. Through experiments on the DCASE 2020 Task 3 dataset, we show that the proposed class-conditioned SELD model performs better in terms of common SELD metrics than the baseline model that locates all classes simultaneously, and also outperforms specialist models that are trained to locate only a single class of interest. We also evaluate performance on the DCASE 2021 Task 3 dataset, which includes directional interference (sound events from classes we are not interested in localizing) and notice especially strong improvement from the class-conditioned model.

ASNov 15, 2022
Reverberation as Supervision for Speech Separation

Rohith Aralikatti, Christoph Boeddeker, Gordon Wichern et al.

This paper proposes reverberation as supervision (RAS), a novel unsupervised loss function for single-channel reverberant speech separation. Prior methods for unsupervised separation required the synthesis of mixtures of mixtures or assumed the existence of a teacher model, making them difficult to consider as potential methods explaining the emergence of separation abilities in an animal's auditory system. We assume the availability of two-channel mixtures at training time, and train a neural network to separate the sources given one of the channels as input such that the other channel may be predicted from the separated sources. As the relationship between the room impulse responses (RIRs) of each channel depends on the locations of the sources, which are unknown to the network, the network cannot rely on learning that relationship. Instead, our proposed loss function fits each of the separated sources to the mixture in the target channel via Wiener filtering, and compares the resulting mixture to the ground-truth one. We show that minimizing the scale-invariant signal-to-distortion ratio (SI-SDR) of the predicted right-channel mixture with respect to the ground truth implicitly guides the network towards separating the left-channel sources. On a semi-supervised reverberant speech separation task based on the WHAMR! dataset, using training data where just 5% (resp., 10%) of the mixtures are labeled with associated isolated sources, we achieve 70% (resp., 78%) of the SI-SDR improvement obtained when training with supervision on the full training set, while a model trained only on the labeled data obtains 43% (resp., 45%).

ASNov 2, 2022
Late Audio-Visual Fusion for In-The-Wild Speaker Diarization

Zexu Pan, Gordon Wichern, François G. Germain et al.

Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin, with the fused audio-visual system achieving a new SOTA on the AVA-AVD benchmark.

65.4SDMar 23
Velocity Potential Neural Field for Efficient Ambisonics Impulse Response Modeling

Yoshiki Masuyama, Francois G. Germain, Gordon Wichern et al.

First-order Ambisonics (FOA) is a standard spatial audio format based on spherical harmonic decomposition. Its zeroth- and first-order components capture the sound pressure and particle velocity, respectively. Recently, physics-informed neural networks have been applied to the spatial interpolation of FOA signals, regularizing the network outputs based on soft penalty terms derived from physical principles, e.g., the linearized momentum equation. In this paper, we reformulate the task so that the predicted FOA signal automatically satisfies the linearized momentum equation. Our network approximates a scalar function called velocity potential, rather than the FOA signal itself. Then, the FOA signal can be readily recovered through the partial derivatives of the velocity potential with respect to the network inputs (i.e., time and microphone position) according to physics of sound propagation. By deriving the four channels of FOA from the single-channel velocity potential, the reconstructed signal follows the physical principle at any time and position by construction. Experimental results on room impulse response reconstruction confirm the effectiveness of the proposed framework.

SDApr 4, 2023
Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT

Ke Chen, Gordon Wichern, François G. Germain et al.

In spite of the progress in music source separation research, the small amount of publicly-available clean source data remains a constant limiting factor for performance. Thus, recent advances in self-supervised learning present a largely-unexplored opportunity for improving separation models by leveraging unlabelled music data. In this paper, we propose a self-supervised learning framework for music source separation inspired by the HuBERT speech representation model. We first investigate the potential impact of the original HuBERT model by inserting an adapted version of it into the well-known Demucs V2 time-domain separation model architecture. We then propose a time-frequency-domain self-supervised model, Pac-HuBERT (for primitive auditory clustering HuBERT), that we later use in combination with a Res-U-Net decoder for source separation. Pac-HuBERT uses primitive auditory features of music as unsupervised clustering labels to initialize the self-supervised pretraining process using the Free Music Archive (FMA) dataset. The resulting framework achieves better source-to-distortion ratio (SDR) performance on the MusDB18 test set than the original Demucs V2 and Res-U-Net models. We further demonstrate that it can boost performance with small amounts of supervised data. Ultimately, our proposed framework is an effective solution to the challenge of limited clean source data for music source separation.

48.9ASMay 7
Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

Julius Richter, Yoshiki Masuyama, Christoph Boeddeker et al.

We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.

ASMay 19, 2025
Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses

Christopher Ick, Gordon Wichern, Yoshiki Masuyama et al.

The characteristics of a sound field are intrinsically linked to the geometric and spatial properties of the environment surrounding a sound source and a listener. The physics of sound propagation is captured in a time-domain signal known as a room impulse response (RIR). Prior work using neural fields (NFs) has allowed learning spatially-continuous representations of RIRs from finite RIR measurements. However, previous NF-based methods have focused on monaural omnidirectional or at most binaural listeners, which does not precisely capture the directional characteristics of a real sound field at a single point. We propose a direction-aware neural field (DANF) that more explicitly incorporates the directional information by Ambisonic-format RIRs. While DANF inherently captures spatial relations between sources and listeners, we further propose a direction-aware loss. In addition, we investigate the ability of DANF to adapt to new rooms in various ways including low-rank adaptation.

ASJun 18, 2025
Factorized RVQ-GAN For Disentangled Speech Tokenization

Sameer Khurana, Dominik Klement, Antoine Laurent et al.

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

LGJan 10, 2025
Meta-Learning for Physically-Constrained Neural System Identification

Ankush Chakrabarty, Gordon Wichern, Vedang M. Deshpande et al.

We present a gradient-based meta-learning framework for rapid adaptation of neural state-space models (NSSMs) for black-box system identification. When applicable, we also incorporate domain-specific physical constraints to improve the accuracy of the NSSM. The major benefit of our approach is that instead of relying solely on data from a single target system, our framework utilizes data from a diverse set of source systems, enabling learning from limited target data, as well as with few online training iterations. Through benchmark examples, we demonstrate the potential of our approach, study the effect of fine-tuning subnetworks rather than full fine-tuning, and report real-world case studies to illustrate the practical application and generalizability of the approach to practical problems with physical-constraints. Specifically, we show that the meta-learned models result in improved downstream performance in model-based state estimation in indoor localization and energy systems.

LGMay 31, 2025
Probabilistic Forecasting for Building Energy Systems using Time-Series Foundation Models

Young Jin Park, Francois Germain, Jing Liu et al. · mit

Decision-making in building energy systems critically depends on the predictive accuracy of relevant time-series models. In scenarios lacking extensive data from a target building, foundation models (FMs) represent a promising technology that can leverage prior knowledge from vast and diverse pre-training datasets to construct accurate probabilistic predictors for use in decision-making tools. This paper investigates the applicability and fine-tuning strategies of time-series foundation models (TSFMs) in building energy forecasting. We analyze both full fine-tuning and parameter-efficient fine-tuning approaches, particularly low-rank adaptation (LoRA), by using real-world data from a commercial net-zero energy building to capture signals such as room occupancy, carbon emissions, plug loads, and HVAC energy consumption. Our analysis reveals that the zero-shot predictive performance of TSFMs is generally suboptimal. To address this shortcoming, we demonstrate that employing either full fine-tuning or parameter-efficient fine-tuning significantly enhances forecasting accuracy, even with limited historical data. Notably, fine-tuning with low-rank adaptation (LoRA) substantially reduces computational costs without sacrificing accuracy. Furthermore, fine-tuned TSFMs consistently outperform state-of-the-art deep forecasting models (e.g., temporal fusion transformers) in accuracy, robustness, and generalization across varying building zones and seasonal conditions. These results underline the efficacy of TSFMs for practical, data-constrained building energy management systems, enabling improved decision-making in pursuit of energy efficiency and sustainability.

ASApr 19, 2025
Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training

Christopher Ick, Gordon Wichern, Yoshiki Masuyama et al.

This report details MERL's system for room impulse response (RIR) estimation submitted to the Generative Data Augmentation Workshop at ICASSP 2025 for Augmenting RIR Data (Task 1) and Improving Speaker Distance Estimation (Task 2). We first pre-train a neural acoustic field conditioned by room geometry on an external large-scale dataset in which pairs of RIRs and the geometries are provided. The neural acoustic field is then adapted to each target room by using the enrollment data, where we leverage either the provided room geometries or geometries retrieved from the external dataset, depending on availability. Lastly, we predict the RIRs for each pair of source and receiver locations specified by Task 1, and use these RIRs to train the speaker distance estimation model in Task 2.

ASOct 19, 2021
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

Darius Petermann, Gordon Wichern, Zhong-Qiu Wang et al.

The cocktail party problem aims at isolating any source of interest within a complex acoustic scene, and has long inspired audio source separation research. Recent efforts have mainly focused on separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. However, separating an audio mixture (e.g., movie soundtrack) into the three broad categories of speech, music, and sound effects (understood to include ambient noise and natural sound events) has been left largely unexplored, despite a wide range of potential applications. This paper formalizes this task as the cocktail fork problem, and presents the Divide and Remaster (DnR) dataset to foster research on this topic. DnR is built from three well-established audio datasets (LibriSpeech, FMA, FSD50k), taking care to reproduce conditions similar to professionally produced content in terms of source overlap and relative loudness, and made available at CD quality. We benchmark standard source separation algorithms on DnR, and further introduce a new multi-resolution model to better address the variety of acoustic characteristics of the three source types. Our best model produces SI-SDR improvements over the mixture of 11.0 dB for music, 11.2 dB for speech, and 10.8 dB for sound effects.

SDOct 1, 2021
Leveraging Low-Distortion Target Estimates for Improved Speech Enhancement

Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux

A promising approach for multi-microphone speech separation involves two deep neural networks (DNN), where the predicted target speech from the first DNN is used to compute signal statistics for time-invariant minimum variance distortionless response (MVDR) beamforming, and the MVDR result is then used as extra features for the second DNN to predict target speech. Previous studies suggested that the MVDR result can provide complementary information for the second DNN to better predict target speech. However, on fixed-geometry arrays, both DNNs can take in, for example, the real and imaginary (RI) components of the multi-channel mixture as features to leverage the spatial and spectral information for enhancement. It is not explained clearly why the linear MVDR result can be complementary and why it is still needed, considering that the DNNs and the beamformer use the same input, and the DNNs perform non-linear filtering and could render the linear filtering of MVDR unnecessary. Similarly, in monaural cases, one can replace the MVDR beamformer with a monaural weighted prediction error (WPE) filter. Although the linear WPE filter and the DNNs use the same mixture RI components as input, the WPE result is found to significantly improve the second DNN. This study provides a novel explanation from the perspective of the low-distortion nature of such algorithms, and finds that they can consistently improve phase estimation. Equipped with this understanding, we investigate several low-distortion target estimation algorithms including several beamformers, WPE, forward convolutive prediction, and their combinations, and use their results as extra features to train the second network to achieve better enhancement. Evaluation results on single- and multi-microphone speech dereverberation and enhancement tasks indicate the effectiveness of the proposed approach, and the validity of the proposed view.

SDAug 16, 2021
Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation

Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux

A promising approach for speech dereverberation is based on supervised learning, where a deep neural network (DNN) is trained to predict the direct sound from noisy-reverberant speech. This data-driven approach is based on leveraging prior knowledge of clean speech patterns and seldom explicitly exploits the linear-filter structure in reverberation, i.e., that reverberation results from a linear convolution between a room impulse response (RIR) and a dry source signal. In this work, we propose to exploit this linear-filter structure within a deep learning based monaural speech dereverberation framework. The key idea is to first estimate the direct-path signal of the target speaker using a DNN and then identify signals that are decayed and delayed copies of the estimated direct-path signal, as these can be reliably considered as reverberation. They can be either directly removed for dereverberation, or used as extra features for another DNN to perform better dereverberation. To identify the copies, we estimate the underlying filter (or RIR) by efficiently solving a linear regression problem per frequency in the time-frequency domain. We then modify the proposed algorithm for speaker separation in reverberant and noisy-reverberant conditions. State-of-the-art speech dereverberation and speaker separation results are obtained on the REVERB, SMS-WSJ, and WHAMR! datasets.

SDAug 16, 2021
Convolutive Prediction for Reverberant Speech Separation

Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux

We investigate the effectiveness of convolutive prediction, a novel formulation of linear prediction for speech dereverberation, for speaker separation in reverberant conditions. The key idea is to first use a deep neural network (DNN) to estimate the direct-path signal of each speaker, and then identify delayed and decayed copies of the estimated direct-path signal. Such copies are likely due to reverberation, and can be directly removed for dereverberation or used as extra features for another DNN to perform better dereverberation and separation. To identify such copies, we solve a linear regression problem per frequency efficiently in the time-frequency (T-F) domain to estimate the underlying room impulse response (RIR). In the multi-channel extension, we perform minimum variance distortionless response (MVDR) beamforming on the outputs of convolutive prediction. The beamforming and dereverberation results are used as extra features for a second DNN to perform better separation and dereverberation. State-of-the-art results are obtained on the SMS-WSJ corpus.

SDAug 11, 2021
On The Compensation Between Magnitude and Phase in Speech Separation

Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux

Deep neural network (DNN) based end-to-end optimization in the complex time-frequency (T-F) domain or time domain has shown considerable potential in monaural speech separation. Many recent studies optimize loss functions defined solely in the time or complex domain, without including a loss on magnitude. Although such loss functions typically produce better scores if the evaluation metrics are objective time-domain metrics, they however produce worse scores on speech quality and intelligibility metrics and usually lead to worse speech recognition performance, compared with including a loss on magnitude. While this phenomenon has been experimentally observed by many studies, it is often not accurately explained and there lacks a thorough understanding on its fundamental cause. This paper provides a novel view from the perspective of the implicit compensation between estimated magnitude and phase. Analytical results based on monaural speech separation and robust automatic speech recognition (ASR) tasks in noisy-reverberant conditions support the validity of our view.

LGJun 29, 2021
Attentive Neural Processes and Batch Bayesian Optimization for Scalable Calibration of Physics-Informed Digital Twins

Ankush Chakrabarty, Gordon Wichern, Christopher Laughman

Physics-informed dynamical system models form critical components of digital twins of the built environment. These digital twins enable the design of energy-efficient infrastructure, but must be properly calibrated to accurately reflect system behavior for downstream prediction and analysis. Dynamical system models of modern buildings are typically described by a large number of parameters and incur significant computational expenditure during simulations. To handle large-scale calibration of digital twins without exorbitant simulations, we propose ANP-BBO: a scalable and parallelizable batch-wise Bayesian optimization (BBO) methodology that leverages attentive neural processes (ANPs).

SDOct 22, 2020
Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

Yun-Ning Hung, Gordon Wichern, Jonathan Le Roux

Most music source separation systems require large collections of isolated sources for training, which can be difficult to obtain. In this work, we use musical scores, which are comparatively easy to obtain, as a weak label for training a source separation system. In contrast with previous score-informed separation approaches, our system does not require isolated sources, and score is used only as a training target, not required for inference. Our model consists of a separator that outputs a time-frequency mask for each instrument, and a transcriptor that acts as a critic, providing both temporal and frequency supervision to guide the learning of the separator. A harmonic mask constraint is introduced as another way of leveraging score information during training, and we propose two novel adversarial losses for additional fine-tuning of both the transcriptor and the separator. Results demonstrate that using score information outperforms temporal weak-labels, and adversarial structures lead to further improvements in both separation and transcription performance.

ASJul 25, 2020
AutoClip: Adaptive Gradient Clipping for Source Separation Networks

Prem Seetharaman, Gordon Wichern, Bryan Pardo et al.

Clipping the gradient is a known approach to improving gradient descent, but requires hand selection of a clipping threshold hyperparameter. We present AutoClip, a simple method for automatically and adaptively choosing a gradient clipping threshold, based on the history of gradient norms observed during training. Experimental results show that applying AutoClip results in improved generalization performance for audio source separation networks. Observation of the training dynamics of a separation network trained with and without AutoClip show that AutoClip guides optimization into smoother parts of the loss landscape. AutoClip is very simple to implement and can be integrated readily into a variety of applications across multiple domains.

SDNov 6, 2019
Finding Strength in Weakness: Learning to Separate Sounds with Weak Supervision

Fatemeh Pishdadian, Gordon Wichern, Jonathan Le Roux

While there has been much recent progress using deep learning techniques to separate speech and music audio signals, these systems typically require large collections of isolated sources during the training process. When extending audio source separation algorithms to more general domains such as environmental monitoring, it may not be possible to obtain isolated signals for training. Here, we propose objective functions and network architectures that enable training a source separation system with weak labels. In this scenario, weak labels are defined in contrast with strong time-frequency (TF) labels such as those obtained from isolated sources, and refer either to frame-level weak labels where one only has access to the time periods when different sources are active in an audio mixture, or to clip-level weak labels that only indicate the presence or absence of sounds in an entire audio clip. We train a separator that estimates a TF mask for each type of sound event, using a sound event classifier as an assessor of the separator's performance to bridge the gap between the TF-level separation and the ground truth weak labels only available at the frame or clip level. Our objective function requires the classifier applied to a separated source to assign high probability to the class corresponding to that source and low probability to all other classes. The objective function also enforces that the separated sources sum up to the mixture. We benchmark the performance of our algorithm using synthetic mixtures of overlapping events created from a database of sounds recorded in urban environments. Compared to training a network using isolated sources, our model achieves somewhat lower but still significant SI-SDR improvement, even in scenarios with significant sound event overlap.

SDOct 23, 2019
Bootstrapping deep music separation from primitive auditory grouping principles

Prem Seetharaman, Gordon Wichern, Jonathan Le Roux et al.

Separating an audio scene such as a cocktail party into constituent, meaningful components is a core task in computer audition. Deep networks are the state-of-the-art approach. They are trained on synthetic mixtures of audio made from isolated sound source recordings so that ground truth for the separation is known. However, the vast majority of available audio is not isolated. The brain uses primitive cues that are independent of the characteristics of any particular sound source to perform an initial segmentation of the audio scene. We present a method for bootstrapping a deep model for music source separation without ground truth by using multiple primitive cues. We apply our method to train a network on a large set of unlabeled music recordings from YouTube to separate vocals from accompaniment without the need for ground truth isolated sources or artificial training mixtures.

SDOct 22, 2019
WHAMR!: Noisy and Reverberant Single-Channel Speech Separation

Matthew Maciejewski, Gordon Wichern, Emmett McQuinn et al.

While significant advances have been made with respect to the separation of overlapping speech signals, studies have been largely constrained to mixtures of clean, near anechoic speech, not representative of many real-world scenarios. Although the WHAM! dataset introduced noise to the ubiquitous wsj0-2mix dataset, it did not include reverberation, which is generally present in indoor recordings outside of recording studios. The spectral smearing caused by reverberation can result in significant performance degradation for standard deep learning-based speech separation systems, which rely on spectral structure and the sparsity of speech signals to tease apart sources. To address this, we introduce WHAMR!, an augmented version of WHAM! with synthetic reverberated sources, and provide a thorough baseline analysis of current techniques as well as novel cascaded architectures on the newly introduced conditions.

SDSep 18, 2019
Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity

Ethan Manilow, Gordon Wichern, Prem Seetharaman et al.

Music source separation performance has greatly improved in recent years with the advent of approaches based on deep learning. Such methods typically require large amounts of labelled training data, which in the case of music consist of mixtures and corresponding instrument stems. However, stems are unavailable for most commercial music, and only limited datasets have so far been released to the public. It can thus be difficult to draw conclusions when comparing various source separation methods, as the difference in performance may stem as much from better data augmentation techniques or training tricks to alleviate the limited availability of training data, as from intrinsically better model architectures and objective functions. In this paper, we present the synthesized Lakh dataset (Slakh) as a new tool for music source separation research. Slakh consists of high-quality renderings of instrumental mixtures and corresponding stems generated from the Lakh MIDI dataset (LMD) using professional-grade sample-based virtual instruments. A first version, Slakh2100, focuses on 2100 songs, resulting in 145 hours of mixtures. While not fully comparable because it is purely instrumental, this dataset contains an order of magnitude more data than MUSDB18, the {\it de facto} standard dataset in the field. We show that Slakh can be used to effectively augment existing datasets for musical instrument separation, while opening the door to a wide array of data-intensive music signal analysis tasks.

SDJul 2, 2019
WHAM!: Extending Speech Separation to Noisy Environments

Gordon Wichern, Joe Antognini, Michael Flynn et al.

Recent progress in separating the speech signals from multiple overlapping speakers using a single audio channel has brought us closer to solving the cocktail party problem. However, most studies in this area use a constrained problem setup, comparing performance when speakers overlap almost completely, at artificially low sampling rates, and with no external background noise. In this paper, we strive to move the field towards more realistic and challenging scenarios. To that end, we created the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset, consisting of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area, and are made publicly available. We benchmark various speech separation architectures and objective functions to evaluate their robustness to noise. While separation performance decreases as a result of noise, we still observe substantial gains relative to the noisy signals for most approaches.

SDNov 7, 2018
Class-conditional embeddings for music source separation

Prem Seetharaman, Gordon Wichern, Shrikant Venkataramani et al.

Isolating individual instruments in a musical mixture has a myriad of potential applications, and seems imminently achievable given the levels of performance reached by recent deep learning methods. While most musical source separation techniques learn an independent model for each instrument, we propose using a common embedding space for the time-frequency bins of all instruments in a mixture inspired by deep clustering and deep attractor networks. Additionally, an auxiliary network is used to generate parameters of a Gaussian mixture model (GMM) where the posterior distribution over GMM components in the embedding space can be used to create a mask that separates individual sources from a mixture. In addition to outperforming a mask-inference baseline on the MUSDB-18 dataset, our embedding space is easily interpretable and can be used for query-based separation.

SDNov 6, 2018
Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures

Prem Seetharaman, Gordon Wichern, Jonathan Le Roux et al.

Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Traditionally, such systems are trained on sound mixtures where the ground truth decomposition is already known. Since most real-world recordings do not have such a decomposition available, this limits the range of mixtures one can train on, and the range of mixtures the learned models may successfully separate. In this work, we use a simple blind spatial source separation algorithm to generate estimated decompositions of stereo mixtures. These estimates, together with a weighting scheme in the time-frequency domain, based on confidence in the separation quality, are used to train a deep learning model that can be used for single-channel separation, where no source direction information is available. This demonstrates how a simple cue such as the direction of origin of source can be used to bootstrap a model for source separation that can be used in situations where that cue is not available.

SDOct 2, 2018
Phasebook and Friends: Leveraging Discrete Representations for Source Separation

Jonathan Le Roux, Gordon Wichern, Shinji Watanabe et al.

Deep learning based speech enhancement and source separation systems have recently reached unprecedented levels of quality, to the point that performance is reaching a new ceiling. Most systems rely on estimating the magnitude of a target source by estimating a real-valued mask to be applied to a time-frequency representation of the mixture signal. A limiting factor in such approaches is a lack of phase estimation: the phase of the mixture is most often used when reconstructing the estimated time-domain signal. Here, we propose "magbook", "phasebook", and "combook", three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks. Magbook layers extend classical sigmoidal units and a recently introduced convex softmax activation for mask-based magnitude estimation. Phasebook layers use a similar structure to give an estimate of the phase mask without suffering from phase wrapping issues. Combook layers are an alternative to the magbook-phasebook combination that directly estimate complex masks. We present various training and inference schemes involving these representations, and explain in particular how to include them in an end-to-end learning framework. We also present an oracle study to assess upper bounds on performance for various types of masks using discrete phase representations. We evaluate the proposed methods on the wsj0-2mix dataset, a well-studied corpus for single-channel speaker-independent speaker separation, matching the performance of state-of-the-art mask-based approaches without requiring additional phase reconstruction steps.

CLJun 21, 2018
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

Chiori Hori, Huda Alamri, Jue Wang et al.

Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000 videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we trained an end-to-end conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge.