Ritwik Giri

AS
h-index1
10papers
1,385citations
Novelty50%
AI Score42

10 Papers

ASMar 28, 2022
Improved singing voice separation with chromagram-based pitch-aware remixing

Siyuan Yuan, Zhepei Wang, Umut Isik et al.

Singing voice separation aims to separate music into vocals and accompaniment components. One of the major constraints for the task is the limited amount of training data with separated vocals. Data augmentation techniques such as random source mixing have been shown to make better use of existing data and mildly improve model performance. We propose a novel data augmentation technique, chromagram-based pitch-aware remixing, where music segments with high pitch alignment are mixed. By performing controlled experiments in both supervised and semi-supervised settings, we demonstrate that training models with pitch-aware remixing significantly improves the test signal-to-distortion ratio (SDR)

SDFeb 2
Masked Autoencoders as Universal Speech Enhancer

Rajalaxmi Rajagopalan, Ritwik Giri, Zhiqiang Tang et al.

Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.

ASFeb 12, 2021
Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders

Jonah Casebeer, Vinjai Vale, Umut Isik et al.

Audio codecs based on discretized neural autoencoders have recently been developed and shown to provide significantly higher compression levels for comparable quality speech output. However, these models are tightly coupled with speech content, and produce unintended outputs in noisy conditions. Based on VQ-VAE autoencoders with WaveRNN decoders, we develop compressor-enhancer encoders and accompanying decoders, and show that they operate well in noisy conditions. We also observe that a compressor-enhancer model performs better on clean speech inputs than a compressor model trained only on clean speech.

ASAug 11, 2020
PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

Umut Isik, Ritwik Giri, Neerad Phansalkar et al.

Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data. We introduce several innovations that lead to better large neural networks for speech enhancement. The novel PoCoNet architecture is a convolutional neural network that, with the use of frequency-positional embeddings, is able to more efficiently build frequency-dependent features in the early layers. A semi-supervised method helps increase the amount of conversational training data by pre-enhancing noisy datasets, improving performance on real recordings. A new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality. Ablation experiments and objective and human opinion metrics show the benefits of the proposed improvements.

SDJan 30, 2020
Channel-Attention Dense U-Net for Multichannel Speech Enhancement

Bahareh Tolooshams, Ritwik Giri, Andrew H. Song et al.

Supervised deep learning has gained significant attention for speech enhancement recently. The state-of-the-art deep learning methods perform the task by learning a ratio/binary mask that is applied to the mixture in the time-frequency domain to produce the clean speech. Despite the great performance in the single-channel setting, these frameworks lag in performance in the multichannel setting as the majority of these methods a) fail to exploit the available spatial information fully, and b) still treat the deep architecture as a black box which may not be well-suited for multichannel audio processing. This paper addresses these drawbacks, a) by utilizing complex ratio masking instead of masking on the magnitude of the spectrogram, and more importantly, b) by introducing a channel-attention mechanism inside the deep architecture to mimic beamforming. We propose Channel-Attention Dense U-Net, in which we apply the channel-attention unit recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming. We demonstrate the superior performance of the network against the state-of-the-art approaches on the CHiME-3 dataset.

CLJan 19, 2020
From Speech-to-Speech Translation to Automatic Dubbing

Marcello Federico, Robert Enyedi, Roberto Barra-Chicote et al.

We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing. Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio. We report on a subjective evaluation of automatic dubbing of excerpts of TED Talks from English into Italian, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement.

CVMar 30, 2017
Relevance Subject Machine: A Novel Person Re-identification Framework

Igor Fedorov, Ritwik Giri, Bhaskar D. Rao et al.

We propose a novel method called the Relevance Subject Machine (RSM) to solve the person re-identification (re-id) problem. RSM falls under the category of Bayesian sparse recovery algorithms and uses the sparse representation of the input video under a pre-defined dictionary to identify the subject in the video. Our approach focuses on the multi-shot re-id problem, which is the prevalent problem in many video analytics applications. RSM captures the essence of the multi-shot re-id problem by constraining the support of the sparse codes for each input video frame to be the same. Our proposed approach is also robust enough to deal with time varying outliers and occlusions by introducing a sparse, non-stationary noise term in the model error. We provide a novel Variational Bayesian based inference procedure along with an intuitive interpretation of the proposed update rules. We evaluate our approach over several commonly used re-id datasets and show superior performance over current state-of-the-art algorithms. Specifically, for ILIDS-VID, a recent large scale re-id dataset, RSM shows significant improvement over all published approaches, achieving an 11.5% (absolute) improvement in rank 1 accuracy over the closest competing algorithm considered.

CVMay 6, 2016
Robust Bayesian Method for Simultaneous Block Sparse Signal Recovery with Applications to Face Recognition

Igor Fedorov, Ritwik Giri, Bhaskar D. Rao et al.

In this paper, we present a novel Bayesian approach to recover simultaneously block sparse signals in the presence of outliers. The key advantage of our proposed method is the ability to handle non-stationary outliers, i.e. outliers which have time varying support. We validate our approach with empirical results showing the superiority of the proposed method over competing approaches in synthetic data experiments as well as the multiple measurement face recognition problem.

MLApr 7, 2016
A Unified Framework for Sparse Non-Negative Least Squares using Multiplicative Updates and the Non-Negative Matrix Factorization Problem

Igor Fedorov, Alican Nalci, Ritwik Giri et al.

We study the sparse non-negative least squares (S-NNLS) problem. S-NNLS occurs naturally in a wide variety of applications where an unknown, non-negative quantity must be recovered from linear measurements. We present a unified framework for S-NNLS based on a rectified power exponential scale mixture prior on the sparse codes. We show that the proposed framework encompasses a large class of S-NNLS algorithms and provide a computationally efficient inference procedure based on multiplicative update rules. Such update rules are convenient for solving large sets of S-NNLS problems simultaneously, which is required in contexts like sparse non-negative matrix factorization (S-NMF). We provide theoretical justification for the proposed approach by showing that the local minima of the objective function being optimized are sparse and the S-NNLS algorithms presented are guaranteed to converge to a set of stationary points of the objective function. We then extend our framework to S-NMF, showing that our framework leads to many well known S-NMF algorithms under specific choices of prior and providing a guarantee that a popular subclass of the proposed algorithms converges to a set of stationary points of the objective function. Finally, we study the performance of the proposed approaches on synthetic and real-world data.

LGJul 17, 2015
Type I and Type II Bayesian Methods for Sparse Signal Recovery using Scale Mixtures

Ritwik Giri, Bhaskar D. Rao

In this paper, we propose a generalized scale mixture family of distributions, namely the Power Exponential Scale Mixture (PESM) family, to model the sparsity inducing priors currently in use for sparse signal recovery (SSR). We show that the successful and popular methods such as LASSO, Reweighted $\ell_1$ and Reweighted $\ell_2$ methods can be formulated in an unified manner in a maximum a posteriori (MAP) or Type I Bayesian framework using an appropriate member of the PESM family as the sparsity inducing prior. In addition, exploiting the natural hierarchical framework induced by the PESM family, we utilize these priors in a Type II framework and develop the corresponding EM based estimation algorithms. Some insight into the differences between Type I and Type II methods is provided and of particular interest in the algorithmic development is the Type II variant of the popular and successful reweighted $\ell_1$ method. Extensive empirical results are provided and they show that the Type II methods exhibit better support recovery than the corresponding Type I methods.