SDJul 23, 2023
Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning RepresentationYoshiki Masuyama, Xuankai Chang, Wangyou Zhang et al. · cmu
Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%).
72.6ASMar 25
Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised LearningDaisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda et al.
Since the introduction of Masked Autoencoders, various improvements to masking techniques have been explored. In this paper, we rethink masking strategies for audio representation learning using masked prediction-based self-supervised learning (SSL) on general audio spectrograms. While recent informed masking techniques have attracted attention, we observe that they incur substantial computational overhead. Motivated by this observation, we propose dispersion-weighted masking (DWM), a lightweight masking strategy that leverages the spectral sparsity inherent in the frequency structure of audio content. Our experiments show that inverse block masking, commonly used in recent SSL frameworks, improves audio event understanding performance while introducing a trade-off in generalization. The proposed DWM alleviates these limitations and computational complexity, leading to consistent performance improvements. This work provides practical guidance on masking strategy design for masked prediction-based audio representation learning.
20.0LGMay 7
DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight CompressionNobutaka Ono
In this paper, we propose DiBA (Diagonal and Binary Matrix Approximation), a compact matrix factorization for neural network weight compression. Many components of modern networks, including linear layers, $1\times1$ convolutions, attention projections, and embedding layers, have dense matrix weights. DiBA approximates $A\in\mathbb{R}^{m\times n}$ by $\widehat A=D_1B_1D_2B_2D_3$, where $D_1,D_2,D_3$ are diagonal matrices and $B_1,B_2$ are $0/1$ binary matrices. The intermediate dimension $k$ controls the trade-off between theoretical storage and approximation accuracy. For matrix-vector products, DiBA decomposes dense multiplication into three element-wise scaling operations and two binary mixing operations, reducing the floating-point multiplication count from $mn$ to $m+k+n$. For optimization, we introduce DiBA-Greedy, an alternating solver that combines closed-form least-squares updates for the diagonal factors with exact one-bit improvement tests for the binary factors. We also introduce DiBARD (DiBA with Retuning only Diagonal factors), which replaces dense-matrix layers by DiBA factors, freezes the binary matrices, and retunes only the diagonal entries on downstream data. This preserves compact binary mixing without discrete search during adaptation. On 40 dense weight matrices extracted from public pretrained models, DiBA-Greedy yields consistent SNR improvements as the theoretical storage ratio increases. After DiBA replacement in two component-replacement studies, DiBARD improves DistilBERT/WikiText masked-token accuracy from 0.4447 to 0.5210 and Speech Commands test accuracy for an Audio Spectrogram Transformer from 0.7684 to 0.9781 without reoptimizing the binary factors.
MMApr 12, 2024
Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event AnalysisMasahiro Yasuda, Noboru Harada, Yasunori Ohishi et al.
Observations with distributed sensors are essential in analyzing a series of human and machine activities (referred to as 'events' in this paper) in complex and extensive real-world environments. This is because the information obtained from a single sensor is often missing or fragmented in such an environment; observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose Guided Masked sELf-Distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to enable the system to effectively distill the fragmented or redundant target event information obtained by the sensors without being overly dependent on any specific sensors. To validate the effectiveness of the proposed method in novel tasks of distributed multimedia sensor event analysis, we recorded two new datasets that fit the problem setting: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results on these datasets show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced.
ASFeb 12, 2021
Joint Dereverberation and Separation with Iterative Source SteeringTaishi Nakashima, Robin Scheibler, Masahito Togami et al.
We propose a new algorithm for joint dereverberation and blind source separation (DR-BSS). Our work builds upon the IRLMA-T framework that applies a unified filter combining dereverberation and separation. One drawback of this framework is that it requires several matrix inversions, an operation inherently costly and with potential stability issues. We leverage the recently introduced iterative source steering (ISS) updates to propose two algorithms mitigating this issue. Albeit derived from first principles, the first algorithm turns out to be a natural combination of weighted prediction error (WPE) dereverberation and ISS-based BSS, applied alternatingly. In this case, we manage to reduce the number of matrix inversion to only one per iteration and source. The second algorithm updates the ILRMA-T matrix using only sequential ISS updates requiring no matrix inversion at all. Its implementation is straightforward and memory efficient. Numerical experiments demonstrate that both methods achieve the same final performance as ILRMA-T in terms of several relevant objective metrics. In the important case of two sources, the number of iterations required is also similar.
SPApr 8, 2020
MM Algorithms for Joint Independent Subspace Analysis with Application to Blind Single and Multi-Source ExtractionRobin Scheibler, Nobutaka Ono
In this work, we propose efficient algorithms for joint independent subspace analysis (JISA), an extension of independent component analysis that deals with parallel mixtures, where not all the components are independent. We derive an algorithmic framework for JISA based on the majorization-minimization (MM) optimization technique (JISA-MM). We use a well-known inequality for super-Gaussian sources to derive a surrogate function of the negative log-likelihood of the observed data. The minimization of this surrogate function leads to a variant of the hybrid exact-approximate diagonalization problem, but where multiple demixing vectors are grouped together. In the spirit of auxiliary function based independent vector analysis (AuxIVA), we propose several updates that can be applied alternately to one, or jointly to two, groups of demixing vectors. Recently, blind extraction of one or more sources has gained interest as a reasonable way of exploiting larger microphone arrays to achieve better separation. In particular, several MM algorithms have been proposed for overdetermined IVA (OverIVA). By applying JISA-MM, we are not only able to rederive these in a general manner, but also find several new algorithms. We run extensive numerical experiments to evaluate their performance, and compare it to that of full separation with AuxIVA. We find that algorithms using pairwise updates of two sources, or of one source and the background have the fastest convergence, and are able to separate target sources quickly and precisely from the background. In addition, we characterize the performance of all algorithms under a large number of noise, reverberation, and background mismatch conditions.
SDOct 23, 2019
Fast Independent Vector Extraction by Iterative SINR MaximizationRobin Scheibler, Nobutaka Ono
We propose fast independent vector extraction (FIVE), a new algorithm that blindly extracts a single non-Gaussian source from a Gaussian background. The algorithm iteratively computes beamforming weights maximizing the signal-to-interference-and-noise ratio for an approximate noise covariance matrix. We demonstrate that this procedure minimizes the negative log-likelihood of the input data according to a well-defined probabilistic model. The minimization is carried out via the auxiliary function technique whereas, unlike related methods, the auxiliary function is globally minimized at every iteration. Numerical experiments are carried out to assess the performance of FIVE. We find that it is vastly superior to competing methods in terms of convergence speed, and has high potential for real-time applications.
SDMay 20, 2019
Independent Vector Analysis with more Microphones than SourcesRobin Scheibler, Nobutaka Ono
We extend frequency-domain blind source separation based on independent vector analysis to the case where there are more microphones than sources. The signal is modelled as non-Gaussian sources in a Gaussian background. The proposed algorithm is based on a parametrization of the demixing matrix decreasing the number of parameters to estimate. Furthermore, orthogonal constraints between the signal and background subspaces are imposed to regularize the separation. The problem can then be posed as a constrained likelihood maximization. We propose efficient alternating updates guaranteed to converge to a stationary point of the cost function. The performance of the algorithm is assessed on simulated signals. We find that the separation performance is on par with that of the conventional determined algorithm at a fraction of the computational cost.
SDApr 4, 2019
Multi-modal Blind Source Separation with Microphones and BlinkiesRobin Scheibler, Nobutaka Ono
We propose a blind source separation algorithm that jointly exploits measurements by a conventional microphone array and an ad hoc array of low-rate sound power sensors called blinkies. While providing less information than microphones, blinkies circumvent some difficulties of microphone arrays in terms of manufacturing, synchronization, and deployment. The algorithm is derived from a joint probabilistic model of the microphone and sound power measurements. We assume the separated sources to follow a time-varying spherical Gaussian distribution, and the non-negative power measurement space-time matrix to have a low-rank structure. We show that alternating updates similar to those of independent vector analysis and Itakura-Saito non-negative matrix factorization decrease the negative log-likelihood of the joint distribution. The proposed algorithm is validated via numerical experiments. Its median separation performance is found to be up to 8 dB more than that of independent vector analysis, with significantly reduced variability.
ASJun 27, 2018
Independent Deeply Learned Matrix Analysis for Multichannel Audio Source SeparationShinichi Mogami, Hayato Sumino, Daichi Kitamura et al.
In this paper, we address a multichannel audio source separation task and propose a new efficient method called independent deeply learned matrix analysis (IDLMA). IDLMA estimates the demixing matrix in a blind manner and updates the time-frequency structures of each source using a pretrained deep neural network (DNN). Also, we introduce a complex Student's t-distribution as a generalized source generative model including both complex Gaussian and Cauchy distributions. Experiments are conducted using music signals with a training dataset, and the results show the validity of the proposed method in terms of separation accuracy and computational cost.
SDAug 16, 2017
Independent Low-Rank Matrix Analysis Based on Complex Student's $t$-Distribution for Blind Audio Source SeparationShinichi Mogami, Daichi Kitamura, Yoshiki Mitsui et al.
In this paper, we generalize a source generative model in a state-of-the-art blind source separation (BSS), independent low-rank matrix analysis (ILRMA). ILRMA is a unified method of frequency-domain independent component analysis and nonnegative matrix factorization and can provide better performance for audio BSS tasks. To further improve the performance and stability of the separation, we introduce an isotropic complex Student's $t$-distribution as a source generative model, which includes the isotropic complex Gaussian distribution used in conventional ILRMA. Experiments are conducted using both music and speech BSS tasks, and the results show the validity of the proposed method.
AIApr 8, 2014
A Stochastic Temporal Model of Polyphonic MIDI Performance with OrnamentsEita Nakamura, Nobutaka Ono, Shigeki Sagayama et al.
We study indeterminacies in realization of ornaments and how they can be incorporated in a stochastic performance model applicable for music information processing such as score-performance matching. We point out the importance of temporal information, and propose a hidden Markov model which describes it explicitly and represents ornaments with several state types. Following a review of the indeterminacies, they are carefully incorporated into the model through its topology and parameters, and the state construction for quite general polyphonic scores is explained in detail. By analyzing piano performance data, we find significant overlaps in inter-onset-interval distributions of chordal notes, ornaments, and inter-chord events, and the data is used to determine details of the model. The model is applied for score following and offline score-performance matching, yielding highly accurate matching for performances with many ornaments and relatively frequent errors, repeats, and skips.
AIApr 8, 2014
Outer-Product Hidden Markov Model and Polyphonic MIDI Score FollowingEita Nakamura, Tomohiko Nakamura, Yasuyuki Saito et al.
We present a polyphonic MIDI score-following algorithm capable of following performances with arbitrary repeats and skips, based on a probabilistic model of musical performances. It is attractive in practical applications of score following to handle repeats and skips which may be made arbitrarily during performances, but the algorithms previously described in the literature cannot be applied to scores of practical length due to problems with large computational complexity. We propose a new type of hidden Markov model (HMM) as a performance model which can describe arbitrary repeats and skips including performer tendencies on distributed score positions before and after them, and derive an efficient score-following algorithm that reduces computational complexity without pruning. A theoretical discussion on how much such information on performer tendencies improves the score-following results is given. The proposed score-following algorithm also admits performance mistakes and is demonstrated to be effective in practical situations by carrying out evaluations with human performances. The proposed HMM is potentially valuable for other topics in information processing and we also provide a detailed description of inference algorithms.