Wolfgang Mack

AS
h-index1
8papers
140citations
Novelty39%
AI Score34

8 Papers

ASMar 13, 2023
Multi-Microphone Speaker Separation by Spatial Regions

Julian Wechsler, Srikanth Raj Chetupalli, Wolfgang Mack et al.

We consider the task of region-based source separation of reverberant multi-microphone recordings. We assume pre-defined spatial regions with a single active source per region. The objective is to estimate the signals from the individual spatial regions as captured by a reference microphone while retaining a correspondence between signals and spatial regions. We propose a data-driven approach using a modified version of a state-of-the-art network, where different layers model spatial and spectro-temporal information. The network is trained to enforce a fixed mapping of regions to network outputs. Using speech from LibriMix, we construct a data set specifically designed to contain the region information. Additionally, we train the network with permutation invariant training. We show that both training methods result in a fixed mapping of regions to network outputs, achieve comparable performance, and that the networks exploit spatial information. The proposed network outperforms a baseline network by 1.5 dB in scale-invariant signal-to-distortion ratio.

SDJul 26, 2025
Improving Audio Classification by Transitioning from Zero- to Few-Shot

James Taylor, Wolfgang Mack

State-of-the-art audio classification often employs a zero-shot approach, which involves comparing audio embeddings with embeddings from text describing the respective audio class. These embeddings are usually generated by neural networks trained through contrastive learning to align audio and text representations. Identifying the optimal text description for an audio class is challenging, particularly when the class comprises a wide variety of sounds. This paper examines few-shot methods designed to improve classification accuracy beyond the zero-shot approach. Specifically, audio embeddings are grouped by class and processed to replace the inherently noisy text embeddings. Our results demonstrate that few-shot classification typically outperforms the zero-shot baseline.

ASFeb 7, 2025
Efficient Evaluation of Quantization-Effects in Neural Codecs

Wolfgang Mack, Ahmed Mustafa, Rafał Łaganowski et al.

Neural codecs, comprising an encoder, quantizer, and decoder, enable signal transmission at exceptionally low bitrates. Training these systems requires techniques like the straight-through estimator, soft-to-hard annealing, or statistical quantizer emulation to allow a non-zero gradient across the quantizer. Evaluating the effect of quantization in neural codecs, like the influence of gradient passing techniques on the whole system, is often costly and time-consuming due to training demands and the lack of affordable and reliable metrics. This paper proposes an efficient evaluation framework for neural codecs using simulated data with a defined number of bits and low-complexity neural encoders/decoders to emulate the non-linear behavior in larger networks. Our system is highly efficient in terms of training time and computational and hardware requirements, allowing us to uncover distinct behaviors in neural codecs. We propose a modification to stabilize training with the straight-through estimator based on our findings. We validate our findings against an internal neural audio codec and against the state-of-the-art descript-audio-codec.

ASFeb 1, 2022
New Insights on Target Speaker Extraction

Mohamed Elminshawi, Wolfgang Mack, Srikanth Raj Chetupalli et al.

Speaker extraction (SE) aims to segregate the speech of a target speaker from a mixture of interfering speakers with the help of auxiliary information. Several forms of auxiliary information have been employed in single-channel SE, such as a speech snippet enrolled from the target speaker or visual information corresponding to the spoken utterance. The effectiveness of the auxiliary information in SE is typically evaluated by comparing the extraction performance of SE with uninformed speaker separation (SS) methods. Following this evaluation protocol, many SE studies have reported performance improvement compared to SS, attributing this to the auxiliary information. However, such studies have been conducted on a few datasets and have not considered recent deep neural network architectures for SS that have shown impressive separation performance. In this paper, we examine the role of the auxiliary information in SE for different input scenarios and over multiple datasets. Specifically, we compare the performance of two SE systems (audio-based and video-based) with SS using a common framework that utilizes the recently proposed dual-path recurrent neural network as the main learning machine. Experimental evaluation on various datasets demonstrates that the use of auxiliary information in the considered SE systems does not always lead to better extraction performance compared to the uninformed SS system. Furthermore, we offer insights into the behavior of the SE systems when provided with different and distorted auxiliary information given the same mixture input.

ASJan 3, 2022
Signal-Aware Direction-of-Arrival Estimation Using Attention Mechanisms

Wolfgang Mack, Julian Wechsler, Emanuël A. P. Habets

The direction-of-arrival (DOA) of sound sources is an essential acoustic parameter used, e.g., for multi-channel speech enhancement or source tracking. Complex acoustic scenarios consisting of sources-of-interest, interfering sources, reverberation, and noise make the estimation of the DOAs corresponding to the sources-of-interest a challenging task. Recently proposed attention mechanisms allow DOA estimators to focus on the sources-of-interest and disregard interference and noise, i.e., they are signal-aware. The attention is typically obtained by a deep neural network (DNN) from a short-time Fourier transform (STFT) based representation of a single microphone signal. Subsequently, attention has been applied as binary or ratio weighting to STFT-based microphone signal representations to reduce the impact of frequency bins dominated by noise, interference, or reverberation. The impact of attention on DOA estimators and different training strategies for attention and DOA DNNs are not yet studied in depth. In this paper, we evaluate systems consisting of different DNNs and signal processing-based methods for DOA estimation when attention is applied. Additionally, we propose training strategies for attention-based DOA estimation optimized via a DOA objective, i.e., end-to-end. The evaluation of the proposed and the baseline systems is performed using data generated with simulated and measured room impulse responses under various acoustic conditions, like reverberation times, noise, and source array distances. Overall, DOA estimation using attention in combination with signal-processing methods exhibits a far lower computational complexity than a fully DNN-based system; however, it yields comparable results.

ASNov 9, 2020
Informed Source Extraction With Application to Acoustic Echo Reduction

Mohamed Elminshawi, Wolfgang Mack, Emanuël A. P. Habets

Informed speaker extraction aims to extract a target speech signal from a mixture of sources given prior knowledge about the desired speaker. Recent deep learning-based methods leverage a speaker discriminative model that maps a reference snippet uttered by the target speaker into a single embedding vector that encapsulates the characteristics of the target speaker. However, such modeling deliberately neglects the time-varying properties of the reference signal. In this work, we assume that a reference signal is available that is temporally correlated with the target signal. To take this correlation into account, we propose a time-varying source discriminative model that captures the temporal dynamics of the reference signal. We also show that existing methods and the proposed method can be generalized to non-speech sources as well. Experimental results demonstrate that the proposed method significantly improves the extraction performance when applied in an acoustic echo reduction scenario.

ASNov 9, 2020
Efficient Training Data Generation for Phase-Based DOA Estimation

Fabian Hübner, Wolfgang Mack, Emanuël A. P. Habets

Deep learning (DL) based direction of arrival (DOA) estimation is an active research topic and currently represents the state-of-the-art. Usually, DL-based DOA estimators are trained with recorded data or computationally expensive generated data. Both data types require significant storage and excessive time to, respectively, record or generate. We propose a low complexity online data generation method to train DL models with a phase-based feature input. The data generation method models the phases of the microphone signals in the frequency domain by employing a deterministic model for the direct path and a statistical model for the late reverberation of the room transfer function. By an evaluation using data from measured room impulse responses, we demonstrate that a model trained with the proposed training data generation method performs comparably to models trained with data generated based on the source-image method.

SDApr 17, 2019
Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters

Wolfgang Mack, Emanuël A. P. Habets

Signal extraction from a single-channel mixture with additional undesired signals is most commonly performed using time-frequency (TF) masks. Typically, the mask is estimated with a deep neural network (DNN), and element-wise applied to the complex mixture short-time Fourier transform (STFT) representation to perform the extraction. Ideal mask magnitudes are zero for solely undesired signals in a TF bin and undefined for total destructive interference. Usually, masks have an upper bound to provide well-defined DNN outputs at the cost of limited extraction capabilities. We propose to estimate with a DNN a complex TF filter for each mixture TF bin which maps an STFT area in the respective mixture to the desired TF bin to address destructive interference in mixture TF bins. The DNN is optimized by minimizing the error between the extracted and the ground-truth desired signal allowing to learn the TF filters without having to specify ground-truth TF filters. We compare our approach with complex and real-valued TF masks by separating speech from a variety of different sound and noise classes from the Google AudioSet corpus. We also process the mixture STFT with notch-filters and zero whole time-frames, to simulate packet-loss during transmission, to demonstrate the reconstruction capabilities of our approach. The proposed method outperformed the baselines, especially when notch-filters and time-frame zeroing were applied.