SDAug 21, 2023
LibriWASN: A Data Set for Meeting Separation, Diarization, and Recognition with Asynchronous Recording DevicesJoerg Schmalenstroeer, Tobias Gburrek, Reinhold Haeb-Umbach
We present LibriWASN, a data set whose design follows closely the LibriCSS meeting recognition data set, with the marked difference that the data is recorded with devices that are randomly positioned on a meeting table and whose sampling clocks are not synchronized. Nine different devices, five smartphones with a single recording channel and four microphone arrays, are used to record a total of 29 channels. Other than that, the data set follows closely the LibriCSS design: the same LibriSpeech sentences are played back from eight loudspeakers arranged around a meeting table and the data is organized in subsets with different percentages of speech overlap. LibriWASN is meant as a test set for clock synchronization algorithms, meeting separation, diarization and transcription systems on ad-hoc wireless acoustic sensor networks. Due to its similarity to LibriCSS, meeting transcription systems developed for the former can readily be tested on LibriWASN. The data set is recorded in two different rooms and is complemented with ground-truth diarization information of who speaks when.
ASOct 25, 2021
On Synchronization of Wireless Acoustic Sensor Networks in the Presence of Time-varying Sampling Rate Offsets and Speaker ChangesTobias Gburrek, Joerg Schmalenstroeer, Reinhold Haeb-Umbach
A wireless acoustic sensor network records audio signals with sampling time and sampling rate offsets between the audio streams, if the analog-digital converters (ADCs) of the network devices are not synchronized. Here, we introduce a new sampling rate offset model to simulate time-varying sampling frequencies caused, for example, by temperature changes of ADC crystal oscillators, and propose an estimation algorithm to handle this dynamic aspect in combination with changing acoustic source positions. Furthermore, we show how deep neural network based estimates of the distances between microphones and human speakers can be used to determine the sampling time offsets. This enables a synchronization of the audio streams to reflect the physical time differences of flight.
SDJun 4, 2021
A Database for Research on Detection and Enhancement of Speech Transmitted over HF linksJens Heitkaemper, Joerg Schmalenstroeer, Joerg Ullmann et al.
In this paper we present an open database for the development of detection and enhancement algorithms of speech transmitted over HF radio channels. It consists of audio samples recorded by various receivers at different locations across Europe, all monitoring the same single-sideband modulated transmission from a base station in Paderborn, Germany. Transmitted and received speech signals are precisely time aligned to offer parallel data for supervised training of deep learning based detection and enhancement algorithms. For the task of speech activity detection two exemplary baseline systems are presented, one based on statistical methods employing a multi-stage Wiener filter with minimum statistics noise floor estimation, and the other relying on a deep learning approach.
SDMar 2, 2021
Open Range Pitch Tracking for Carrier Frequency Difference Estimation from HF Transmitted SpeechJoerg Schmalenstroeer, Jens Heitkaemper, Joerg Ullmann et al.
In this paper we investigate the task of detecting carrier frequency differences from demodulated single sideband signals by examining the pitch contours of the received baseband speech signal in the short-time spectral domain. From the detected pitch frequency trajectory and its harmonics a carrier frequency difference, which is caused by demodulating the radio signal with the wrong carrier frequency, can be deduced. A computationally efficient realization in the power cepstral domain is presented. The core component, i.e., the pitch tracking algorithm, is shown to perform comparably to a state of the art algorithm. The full carrier frequency difference estimation system is tested on recordings of real transmissions over HF links. A comparison with an existing approach shows improved estimation accuracy, both on short and longer speech utterances
ASDec 11, 2020
Iterative Geometry Calibration from Distance Estimates for Wireless Acoustic Sensor NetworksTobias Gburrek, Joerg Schmalenstroeer, Reinhold Haeb-Umbach
In this paper we present an approach to geometry calibration in wireless acoustic sensor networks, whose nodes are assumed to be equipped with a compact microphone array. The proposed approach solely works with estimates of the distances between acoustic sources and the nodes that record these sources. It consists of an iterative weighted least squares localization procedure, which is initialized by multidimensional scaling. Alongside the sensor node locations, also the positions of the acoustic sources are estimated. Furthermore, we derive the Cramer-Rao lower bound (CRLB) for source and sensor position estimation, and show by simulation that the estimator is efficient.
ASJun 24, 2020
Deep Neural Network based Distance Estimation for Geometry Calibration in Acoustic Sensor NetworksTobias Gburrek, Joerg Schmalenstroeer, Andreas Brendel et al.
We present an approach to deep neural network based (DNN-based) distance estimation in reverberant rooms for supporting geometry calibration tasks in wireless acoustic sensor networks. Signal diffuseness information from acoustic signals is aggregated via the coherent-to-diffuse power ratio to obtain a distance-related feature, which is mapped to a source-to-microphone distance estimate by means of a DNN. This information is then combined with direction-of-arrival estimates from compact microphone arrays to infer the geometry of the sensor network. Unlike many other approaches to geometry calibration, the proposed scheme does only require that the sampling clocks of the sensor nodes are roughly synchronized. In simulations we show that the proposed DNN-based distance estimator generalizes to unseen acoustic environments and that precise estimates of the sensor node positions are obtained.
ASMay 20, 2020
Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic EnvironmentsJens Heitkaemper, Joerg Schmalenstroeer, Reinhold Haeb-Umbach
Speech activity detection (SAD), which often rests on the fact that the noise is "more" stationary than speech, is particularly challenging in non-stationary environments, because the time variance of the acoustic scene makes it difficult to discriminate speech from noise. We propose two approaches to SAD, where one is based on statistical signal processing, while the other utilizes neural networks. The former employes sophisticated signal processing to track the noise and speech energies and is meant to support the case for a resource efficient, unsupervised signal processing approach. The latter introduces a recurrent network layer that operates on short segments of the input speech to do temporal smoothing in the presence of non-stationary noise. The systems are tested on the Fearless Steps challenge, which consists of the transmission data from the Apollo-11 space mission. The statistical SAD achieves comparable detection performance to earlier proposed neural network based SADs, while the neural network based approach leads to a decision cost function of 1.07% on the evaluation set of the 2020 Fearless Steps Challenge, which sets a new state of the art.