ASJun 23, 2023Code
The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse ScenariosSamuele Cornell, Matthew Wiesner, Shinji Watanabe et al. · cmu
The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate systems on 3 diverse scenarios: CHiME-6, DiPCo, and Mixer 6. The goal is for participants to devise a single system that can generalize across different array geometries and use cases with no a-priori information. Another departure from earlier CHiME iterations is that participants are allowed to use open-source pre-trained models and datasets. In this paper, we describe the challenge design, motivation, and fundamental research questions in detail. We also present the baseline system, which is fully array-topology agnostic and features multi-channel diarization, channel selection, guided source separation and a robust ASR model that leverages self-supervised speech representations (SSLR).
ASOct 2, 2023
One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech RecognitionSamuele Cornell, Jee-weon Jung, Shinji Watanabe et al. · cmu
This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and ``Whisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
ASMar 21, 2023
End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone ConversationsGiovanni Morrone, Samuele Cornell, Luca Serafini et al. · cmu
Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.
LGFeb 23, 2019Code
Transfer Learning for Non-Intrusive Load MonitoringMichele DIncecco, Stefano Squartini, Mingjun Zhong
Non-intrusive load monitoring (NILM) is a technique to recover source appliances from only the recorded mains in a household. NILM is unidentifiable and thus a challenge problem because the inferred power value of an appliance given only the mains could not be unique. To mitigate the unidentifiable problem, various methods incorporating domain knowledge into NILM have been proposed and shown effective experimentally. Recently, among these methods, deep neural networks are shown performing best. Arguably, the recently proposed sequence-to-point (seq2point) learning is promising for NILM. However, the results were only carried out on the same data domain. It is not clear if the method could be generalised or transferred to different domains, e.g., the test data were drawn from a different country comparing to the training data. We address this issue in the paper, and two transfer learning schemes are proposed, i.e., appliance transfer learning (ATL) and cross-domain transfer learning (CTL). For ATL, our results show that the latent features learnt by a `complex' appliance, e.g., washing machine, can be transferred to a `simple' appliance, e.g., kettle. For CTL, our conclusion is that the seq2point learning is transferable. Precisely, when the training and test data are in a similar domain, seq2point learning can be directly applied to the test data without fine tuning; when the training and test data are in different domains, seq2point learning needs fine tuning before applying to the test data. Interestingly, we show that only the fully connected layers need fine tuning for transfer learning. Source code can be found at https://github.com/MingjunZhong/transferNILM.
ASJul 24, 2025
Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR ChallengesSamuele Cornell, Christoph Boeddeker, Taejin Park et al.
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.
ASNov 8, 2021
Learning Filterbanks for End-to-End Acoustic BeamformingSamuele Cornell, Manuel Pariente, François Grondin et al.
Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial covariance matrices. In this work we try to bridge the gap between these two worlds and explore fully end-to-end hybrid neural beamforming in which, instead of using the Short-Time-Fourier Transform, also the analysis and synthesis filterbanks are learnt jointly with the DNN. In detail, we explore two different types of learned filterbanks: fully learned and analytic. We perform a detailed analysis using the recent Clarity Challenge data and show that by using learnt filterbanks it is possible to surpass oracle-mask based beamforming for short windows.
ASOct 5, 2021
Deep Optimization of Parametric IIR Filters for Audio EqualizationGiovanni Pepe, Leonardo Gabrielli, Stefano Squartini et al.
This paper describes a novel Deep Learning method for the design of IIR parametric filters for automatic audio equalization. A simple and effective neural architecture, named BiasNet, is proposed to determine the IIR equalizer parameters. An output denormalization technique is used to obtain accurate tuning of the IIR filters center frequency, quality factor and gain. All layers involved in the proposed method are shown to be differentiable, allowing backpropagation to optimize the network weights and achieve, after a number of training iterations, the optimal output. The parameters are optimized with respect to a loss function based on a spectral distance between the measured and desired magnitude response, and a regularization term used to achieve a spatialization of the acoustc scene. Two scenarios with different characteristics were considered for the experimental evaluation: a room and a car cabin. The performance of the proposed method improves over the baseline techniques and achieves an almost flat band. Moreover IIR filters provide a consistently lower computational cost during runtime with respect to FIR filters.
ASApr 6, 2021
Learning to Rank Microphones for Distant Speech RecognitionSamuele Cornell, Alessio Brutti, Marco Matassoni et al.
Fully exploiting ad-hoc microphone networks for distant speech recognition is still an open issue. Empirical evidence shows that being able to select the best microphone leads to significant improvements in recognition without any additional effort on front-end processing. Current channel selection techniques either rely on signal, decoder or posterior-based features. Signal-based features are inexpensive to compute but do not always correlate with recognition performance. Instead decoder and posterior-based features exhibit better correlation but require substantial computational resources. In this work, we tackle the channel selection problem by proposing MicRank, a learning to rank framework where a neural network is trained to rank the available channels using directly the recognition performance on the training set. The proposed approach is agnostic with respect to the array geometry and type of recognition back-end. We investigate different learning to rank strategies using a synthetic dataset developed on purpose and the CHiME-6 data. Results show that the proposed approach is able to considerably improve over previous selection techniques, reaching comparable and in some instances better performance than oracle signal-based measures.
LGNov 12, 2020
Real-World Anomaly Detection by using Digital Twin Systems and Weakly-Supervised LearningAndrea Castellani, Sebastian Schmitt, Stefano Squartini
The continuously growing amount of monitored data in the Industry 4.0 context requires strong and reliable anomaly detection techniques. The advancement of Digital Twin technologies allows for realistic simulations of complex machinery, therefore, it is ideally suited to generate synthetic datasets for the use in anomaly detection approaches when compared to actual measurement data. In this paper, we present novel weakly-supervised approaches to anomaly detection for industrial settings. The approaches make use of a Digital Twin to generate a training dataset which simulates the normal operation of the machinery, along with a small set of labeled anomalous measurement from the real machinery. In particular, we introduce a clustering-based approach, called Cluster Centers (CC), and a neural architecture based on the Siamese Autoencoders (SAE), which are tailored for weakly-supervised settings with very few labeled data samples. The performance of the proposed methods is compared against various state-of-the-art anomaly detection algorithms on an application to a real-world dataset from a facility monitoring system, by using a multitude of performance measures. Also, the influence of hyper-parameters related to feature extraction and network architecture is investigated. We find that the proposed SAE based solutions outperform state-of-the-art anomaly detection approaches very robustly for many different hyper-parameter settings on all performance measures.
ASNov 6, 2019
The Speed Submission to DIHARD II: Contributions & Lessons LearnedMd Sahidullah, Jose Patino, Samuele Cornell et al.
This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team. Besides describing the system, which considerably outperformed the challenge baselines, we also focus on the lessons learned from numerous approaches that we tried for single and multi-channel systems. We present several components of our diarization system, including categorization of domains, speech enhancement, speech activity detection, speaker embeddings, clustering methods, resegmentation, and system fusion. We analyze and discuss the effect of each such component on the overall diarization performance within the realistic settings of the challenge.
SDApr 3, 2019
End-to-end Binaural Sound Localisation from the Raw WaveformPaolo Vecchiotti, Ning Ma, Stefano Squartini et al.
A novel end-to-end binaural sound localisation approach is proposed which estimates the azimuth of a sound source directly from the waveform. Instead of employing hand-crafted features commonly employed for binaural sound localisation, such as the interaural time and level difference, our end-to-end system approach uses a convolutional neural network (CNN) to extract specific features from the waveform that are suitable for localisation. Two systems are proposed which differ in the initial frequency analysis stage. The first system is auditory-inspired and makes use of a gammatone filtering layer, while the second system is fully data-driven and exploits a trainable convolutional layer to perform frequency analysis. In both systems, a set of dedicated convolutional kernels are then employed to search for specific localisation cues, which are coupled with a localisation stage using fully connected layers. Localisation experiments using binaural simulation in both anechoic and reverberant environments show that the proposed systems outperform a state-of-the-art deep neural network system. Furthermore, our investigation of the frequency analysis stage in the second system suggests that the CNN is able to exploit different frequency bands for localisation according to the characteristics of the reverberant environment.
ASOct 15, 2018
Polyphonic Sound Event Detection by using Capsule Neural NetworksFabio Vesperini, Leonardo Gabrielli, Emanuele Principi et al.
Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, Deep Learning offers valuable techniques for this goal such as Convolutional Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic-SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called "dynamic routing" that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state of the art algorithms.
SDSep 14, 2018
A Multi-Stage Algorithm for Acoustic Physical Model Parameters EstimationLeonardo Gabrielli, Stefano Tomassetti, Stefano Squartini et al.
One of the challenges in computational acoustics is the identification of models that can simulate and predict the physical behavior of a system generating an acoustic signal. Whenever such models are used for commercial applications an additional constraint is the time-to-market, making automation of the sound design process desirable. In previous works, a computational sound design approach has been proposed for the parameter estimation problem involving timbre matching by deep learning, which was applied to the synthesis of pipe organ tones. In this work we refine previous results by introducing the former approach in a multi-stage algorithm that also adds heuristics and a stochastic optimization method operating on objective cost functions based on psychoacoustics. The optimization method shows to be able to refine the first estimate given by the deep learning approach and substantially improve the objective metrics, with the additional benefit of reducing the sound design process time. Subjective listening tests are also conducted to gather additional insights on the results.