François Grondin

h-index24

19papers

1,386citations

Novelty37%

AI Score28

Ranked #149,448 of 194,257 authors (top 77%)#974 in AS (top 67%)

19 Papers

10.8ASJun 19, 2022Code

Resource-Efficient Separation Transformer

Luca Della Libera, Cem Subakan, Mirco Ravanelli et al. · cmu

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.

41.8ASJun 8, 2021Code

SpeechBrain: A General-Purpose Speech Toolkit

Mirco Ravanelli, Titouan Parcollet, Peter Plantinga et al.

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

6.4HCMar 10, 2021Code

OpenTera: A Microservice Architecture Solution for Rapid Prototyping of Robotic Solutions to COVID-19 Challenges in Care Facilities

Adina M. Panchea, Dominic Létourneau, Simon Brière et al.

As telecommunications technology progresses, telehealth frameworks are becoming more widely adopted in the context of long-term care (LTC) for older adults, both in care facilities and in homes. Today, robots could assist healthcare workers when they provide care to elderly patients, who constitute a particularly vulnerable population during the COVID-19 pandemic. Previous work on user-centered design of assistive technologies in LTC facilities for seniors has identified positive impacts. The need to deal with the effects of the COVID-19 pandemic emphasizes the benefits of this approach, but also highlights some new challenges for which robots could be interesting solutions to be deployed in LTC facilities. This requires customization of telecommunication and audio/video/data processing to address specific clinical requirements and needs. This paper presents OpenTera, an open source telehealth framework, aiming to facilitate prototyping of such solutions by software and robotic designers. Designed as a microservice-oriented platform, OpenTera is an end-to-end solution that employs a series of independent modules for tasks such as data and session management, telehealth, daily assistive tasks/actions, together with smart devices and environments, all connected through the framework. After explaining the framework, we illustrate how OpenTera can be used to implement robotic solutions for different applications identified in LTC facilities and homes, and we describe how we plan to validate them through field trials.

2.2ROFeb 28, 2022

SmartBelt: A Wearable Microphone Array for Sound Source Localization with Haptic Feedback

Simon Michaud, Benjamin Moffett, Ana Tapia Rousiouk et al.

This paper introduces SmartBelt, a wearable microphone array on a belt that performs sound source localization and returns the direction of arrival with respect to the user waist. One of the haptic motors on the belt then vibrates in the corresponding direction to provide useful feedback to the user. We also introduce a simple calibration step to adapt the belt to different waist sizes. Experiments are performed to confirm the accuracy of this wearable sound source localization system, and results show a Mean Average Error (MAE) of 2.90 degrees, and a correct haptic motor selection with a rate of 92.3%. Results suggest the device can provide useful haptic feedback, and will be evaluated in a study with people having hearing impairments.

6.6ASFeb 6, 2022Code

Exploring Self-Attention Mechanisms for Speech Separation

Cem Subakan, Mirco Ravanelli, Samuele Cornell et al.

Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.

3.3ASNov 8, 2021

Learning Filterbanks for End-to-End Acoustic Beamforming

Samuele Cornell, Manuel Pariente, François Grondin et al.

Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial covariance matrices. In this work we try to bridge the gap between these two worlds and explore fully end-to-end hybrid neural beamforming in which, instead of using the Short-Time-Fourier Transform, also the analysis and synthesis filterbanks are learnt jointly with the DNN. In detail, we explore two different types of learned filterbanks: fully learned and analytic. We perform a detailed analysis using the recent Clarity Challenge data and show that by using learnt filterbanks it is possible to surpass oracle-mask based beamforming for short windows.

8.0ASOct 20, 2021

REAL-M: Towards Speech Separation on Real Mixtures

Cem Subakan, Mirco Ravanelli, Samuele Cornell et al.

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures. The performance predictions of the SI-SNR estimator indeed correlate well with human opinions. Moreover, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow those achieved on synthetic benchmarks when evaluating popular speech separation models.

1.2ASOct 6, 2021

Lightweight Speech Enhancement in Unseen Noisy and Reverberant Conditions using KISS-GEV Beamforming

Thomas Bernard, François Grondin

This paper introduces a new method referred to as KISS-GEV (for Keep It Super Simple Generalized eigenvalue) beamforming. While GEV beamforming usually relies on deep neural network for estimating target and noise time-frequency masks, this method uses a signal processing approach based on the direction of arrival (DoA) of the target. This considerably reduces the amount of computations involved at test time, and works for speech enhancement in unseen conditions as there is no need to train a neural network with noisy speech. The proposed method can also be used to separate speech from a mixture, provided the speech sources come from different directions. Results also show that the proposed method uses the same minimal DoA assumption as Delay-and-Sum beamforming, yet outperforms this traditional approach.

3.3ASMar 5, 2021Code

ODAS: Open embeddeD Audition System

François Grondin, Dominic Létourneau, Cédric Godin et al.

Artificial audition aims at providing hearing capabilities to machines, computers and robots. Existing frameworks in robot audition offer interesting sound source localization, tracking and separation performance, although involve a significant amount of computations that limit their use on robots with embedded computing capabilities. This paper presents ODAS, the Open embeddeD Audition System framework, which includes strategies to reduce the computational load and perform robot audition tasks on low-cost embedded computing systems. It presents key features of ODAS, along with cases illustrating its uses in different robots and artificial audition applications.

9.3SDOct 19, 2020Code

BIRD: Big Impulse Response Dataset

François Grondin, Jean-Samuel Lauzon, Simon Michaud et al.

This paper introduces BIRD, the Big Impulse Response Dataset. This open dataset consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using the Image Method, making it the largest multichannel open dataset currently available. These RIRs can be used toperform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources. The paper also introduces use cases to illustrate how BIRD can perform data augmentation with existing speech corpora.

7.9CVJul 31, 2020Code

Dynamic Object Tracking and Masking for Visual SLAM

Jonathan Vincent, Mathieu Labbé, Jean-Samuel Lauzon et al.

In dynamic environments, performance of visual SLAM techniques can be impaired by visual features taken from moving objects. One solution is to identify those objects so that their visual features can be removed for localization and mapping. This paper presents a simple and fast pipeline that uses deep neural networks, extended Kalman filters and visual SLAM to improve both localization and mapping in dynamic environments (around 14 fps on a GTX 1080). Results on the dynamic sequences from the TUM dataset using RTAB-Map as visual SLAM suggest that the approach achieves similar localization performance compared to other state-of-the-art methods, while also providing the position of the tracked dynamic objects, a 3D map free of those dynamic objects, better loop closure detection with the whole pipeline able to run on a robot moving at moderate speed.

5.1ASJul 21, 2020

3D Localization of a Sound Source Using Mobile Microphone Arrays Referenced by SLAM

Simon Michaud, Samuel Faucher, François Grondin et al.

A microphone array can provide a mobile robot with the capability of localizing, tracking and separating distant sound sources in 2D, i.e., estimating their relative elevation and azimuth. To combine acoustic data with visual information in real world settings, spatial correlation must be established. The approach explored in this paper consists of having two robots, each equipped with a microphone array, localizing themselves in a shared reference map using SLAM. Based on their locations, data from the microphone arrays are used to triangulate in 3D the location of a sound source in relation to the same map. This strategy results in a novel cooperative sound mapping approach using mobile microphone arrays. Trials are conducted using two mobile robots localizing a static or a moving sound source to examine in which conditions this is possible. Results suggest that errors under 0.3 m are observed when the relative angle between the two robots are above 30 degrees for a static sound source, while errors under 0.3 m for angles between 40 degrees and 140 degrees are observed with a moving sound source.

2.3ASFeb 4, 2020

Audio-Visual Calibration with Polynomial Regression for 2-D Projection Using SVD-PHAT

Francois Grondin, Hao Tang, James Glass

This paper proposes a straightforward 2-D method to spatially calibrate the visual field of a camera with the auditory field of an array microphone by generating and overlaying an acoustic image over an optical image. Using a low-cost microphone array and an off-the-shelf camera, we show that polynomial regression can deal efficiently with non-linear camera distortion, and that a recently proposed sound source localization method for real-time processing, SVD-PHAT, can be adapted for this task.

9.7ASOct 22, 2019

Sound Event Localization and Detection Using CRNN on Pairs of Microphones

Francois Grondin, James Glass, Iwona Sobieraj et al.

This paper proposes sound event localization and detection methods from multichannel recording. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNNs) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones in a microphone array. In this paper, the system is evaluated with a four-microphone array, and thus combines the results from six pairs of microphones to provide a final classification and a 3-D direction of arrival (DOA) estimate. Results demonstrate that the proposed approach outperforms the DCASE 2019 baseline system.

2.3ASJul 29, 2019

Fast and Robust 3-D Sound Source Localization with DSVD-PHAT

Francois Grondin, James Glass

This paper introduces a variant of the Singular Value Decomposition with Phase Transform (SVD-PHAT), named Difference SVD-PHAT (DSVD-PHAT), to achieve robust Sound Source Localization (SSL) in noisy conditions. Experiments are performed on a Baxter robot with a four-microphone planar array mounted on its head. Results show that this method offers similar robustness to noise as the state-of-the-art Multiple Signal Classification based on Generalized Singular Value Decomposition (GSVD-MUSIC) method, and considerably reduces the computational load by a factor of 250. This performance gain thus makes DSVD-PHAT appealing for real-time application on robots with limited on-board computing power.

10.3ASDec 1, 2018

Lightweight and Optimized Sound Source Localization and Tracking Methods for Open and Closed Microphone Array Configurations

Francois Grondin, Francois Michaud

Human-robot interaction in natural settings requires filtering out the different sources of sounds from the environment. Such ability usually involves the use of microphone arrays to localize, track and separate sound sources online. Multi-microphone signal processing techniques can improve robustness to noise but the processing cost increases with the number of microphones used, limiting response time and widespread use on different types of mobile robots. Since sound source localization methods are the most expensive in terms of computing resources as they involve scanning a large 3D space, minimizing the amount of computations required would facilitate their implementation and use on robots. The robot's shape also brings constraints on the microphone array geometry and configurations. In addition, sound source localization methods usually return noisy features that need to be smoothed and filtered by tracking the sound sources. This paper presents a novel sound source localization method, called SRP-PHAT-HSDA, that scans space with coarse and fine resolution grids to reduce the number of memory lookups. A microphone directivity model is used to reduce the number of directions to scan and ignore non significant pairs of microphones. A configuration method is also introduced to automatically set parameters that are normally empirically tuned according to the shape of the microphone array. For sound source tracking, this paper presents a modified 3D Kalman (M3K) method capable of simultaneously tracking in 3D the directions of sound sources. Using a 16-microphone array and low cost hardware, results show that SRP-PHAT-HSDA and M3K perform at least as well as other sound source localization and tracking methods while using up to 4 and 30 times less computing resources respectively.

3.3ASNov 28, 2018

A Study of the Complexity and Accuracy of Direction of Arrival Estimation Methods Based on GCC-PHAT for a Pair of Close Microphones

Francois Grondin, James Glass

This paper investigates the accuracy of various Generalized Cross-Correlation with Phase Transform (GCC-PHAT) methods for a close pair of microphones. We investigate interpolation-based methods and also propose another approach based on Singular Value Decomposition (SVD). All investigated methods are implemented in C code, and the execution time is measured to determine which approach is the most appealing for real-time applications on low-cost embedded hardware.

6.6ASNov 28, 2018

SVD-PHAT: A Fast Sound Source Localization Method

Francois Grondin, James Glass

This paper introduces a new localization method called SVD-PHAT. The SVD-PHAT method relies on Singular Value Decomposition of the SRP-PHAT projection matrix. A k-d tree is also proposed to speed up the search for the most likely direction of arrival of sound. We show that this method performs as accurately as SRP-PHAT, while reducing significantly the amount of computation required.

1.9CLJun 13, 2018

A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition

Hao Tang, Wei-Ning Hsu, Francois Grondin et al.

Speech recognizers trained on close-talking speech do not generalize to distant speech and the word error rate degradation can be as large as 40% absolute. Most studies focus on tackling distant speech recognition as a separate problem, leaving little effort to adapting close-talking speech recognizers to distant speech. In this work, we review several approaches from a domain adaptation perspective. These approaches, including speech enhancement, multi-condition training, data augmentation, and autoencoders, all involve a transformation of the data between domains. We conduct experiments on the AMI data set, where these approaches can be realized under the same controlled setting. These approaches lead to different amounts of improvement under their respective assumptions. The purpose of this paper is to quantify and characterize the performance gap between the two domains, setting up the basis for studying adaptation of speech recognizers from close-talking speech to distant speech. Our results also have implications for improving distant speech recognition.