François Grondin

AS
13papers
1,132citations
Novelty33%
AI Score41

13 Papers

ASJun 19, 2022
Resource-Efficient Separation Transformer

Luca Della Libera, Cem Subakan, Mirco Ravanelli et al. · cmu

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.

43.2ASApr 29
Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion

Farnaz Jazaeri, Homayoun Kamkar-Parsi, François Grondin et al.

For extracting a target speaker voice, direction-of-arrival (DOA) estimation is crucial for binaural hearing aids operating in noisy, multi-speaker environments. Among the solutions developed for this task, a deep learning convolutional recurrent neural network (CRNN) model leveraging spectral phase differences and magnitude ratios between microphone signals is a popular option. In this paper, we explore adding source-count information for multi-sources DOA estimation. The use of dual-task training with joint multi-sources DOA estimation and source counting is first considered. We then consider using the source count as an auxiliary feature in a standalone DOA estimation system, where the number of active sources (0, 1, or 2+) is integrated into the CRNN architecture through early, mid, and late fusion strategies. Experiments using real binaural recordings are performed. Results show that the dual-task training does not improve DOA estimation performance, although it benefits source-count prediction. However, a ground-truth (oracle) source count used as an auxiliary feature significantly enhances standalone DOA estimation performance, with late fusion yielding up to 14% higher average F1-scores over the baseline CRNN. This highlights the potential of using source-count estimation for robust DOA estimation in binaural hearing aids.

ASJun 8, 2021Code
SpeechBrain: A General-Purpose Speech Toolkit

Mirco Ravanelli, Titouan Parcollet, Peter Plantinga et al.

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

HCMar 10, 2021Code
OpenTera: A Microservice Architecture Solution for Rapid Prototyping of Robotic Solutions to COVID-19 Challenges in Care Facilities

Adina M. Panchea, Dominic Létourneau, Simon Brière et al.

As telecommunications technology progresses, telehealth frameworks are becoming more widely adopted in the context of long-term care (LTC) for older adults, both in care facilities and in homes. Today, robots could assist healthcare workers when they provide care to elderly patients, who constitute a particularly vulnerable population during the COVID-19 pandemic. Previous work on user-centered design of assistive technologies in LTC facilities for seniors has identified positive impacts. The need to deal with the effects of the COVID-19 pandemic emphasizes the benefits of this approach, but also highlights some new challenges for which robots could be interesting solutions to be deployed in LTC facilities. This requires customization of telecommunication and audio/video/data processing to address specific clinical requirements and needs. This paper presents OpenTera, an open source telehealth framework, aiming to facilitate prototyping of such solutions by software and robotic designers. Designed as a microservice-oriented platform, OpenTera is an end-to-end solution that employs a series of independent modules for tasks such as data and session management, telehealth, daily assistive tasks/actions, together with smart devices and environments, all connected through the framework. After explaining the framework, we illustrate how OpenTera can be used to implement robotic solutions for different applications identified in LTC facilities and homes, and we describe how we plan to validate them through field trials.

ROFeb 28, 2022
SmartBelt: A Wearable Microphone Array for Sound Source Localization with Haptic Feedback

Simon Michaud, Benjamin Moffett, Ana Tapia Rousiouk et al.

This paper introduces SmartBelt, a wearable microphone array on a belt that performs sound source localization and returns the direction of arrival with respect to the user waist. One of the haptic motors on the belt then vibrates in the corresponding direction to provide useful feedback to the user. We also introduce a simple calibration step to adapt the belt to different waist sizes. Experiments are performed to confirm the accuracy of this wearable sound source localization system, and results show a Mean Average Error (MAE) of 2.90 degrees, and a correct haptic motor selection with a rate of 92.3%. Results suggest the device can provide useful haptic feedback, and will be evaluated in a study with people having hearing impairments.

ASNov 8, 2021
Learning Filterbanks for End-to-End Acoustic Beamforming

Samuele Cornell, Manuel Pariente, François Grondin et al.

Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial covariance matrices. In this work we try to bridge the gap between these two worlds and explore fully end-to-end hybrid neural beamforming in which, instead of using the Short-Time-Fourier Transform, also the analysis and synthesis filterbanks are learnt jointly with the DNN. In detail, we explore two different types of learned filterbanks: fully learned and analytic. We perform a detailed analysis using the recent Clarity Challenge data and show that by using learnt filterbanks it is possible to surpass oracle-mask based beamforming for short windows.

ASOct 20, 2021
REAL-M: Towards Speech Separation on Real Mixtures

Cem Subakan, Mirco Ravanelli, Samuele Cornell et al.

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures. The performance predictions of the SI-SNR estimator indeed correlate well with human opinions. Moreover, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow those achieved on synthetic benchmarks when evaluating popular speech separation models.

ASOct 6, 2021
Lightweight Speech Enhancement in Unseen Noisy and Reverberant Conditions using KISS-GEV Beamforming

Thomas Bernard, François Grondin

This paper introduces a new method referred to as KISS-GEV (for Keep It Super Simple Generalized eigenvalue) beamforming. While GEV beamforming usually relies on deep neural network for estimating target and noise time-frequency masks, this method uses a signal processing approach based on the direction of arrival (DoA) of the target. This considerably reduces the amount of computations involved at test time, and works for speech enhancement in unseen conditions as there is no need to train a neural network with noisy speech. The proposed method can also be used to separate speech from a mixture, provided the speech sources come from different directions. Results also show that the proposed method uses the same minimal DoA assumption as Delay-and-Sum beamforming, yet outperforms this traditional approach.

IVSep 17, 2021
Neural Network Based Lidar Gesture Recognition for Realtime Robot Teleoperation

Simon Chamorro, Jack Collier, François Grondin

We propose a novel low-complexity lidar gesture recognition system for mobile robot control robust to gesture variation. Our system uses a modular approach, consisting of a pose estimation module and a gesture classifier. Pose estimates are predicted from lidar scans using a Convolutional Neural Network trained using an existing stereo-based pose estimation system. Gesture classification is accomplished using a Long Short-Term Memory network and uses a sequence of estimated body poses as input to predict a gesture. Breaking down the pipeline into two modules reduces the dimensionality of the input, which could be lidar scans, stereo imagery, or any other modality from which body keypoints can be extracted, making our system lightweight and suitable for mobile robot control with limited computing power. The use of lidar contributes to the robustness of the system, allowing it to operate in most outdoor conditions, to be independent of lighting conditions, and for input to be detected 360 degrees around the robot. The lidar-based pose estimator and gesture classifier use data augmentation and automated labeling techniques, requiring a minimal amount of data collection and avoiding the need for manual labeling. We report experimental results for each module of our system and demonstrate its effectiveness by testing it in a real-world robot teleoperation setting.

ASMar 5, 2021
ODAS: Open embeddeD Audition System

François Grondin, Dominic Létourneau, Cédric Godin et al.

Artificial audition aims at providing hearing capabilities to machines, computers and robots. Existing frameworks in robot audition offer interesting sound source localization, tracking and separation performance, although involve a significant amount of computations that limit their use on robots with embedded computing capabilities. This paper presents ODAS, the Open embeddeD Audition System framework, which includes strategies to reduce the computational load and perform robot audition tasks on low-cost embedded computing systems. It presents key features of ODAS, along with cases illustrating its uses in different robots and artificial audition applications.

SDOct 19, 2020
BIRD: Big Impulse Response Dataset

François Grondin, Jean-Samuel Lauzon, Simon Michaud et al.

This paper introduces BIRD, the Big Impulse Response Dataset. This open dataset consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using the Image Method, making it the largest multichannel open dataset currently available. These RIRs can be used toperform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources. The paper also introduces use cases to illustrate how BIRD can perform data augmentation with existing speech corpora.

CVJul 31, 2020
Dynamic Object Tracking and Masking for Visual SLAM

Jonathan Vincent, Mathieu Labbé, Jean-Samuel Lauzon et al.

In dynamic environments, performance of visual SLAM techniques can be impaired by visual features taken from moving objects. One solution is to identify those objects so that their visual features can be removed for localization and mapping. This paper presents a simple and fast pipeline that uses deep neural networks, extended Kalman filters and visual SLAM to improve both localization and mapping in dynamic environments (around 14 fps on a GTX 1080). Results on the dynamic sequences from the TUM dataset using RTAB-Map as visual SLAM suggest that the approach achieves similar localization performance compared to other state-of-the-art methods, while also providing the position of the tracked dynamic objects, a 3D map free of those dynamic objects, better loop closure detection with the whole pipeline able to run on a robot moving at moderate speed.

ASJul 21, 2020
3D Localization of a Sound Source Using Mobile Microphone Arrays Referenced by SLAM

Simon Michaud, Samuel Faucher, François Grondin et al.

A microphone array can provide a mobile robot with the capability of localizing, tracking and separating distant sound sources in 2D, i.e., estimating their relative elevation and azimuth. To combine acoustic data with visual information in real world settings, spatial correlation must be established. The approach explored in this paper consists of having two robots, each equipped with a microphone array, localizing themselves in a shared reference map using SLAM. Based on their locations, data from the microphone arrays are used to triangulate in 3D the location of a sound source in relation to the same map. This strategy results in a novel cooperative sound mapping approach using mobile microphone arrays. Trials are conducted using two mobile robots localizing a static or a moving sound source to examine in which conditions this is possible. Results suggest that errors under 0.3 m are observed when the relative angle between the two robots are above 30 degrees for a static sound source, while errors under 0.3 m for angles between 40 degrees and 140 degrees are observed with a moving sound source.