Mostafa Sadeghi

h-index13

10papers

72citations

Novelty45%

AI Score40

Ranked #71,675 of 194,257 authors (top 37%)#24,360 in CV (top 41%)

10 Papers

4.8CVApr 6, 2022

Expression-preserving face frontalization improves visually assisted speech processing

Zhiqi Kang, Mostafa Sadeghi, Radu Horaud et al.

Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution of this paper is a frontalization methodology that preserves non-rigid facial deformations in order to boost the performance of visually assisted speech communication. The method alternates between the estimation of (i)~the rigid transformation (scale, rotation, and translation) and (ii)~the non-rigid deformation between an arbitrarily-viewed face and a face model. The method has two important merits: it can deal with non-Gaussian errors in the data and it incorporates a dynamical face deformation model. For that purpose, we use the generalized Student t-distribution in combination with a linear dynamic system in order to account for both rigid head motions and time-varying facial deformations caused by speech production. We propose to use the zero-mean normalized cross-correlation (ZNCC) score to evaluate the ability of the method to preserve facial expressions. The method is thoroughly evaluated and compared with several state of the art methods, either based on traditional geometric models or on deep learning. Moreover, we show that the method, when incorporated into deep learning pipelines, namely lip reading and speech enhancement, improves word recognition and speech intelligibilty scores by a considerable margin. Supplemental material is accessible at https://team.inria.fr/robotlearn/research/facefrontalization/

8.4CVSep 19, 2023

Diffusion-based speech enhancement with a weighted generative-supervised learning loss

Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel

Diffusion-based generative models have recently gained attention in speech enhancement (SE), providing an alternative to conventional supervised methods. These models transform clean speech training samples into Gaussian noise centered at noisy speech, and subsequently learn a parameterized model to reverse this process, conditionally on noisy speech. Unlike supervised methods, generative-based SE approaches usually rely solely on an unsupervised loss, which may result in less efficient incorporation of conditioned noisy speech. To address this issue, we propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech at each reverse process iteration. Experimental results demonstrate the effectiveness of our proposed methodology.

7.6SDMay 23

Diffusion-based Frameworks for Unsupervised Speech Enhancement

Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel et al.

This paper addresses unsupervised diffusion-based single-channel speech enhancement (SE). Prior work in this direction combines a score-based diffusion model trained on clean speech with a Gaussian noise model whose covariance is structured by non-negative matrix factorization (NMF). This combination is used within an iterative expectation-maximization (EM) scheme, in which a diffusion-based posterior-sampling E-step estimates the clean speech. We first revisit this framework and propose to explicitly model both speech and acoustic noise as latent variables, jointly sampling them in the E-step instead of sampling speech alone as in previous approaches. We then introduce a new semi-supervised SE framework that replaces the NMF noise prior with a diffusion-based noise model, learned jointly with the speech prior in a single conditional score model. Within this framework, we derive two variants: one that implicitly accounts for noise and one that explicitly treats noise as a latent variable. Experiments on WSJ0-QUT and VoiceBank-DEMAND show that explicit noise modeling systematically improves SE performance for both NMF-based and diffusion-based noise priors. Under matched conditions, the diffusion-based noise model attains the best overall quality and intelligibility among unsupervised methods, while under mismatched conditions the proposed NMF-based explicit-noise framework is more robust and suffers less degradation than several supervised baselines. Code, demo, and supplementary materials are publicly available.

10.9SDFeb 2, 2024Code

Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

Simon Leglaive, Matthieu Fraticelli, Hend ElGhazaly et al. · deepmind

Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.

1.2CVOct 26, 2020

Face Frontalization Based on Robustly Fitting a Deformable Shape Model to 3D Landmarks

Zhiqi Kang, Mostafa Sadeghi, Radu Horaud

Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution of this paper is a robust face alignment method that enables pixel-to-pixel warping. The method simultaneously estimates the rigid transformation (scale, rotation, and translation) and the non-rigid deformation between two 3D point sets: a set of 3D landmarks extracted from an arbitrary-viewed face, and a set of 3D landmarks parameterized by a frontally-viewed deformable face model. An important merit of the proposed method is its ability to deal both with noise (small perturbations) and with outliers (large errors). We propose to model inliers and outliers with the generalized Student's t-probability distribution function, a heavy-tailed distribution that is immune to non-Gaussian errors in the data. We describe in detail the associated expectation-maximization (EM) algorithm that alternates between the estimation of (i) the rigid parameters, (ii) the deformation parameters, and (iii) the Student-t distribution parameters. We also propose to use the zero-mean normalized cross-correlation, between a frontalized face and the corresponding ground-truth frontally-viewed face, to evaluate the performance of frontalization. To this end, we use a dataset that contains pairs of profile-viewed and frontally-viewed faces. This evaluation, based on direct image-to-image comparison, stands in contrast with indirect evaluation, based on analyzing the effect of frontalization on face recognition.

6.6ASAug 17, 2020

Deep Variational Generative Models for Audio-visual Speech Separation

Viet-Nhat Nguyen, Mostafa Sadeghi, Elisa Ricci et al.

In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean speech. More specifically, during training, a latent variable generative model is learned from clean speech spectrograms using a variational auto-encoder (VAE). To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech (instead of clean speech) as well as the visual data. The visual modality also serves as a prior for latent variables, through a visual network. At test time, the learned generative model (both for speaker-independent and speaker-dependent scenarios) is combined with an unsupervised non-negative matrix factorization (NMF) variance model for background noise. All the latent variables and noise parameters are then estimated by a Monte Carlo expectation-maximization algorithm. Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches as well as a supervised deep learning-based technique.

4.2CVApr 14, 2020

Unsupervised Performance Analysis of 3D Face Alignment with a Statistically Robust Confidence Test

Mostafa Sadeghi, Xavier Alameda-Pineda, Radu Horaud

This paper addresses the problem of analysing the performance of 3D face alignment (3DFA), or facial landmark localization. This task is usually supervised, based on annotated datasets. Nevertheless, in the particular case of 3DFA, the annotation process is rarely error-free, which strongly biases the results. Alternatively, unsupervised performance analysis (UPA) is investigated. The core ingredient of the proposed methodology is the robust estimation of the rigid transformation between predicted landmarks and model landmarks. It is shown that the rigid mapping thus computed is affected neither by non-rigid facial deformations, due to variabilities in expression and in identity, nor by landmark localization errors, due to various perturbations. The guiding idea is to apply the estimated rotation, translation and scale to a set of predicted landmarks in order to map them onto a mathematical home for the shape embedded in these landmarks (including possible errors). UPA proceeds as follows: (i) 3D landmarks are extracted from a 2D face using the 3DFA method under investigation; (ii) these landmarks are rigidly mapped onto a canonical (frontal) pose, and (iii) a statistically-robust confidence score is computed for each landmark. This allows to assess whether the mapped landmarks lie inside (inliers) or outside (outliers) a confidence volume. An experimental evaluation protocol, that uses publicly available datasets and several 3DFA software packages associated with published articles, is described in detail. The results show that the proposed analysis is consistent with supervised metrics and that it can be used to measure the accuracy of both predicted landmarks and of automatically annotated 3DFA datasets, to detect errors and to eliminate them. Source code and supplemental materials for this paper are publicly available at https://team.inria.fr/robotlearn/upa3dfa/.

1.3CVJan 6, 2015

A Study on Clustering for Clustering Based Image De-Noising

Hossein Bakhshi Golestani, Mohsen Joneidi, Mostafa Sadeghi

In this paper, the problem of de-noising of an image contaminated with Additive White Gaussian Noise (AWGN) is studied. This subject is an open problem in signal processing for more than 50 years. Local methods suggested in recent years, have obtained better results than global methods. However by more intelligent training in such a way that first, important data is more effective for training, second, clustering in such way that training blocks lie in low-rank subspaces, we can design a dictionary applicable for image de-noising and obtain results near the state of the art local methods. In the present paper, we suggest a method based on global clustering of image constructing blocks. As the type of clustering plays an important role in clustering-based de-noising methods, we address two questions about the clustering. The first, which parts of the data should be considered for clustering? and the second, what data clustering method is suitable for de-noising.? Then clustering is exploited to learn an over complete dictionary. By obtaining sparse decomposition of the noisy image blocks in terms of the dictionary atoms, the de-noised version is achieved. In addition to our framework, 7 popular dictionary learning methods are simulated and compared. The results are compared based on two major factors: (1) de-noising performance and (2) execution time. Experimental results show that our dictionary learning framework outperforms its competitors in terms of both factors.

2.3ITJul 29, 2013

Union of Low-Rank Subspaces Detector

Mohsen Joneidi, Parvin Ahmadi, Mostafa Sadeghi et al.

The problem of signal detection using a flexible and general model is considered. Due to applicability and flexibility of sparse signal representation and approximation, it has attracted a lot of attention in many signal processing areas. In this paper, we propose a new detection method based on sparse decomposition in a union of subspaces (UoS) model. Our proposed detector uses a dictionary that can be interpreted as a bank of matched subspaces. This improves the performance of signal detection, as it is a generalization for detectors. Low-rank assumption for the desired signals implies that the representations of these signals in terms of some proper bases would be sparse. Our proposed detector exploits sparsity in its decision rule. We demonstrate the high efficiency of our method in the cases of voice activity detection in speech processing.

3.1CVJun 12, 2013

Optimization of Clustering for Clustering-based Image Denoising

Mohsen Joneidi, Mostafa Sadeghi

In this paper, the problem of de-noising of an image contaminated with additive white Gaussian noise (AWGN) is studied. This subject has been continued to be an open problem in signal processing for more than 50 years. In the present paper, we suggest a method based on global clustering of image constructing blocks. Noting that the type of clustering plays an important role in clustering-based de-noising methods, we address two questions about the clustering. First, which parts of data should be considered for clustering? Second, what data clustering method is suitable for de-noising? Clustering is exploited to learn an over complete dictionary. By obtaining sparse decomposition of the noisy image blocks in terms of the dictionary atoms, the de-noised version is achieved. Experimental results show that our dictionary learning framework outperforms traditional dictionary learning methods such as K-SVD.