ASMar 23, 2022
The VoicePrivacy 2022 Challenge Evaluation PlanNatalia Tomashenko, Xin Wang, Xiaoxiao Miao et al.
For new participants - Executive summary: (1) The task is to develop a voice anonymization system for speech data which conceals the speaker's voice identity while protecting linguistic content, paralinguistic attributes, intelligibility and naturalness. (2) Training, development and evaluation datasets are provided in addition to 3 different baseline anonymization systems, evaluation scripts, and metrics. Participants apply their developed anonymization systems, run evaluation scripts and submit objective evaluation results and anonymized speech data to the organizers. (3) Results will be presented at a workshop held in conjunction with INTERSPEECH 2022 to which all participants are invited to present their challenge systems and to submit additional workshop papers. For readers familiar with the VoicePrivacy Challenge - Changes w.r.t. 2020: (1) A stronger, semi-informed attack model in the form of an automatic speaker verification (ASV) system trained on anonymized (per-utterance) speech data. (2) Complementary metrics comprising the equal error rate (EER) as a privacy metric, the word error rate (WER) as a primary utility metric, and the pitch correlation and gain of voice distinctiveness as secondary utility metrics. (3) A new ranking policy based upon a set of minimum target privacy requirements.
CLMay 14, 2022
The VoicePrivacy 2020 Challenge Evaluation PlanNatalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang et al.
The VoicePrivacy Challenge aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges. In this document, we formulate the voice anonymization task selected for the VoicePrivacy 2020 Challenge and describe the datasets used for system development and evaluation. We also present the attack models and the associated objective and subjective evaluation metrics. We introduce two anonymization baselines and report objective evaluation results.
ASFeb 20, 2023
Federated Learning for ASR based on Wav2vec 2.0Tuan Nguyen, Salima Mdhaffar, Natalia Tomashenko et al.
This paper presents a study on the use of federated learning to train an ASR model based on a wav2vec 2.0 model pre-trained by self supervision. Carried out on the well-known TED-LIUM 3 dataset, our experiments show that such a model can obtain, with no use of a language model, a word error rate of 10.92% on the official TED-LIUM 3 test set, without sharing any data from the different users. We also analyse the ASR performance for speakers depending to their participation to the federated learning. Since federated learning was first introduced for privacy purposes, we also measure its ability to protect speaker identity. To do that, we exploit an approach to analyze information contained in exchanged models based on a neural network footprint on an indicator dataset. This analysis is made layer-wise and shows which layers in an exchanged wav2vec 2.0 based model bring the speaker identity information.
LGAug 2, 2024
Explaining a probabilistic prediction on the simplex with Shapley compositionsPaul-Gauthier Noé, Miquel Perelló-Nieto, Jean-François Bonastre et al.
Originating in game theory, Shapley values are widely used for explaining a machine learning model's prediction by quantifying the contribution of each feature's value to the prediction. This requires a scalar prediction as in binary classification, whereas a multiclass probabilistic prediction is a discrete probability distribution, living on a multidimensional simplex. In such a multiclass setting the Shapley values are typically computed separately on each class in a one-vs-rest manner, ignoring the compositional nature of the output distribution. In this paper, we introduce Shapley compositions as a well-founded way to properly explain a multiclass probabilistic prediction, using the Aitchison geometry from compositional data analysis. We prove that the Shapley composition is the unique quantity satisfying linearity, symmetry and efficiency on the Aitchison simplex, extending the corresponding axiomatic properties of the standard Shapley value. We demonstrate this proper multiclass treatment in a range of scenarios.
SDNov 18, 2025
Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level PredictionsMarcel Gibier, Nolwenn Celton, Raphaël Duroselle et al.
In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.
SDNov 18, 2025
Segmentwise Pruning in Audio-Language ModelsMarcel Gibier, Raphaël Duroselle, Pierre Serrano et al.
Recent audio-language models have shown impressive performance across a wide range of audio tasks and are increasingly capable of handling long audio inputs. However, the computing costs in these models heavily depend on sequence length, which can become very large given the nature of audio data. In the vision-language domain, token pruning methods have proven effective in reducing token counts while preserving strong performance on standard benchmarks. In this work, we investigate the relevance and effectiveness of such token selection strategies in the context of audio-language models. We also improve them by proposing a lightweight strategy that takes the time dimension into account. While retaining only a quarter of the initial tokens, our approach results in a relative maximum decrease of 2% in CIDEr on Clotho v2 and a relative maximum decrease of 4% in accuracy on MMAU.
SDSep 15, 2025
Improving Out-of-Domain Audio Deepfake Detection via Layer Selection and Fusion of SSL-Based CountermeasuresPierre Serrano, Raphaël Duroselle, Florian Angulo et al.
Audio deepfake detection systems based on frozen pre-trained self-supervised learning (SSL) encoders show a high level of performance when combined with layer-weighted pooling methods, such as multi-head factorized attentive pooling (MHFA). However, they still struggle to generalize to out-of-domain (OOD) conditions. We tackle this problem by studying the behavior of six different pre-trained SSLs, on four different test corpora. We perform a layer-by-layer analysis to determine which layers contribute most. Next, we study the pooling head, comparing a strategy based on a single layer with automatic selection via MHFA. We observed that selecting the best layer gave very good results, while reducing system parameters by up to 80%. A wide variation in performance as a function of test corpus and SSL model is also observed, showing that the pre-training strategy of the encoder plays a role. Finally, score-level fusion of several encoders improved generalization to OOD attacks.
LGSep 3, 2025
The distribution of calibrated likelihood functions on the probability-likelihood Aitchison simplexPaul-Gauthier Noé, Andreas Nautsch, Driss Matrouf et al.
While calibration of probabilistic predictions has been widely studied, this paper rather addresses calibration of likelihood functions. This has been discussed, especially in biometrics, in cases with only two exhaustive and mutually exclusive hypotheses (classes) where likelihood functions can be written as log-likelihood-ratios (LLRs). After defining calibration for LLRs and its connection with the concept of weight-of-evidence, we present the idempotence property and its associated constraint on the distribution of the LLRs. Although these results have been known for decades, they have been limited to the binary case. Here, we extend them to cases with more than two hypotheses by using the Aitchison geometry of the simplex, which allows us to recover, in a vector form, the additive form of the Bayes' rule; extending therefore the LLR and the weight-of-evidence to any number of hypotheses. Especially, we extend the definition of calibration, the idempotence, and the constraint on the distribution of likelihood functions to this multiple hypotheses and multiclass counterpart of the LLR: the isometric-log-ratio transformed likelihood function. This work is mainly conceptual, but we still provide one application to machine learning by presenting a non-linear discriminant analysis where the discriminant components form a calibrated likelihood function over the classes, improving therefore the interpretability and the reliability of the method.
CLNov 7, 2021
Retrieving Speaker Information from Personalized Acoustic Models for Speech RecognitionSalima Mdhaffar, Jean-François Bonastre, Marc Tommasi et al.
The widespread of powerful personal devices capable of collecting voice of their users has opened the opportunity to build speaker adapted speech recognition system (ASR) or to participate to collaborative learning of ASR. In both cases, personalized acoustic models (AM), i.e. fine-tuned AM with specific speaker data, can be built. A question that naturally arises is whether the dissemination of personalized acoustic models can leak personal information. In this paper, we show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker. Incidentally we observe phenomena that may be useful towards explainability of deep neural networks in the context of speech processing. Gender can be identified almost surely using only the first layers and speaker verification performs well when using middle-up layers. Our experimental study on the TED-LIUM 3 dataset with HMM/TDNN models shows an accuracy of 95% for gender detection, and an Equal Error Rate of 9.07% for a speaker verification task by only exploiting the weights from personalized models that could be exchanged instead of user data.
CLNov 6, 2021
Privacy attacks for automatic speech recognition acoustic models in a federated learning frameworkNatalia Tomashenko, Salima Mdhaffar, Marc Tommasi et al.
This paper investigates methods to effectively retrieve speaker information from the personalized speaker adapted neural network acoustic models (AMs) in automatic speech recognition (ASR). This problem is especially important in the context of federated learning of ASR acoustic models where a global model is learnt on the server based on the updates received from multiple clients. We propose an approach to analyze information in neural network AMs based on a neural network footprint on the so-called Indicator dataset. Using this method, we develop two attack models that aim to infer speaker identity from the updated personalized models without access to the actual users' speech data. Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches are very effective and can provide equal error rate (EER) of 1-2%.
CROct 12, 2021
A bridge between features and evidence for binary attribute-driven perfect privacyPaul-Gauthier Noé, Andreas Nautsch, Driss Matrouf et al.
Attribute-driven privacy aims to conceal a single user's attribute, contrary to anonymisation that tries to hide the full identity of the user in some data. When the attribute to protect from malicious inferences is binary, perfect privacy requires the log-likelihood-ratio to be zero resulting in no strength-of-evidence. This work presents an approach based on normalizing flow that maps a feature vector into a latent space where the evidence, related to the binary attribute, and an independent residual are disentangled. It can be seen as a non-linear discriminant analysis where the mapping is invertible allowing generation by mapping the latent variable back to the original space. This framework allows to manipulate the log-likelihood-ratio of the data and therefore allows to set it to zero for privacy. We show the applicability of the approach on an attribute-driven privacy task where the sex information is removed from speaker embeddings. Results on VoxCeleb2 dataset show the efficiency of the method that outperforms in terms of privacy and utility our previous experiments based on adversarial disentanglement.
CLSep 1, 2021
The VoicePrivacy 2020 Challenge: Results and findingsNatalia Tomashenko, Xin Wang, Emmanuel Vincent et al.
This paper presents the results and analyses stemming from the first VoicePrivacy 2020 Challenge which focuses on developing anonymization solutions for speech technology. We provide a systematic overview of the challenge design with an analysis of submitted systems and evaluation results. In particular, we describe the voice anonymization task and datasets used for system development and evaluation. Also, we present different attack models and the associated objective and subjective evaluation metrics. We introduce two anonymization baselines and provide a summary description of the anonymization systems developed by the challenge participants. We report objective and subjective evaluation results for baseline and submitted systems. In addition, we present experimental results for alternative privacy metrics and attack models developed as a part of the post-evaluation analysis. Finally, we summarize our insights and observations that will influence the design of the next VoicePrivacy challenge edition and some directions for future voice anonymization research.
ASDec 8, 2020
Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy PreservationPaul-Gauthier Noé, Mohammad Mohammadamini, Driss Matrouf et al.
In speech technologies, speaker's voice representation is used in many applications such as speech recognition, voice conversion, speech synthesis and, obviously, user authentication. Modern vocal representations of the speaker are based on neural embeddings. In addition to the targeted information, these representations usually contain sensitive information about the speaker, like the age, sex, physical state, education level or ethnicity. In order to allow the user to choose which information to protect, we introduce in this paper the concept of attribute-driven privacy preservation in speaker voice representation. It allows a person to hide one or more personal aspects to a potential malicious interceptor and to the application provider. As a first solution to this concept, we propose to use an adversarial autoencoding method that disentangles in the voice representation a given speaker attribute thus allowing its concealment. We focus here on the sex attribute for an Automatic Speaker Verification (ASV) task. Experiments carried out using the VoxCeleb datasets have shown that the proposed method enables the concealment of this attribute while preserving ASV ability.
ASAug 30, 2020
Speech Pseudonymisation Assessment Using Voice Similarity MatricesPaul-Gauthier Noé, Jean-François Bonastre, Driss Matrouf et al.
The proliferation of speech technologies and rising privacy legislation calls for the development of privacy preservation solutions for speech applications. These are essential since speech signals convey a wealth of rich, personal and potentially sensitive information. Anonymisation, the focus of the recent VoicePrivacy initiative, is one strategy to protect speaker identity information. Pseudonymisation solutions aim not only to mask the speaker identity and preserve the linguistic content, quality and naturalness, as is the goal of anonymisation, but also to preserve voice distinctiveness. Existing metrics for the assessment of anonymisation are ill-suited and those for the assessment of pseudonymisation are completely lacking. Based upon voice similarity matrices, this paper proposes the first intuitive visualisation of pseudonymisation performance for speech signals and two novel metrics for objective assessment. They reflect the two, key pseudonymisation requirements of de-identification and voice distinctiveness.
CLMay 4, 2020
Introducing the VoicePrivacy InitiativeNatalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang et al.
The VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges. In this paper, we formulate the voice anonymization task selected for the VoicePrivacy 2020 Challenge and describe the datasets used for system development and evaluation. We also present the attack models and the associated objective and subjective evaluation metrics. We introduce two anonymization baselines and report objective evaluation results.
SDDec 15, 2016
LIA system description for NIST SRE 2016Mickael Rouvier, Pierre-Michel Bousquet, Moez Ajili et al.
This paper describes the LIA speaker recognition system developed for the Speaker Recognition Evaluation (SRE) campaign. Eight sub-systems are developed, all based on a state-of-the-art approach: i-vector/PLDA which represents the mainstream technique in text-independent speaker recognition. These sub-systems differ: on the acoustic feature extraction front-end (MFCC, PLP), at the i-vector extraction stage (UBM, DNN or two-feats posteriors) and finally on the data-shifting (IDVC, mean-shifting). The submitted system is a fusion at the score-level of these eight sub-systems.