Florian Lux

CL
h-index11
17papers
2,136citations
Novelty45%
AI Score47

17 Papers

CLOct 21, 2022Code
Low-Resource Multilingual and Zero-Shot Multispeaker TTS

Florian Lux, Julia Koch, Ngoc Thang Vu

While neural methods for text-to-speech (TTS) have shown great advances in modeling multiple speakers, even in zero-shot settings, the amount of data needed for those approaches is generally not feasible for the vast majority of the world's over 6,000 spoken languages. In this work, we bring together the tasks of zero-shot voice cloning and multilingual low-resource TTS. Using the language agnostic meta learning (LAML) procedure and modifications to a TTS encoder, we show that it is possible for a system to learn speaking a new language using just 5 minutes of training data while retaining the ability to infer the voice of even unseen speakers in the newly learned language. We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker using objective metrics as well as human studies and provide our code and trained models open source.

CLOct 26, 2023Code
The IMS Toucan System for the Blizzard Challenge 2023

Florian Lux, Julia Koch, Sarina Meyer et al.

For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available.

SDOct 13, 2022
Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy

Sarina Meyer, Pascal Tilli, Pavel Denisov et al.

In order to protect the privacy of speech data, speaker anonymization aims for hiding the identity of a speaker by changing the voice in speech recordings. This typically comes with a privacy-utility trade-off between protection of individuals and usability of the data for downstream applications. One of the challenges in this context is to create non-existent voices that sound as natural as possible. In this work, we propose to tackle this issue by generating speaker embeddings using a generative adversarial network with Wasserstein distance as cost function. By incorporating these artificial embeddings into a speech-to-text-to-speech pipeline, we outperform previous approaches in terms of privacy and utility. According to standard objective metrics and human evaluation, our approach generates intelligible and content-preserving yet privacy-protecting versions of the original recordings.

CLMar 7, 2022
Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features

Florian Lux, Ngoc Thang Vu

While neural text-to-speech systems perform remarkably well in high-resource scenarios, they cannot be applied to the majority of the over 6,000 spoken languages in the world due to a lack of appropriate training data. In this work, we use embeddings derived from articulatory vectors rather than embeddings derived from phoneme identities to learn phoneme representations that hold across languages. In conjunction with language agnostic meta learning, this enables us to fine-tune a high-quality text-to-speech model on just 30 minutes of data in a previously unseen language spoken by a previously unseen speaker.

SDJul 11, 2022
Speaker Anonymization with Phonetic Intermediate Representations

Sarina Meyer, Florian Lux, Pavel Denisov et al.

In this work, we propose a speaker anonymization pipeline that leverages high quality automatic speech recognition and synthesis systems to generate speech conditioned on phonetic transcriptions and anonymized speaker embeddings. Using phones as the intermediate representation ensures near complete elimination of speaker identity information from the input while preserving the original phonetic content as much as possible. Our experimental results on LibriSpeech and VCTK corpora reveal two key findings: 1) although automatic speech recognition produces imperfect transcriptions, our neural speech synthesis system can handle such errors, making our system feasible and robust, and 2) combining speaker embeddings from different resources is beneficial and their appropriate normalization is crucial. Overall, our final best system outperforms significantly the baselines provided in the Voice Privacy Challenge 2020 in terms of privacy robustness against a lazy-informed attacker while maintaining high intelligibility and naturalness of the anonymized speech.

SDJun 24, 2022
Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

Florian Lux, Julia Koch, Ngoc Thang Vu

The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the prosody of a transcribed reference audio have also been proposed recently. In this work, we bring these two tasks together for the first time through utterance level normalization in conjunction with an utterance level speaker embedding. We further introduce a lightweight aligner for extracting fine-grained prosodic features, that can be finetuned on individual samples within seconds. We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as our objective evaluation and human study show. All of our code and trained models are available, alongside static and interactive demos.

ASJul 11, 2022
PoeticTTS -- Controllable Poetry Reading for Literary Studies

Julia Koch, Florian Lux, Nadja Schauffler et al.

Speech synthesis for poetry is challenging due to specific intonation patterns inherent to poetic speech. In this work, we propose an approach to synthesise poems with almost human like naturalness in order to enable literary scholars to systematically examine hypotheses on the interplay between text, spoken realisation, and the listener's perception of poems. To meet these special requirements for literary studies, we resynthesise poems by cloning prosodic values from a human reference recitation, and afterwards make use of fine-grained prosody control to manipulate the synthetic speech in a human-in-the-loop setting to alter the recitation w.r.t. specific phenomena. We find that finetuning our TTS model on poetry captures poetic intonation patterns to a large extent which is beneficial for prosody cloning and manipulation and verify the success of our approach both in an objective evaluation as well as in human studies.

SDOct 26, 2023
Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

Florian Lux, Pascal Tilli, Sarina Meyer et al.

Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intuitive and fine-grained control over the voice and speaking style of the embeddings, without requiring any labels for speaker or style. The artificial and controllable embeddings can be fed to a speech synthesis system, conditioned on embeddings of real humans during training, without sacrificing privacy during inference.

CLOct 21, 2022
Combining Contrastive and Non-Contrastive Losses for Fine-Tuning Pretrained Models in Speech Analysis

Florian Lux, Ching-Yi Chen, Ngoc Thang Vu

Embedding paralinguistic properties is a challenging task as there are only a few hours of training data available for domains such as emotional speech. One solution to this problem is to pretrain a general self-supervised speech representation model on large amounts of unlabeled speech. This pretrained model is then finetuned to a specific task. Paralinguistic properties however have notoriously high class variance, making the finetuning ineffective. In this work, we propose a two step approach to this. First we improve the embedding space, then we train an adapter to bridge the gap from the embedding space to a classification task. In order to improve the class invariance we use a combination of contrastive and non-contrastive losses to explicitly optimize for class invariant, yet discriminative features. Our approach consistently outperforms baselines that are finetuned end-to-end on multiple tasks and surpasses a benchmark on state-of-the-art emotion classification.

CLJul 3, 2024
Probing the Feasibility of Multilingual Speaker Anonymization

Sarina Meyer, Florian Lux, Ngoc Thang Vu

In speaker anonymization, speech recordings are modified in a way that the identity of the speaker remains hidden. While this technology could help to protect the privacy of individuals around the globe, current research restricts this by focusing almost exclusively on English data. In this study, we extend a state-of-the-art anonymization system to nine languages by transforming language-dependent components to their multilingual counterparts. Experiments testing the robustness of the anonymized speech against privacy attacks and speech deterioration show an overall success of this system for all languages. The results suggest that speaker embeddings trained on English data can be applied across languages, and that the anonymization performance for a language is mainly affected by the quality of the speech synthesis component used for it.

SDSep 17, 2024
High-Resolution Speech Restoration with Latent Diffusion Model

Tushar Dhyani, Florian Lux, Michele Mancusi et al.

Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.

SDFeb 18
How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

Yixuan Xiao, Florian Lux, Alejandro Pérez-González-de-Martos et al.

Since Text-to-Speech systems typically don't produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.

CLMay 4, 2020Code
ADVISER: A Toolkit for Developing Multi-modal, Multi-domain and Socially-engaged Conversational Agents

Chia-Yu Li, Daniel Ortega, Dirk Väth et al.

We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e.g. emotion recognition, engagement level prediction and backchanneling) conversational agents. The final Python-based implementation of our toolkit is flexible, easy to use, and easy to extend not only for technically experienced users, such as machine learning researchers, but also for less technically experienced users, such as linguists or cognitive scientists, thereby providing a flexible platform for collaborative research. Link to open-source code: https://github.com/DigitalPhonetics/adviser

ASJun 30, 2025
Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis

Paul Mayer, Florian Lux, Alejandro Pérez-González-de-Martos et al.

While generative methods have progressed rapidly in recent years, generating expressive prosody for an utterance remains a challenging task in text-to-speech synthesis. This is particularly true for systems that model prosody explicitly through parameters such as pitch, energy, and duration, which is commonly done for the sake of interpretability and controllability. In this work, we investigate the effectiveness of stochastic methods for this task, including Normalizing Flows, Conditional Flow Matching, and Rectified Flows. We compare these methods to a traditional deterministic baseline, as well as to real human realizations. Our extensive subjective and objective evaluations demonstrate that stochastic methods produce natural prosody on par with human speakers by capturing the variability inherent in human speech. Further, they open up additional controllability options by allowing the sampling temperature to be tuned.

CLJun 10, 2024
Controlling Emotion in Text-to-Speech with Natural Language Prompts

Thomas Bott, Florian Lux, Ngoc Thang Vu

In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.

CLJun 10, 2024
Meta Learning Text-to-Speech Synthesis in over 7000 Languages

Florian Lux, Sarina Meyer, Lyonel Behringer et al.

In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.

ASFeb 25, 2021
Meta-Learning for improving rare word recognition in end-to-end ASR

Florian Lux, Ngoc Thang Vu

We propose a new method of generating meaningful embeddings for speech, changes to four commonly used meta learning approaches to enable them to perform keyword spotting in continuous signals and an approach of combining their outcomes into an end-to-end automatic speech recognition system to improve rare word recognition. We verify the functionality of each of our three contributions in two experiments exploring their performance for different amounts of classes (N-way) and examples per class (k-shot) in a few-shot setting. We find that the speech embeddings work well and the changes to the meta learning approaches also clearly enable them to perform continuous signal spotting. Despite the interface between keyword spotting and speech recognition being very simple, we are able to consistently improve word error rate by up to 5%.