SDCLASJul 3, 2023

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

arXiv:2307.00729v12 citationsh-index: 43
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of generating realistic fake human voices for applications like audio deepfake detection, but it is incremental as it combines existing methods.

The paper tackled the problem of synthetic speech generation by building an end-to-end multi-module model, achieving first place in the ADD 2023 challenge Track 1.1 with a weighted deception success rate of 44.97%.

The task of synthetic speech generation is to generate language content from a given text, then simulating fake human voice.The key factors that determine the effect of synthetic speech generation mainly include speed of generation, accuracy of word segmentation, naturalness of synthesized speech, etc. This paper builds an end-to-end multi-module synthetic speech generation model, including speaker encoder, synthesizer based on Tacotron2, and vocoder based on WaveRNN. In addition, we perform a lot of comparative experiments on different datasets and various model structures. Finally, we won the first place in the ADD 2023 challenge Track 1.1 with the weighted deception success rate (WDSR) of 44.97%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes