SD LG ASSep 22, 2023

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Alexandre R. Ferreira, Cláudio E. C. Campelo

arXiv:2309.12802v11 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This work addresses data scarcity for speech-to-text transcription, particularly for non-English languages, though it appears incremental as it builds on existing deepfake and transcription models.

The authors tackled the problem of limited labeled speech data for training robust transcription models, especially for less common languages, by proposing a deepfake audio data augmentation framework. They validated the framework using existing deepfake and transcription models on an Indian-accented English dataset, showing it can be used to train speech-to-text models in various scenarios.

To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios.

View on arXiv PDF

Similar