SDAIASJan 31, 2025

Deepfake Detection of Singing Voices With Whisper Encodings

arXiv:2501.18919v13 citationsh-index: 1ICASSP
Originality Synthesis-oriented
AI Analysis

This addresses the problem of deepfake singing vocals for artists in the music industry, representing an incremental improvement.

The paper tackled deepfake detection of singing voices by using noise-variant encodings from OpenAI's Whisper model, achieving performance evaluated in %EER with varying model sizes and classifiers under different testing conditions.

The deepfake generation of singing vocals is a concerning issue for artists in the music industry. In this work, we propose a singing voice deepfake detection (SVDD) system, which uses noise-variant encodings of open-AI's Whisper model. As counter-intuitive as it may sound, even though the Whisper model is known to be noise-robust, the encodings are rich in non-speech information, and are noise-variant. This leads us to evaluate Whisper encodings as feature representations for the SVDD task. Therefore, in this work, the SVDD task is performed on vocals and mixtures, and the performance is evaluated in \%EER over varying Whisper model sizes and two classifiers- CNN and ResNet34, under different testing conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes