Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
This work provides a cost-effective method for detecting synthetic words in speech, which is important for combating deepfake audio for general users and researchers.
This paper addresses the detection of deepfake words within speech utterances by fine-tuning a pre-trained Whisper model for next-token prediction. The method achieves low synthetic-word and transcription error rates on in-domain data, and performs comparably to a dedicated ResNet-based model on out-of-domain data.
Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized with speech-generative models. While a dedicated synthetic word detector could be developed, we developed a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thus reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced with unseen speech-generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.