SD CR LG ASJun 8, 2025

Towards Generalized Source Tracing for Codec-Based Deepfake Speech

Xuanjun Chen, I-Ming Lin, Lin Zhang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

arXiv:2506.07294v39.34 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses the challenge of detecting deepfake speech for security and verification purposes, but it is incremental as it builds on existing methods with a hybrid approach.

The paper tackled the problem of source tracing for codec-based deepfake speech, where models trained on simulated data overfit and generalize poorly to real audio, and introduced SASTNet, which achieved state-of-the-art performance on the CoSG test set of the CodecFake+ dataset.

Recent attempts at source tracing for codec-based deepfake speech (CodecFake), generated by neural audio codec-based speech generation (CoSG) models, have exhibited suboptimal performance. However, how to train source tracing models using simulated CoSG data while maintaining strong performance on real CoSG-generated audio remains an open challenge. In this paper, we show that models trained solely on codec-resynthesized data tend to overfit to non-speech regions and struggle to generalize to unseen content. To mitigate these challenges, we introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding. Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.

View on arXiv PDF

Similar