Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech
For researchers and practitioners in deepfake speech detection, this work improves generalization to unseen generative models, though it is incremental as it builds on existing SSL and augmentation techniques.
The paper addresses the domain gap between proxy data (codec resynthesized speech) and real-world deepfake speech. The proposed DSFA method, combined with a post-trained SSL backbone, achieves state-of-the-art performance on CoSG Eval and CoSG ExtEval datasets, with the latter containing 40 unseen generative models.
Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.