Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection
For persons-of-interest (POIs) like public figures, this work provides a personalized, interpretable defense against speech deepfakes, addressing the limitation of generic black-box models.
The paper tackles speaker-specific deepfake detection by proposing Phoneme-based Voice Profiling (PVP), which models speaker-specific phonetic realizations using GMMs. PVP significantly outperforms state-of-the-art generic detectors, achieving substantial Equal Error Rate (EER) reductions.
The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP