Self-supervised pre-training with acoustic configurations for replay spoofing detection
This work addresses the problem of replay spoofing detection for security applications, offering an incremental improvement by leveraging existing datasets to overcome data scarcity.
The paper tackles the challenge of limited replay spoofing detection datasets by proposing a self-supervised pre-training framework that uses acoustic configurations from other tasks, achieving a 30% improvement over baseline methods on the ASVspoof 2019 dataset.
Constructing a dataset for replay spoofing detection requires a physical process of playing an utterance and re-recording it, presenting a challenge to the collection of large-scale datasets. In this study, we propose a self-supervised framework for pretraining acoustic configurations using datasets published for other tasks, such as speaker verification. Here, acoustic configurations refer to the environmental factors generated during the process of voice recording but not the voice itself, including microphone types, place and ambient noise levels. Specifically, we select pairs of segments from utterances and train deep neural networks to determine whether the acoustic configurations of the two segments are identical. We validate the effectiveness of the proposed method based on the ASVspoof 2019 physical access dataset utilizing two well-performing systems. The experimental results demonstrate that the proposed method outperforms the baseline approach by 30%.