End-to-end spoofing detection with raw waveform CLDNNs
This work addresses the issue of malicious spoofed speech attacks in speaker verification systems, offering an incremental improvement in detection accuracy.
The paper tackles the problem of spoofing detection in speaker verification by proposing a raw waveform-based deep model that jointly acts as a feature extractor and classifier, achieving a half total error rate (HTER) of 0.82% on the BTAS2016 dataset, improving from the previous best of 1.26%.
Albeit recent progress in speaker verification generates powerful models, malicious attacks in the form of spoofed speech, are generally not coped with. Recent results in ASVSpoof2015 and BTAS2016 challenges indicate that spoof-aware features are a possible solution to this problem. Most successful methods in both challenges focus on spoof-aware features, rather than focusing on a powerful classifier. In this paper we present a novel raw waveform based deep model for spoofing detection, which jointly acts as a feature extractor and classifier, thus allowing it to directly classify speech signals. This approach can be considered as an end-to-end classifier, which removes the need for any pre- or post-processing on the data, making training and evaluation a streamlined process, consuming less time than other neural-network based approaches. The experiments on the BTAS2016 dataset show that the system performance is significantly improved by the proposed raw waveform convolutional long short term neural network (CLDNN), from the previous best published 1.26\% half total error rate (HTER) to the current 0.82\% HTER. Moreover it shows that the proposed system also performs well under the unknown (RE-PH2-PH3,RE-LPPH2-PH3) conditions.