Hemlata Tak

AS
h-index13
16papers
1,793citations
Novelty44%
AI Score45

16 Papers

ASAug 16, 2024
ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Xin Wang, Hector Delgado, Hemlata Tak et al.

ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogate detection models, while adversarial attacks are incorporated for the first time. New metrics support the evaluation of spoofing-robust automatic speaker verification (SASV) as well as stand-alone detection solutions, i.e., countermeasures without ASV. We describe the two challenge tracks, the new database, the evaluation metrics, baselines, and the evaluation platform, and present a summary of the results. Attacks significantly compromise the baseline systems, while submissions bring substantial improvements.

ASJun 13, 2023
Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems

Michele Panariello, Wanying Ge, Hemlata Tak et al.

We present Malafide, a universal adversarial attack against automatic speaker verification (ASV) spoofing countermeasures (CMs). By introducing convolutional noise using an optimised linear time-invariant filter, Malafide attacks can be used to compromise CM reliability while preserving other speech attributes such as quality and the speaker's voice. In contrast to other adversarial attacks proposed recently, Malafide filters are optimised independently of the input utterance and duration, are tuned instead to the underlying spoofing attack, and require the optimisation of only a small number of filter coefficients. Even so, they degrade CM performance estimates by an order of magnitude, even in black-box settings, and can also be configured to overcome integrated CM and ASV subsystems. Integrated solutions that use self-supervised learning CMs, however, are more robust, under both black-box and white-box settings.

CRSep 21, 2023
t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Tomi Kinnunen, Kong Aik Lee, Hemlata Tak et al.

Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a \emph{set} of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the \emph{concurrent} t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.

SPMar 26
ASVspoof 5: Evaluation of Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech

Xin Wang, Héctor Delgado, Nicholas Evans et al.

ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake detection solutions. A significant change from previous challenge editions is a new crowdsourced database collected from a substantially greater number of speakers under diverse recording conditions, and a mix of cutting-edge and legacy generative speech technology. With the new database described elsewhere, we provide in this paper an overview of the ASVspoof 5 challenge results for the submissions of 53 participating teams. While many solutions perform well, performance degrades under adversarial attacks and the application of neural encoding/compression schemes. Together with a review of post-challenge results, we also report a study of calibration in addition to other principal challenges and outline a road-map for the future of ASVspoof.

ASJul 27, 2021Code
End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection

Hemlata Tak, Jee-weon Jung, Jose Patino et al.

Artefacts that serve to distinguish bona fide speech from spoofed or deepfake speech are known to reside in specific subbands and temporal segments. Various approaches can be used to capture and model such artefacts, however, none works well across a spectrum of diverse spoofing attacks. Reliable detection then often depends upon the fusion of multiple detection systems, each tuned to detect different forms of attack. In this paper we show that better performance can be achieved when the fusion is performed within the model itself and when the representation is learned automatically from raw waveform inputs. The principal contribution is a spectro-temporal graph attention network (GAT) which learns the relationship between cues spanning different sub-bands and temporal intervals. Using a model-level graph fusion of spectral (S) and temporal (T) sub-graphs and a graph pooling strategy to improve discrimination, the proposed RawGAT-ST model achieves an equal error rate of 1.06 % for the ASVspoof 2019 logical access database. This is one of the best results reported to date and is reproducible using an open source implementation.

CVAug 11, 2025
Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

Nicholas Klein, Hemlata Tak, James Fullwood et al.

The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

ASJun 8, 2024
To what extent can ASV systems naturally defend against spoofing attacks?

Jee-weon Jung, Xin Wang, Nicholas Evans et al.

The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically exploring diverse ASV systems and spoofing attacks, ranging from traditional to cutting-edge techniques. Through extensive analyses conducted on eight distinct ASV systems and 29 spoofing attack systems, we demonstrate that the evolution of ASV inherently incorporates defense mechanisms against spoofing attacks. Nevertheless, our findings also underscore that the advancement of spoofing attacks far outpaces that of ASV systems, hence necessitating further research on spoofing-robust ASV methodologies.

SDJun 5, 2024
Harder or Different? Understanding Generalization of Audio Deepfake Detection

Nicolas M. Müller, Nicholas Evans, Hemlata Tak et al.

Recent research has highlighted a key issue in speech deepfake detection: models trained on one set of deepfakes perform poorly on others. The question arises: is this due to the continuously improving quality of Text-to-Speech (TTS) models, i.e., are newer DeepFakes just 'harder' to detect? Or, is it because deepfakes generated with one model are fundamentally different to those generated using another model? We answer this question by decomposing the performance gap between in-domain and out-of-domain test data into 'hardness' and 'difference' components. Experiments performed using ASVspoof databases indicate that the hardness component is practically negligible, with the performance gap being attributed primarily to the difference component. This has direct implications for real-world deepfake detection, highlighting that merely increasing model capacity, the currently-dominant research trend, may not effectively address the generalization challenge.

ASMay 30, 2023
Towards single integrated spoofing-aware speaker verification embeddings

Sung Hwan Mun, Hye-jin Shim, Hemlata Tak et al.

This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.

ASFeb 24, 2022
Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

Hemlata Tak, Massimiliano Todisco, Xin Wang et al.

The performance of spoofing countermeasure systems depends fundamentally upon the use of sufficiently representative training data. With this usually being limited, current solutions typically lack generalisation to attacks encountered in the wild. Strategies to improve reliability in the face of uncontrolled, unpredictable attacks are hence needed. We report in this paper our efforts to use self-supervised learning in the form of a wav2vec 2.0 front-end with fine tuning. Despite initial base representations being learned using only bona fide data and no spoofed data, we obtain the lowest equal error rates reported in the literature for both the ASVspoof 2021 Logical Access and Deepfake databases. When combined with data augmentation,these results correspond to an improvement of almost 90% relative to our baseline system.

SDJan 25, 2022
SASV Challenge 2022: A Spoofing Aware Speaker Verification Challenge Evaluation Plan

Jee-weon Jung, Hemlata Tak, Hye-jin Shim et al.

ASV (automatic speaker verification) systems are intrinsically required to reject both non-target (e.g., voice uttered by different speaker) and spoofed (e.g., synthesised or converted) inputs. However, there is little consideration for how ASV systems themselves should be adapted when they are expected to encounter spoofing attacks, nor when they operate in tandem with CMs (spoofing countermeasures), much less how both systems should be jointly optimised. The goal of the first SASV (spoofing-aware speaker verification) challenge, a special sesscion in ISCA INTERSPEECH 2022, is to promote development of integrated systems that can perform ASV and CM simultaneously.

ASNov 8, 2021
RawBoost: A Raw Data Boosting and Augmentation Method applied to Automatic Speaker Verification Anti-Spoofing

Hemlata Tak, Madhu Kamble, Jose Patino et al.

This paper introduces RawBoost, a data boosting and augmentation method for the design of more reliable spoofing detection solutions which operate directly upon raw waveform inputs. While RawBoost requires no additional data sources, e.g. noise recordings or impulse responses and is data, application and model agnostic, it is designed for telephony scenarios. Based upon the combination of linear and non-linear convolutive noise, impulsive signal-dependent additive noise and stationary signal-independent additive noise, RawBoost models nuisance variability stemming from, e.g., encoding, transmission, microphones and amplifiers, and both linear and non-linear distortion. Experiments performed using the ASVspoof 2021 logical access database show that RawBoost improves the performance of a state-of-the-art raw end-to-end baseline system by 27% relative and is only outperformed by solutions that either depend on external data or that require additional intervention at the model level.

ASOct 4, 2021
AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks

Jee-weon Jung, Hee-Soo Heo, Hemlata Tak et al.

Artefacts that differentiate spoofed from bona-fide utterances can reside in spectral or temporal domains. Their reliable detection usually depends upon computationally demanding ensemble systems where each subsystem is tuned to some specific artefacts. We seek to develop an efficient, single system that can detect a broad range of different spoofing attacks without score-level ensembles. We propose a novel heterogeneous stacking graph attention layer which models artefacts spanning heterogeneous temporal and spectral domains with a heterogeneous attention mechanism and a stack node. With a new max graph operation that involves a competitive mechanism and an extended readout scheme, our approach, named AASIST, outperforms the current state-of-the-art by 20% relative. Even a lightweight variant, AASIST-L, with only 85K parameters, outperforms all competing systems.

ASApr 8, 2021
Graph Attention Networks for Anti-Spoofing

Hemlata Tak, Jee-weon Jung, Jose Patino et al.

The cues needed to detect spoofing attacks against automatic speaker verification are often located in specific spectral sub-bands or temporal segments. Previous works show the potential to learn these using either spectral or temporal self-attention mechanisms but not the relationships between neighbouring sub-bands or segments. This paper reports our use of graph attention networks (GATs) to model these relationships and to improve spoofing detection performance. GATs leverage a self-attention mechanism over graph structured data to model the data manifold and the relationships between nodes. Our graph is constructed from representations produced by a ResNet. Nodes in the graph represent information either in specific sub-bands or temporal segments. Experiments performed on the ASVspoof 2019 logical access database show that our GAT-based model with temporal attention outperforms all of our baseline single systems. Furthermore, GAT-based systems are complementary to a set of existing systems. The fusion of GAT-based models with more conventional countermeasures delivers a 47% relative improvement in performance compared to the best performing single GAT system.

ASMay 20, 2020
Spoofing Attack Detection using the Non-linear Fusion of Sub-band Classifiers

Hemlata Tak, Jose Patino, Andreas Nautsch et al.

The threat of spoofing can pose a risk to the reliability of automatic speaker verification. Results from the bi-annual ASVspoof evaluations show that effective countermeasures demand front-ends designed specifically for the detection of spoofing artefacts. Given the diversity in spoofing attacks, ensemble methods are particularly effective. The work in this paper shows that a bank of very simple classifiers, each with a front-end tuned to the detection of different spoofing attacks and combined at the score level through non-linear fusion, can deliver superior performance than more sophisticated ensemble solutions that rely upon complex neural network architectures. Our comparatively simple approach outperforms all but 2 of the 48 systems submitted to the logical access condition of the most recent ASVspoof 2019 challenge.

ASApr 14, 2020
An explainability study of the constant Q cepstral coefficient spoofing countermeasure for automatic speaker verification

Hemlata Tak, Jose Patino, Andreas Nautsch et al.

Anti-spoofing for automatic speaker verification is now a well established area of research, with three competitive challenges having been held in the last 6 years. A great deal of research effort over this time has been invested into the development of front-end representations tailored to the spoofing detection task. One such approach known as constant Q cepstral coefficients (CQCCs) have been shown to be especially effective in detecting attacks implemented with a unit selection based speech synthesis algorithm. Despite their success, they largely fail in detecting other forms of spoofing attack where more traditional front-end representations give substantially better results. Similar differences were also observed in the most recent, 2019 edition of the ASVspoof challenge series. This paper reports our attempts to help explain these observations. The explanation is shown to lie in the level of attention paid by each front-end to different sub-band components of the spectrum. Thus far, surprisingly little has been learned about what artefacts are being detected by spoofing countermeasures. Our work hence aims to shed light upon signal or spectrum level artefacts that serve to distinguish different forms of spoofing attack from genuine, bone fide speech. With a better understanding of these artefacts we will be better positioned to design more reliable countermeasures.