Hafiz Malik

CR
h-index15
13papers
385citations
Novelty36%
AI Score48

13 Papers

ASAug 27, 2024
Is Audio Spoof Detection Robust to Laundering Attacks?

Hashim Ali, Surya Subramani, Shefali Sudhir et al.

Voice-cloning (VC) systems have seen an exceptional increase in the realism of synthesized speech in recent years. The high quality of synthesized speech and the availability of low-cost VC services have given rise to many potential abuses of this technology. Several detection methodologies have been proposed over the years that can detect voice spoofs with reasonably good accuracy. However, these methodologies are mostly evaluated on clean audio databases, such as ASVSpoof 2019. This paper evaluates SOTA Audio Spoof Detection approaches in the presence of laundering attacks. In that regard, a new laundering attack database, called the ASVSpoof Laundering Database, is created. This database is based on the ASVSpoof 2019 (LA) eval database comprising a total of 1388.22 hours of audio recordings. Seven SOTA audio spoof detection approaches are evaluated on this laundered database. The results indicate that SOTA systems perform poorly in the presence of aggressive laundering attacks, especially reverberation and additive noise attacks. This suggests the need for robust audio spoof detection.

CVSep 2, 2022
Distilling Facial Knowledge With Teacher-Tasks: Semantic-Segmentation-Features For Pose-Invariant Face-Recognition

Ali Hassani, Zaid El Shair, Rafi Ud Duala Refat et al.

This paper demonstrates a novel approach to improve face-recognition pose-invariance using semantic-segmentation features. The proposed Seg-Distilled-ID network jointly learns identification and semantic-segmentation tasks, where the segmentation task is then "distilled" (MobileNet encoder). Performance is benchmarked against three state-of-the-art encoders on a publicly available data-set emphasizing head-pose variations. Experimental evaluations show the Seg-Distilled-ID network shows notable robustness benefits, achieving 99.9% test-accuracy in comparison to 81.6% on ResNet-101, 96.1% on VGG-19 and 96.3% on InceptionV3. This is achieved using approximately one-tenth of the top encoder's inference parameters. These results demonstrate distilling semantic-segmentation features can efficiently address face-recognition pose-invariance.

ASMar 2
A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Hashim Ali, Nithin Sai Adupa, Surya Subramani et al.

Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

ASApr 28
Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

Jaskirat Sudan, Hashim Ali, Surya Subramani et al.

Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.

SDJan 12
LJ-Spoof: A Generatively Varied Corpus for Audio Anti-Spoofing and Synthesis Source Tracing

Surya Subramani, Hashim Ali, Hafiz Malik

Speaker-specific anti-spoofing and synthesis-source tracing are central challenges in audio anti-spoofing. Progress has been hampered by the lack of datasets that systematically vary model architectures, synthesis pipelines, and generative parameters. To address this gap, we introduce LJ-Spoof, a speaker-specific, generatively diverse corpus that systematically varies prosody, vocoders, generative hyperparameters, bona fide prompt sources, training regimes, and neural post-processing. The corpus spans one speakers-including studio-quality recordings-30 TTS families, 500 generatively variant subsets, 10 bona fide neural-processing variants, and more than 3 million utterances. This variation-dense design enables robust speaker-conditioned anti-spoofing and fine-grained synthesis-source tracing. We further position this dataset as both a practical reference training resource and a benchmark evaluation suite for anti-spoofing and source tracing.

ASAug 28, 2025
Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System

Hashim Ali, Surya Subramani, Lekha Bollinani et al.

The SAFE Challenge evaluates synthetic speech detection across three tasks: unmodified audio, processed audio with compression artifacts, and laundered audio designed to evade detection. We systematically explore self-supervised learning (SSL) front-ends, training data compositions, and audio length configurations for robust deepfake detection. Our AASIST-based approach incorporates WavLM large frontend with RawBoost augmentation, trained on a multilingual dataset of 256,600 samples spanning 9 languages and over 70 TTS systems from CodecFake, MLAAD v5, SpoofCeleb, Famous Figures, and MAILABS. Through extensive experimentation with different SSL front-ends, three training data versions, and two audio lengths, we achieved second place in both Task 1 (unmodified audio detection) and Task 3 (laundered audio detection), demonstrating strong generalization and robustness.

CRSep 24, 2020
Graph-Based Intrusion Detection System for Controller Area Networks

Riadul Islam, Rafi Ud Daula Refat, Sai Manikanta Yerram et al.

The controller area network (CAN) is the most widely used intra-vehicular communication network in the automotive industry. Because of its simplicity in design, it lacks most of the requirements needed for a security-proven communication protocol. However, a safe and secured environment is imperative for autonomous as well as connected vehicles. Therefore CAN security is considered one of the important topics in the automotive research community. In this paper, we propose a four-stage intrusion detection system that uses the chi-squared method and can detect any kind of strong and weak cyber attacks in a CAN. This work is the first-ever graph-based defense system proposed for the CAN. Our experimental results show that we have a very low 5.26% misclassification for denial of service (DoS) attack, 10% misclassification for fuzzy attack, 4.76% misclassification for replay attack, and no misclassification for spoofing attack. In addition, the proposed methodology exhibits up to 13.73% better accuracy compared to existing ID sequence-based methods.

SDSep 3, 2019
Voice Spoofing Detection Corpus for Single and Multi-order Audio Replays

Roland Baumann, Khalid Mahmood Malik, Ali Javed et al.

The evolution of modern voice controlled devices (VCDs) in recent years has revolutionized the Internet of Things, and resulted in increased realization of smart homes, personalization and home automation through voice commands. The introduction of VCDs in IoT is expected to give emergence of new subfield of IoT, called Multimedia of Thing (MoT). These VCDs can be exploited in IoT driven environment to generate various spoofing attacks including the replays. Replay attacks are generated through replaying the recorded audio of legitimate human speaker with the intent of deceiving the VCDs having speaker verification system. The connectivity among the VCDs can easily be exploited in IoT driven environment to generate a chain of replay attacks (multi-order replay attacks). Existing spoofing detection datasets like ASVspoof and ReMASC contain only the first-order replay recordings against the bonafide audio samples. These datasets can not offer evaluation of the anti-spoofing algorithms capable of detecting the multi-order replay attacks. Additionally, these datasets do not capture the characteristics of microphone arrays, which is an important characteristic of modern VCDs. We need a diverse replay spoofing detection corpus that consists of multi-order replay recordings against the bonafide voice samples. This paper presents a novel voice spoofing detection corpus (VSDC) to evaluate the performance of multi-order replay anti-spoofing methods. The proposed VSDC consists of first and second-order-replay samples against the bonafide audio recordings. Additionally, the proposed VSDC can also be used to evaluate the performance of speaker verification systems as our corpus includes the audio samples of fifteen different speakers. To the best of our knowledge, this is the first publicly available replay spoofing detection corpus comprising of first-order and second-order-replay samples.

CRApr 13, 2019
Towards Vulnerability Analysis of Voice-Driven Interfaces and Countermeasures for Replay

Khalid Mahmood Malik, Hafiz Malik, Roland Baumann

Fake audio detection is expected to become an important research area in the field of smart speakers such as Google Home, Amazon Echo and chatbots developed for these platforms. This paper presents replay attack vulnerability of voice-driven interfaces and proposes a countermeasure to detect replay attack on these platforms. This paper presents a novel framework to model replay attack distortion, and then use a non-learning-based method for replay attack detection on smart speakers. The reply attack distortion is modeled as a higher-order nonlinearity in the replay attack audio. Higher-order spectral analysis (HOSA) is used to capture characteristics distortions in the replay audio. Effectiveness of the proposed countermeasure scheme is evaluated on original speech as well as corresponding replayed recordings. The replay attack recordings are successfully injected into the Google Home device via Amazon Alexa using the drop-in conferencing feature.

ASFeb 18, 2019
Securing Voice-driven Interfaces against Fake (Cloned) Audio Attacks

Hafiz Malik

Voice cloning technologies have found applications in a variety of areas ranging from personalized speech interfaces to advertisement, robotics, and so on. Existing voice cloning systems are capable of learning speaker characteristics and use trained models to synthesize a person's voice from only a few audio samples. Advances in cloned speech generation technologies are capable of generating perceptually indistinguishable speech from a bona-fide speech. These advances pose new security and privacy threats to voice-driven interfaces and speech-based access control systems. The state-of-the-art speech synthesis technologies use trained or tuned generative models for cloned speech generation. Trained generative models rely on linear operations, learned weights, and excitation source for cloned speech synthesis. These systems leave characteristic artifacts in the synthesized speech. Higher-order spectral analysis is used to capture differentiating attributes between bona-fide and cloned audios. Specifically, quadrature phase coupling (QPC) in the estimated bicoherence, Gaussianity test statistics, and linearity test statistics are used to capture generative model artifacts. Performance of the proposed method is evaluated on cloned audios generated using speaker adaptation- and speaker encoding-based approaches. Experimental results for a dataset consisting of 126 cloned speech and 8 bona-fide speech samples indicate that the proposed method is capable of detecting bona-fide and cloned audios with close to a perfect detection rate.

CRFeb 5, 2018
State-of-the-Art Survey on In-Vehicle Network Communication (CAN-Bus) Security and Vulnerabilities

Omid Avatefipour, Hafiz Malik

Nowadays with the help of advanced technology, modern vehicles are not only made up of mechanical devices but also consist of highly complex electronic devices and connections to the outside world. There are around 70 Electronic Control Units (ECUs) in modern vehicle which are communicating with each other over the standard communication protocol known as Controller Area Network (CAN-Bus) that provides the communication rate up to 1Mbps. There are different types of in-vehicle network protocol and bus system namely Controlled Area Network (CAN), Local Interconnected Network (LIN), Media Oriented System Transport (MOST), and FlexRay. Even though CAN-Bus is considered as de-facto standard for in-vehicle network communication, it inherently lacks the fundamental security features by design like message authentication. This security limitation has paved the way for adversaries to penetrate into the vehicle network and do malicious activities which can pose a dangerous situation for both driver and passengers. In particular, nowadays vehicular networks are not only closed systems, but also they are open to different external interfaces namely Bluetooth, GPS, to the outside world. Therefore, it creates new opportunities for attackers to remotely take full control of the vehicle. The objective of this research is to survey the current limitations of CAN-Bus protocol in terms of secure communication and different solutions that researchers in the society of automotive have provided to overcome the CAN-Bus limitation on different layers.

CRJan 27, 2018
Linking Received Packet to the Transmitter Through Physical-Fingerprinting of Controller Area Network

Omid Avatefipour, Azeem Hafeez, Muhammad Tayyab et al.

The Controller Area Network (CAN) bus serves as a legacy protocol for in-vehicle data communication. Simplicity, robustness, and suitability for real-time systems are the salient features of the CAN bus protocol. However, it lacks the basic security features such as massage authentication, which makes it vulnerable to the spoofing attacks. In a CAN network, linking CAN packet to the sender node is a challenging task. This paper aims to address this issue by developing a framework to link each CAN packet to its source. Physical signal attributes of the received packet consisting of channel and node (or device) which contains specific unique artifacts are considered to achieve this goal. Material and design imperfections in the physical channel and digital device, which are the main contributing factors behind the device-channel specific unique artifacts, are leveraged to link the received electrical signal to the transmitter. Generally, the inimitable patterns of signals from each ECUs exist over the course of time that can manifest the stability of the proposed method. Uniqueness of the channel-device specific attributes are also investigated for time- and frequency-domain. Feature vector is made up of both time and frequency domain physical attributes and then employed to train a neural network-based classifier. Performance of the proposed fingerprinting method is evaluated by using a dataset collected from 16 different channels and four identical ECUs transmitting same message. Experimental results indicate that the proposed method achieves correct detection rates of 95.2% and 98.3% for channel and ECU classification, respectively.

CRNov 26, 2014
Audio Splicing Detection and Localization Using Environmental Signature

Hong Zhao, Yifan Chen, Rui Wang et al.

Audio splicing is one of the most common manipulation techniques in the area of audio forensics. In this paper, the magnitudes of acoustic channel impulse response and ambient noise are proposed as the environmental signature. Specifically, the spliced audio segments are detected according to the magnitude correlation between the query frames and reference frames via a statically optimal threshold. The detection accuracy is further refined by comparing the adjacent frames. The effectiveness of the proposed method is tested on two data sets. One is generated from TIMIT database, and the other one is made in four acoustic environments using a commercial grade microphones. Experimental results show that the proposed method not only detects the presence of spliced frames, but also localizes the forgery segments with near perfect accuracy. Comparison results illustrate that the identification accuracy of the proposed scheme is higher than the previous schemes. In addition, experimental results also show that the proposed scheme is robust to MP3 compression attack, which is also superior to the previous works.