CRJan 23, 2019

SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems

arXiv:1901.07846v2156 citations
Originality Incremental advance
AI Analysis

This addresses security risks for users of end-to-end acoustic systems like speech recognition, though it is incremental as it builds on existing adversarial attack methods.

The paper tackles the vulnerability of deep learning-based acoustic systems to adversarial attacks by introducing SirenAttack, which generates adversarial audios that deceive systems under various settings, achieving a 99.45% attack success rate on the IEMOCAP dataset against a ResNet18 model.

Despite their immense popularity, deep learning-based acoustic systems are inherently vulnerable to adversarial attacks, wherein maliciously crafted audios trigger target systems to misbehave. In this paper, we present SirenAttack, a new class of attacks to generate adversarial audios. Compared with existing attacks, SirenAttack highlights with a set of significant features: (i) versatile -- it is able to deceive a range of end-to-end acoustic systems under both white-box and black-box settings; (ii) effective -- it is able to generate adversarial audios that can be recognized as specific phrases by target acoustic systems; and (iii) stealthy -- it is able to generate adversarial audios indistinguishable from their benign counterparts to human perception. We empirically evaluate SirenAttack on a set of state-of-the-art deep learning-based acoustic systems (including speech command recognition, speaker recognition and sound event classification), with results showing the versatility, effectiveness, and stealthiness of SirenAttack. For instance, it achieves 99.45% attack success rate on the IEMOCAP dataset against the ResNet18 model, while the generated adversarial audios are also misinterpreted by multiple popular ASR platforms, including Google Cloud Speech, Microsoft Bing Voice, and IBM Speech-to-Text. We further evaluate three potential defense methods to mitigate such attacks, including adversarial training, audio downsampling, and moving average filtering, which leads to promising directions for further research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes