Joseph Szurley

SD
6papers
109citations
Novelty50%
AI Score25

6 Papers

ASFeb 17, 2022
Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition

Chao-Han Huck Yang, Zeeshan Ahmed, Yile Gu et al.

In this work, we aim to enhance the system robustness of end-to-end automatic speech recognition (ASR) against adversarially-noisy speech examples. We focus on a rigorous and empirical "closed-model adversarial robustness" setting (e.g., on-device or cloud applications). The adversarial noise is only generated by closed-model optimization (e.g., evolutionary and zeroth-order estimation) without accessing gradient information of a targeted ASR model directly. We propose an advanced Bayesian neural network (BNN) based adversarial detector, which could model latent distributions against adaptive adversarial perturbation with divergence measurement. We further simulate deployment scenarios of RNN Transducer, Conformer, and wav2vec-2.0 based ASR systems with the proposed adversarial detection system. Leveraging the proposed BNN based detection system, we improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets compared to the current model enhancement methods against the adversarial speech examples.

CROct 31, 2019
Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

Juncheng B. Li, Shuhui Qu, Xinjian Li et al.

Voice Assistants (VAs) such as Amazon Alexa or Google Assistant rely on wake-word detection to respond to people's commands, which could potentially be vulnerable to audio adversarial examples. In this work, we target our attack on the wake-word detection system, jamming the model with some inconspicuous background music to deactivate the VAs while our audio adversary is present. We implemented an emulated wake-word detection system of Amazon Alexa based on recent publications. We validated our models against the real Alexa in terms of wake-word detection accuracy. Then we computed our audio adversaries with consideration of expectation over transform and we implemented our audio adversary with a differentiable synthesizer. Next, we verified our audio adversaries digitally on hundreds of samples of utterances collected from the real world. Our experiments show that we can effectively reduce the recognition F1 score of our emulated model from 93.4% to 11.0%. Finally, we tested our audio adversary over the air, and verified it works effectively against Alexa, reducing its F1 score from 92.5% to 11.0%.; We also verified that non-adversarial music does not disable Alexa as effectively as our music at the same sound level. To the best of our knowledge, this is the first real-world adversarial attack against a commercial-grade VA wake-word detection system. Our code and demo videos can be accessed at \url{https://www.junchengbillyli.com/AdversarialMusic}

ASJun 14, 2019
Perceptual Based Adversarial Audio Attacks

Joseph Szurley, J. Zico Kolter

Recent work has shown the possibility of adversarial attacks on automatic speechrecognition (ASR) systems. However, in the vast majority of work in this area, theattacks have been executed only in the digital space, or have involved short phrasesand static room settings. In this paper, we demonstrate a physically realizableaudio adversarial attack. We base our approach specifically on a psychoacoustic-property-based loss function, and automated generation of room impulse responses, to create adversarial attacks that are robust when played over a speaker in multiple environments. We show that such attacks are possible even while being virtually imperceptible to listeners.

SDDec 27, 2017
A Light-Weight Multimodal Framework for Improved Environmental Audio Tagging

Juncheng Li, Yun Wang, Joseph Szurley et al.

The lack of strong labels has severely limited the state-of-the-art fully supervised audio tagging systems to be scaled to larger dataset. Meanwhile, audio-visual learning models based on unlabeled videos have been successfully applied to audio tagging, but they are inevitably resource hungry and require a long time to train. In this work, we propose a light-weight, multimodal framework for environmental audio tagging. The audio branch of the framework is a convolutional and recurrent neural network (CRNN) based on multiple instance learning (MIL). It is trained with the audio tracks of a large collection of weakly labeled YouTube video excerpts; the video branch uses pretrained state-of-the-art image recognition networks and word embeddings to extract information from the video track and to map visual objects to sound events. Experiments on the audio tagging task of the DCASE 2017 challenge show that the incorporation of video information improves a strong baseline audio tagging system by 5.3\% absolute in terms of $F_1$ score. The entire system can be trained within 6~hours on a single GPU, and can be easily carried over to other audio tasks such as speech sentimental analysis.

SDDec 27, 2017
Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection

Shao-Yen Tseng, Juncheng Li, Yun Wang et al.

State-of-the-art audio event detection (AED) systems rely on supervised learning using strongly labeled data. However, this dependence severely limits scalability to large-scale datasets where fine resolution annotations are too expensive to obtain. In this paper, we propose a small-footprint multiple instance learning (MIL) framework for multi-class AED using weakly annotated labels. The proposed MIL framework uses audio embeddings extracted from a pre-trained convolutional neural network as input features. We show that by using audio embeddings the MIL framework can be implemented using a simple DNN with performance comparable to recurrent neural networks. We evaluate our approach by training an audio tagging system using a subset of AudioSet, which is a large collection of weakly labeled YouTube video excerpts. Combined with a late-fusion approach, we improve the F1 score of a baseline audio tagging system by 17%. We show that audio embeddings extracted by the convolutional neural networks significantly boost the performance of all MIL models. This framework reduces the model complexity of the AED system and is suitable for applications where computational resources are limited.

SDDec 27, 2017
Eventness: Object Detection on Spectrograms for Temporal Localization of Audio Events

Phuong Pham, Juncheng Li, Joseph Szurley et al.

In this paper, we introduce the concept of Eventness for audio event detection, which can, in part, be thought of as an analogue to Objectness from computer vision. The key observation behind the eventness concept is that audio events reveal themselves as 2-dimensional time-frequency patterns with specific textures and geometric structures in spectrograms. These time-frequency patterns can then be viewed analogously to objects occurring in natural images (with the exception that scaling and rotation invariance properties do not apply). With this key observation in mind, we pose the problem of detecting monophonic or polyphonic audio events as an equivalent visual object(s) detection problem under partial occlusion and clutter in spectrograms. We adapt a state-of-the-art visual object detection model to evaluate the audio event detection task on publicly available datasets. The proposed network has comparable results with a state-of-the-art baseline and is more robust on minority events. Provided large-scale datasets, we hope that our proposed conceptual model of eventness will be beneficial to the audio signal processing community towards improving performance of audio event detection.