ASApr 2, 2021
Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature RepresentationSiyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez et al.
This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. In the first stage, a recently proposed method in the task of unsupervised subword modeling is improved by replacing a monolingual out-of-domain (OOD) ASR system with a multilingual one to create a subword-discriminative representation that is more language-independent. In the second stage, segment-level k-means is adopted, and two methods to represent the variable-length speech segments as fixed-dimension feature vectors are compared. Experiments on a very low-resource Mboshi language corpus show that our approach outperforms state-of-the-art AUD in both normalized mutual information (NMI) and F-score. The multilingual ASR improved upon the monolingual ASR in providing OOD phone labels and in estimating the phone boundaries. A comparison of our systems with and without knowing the ground-truth phone boundaries showed a 16% NMI performance gap, suggesting that the current approach can significantly benefit from improved phone boundary estimation.
ASJan 22, 2021
Study of Pre-processing Defenses against Adversarial Attacks on State-of-the-art Speaker Recognition SystemsSonal Joshi, Jesús Villalba, Piotr Żelasko et al.
Adversarial examples to speaker recognition (SR) systems are generated by adding a carefully crafted noise to the speech signal to make the system fail while being imperceptible to humans. Such attacks pose severe security risks, making it vital to deep-dive and understand how much the state-of-the-art SR systems are vulnerable to these attacks. Moreover, it is of greater importance to propose defenses that can protect the systems against these attacks. Addressing these concerns, this paper at first investigates how state-of-the-art x-vector based SR systems are affected by white-box adversarial attacks, i.e., when the adversary has full knowledge of the system. x-Vector based SR systems are evaluated against white-box adversarial attacks common in the literature like fast gradient sign method (FGSM), basic iterative method (BIM)--a.k.a. iterative-FGSM--, projected gradient descent (PGD), and Carlini-Wagner (CW) attack. To mitigate against these attacks, the paper proposes four pre-processing defenses. It evaluates them against powerful adaptive white-box adversarial attacks, i.e., when the adversary has full knowledge of the system, including the defense. The four pre-processing defenses--viz. randomized smoothing, DefenseGAN, variational autoencoder (VAE), and Parallel WaveGAN vocoder (PWG) are compared against the baseline defense of adversarial training. Conclusions indicate that SR systems were extremely vulnerable under BIM, PGD, and CW attacks. Among the proposed pre-processing defenses, PWG combined with randomized smoothing offers the most protection against the attacks, with accuracy averaging 93% compared to 52% in the undefended system and an absolute improvement >90% for BIM attacks with $L_\infty>0.001$ and CW attack.
ASOct 22, 2020
How Phonotactics Affect Multilingual and Zero-shot ASR PerformanceSiyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez et al.
The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.
ASMay 16, 2020
That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across LanguagesPiotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson et al.
Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages - an encouraging result for the low-resource speech community.