Ludwig Kürzinger

AS
8papers
123citations
Novelty44%
AI Score25

8 Papers

SDDec 17, 2021Code
JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

Shinnosuke Takamichi, Ludwig Kürzinger, Takaaki Saeki et al.

In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can automatically filter the videos and subtitles with almost no language-dependent processes. We consistently employ Connectionist Temporal Classification (CTC)-based techniques for automatic speech recognition (ASR) and a speaker variation-based method for automatic speaker verification (ASV). We build 1) a large-scale Japanese ASR benchmark with more than 1,300 hours of data and 2) 900 hours of data for Japanese ASV.

CRFeb 3, 2022
The Wiretap Channel for Capacitive PUF-Based Security Enclosures

Kathrin Garb, Marvin Xhemrishi, Ludwig Kürzinger et al.

In order to protect devices from physical manipulations, protective security enclosures were developed. However, these battery-backed solutions come with a reduced lifetime, and have to be actively and continuously monitored. In order to overcome these drawbacks, batteryless capacitive enclosures based on Physical Unclonable Functions (PUFs) have been developed that generate a key-encryption-key (KEK) for decryption of the key chain. In order to reproduce the PUF-key reliably and to compensate the effect of noise and environmental influences, the key generation includes error correction codes. However, drilling attacks that aim at partially destroying the enclosure also alter the PUF-response and are subjected to the same error correction procedures. Correcting attack effects, however, is highly undesirable as it would destroy the security concept of the enclosure. In general, designing error correction codes such that they provide tamper-sensitivity to attacks, while still correcting noise and environmental effects is a challenging task. We tackle this problem by first analyzing the behavior of the PUF-response under external influences and different post-processing parameters. From this, we derive a system model of the PUF-based enclosure, and construct a wiretap channel implementation from q-ary polar codes. We verify the obtained error correction scheme in a Monte Carlo simulation and demonstrate that our wiretap channel implementation achieves a physical layer security of 100 bits for 306 bits of entropy for the PUF-secret. Through this, we further develop capacitive PUF-based security enclosures and bring them one step closer to their commercial deployment.

ITDec 4, 2021
Analysis of Communication Channels Related to Physical Unclonable Functions

Georg Maringer, Marvin Xhemrishi, Sven Puchinger et al.

Cryptographic algorithms rely on the secrecy of their corresponding keys. On embedded systems with standard CMOS chips, where secure permanent memory such as flash is not available as a key storage, the secret key can be derived from Physical Unclonable Functions (PUFs) that make use of minuscule manufacturing variations of, for instance, SRAM cells. Since PUFs are affected by environmental changes, the reliable reproduction of the PUF key requires error correction. For silicon PUFs with binary output, errors occur in the form of bitflips within the PUFs response. Modelling the channel as a Binary Symmetric Channel (BSC) with fixed crossover probability $p$ is only a first-order approximation of the real behavior of the PUF response. We propose a more realistic channel model, refered to as the Varying Binary Symmetric Channel (VBSC), which takes into account that the reliability of different PUF response bits may not be equal. We investigate its channel capacity for various scenarios which differ in the channel state information (CSI) present at encoder and decoder. We compare the capacity results for the VBSC for the different CSI cases with reference to the distribution of the bitflip probability according a work by Maes et al.

ASOct 15, 2020
Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions

Ludwig Kürzinger, Nicolas Lindae, Palle Klewitz et al.

Many end-to-end Automatic Speech Recognition (ASR) systems still rely on pre-processed frequency-domain features that are handcrafted to emulate the human hearing. Our work is motivated by recent advances in integrated learnable feature extraction. For this, we propose Lightweight Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise convolutions as a low-parameter machine-learnable feature extraction for end-to-end ASR systems. We integrated LSC into the hybrid CTC/attention architecture for evaluation. The resulting end-to-end model shows smooth convergence behaviour that is further improved by applying SpecAugment in time-domain. We also discuss filter-level improvements, such as using log-compression as activation function. Our model achieves a word error rate of 10.7% on the TEDlium v2 test dataset, surpassing the corresponding architecture with log-mel filterbank features by an absolute 1.9%, but only has 21% of its model size.

ASJul 25, 2020
MP3 Compression To Diminish Adversarial Noise in End-to-End Speech Recognition

Iustina Andronic, Ludwig Kürzinger, Edgar Ricardo Chavez Rosas et al.

Audio Adversarial Examples (AAE) represent specially created inputs meant to trick Automatic Speech Recognition (ASR) systems into misclassification. The present work proposes MP3 compression as a means to decrease the impact of Adversarial Noise (AN) in audio samples transcribed by ASR systems. To this end, we generated AAEs with the Fast Gradient Sign Method for an end-to-end, hybrid CTC-attention ASR system. Our method is then validated by two objective indicators: (1) Character Error Rates (CER) that measure the speech decoding performance of four ASR models trained on uncompressed, as well as MP3-compressed data sets and (2) Signal-to-Noise Ratio (SNR) estimated for both uncompressed and MP3-compressed AAEs that are reconstructed in the time domain by feature inversion. We found that MP3 compression applied to AAEs indeed reduces the CER when compared to uncompressed AAEs. Moreover, feature-inverted (reconstructed) AAEs had significantly higher SNRs after MP3 compression, indicating that AN was reduced. In contrast to AN, MP3 compression applied to utterances augmented with regular noise resulted in more transcription errors, giving further evidence that MP3 encoding is effective in diminishing only AN.

ASJul 21, 2020
Audio Adversarial Examples for Robust Hybrid CTC/Attention Speech Recognition

Ludwig Kürzinger, Edgar Ricardo Chavez Rosas, Lujun Li et al.

Recent advances in Automatic Speech Recognition (ASR) demonstrated how end-to-end systems are able to achieve state-of-the-art performance. There is a trend towards deeper neural networks, however those ASR models are also more complex and prone against specially crafted noisy data. Those Audio Adversarial Examples (AAE) were previously demonstrated on ASR systems that use Connectionist Temporal Classification (CTC), as well as attention-based encoder-decoder architectures. Following the idea of the hybrid CTC/attention ASR system, this work proposes algorithms to generate AAEs to combine both approaches into a joint CTC-attention gradient method. Evaluation is performed using a hybrid CTC/attention end-to-end ASR model on two reference sentences as case study, as well as the TEDlium v2 speech recognition task. We then demonstrate the application of this algorithm for adversarial training to obtain a more robust ASR model.

ASJun 15, 2020
Regularized Forward-Backward Decoder for Attention Models

Tobias Watzel, Ludwig Kürzinger, Lujun Li et al.

Nowadays, attention models are one of the popular candidates for speech recognition. So far, many studies mainly focus on the encoder structure or the attention module to enhance the performance of these models. However, mostly ignore the decoder. In this paper, we propose a novel regularization technique incorporating a second decoder during the training phase. This decoder is optimized on time-reversed target labels beforehand and supports the standard decoder during training by adding knowledge from future context. Since it is only added during training, we are not changing the basic structure of the network or adding complexity during decoding. We evaluate our approach on the smaller TEDLIUMv2 and the larger LibriSpeech dataset, achieving consistent improvements on both of them.

ASNov 5, 2019
Small-Footprint Keyword Spotting on Raw Audio Data with Sinc-Convolutions

Simon Mittermaier, Ludwig Kürzinger, Bernd Waschneck et al.

Keyword Spotting (KWS) enables speech-based user interaction on smart devices. Always-on and battery-powered application scenarios for smart devices put constraints on hardware resources and power consumption, while also demanding high accuracy as well as real-time capability. Previous architectures first extracted acoustic features and then applied a neural network to classify keyword probabilities, optimizing towards memory footprint and execution time. Compared to previous publications, we took additional steps to reduce power and memory consumption without reducing classification accuracy. Power-consuming audio preprocessing and data transfer steps are eliminated by directly classifying from raw audio. For this, our end-to-end architecture extracts spectral features using parametrized Sinc-convolutions. Its memory footprint is further reduced by grouping depthwise separable convolutions. Our network achieves the competitive accuracy of 96.4% on Google's Speech Commands test set with only 62k parameters.