ASApr 7, 2022
Detecting Vocal Fatigue with Neural EmbeddingsSebastian P. Bayerl, Dominik Wagner, Ilja Baumann et al.
Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional mappings of the data reveal that neural embeddings capture information about the change in vocal characteristics of a speaker during prolonged voice usage. We show that vocal fatigue can be reliably predicted using all three kinds of neural embeddings after only 50 minutes of continuous speaking when temporal smoothing and normalization are applied to the extracted embeddings. We employ support vector machines for classification and achieve accuracy scores of 81% using x-vectors, 85% using ECAPA-TDNN embeddings, and 82% using wav2vec 2.0 embeddings as input features. We obtain an accuracy score of 76%, when the trained system is applied to a different speaker and recording environment without any adaptation.
ASJun 16, 2022
Nonwords Pronunciation Classification in Language Development Tests for Preschool ChildrenIlja Baumann, Dominik Wagner, Sebastian Bayerl et al.
This work aims to automatically evaluate whether the language development of children is age-appropriate. Validated speech and language tests are used for this purpose to test the auditory memory. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures: Low-level features (FFT), speaker embeddings (ECAPA-TDNN), grapheme-motivated embeddings (wav2vec 2.0), and phonetic embeddings in form of senones (ASR acoustic model). Each of the approaches provides input for VGG-like 5-layer CNN classifiers. We also examine the adaptation per nonword. The evaluation of the proposed systems was performed using recordings from different kindergartens of spoken nonwords. ECAPA-TDNN and low-level FFT features do not explicitly model phonetic information; wav2vec2.0 is trained on grapheme labels, our ASR acoustic model features contain (sub-)phonetic information. We found that the more granular the phonetic modeling is, the higher are the achieved recognition rates. The best system trained on ASR acoustic model features with VTLN achieved an accuracy of 89.4% and an area under the ROC (Receiver Operating Characteristic) curve (AUC) of 0.923. This corresponds to an improvement in accuracy of 20.2% and AUC of 0.309 relative compared to the FFT-baseline.
SDFeb 23, 2024
A Survey of Music Generation in the Context of InteractionIsmael Agchar, Ilja Baumann, Franziska Braun et al.
In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amounts of recorded or transcribed music, which in turn also allows for fairly straight-forward "performance" evaluation. However, most of these models are not suitable for human-machine co-creation through live interaction, neither is clear, how such models and resulting creations would be evaluated. This article presents a thorough review of music representation, feature analysis, heuristic algorithms, statistical and parametric modelling, and human and automatic evaluation measures, along with a discussion of which approaches and models seem most suitable for live interaction.
LGJan 13, 2025
Digital Operating Mode Classification of Real-World Amateur Radio TransmissionsMaximilian Bundscherer, Thomas H. Schmitt, Ilja Baumann et al.
This study presents an ML approach for classifying digital radio operating modes evaluated on real-world transmissions. We generated 98 different parameterized radio signals from 17 digital operating modes, transmitted each of them on the 70 cm (UHF) amateur radio band, and recorded our transmissions with two different architectures of SDR receivers. Three lightweight ML models were trained exclusively on spectrograms of limited non-transmitted signals with random characters as payloads. This training involved an online data augmentation pipeline to simulate various radio channel impairments. Our best model, EfficientNetB0, achieved an accuracy of 93.80% across the 17 operating modes and 85.47% across all 98 parameterized radio signals, evaluated on our real-world transmissions with Wikipedia articles as payloads. Furthermore, we analyzed the impact of varying signal durations & the number of FFT bins on classification, assessed the effectiveness of our simulated channel impairments, and tested our models across multiple simulated SNRs.
SDJun 16, 2024
Large Language Models for Dysfluency Detection in Stuttered SpeechDominik Wagner, Sebastian P. Bayerl, Ilja Baumann et al.
Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, such as audio and video, we approach the task of multi-label dysfluency detection as a language modeling problem. We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech. The experimental results show that our system effectively combines acoustic and lexical information and achieves competitive results on the multi-label stuttering detection task.
LGJun 16, 2024
Optimized Speculative Sampling for GPU Hardware AcceleratorsDominik Wagner, Seanie Lee, Ilja Baumann et al.
In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.
ASMay 30, 2023
A Stutter Seldom Comes Alone -- Cross-Corpus Stuttering Detection as a Multi-label ProblemSebastian P. Bayerl, Dominik Wagner, Ilja Baumann et al.
Most stuttering detection and classification research has viewed stuttering as a multi-class classification problem or a binary detection task for each dysfluency type; however, this does not match the nature of stuttering, in which one dysfluency seldom comes alone but rather co-occurs with others. This paper explores multi-language and cross-corpus end-to-end stuttering detection as a multi-label problem using a modified wav2vec 2.0 system with an attention-based classification head and multi-task learning. We evaluate the method using combinations of three datasets containing English and German stuttered speech, one containing speech modified by fluency shaping. The experimental results and an error analysis show that multi-label stuttering detection systems trained on cross-corpus and multi-language data achieve competitive results but performance on samples with multiple labels stays below over-all detection results.