CLOct 26, 2022
Monotonic segmental attention for automatic speech recognitionAlbert Zeyer, Robin Schmitt, Wei Zhou et al.
We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable streaming. We directly compare global-attention and different segmental-attention modeling variants. We develop and compare two separate time-synchronous decoders, one specifically taking the segmental nature into account, yielding further improvements. Using time-synchronous decoding for segmental models is novel and a step towards streaming applications. Our experiments show the importance of a length model to predict the segment boundaries. The final best segmental-attention model using segmental decoding performs better than global-attention, in contrast to other monotonic attention approaches in the literature. Further, we observe that the segmental model generalizes much better to long sequences of up to several minutes.
LGMar 11, 2025
Comprehensive Benchmarking of Machine Learning Methods for Risk Prediction Modelling from Large-Scale Survival Data: A UK Biobank StudyRafael R. Oexner, Robin Schmitt, Hyunchan Ahn et al.
Predictive modelling is vital to guide preventive efforts. Whilst large-scale prospective cohort studies and a diverse toolkit of available machine learning (ML) algorithms have facilitated such survival task efforts, choosing the best-performing algorithm remains challenging. Benchmarking studies to date focus on relatively small-scale datasets and it is unclear how well such findings translate to large datasets that combine omics and clinical features. We sought to benchmark eight distinct survival task implementations, ranging from linear to deep learning (DL) models, within the large-scale prospective cohort study UK Biobank (UKB). We compared discrimination and computational requirements across heterogenous predictor matrices and endpoints. Finally, we assessed how well different architectures scale with sample sizes ranging from n = 5,000 to n = 250,000 individuals. Our results show that discriminative performance across a multitude of metrices is dependent on endpoint frequency and predictor matrix properties, with very robust performance of (penalised) COX Proportional Hazards (COX-PH) models. Of note, there are certain scenarios which favour more complex frameworks, specifically if working with larger numbers of observations and relatively simple predictor matrices. The observed computational requirements were vastly different, and we provide solutions in cases where current implementations were impracticable. In conclusion, this work delineates how optimal model choice is dependent on a variety of factors, including sample size, endpoint frequency and predictor matrix properties, thus constituting an informative resource for researchers working on similar datasets. Furthermore, we showcase how linear models still display a highly effective and scalable platform to perform risk modelling at scale and suggest that those are reported alongside non-linear ML models.
CLNov 27, 2025
Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious DatasetNick Rossenbach, Robin Schmitt, Tina Raissi et al.
The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.