Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen
This work addresses a gap in ASR evaluation for the German medical domain, providing a benchmark that can help reduce documentation workload for healthcare professionals, though it is incremental as it applies existing methods to new data.
The authors tackled the lack of German medical ASR benchmarks by evaluating 29 models on a curated dataset of simulated doctor-patient conversations, finding significant performance differences with the best systems achieving Word Error Rates below 3% while others struggled with medical terminology and dialects.
Automatic Speech Recognition (ASR) offers significant potential to reduce the workload of medical personnel, for example, through the automation of documentation tasks. While numerous benchmarks exist for the English language, specific evaluations for the German-speaking medical context are still lacking, particularly regarding the inclusion of dialects. In this article, we present a curated dataset of simulated doctor-patient conversations and evaluate a total of 29 different ASR models. The test field encompasses both open-weights models from the Whisper, Voxtral, and Wav2Vec2 families as well as commercial state-of-the-art APIs (AssemblyAI, Deepgram). For evaluation, we utilize three different metrics (WER, CER, BLEU) and provide an outlook on qualitative semantic analysis. The results demonstrate significant performance differences between the models: while the best systems already achieve very good Word Error Rates (WER) of partly below 3%, the error rates of other models, especially concerning medical terminology or dialect-influenced variations, are considerably higher.