The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP
This work provides domain-specific language models and adaptation strategies for German medical NLP, addressing the gap in resources for this language and domain.
The authors present ChristBERT, a family of German RoBERTa-based models trained on a 13.5GB medical corpus, and show they outperform existing models on four of five clinical NLP benchmarks, establishing a new state of the art for German clinical language modeling.
Digital healthcare generates vast amounts of clinical text that can support AI-assisted applications, yet German biomedical language models remain limited by older architectures or restricted training data. We present ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT), a family of domain-specific German RoBERTa-based language models trained on a 13.5GB corpus of scientific publications, clinical texts, health-related web content, and translated clinical resources. To investigate the impact of domain adaptation strategies in German clinical NLP, we compare continued pre-training, training from scratch, and domain-specific vocabulary adaptation. The resulting models are evaluated on three medical named entity recognition tasks and two text classification tasks. ChristBERT consistently outperforms existing general-purpose and medical German language models on four of five benchmarks and establishes a new state of the art for German clinical language modeling. Our results show that the optimal adaptation strategy is task-dependent: in our evaluation, training from scratch is particularly effective for highly specialized clinical texts, whereas continued pre-training performs well on more commonly written medical texts. All models are publicly released to support future research and applications in German medical NLP.