CLJun 2

The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP

arXiv:2606.032506.53 citations
AI Analysis

This work provides domain-specific language models and adaptation strategies for German medical NLP, addressing the gap in resources for this language and domain.

The authors present ChristBERT, a family of German RoBERTa-based models trained on a 13.5GB medical corpus, and show they outperform existing models on four of five clinical NLP benchmarks, establishing a new state of the art for German clinical language modeling.

Digital healthcare generates vast amounts of clinical text that can support AI-assisted applications, yet German biomedical language models remain limited by older architectures or restricted training data. We present ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT), a family of domain-specific German RoBERTa-based language models trained on a 13.5GB corpus of scientific publications, clinical texts, health-related web content, and translated clinical resources. To investigate the impact of domain adaptation strategies in German clinical NLP, we compare continued pre-training, training from scratch, and domain-specific vocabulary adaptation. The resulting models are evaluated on three medical named entity recognition tasks and two text classification tasks. ChristBERT consistently outperforms existing general-purpose and medical German language models on four of five benchmarks and establishes a new state of the art for German clinical language modeling. Our results show that the optimal adaptation strategy is task-dependent: in our evaluation, training from scratch is particularly effective for highly specialized clinical texts, whereas continued pre-training performs well on more commonly written medical texts. All models are publicly released to support future research and applications in German medical NLP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes