ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP
This provides essential open resources for researchers working on Spanish clinical NLP, enabling further advancements in healthcare applications.
The authors tackled the lack of publicly available resources for Spanish clinical NLP by creating ClinText-SP, the largest open clinical corpus in Spanish, and RigoBERTa Clinical, a domain-adapted language model that significantly outperforms existing models on multiple benchmarks.
We present a novel contribution to Spanish clinical natural language processing by introducing the largest publicly available clinical corpus, ClinText-SP, along with a state-of-the-art clinical encoder language model, RigoBERTa Clinical. Our corpus was meticulously curated from diverse open sources, including clinical cases from medical journals and annotated corpora from shared tasks, providing a rich and diverse dataset that was previously difficult to access. RigoBERTa Clinical, developed through domain-adaptive pretraining on this comprehensive dataset, significantly outperforms existing models on multiple clinical NLP benchmarks. By publicly releasing both the dataset and the model, we aim to empower the research community with robust resources that can drive further advancements in clinical NLP and ultimately contribute to improved healthcare applications.