CLMay 13, 2022

PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for Pathology Domain

arXiv:2205.06885v130 citationsh-index: 23
Originality Synthesis-oriented
AI Analysis

This addresses the problem of inadequate text mining for pathology data, which is incremental as it adapts existing methods to a new domain.

The authors tackled the lack of a pathology-specific language model for text mining by pre-training a transformer on 347,173 histopathology reports, resulting in improved performance on NLU and breast cancer diagnosis classification compared to general models.

Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface repository. Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes