LGAug 21, 2025

A Robust BERT-Based Deep Learning Model for Automated Cancer Type Extraction from Unstructured Pathology Reports

Minh Tran, Jeffery C. Chan, Min Li Huang, Maya Kansara, John P. Grady, Christine E. Napier, Subotheni Thavaneswaran, Mandy L. Ballinger, David M. Thomas, Frank P. Lin

arXiv:2508.15149v14.11 citationsh-index: 14

Originality Synthesis-oriented

AI Analysis

This work addresses the need for automated clinical information extraction to support precision oncology research, though it is incremental as it applies an existing fine-tuning method to a specific domain.

The authors tackled the problem of extracting cancer types from unstructured pathology reports by developing a fine-tuned RoBERTa model, which achieved an F1_Bertscore of 0.98 and an exact match of 80.61%, outperforming baseline and Mistral 7B models.

The accurate extraction of clinical information from electronic medical records is particularly critical to clinical research but require much trained expertise and manual labor. In this study we developed a robust system for automated extraction of the specific cancer types for the purpose of supporting precision oncology research. from pathology reports using a fine-tuned RoBERTa model. This model significantly outperformed the baseline model and a Large Language Model, Mistral 7B, achieving F1_Bertscore 0.98 and overall exact match of 80.61%. This fine-tuning approach demonstrates the potential for scalability that can integrate seamlessly into the molecular tumour board process. Fine-tuning domain-specific models for precision tasks in oncology, may pave the way for more efficient and accurate clinical information extraction.

View on arXiv PDF

Similar