CLMay 13, 2022
PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for Pathology DomainThiago Santos, Amara Tariq, Susmita Das et al.
Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface repository. Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.
CVNov 2, 2023
Hierarchical Classification System for Breast Cancer Specimen Report (HCSBC) -- an end-to-end model for characterizing severity and diagnosisThiago Santos, Harish Kamath, Christopher R. McAdams et al.
Automated classification of cancer pathology reports can extract information from unstructured reports and categorize each report into structured diagnosis and severity categories. Thus, such system can reduce the burden for populating tumor registries, help registration for clinical trial as well as developing large dataset for deep learning model development using true pathologic ground truth. However, the content of breast pathology reports can be difficult for categorize due to the high linguistic variability in content and wide variety of potential diagnoses >50. Existing NLP models are primarily focused on developing classifier for primary breast cancer types (e.g. IDC, DCIS, ILC) and tumor characteristics, and ignore the rare diagnosis of cancer subtypes. We then developed a hierarchical hybrid transformer-based pipeline (59 labels) - Hierarchical Classification System for Breast Cancer Specimen Report (HCSBC), which utilizes the potential of the transformer context-preserving NLP technique and compared our model to several state of the art ML and DL models. We trained the model on the EUH data and evaluated our model's performance on two external datasets - MGH and Mayo Clinic. We publicly release the code and a live application under Huggingface spaces repository
IVFeb 8, 2022
The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.5M Screening and Diagnostic MammogramsJiwoong J. Jeong, Brianna L. Vey, Ananth Reddy et al.
Developing and validating artificial intelligence models in medical imaging requires datasets that are large, granular, and diverse. To date, the majority of publicly available breast imaging datasets lack in one or more of these areas. Models trained on these data may therefore underperform on patient populations or pathologies that have not previously been encountered. The EMory BrEast imaging Dataset (EMBED) addresses these gaps by providing 3650,000 2D and DBT screening and diagnostic mammograms for 116,000 women divided equally between White and African American patients. The dataset also contains 40,000 annotated lesions linked to structured imaging descriptors and 61 ground truth pathologic outcomes grouped into six severity classes. Our goal is to share this dataset with research partners to aid in development and validation of breast AI models that will serve all patients fairly and help decrease bias in medical AI.
NEOct 29, 2018
A Convergence indicator for Multi-Objective Optimisation AlgorithmsThiago Santos, Sebastiao Xavier
The algorithms of multi-objective optimisation had a relative growth in the last years. Thereby, it's requires some way of comparing the results of these. In this sense, performance measures play a key role. In general, it's considered some properties of these algorithms such as capacity, convergence, diversity or convergence-diversity. There are some known measures such as generational distance (GD), inverted generational distance (IGD), hypervolume (HV), Spread($Δ$), Averaged Hausdorff distance ($Δ_p$), R2-indicator, among others. In this paper, we focuses on proposing a new indicator to measure convergence based on the traditional formula for Shannon entropy. The main features about this measure are: 1) It does not require tho know the true Pareto set and 2) Medium computational cost when compared with Hypervolume.