Improving Predictions of Tail-end Labels using Concatenated BioMed-Transformers for Long Medical Documents
This work addresses the challenge of long-tailed label distributions in medical text classification, which can enhance patient understanding and care by improving predictions for rare but impactful medical codes, though it is incremental as it builds on existing transformer methods.
The paper tackled the problem of improving predictions for infrequent, tail-end labels in multi-label classification of long medical documents, achieving new state-of-the-art results on the MIMIC-III database with higher F1 scores and lower training times compared to standard transformers.
Multi-label learning predicts a subset of labels from a given label set for an unseen instance while considering label correlations. A known challenge with multi-label classification is the long-tailed distribution of labels. Many studies focus on improving the overall predictions of the model and thus do not prioritise tail-end labels. Improving the tail-end label predictions in multi-label classifications of medical text enables the potential to understand patients better and improve care. The knowledge gained by one or more infrequent labels can impact the cause of medical decisions and treatment plans. This research presents variations of concatenated domain-specific language models, including multi-BioMed-Transformers, to achieve two primary goals. First, to improve F1 scores of infrequent labels across multi-label problems, especially with long-tail labels; second, to handle long medical text and multi-sourced electronic health records (EHRs), a challenging task for standard transformers designed to work on short input sequences. A vital contribution of this research is new state-of-the-art (SOTA) results obtained using TransformerXL for predicting medical codes. A variety of experiments are performed on the Medical Information Mart for Intensive Care (MIMIC-III) database. Results show that concatenated BioMed-Transformers outperform standard transformers in terms of overall micro and macro F1 scores and individual F1 scores of tail-end labels, while incurring lower training times than existing transformer-based solutions for long input sequences.