Medical Coding with Biomedical Transformer Ensembles and Zero/Few-shot Learning
This addresses the problem of efficient and accurate medical coding for healthcare data management, though it is incremental as it builds on existing transformer and few-shot learning methods.
The paper tackles the challenge of automating medical coding by matching free-text reported terms to standardized medical codes, introducing a novel approach called xTARS that combines BERT-based classification with zero/few-shot learning, which outperforms strong baselines, especially in few-shot scenarios, and has been deployed at Bayer since November 2021.
Medical coding (MC) is an essential pre-requisite for reliable data retrieval and reporting. Given a free-text reported term (RT) such as "pain of right thigh to the knee", the task is to identify the matching lowest-level term (LLT) - in this case "unilateral leg pain" - from a very large and continuously growing repository of standardized medical terms. However, automating this task is challenging due to a large number of LLT codes (as of writing over 80,000), limited availability of training data for long tail/emerging classes, and the general high accuracy demands of the medical domain. With this paper, we introduce the MC task, discuss its challenges, and present a novel approach called xTARS that combines traditional BERT-based classification with a recent zero/few-shot learning approach (TARS). We present extensive experiments that show that our combined approach outperforms strong baselines, especially in the few-shot regime. The approach is developed and deployed at Bayer, live since November 2021. As we believe our approach potentially promising beyond MC, and to ensure reproducibility, we release the code to the research community.