Generalizing over Long Tail Concepts for Medical Term Normalization
This addresses the challenge of generalizing to rare or unseen medical concepts for healthcare NLP applications, representing a strong specific gain rather than a foundational advance.
The paper tackled the problem of medical term normalization with limited annotated data and a long tail of concepts by introducing a learning strategy that leverages hierarchical ontology structure. The result was state-of-the-art performance on seen concepts and consistent improvements on unseen ones, enabling efficient zero-shot transfer across text types and datasets.
Medical term normalization consists in mapping a piece of text to a large number of output classes. Given the small size of the annotated datasets and the extremely long tail distribution of the concepts, it is of utmost importance to develop models that are capable to generalize to scarce or unseen concepts. An important attribute of most target ontologies is their hierarchical structure. In this paper we introduce a simple and effective learning strategy that leverages such information to enhance the generalizability of both discriminative and generative models. The evaluation shows that the proposed strategy produces state-of-the-art performance on seen concepts and consistent improvements on unseen ones, allowing also for efficient zero-shot knowledge transfer across text typologies and datasets.