CLOct 31, 2018

Effective Feature Representation for Clinical Text Concept Extraction

Yifeng Tao, Bruno Godefroy, Guillaume Genthial, Christopher Potts

arXiv:1811.00070v232.01092 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of limited annotated data in healthcare NLP, offering a method to improve concept extraction for applications like clinical documentation and research.

The authors tackled the problem of extracting healthcare concepts from diverse clinical texts by developing an LSTM-CRF model that combines unsupervised word representations with hand-built features from ontologies, achieving superior performance on five datasets.

Crucial information about the practice of healthcare is recorded only in free-form text, which creates an enormous opportunity for high-impact NLP. However, annotated healthcare datasets tend to be small and expensive to obtain, which raises the question of how to make maximally efficient uses of the available data. To this end, we develop an LSTM-CRF model for combining unsupervised word representations and hand-built feature representations derived from publicly available healthcare ontologies. We show that this combined model yields superior performance on five datasets of diverse kinds of healthcare text (clinical, social, scientific, commercial). Each involves the labeling of complex, multi-word spans that pick out different healthcare concepts. We also introduce a new labeled dataset for identifying the treatment relations between drugs and diseases.

View on arXiv PDF

Similar