CL IR LGMar 16, 2020

Parallel sequence tagging for concept recognition

Lenz Furrer, Joseph Cornelius, Fabio Rinaldi

arXiv:2003.07424v20.511 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses error propagation in biomedical text-mining pipelines, though it is incremental as it builds on existing sequence-labeling methods.

The authors tackled error propagation in biomedical concept recognition by proposing a parallel architecture for Named Entity Recognition and Normalization, which outperformed the baseline pipeline system on all 20 annotation sets of the CRAFT corpus.

Background: Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence. Results: We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task 2019. Conclusions: Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts). Availability and Implementation: Source code freely available for download at https://github.com/OntoGene/craft-st. Supplementary data are available at arXiv online.

View on arXiv PDF Code

Similar