CLOct 31, 2023

Integrating curation into scientific publishing to train AI models

Jorge Abreu-Vicente, Hannah Sonntag, Thomas Eidens, Cassie S. Mitchell, Thomas Lemberger

arXiv:2310.20440v2h-index: 3

Originality Incremental advance

AI Analysis

This addresses the need for high-quality, annotated biomedical data to train AI models, though it is incremental as it builds on existing curation and NLP methods.

The authors tackled the problem of extracting structured data from academic articles for AI training by embedding multimodal curation into the publishing process, resulting in a dataset with over 620,000 annotated biomedical entities from 18,689 figures in 3,223 articles.

High throughput extraction and structured labeling of data from academic articles is critical to enable downstream machine learning applications and secondary analyses. We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions. Natural language processing (NLP) was combined with human-in-the-loop feedback from the original authors to increase annotation accuracy. Annotation included eight classes of bioentities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases) plus additional classes delineating the entities' roles in experiment designs and methodologies. The resultant dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 articles in molecular and cell biology. We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task assessing whether an entity is a controlled intervention target or a measurement object. We also illustrate the use of our dataset in performing a multi-modal task for segmenting figures into panel images and their corresponding captions.

View on arXiv PDF

Similar