CLNov 22, 2022

A Large-Scale Dataset for Biomedical Keyphrase Generation

arXiv:2211.12124v1291 citationsh-index: 24Has Code
Originality Synthesis-oriented
AI Analysis

This addresses a data scarcity problem for researchers in biomedical NLP, though it is incremental as it focuses on dataset creation rather than novel methods.

The authors tackled the lack of large-scale datasets for biomedical keyphrase generation by introducing kp-biomed, a dataset with over 5M documents from PubMed abstracts, and showed that using it significantly improves performance for present and absent keyphrase generation.

Keyphrase generation is the task consisting in generating a set of words or phrases that highlight the main topics of a document. There are few datasets for keyphrase generation in the biomedical domain and they do not meet the expectations in terms of size for training generative models. In this paper, we introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset with more than 5M documents collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation. The dataset is available under CC-BY-NC v4.0 license at https://huggingface.co/ datasets/taln-ls2n/kpbiomed.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes