CL DLOct 13, 2025

A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications

arXiv:2510.21762v1h-index: 1

Originality Synthesis-oriented

AI Analysis

This provides a resource for researchers in scientific literature mining, though it is incremental as it builds on existing open-access corpora.

The authors created a dataset of 833,000 paragraphs from open-access scientific publications, classified into four categories (acknowledgments, data mentions, software/code mentions, and clinical trial mentions) and annotated with language and scientific domain, to support text classification and named entity recognition tasks.

We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace https://doi.org/10.57967/hf/6679 under a CC-BY license.

View on arXiv PDF

Similar