Overview of SCIDOCA 2025 Shared Task on Citation Prediction, Discovery, and Placement
This provides a new benchmark for citation modeling in scientific document understanding, but it is incremental as it builds on existing datasets and tasks.
The SCIDOCA 2025 Shared Task tackled citation prediction and discovery in scientific documents by introducing a large-scale dataset from S2ORC with over 60,000 annotated paragraphs, and it reported performance metrics from three submitted systems across three subtasks.
We present an overview of the SCIDOCA 2025 Shared Task, which focuses on citation discovery and prediction in scientific documents. The task is divided into three subtasks: (1) Citation Discovery, where systems must identify relevant references for a given paragraph; (2) Masked Citation Prediction, which requires selecting the correct citation for masked citation slots; and (3) Citation Sentence Prediction, where systems must determine the correct reference for each cited sentence. We release a large-scale dataset constructed from the Semantic Scholar Open Research Corpus (S2ORC), containing over 60,000 annotated paragraphs and a curated reference set. The test set consists of 1,000 paragraphs from distinct papers, each annotated with ground-truth citations and distractor candidates. A total of seven teams registered, with three submitting results. We report performance metrics across all subtasks and analyze the effectiveness of submitted systems. This shared task provides a new benchmark for evaluating citation modeling and encourages future research in scientific document understanding. The dataset and task materials are publicly available at https://github.com/daotuanan/scidoca2025-shared-task.