IRJan 16, 2020

Assigning credit to scientific datasets using article citation networks

Tong Zeng, Longfeng Wu, Sarah Bratt, Daniel E. Acuna

arXiv:2001.05917v14.322 citations

Originality Incremental advance

AI Analysis

This addresses the issue of under-credited datasets in science, which is incremental as it builds on existing citation analysis methods to specifically include datasets.

The authors tackled the problem of datasets not receiving appropriate credit in scientific citation networks by developing DataRank, a network flow measure that assigns relative value to nodes based on citation flow, differentiating between publications and datasets. They showed that DataRank better predicts real dataset usage, such as web visits to GenBank and downloads from Figshare, compared to alternatives.

A citation is a well-established mechanism for connecting scientific artifacts. Citation networks are used by citation analysis for a variety of reasons, prominently to give credit to scientists' work. However, because of current citation practices, scientists tend to cite only publications, leaving out other types of artifacts such as datasets. Datasets then do not get appropriate credit even though they are increasingly reused and experimented with. We develop a network flow measure, called DataRank, aimed at solving this gap. DataRank assigns a relative value to each node in the network based on how citations flow through the graph, differentiating publication and dataset flow rates. We evaluate the quality of DataRank by estimating its accuracy at predicting the usage of real datasets: web visits to GenBank and downloads of Figshare datasets. We show that DataRank is better at predicting this usage compared to alternatives while offering additional interpretable outcomes. We discuss improvements to citation behavior and algorithms to properly track and assign credit to datasets.

View on arXiv PDF

Similar