On the Use of ArXiv as a Dataset
This work provides a standardized dataset for researchers in machine learning and AI to benchmark models, though it is incremental as it focuses on data extraction rather than novel methods.
The authors tackled the problem of standardizing access to arXiv's rich multi-modal data by providing a pipeline to extract and analyze a 6.7 million edge citation graph and an 11 billion word corpus, enabling benchmarking for next-generation models.
The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure---created by citation, affiliation, and co-authorship---makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv's publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.