ArGoT: A Glossary of Terms extracted from the arXiv
This work addresses the challenge of aligning natural mathematical text with formally verified libraries for researchers in mathematics and interactive theorem proving, though it is incremental as it builds on existing extraction and embedding methods.
The authors tackled the problem of extracting and organizing mathematical terms from arXiv articles, resulting in the ArGoT dataset which includes a comprehensive vocabulary and dependency graph of terms, and demonstrated that hyperbolic and standard word embeddings capture entailment relations among these concepts.
We introduce ArGoT, a data set of mathematical terms extracted from the articles hosted on the arXiv website. A term is any mathematical concept defined in an article. Using labels in the article's source code and examples from other popular math websites, we mine all the terms in the arXiv data and compile a comprehensive vocabulary of mathematical terms. Each term can be then organized in a dependency graph by using the term's definitions and the arXiv's metadata. Using both hyperbolic and standard word embeddings, we demonstrate how this structure is reflected in the text's vector representation and how they capture relations of entailment in mathematical concepts. This data set is part of an ongoing effort to align natural mathematical text with existing Interactive Theorem Prover Libraries (ITPs) of formally verified statements.