IRMar 3, 2023
Discovery and Recognition of Formula Concepts using Machine LearningPhilipp Scharpf, Moritz Schubotz, Howard S. Cohl et al.
Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula notation to refer to prior knowledge. Our long-term goal is to generalize citation-based IR methods and apply this generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulas could be cited and define a Formula Concept Retrieval task with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While FCD aims at the definition and exploration of a 'Formula Concept' that names bundled equivalent representations of a formula, FCR is designed to match a given formula to a prior assigned unique mathematical concept identifier. We present machine learning-based approaches to address the FCD and FCR tasks. We then evaluate these approaches on a standardized test collection (NTCIR arXiv dataset). Our FCD approach yields a precision of 68% for retrieving equivalent representations of frequent formulas and a recall of 72% for extracting the formula name from the surrounding text. FCD and FCR enable the citation of formulas within mathematical documents and facilitate semantic search and question answering as well as document similarity assessments for plagiarism detection or recommender systems.
SENov 8, 2022
Caching and Reproducibility: Making Data Science experiments faster and FAIRerMoritz Schubotz, Ankit Satpute, Andre Greiner-Petter et al.
Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computationally expensive experiments. In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written. This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow. Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans. We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.
21.0NIApr 14
Large-Scale Measurement of NAT Traversal for the Decentralized Web: A Case Study of DCUtR in IPFSDennis Trautwein, Cornelius Ihle, Moritz Schubotz et al.
The promise of decentralized peer-to-peer (P2P) systems is fundamentally gated by the challenge of Network Address Translation (NAT) traversal, with existing solutions often reintroducing the very centralization they seek to avoid. This paper presents the first large-scale measurement study of a fully decentralized NAT traversal protocol, Direct Connection Upgrade through Relay (DCUtR), within the production libp2p-based InterPlanetary File System (IPFS) network. Drawing on over 4.4 million traversal attempts from 85,000+ distinct networks across 167 countries, we provide an empirical analysis of modern P2P connectivity. We establish a conditional success rate of $70\% \pm 7.1\%$ for the hole-punching stage, given that prerequisite relay reservation and public address discovery succeed, providing a crucial new benchmark for the field. Critically, we empirically challenge the long-held belief of UDP's superiority for NAT traversal, demonstrating that DCUtR's high-precision, RTT-based synchronization yields statistically indistinguishable success rates for both TCP and QUIC ($\sim70\%$). Our analysis further validates the protocol's design for permissionless environments by showing that success is independent of relay characteristics and that the mechanism is highly efficient, with $97.6\%$ of successful connections established on the first attempt. Building on this analysis, we propose a concrete roadmap of protocol enhancements aimed at achieving universal connectivity and contribute our complete dataset to foster further research in this domain.
50.3DLApr 1
LLM-supported document separation for printed reviews from zbMATH OpenIvan Pluzhnikov, Ankit Satpute, Moritz Schubotz et al.
This paper presents a specialized methodology for digitizing and segmenting mathematical documents from zbMATH Open, a comprehensive database of mathematical literature, to enhance machine processing capabilities. Currently, approximately 831,000 documents exist only in scanned volumes, which makes them not machine-processable. Furthermore, these scans often span multiple pages or share pages with other documents and incorporate diverse typesetting techniques, posing challenges for automated processing. To address these issues, we evaluate various Optical Character Recognition (OCR) tools and document separation techniques, proposing an optimized pipeline that outperforms existing approaches. Our study identifies Mathpix as the most effective OCR tool for LaTeX conversion, demonstrating superior performance based on BLEU and Edit Distance metrics. For document separation, we fine-tune generative Large Language Models (LLMs) and integrate them into a Majority Voting framework, achieving 97.5% accuracy when providing the text of the document. Additionally, our method identifies the start and end indexes for 90.6% of the test dataset, with an accuracy of 98.4% on applicable cases, resulting in an overall accuracy of 89.1% on the entire dataset. This approach surpasses traditional baselines, including regular expressions, ChatGPT-4o, and computer vision-based techniques. As a practical outcome, we process 810,977 mathematical documents into machine-readable text and extract precise document boundaries for 721,288 documents in LaTeX format. These contributions significantly improve accessibility for mathematical information retrieval systems, machine learning models, and related applications.
CLMar 30, 2024Code
Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeAnkit Satpute, Noah Giessing, Andre Greiner-Petter et al.
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{https://github.com/gipplab/LLM-Investig-MathStackExchange}
IRJun 28, 2019Code
Introducing MathQA -- A Math-Aware Question Answering SystemMoritz Schubotz, Philipp Scharpf, Kaushal Dudhat et al.
We present an open source math-aware Question Answering System based on Ask Platypus. Our system returns as a single mathematical formula for a natural language question in English or Hindi. This formulae originate from the knowledge-base Wikidata. We translate these formulae to computable data by integrating the calculation engine sympy into our system. This way, users can enter numeric values for the variables occurring in the formula. Moreover, the system loads numeric values for constants occurring in the formula from Wikidata. In a user study, our system outperformed a commercial computational mathematical knowledge engine by 13%. However, the performance of our system heavily depends on the size and quality of the formula data available in Wikidata. Since only a few items in Wikidata contained formulae when we started the project, we facilitated the import process by suggesting formula edits to Wikidata editors. With the simple heuristic that the first formula is significant for the article, 80% of the suggestions were correct.
HCJul 12, 2017Code
VMEXT: A Visualization Tool for Mathematical Expression TreesMoritz Schubotz, Norman Meuschke, Thomas Hepp et al.
Mathematical expressions can be represented as a tree consisting of terminal symbols, such as identifiers or numbers (leaf nodes), and functions or operators (non-leaf nodes). Expression trees are an important mechanism for storing and processing mathematical expressions as well as the most frequently used visualization of the structure of mathematical expressions. Typically, researchers and practitioners manually visualize expression trees using general-purpose tools. This approach is laborious, redundant, and error-prone. Manual visualizations represent a user's notion of what the markup of an expression should be, but not necessarily what the actual markup is. This paper presents VMEXT - a free and open source tool to directly visualize expression trees from parallel MathML. VMEXT simultaneously visualizes the presentation elements and the semantic structure of mathematical expressions to enable users to quickly spot deficiencies in the Content MathML markup that does not affect the presentation of the expression. Identifying such discrepancies previously required reading the verbose and complex MathML markup. VMEXT also allows one to visualize similar and identical elements of two expressions. Visualizing expression similarity can support support developers in designing retrieval approaches and enable improved interaction concepts for users of mathematical information retrieval systems. We demonstrate VMEXT's visualizations in two web-based applications. The first application presents the visualizations alone. The second application shows a possible integration of the visualizations in systems for mathematical knowledge management and mathematical information retrieval. The application converts LaTeX input to parallel MathML, computes basic similarity measures for mathematical expressions, and visualizes the results using VMEXT.
42.6IRMay 5
Aspect-Aware Content-Based Recommendations for Mathematical Research PapersAnkit Satpute, André Greiner-Petter, Noah Gießing et al.
Content-based research paper recommendation (CbRPR) has seen advances in computer science and biomedicine, but remains unexplored for mathematics, where paper relatedness is more conceptual than explicit textual or citation-based similarity. Mathematics papers may be connected through shared proof techniques, logical implications, or natural generalizations, yet exhibit minimal textual or citation overlap, rendering existing CbRPR ineffective. To address this gap, we first conduct an expert-driven study characterizing mathematical recommendations, revealing that relevance is inherently \textit{aspect}-driven. Grounded in this insight, we introduce GoldRiM (small, expert-annotated) and SilverRiM (large, automatically derived), the first datasets for \textit{aspect}-aware CbRPR in mathematics. Recognizing that LLM embeddings of mathematical content alone yield suboptimal representation, we propose AchGNN, an \textit{aspect}-conditioned heterogeneous GNN that jointly models textual semantics, citation structure, and author lineage. Across GoldRiM and SilverRiM, AchGNN consistently outperforms prior \textit{aspect}-based CbRPR methods, achieving substantial gains across all evaluated \textit{aspects}. We conduct ablation studies to analyze the contributions of individual \textit{aspect} supervision, authorship lineage, and graph-structural signals to AchGNN's performance. To assess domain generality, we further evaluate AchGNN on the \textit{Papers with Code} dataset of machine learning publications, demonstrating that our \textit{aspect}-aware approach effectively transfers beyond mathematics. We deploy our system on the MaRDI platform to help mathematicians with recommendations and release datasets and code publicly for reproducibility.
DLOct 26, 2025
Fraudulent Publishing in Mathematics: A European Call to Action and How Information Infrastructure Can HelpMoritz Schubotz, Jan Philip SoloveJ
The IMU-ICIAM working group's new report on Fraudulent Publishing in the Mathematical Sciences documents how gaming of bibliometrics, predatory outlets and paper-mill activity are eroding trust in research, mathematics included. This short EMS note brings that analysis home to Europe. We urge readers to recognise the warning signs of fraudulent publishing, to report serious irregularities so that they can be investigated and sanctioned, and to reflect critically on their own editorial and reviewing practices. We then sketch why Europe is well placed to lead a structural response: a decade of policy development on open science; mature infrastructures for data, software and scholarly communication; and new capacity for community-led diamond open access. Finally, we outline developments towards non-print contributions across member countries including the growth of formal proofs (e.g. with Lean and Isabelle) and we highlight the role of zbMATH Open as a European quality signal that can help editors, reviewers and authors steer clear of problematic venues.
CLMay 25, 2023
Neural Machine Translation for Mathematical FormulaeFelix Petersen, Moritz Schubotz, Andre Greiner-Petter et al.
We tackle the problem of neural machine translation of mathematical formulae between ambiguous presentation languages and unambiguous content languages. Compared to neural machine translation on natural language, mathematical formulae have a much smaller vocabulary and much longer sequences of symbols, while their translation requires extreme precision to satisfy mathematical information needs. In this work, we perform the tasks of translating from LaTeX to Mathematica as well as from LaTeX to semantic LaTeX. While recurrent, recursive, and transformer networks struggle with preserving all contained information, we find that convolutional sequence-to-sequence networks achieve 95.1% and 90.7% exact matches, respectively.
CLNov 18, 2021
Detecting Cross-Language Plagiarism using Open Knowledge GraphsJohannes Stegmüller, Fabian Bauer-Marquart, Norman Meuschke et al.
Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.
IRSep 17, 2021
Semantic Preserving Bijective Mappings of Mathematical Formulae between Document Preparation Systems and Computer Algebra SystemsHoward S. Cohl, Moritz Schubotz, Abdou Youssef et al.
Document preparation systems like LaTeX offer the ability to render mathematical expressions as one would write these on paper. Using LaTeX, LaTeXML, and tools generated for use in the National Institute of Standards (NIST) Digital Library of Mathematical Functions, semantically enhanced mathematical LaTeX markup (semantic LaTeX) is achieved by using a semantic macro set. Computer algebra systems (CAS) such as Maple and Mathematica use alternative markup to represent mathematical expressions. By taking advantage of Youssef's Part-of-Math tagger and CAS internal representations, we develop algorithms to translate mathematical expressions represented in semantic LaTeX to corresponding CAS representations and vice versa. We have also developed tools for translating the entire Wolfram Encoding Continued Fraction Knowledge and University of Antwerp Continued Fractions for Special Functions datasets, for use in the NIST Digital Repository of Mathematical Formulae. The overall goal of these efforts is to provide semantically enriched standard conforming MathML representations to the public for formulae in digital mathematics libraries. These representations include presentation MathML, content MathML, generic LaTeX, semantic LaTeX, and now CAS representations as well.
IRSep 2, 2021
Towards Explaining STEM Document Classification using Mathematical Entity LinkingPhilipp Scharpf, Moritz Schubotz, Bela Gipp
Document subject classification is essential for structuring (digital) libraries and allowing readers to search within a specific field. Currently, the classification is typically made by human domain experts. Semi-supervised Machine Learning algorithms can support them by exploiting the labeled data to predict subject classes for unclassified new documents. However, while humans partly do, machines mostly do not explain the reasons for their decisions. Recently, explainable AI research to address the problem of Machine Learning decisions being a black box has increasingly gained interest. Explainer models have already been applied to the classification of natural language texts, such as legal or medical documents. Documents from Science, Technology, Engineering, and Mathematics (STEM) disciplines are more difficult to analyze, since they contain both textual and mathematical formula content. In this paper, we present first advances towards STEM document classification explainability using classical and mathematical Entity Linking. We examine relationships between textual and mathematical subject classes and entities, mining a collection of documents from the arXiv preprint repository (NTCIR and zbMATH dataset). The results indicate that mathematical entities have the potential to provide high explainability as they are a crucial part of a STEM document.
DLMay 25, 2020
AutoMSC: Automatic Assignment of Mathematics Subject Classification LabelsMoritz Schubotz, Philipp Scharpf, Olaf Teschke et al.
Authors of research papers in the fields of mathematics, and other math-heavy disciplines commonly employ the Mathematics Subject Classification (MSC) scheme to search for relevant literature. The MSC is a hierarchical alphanumerical classification scheme that allows librarians to specify one or multiple codes for publications. Digital Libraries in Mathematics, as well as reviewing services, such as zbMATH and Mathematical Reviews (MR) rely on these MSC labels in their workflows to organize the abstracting and reviewing process. Especially, the coarse-grained classification determines the subject editor who is responsible for the actual reviewing process. In this paper, we investigate the feasibility of automatically assigning a coarse-grained primary classification using the MSC scheme, by regarding the problem as a multi-class classification machine learning task. We find that our method achieves an (F_1)-score of over 77%, which is remarkably close to the agreement of zbMATH and MR ((F_1)-score of 81%). Moreover, we find that the method's confidence score allows for reducing the effort by 86% compared to the manual coarse-grained classification effort while maintaining a precision of 81% for automatically classified articles.
CRMay 23, 2020
A First Step Towards Content Protecting Plagiarism DetectionCornelius Ihle, Moritz Schubotz, Norman Meuschke et al.
Plagiarism detection systems are essential tools for safeguarding academic and educational integrity. However, today's systems require disclosing the full content of the input documents and the document collection to which the input documents are compared. Moreover, the systems are centralized and under the control of individual, typically commercial providers. This situation raises procedural and legal concerns regarding the confidentiality of sensitive data, which can limit or prohibit the use of plagiarism detection services. To eliminate these weaknesses of current systems, we seek to devise a plagiarism detection approach that does not require a centralized provider nor exposing any content as cleartext. This paper presents the initial results of our research. Specifically, we employ Private Set Intersection to devise a content-protecting variant of the citation-based similarity measure Bibliographic Coupling implemented in our plagiarism detection system HyPlag. Our evaluation shows that the content-protecting method achieves the same detection effectiveness as the original method while making common attacks to disclose the protected content practically infeasible. Our future work will extend this successful proof-of-concept by devising plagiarism detection methods that can analyze the entire content of documents without disclosing it as cleartext.
DLMay 22, 2020
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical LanguagePhilipp Scharpf, Moritz Schubotz, Abdou Youssef et al.
In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to $82.8\%$ and cluster purities up to $69.4\%$ (number of clusters equals number of classes), and $99.9\%$ (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.
DLMar 22, 2020
Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia ArticlesMalte Ostendorff, Terry Ruas, Moritz Schubotz et al.
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
DLMar 20, 2020
Mathematical Formulae in Wikimedia Projects 2020Moritz Schubotz, André Greiner-Petter, Norman Meuschke et al.
This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve the accessibility and discoverability of mathematical knowledge in Wikimedia projects further.
DLFeb 7, 2020
Discovering Mathematical Objects of Interest -- A Study of Mathematical NotationsAndre Greiner-Petter, Moritz Schubotz, Fabian Mueller et al.
Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora: the open access arXiv (2.5B mathematical objects) and the mathematical reviewing service for pure and applied mathematics zbMATH (61M mathematical objects). Our study lays a foundation for future research projects on mathematical information retrieval for large scientific corpora. Further, we demonstrate the relevance of our results to a variety of use-cases. For example, to assist semantic extraction systems, to improve scientific search engines, and to facilitate specialized math recommendation systems. The contributions of our presented research are as follows: (1) we present the first distributional analysis of mathematical formulae on arXiv and zbMATH; (2) we retrieve relevant mathematical objects for given textual search queries (e.g., linking $P_{n}^{(α, β)}\!\left(x\right)$ with `Jacobi polynomial'); (3) we extend zbMATH's search engine by providing relevant mathematical formulae; and (4) we exemplify the applicability of the results by presenting auto-completion for math inputs as the first contribution to math recommendation systems. To expedite future research projects, we have made available our source code and data.
IRSep 23, 2019
NewsDeps: Visualizing the Origin of Information in News ArticlesFelix Hamborg, Philipp Meschenmoser, Moritz Schubotz et al.
In scientific publications, citations allow readers to assess the authenticity of the presented information and verify it in the original context. News articles, however, do not contain citations and only rarely refer readers to further sources. Readers often cannot assess the authenticity of the presented information as its origin is unclear. We present NewsDeps, the first approach that analyzes and visualizes where information in news articles stems from. NewsDeps employs methods from natural language processing and plagiarism detection to measure article similarity. We devise a temporal-force-directed graph that places articles as nodes chronologically. The graph connects articles by edges varying in width depending on the articles' similarity. We demonstrate our approach in a case study with two real-world scenarios. We find that NewsDeps increases efficiency and transparency in news consumption by revealing which previously published articles are the primary sources of each given article.
DLJun 27, 2019
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and CitationsNorman Meuschke, Vincent Stange, Moritz Schubotz et al.
Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.
DLMay 20, 2019
Why Machines Cannot Learn Mathematics, YetAndré Greiner-Petter, Terry Ruas, Moritz Schubotz et al.
Nowadays, Machine Learning (ML) is seen as the universal solution to improve the effectiveness of information retrieval (IR) methods. However, while mathematics is a precise and accurate science, it is usually expressed by less accurate and imprecise descriptions, contributing to the relative dearth of machine learning applications for IR in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. Given recent advances in ML, it seems canonical to apply ML techniques to represent and retrieve mathematics semantically. In this work, we apply popular text embedding techniques to the arXiv collection of STEM documents and explore how these are unable to properly understand mathematics from that corpus. In addition, we also investigate the missing aspects that would allow mathematics to be learned by computers.
LGNov 10, 2018
Towards Formula Translation using Recursive Neural NetworksFelix Petersen, Moritz Schubotz, Bela Gipp
While it has become common to perform automated translations on natural language, performing translations between different representations of mathematical formulae has thus far not been possible. We implemented the first translator for mathematical formulae based on recursive neural networks. We chose recursive neural networks because mathematical formulae inherently include a structural encoding. In our implementation, we developed new techniques and topologies for recursive tree-to-tree neural networks based on multi-variate multi-valued Long Short-Term Memory cells. We propose a novel approach for mini-batch training that utilizes clustering and tree traversal. We evaluate our translator and analyze the behavior of our proposed topologies and techniques based on a translation from generic LaTeX to the semantic LaTeX notation. We use the semantic LaTeX notation from the Digital Library for Mathematical Formulae and the Digital Repository for Mathematical Formulae at the National Institute for Standards and Technology. We find that a simple heuristics-based clustering algorithm outperforms the conventional clustering algorithms on the task of clustering binary trees of mathematical formulae with respect to their topology. Furthermore, we find a mask for the loss function, which can prevent the neural network from finding a local minimum of the loss function. Given our preliminary results, a complete translation from formula to formula is not yet possible. However, we achieved a prediction accuracy of 47.05% for predicting symbols at the correct position and an accuracy of 92.3% when ignoring the predicted position. Concluding, our work advances the field of recursive neural networks by improving the training speed and quality of training. In the future, we will work towards a complete translation allowing a machine-interpretation of LaTeX formulae.
DLApr 13, 2018
Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual ContextMoritz Schubotz, Andre Greiner-Petter, Philipp Scharpf et al.
Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.
DLMay 6, 2015
Growing the Digital Repository of Mathematical Formulae with Generic LaTeX SourcesHoward S. Cohl, Moritz Schubotz, Marjorie A. McClain et al.
One initial goal for the DRMF is to seed our digital compendium with fundamental orthogonal polynomial formulae. We had used the data from the NIST Digital Library of Mathematical Functions (DLMF) as initial seed for our DRMF project. The DLMF input LaTeX source already contains some semantic information encoded using a highly customized set of semantic LaTeX macros. Those macros could be converted to content MathML using LaTeXML. During that conversion the semantics were translated to an implicit DLMF content dictionary. This year, we have developed a semantic enrichment process whose goal is to infer semantic information from generic LaTeX sources. The generated context-free semantic information is used to build DRMF formula home pages for each individual formula. We demonstrate this process using selected chapters from the book "Hypergeometric Orthogonal Polynomials and their $q$-Analogues" (2010) by Koekoek, Lesky and Swarttouw (KLS) as well as an actively maintained addendum to this book by Koornwinder (KLSadd). The generic input KLS and KLSadd LaTeX sources describe the printed representation of the formulae, but does not contain explicit semantic information. See http://drmf.wmflabs.org.
DLJul 1, 2014
Mathematical Language Processing ProjectRobert Pagael, Moritz Schubotz
In natural language, words and phrases themselves imply the semantics. In contrast, the meaning of identifiers in mathematical formulae is undefined. Thus scientists must study the context to decode the meaning. The Mathematical Language Processing (MLP) project aims to support that process. In this paper, we compare two approaches to discover identifier-definition tuples. At first we use a simple pattern matching approach. Second, we present the MLP approach that uses part-of-speech tag based distances as well as sentence positions to calculate identifier-definition probabilities. The evaluation of our prototypical system, applied on the Wikipedia text corpus, shows that our approach augments the user experience substantially. While hovering the identifiers in the formula, tool-tips with the most probable definitions occur. Tests with random samples show that the displayed definitions provide a good match with the actual meaning of the identifiers.