Disambiguating Symbolic Expressions in Informal Documents
This addresses the challenge of interpreting ambiguous mathematical notation in informal documents for researchers and educators, though it appears incremental as it builds on existing transformer methods.
The paper tackles the problem of determining precise semantics and abstract syntax trees for symbolic expressions in informal STEM documents by framing it as a neural machine translation task, and presents a transformer-based model that yields promising results despite a small dataset of 33,000 entries.
We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of LaTeX files - that is, determining their precise semantics and abstract syntax tree - as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid LaTeX before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.