CLNov 1, 2024

STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing

arXiv:2411.00387v35 citationsh-index: 9Has CodeACL
Originality Synthesis-oriented
AI Analysis

This addresses a limitation in LLMs for researchers and practitioners working with math-rich scientific text, representing an incremental improvement through a new benchmark.

The paper tackles the problem of evaluating large language models' ability to understand abstract mathematical symbols in STEM documents, introducing the STEM-PoM benchmark dataset and finding that state-of-the-art LLMs achieve 20-60% accuracy in symbol classification.

Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM (Science, Technology, Engineering, and Mathematics) documents. While LLMs can generate equations or solve math-related queries, their ability to fully understand and interpret abstract mathematical symbols in long, math-rich documents remains limited. In this paper, we introduce STEM-PoM, a comprehensive benchmark dataset designed to evaluate LLMs' reasoning abilities on math symbols within contextual scientific text. The dataset, sourced from real-world ArXiv documents, contains over 2K math symbols classified as main attributes of variables, constants, operators, and unit descriptors, with additional sub-attributes including scalar/vector/matrix for variables and local/global/discipline-specific labels for both constants and operators. Our extensive experiments demonstrate that state-of-the-art LLMs achieve an average accuracy of 20-60% under in-context learning and 50-60% with fine-tuning, highlighting a substantial gap in their ability to classify mathematical symbols. By improving LLMs' mathematical symbol classification, STEM-PoM further enhances models' downstream mathematical reasoning capabilities. The code and data are available at https://github.com/jiaruzouu/STEM-PoM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes