AIDec 6, 2024

Neuro-Symbolic Data Generation for Math Reasoning

Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, Xiaoxing Ma

Microsoft

arXiv:2412.04857v122.330 citationsh-index: 38Advances in Neural Information Processing Systems 37

Originality Highly original

AI Analysis

This addresses the data bottleneck for improving mathematical reasoning in LLMs, though it is an incremental advance in data generation methods.

The authors tackled the problem of whether LLMs' mathematical reasoning deficiencies stem from insufficient high-quality data by developing an automated neuro-symbolic method to generate supervised mathematical datasets, resulting in LLaMA-2 and Mistral models realigned with this data surpassing state-of-the-art counterparts.

A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space. Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.

View on arXiv PDF

Similar