CL AIMar 16, 2023

How well do Large Language Models perform in Arithmetic tasks?

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang

Tsinghua

arXiv:2304.02015v120.7167 citationsh-index: 40Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses a gap in evaluating arithmetic skills in large language models, which is important for researchers and developers in AI, though it is incremental as it focuses on a specific aspect of model performance.

The authors tackled the problem of evaluating the arithmetic ability of large language models by proposing a new dataset called MATH 401, and they found that models like GPT-4 and ChatGPT show varying performance on arithmetic tasks, with specific accuracy scores reported in the analysis.

Large language models have emerged abilities including chain-of-thought to answer math word problems step by step. Solving math word problems not only requires abilities to disassemble problems via chain-of-thought but also needs to calculate arithmetic expressions correctly for each step. To the best of our knowledge, there is no work to focus on evaluating the arithmetic ability of large language models. In this work, we propose an arithmetic dataset MATH 401 to test the latest large language models including GPT-4, ChatGPT, InstrctGPT, Galactica, and LLaMA with various arithmetic expressions and provide a detailed analysis of the ability of large language models. MATH 401 and evaluation codes are released at \url{https://github.com/GanjinZero/math401-llm}.

View on arXiv PDF Code

Similar