CLAIMar 16, 2023

How well do Large Language Models perform in Arithmetic tasks?

Tsinghua
arXiv:2304.02015v1167 citationsh-index: 40Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in evaluating arithmetic skills in large language models, which is important for researchers and developers in AI, though it is incremental as it focuses on a specific aspect of model performance.

The authors tackled the problem of evaluating the arithmetic ability of large language models by proposing a new dataset called MATH 401, and they found that models like GPT-4 and ChatGPT show varying performance on arithmetic tasks, with specific accuracy scores reported in the analysis.

Large language models have emerged abilities including chain-of-thought to answer math word problems step by step. Solving math word problems not only requires abilities to disassemble problems via chain-of-thought but also needs to calculate arithmetic expressions correctly for each step. To the best of our knowledge, there is no work to focus on evaluating the arithmetic ability of large language models. In this work, we propose an arithmetic dataset MATH 401 to test the latest large language models including GPT-4, ChatGPT, InstrctGPT, Galactica, and LLaMA with various arithmetic expressions and provide a detailed analysis of the ability of large language models. MATH 401 and evaluation codes are released at \url{https://github.com/GanjinZero/math401-llm}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes