CLMay 27, 2023

FERMAT: An Alternative to Accuracy for Numerical Reasoning

arXiv:2305.17491v1225 citations
Originality Incremental advance
AI Analysis

This provides a more nuanced evaluation tool for researchers working on numerical reasoning in NLP, though it is incremental as it builds on existing evaluation frameworks like CheckList.

The authors tackled the problem of evaluating numerical reasoning in language models by introducing FERMAT, a multi-view evaluation set that assesses aspects like number understanding and mathematical operations, enabling systematic generation of large datasets for training and evaluation.

While pre-trained language models achieve impressive performance on various NLP benchmarks, they still struggle with tasks that require numerical reasoning. Recent advances in improving numerical reasoning are mostly achieved using very large language models that contain billions of parameters and are not accessible to everyone. In addition, numerical reasoning is measured using a single score on existing datasets. As a result, we do not have a clear understanding of the strengths and shortcomings of existing models on different numerical reasoning aspects and therefore, potential ways to improve them apart from scaling them up. Inspired by CheckList (Ribeiro et al., 2020), we introduce a multi-view evaluation set for numerical reasoning in English, called FERMAT. Instead of reporting a single score on a whole dataset, FERMAT evaluates models on various key numerical reasoning aspects such as number understanding, mathematical operations, and training dependency. Apart from providing a comprehensive evaluation of models on different numerical reasoning aspects, FERMAT enables a systematic and automated generation of an arbitrarily large training or evaluation set for each aspect.The datasets and codes are publicly available to generate further multi-view data for ulterior tasks and languages.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes