CLLGNov 3, 2023

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data

arXiv:2311.02216v1150 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of numerical reasoning in language models for domains like finance and science, but it is incremental as it builds on existing benchmarks and focuses on a specific task.

The paper tackled the problem of evaluating numerical reasoning capabilities in language models by proposing a hierarchical taxonomy and conducting a comprehensive analysis on tabular data, finding that no model consistently excels across all reasoning types, with FlanT5 and GPT-3.5 showing strong overall skills.

Numbers are crucial for various real-world domains such as finance, economics, and science. Thus, understanding and reasoning with numbers are essential skills for language models to solve different tasks. While different numerical benchmarks have been introduced in recent years, they are limited to specific numerical aspects mostly. In this paper, we propose a hierarchical taxonomy for numerical reasoning skills with more than ten reasoning types across four levels: representation, number sense, manipulation, and complex reasoning. We conduct a comprehensive evaluation of state-of-the-art models to identify reasoning challenges specific to them. Henceforth, we develop a diverse set of numerical probes employing a semi-automated approach. We focus on the tabular Natural Language Inference (TNLI) task as a case study and measure models' performance shifts. Our results show that no model consistently excels across all numerical reasoning types. Among the probed models, FlanT5 (few-/zero-shot) and GPT-3.5 (few-shot) demonstrate strong overall numerical reasoning skills compared to other models. Label-flipping probes indicate that models often exploit dataset artifacts to predict the correct labels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes