CLAIOct 31, 2022

Lila: A Unified Benchmark for Mathematical Reasoning

arXiv:2210.17517v2322 citationsh-index: 29
Originality Incremental advance
AI Analysis

This work addresses the need for comprehensive evaluation of mathematical reasoning in AI systems, which is crucial for tasks ranging from everyday applications to scientific modeling, and is incremental by extending existing datasets.

The authors introduced LILA, a unified benchmark for evaluating mathematical reasoning in AI systems across 23 diverse tasks, and found that multi-tasking improved performance by an average of 21.83% F1 score compared to single-task models, with the best model achieving 60.40% F1 score.

Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes