CLNov 11, 2023

BizBench: A Quantitative Reasoning Benchmark for Business and Finance

Rik Koncel-Kedziorski, Michael Krumdick, Viet Lai, Varshini Reddy, Charles Lovering, Chris Tanner

arXiv:2311.06602v213.042 citationsh-index: 10Has Code

Originality Synthesis-oriented

AI Analysis

This provides a challenging benchmark for researchers and practitioners to assess and improve LLMs in business and finance, though it is incremental as it builds on existing benchmarking approaches.

The authors tackled the challenge of evaluating large language models' quantitative reasoning in business and finance by introducing BizBench, a benchmark with eight tasks including program synthesis and code generation, and found that current models' performance is limited by their lack of financial understanding.

Answering questions within business and finance requires reasoning, precision, and a wide-breadth of technical knowledge. Together, these requirements make this domain difficult for large language models (LLMs). We introduce BizBench, a benchmark for evaluating models' ability to reason about realistic financial problems. BizBench comprises eight quantitative reasoning tasks, focusing on question-answering (QA) over financial data via program synthesis. We include three financially-themed code-generation tasks from newly collected and augmented QA data. Additionally, we isolate the reasoning capabilities required for financial QA: reading comprehension of financial text and tables for extracting intermediate values, and understanding financial concepts and formulas needed to calculate complex solutions. Collectively, these tasks evaluate a model's financial background knowledge, ability to parse financial documents, and capacity to solve problems with code. We conduct an in-depth evaluation of open-source and commercial LLMs, comparing and contrasting the behavior of code-focused and language-focused models. We demonstrate that the current bottleneck in performance is due to LLMs' limited business and financial understanding, highlighting the value of a challenging benchmark for quantitative reasoning within this domain.

View on arXiv PDF

Similar