CLNov 16, 2023

FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains

arXiv:2311.09797v243 citationsh-index: 28
AI Analysis

This work addresses the challenge of assessing LLMs in domain-specific reasoning tasks for finance, but it is incremental as it builds on existing benchmarking and prompting methods.

The authors tackled the problem of evaluating LLMs on knowledge-intensive math reasoning in finance by introducing FinanceMath, a benchmark with 1,200 problems requiring college-level finance knowledge, and found that the best-performing system (GPT-4o) achieved only 60.9% accuracy with Chain-of-Thought prompting, leaving a large gap compared to human expert performance of 92%.

We introduce FinanceMath, a novel benchmark designed to evaluate LLMs' capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, FinanceMath includes 1,200 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. We also construct a finance-domain knowledge bank and investigate various knowledge integration strategies. Finally, we evaluate a wide spectrum of 44 LLMs with both Chain-of-Thought and Program-of-Thought prompting methods. Our experimental results reveal that the current best-performing system (i.e., GPT-4o) achieves only 60.9% accuracy using CoT prompting, leaving substantial room for improvement. Moreover, while augmenting LLMs with external knowledge can improve model performance (e.g., from 47.5% to 54.5% for Gemini-1.5-Pro), their accuracy remains significantly lower than the estimated human expert performance of 92%. We believe that FinanceMath can advance future research in the area of domain-specific knowledge retrieval and integration, particularly within the context of solving reasoning-intensive tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes