AIMar 7

FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

Jan Ravnik, Matjaž Ličen, Felix Bührmann, Bithiah Yuan, Felix Stinson, Tanvi Singh

arXiv:2603.07316v1

Predicted impact top 80% in AI · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the lack of real-world financial spreadsheet datasets for benchmarking LLMs, which is a critical problem for financial professionals seeking to automate due diligence tasks. It is an incremental step towards robust LLM application in finance.

This paper introduces FinSheet-Bench, a synthetic financial spreadsheet benchmark, to evaluate LLMs on extracting and reasoning over structured tabular data. The evaluation of ten LLM configurations shows that no model achieves sufficiently low error rates for unsupervised use, with the best model, Gemini 3.1 Pro, reaching 82.4% accuracy, and performance dropping to 48.6% on the largest spreadsheets.

While Large Language Models (LLMs) can accelerate text-heavy tasks in alternative investment due diligence, a gap remains in their ability to accurately extract and reason over structured tabular data from complex financial spreadsheets. Progress is held back by the lack of real industry fund portfolio datasets for benchmarking, as private equity data rooms are confidential. To address this, we introduce FinSheet-Bench, a benchmark of synthetic financial portfolio data modeled on real private equity fund structures, designed to evaluate LLM performance on text-serialized spreadsheet question answering and numeric reasoning tasks. Our evaluation of ten model configurations from OpenAI, Google, and Anthropic on financial spreadsheets, including complex layouts, fund dividers, and multi-line column names, reveals that no standalone model achieves error rates low enough for unsupervised use in professional finance applications. The best-performing model, Gemini 3.1 Pro, achieves 82.4% accuracy across twenty-four evaluation files of varying complexity and structural layout (approximately 1 error per 6 questions), followed by GPT-5.2 with reasoning at 80.4%, Claude Opus 4.6 with thinking at 80.2%, and Gemini 3 Pro at 80.2%. Performance degrades substantially on larger, more complex spreadsheets: the largest spreadsheet (152 companies, 8 funds) yields an average accuracy of just 48.6% across all models, compared to 86.2% on the easiest evaluation file. These difficulty patterns are consistent across all ten models, indicating that they reflect LLM limitations rather than idiosyncratic model weaknesses. Reliable financial spreadsheet extraction will likely require architectural approaches that separate document understanding from deterministic computation.

View on arXiv PDF

Similar