CLJun 18, 2025

FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning

Shaoyu Dou, Yutian Shen, Mofan Chen, Zixuan Wang, Jiajie Xu, Qi Guo, Kailai Shao, Chao Chen, Haixiang Hu, Haibo Shi, Min Min, Liwen Zhang

arXiv:2506.21591v36.72 citationsh-index: 4Has CodeProceedings of The 10th Workshop on Financial Technology and Natural Language Processing

Originality Incremental advance

AI Analysis

This work addresses the problem of inadequate evaluation benchmarks for financial reasoning in LLMs, offering a domain-specific framework that is incremental in its approach to decoupling capabilities.

The authors tackled the challenge of evaluating large language models (LLMs) in financial reasoning by introducing FinEval-KR, a framework that decouples and quantifies knowledge and reasoning abilities independently, revealing that reasoning ability and higher-order cognitive skills are core factors influencing accuracy, with top models still struggling with knowledge application.

Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and sophisticated reasoning. Current evaluation benchmarks often fall short by not decoupling these capabilities indicators from single task performance and lack root cause analysis for task failure. To address this, we introduce FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs' knowledge and reasoning abilities independently, proposing distinct knowledge score and reasoning score metrics. Inspired by cognitive science, we further propose a cognitive score based on Bloom's taxonomy to analyze capabilities in reasoning tasks across different cognitive levels. We also release a new open-source Chinese financial reasoning dataset covering 22 subfields to support reproducible research and further advancements in financial reasoning. Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy. We also specifically find that even top models still face a bottleneck with knowledge application. Furthermore, our analysis shows that specialized financial LLMs generally lag behind the top general large models across multiple metrics.

View on arXiv PDF

Similar