FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models
This provides a comprehensive benchmark for assessing LLMs in the Chinese financial domain, addressing a gap in security and practical abilities, but it is incremental as it focuses on evaluation rather than new methods.
The paper tackles the lack of evaluation for large language models in the financial domain by introducing FinEval, a benchmark with 8,351 questions across four areas, and finds that Claude 3.5-Sonnet achieves a 72.9 weighted average score in zero-shot settings.
Large language models have demonstrated outstanding performance in various natural language processing tasks, but their security capabilities in the financial domain have not been explored, and their performance on complex tasks like financial agent remains unknown. This paper presents FinEval, a benchmark designed to evaluate LLMs' financial domain knowledge and practical abilities. The dataset contains 8,351 questions categorized into four different key areas: Financial Academic Knowledge, Financial Industry Knowledge, Financial Security Knowledge, and Financial Agent. Financial Academic Knowledge comprises 4,661 multiple-choice questions spanning 34 subjects such as finance and economics. Financial Industry Knowledge contains 1,434 questions covering practical scenarios like investment research. Financial Security Knowledge assesses models through 1,640 questions on topics like application security and cryptography. Financial Agent evaluates tool usage and complex reasoning with 616 questions. FinEval has multiple evaluation settings, including zero-shot, five-shot with chain-of-thought, and assesses model performance using objective and subjective criteria. Our results show that Claude 3.5-Sonnet achieves the highest weighted average score of 72.9 across all financial domain categories under zero-shot setting. Our work provides a comprehensive benchmark closely aligned with Chinese financial domain.