CLJan 13

QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models

Zhaolu Kang, Junhao Gong, Wenqing Hu, Shuo Yin, Kehan Jiang, Zhicheng Fang, Yingjie He, Chunlei Meng, Rong Fu, Dongyang Chen, Leqi Zheng, Eric Hanchen Jiang

arXiv:2601.08689v12.12 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation of LLMs in quantitative finance, facilitating research and practical adoption in trading workflows, though it is incremental as it builds on existing benchmarking approaches.

The authors tackled the problem of evaluating large language models (LLMs) in financial quantitative tasks by introducing QuantEval, a benchmark that assesses models across knowledge-based QA, mathematical reasoning, and strategy coding, revealing substantial performance gaps compared to human experts.

Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs' quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.

View on arXiv PDF

Similar