QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models
This work addresses the need for better evaluation of LLMs in quantitative finance, facilitating research and practical adoption in trading workflows, though it is incremental as it builds on existing benchmarking approaches.
The authors tackled the problem of evaluating large language models (LLMs) in financial quantitative tasks by introducing QuantEval, a benchmark that assesses models across knowledge-based QA, mathematical reasoning, and strategy coding, revealing substantial performance gaps compared to human experts.
Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs' quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.