KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding
This addresses the problem of evaluating LLMs for high-stakes financial applications in Korean, though it is incremental as it adapts existing benchmark concepts to a new domain.
They tackled the lack of Korean financial benchmarks for LLMs by introducing KFinEval-Pilot, a suite with over 1,000 questions, and found notable performance differences and safety trade-offs across models.
We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity. The benchmark is constructed through a semi-automated pipeline that combines GPT-4-generated prompts with expert validation to ensure domain relevance and factual accuracy. We evaluate a range of representative LLMs and observe notable performance differences across models, with trade-offs between task accuracy and output safety across different model families. These results highlight persistent challenges in applying LLMs to high-stakes financial applications, particularly in reasoning and safety. Grounded in real-world financial use cases and aligned with the Korean regulatory and linguistic context, KFinEval-Pilot serves as an early diagnostic tool for developing safer and more reliable financial AI systems.