CEAICLMar 29, 2025

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

arXiv:2503.22968v4h-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses reproducibility issues for researchers developing Korean LLMs, though it is incremental as it builds on existing evaluation methods.

The paper tackles inconsistent evaluation protocols for Korean large language models, which cause up to 10 percentage point performance gaps, by introducing HRET, an open-source framework that unifies benchmarks and includes Korean-focused analyses like morphology-aware TTR and keyword-omission detection.

Recent advancements in Korean large language models (LLMs) have driven numerous benchmarks and evaluation methods, yet inconsistent protocols cause up to 10 p.p performance gaps across institutions. Overcoming these reproducibility gaps does not mean enforcing a one-size-fits-all evaluation. Rather, effective benchmarking requires diverse experimental approaches and a framework robust enough to support them. To this end, we introduce HRET (Haerae Evaluation Toolkit), an open-source, registry-based framework that unifies Korean LLM assessment. HRET integrates major Korean benchmarks, multiple inference backends, and multi-method evaluation, with language consistency enforcement to ensure genuine Korean outputs. Its modular registry design also enables rapid incorporation of new datasets, methods, and backends, ensuring the toolkit adapts to evolving research needs. Beyond standard accuracy metrics, HRET incorporates Korean-focused output analyses-morphology-aware Type-Token Ratio (TTR) for evaluating lexical diversity and systematic keyword-omission detection for identifying missing concepts-to provide diagnostic insights into language-specific behaviors. These targeted analyses help researchers pinpoint morphological and semantic shortcomings in model outputs, guiding focused improvements in Korean LLM development.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes