CLAIDec 15, 2025

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

arXiv:2512.13330v1h-index: 13Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for standardized evaluation tools for Finnish language models, which is incremental as it builds upon and improves previous benchmarks.

The authors tackled the problem of evaluating Finnish large language models by introducing FIN-bench-v2, a unified benchmark suite that consolidates and expands existing Finnish benchmarks, covering multiple tasks and including robust task selection criteria, with all resources made publicly available.

We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes