CLMay 22, 2023

DUMB: A Benchmark for Smart Evaluation of Dutch Models

arXiv:2305.13026v2135 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for standardized evaluation of Dutch language models, which is incremental as it builds on existing benchmarking practices but introduces new tasks and metrics for a specific language domain.

The authors introduced DUMB, a benchmark for evaluating Dutch language models across nine tasks, including four new to Dutch, and proposed Relative Error Reduction (RER) for consistent comparison. Their evaluation of 14 models showed that current Dutch monolingual models underperform, with DeBERTaV3, XLM-R, and mDeBERTaV3 achieving the highest performance, and they recommended training larger models with varied architectures.

We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks. The total set of nine tasks includes four tasks that were previously not available in Dutch. Instead of relying on a mean score across tasks, we propose Relative Error Reduction (RER), which compares the DUMB performance of language models to a strong baseline which can be referred to in the future even when assessing different sets of language models. Through a comparison of 14 pre-trained language models (mono- and multi-lingual, of varying sizes), we assess the internal consistency of the benchmark tasks, as well as the factors that likely enable high performance. Our results indicate that current Dutch monolingual models under-perform and suggest training larger Dutch models with other architectures and pre-training objectives. At present, the highest performance is achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV3 (base). In addition to highlighting best strategies for training larger Dutch models, DUMB will foster further research on Dutch. A public leaderboard is available at https://dumbench.nl.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes