CLDec 15, 2024

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

arXiv:2412.11314v120 citationsh-index: 1Has CodeCOLING
Originality Synthesis-oriented
AI Analysis

This addresses the problem of inconsistent and non-reproducible evaluations for researchers and practitioners in NLP, though it is incremental as it builds on existing evaluation concepts.

The paper tackles the need for modern evaluation protocols in NLP by introducing Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards, demonstrating its usability through web, command-line, and Python interfaces.

The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes