CLDec 15, 2024

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

arXiv:2412.11314v112.220 citationsh-index: 1Has CodeCOLING

Originality Synthesis-oriented

AI Analysis

This addresses the problem of inconsistent and non-reproducible evaluations for researchers and practitioners in NLP, though it is incremental as it builds on existing evaluation concepts.

The paper tackles the need for modern evaluation protocols in NLP by introducing Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards, demonstrating its usability through web, command-line, and Python interfaces.

The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

View on arXiv PDF Code

Similar