CLAIMay 21, 2021

Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

arXiv:2106.06052v167 citations
Originality Incremental advance
AI Analysis

This addresses the need for more comprehensive and reproducible benchmarking in NLP, particularly for practitioners, though it is incremental as it builds on existing platforms like Dynabench.

The authors tackled the problem of unreliable and limited benchmarking in NLP by introducing Dynaboard, an evaluation-as-a-service platform that assesses models directly in the cloud, resulting in a holistic comparison with metrics like memory use and robustness, and a customizable ranking system called Dynascore.

We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibility, accessibility, and backwards compatibility that often hinder benchmarking in NLP. This allows users to interact with uploaded models in real time to assess their quality, and permits the collection of additional metrics such as memory use, throughput, and robustness, which -- despite their importance to practitioners -- have traditionally been absent from leaderboards. On each task, models are ranked according to the Dynascore, a novel utility-based aggregation of these statistics, which users can customize to better reflect their preferences, placing more/less weight on a particular axis of evaluation or dataset. As state-of-the-art NLP models push the limits of traditional benchmarks, Dynaboard offers a standardized solution for a more diverse and comprehensive evaluation of model quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes