CLJun 11, 2021

A Discussion on Building Practical NLP Leaderboards: The Case of Machine Translation

arXiv:2106.06292v20.54 citations

Originality Synthesis-oriented

AI Analysis

This addresses the issue for NLP researchers and practitioners by highlighting risks in current evaluation practices, though it is incremental as it builds on existing discussions.

The paper tackles the problem of over-reliance on single accuracy metrics in NLP leaderboards, particularly for machine translation, and suggests ways to develop more practical leaderboards that better reflect real-world utility.

Recent advances in AI and ML applications have benefited from rapid progress in NLP research. Leaderboards have emerged as a popular mechanism to track and accelerate progress in NLP through competitive model development. While this has increased interest and participation, the over-reliance on single, and accuracy-based metrics have shifted focus from other important metrics that might be equally pertinent to consider in real-world contexts. In this paper, we offer a preliminary discussion of the risks associated with focusing exclusively on accuracy metrics and draw on recent discussions to highlight prescriptive suggestions on how to develop more practical and effective leaderboards that can better reflect the real-world utility of models.

View on arXiv PDF

Similar