CLHCApr 20

Pearmut: Human Evaluation of Translation Made Trivial

ETH Zurich
arXiv:2601.0293390.63 citationsh-index: 15
Predicted impact top 14% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

For NLP researchers and practitioners, Pearmut removes barriers to human evaluation, making it practical for routine model development rather than occasional effort.

Pearmut is a lightweight platform that makes human evaluation of machine translation as easy as automatic evaluation, supporting standard protocols like DA, ESA, and MQM, and enabling routine use in model development.

Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, and MQM, and is extensible to support new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and dynamic assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes