LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing
This toolkit addresses the need for efficient and reproducible human evaluation in dialogue system research, though it is incremental as it builds on existing crowdsourcing methods.
The paper tackles the challenge of evaluating dialogue systems by introducing LEGOEval, an open-source toolkit that simplifies human evaluation via Amazon Mechanical Turk, enabling researchers to reproduce results quickly and consistently with a flexible Python API.
We present LEGOEval, an open-source toolkit that enables researchers to easily evaluate dialogue systems in a few lines of code using the online crowdsource platform, Amazon Mechanical Turk. Compared to existing toolkits, LEGOEval features a flexible task design by providing a Python API that maps to commonly used React.js interface components. Researchers can personalize their evaluation procedures easily with our built-in pages as if playing with LEGO blocks. Thus, LEGOEval provides a fast, consistent method for reproducing human evaluation results. Besides the flexible task design, LEGOEval also offers an easy API to review collected data.