Reproducible Subjective Evaluation
This addresses the need for more accessible and reproducible subjective evaluations across multiple domains, though it is incremental as it builds on existing evaluation methods.
The paper tackles the problem of time-consuming and poorly reproducible human perceptual studies in ML, linguistics, and psychology by proposing ReSEval, an open-source framework that enables quick deployment of crowdsourced subjective evaluations from Python, making it as easy as objective evaluation.
Human perceptual studies are the gold standard for the evaluation of many research tasks in machine learning, linguistics, and psychology. However, these studies require significant time and cost to perform. As a result, many researchers use objective measures that can correlate poorly with human evaluation. When subjective evaluations are performed, they are often not reported with sufficient detail to ensure reproducibility. We propose Reproducible Subjective Evaluation (ReSEval), an open-source framework for quickly deploying crowdsourced subjective evaluations directly from Python. ReSEval lets researchers launch A/B, ABX, Mean Opinion Score (MOS) and MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) tests on audio, image, text, or video data from a command-line interface or using one line of Python, making it as easy to run as objective evaluation. With ReSEval, researchers can reproduce each other's subjective evaluations by sharing a configuration file and the audio, image, text, or video files.