Shades of BLEU, Flavours of Success: The Case of MultiWOZ
This work addresses benchmarking problems for researchers using the MultiWOZ dataset, though it is incremental as it focuses on evaluation rather than new methods.
The authors identified inconsistencies in data preprocessing and metric reporting for the MultiWOZ dataset, re-evaluated 13 models to show their scores are not directly comparable, and released standardized scripts for future benchmarking.
The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarking context-to-response abilities of task-oriented dialogue systems. In this work, we identify inconsistencies in data preprocessing and reporting of three corpus-based metrics used on this dataset, i.e., BLEU score and Inform & Success rates. We point out a few problems of the MultiWOZ benchmark such as unsatisfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database. We re-evaluate 7 end-to-end and 6 policy optimization models in as-fair-as-possible setups, and we show that their reported scores cannot be directly compared. To facilitate comparison of future systems, we release our stand-alone standardized evaluation scripts. We also give basic recommendations for corpus-based benchmarking in future works.