Evaluating for Diversity in Question Generation over Text
This addresses the need for better evaluation in question generation, which has widespread applications, but is incremental as it builds on existing metrics and models.
The authors tackled the problem of evaluating diversity in question generation over text by proposing a scheme to extend conventional metrics and a variational encoder-decoder model, showing that their model improves diversity without quality loss as demonstrated through automatic and human evaluation.
Generating diverse and relevant questions over text is a task with widespread applications. We argue that commonly-used evaluation metrics such as BLEU and METEOR are not suitable for this task due to the inherent diversity of reference questions, and propose a scheme for extending conventional metrics to reflect diversity. We furthermore propose a variational encoder-decoder model for this task. We show through automatic and human evaluation that our variational model improves diversity without loss of quality, and demonstrate how our evaluation scheme reflects this improvement.