CLNov 21, 2019

How to Ask Better Questions? A Large-Scale Multi-Domain Dataset for Rewriting Ill-Formed Questions

Zewei Chu, Mingda Chen, Jing Chen, Miaosen Wang, Kevin Gimpel, Manaal Faruqui, Xiance Si

arXiv:1911.09247v12.422 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of improving question quality for natural language processing applications, though it is incremental as it builds on existing datasets and methods.

The authors tackled the problem of rewriting ill-formed natural language questions into well-formed ones by creating a large-scale multi-domain dataset from Stack Exchange edit histories, which improved question quality by an average of 45 points and achieved a 13.2% BLEU-4 improvement over baselines.

We present a large-scale dataset for the task of rewriting an ill-formed natural language question to a well-formed one. Our multi-domain question rewriting MQR dataset is constructed from human contributed Stack Exchange question edit histories. The dataset contains 427,719 question pairs which come from 303 domains. We provide human annotations for a subset of the dataset as a quality estimate. When moving from ill-formed to well-formed questions, the question quality improves by an average of 45 points across three aspects. We train sequence-to-sequence neural models on the constructed dataset and obtain an improvement of 13.2% in BLEU-4 over baseline methods built from other data resources. We release the MQR dataset to encourage research on the problem of question rewriting.

View on arXiv PDF Code

Similar