Identifying Well-formed Natural Language Questions
This work addresses the challenge of handling ambiguous user queries in search systems, but it is incremental as it builds on existing natural language processing tasks and datasets.
The paper tackles the problem of distinguishing well-formed natural language questions from poorly formed ones to improve query understanding, achieving an accuracy of 70.7% on a test set of 25,100 questions and showing that this classifier enhances neural sequence-to-sequence models for question generation in reading comprehension.
Understanding search queries is a hard problem as it involves dealing with "word salad" text ubiquitously issued by users. However, if a query resembles a well-formed question, a natural language processing pipeline is able to perform more accurate interpretation, thus reducing downstream compounding errors. Hence, identifying whether or not a query is well formed can enhance query understanding. Here, we introduce a new task of identifying a well-formed natural language question. We construct and release a dataset of 25,100 publicly available questions classified into well-formed and non-wellformed categories and report an accuracy of 70.7% on the test set. We also show that our classifier can be used to improve the performance of neural sequence-to-sequence models for generating questions for reading comprehension.