Dataset for a Neural Natural Language Interface for Databases (NNLIDB)
This work addresses the lack of a large corpus for training and comparing data-driven approaches in NLIDBs, which is a domain-specific problem for database users and researchers.
The authors tackled the problem of natural language interfaces to databases by introducing a large dataset extracted from Stack Exchange Data Explorer to serve as a standard benchmark, and reported baseline results using an attention-based sequence-to-sequence neural network.
Progress in natural language interfaces to databases (NLIDB) has been slow mainly due to linguistic issues (such as language ambiguity) and domain portability. Moreover, the lack of a large corpus to be used as a standard benchmark has made data-driven approaches difficult to develop and compare. In this paper, we revisit the problem of NLIDBs and recast it as a sequence translation problem. To this end, we introduce a large dataset extracted from the Stack Exchange Data Explorer website, which can be used for training neural natural language interfaces for databases. We also report encouraging baseline results on a smaller manually annotated test corpus, obtained using an attention-based sequence-to-sequence neural network.