CLJul 11, 2017

Dataset for a Neural Natural Language Interface for Databases (NNLIDB)

arXiv:1707.03172v11094 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of a large corpus for training and comparing data-driven approaches in NLIDBs, which is a domain-specific problem for database users and researchers.

The authors tackled the problem of natural language interfaces to databases by introducing a large dataset extracted from Stack Exchange Data Explorer to serve as a standard benchmark, and reported baseline results using an attention-based sequence-to-sequence neural network.

Progress in natural language interfaces to databases (NLIDB) has been slow mainly due to linguistic issues (such as language ambiguity) and domain portability. Moreover, the lack of a large corpus to be used as a standard benchmark has made data-driven approaches difficult to develop and compare. In this paper, we revisit the problem of NLIDBs and recast it as a sequence translation problem. To this end, we introduce a large dataset extracted from the Stack Exchange Data Explorer website, which can be used for training neural natural language interfaces for databases. We also report encouraging baseline results on a smaller manually annotated test corpus, obtained using an attention-based sequence-to-sequence neural network.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes