CLAIOct 6, 2020

Semantic Evaluation for Text-to-SQL with Distilled Test Suites

arXiv:2010.02840v11017 citations
Originality Incremental advance
AI Analysis

This addresses the need for more reliable evaluation metrics in the Text-to-SQL domain, though it is incremental as it builds on existing evaluation frameworks.

The paper tackles the problem of evaluating semantic accuracy in Text-to-SQL models by proposing test suite accuracy, which uses distilled test suites to compute a tight upper-bound efficiently, and shows that it reduces false negatives compared to the current Spider metric, with a worst-case improvement from 8.1% to 0%.

We propose test suite accuracy to approximate semantic accuracy for Text-to-SQL models. Our method distills a small test suite of databases that achieves high code coverage for the gold query from a large number of randomly generated databases. At evaluation time, it computes the denotation accuracy of the predicted queries on the distilled test suite, hence calculating a tight upper-bound for semantic accuracy efficiently. We use our proposed method to evaluate 21 models submitted to the Spider leader board and manually verify that our method is always correct on 100 examples. In contrast, the current Spider metric leads to a 2.5% false negative rate on average and 8.1% in the worst case, indicating that test suite accuracy is needed. Our implementation, along with distilled test suites for eleven Text-to-SQL datasets, is publicly available.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes