DBAILGApr 12, 2020

Complaint-driven Training Data Debugging for Query 2.0

arXiv:2004.05722v150 citations
Originality Highly original
AI Analysis

This addresses a critical challenge for database providers and users integrating ML into SQL queries, offering a novel solution to improve reliability in commercial applications.

The paper tackles the problem of debugging training data bugs in Query 2.0 by proposing Rain, a complaint-driven system that identifies a minimal set of training examples to remove to resolve user complaints, achieving the highest recall@k among baselines while maintaining interactive performance.

As the need for machine learning (ML) increases rapidly across all industry sectors, there is a significant interest among commercial database providers to support "Query 2.0", which integrates model inference into SQL queries. Debugging Query 2.0 is very challenging since an unexpected query result may be caused by the bugs in training data (e.g., wrong labels, corrupted features). In response, we propose Rain, a complaint-driven training data debugging system. Rain allows users to specify complaints over the query's intermediate or final output, and aims to return a minimum set of training examples so that if they were removed, the complaints would be resolved. To the best of our knowledge, we are the first to study this problem. A naive solution requires retraining an exponential number of ML models. We propose two novel heuristic approaches based on influence functions which both require linear retraining steps. We provide an in-depth analytical and empirical analysis of the two approaches and conduct extensive experiments to evaluate their effectiveness using four real-world datasets. Results show that Rain achieves the highest recall@k among all the baselines while still returns results interactively.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes