Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes
This addresses the difficulty of multivariate search queries in databases for users lacking domain expertise, though it appears incremental as it builds on existing probabilistic methods.
The paper tackles the problem of extracting relevant data from structured databases without requiring extensive domain knowledge by introducing a probabilistic search approach using probabilistic programming and nonparametric Bayes. The result is a flexible technique integrated into BayesDB, which human evaluators often preferred over a standard baseline in applications like US colleges and macroeconomic indicators.
Databases are widespread, yet extracting relevant data can be difficult. Without substantial domain knowledge, multivariate search queries often return sparse or uninformative results. This paper introduces an approach for searching structured data based on probabilistic programming and nonparametric Bayes. Users specify queries in a probabilistic language that combines standard SQL database search operators with an information theoretic ranking function called predictive relevance. Predictive relevance can be calculated by a fast sparse matrix algorithm based on posterior samples from CrossCat, a nonparametric Bayesian model for high-dimensional, heterogeneously-typed data tables. The result is a flexible search technique that applies to a broad class of information retrieval problems, which we integrate into BayesDB, a probabilistic programming platform for probabilistic data analysis. This paper demonstrates applications to databases of US colleges, global macroeconomic indicators of public health, and classic cars. We found that human evaluators often prefer the results from probabilistic search to results from a standard baseline.