Computing Multi-Relational Sufficient Statistics for Large Databases
This work addresses the challenge of statistical analysis in large-scale relational databases for data scientists, offering a scalable solution to handle complex queries involving negative relationships.
The paper tackled the problem of efficiently computing sufficient statistics for both positive and negative relationships in large databases, which is infeasible with naive methods, by introducing a dynamic programming algorithm that scales to over 1 million tuples and improves tasks like feature selection and Bayesian network learning.
Databases contain information about which relationships do and do not hold among entities. To make this information accessible for statistical analysis requires computing sufficient statistics that combine information from different database tables. Such statistics may involve any number of {\em positive and negative} relationships. With a naive enumeration approach, computing sufficient statistics for negative relationships is feasible only for small databases. We solve this problem with a new dynamic programming algorithm that performs a virtual join, where the requisite counts are computed without materializing join tables. Contingency table algebra is a new extension of relational algebra, that facilitates the efficient implementation of this Möbius virtual join operation. The Möbius Join scales to large datasets (over 1M tuples) with complex schemas. Empirical evaluation with seven benchmark datasets showed that information about the presence and absence of links can be exploited in feature selection, association rule mining, and Bayesian network learning.