Alice and the Caterpillar: A more descriptive null model for assessing data mining results
This work addresses the need for more accurate statistical hypothesis testing in data mining, particularly for binary transactional and sequence datasets, though it appears incremental as it builds on existing null models.
The authors tackled the problem of assessing data mining results by introducing novel null models that preserve more dataset properties, such as the Bipartite Joint Degree Matrix and the number of caterpillars, and developed Alice, a fast and scalable MCMC algorithm for sampling from these models, which found different significant results compared to existing methods.
We introduce novel null models for assessing the results obtained from observed binary transactional and sequence datasets, using statistical hypothesis testing. Our null models maintain more properties of the observed dataset than existing ones. Specifically, they preserve the Bipartite Joint Degree Matrix of the bipartite (multi-)graph corresponding to the dataset, which ensures that the number of caterpillars, i.e., paths of length three, is preserved, in addition to other properties considered by other models. We describe Alice, a suite of Markov chain Monte Carlo algorithms for sampling datasets from our null models, based on a carefully defined set of states and efficient operations to move between them. The results of our experimental evaluation show that Alice mixes fast and scales well, and that our null model finds different significant results than ones previously considered in the literature.