CCMay 5
Optimal Union Probability Interval Is NP-HardPetteri Kaski, Heikki Mannila, Chandra Kanta Mohapatra
A problem dating back to Boole [Laws of Thought, Walton & Maberly,1854] is what can be computed about the probability of a finite union of events when given as input the probabilities of intersections of some of the events. The modern geometric study of the problem can be traced back to Hailperin [Amer. Math. Monthly 2 (1965) 343--359] who phrased the problem in the language of linear programming and generalized it to logical formulas of the events other than disjunction, heralding a substantial body of work in probabilistic logic [Nilsson, Artif.\ Intell.\ 28 (1986) 71--87], including the probabilistic satisfiability problem of Georgakopoulos, Kavvadis, and Papadimitriou [J.Complexity 4 (1988) 1--11], as well as fundamental connections to the geometry of metrics via cut and correlation polytopes [Deza and Laurent, Geometry of Cuts and Metrics, Springer, 1997] and to the study of marginal polytopes in graphical models of machine learning [Wainwright and Jordan, Found.\ Trends Mach.\ Learn. 1 (2008) 1--305]. This paper (i) describes the pertinent geometry of Boole's problem via coordinate projections of an elementary polytope arising essentially from Hailperin's linear program on the atoms of a Venn diagram, and (ii) shows that computing the optimal interval for the union probability is NP-hard, resolving an apparent gap in the literature highlighted by Pitowsky [Math.\ Programming 50 (1991) 395--414] and Boros et al. [Math.\ Oper.\ Res. 39 (2014) 1311--1329 and 51 (2026) 134--148].
LGMar 27, 2025
Fair PCA, One Component at a TimeAntonis Matakos, Martino Ciaperoni, Heikki Mannila
The Min-Max Fair PCA problem seeks a low-rank representation of multi-group data such that the the approximation error is as balanced as possible across groups. Existing approaches to this problem return a rank-$d$ fair subspace, but lack the fundamental containment property of standard PCA: each rank-$d$ PCA subspace should contain all lower-rank PCA subspaces. To fill this gap, we define fair principal components as directions that minimize the maximum group-wise reconstruction error, subject to orthogonality with previously selected components, and we introduce an iterative method to compute them. This approach preserves the containment property of standard PCA, and reduces to standard \pca for data with a single group. We analyze the theoretical properties of our method and show empirically that it outperforms existing approaches to Min-Max Fair PCA.
LGJun 16, 2020
Tell Me Something I Don't Know: Randomization Strategies for Iterative Data MiningSami Hanhijärvi, Markus Ojala, Niko Vuokko et al.
There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.
LGFeb 4, 2019
What is the dimension of your binary data?Nikolaj Tatti, Taneli Mielikainen, Aristides Gionis et al.
Many 0/1 datasets have a very large number of variables; on the other hand, they are sparse and the dependency structure of the variables is simpler than the number of variables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining a robust measure of dimension for 0/1 datasets, and show that the basic idea of fractal dimension can be adapted for binary data. However, as such the fractal dimension is difficult to interpret. Hence we introduce the concept of normalized fractal dimension. For a dataset $D$, its normalized fractal dimension is the number of columns in a dataset $D'$ with independent columns and having the same (unnormalized) fractal dimension as $D$. The normalized fractal dimension measures the degree of dependency structure of the data. We study the properties of the normalized fractal dimension and discuss its computation. We give empirical results on the normalized fractal dimension, comparing it against baseline measures such as PCA. We also study the relationship of the dimension of the whole dataset and the dimensions of subgroups formed by clustering. The results indicate interesting differences between and within datasets.
AIJan 16, 2013
Probabilistic Models for Query Approximation with Large Sparse Binary DatasetsDmitry Y. Pavlov, Heikki Mannila, Padhraic Smyth
Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: customers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important practical problem for such databases. We investigate the application of probabilistic models to this problem. In particular, we study a Markov random field (MRF) approach based on frequent sets and maximum entropy, and compare it to the independence model and the Chow-Liu tree model. We find that the MRF model provides substantially more accurate probability estimates than the other methods but is more expensive from a computational and memory viewpoint. To alleviate the computational requirements we show how one can apply bucket elimination and clique tree approaches to take advantage of structure in the models and in the queries. We provide experimental results on two large real-world transaction datasets.