LG COSep 24, 2023

A Probabilistic Model for Data Redundancy in the Feature Domain

arXiv:2309.13657v1h-index: 9

Originality Synthesis-oriented

AI Analysis

This work addresses data redundancy issues for researchers and practitioners in machine learning and statistics, offering a theoretical framework, but it appears incremental as it builds on existing probabilistic methods.

The paper tackles the problem of estimating uncorrelated features in large datasets by developing a probabilistic model that accounts for pairwise and multiple feature correlations, providing upper and lower bounds for feature sets with low collinearity and multicollinearity.

In this paper, we use a probabilistic model to estimate the number of uncorrelated features in a large dataset. Our model allows for both pairwise feature correlation (collinearity) and interdependency of multiple features (multicollinearity) and we use the probabilistic method to obtain upper and lower bounds of the same order, for the size of a feature set that exhibits low collinearity and low multicollinearity. We also prove an auxiliary result regarding mutually good constrained sets that is of independent interest.

View on arXiv PDF

Similar