LGMar 12, 2012

On the Necessity of Irrelevant Variables

arXiv:1203.2557v32 citations

Originality Incremental advance

AI Analysis

This addresses a fundamental issue in machine learning for practitioners dealing with high-dimensional data where variable relevance is uncertain.

The paper tackles the problem of classifier accuracy when irrelevant boolean variables are present, showing that algorithms using irrelevant variables can achieve error probabilities approaching zero, while those limiting their use have errors bounded away from zero.

This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant.

View on arXiv PDF

Similar