MLFeb 18, 2016

What is the distribution of the number of unique original items in a bootstrap sample?

arXiv:1602.05822v111 citations
Originality Synthesis-oriented
AI Analysis

It addresses a foundational issue for machine learning researchers to better control resampling techniques, though it is incremental as it clarifies existing statistical properties.

This report tackles the problem of understanding the distribution of unique original items in bootstrap samples, which is crucial for machine learning techniques like bagging and validation schemes, by presenting its key characteristics and deriving a heuristic for normal approximation.

Sampling with replacement occurs in many settings in machine learning, notably in the bagging ensemble technique and the .632+ validation scheme. The number of unique original items in a bootstrap sample can have an important role in the behaviour of prediction models learned on it. Indeed, there are uncontrived examples where duplicate items have no effect. The purpose of this report is to present the distribution of the number of unique original items in a bootstrap sample clearly and concisely, with a view to enabling other machine learning researchers to understand and control this quantity in existing and future resampling techniques. We describe the key characteristics of this distribution along with the generalisation for the case where items come from distinct categories, as in classification. In both cases we discuss the normal limit, and conduct an empirical investigation to derive a heuristic for when a normal approximation is permissible.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes