IT DS LG STMar 27, 2015

Competitive Distribution Estimation

arXiv:1503.07940v11.2

Originality Incremental advance

AI Analysis

This work addresses distribution estimation for statisticians and machine learning practitioners, offering a competitive approach that is incremental in refining oracle-based regret bounds.

The paper tackles the problem of estimating an unknown distribution from samples by considering competitive regret relative to oracles with limited knowledge, showing that regret reduces to min(k/n, Õ(1/√n)), which is uniformly bounded across alphabet sizes, and provides a linear-time estimator achieving this bound.

Estimating an unknown distribution from its samples is a fundamental problem in statistics. The common, min-max, formulation of this goal considers the performance of the best estimator over all distributions in a class. It shows that with $n$ samples, distributions over $k$ symbols can be learned to a KL divergence that decreases to zero with the sample size $n$, but grows unboundedly with the alphabet size $k$. Min-max performance can be viewed as regret relative to an oracle that knows the underlying distribution. We consider two natural and modest limits on the oracle's power. One where it knows the underlying distribution only up to symbol permutations, and the other where it knows the exact distribution but is restricted to use natural estimators that assign the same probability to symbols that appeared equally many times in the sample. We show that in both cases the competitive regret reduces to $\min(k/n,\tilde{\mathcal{O}}(1/\sqrt n))$, a quantity upper bounded uniformly for every alphabet size. This shows that distributions can be estimated nearly as well as when they are essentially known in advance, and nearly as well as when they are completely known in advance but need to be estimated via a natural estimator. We also provide an estimator that runs in linear time and incurs competitive regret of $\tilde{\mathcal{O}}(\min(k/n,1/\sqrt n))$, and show that for natural estimators this competitive regret is inevitable. We also demonstrate the effectiveness of competitive estimators using simulations.

View on arXiv PDF

Similar