Persi Diaconis

2papers

2 Papers

93.8STApr 27

Estimating the size of a set using cascading exclusion

Sourav Chatterjee, Persi Diaconis, Susan Holmes

Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.

PRJun 21, 2017

The sample size required in importance sampling

Sourav Chatterjee, Persi Diaconis

The goal of importance sampling is to estimate the expected value of a given function with respect to a probability measure $ν$ using a random sample of size $n$ drawn from a different probability measure $μ$. If the two measures $μ$ and $ν$ are nearly singular with respect to each other, which is often the case in practice, the sample size required for accurate estimation is large. In this article it is shown that in a fairly general setting, a sample of size approximately $\exp(D(ν||μ))$ is necessary and sufficient for accurate estimation by importance sampling, where $D(ν||μ)$ is the Kullback-Leibler divergence of $μ$ from $ν$. In particular, the required sample size exhibits a kind of cut-off in the logarithmic scale. The theory is applied to obtain a general formula for the sample size required in importance sampling for one-parameter exponential families (Gibbs measures).