Art Owen

STFeb 15, 2013

Guaranteed Conservative Fixed Width Confidence Intervals Via Monte Carlo Sampling

Fred J. Hickernell, Lan Jiang, Yuewei Liu et al.

Monte Carlo methods are used to approximate the means, $μ$, of random variables $Y$, whose distributions are not known explicitly. The key idea is that the average of a random sample, $Y_1, ..., Y_n$, tends to $μ$ as $n$ tends to infinity. This article explores how one can reliably construct a confidence interval for $μ$ with a prescribed half-width (or error tolerance) $\varepsilon$. Our proposed two-stage algorithm assumes that the kurtosis of $Y$ does not exceed some user-specified bound. An initial independent and identically distributed (IID) sample is used to confidently estimate the variance of $Y$. A Berry-Esseen inequality then makes it possible to determine the size of the IID sample required to construct the desired confidence interval for $μ$. We discuss the important case where $Y=f(\vX)$ and $\vX$ is a random $d$-vector with probability density function $ρ$. In this case $μ$ can be interpreted as the integral $\int_{\reals^d} f(\vx) ρ(\vx) \dif \vx$, and the Monte Carlo method becomes a method for multidimensional cubature.

LGAug 6, 2012

One Permutation Hashing for Efficient Search and Learning

Ping Li, Art Owen, Cun-Hui Zhang

Recently, the method of b-bit minwise hashing has been applied to large-scale linear learning and sublinear time near-neighbor search. The major drawback of minwise hashing is the expensive preprocessing cost, as the method requires applying (e.g.,) k=200 to 500 permutations on the data. The testing time can also be expensive if a new data point (e.g., a new document or image) has not been processed, which might be a significant issue in user-facing applications. We develop a very simple solution based on one permutation hashing. Conceptually, given a massive binary data matrix, we permute the columns only once and divide the permuted columns evenly into k bins; and we simply store, for each data vector, the smallest nonzero location in each bin. The interesting probability analysis (which is validated by experiments) reveals that our one permutation scheme should perform very similarly to the original (k-permutation) minwise hashing. In fact, the one permutation scheme can be even slightly more accurate, due to the "sample-without-replacement" effect. Our experiments with training linear SVM and logistic regression on the webspam dataset demonstrate that this one permutation hashing scheme can achieve the same (or even slightly better) accuracies compared to the original k-permutation scheme. To test the robustness of our method, we also experiment with the small news20 dataset which is very sparse and has merely on average 500 nonzeros in each data vector. Interestingly, our one permutation scheme noticeably outperforms the k-permutation scheme when k is not too small on the news20 dataset. In summary, our method can achieve at least the same accuracy as the original k-permutation scheme, at merely 1/k of the original preprocessing cost.

Art Owen

2 Papers