ML LGFeb 19, 2020

Simultaneous Inference for Massive Data: Distributed Bootstrap

arXiv:2002.08443v16.717 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of statistical inference for large-scale distributed data, offering a practical solution for data scientists and researchers dealing with massive datasets.

The paper tackles the problem of performing simultaneous inference on massive data distributed across many machines by proposing a computationally efficient bootstrap method that avoids over-resampling and achieves optimal statistical efficiency with minimal communication, as validated through simulations.

In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling, typically required by existing methods \cite{kleiner2014scalable,sengupta2016subsampled}, while provably achieving optimal statistical efficiency with minimal communication. Our method does not require repeatedly re-fitting the model but only applies multiplier bootstrap in the master machine on the gradients received from the worker machines. Simulations validate our theory.

View on arXiv PDF

Similar