MLLGFeb 19, 2020

Simultaneous Inference for Massive Data: Distributed Bootstrap

arXiv:2002.08443v117 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of statistical inference for large-scale distributed data, offering a practical solution for data scientists and researchers dealing with massive datasets.

The paper tackles the problem of performing simultaneous inference on massive data distributed across many machines by proposing a computationally efficient bootstrap method that avoids over-resampling and achieves optimal statistical efficiency with minimal communication, as validated through simulations.

In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling, typically required by existing methods \cite{kleiner2014scalable,sengupta2016subsampled}, while provably achieving optimal statistical efficiency with minimal communication. Our method does not require repeatedly re-fitting the model but only applies multiplier bootstrap in the master machine on the gradients received from the worker machines. Simulations validate our theory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes