Distributed Bootstrap for Simultaneous Inference Under High Dimensionality
This work addresses the challenge of scalable statistical inference for high-dimensional data in distributed computing environments, representing an incremental improvement with specific efficiency gains.
The authors tackled the problem of simultaneous inference on high-dimensional massive data distributed across many machines by proposing a distributed bootstrap method that produces an ℓ∞-norm confidence region, with theoretical guarantees showing the required communication rounds increase only logarithmically with workers and intrinsic dimensionality.
We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $τ_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $τ_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available at GitHub: https://github.com/skchao74/Distributed-bootstrap.