Accumulation of Sub-Sampling Matrices with Applications to Statistical Computation
This work addresses computational bottlenecks in statistical inference for researchers and practitioners, but it is incremental as it builds on existing random projection frameworks.
The paper tackles the problem of expensive optimal sampling probabilities in sampling-based random projection for large-scale statistical methods by proposing accumulative sub-sampling, which improves statistical efficiency and reduces computational costs, as demonstrated through numerical experiments showing consistent efficiency gains over baselines.
With appropriately chosen sampling probabilities, sampling-based random projection can be used to implement large-scale statistical methods, substantially reducing computational cost while maintaining low statistical error. However, computing optimal sampling probabilities is often itself expensive, and in practice one typically resorts to suboptimal schemes. This generally leads to increased time and space costs, as more subsamples are required and the resulting projection matrices become larger, thereby making the inference procedure more computationally demanding. In this paper, we extend the framework of sampling-based random projection and propose a new projection method, \emph{accumulative sub-sampling}. By carefully accumulating multiple such projections, accumulative sub-sampling improves statistical efficiency while controlling the effective matrix size throughout the statistical computation. On the theoretical side, we quantify how the quality of the subsampling scheme affects the error in approximating matrix products and positive semidefinite matrices, and show how the proposed accumulation strategy mitigates this effect. Moreover, we apply our method to statistical models involving intensive matrix operations, such as eigendecomposition in spectral clustering and matrix inversion in kernel ridge regression, and demonstrate that reducing the effective matrix size leads to substantial computational savings. Numerical experiments across a range of problems further show that our approach consistently improves computational efficiency compared to existing random projection baselines under suboptimal sampling schemes.