Distributed sequential method for analyzing massive data
This work addresses the challenge of efficient statistical analysis for large-scale data in fields like environmental monitoring and energy management, though it appears incremental as it builds on existing sequential and divide-and-conquer approaches.
The authors tackled the problem of analyzing massive datasets with lengthy variables by proposing a distributed sequential method that integrates parallel estimation procedures and adaptive sample selection, achieving theoretical justification and validation on synthetic and real-world datasets including appliance energy use and particulate matter concentration.
To analyse a very large data set containing lengthy variables, we adopt a sequential estimation idea and propose a parallel divide-and-conquer method. We conduct several conventional sequential estimation procedures separately, and properly integrate their results while maintaining the desired statistical properties. Additionally, using a criterion from the statistical experiment design, we adopt an adaptive sample selection, together with an adaptive shrinkage estimation method, to simultaneously accelerate the estimation procedure and identify the effective variables. We confirm the cogency of our methods through theoretical justifications and numerical results derived from synthesized data sets. We then apply the proposed method to three real data sets, including those pertaining to appliance energy use and particulate matter concentration.