Communication-Efficient l_0 Penalized Least Square
This addresses the problem of communication bottlenecks and privacy concerns in distributed machine learning for researchers and practitioners, though it is incremental as it builds on existing methods.
The paper tackles high-dimensional sparse linear regression with massive distributed data by proposing a communication-efficient algorithm (CESDAR) that avoids raw data transmission, enhancing privacy and speed while achieving the same statistical accuracy as a global estimator.
In this paper, we propose a communication-efficient penalized regression algorithm for high-dimensional sparse linear regression models with massive data. This approach incorporates an optimized distributed system communication algorithm, named CESDAR algorithm, based on the Enhanced Support Detection and Root finding algorithm. The CESDAR algorithm leverages data distributed across multiple machines to compute and update the active set and introduces the communication-efficient surrogate likelihood framework to approximate the optimal solution for the full sample on the active set, resulting in the avoidance of raw data transmission, which enhances privacy and data security, while significantly improving algorithm execution speed and substantially reducing communication costs. Notably, this approach achieves the same statistical accuracy as the global estimator. Furthermore, this paper explores an extended version of CESDAR and an adaptive version of CESDAR to enhance algorithmic speed and optimize parameter selection, respectively. Simulations and real data benchmarks experiments demonstrate the efficiency and accuracy of the CESDAR algorithm.