MLLGSPJan 22, 2021

Linear Regression with Distributed Learning: A Generalization Error Perspective

arXiv:2101.09001v310 citations
Originality Incremental advance
AI Analysis

This work addresses the performance gap in distributed learning for large-scale regression, which is incremental but important for practitioners scaling ML systems.

The paper investigates the generalization error of distributed linear regression, showing that it can be substantially higher than centralized solutions even with similar training error, with high-probability bounds provided for various data types.

Distributed learning provides an attractive framework for scaling the learning task by sharing the computational load over multiple nodes in a network. Here, we investigate the performance of distributed learning for large-scale linear regression where the model parameters, i.e., the unknowns, are distributed over the network. We adopt a statistical learning approach. In contrast to works that focus on the performance on the training data, we focus on the generalization error, i.e., the performance on unseen data. We provide high-probability bounds on the generalization error for both isotropic and correlated Gaussian data as well as sub-gaussian data. These results reveal the dependence of the generalization performance on the partitioning of the model over the network. In particular, our results show that the generalization error of the distributed solution can be substantially higher than that of the centralized solution even when the error on the training data is at the same level for both the centralized and distributed approaches. Our numerical results illustrate the performance with both real-world image data as well as synthetic data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes