OCDCLGMLMay 24, 2023

Error Feedback Shines when Features are Rare

arXiv:2305.15264v12 citations
Originality Highly original
AI Analysis

This solves a long-standing open problem in distributed machine learning since 2014, offering improved communication efficiency for scenarios with rare features, such as federated learning with sparse data.

The paper tackles the problem of distributed optimization with gradient sparsification and error feedback, proving for the first time that this approach can achieve better communication complexity than vanilla gradient descent when features are rare, i.e., each feature is present in only a small number of clients, with a specific complexity bound of O((L + r√(c/n min(c/n max_i L_i^2, (1/n)∑ L_i^2))) / ε) bits per worker.

We provide the first proof that gradient descent $\left({\color{green}\sf GD}\right)$ with greedy sparsification $\left({\color{green}\sf TopK}\right)$ and error feedback $\left({\color{green}\sf EF}\right)$ can obtain better communication complexity than vanilla ${\color{green}\sf GD}$ when solving the distributed optimization problem $\min_{x\in \mathbb{R}^d} {f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)}$, where $n$ = # of clients, $d$ = # of features, and $f_1,\dots,f_n$ are smooth nonconvex functions. Despite intensive research since 2014 when ${\color{green}\sf EF}$ was first proposed by Seide et al., this problem remained open until now. We show that ${\color{green}\sf EF}$ shines in the regime when features are rare, i.e., when each feature is present in the data owned by a small number of clients only. To illustrate our main result, we show that in order to find a random vector $\hat{x}$ such that $\lVert {\nabla f(\hat{x})} \rVert^2 \leq \varepsilon$ in expectation, ${\color{green}\sf GD}$ with the ${\color{green}\sf Top1}$ sparsifier and ${\color{green}\sf EF}$ requires ${\cal O} \left(\left( L+{\color{blue}r} \sqrt{ \frac{{\color{red}c}}{n} \min \left( \frac{{\color{red}c}}{n} \max_i L_i^2, \frac{1}{n}\sum_{i=1}^n L_i^2 \right) }\right) \frac{1}{\varepsilon} \right)$ bits to be communicated by each worker to the server only, where $L$ is the smoothness constant of $f$, $L_i$ is the smoothness constant of $f_i$, ${\color{red}c}$ is the maximal number of clients owning any feature ($1\leq {\color{red}c} \leq n$), and ${\color{blue}r}$ is the maximal number of features owned by any client ($1\leq {\color{blue}r} \leq d$). Clearly, the communication complexity improves as ${\color{red}c}$ decreases (i.e., as features become more rare), and can be much better than the ${\cal O}({\color{blue}r} L \frac{1}{\varepsilon})$ communication complexity of ${\color{green}\sf GD}$ in the same regime.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes