Mikhail Rudakov

IT
Semantic Scholar Profile
h-index22
3papers
7citations
Novelty57%
AI Score40

3 Papers

OCFeb 17
Exploring New Frontiers in Vertical Federated Learning: the Role of Saddle Point Reformulation

Aleksandr Beznosikov, Georgiy Kormakov, Alexander Grigorievskiy et al.

The objective of Vertical Federated Learning (VFL) is to collectively train a model using features available on different devices while sharing the same users. This paper focuses on the saddle point reformulation of the VFL problem via the classical Lagrangian function. We first demonstrate how this formulation can be solved using deterministic methods. More importantly, we explore various stochastic modifications to adapt to practical scenarios, such as employing compression techniques for efficient information transmission, enabling partial participation for asynchronous communication, and utilizing coordinate selection for faster local computation. We show that the saddle point reformulation plays a key role and opens up possibilities to use mentioned extension that seem to be impossible in the standard minimization formulation. Convergence estimates are provided for each algorithm, demonstrating their effectiveness in addressing the VFL problem. Additionally, alternative reformulations are investigated, and numerical experiments are conducted to validate performance and effectiveness of the proposed approach.

18.9ITMar 26
List Estimation

Nikola Zlatanov, Amin Gohari, Farzad Shahrivari et al.

Classical estimation outputs a single point estimate of an unknown $d$-dimensional vector from an observation. In this paper, we study \emph{$k$-list estimation}, in which a single observation is used to produce a list of $k$ candidate estimates and performance is measured by the expected squared distance from the true vector to the closest candidate. We compare this centralized setting with a symmetric decentralized MMSE benchmark in which $k$ agents observe conditionally i.i.d.\ measurements and each agent outputs its own MMSE estimate. On the centralized side, we show that optimal $k$-list estimation is equivalent to fixed-rate $k$-point vector quantization of the posterior distribution and, under standard regularity conditions, admits an exact high-rate asymptotic expansion with explicit constants and decay rate $k^{-2/d}$. On the decentralized side, we derive lower bounds in terms of the small-ball behavior of the single-agent MMSE error; in particular, when the conditional error density is bounded near the origin, the benchmark distortion cannot decay faster than order $k^{-2/d}$. We further show that if the error density vanishes at the origin, then the decentralized benchmark is provably unable to match the centralized $k^{-2/d}$ exponent, whereas the centralized estimator retains that scaling. Gaussian specializations yield explicit formulas and numerical experiments corroborate the predicted asymptotic behavior. Overall, the results show that, in the scaling with $k$, one observation combined with $k$ carefully chosen candidates can be asymptotically as effective as -- and in some regimes strictly better than -- this MMSE-based decentralized benchmark with $k$ independent observations.

LGJan 15, 2024
Activations and Gradients Compression for Model-Parallel Training

Mikhail Rudakov, Aleksandr Beznosikov, Yaroslav Kholodov et al.

Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that $K=10\%$ is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with $ K=30\%$ worsens model performance significantly.