h-index17
21papers
1,342citations
Novelty58%
AI Score51

21 Papers

LGJun 5, 2023
Improved Stability and Generalization Guarantees of the Decentralized SGD Algorithm

Batiste Le Bars, Aurélien Bellet, Marc Tommasi et al.

This paper presents a new generalization error analysis for Decentralized Stochastic Gradient Descent (D-SGD) based on algorithmic stability. The obtained results overhaul a series of recent works that suggested an increased instability due to decentralization and a detrimental impact of poorly-connected communication graphs on generalization. On the contrary, we show, for convex, strongly convex and non-convex functions, that D-SGD can always recover generalization bounds analogous to those of classical SGD, suggesting that the choice of graph does not matter. We then argue that this result is coming from a worst-case analysis, and we provide a refined optimization-dependent generalization bound for general convex functions. This new bound reveals that the choice of graph can in fact improve the worst-case bound in certain regimes, and that surprisingly, a poorly-connected graph can even be beneficial for generalization.

LGJan 19, 2023
Convergence beyond the over-parameterized regime using Rayleigh quotients

David A. R. Robin, Kevin Scaman, Marc Lelarge

In this paper, we present a new strategy to prove the convergence of deep learning architectures to a zero training (or even testing) loss by gradient flow. Our analysis is centered on the notion of Rayleigh quotients in order to prove Kurdyka-Łojasiewicz inequalities for a broader set of neural network architectures and loss functions. We show that Rayleigh quotients provide a unified view for several convergence analysis techniques in the literature. Our strategy produces a proof of convergence for various examples of parametric learning. In particular, our analysis does not require the number of parameters to tend to infinity, nor the number of samples to be finite, thus extending to test loss minimization and beyond the over-parameterized regime.

LGNov 21, 2022
SIFU: Sequential Informed Federated Unlearning for Efficient and Provable Client Unlearning in Federated Optimization

Yann Fraboni, Martin Van Waerebeke, Kevin Scaman et al.

Machine Unlearning (MU) is an increasingly important topic in machine learning safety, aiming at removing the contribution of a given data point from a training procedure. Federated Unlearning (FU) consists in extending MU to unlearn a given client's contribution from a federated training routine. While several FU methods have been proposed, we currently lack a general approach providing formal unlearning guarantees to the FedAvg routine, while ensuring scalability and generalization beyond the convex assumption on the clients' loss functions. We aim at filling this gap by proposing SIFU (Sequential Informed Federated Unlearning), a new FU method applying to both convex and non-convex optimization regimes. SIFU naturally applies to FedAvg without additional computational cost for the clients and provides formal guarantees on the quality of the unlearning task. We provide a theoretical analysis of the unlearning properties of SIFU, and practically demonstrate its effectiveness as compared to a panel of unlearning methods from the state-of-the-art.

LGSep 13, 2024
In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting

Constantin Philippenko, Kevin Scaman, Laurent Massoulié

We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients, each holding a local dataset $\mathbf{S}^i \in \mathbb{R}^{n_i \times d}$, mathematically, we seek to solve $min_{\mathbf{U}^i \in \mathbb{R}^{n_i\times r}, \mathbf{V}\in \mathbb{R}^{d \times r} } \frac{1}{2} \sum_{i=1}^N \|\mathbf{S}^i - \mathbf{U}^i \mathbf{V}^\top\|^2_{\text{F}}$. Considering a power initialization of $\mathbf{V}$, we rewrite the previous smooth non-convex problem into a smooth strongly-convex problem that we solve using a parallel Nesterov gradient descent potentially requiring a single step of communication at the initialization step. For any client $i$ in $\{1, \dots, N\}$, we obtain a global $\mathbf{V}$ in $\mathbb{R}^{d \times r}$ common to all clients and a local variable $\mathbf{U}^i$ in $\mathbb{R}^{n_i \times r}$. We provide a linear rate of convergence of the excess loss which depends on $σ_{\max} / σ_{r}$, where $σ_{r}$ is the $r^{\mathrm{th}}$ singular value of the concatenation $\mathbf{S}$ of the matrices $(\mathbf{S}^i)_{i=1}^N$. This result improves the rates of convergence given in the literature, which depend on $σ_{\max}^2 / σ_{\min}^2$. We provide an upper bound on the Frobenius-norm error of reconstruction under the power initialization strategy. We complete our analysis with experiments on both synthetic and real data.

LGFeb 16
Unbiased Approximate Vector-Jacobian Products for Efficient Backpropagation

Killian Bakong, Laurent Massoulié, Edouard Oyallon et al.

In this work we introduce methods to reduce the computational and memory costs of training deep neural networks. Our approach consists in replacing exact vector-jacobian products by randomized, unbiased approximations thereof during backpropagation. We provide a theoretical analysis of the trade-off between the number of epochs needed to achieve a target precision and the cost reduction for each epoch. We then identify specific unbiased estimates of vector-jacobian products for which we establish desirable optimality properties of minimal variance under sparsity constraints. Finally we provide in-depth experiments on multi-layer perceptrons, BagNets and Visual Transfomers architectures. These validate our theoretical results, and confirm the potential of our proposed unbiased randomized backpropagation approach for reducing the cost of deep learning.

LGJul 10, 2023
Minimax Excess Risk of First-Order Methods for Statistical Learning with Data-Dependent Oracles

Kevin Scaman, Mathieu Even, Batiste Le Bars et al.

In this paper, our aim is to analyse the generalization capabilities of first-order methods for statistical learning in multiple, different yet related, scenarios including supervised learning, transfer learning, robust learning and federated learning. To do so, we provide sharp upper and lower bounds for the minimax excess risk of strongly convex and smooth statistical learning when the gradient is accessed through partial observations given by a data-dependent oracle. This novel class of oracles can query the gradient with any given data distribution, and is thus well suited to scenarios in which the training data distribution does not match the target (or test) distribution. In particular, our upper and lower bounds are proportional to the smallest mean square error achievable by gradient estimators, thus allowing us to easily derive multiple sharp bounds in the aforementioned scenarios using the extensive literature on parameter estimation.

LGFeb 16
Variance-Reduced $(\varepsilon,δ)-$Unlearning using Forget Set Gradients

Martin Van Waerebeke, Marco Lorenzi, Kevin Scaman et al.

In machine unlearning, $(\varepsilon,δ)-$unlearning is a popular framework that provides formal guarantees on the effectiveness of the removal of a subset of training data, the forget set, from a trained model. For strongly convex objectives, existing first-order methods achieve $(\varepsilon,δ)-$unlearning, but they only use the forget set to calibrate injected noise, never as a direct optimization signal. In contrast, efficient empirical heuristics often exploit the forget samples (e.g., via gradient ascent) but come with no formal unlearning guarantees. We bridge this gap by presenting the Variance-Reduced Unlearning (VRU) algorithm. To the best of our knowledge, VRU is the first first-order algorithm that directly includes forget set gradients in its update rule, while provably satisfying ($(\varepsilon,δ)-$unlearning. We establish the convergence of VRU and show that incorporating the forget set yields strictly improved rates, i.e. a better dependence on the achieved error compared to existing first-order $(\varepsilon,δ)-$unlearning methods. Moreover, we prove that, in a low-error regime, VRU asymptotically outperforms any first-order method that ignores the forget set.Experiments corroborate our theory, showing consistent gains over both state-of-the-art certified unlearning methods and over empirical baselines that explicitly leverage the forget set.

MLFeb 24, 2025
When to Forget? Complexity Trade-offs in Machine Unlearning

Martin Van Waerebeke, Marco Lorenzi, Giovanni Neglia et al.

Machine Unlearning (MU) aims at removing the influence of specific data points from a trained model, striving to achieve this at a fraction of the cost of full model retraining. In this paper, we analyze the efficiency of unlearning methods and establish the first upper and lower bounds on minimax computation times for this problem, characterizing the performance of the most efficient algorithm against the most difficult objective function. Specifically, for strongly convex objective functions and under the assumption that the forget data is inaccessible to the unlearning method, we provide a phase diagram for the unlearning complexity ratio -- a novel metric that compares the computational cost of the best unlearning method to full model retraining. The phase diagram reveals three distinct regimes: one where unlearning at a reduced cost is infeasible, another where unlearning is trivial because adding noise suffices, and a third where unlearning achieves significant computational advantages over retraining. These findings highlight the critical role of factors such as data dimensionality, the number of samples to forget, and privacy constraints in determining the practical feasibility of unlearning.

MLJul 9, 2025
Adaptive collaboration for online personalized distributed learning with heterogeneous clients

Constantin Philippenko, Batiste Le Bars, Kevin Scaman et al.

We study the problem of online personalized decentralized learning with $N$ statistically heterogeneous clients collaborating to accelerate local training. An important challenge in this setting is to select relevant collaborators to reduce gradient variance while mitigating the introduced bias. To tackle this, we introduce a gradient-based collaboration criterion, allowing each client to dynamically select peers with similar gradients during the optimization process. Our criterion is motivated by a refined and more general theoretical analysis of the All-for-one algorithm, proved to be optimal in Even et al. (2022) for an oracle collaboration scheme. We derive excess loss upper-bounds for smooth objective functions, being either strongly convex, non-convex, or satisfying the Polyak-Lojasiewicz condition; our analysis reveals that the algorithm acts as a variance reduction method where the speed-up depends on a sufficient variance. We put forward two collaboration methods instantiating the proposed general schema; and we show that one variant preserves the optimality of All-for-one. We validate our results with experiments on synthetic and real datasets.

MLJun 2, 2021
Tight High Probability Bounds for Linear Stochastic Approximation with Fixed Stepsize

Alain Durmus, Eric Moulines, Alexey Naumov et al.

This paper provides a non-asymptotic analysis of linear stochastic approximation (LSA) algorithms with fixed stepsize. This family of methods arises in many machine learning tasks and is used to obtain approximate solutions of a linear system $\bar{A}θ= \bar{b}$ for which $\bar{A}$ and $\bar{b}$ can only be accessed through random estimates $\{({\bf A}_n, {\bf b}_n): n \in \mathbb{N}^*\}$. Our analysis is based on new results regarding moments and high probability bounds for products of matrices which are shown to be tight. We derive high probability bounds on the performance of LSA under weaker conditions on the sequence $\{({\bf A}_n, {\bf b}_n): n \in \mathbb{N}^*\}$ than previous works. However, in contrast, we establish polynomial concentration bounds with order depending on the stepsize. We show that our conclusions cannot be improved without additional assumptions on the sequence of random matrices $\{{\bf A}_n: n \in \mathbb{N}^*\}$, and in particular that no Gaussian or exponential high probability bounds can hold. Finally, we pay a particular attention to establishing bounds with sharp order with respect to the number of iterations and the stepsize and whose leading terms contain the covariance matrices appearing in the central limit theorems.

LGMar 8, 2021
Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks

George Dasoulas, Kevin Scaman, Aladin Virmaux

Attention based neural networks are state of the art in a large range of applications. However, their performance tends to degrade when the number of layers increases. In this work, we show that enforcing Lipschitz continuity by normalizing the attention scores can significantly improve the performance of deep attention models. First, we show that, for deep graph attention networks (GAT), gradient explosion appears during training, leading to poor performance of gradient-based training algorithms. To address this issue, we derive a theoretical analysis of the Lipschitz continuity of attention modules and introduce LipschitzNorm, a simple and parameter-free normalization for self-attention mechanisms that enforces the model to be Lipschitz continuous. We then apply LipschitzNorm to GAT and Graph Transformers and show that their performance is substantially improved in the deep setting (10 to 30 layers). More specifically, we show that a deep GAT model with LipschitzNorm achieves state of the art results for node label prediction tasks that exhibit long-range dependencies, while showing consistent improvements over their unnormalized counterparts in benchmark node classification tasks.

LGFeb 17, 2021
Improving Hierarchical Adversarial Robustness of Deep Neural Networks

Avery Ma, Aladin Virmaux, Kevin Scaman et al.

Do all adversarial examples have the same consequences? An autonomous driving system misclassifying a pedestrian as a car may induce a far more dangerous -- and even potentially lethal -- behavior than, for instance, a car as a bus. In order to better tackle this important problematic, we introduce the concept of hierarchical adversarial robustness. Given a dataset whose classes can be grouped into coarse-level labels, we define hierarchical adversarial examples as the ones leading to a misclassification at the coarse level. To improve the resistance of neural networks to hierarchical attacks, we introduce a hierarchical adversarially robust (HAR) network design that decomposes a single classification task into one coarse and multiple fine classification tasks, before being specifically trained by adversarial defense techniques. As an alternative to an end-to-end learning approach, we show that HAR significantly improves the robustness of the network against $\ell_2$ and $\ell_{\infty}$ bounded hierarchical attacks on the CIFAR-10 and CIFAR-100 dataset.

LGFeb 17, 2021
Ego-based Entropy Measures for Structural Representations on Graphs

George Dasoulas, Giannis Nikolentzos, Kevin Scaman et al.

Machine learning on graph-structured data has attracted high research interest due to the emergence of Graph Neural Networks (GNNs). Most of the proposed GNNs are based on the node homophily, i.e neighboring nodes share similar characteristics. However, in many complex networks, nodes that lie to distant parts of the graph share structurally equivalent characteristics and exhibit similar roles (e.g chemical properties of distant atoms in a molecule, type of social network users). A growing literature proposed representations that identify structurally equivalent nodes. However, most of the existing methods require high time and space complexity. In this paper, we propose VNEstruct, a simple approach, based on entropy measures of the neighborhood's topology, for generating low-dimensional structural representations, that is time-efficient and robust to graph perturbations. Empirically, we observe that VNEstruct exhibits robustness on structural role identification tasks. Moreover, VNEstruct can achieve state-of-the-art performance on graph classification, without incorporating the graph structure information in the optimization, in contrast to GNN competitors.

LGMar 1, 2020
Ego-based Entropy Measures for Structural Representations

George Dasoulas, Giannis Nikolentzos, Kevin Scaman et al.

In complex networks, nodes that share similar structural characteristics often exhibit similar roles (e.g type of users in a social network or the hierarchical position of employees in a company). In order to leverage this relationship, a growing literature proposed latent representations that identify structurally equivalent nodes. However, most of the existing methods require high time and space complexity. In this paper, we propose VNEstruct, a simple approach for generating low-dimensional structural node embeddings, that is both time efficient and robust to perturbations of the graph structure. The proposed approach focuses on the local neighborhood of each node and employs the Von Neumann entropy, an information-theoretic tool, to extract features that capture the neighborhood's topology. Moreover, on graph classification tasks, we suggest the utilization of the generated structural embeddings for the transformation of an attributed graph structure into a set of augmented node attributes. Empirically, we observe that the proposed approach exhibits robustness on structural role identification tasks and state-of-the-art performance on graph classification tasks, while maintaining very high computational speed.

LGDec 12, 2019
Coloring graph neural networks for node disambiguation

George Dasoulas, Ludovic Dos Santos, Kevin Scaman et al.

In this paper, we show that a simple coloring scheme can improve, both theoretically and empirically, the expressive power of Message Passing Neural Networks(MPNNs). More specifically, we introduce a graph neural network called Colored Local Iterative Procedure (CLIP) that uses colors to disambiguate identical node attributes, and show that this representation is a universal approximator of continuous functions on graphs with node attributes. Our method relies on separability , a key topological characteristic that allows to extend well-chosen neural networks into universal representations. Finally, we show experimentally that CLIP is capable of capturing structural characteristics that traditional MPNNs fail to distinguish,while being state-of-the-art on benchmark graph classification datasets.

MLOct 11, 2019
Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

Igor Colin, Ludovic Dos Santos, Kevin Scaman

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective functions, we provide matching lower and upper complexity bounds and show that a naive pipeline parallelization of Nesterov's accelerated gradient descent is optimal. For non-smooth convex functions, we provide a novel algorithm coined Pipeline Parallel Random Smoothing (PPRS) that is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension. While the convergence rate still obeys a slow $\varepsilon^{-2}$ convergence rate, the depth-dependent part is accelerated, resulting in a near-linear speed-up and convergence time that only slightly depends on the depth of the deep learning architecture. Finally, we perform an empirical analysis of the non-smooth non-convex case and show that, for difficult and highly non-smooth problems, PPRS outperforms more traditional optimization algorithms such as gradient descent and Nesterov's accelerated gradient descent for problems where the sample size is limited, such as few-shot or adversarial learning.

MLMay 28, 2018
Lipschitz regularity of deep neural networks: analysis and efficient estimation

Kevin Scaman, Aladin Virmaux

Deep neural networks are notorious for being sensitive to small well-chosen perturbations, and estimating the regularity of such architectures is of utmost importance for safe and robust practical applications. In this paper, we investigate one of the key characteristics to assess the regularity of such methods: the Lipschitz constant of deep learning architectures. First, we show that, even for two layer neural networks, the exact computation of this quantity is NP-hard and state-of-art methods may significantly overestimate it. Then, we both extend and improve previous estimation methods by providing AutoLip, the first generic algorithm for upper bounding the Lipschitz constant of any automatically differentiable function. We provide a power method algorithm working with automatic differentiation, allowing efficient computations even on large convolutions. Second, for sequential neural networks, we propose an improved algorithm named SeqLip that takes advantage of the linear computation graph to split the computation per pair of consecutive layers. Third we propose heuristics on SeqLip in order to tackle very large networks. Our experiments show that SeqLip can significantly improve on the existing upper bounds. Finally, we provide an implementation of AutoLip in the PyTorch environment that may be used to better estimate the robustness of a given neural network to small perturbations or regularize it using more precise Lipschitz estimations.

LGMay 25, 2018
KONG: Kernels for ordered-neighborhood graphs

Moez Draief, Konstantin Kutzkov, Kevin Scaman et al.

We present novel graph kernels for graphs with node and edge labels that have ordered neighborhoods, i.e. when neighbor nodes follow an order. Graphs with ordered neighborhoods are a natural data representation for evolving graphs where edges are created over time, which induces an order. Combining convolutional subgraph kernels and string kernels, we design new scalable algorithms for generation of explicit graph feature maps using sketching techniques. We obtain precise bounds for the approximation accuracy and computational complexity of the proposed approaches and demonstrate their applicability on real datasets. In particular, our experiments demonstrate that neighborhood ordering results in more informative features. For the special case of general graphs, i.e. graphs without ordered neighborhoods, the new graph kernels yield efficient and simple algorithms for the comparison of label distributions between graphs.

MLSep 15, 2017
A Spectral Method for Activity Shaping in Continuous-Time Information Cascades

Kevin Scaman, Argyris Kalogeratos, Luca Corinzia et al.

Information Cascades Model captures dynamical properties of user activity in a social network. In this work, we develop a novel framework for activity shaping under the Continuous-Time Information Cascades Model which allows the administrator for local control actions by allocating targeted resources that can alter the spread of the process. Our framework employs the optimization of the spectral radius of the Hazard matrix, a quantity that has been shown to drive the maximum influence in a network, while enjoying a simple convex relaxation when used to minimize the influence of the cascade. In addition, use-cases such as quarantine and node immunization are discussed to highlight the generality of the proposed activity shaping framework. Finally, we present the NetShape influence minimization method which is compared favorably to baseline and state-of-the-art approaches through simulations on real social networks.

OCFeb 28, 2017
Optimal algorithms for smooth and strongly convex distributed optimization in networks

Kevin Scaman, Francis Bach, Sébastien Bubeck et al.

In this paper, we determine the optimal convergence rates for strongly convex and smooth distributed optimization in two settings: centralized and decentralized communications over a network. For centralized (i.e. master/slave) algorithms, we show that distributing Nesterov's accelerated gradient descent is optimal and achieves a precision $\varepsilon > 0$ in time $O(\sqrt{κ_g}(1+Δτ)\ln(1/\varepsilon))$, where $κ_g$ is the condition number of the (global) function to optimize, $Δ$ is the diameter of the network, and $τ$ (resp. $1$) is the time needed to communicate values between two neighbors (resp. perform local computations). For decentralized algorithms based on gossip, we provide the first optimal algorithm, called the multi-step dual accelerated (MSDA) method, that achieves a precision $\varepsilon > 0$ in time $O(\sqrt{κ_l}(1+\fracτ{\sqrtγ})\ln(1/\varepsilon))$, where $κ_l$ is the condition number of the local functions and $γ$ is the (normalized) eigengap of the gossip matrix used for communication between nodes. We then verify the efficiency of MSDA against state-of-the-art methods for two problems: least-squares regression and classification by logistic regression.

MLFeb 26, 2016
Multivariate Hawkes Processes for Large-scale Inference

Rémi Lemonnier, Kevin Scaman, Argyris Kalogeratos

In this paper, we present a framework for fitting multivariate Hawkes processes for large-scale problems both in the number of events in the observed history $n$ and the number of event types $d$ (i.e. dimensions). The proposed Low-Rank Hawkes Process (LRHP) framework introduces a low-rank approximation of the kernel matrix that allows to perform the nonparametric learning of the $d^2$ triggering kernels using at most $O(ndr^2)$ operations, where $r$ is the rank of the approximation ($r \ll d,n$). This comes as a major improvement to the existing state-of-the-art inference algorithms that are in $O(nd^2)$. Furthermore, the low-rank approximation allows LRHP to learn representative patterns of interaction between event types, which may be valuable for the analysis of such complex processes in real world datasets. The efficiency and scalability of our approach is illustrated with numerical experiments on simulated as well as real datasets.