LG OC PRJul 21, 2023

Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

arXiv:2307.11714v39.88 citationsh-index: 4

Originality Incremental advance

AI Analysis

This provides theoretical validation for a practical observation in machine learning, addressing a gap for researchers using optimal transport in neural network training, though it is incremental as it builds on existing convergence frameworks.

The paper tackles the lack of theoretical guarantees for the convergence of Stochastic Gradient Descent (SGD) when training neural networks with Sliced Wasserstein losses, showing that trajectories approach gradient flow equations as step size decreases and, under stricter assumptions, converge to critical points of the loss function.

Optimal Transport has sparked vivid interest in recent years, in particular thanks to the Wasserstein distance, which provides a geometrically sensible and intuitive way of comparing probability measures. For computational reasons, the Sliced Wasserstein (SW) distance was introduced as an alternative to the Wasserstein distance, and has seen uses for training generative Neural Networks (NNs). While convergence of Stochastic Gradient Descent (SGD) has been observed practically in such a setting, there is to our knowledge no theoretical guarantee for this observation. Leveraging recent works on convergence of SGD on non-smooth and non-convex functions by Bianchi et al. (2022), we aim to bridge that knowledge gap, and provide a realistic context under which fixed-step SGD trajectories for the SW loss on NN parameters converge. More precisely, we show that the trajectories approach the set of (sub)-gradient flow equations as the step decreases. Under stricter assumptions, we show a much stronger convergence result for noised and projected SGD schemes, namely that the long-run limits of the trajectories approach a set of generalised critical points of the loss function.

View on arXiv PDF

Similar