Projected Forward Gradient-Guided Frank-Wolfe Algorithm via Variance Reduction
This addresses optimization bottlenecks for deep learning practitioners, representing an incremental improvement to existing methods.
The paper tackles the high computational and memory costs of the Frank-Wolfe algorithm for training deep neural networks by applying a projected forward gradient method with variance reduction, achieving convergence to optimal solutions for convex functions and stationary points for non-convex functions.
This paper aims to enhance the use of the Frank-Wolfe (FW) algorithm for training deep neural networks. Similar to any gradient-based optimization algorithm, FW suffers from high computational and memory costs when computing gradients for DNNs. This paper introduces the application of the recently proposed projected forward gradient (Projected-FG) method to the FW framework, offering reduced computational cost similar to backpropagation and low memory utilization akin to forward propagation. Our results show that trivial application of the Projected-FG introduces non-vanishing convergence error due to the stochastic noise that the Projected-FG method introduces in the process. This noise results in an non-vanishing variance in the Projected-FG estimated gradient. To address this, we propose a variance reduction approach by aggregating historical Projected-FG directions. We demonstrate rigorously that this approach ensures convergence to the optimal solution for convex functions and to a stationary point for non-convex functions. These convergence properties are validated through a numerical example, showcasing the approach's effectiveness and efficiency.