Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks
This addresses the challenge of non-convex optimization in deep learning for researchers and practitioners, offering a theoretically grounded alternative to SGD, though it appears incremental as it builds on existing regularization and optimization techniques.
The paper tackles the problem of training deep neural networks by proposing a smooth multi-convex formulation via lifting ReLU into a higher-dimensional space, resulting in a block coordinate descent algorithm with proven global convergence and R-linear rate. In experiments on MNIST, this method consistently achieved better test-set error rates than SGD variants in Caffe.
By lifting the ReLU function into a higher dimensional space, we develop a smooth multi-convex formulation for training feed-forward deep neural networks (DNNs). This allows us to develop a block coordinate descent (BCD) training algorithm consisting of a sequence of numerically well-behaved convex optimizations. Using ideas from proximal point methods in convex analysis, we prove that this BCD algorithm will converge globally to a stationary point with R-linear convergence rate of order one. In experiments with the MNIST database, DNNs trained with this BCD algorithm consistently yielded better test-set error rates than identical DNN architectures trained via all the stochastic gradient descent (SGD) variants in the Caffe toolbox.