MLMar 22, 2019
Gradient-only line searches: An Alternative to Probabilistic Line SearchesDominic Kafka, Daniel Wilke
Step sizes in neural network training are largely determined using predetermined rules such as fixed learning rates and learning rate schedules. These require user input or expensive global optimization strategies to determine their functional form and associated hyperparameters. Line searches are capable of adaptively resolving learning rate schedules. However, due to discontinuities induced by mini-batch sub-sampling, they have largely fallen out of favour. Notwithstanding, probabilistic line searches, which use statistical surrogates over a limited spatial domain, have recently demonstrated viability in resolving learning rates for stochastic loss functions. This paper introduces an alternative paradigm, Gradient-Only Line Searches that are Inexact (GOLS-I), as an alternative strategy to automatically determine learning rates in stochastic loss functions over a range of 15 orders of magnitude without the use of surrogates. We show that GOLS-I is a competitive strategy to reliably determine step sizes, adding high value in terms of performance, while being easy to implement.
MLMar 20, 2019
Traversing the noise of dynamic mini-batch sub-sampled loss functions: A visual guideDominic Kafka, Daniel Wilke
Mini-batch sub-sampling in neural network training is unavoidable, due to growing data demands, memory-limited computational resources such as graphical processing units (GPUs), and the dynamics of on-line learning. In this study we specifically distinguish between static mini-batch sub-sampled loss functions, where mini-batches are intermittently fixed during training, resulting in smooth but biased loss functions; and the dynamic sub-sampling equivalent, where new mini-batches are sampled at every loss evaluation, trading bias for variance in sampling induced discontinuities. These render automated optimization strategies such as minimization line searches ineffective, since critical points may not exist and function minimizers find spurious, discontinuity induced minima. This paper suggests recasting the optimization problem to find stochastic non-negative associated gradient projection points (SNN-GPPs). We demonstrate that the SNN-GPP optimality criterion is less susceptible to sub-sampling induced discontinuities than critical points or minimizers. We conduct a visual investigation, comparing local minimum and SNN-GPP optimality criteria in the loss functions of a simple neural network training problem for a variety of popular activation functions. Since SNN-GPPs better approximate the location of true optima, particularly when using smooth activation functions with high curvature characteristics, we postulate that line searches locating SNN-GPPs can contribute significantly to automating neural network training