Diagonal Rescaling For Neural Networks
This work addresses robustness issues in neural network optimization, offering incremental improvements to training algorithms for machine learning practitioners.
The paper tackled the lack of robustness in a second-order stochastic gradient training algorithm for neural networks by proposing new stepsize scaling methods and emphasizing the importance of handling curvature changes, resulting in clarified connections to existing algorithms like RMSProp and fanin scaling.
We define a second-order neural network stochastic gradient training algorithm whose block-diagonal structure effectively amounts to normalizing the unit activations. Investigating why this algorithm lacks in robustness then reveals two interesting insights. The first insight suggests a new way to scale the stepsizes, clarifying popular algorithms such as RMSProp as well as old neural network tricks such as fanin stepsize scaling. The second insight stresses the practical importance of dealing with fast changes of the curvature of the cost.