LG CVOct 12, 2022

Towards Theoretically Inspired Neural Initialization Optimization

Yibo Yang, Hong Wang, Haobo Yuan, Zhouchen Lin

arXiv:2210.05956v110.413 citationsh-index: 79Has Code

Originality Incremental advance

AI Analysis

This work addresses the lack of automated techniques for neural initialization, which is a domain-specific problem for machine learning practitioners, offering an incremental improvement over handcrafted methods.

The paper tackles the problem of automated neural network initialization by proposing GradCosine, a differentiable metric based on sample-wise gradient cosine similarity, and the Neural Initialization Optimization (NIO) algorithm, which improves classification performance on CIFAR-10, CIFAR-100, and ImageNet and enables training large vision Transformers without warmup.

Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on specific architectures. In this paper, we propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. Specifically, GradCosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the sample-wise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing GradCosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization (NIO) algorithm. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost compared with the training time. With NIO, we improve the classification performance of a variety of neural architectures on CIFAR-10, CIFAR-100, and ImageNet. Moreover, we find that our method can even help to train large vision Transformer architecture without warmup.

View on arXiv PDF Code

Similar