Дмитрий Ветров
Национальный исследовательский университет «Высшая школа экономики»

Surprising properties of loss landscape in over-parameterized models

During recent years several unusual effects which are observed in the process of training modern deep neural networks have been reported. Among them are double descent, mode connectivity, minefields in loss landscape, etc. All of them are related to over-parameterization of modern DNN. The deeper understanding of the properties of over-parameterized models may give clues to the development of better algorithms for training DNNs. In the talk we will share some intuition and experimental proof that explains many of the strange effects mentioned above. In particular we will focus on so-called scale-invariant networks and demonstrate how the choice of hyperparameters affects the training process and reveals some properties of the training loss landscape. We will consider simplified setting of fully scale-invariant neural network with weights on a unit sphere training by SGD with constant learning rate. This setting allows to remove the influence of initialization, weight decay coefficient and rate schedule leaving only one hyperparamter (learning rate) to control training. By running same network with different constant learning rates we discover several different but stable phases that are related with the properties of the loss landscape and with generalization ability of trained network.