Zoom
https://us02web.zoom.us/j/83447514190pwd=cDVsc3Y0dWJOTFIwNlhSQmV5Vkpudz09
Meeting ID: 834 4751 4190
Passcode: 1212
Title
The implicit bias of SGD: Minima stability analysis
Abstract
One of the puzzling phenomena in deep learning, is that neural networks tend to generalize well even when they are highly overparameterized. This stands in sharp contrast to classical wisdom, since the training process seemingly has no way to tell apart “good” minima from “bad” ones if they achieve the precise same loss. Recent works suggested that this behavior results from implicit biases of the algorithms we use to train networks (like SGD). However existing results fail to explain many of the behaviors observed in practice, such as the dependence of this bias on the step size.
In this work, we analyze the implicit bias of SGD from the standpoint of minima stability. Specifically, it is known that SGD can stably converge only to minima that are sufficiently flat w.r.t. to its step size. We show that this property has a remarkable effect on models trained with the quadratic loss, both in terms of the end-to-end function and in terms of the way this function is implemented by the network upon convergence. Particularly, for single-hidden-layer univariate ReLU networks, we prove that stable minima correspond to end-to-end functions that get smoother as the learning rate is increased. For deep linear networks, we prove that stable minima correspond to nearly balanced networks, where the gains of all layers are approximately the same. Experiments indicate that these properties are characteristic also of more complex models trained in practice.