2024-11-24 Sunday Sign in CN

Activities
Why does SGD converge to generalizable solutions? A quantitative explanation via linear stability
Home - Activities
Reporter:
Lei Wu, School of Mathematical Sciences, Peking University
Inviter:
Haijun Yu, professor
Subject:
Why does SGD converge to generalizable solutions? A quantitative explanation via linear stability
Time and place:
16:00-17:00 May 26(Thursday)
Abstract:

Deep learning models are often operated with far more unknown parameters than training examples. In such a case, there exist many global minima, but their test performances can be very different. Fortunately, stochastic gradient descent (SGD) can select the good ones without needing any explicit regularizations, suggesting certain "implicit regularization" at work. This talk will provide a quantitative explanation of this striking phenomenon from the perspective of dynamical stability. We prove that if a global minimum is linearly stable for SGD, then the flatness---as measured by the Hessian's Frobenius norm---must be bounded independently of the model size and sample size. Moreover, this flatness can bound the generalization gap of two-layer neural networks. Together, we show that SGD tends to converge to flat minima and flat minima provably generalize well. Note that these are made possible by exploiting the particular geometry-aware structure of SGD noise.