Abstract
Recent findings by Cohen et al., 2021, demonstrate that when training neural networks using full-batch gradient descent with a step size of $\eta$, the largest eigenvalue $\lambda_{\max}$ of the full-batch Hessian consistently stabilizes around $2/\eta$. These results have significant implications for convergence and generalization. This, however, is not the case for mini-batch optimization algorithms, limiting the broader applicabilityof the consequences of these findings. We show mini-batch Stochastic Gradient Descent (SGD) trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/\eta$ is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $\lambda_{\max}$ -- which is generally smaller than Batch Sharpness -- is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.