Abstract
In deep learning, stochastic gradient descent (SGD) and its momentum-based variants are widely used for optimization. However, the internal dynamics of these methods remain underexplored. In this paper, we analyze gradient behavior through a signal processing lens, isolating key factors that influence gradient updates and revealing a critical limitation: momentum techniques lack the flexibility to adequately balance bias and variance components in gradients, resulting in gradient estimation inaccuracies. To address this issue, we introduce a novel method SGDF (SGD with Filter) based on Wiener Filter principles, which derives an optimal time-varying gain to refine gradient updates by minimizing the mean square error in gradient estimation. This method yields an optimal first-order gradient estimate, effectively balancing noise reduction and signal preservation. Furthermore, our approach could extend to adaptive optimizers, enhancing their generalization potential. Empirical results show that SGDF achieves superior convergence and generalization compared to traditional momentum methods, and performs competitively with state-of-the-art optimizers.