LG CVJan 16, 2024

GD doesn't make the cut: Three ways that non-differentiability affects neural network training

arXiv:2401.08426v124.65 citations

Originality Highly original

AI Analysis

This work addresses critical misunderstandings in deep learning optimization theory for researchers and practitioners, highlighting the need for reevaluation due to its foundational implications, though it is more theoretical than incremental.

This paper tackles the problem of applying gradient methods to non-differentiable functions in neural network training, revealing that non-differentiable gradient methods exhibit different convergence properties, paradoxical behavior in L1-regularized problems, and broader occurrences of the Edge of Stability phenomenon, challenging existing optimization theories.

This paper critically examines the fundamental distinctions between gradient methods applied to non-differentiable functions (NGDMs) and classical gradient descents (GDs) for differentiable functions, revealing significant gaps in current deep learning optimization theory. We demonstrate that NGDMs exhibit markedly different convergence properties compared to GDs, strongly challenging the applicability of extensive neural network convergence literature based on $L-smoothness$ to non-smooth neural networks. Our analysis reveals paradoxical behavior of NDGM solutions for $L_{1}$-regularized problems, where increasing regularization counterintuitively leads to larger $L_{1}$ norms of optimal solutions. This finding calls into question widely adopted $L_{1}$ penalization techniques for network pruning. We further challenge the common assumption that optimization algorithms like RMSProp behave similarly in differentiable and non-differentiable contexts. Expanding on the Edge of Stability phenomenon, we demonstrate its occurrence in a broader class of functions, including Lipschitz continuous convex differentiable functions. This finding raises important questions about its relevance and interpretation in non-convex, non-differentiable neural networks, particularly those using ReLU activations. Our work identifies critical misunderstandings of NDGMs in influential literature, stemming from an overreliance on strong smoothness assumptions. These findings necessitate a reevaluation of optimization dynamics in deep learning, emphasizing the crucial need for more nuanced theoretical foundations in analyzing these complex systems.

View on arXiv PDF

Similar