LGFeb 18
Optimizer choice matters for the emergence of Neural CollapseJim Zhao, Tin Sum Cheng, Wojciech Masarczyk et al.
Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in the representations of deep neural networks during the terminal phase of training. Despite its prevalence, the theoretical understanding of NC remains limited. Existing analyses largely ignore the role of the optimizer, thereby suggesting that NC is universal across optimization methods. In this work, we challenge this assumption and demonstrate that the choice of optimizer plays a critical role in the emergence of NC. The phenomenon is typically quantified through NC metrics, which, however, are difficult to track and analyze theoretically. To overcome this limitation, we introduce a novel diagnostic metric, NC0, whose convergence to zero is a necessary condition for NC. Using NC0, we provide theoretical evidence that NC cannot emerge under decoupled weight decay in adaptive optimizers, as implemented in AdamW. Concretely, we prove that SGD, SignGD with coupled weight decay (a special case of Adam), and SignGD with decoupled weight decay (a special case of AdamW) exhibit qualitatively different NC0 dynamics. Also, we show the accelerating effect of momentum on NC (beyond convergence of train loss) when trained with SGD, being the first result concerning momentum in the context of NC. Finally, we conduct extensive empirical experiments consisting of 3,900 training runs across various datasets, architectures, optimizers, and hyperparameters, confirming our theoretical results. This work provides the first theoretical explanation for optimizer-dependent emergence of NC and highlights the overlooked role of weight-decay coupling in shaping the implicit biases of optimizers.
LGNov 4, 2024
Theoretical characterisation of the Gauss-Newton conditioning in Neural NetworksJim Zhao, Sidak Pal Singh, Aurelien Lucchi · eth-zurich
The Gauss-Newton (GN) matrix plays an important role in machine learning, most evident in its use as a preconditioning matrix for a wide family of popular adaptive methods to speed up optimization. Besides, it can also provide key insights into the optimization landscape of neural networks. In the context of deep neural networks, understanding the GN matrix involves studying the interaction between different weight matrices as well as the dependencies introduced by the data, thus rendering its analysis challenging. In this work, we take a first step towards theoretically characterizing the conditioning of the GN matrix in neural networks. We establish tight bounds on the condition number of the GN in deep linear networks of arbitrary depth and width, which we also extend to two-layer ReLU networks. We expand the analysis to further architectural components, such as residual connections and convolutional layers. Finally, we empirically validate the bounds and uncover valuable insights into the influence of the analyzed architectural components.
LGJun 24, 2024
Cubic regularized subspace Newton for non-convex optimizationJim Zhao, Aurelien Lucchi, Nikita Doikov
This paper addresses the optimization problem of minimizing non-convex continuous functions, which is relevant in the context of high-dimensional machine learning applications characterized by over-parametrization. We analyze a randomized coordinate second-order method named SSCN which can be interpreted as applying cubic regularization in random subspaces. This approach effectively reduces the computational complexity associated with utilizing second-order information, rendering it applicable in higher-dimensional scenarios. Theoretically, we establish convergence guarantees for non-convex functions, with interpolating rates for arbitrary subspace sizes and allowing inexact curvature estimation. When increasing subspace size, our complexity matches $\mathcal{O}(ε^{-3/2})$ of the cubic regularization (CR) rate. Additionally, we propose an adaptive sampling scheme ensuring exact convergence rate of $\mathcal{O}(ε^{-3/2}, ε^{-3})$ to a second-order stationary point, even without sampling all coordinates. Experimental results demonstrate substantial speed-ups achieved by SSCN compared to conventional first-order methods.
CVMar 23, 2024
The Limits of Perception: Analyzing Inconsistencies in Saliency Maps in XAIAnna Stubbin, Thompson Chyrikov, Jim Zhao et al.
Explainable artificial intelligence (XAI) plays an indispensable role in demystifying the decision-making processes of AI, especially within the healthcare industry. Clinicians rely heavily on detailed reasoning when making a diagnosis, often CT scans for specific features that distinguish between benign and malignant lesions. A comprehensive diagnostic approach includes an evaluation of imaging results, patient observations, and clinical tests. The surge in deploying deep learning models as support systems in medical diagnostics has been significant, offering advances that traditional methods could not. However, the complexity and opacity of these models present a double-edged sword. As they operate as "black boxes," with their reasoning obscured and inaccessible, there's an increased risk of misdiagnosis, which can lead to patient harm. Hence, there is a pressing need to cultivate transparency within AI systems, ensuring that the rationale behind an AI's diagnostic recommendations is clear and understandable to medical practitioners. This shift towards transparency is not just beneficial -- it's a critical step towards responsible AI integration in healthcare, ensuring that AI aids rather than hinders medical professionals in their crucial work.