Jeremy Cohen

LG
h-index2
11papers
160citations
Novelty27%
AI Score42

11 Papers

LGJun 21, 2022
On the Maximum Hessian Eigenvalue and Generalization

Simran Kaur, Jeremy Cohen, Zachary C. Lipton

The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $λ_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $λ_{max}$ and generalization. In this paper, we present findings that call $λ_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $λ_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $λ_{max}$ without affecting generalization; (3) while SAM produces smaller $λ_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $λ_{max}$; and (5) while batch-normalization does not consistently produce smaller $λ_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $λ_{max}$'s ability to explain generalization in neural networks.

MLApr 23
There Will Be a Scientific Theory of Deep Learning

Jamie Simon, Daniel Kunin, Alexander Atanasov et al.

In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; (c) simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.

LGMar 5
Non-Euclidean Gradient Descent Operates at the Edge of Stability

Rustem Islamov, Michael Crawshaw, Jeremy Cohen et al.

The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian converges to $2/η$ during training with gradient descent (GD) with a step-size $η$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation of EoS through the lens of Directional Smoothness Mishkin et al. [2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and Muon without momentum. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/η$. Practically, our framework provides a single, geometry-aware spectral measure that works across optimizers.

LGJul 17, 2025
Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability

Kaiqi Jiang, Jeremy Cohen, Yuanzhi Li

The study of Neural Tangent Kernels (NTKs) in deep learning has drawn increasing attention in recent years. NTKs typically actively change during training and are related to feature learning. In parallel, recent work on Gradient Descent (GD) has found a phenomenon called Edge of Stability (EoS), in which the largest eigenvalue of the NTK oscillates around a value inversely proportional to the step size. However, although follow-up works have explored the underlying mechanism of such eigenvalue behavior in depth, the understanding of the behavior of the NTK eigenvectors during EoS is still missing. This paper examines the dynamics of NTK eigenvectors during EoS in detail. Across different architectures, we observe that larger learning rates cause the leading eigenvectors of the final NTK, as well as the full NTK matrix, to have greater alignment with the training target. We then study the underlying mechanism of this phenomenon and provide a theoretical analysis for a two-layer linear network. Our study enhances the understanding of GD training dynamics in deep learning.

SDJul 23, 2021
Multi-Channel Automatic Music Transcription Using Tensor Algebra

Axel Marmoret, Nancy Bertin, Jeremy Cohen

Music is an art, perceived in unique ways by every listener, coming from acoustic signals. In the meantime, standards as musical scores exist to describe it. Even if humans can make this transcription, it is costly in terms of time and efforts, even more with the explosion of information consecutively to the rise of the Internet. In that sense, researches are driven in the direction of Automatic Music Transcription. While this task is considered solved in the case of single notes, it is still open when notes superpose themselves, forming chords. This report aims at developing some of the existing techniques towards Music Transcription, particularly matrix factorization, and introducing the concept of multi-channel automatic music transcription. This concept will be explored with mathematical objects called tensors.

SEApr 4, 2021
Understanding Equity, Diversity and Inclusion Challenges Within the Research Software Community

Neil P. Chue Hong, Jeremy Cohen, Caroline Jay

Research software -- specialist software used to support or undertake research -- is of huge importance to researchers. It contributes to significant advances in the wider world and requires collaboration between people with diverse skills and backgrounds. Analysis of recent survey data provides evidence for a lack of diversity in the Research Software Engineer community. We identify interventions which could address challenges in the wider research software community and highlight areas where the community is becoming more diverse. There are also lessons that are applicable, more generally, to the field of software development around recruitment from other disciplines and the importance of welcoming communities.

SEOct 20, 2020
RSEs in Research? RSEs in IT?: Finding a suitable home for RSEs

Jeremy Cohen, Mark Woodbridge

The term Research Software Engineer (RSE) was first used in the UK research community in 2012 to refer to individuals based in research environments who focus on the development of software to support and undertake research. Since then, the term has gained wide adoption and many RSE groups and teams have been set up at institutions within the UK and around the world. There is no "manual" for establishing an RSE group! These groups are often set up in an ad hoc manner based on what is practical within the context of each institution. Some of these groups are based in a central IT environment, others are based in research groups in academic departments. Is one option better than another? What are the pros and cons of different options? In this position paper we look at some arguments for where RSE teams fit best within academic institutions and research organisations and consider whether there is an ideal "home" for RSEs.

AIOct 19, 2020
Robot Design With Neural Networks, MILP Solvers and Active Learning

Sanjai Narain, Emily Mak, Dana Chee et al.

Central to the design of many robot systems and their controllers is solving a constrained blackbox optimization problem. This paper presents CNMA, a new method of solving this problem that is conservative in the number of potentially expensive blackbox function evaluations; allows specifying complex, even recursive constraints directly rather than as hard-to-design penalty or barrier functions; and is resilient to the non-termination of function evaluations. CNMA leverages the ability of neural networks to approximate any continuous function, their transformation into equivalent mixed integer linear programs (MILPs) and their optimization subject to constraints with industrial strength MILP solvers. A new learning-from-failure step guides the learning to be relevant to solving the constrained optimization problem. Thus, the amount of learning is orders of magnitude smaller than that needed to learn functions over their entire domains. CNMA is illustrated with the design of several robotic systems: wave-energy propelled boat, lunar lander, hexapod, cartpole, acrobot and parallel parking. These range from 6 real-valued dimensions to 36. We show that CNMA surpasses the Nelder-Mead, Gaussian and Random Search optimization methods against the metric of number of function evaluations.

LGOct 18, 2019
Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?

Simran Kaur, Jeremy Cohen, Zachary C. Lipton

For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, Santurkar et al. (2019) demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these "perceptually-aligned gradients" also occur under randomized smoothing, an alternative means of constructing adversarially-robust classifiers. Our finding supports the hypothesis that perceptually-aligned gradients may be a general property of robust classifiers. We hope that our results will inspire research aimed at explaining this link between perceptually-aligned gradients and adversarial robustness.

SEJul 11, 2018
Building a Sustainable Structure for Research Software Engineering Activities

Jeremy Cohen, Daniel S. Katz, Michelle Barker et al.

The profile of research software engineering has been greatly enhanced by developments at institutions around the world to form groups and communities that can support effective, sustainable development of research software. We observe, however, that there is still a long way to go to build a clear understanding about what approaches provide the best support for research software developers in different contexts, and how such understanding can be used to suggest more formal structures, models or frameworks that can help to further support the growth of research software engineering. This paper sets out some preliminary thoughts and proposes an initial high-level model based on discussions between the authors around the concept of a set of pillars representing key activities and processes that form the core structure of a successful research software engineering offering.

DCSep 4, 2013
Simplifying the Development, Use and Sustainability of HPC Software

Jeremy Cohen, Chris Cantwell, Neil Chue Hong et al.

Developing software to undertake complex, compute-intensive scientific processes requires a challenging combination of both specialist domain knowledge and software development skills to convert this knowledge into efficient code. As computational platforms become increasingly heterogeneous and newer types of platform such as Infrastructure-as-a-Service (IaaS) cloud computing become more widely accepted for HPC computations, scientists require more support from computer scientists and resource providers to develop efficient code and make optimal use of the resources available to them. As part of the libhpc stage 1 and 2 projects we are developing a framework to provide a richer means of job specification and efficient execution of complex scientific software on heterogeneous infrastructure. The use of such frameworks has implications for the sustainability of scientific software. In this paper we set out our developing understanding of these challenges based on work carried out in the libhpc project.