NAAug 1, 2012
Structural Susceptibility and Separation of Time Scales in the van der Pol OscillatorRicky Chachra, Mark K. Transtrum, James P. Sethna
We use an extension of the van der Pol oscillator as an example of a system with multiple time scales to study the susceptibility of its trajectory to polynomial perturbations in the dynamics. A striking feature of many nonlinear, multi-parameter models is an apparently inherent insensitivity to large magnitude variations in certain linear combinations of parameters. This phenomenon of "sloppiness" is quantified by calculating the eigenvalues of the Hessian matrix of the least-squares cost function which typically span many orders of magnitude. The van der Pol system is no exception: Perturbations in its dynamics show that most directions in parameter space weakly affect the limit cycle, whereas only a few directions are stiff. With this study we show that separating the time scales in the van der Pol system leads to a further separation of eigenvalues. Parameter combinations which perturb the slow manifold are stiffer and those which solely affect the transients in the dynamics are sloppier.
STAug 15, 2024
eGAD! double descent is explained by Generalized Aliasing DecompositionMark K. Transtrum, Gus L. W. Hart, Tyler J. Jarvis et al.
A central problem in data science is to use potentially noisy samples of an unknown function to predict values for unseen inputs. In classical statistics, predictive error is understood as a trade-off between the bias and the variance that balances model simplicity with its ability to fit complex functions. However, over-parameterized models exhibit counterintuitive behaviors, such as "double descent" in which models of increasing complexity exhibit decreasing generalization error. Others may exhibit more complicated patterns of predictive error with multiple peaks and valleys. Neither double descent nor multiple descent phenomena are well explained by the bias-variance decomposition. We introduce a novel decomposition that we call the generalized aliasing decomposition (GAD) to explain the relationship between predictive performance and model complexity. The GAD decomposes the predictive error into three parts: 1) model insufficiency, which dominates when the number of parameters is much smaller than the number of data points, 2) data insufficiency, which dominates when the number of parameters is much greater than the number of data points, and 3) generalized aliasing, which dominates between these two extremes. We demonstrate the applicability of the GAD to diverse applications, including random feature models from machine learning, Fourier transforms from signal processing, solution methods for differential equations, and predictive formation enthalpy in materials discovery. Because key components of the GAD can be explicitly calculated from the relationship between model class and samples without seeing any data labels, it can answer questions related to experimental design and model selection before collecting data or performing experiments. We further demonstrate this approach on several examples and discuss implications for predictive modeling and data science.
LGNov 5, 2024
An information-matching approach to optimal experimental design and active learningYonatan Kurniawan, Tracianne B. Neilsen, Benjamin L. Francis et al.
The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.
LGMay 13, 2025
An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear ModelsJialin Mao, Itay Griniasty, Yan Sun et al.
Recent experiments have shown that training trajectories of multiple deep neural networks with different architectures, optimization algorithms, hyper-parameter settings, and regularization methods evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold in the space of probability distributions. Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter. We show, using tools in dynamical systems theory, that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent. By analytically computing and bounding the contributions of these quantities, we characterize phase boundaries of the region where hyper-ribbons are to be expected. We also extend our analysis to kernel machines and linear models that are trained with stochastic gradient descent.
MTRL-SCIApr 27, 2025
Composable and adaptive design of machine learning interatomic potentials guided by Fisher-information analysisWeishi Wang, Mark K. Transtrum, Vincenzo Lordi et al.
An adaptive physics-informed model design strategy for machine-learning interatomic potentials (MLIPs) is proposed. This strategy follows an iterative reconfiguration of composite models from single-term models, followed by a unified training procedure. A model evaluation method based on the Fisher information matrix (FIM) and multiple-property error metrics is proposed to guide model reconfiguration and hyperparameter optimization. Combining the model reconfiguration and the model evaluation subroutines, we provide an adaptive MLIP design strategy that balances flexibility and extensibility. In a case study of designing models against a structurally diverse niobium dataset, we managed to obtain an optimal configuration with 75 parameters generated by our framework that achieved a force RMSE of 0.172 eV/Å and an energy RMSE of 0.013 eV/atom.
LGDec 24, 2024
Comparing analytic and data-driven approaches to parameter identifiability: A power systems case studyNikolaos Evangelou, Alexander M. Stankovic, Ioannis G. Kevrekidis et al.
Parameter identifiability refers to the capability of accurately inferring the parameter values of a model from its observations (data). Traditional analysis methods exploit analytical properties of the closed form model, in particular sensitivity analysis, to quantify the response of the model predictions to variations in parameters. Techniques developed to analyze data, specifically manifold learning methods, have the potential to complement, and even extend the scope of the traditional analytical approaches. We report on a study comparing and contrasting analytical and data-driven approaches to quantify parameter identifiability and, importantly, perform parameter reduction tasks. We use the infinite bus synchronous generator model, a well-understood model from the power systems domain, as our benchmark problem. Our traditional analysis methods use the Fisher Information Matrix to quantify parameter identifiability analysis, and the Manifold Boundary Approximation Method to perform parameter reduction. We compare these results to those arrived at through data-driven manifold learning schemes: Output - Diffusion Maps and Geometric Harmonics. For our test case, we find that the two suites of tools (analytical when a model is explicitly available, as well as data-driven when the model is lacking and only measurement data are available) give (correct) comparable results; these results are also in agreement with traditional analysis based on singular perturbation theory. We then discuss the prospects of using data-driven methods for such model analysis.
LGMay 2, 2023
The Training Process of Many Deep Networks Explores the Same Low-Dimensional ManifoldJialin Mao, Itay Griniasty, Han Kheng Teoh et al.
We develop information-geometric techniques to analyze the trajectories of the predictions of deep networks during training. By examining the underlying high-dimensional probabilistic models, we reveal that the training process explores an effectively low-dimensional manifold. Networks with a wide range of architectures, sizes, trained using different optimization methods, regularization techniques, data augmentation techniques, and weight initializations lie on the same manifold in the prediction space. We study the details of this manifold to find that networks with different architectures follow distinguishable trajectories but other factors have a minimal influence; larger networks train along a similar manifold as that of smaller networks, just faster; and networks initialized at very different parts of the prediction space converge to the solution along a similar manifold.
DATA-ANMay 2, 2017
Maximizing the information learned from finite data selects a simple modelHenry H. Mattingly, Mark K. Transtrum, Michael C. Abbott et al.
We use the language of uninformative Bayesian prior choice to study the selection of appropriately simple effective models. We advocate for the prior which maximizes the mutual information between parameters and predictions, learning as much as possible from limited data. When many parameters are poorly constrained by the available data, we find that this prior puts weight only on boundaries of the parameter manifold. Thus it selects a lower-dimensional effective theory in a principled way, ignoring irrelevant parameter directions. In the limit where there is sufficient data to tightly constrain any number of parameters, this reduces to Jeffreys prior. But we argue that this limit is pathological when applied to the hyper-ribbon parameter manifolds generic in science, because it leads to dramatic dependence on effects invisible to experiment.