Constructing Effective Machine Learning Models for the Sciences: A Multidisciplinary Perspective
It addresses the practical challenge of model selection and interpretability for researchers in natural and social sciences, though it appears incremental in its analysis.
The paper examines when flexible machine learning models improve upon linear regression with manual feature engineering in scientific applications, providing guidance on recognizing these cases and moving toward interpretable models while demonstrating varied outcomes across natural and social sciences.
Learning from data has led to substantial advances in a multitude of disciplines, including text and multimedia search, speech recognition, and autonomous-vehicle navigation. Can machine learning enable similar leaps in the natural and social sciences? This is certainly the expectation in many scientific fields and recent years have seen a plethora of applications of non-linear models to a wide range of datasets. However, flexible non-linear solutions will not always improve upon manually adding transforms and interactions between variables to linear regression models. We discuss how to recognize this before constructing a data-driven model and how such analysis can help us move to intrinsically interpretable regression models. Furthermore, for a variety of applications in the natural and social sciences we demonstrate why improvements may be seen with more complex regression models and why they may not.