ML LGJul 19, 2022

Holistic Robust Data-Driven Decisions

Amine Bennouna, Bart Van Parys, Ryan Lucas

arXiv:2207.09560v417.724 citationsh-index: 13

Originality Incremental advance

AI Analysis

This work addresses the challenge of robust out-of-sample performance for practitioners in machine learning and decision-making, offering a comprehensive solution against multiple overfitting sources, though it builds incrementally on existing robust optimization methods.

The paper tackles the problem of overfitting in data-driven machine learning and decision-making by addressing three simultaneous sources: statistical error, data noise, and data misspecification. It proposes a novel distributionally robust optimization formulation that provides holistic protection, showing applications in healthcare neural network training and portfolio selection with real data.

The design of data-driven formulations for machine learning and decision-making with good out-of-sample performance is a key challenge. The observation that good in-sample performance does not guarantee good out-of-sample performance is generally known as overfitting. Practical overfitting can typically not be attributed to a single cause but is caused by several factors simultaneously. We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise, which occurs when the data points are measured only with finite precision, and finally, (iii) data misspecification in which a small fraction of all data may be wholly corrupted. Although existing data-driven formulations may be robust against one of these three sources in isolation, they do not provide holistic protection against all overfitting sources simultaneously. We design a novel data-driven formulation that guarantees such holistic protection and is computationally viable. Our distributionally robust optimization formulation can be interpreted as a novel combination of a Kullback-Leibler and Lévy-Prokhorov robust optimization formulation. In the context of classification and regression problems, we show that several popular regularized and robust formulations naturally reduce to a particular case of our proposed novel formulation. Finally, we apply the proposed HR formulation to two real-life applications and study it alongside several benchmarks: (1) training neural networks on healthcare data, where we analyze various robustness and generalization properties in the presence of noise, labeling errors, and scarce data, (2) a portfolio selection problem with real stock data, and analyze the risk/return tradeoff under the natural severe distribution shift of the application.

View on arXiv PDF

Similar