MLDec 15, 2022
Construction of a Surrogate Model: Multivariate Time Series Prediction with a Hybrid ModelClara Carlier, Arnaud Franju, Matthieu Lerasle et al.
Recent developments of advanced driver-assistance systems necessitate an increasing number of tests to validate new technologies. These tests cannot be carried out on track in a reasonable amount of time and automotive groups rely on simulators to perform most tests. The reliability of these simulators for constantly refined tasks is becoming an issue and, to increase the number of tests, the industry is now developing surrogate models, that should mimic the behavior of the simulator while being much faster to run on specific tasks. In this paper we aim to construct a surrogate model to mimic and replace the simulator. We first test several classical methods such as random forests, ridge regression or convolutional neural networks. Then we build three hybrid models that use all these methods and combine them to obtain an efficient hybrid surrogate model.
STNov 4, 2024
Finite-sample performance of the maximum likelihood estimator in logistic regressionHugo Chardon, Matthieu Lerasle, Jaouad Mourtada
Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression, assessed in terms of logistic risk. We consider two questions: first, that of the existence of the MLE (which occurs when the dataset is not linearly separated), and second that of its accuracy when it exists. These properties depend on both the dimension of covariates and on the signal strength. In the case of Gaussian covariates and a well-specified logistic model, we obtain sharp non-asymptotic guarantees for the existence and excess logistic risk of the MLE. We then generalize these results in two ways: first, to non-Gaussian covariates satisfying a certain two-dimensional margin condition, and second to the general case of statistical learning with a possibly misspecified logistic model. Finally, we consider the case of a Bernoulli design, where the behavior of the MLE is highly sensitive to the parameter direction.
STMar 12, 2020
On the robustness of the minimum $\ell_2$ interpolatorGeoffrey Chinot, Matthieu Lerasle
We analyse the interpolator with minimal $\ell_2$-norm $\hatβ$ in a general high dimensional linear regression framework where $\mathbb Y=\mathbb Xβ^*+ξ$ where $\mathbb X$ is a random $n\times p$ matrix with independent $\mathcal N(0,Σ)$ rows and without assumption on the noise vector $ξ\in \mathbb R^n$. We prove that, with high probability, the prediction loss of this estimator is bounded from above by $(\|β^*\|^2_2r_{cn}(Σ)\vee \|ξ\|^2)/n$, where $r_{k}(Σ)=\sum_{i\geq k}λ_i(Σ)$ are the rests of the sum of eigenvalues of $Σ$. These bounds show a transition in the rates. For high signal to noise ratios, the rates $\|β^*\|^2_2r_{cn}(Σ)/n$ broadly improve the existing ones. For low signal to noise ratio, we also provide lower bound holding with large probability. Under assumptions on the sprectrum of $Σ$, this lower bound is of order $\| ξ\|_2^2/n$, matching the upper bound. Consequently, in the large noise regime, we are able to precisely track the prediction error with large probability. This results give new insight when the interpolation can be harmless in high dimensions.
STSep 11, 2019
Aggregated Hold-OutGuillaume Maillard, Sylvain Arlot, Matthieu Lerasle
Aggregated hold-out (Agghoo) is a method which averages learning rules selected by hold-out (that is, cross-validation with a single split). We provide the first theoretical guarantees on Agghoo, ensuring that it can be used safely: Agghoo performs at worst like the hold-out when the risk is convex. The same holds true in classification with the 0-1 risk, with an additional constant factor. For the hold-out, oracle inequalities are known for bounded losses, as in binary classification. We show that similar results can be proved, under appropriate assumptions, for other risk-minimization problems. In particular, we obtain an oracle inequality for regularized kernel regression with a Lip-schitz loss, without requiring that the Y variable or the regressors be bounded. Numerical experiments show that aggregation brings a significant improvement over the hold-out and that Agghoo is competitive with cross-validation.
MLAug 28, 2019
Lecture Notes: Selected topics on robust statistical learning theoryMatthieu Lerasle
These notes gather recent results on robust statistical learning theory. The goal is to stress the main principles underlying the construction and theoretical analysis of these estimators rather than provide an exhaustive account on this rapidly growing field. The notes are the basis of lectures given at the conference StatMathAppli 2019.
MLMay 17, 2019
Pair-Matching: Links Prediction with Adaptive QueriesChristophe Giraud, Yann Issartel, Luc Lehéricy et al.
The pair-matching problem appears in many applications where one wants to discover good matches between pairs of entities or individuals. Formally, the set of individuals is represented by the nodes of a graph where the edges, unobserved at first, represent the good matches. The algorithm queries pairs of nodes and observes the presence/absence of edges. Its goal is to discover as many edges as possible with a fixed budget of queries. Pair-matching is a particular instance of multi-armed bandit problem in which the arms are pairs of individuals and the rewards are edges linking these pairs. This bandit problem is non-standard though, as each arm can only be played once. Given this last constraint, sublinear regret can be expected only if the graph presents some underlying structure. This paper shows that sublinear regret is achievable in the case where the graph is generated according to a Stochastic Block Model (SBM) with two communities. Optimal regret bounds are computed for this pair-matching problem. They exhibit a phase transition related to the Kesten-Stigum threshold for community detection in SBM. The pair-matching problem is considered in the case where each node is constrained to be sampled less than a given amount of times. We show how optimal regret rates depend on this constraint. The paper is concluded by a conjecture regarding the optimal regret when the number of communities is larger than 2. Contrary to the two communities case, we argue that a statistical-computational gap would appear in this problem.
MLFeb 13, 2018
MONK -- Outlier-Robust Mean Embedding Estimation by Median-of-MeansMatthieu Lerasle, Zoltan Szabo, Timothee Mathieu et al.
Mean embeddings provide an extremely flexible and powerful tool in machine learning and statistics to represent probability distributions and define a semi-metric (MMD, maximum mean discrepancy; also called N-distance or energy distance), with numerous successful applications. The representation is constructed as the expectation of the feature map defined by a kernel. As a mean, its classical empirical estimator, however, can be arbitrary severely affected even by a single outlier in case of unbounded features. To the best of our knowledge, unfortunately even the consistency of the existing few techniques trying to alleviate this serious sensitivity bottleneck is unknown. In this paper, we show how the recently emerged principle of median-of-means can be used to design estimators for kernel mean embedding and MMD with excessive resistance properties to outliers, and optimal sub-Gaussian deviation bounds under mild assumptions.
STOct 22, 2012
Choice of V for V-Fold Cross-Validation in Least-Squares Density EstimationSylvain Arlot, Matthieu Lerasle
This paper studies V-fold cross-validation for model selection in least-squares density estimation. The goal is to provide theoretical grounds for choosing V in order to minimize the least-squares loss of the selected estimator. We first prove a non-asymptotic oracle inequality for V-fold cross-validation and its bias-corrected version (V-fold penalization). In particular, this result implies that V-fold penalization is asymptotically optimal in the nonparametric case. Then, we compute the variance of V-fold cross-validation and related criteria, as well as the variance of key quantities for model selection performance. We show that these variances depend on V like 1+4/(V-1), at least in some particular cases, suggesting that the performance increases much from V=2 to V=5 or 10, and then is almost constant. Overall, this can explain the common advice to take V=5---at least in our setting and when the computational power is limited---, as supported by some simulation experiments. An oracle inequality and exact formulas for the variance are also proved for Monte-Carlo cross-validation, also known as repeated cross-validation, where the parameter V is replaced by the number B of random splits of the data.