MLOct 28, 2021
MMD Aggregated Two-Sample TestAntonin Schrab, Ilmun Kim, Mélisande Albert et al.
We propose two novel nonparametric two-sample kernel tests based on the Maximum Mean Discrepancy (MMD). First, for a fixed kernel, we construct an MMD test using either permutations or a wild bootstrap, two popular numerical procedures to determine the test threshold. We prove that this test controls the probability of type I error non-asymptotically. Hence, it can be used reliably even in settings with small sample sizes as it remains well-calibrated, which differs from previous MMD tests which only guarantee correct test level asymptotically. When the difference in densities lies in a Sobolev ball, we prove minimax optimality of our MMD test with a specific kernel depending on the smoothness parameter of the Sobolev ball. In practice, this parameter is unknown and, hence, the optimal MMD test with this particular kernel cannot be used. To overcome this issue, we construct an aggregated test, called MMDAgg, which is adaptive to the smoothness parameter. The test power is maximised over the collection of kernels used, without requiring held-out data for kernel selection (which results in a loss of test power), or arbitrary kernel choices such as the median heuristic. We prove that MMDAgg still controls the level non-asymptotically, and achieves the minimax rate over Sobolev balls, up to an iterated logarithmic term. Our guarantees are not restricted to a specific type of kernel, but hold for any product of one-dimensional translation invariant characteristic kernels. We provide a user-friendly parameter-free implementation of MMDAgg using an adaptive collection of bandwidths. We demonstrate that MMDAgg significantly outperforms alternative state-of-the-art MMD-based two-sample tests on synthetic data satisfying the Sobolev smoothness assumption, and that, on real-world image data, MMDAgg closely matches the power of tests leveraging the use of models such as neural networks.
STFeb 11, 2020
Minimax optimal goodness-of-fit testing for densities and multinomials under a local differential privacy constraintJoseph Lam-Weil, Béatrice Laurent, Jean-Michel Loubes
Finding anonymization mechanisms to protect personal data is at the heart of recent machine learning research. Here, we consider the consequences of local differential privacy constraints on goodness-of-fit testing, i.e. the statistical problem assessing whether sample points are generated from a fixed density $f_0$, or not. The observations are kept hidden and replaced by a stochastic transformation satisfying the local differential privacy constraint. In this setting, we propose a testing procedure which is based on an estimation of the quadratic distance between the density $f$ of the unobserved samples and $f_0$. We establish an upper bound on the separation distance associated with this test, and a matching lower bound on the minimax separation rates of testing under non-interactive privacy in the case that $f_0$ is uniform, in discrete and continuous settings. To the best of our knowledge, we provide the first minimax optimal test and associated private transformation under a local differential privacy constraint over Besov balls in the continuous setting, quantifying the price to pay for data privacy. We also present a test that is adaptive to the smoothness parameter of the unknown density and remains minimax optimal up to a logarithmic factor. Finally, we note that our results can be translated to the discrete case, where the treatment of probability vectors is shown to be equivalent to that of piecewise constant densities in our setting. That is why we work with a unified setting for both the continuous and the discrete cases.
CYSep 28, 2018
Wikistat 2.0: Educational Resources for Artificial IntelligencePhilippe Besse, Brendan Guillouet, Béatrice Laurent
Big data, data science, deep learning, artificial intelligence are the key words of intense hype related with a job market in full evolution, that impose to adapt the contents of our university professional trainings. Which artificial intelligence is mostly concerned by the job offers? Which methodologies and technologies should be favored in the training programs? Which objectives, tools and educational resources do we needed to put in place to meet these pressing needs? We answer these questions in describing the contents and operational resources in the Data Science orientation of the specialty Applied Mathematics at INSA Toulouse. We focus on basic mathematics training (Optimization, Probability, Statistics), associated with the practical implementation of the most performing statistical learning algorithms, with the most appropriate technologies and on real examples. Considering the huge volatility of the technologies, it is imperative to train students in seft-training, this will be their technological watch tool when they will be in professional activity. This explains the structuring of the educational site github.com/wikistat into a set of tutorials. Finally, to motivate the thorough practice of these tutorials, a serious game is organized each year in the form of a prediction contest between students of Master degrees in Applied Mathematics for IA.
MLDec 13, 2017
Multiple testing for outlier detection in functional dataClémentine Barreyre, Béatrice Laurent, Jean-Michel Loubes et al.
We propose a novel procedure for outlier detection in functional data, in a semi-supervised framework. As the data is functional, we consider the coefficients obtained after projecting the observations onto orthonormal bases (wavelet, PCA). A multiple testing procedure based on the two-sample test is defined in order to highlight the levels of the coefficients on which the outliers appear as significantly different to the normal data. The selected coefficients are then called features for the outlier detection, on which we compute the Local Outlier Factor to highlight the outliers. This procedure to select the features is applied on simulated data that mimic the behaviour of space telemetries, and compared with existing dimension reduction techniques.