ML LGAug 7, 2023

Partial identification of kernel based two sample tests with mismeasured data

arXiv:2308.03570v12.3h-index: 12

Originality Incremental advance

AI Analysis

This addresses a practical issue in machine learning for researchers and practitioners using nonparametric tests with noisy data, though it is an incremental improvement by extending existing methods to handle contamination.

The paper tackles the problem of unreliable Maximum Mean Discrepancy (MMD) estimation in two-sample tests when data is mismeasured due to ε-contamination, and it proposes a method that provides sharp bounds on the MMD with faster convergence rates and superior empirical performance on three datasets.

Nonparametric two-sample tests such as the Maximum Mean Discrepancy (MMD) are often used to detect differences between two distributions in machine learning applications. However, the majority of existing literature assumes that error-free samples from the two distributions of interest are available.We relax this assumption and study the estimation of the MMD under $ε$-contamination, where a possibly non-random $ε$ proportion of one distribution is erroneously grouped with the other. We show that under $ε$-contamination, the typical estimate of the MMD is unreliable. Instead, we study partial identification of the MMD, and characterize sharp upper and lower bounds that contain the true, unknown MMD. We propose a method to estimate these bounds, and show that it gives estimates that converge to the sharpest possible bounds on the MMD as sample size increases, with a convergence rate that is faster than alternative approaches. Using three datasets, we empirically validate that our approach is superior to the alternatives: it gives tight bounds with a low false coverage rate.

View on arXiv PDF

Similar