MEMLOct 7, 2021

A fast and effective kernel two-sample test for large-scale data

arXiv:2110.03118v23 citations
Originality Highly original
AI Analysis

This work addresses a bottleneck in statistical testing for big data applications, offering a faster and more effective method for researchers and practitioners dealing with high-dimensional datasets.

The authors tackled the computational inefficiency and limited effectiveness of existing kernel two-sample tests for high-dimensional, large-scale data by proposing a new test that achieves high power across a wide range of alternatives and is more robust to high dimensions without requiring parameter optimization.

Kernel two-sample tests have been widely used, and the development of efficient methods for high-dimensional, large-scale data is receiving increasing attention in the big data era. However, existing methods, such as the maximum mean discrepancy (MMD) and recently proposed kernel-based tests for large-scale data, are computationally intensive and/or ineffective for some common alternatives in high-dimensional data. In this paper, we propose a new test that exhibits high power across a wide range of alternatives. Furthermore, the new test is more robust to high dimensions than existing methods and does not require optimization procedures for choosing kernel bandwidth and other parameters through data splitting. Numerical studies demonstrate that the new approach performs well on both synthetic and real-world data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes