LGAIFeb 6

Beyond Pooling: Matching for Robust Generalization under Data Heterogeneity

arXiv:2602.07154v1h-index: 19Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of robust generalization under data heterogeneity for machine learning practitioners, particularly in domains like medical anomaly detection, representing a novel method for a known bottleneck rather than a foundational breakthrough.

The paper tackles the problem of biased estimators in representation learning when pooling heterogeneous datasets across domains, proposing a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The result shows that matching achieves better performance than naive pooling or uniform subsampling under asymmetric meta-distributions, with improvements demonstrated in zero-shot medical anomaly detection.

Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on https://github.com/AyushRoy2001/Beyond-Pooling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes