Complement Submodular Information Measures for Balanced and Robust Data Selection

arXiv:2605.2477939.5

AI Analysis

For machine learning practitioners needing balanced and robust data selection (e.g., train/validation/test splitting, benchmark construction), this provides a principled framework that outperforms existing submodular methods.

This work introduces Complement Submodular Information (CSI), a new class of submodular objectives that explicitly preserve structural information between a selected subset and its complement. CSI objectives achieve near-(1-1/e) greedy approximation guarantees and consistently outperform standard submodular objectives on robust hidden-slice-aware subset selection, significantly improving preservation of rare/tail semantic structure while suppressing outliers.

Submodular optimization has become a fundamental paradigm for data selection, retrieval, summarization, and representation learning due to its ability to model coverage, diversity, and representativeness. However, classical submodular objectives optimize only the selected subset and do not explicitly preserve structural information between the selected subset and the remaining data. In many modern machine learning applications, including train/validation/test splitting, benchmark construction, and robust subset selection, the quality of a selection depends critically on preserving balanced structure across both the selected subset and its complement. In this work, we introduce Complement Submodular Information (CSI), a new class of complement-aware submodular objectives that quantify shared structural information between a subset and its complement. Our framework induces complement-aware variants of several classical submodular functions including Facility Location, Graph Cut, LogDet, Saturated Coverage, Set Cover, Probabilistic Set Cover, and Feature Based Functions. We analyze the theoretical properties of CSI objectives and show that they exhibit approximate monotonicity under bounded curvature conditions, leading to near-$(1-1/e)$ greedy approximation guarantees. Empirically, CSI objectives consistently outperform standard submodular objectives on robust hidden-slice-aware subset selection. In particular, CSI objectives significantly improve preservation of coherent rare/tail semantic structure while simultaneously suppressing noisy and isolated outliers, leading to substantially improved downstream predictive performance. Synthetic experiments further illustrate how different CSI instantiations capture complementary notions of representativeness, diversity, connectivity, and balanced neighborhood preservation.

View on arXiv PDF

Similar