LGMLOct 4, 2025

From Moments to Models: Graphon Mixture-Aware Mixup and Contrastive Learning

arXiv:2510.03690v2h-index: 3
Originality Highly original
AI Analysis

It addresses the challenge of handling mixed populations in graph data for machine learning practitioners, offering a novel approach that improves both unsupervised and supervised graph learning tasks.

The paper tackles the problem of graph representation learning on datasets with mixtures of underlying generative models by proposing a unified framework that clusters graphs using motif densities to disentangle mixture components. This enables a graphon-mixture-aware mixup (GMAM) for data augmentation and a model-aware contrastive learning method (MGCL), achieving state-of-the-art results with top average rank in unsupervised learning on eight datasets and new SOTA accuracy in 6 out of 7 supervised datasets.

Real-world graph datasets often consist of mixtures of populations, where graphs are generated from multiple distinct underlying distributions. However, modern representation learning approaches, such as graph contrastive learning (GCL) and augmentation methods like Mixup, typically overlook this mixture structure. In this work, we propose a unified framework that explicitly models data as a mixture of underlying probabilistic graph generative models represented by graphons. To characterize these graphons, we leverage graph moments (motif densities) to cluster graphs arising from the same model. This enables us to disentangle the mixture components and identify their distinct generative mechanisms. This model-aware partitioning benefits two key graph learning tasks: 1) It enables a graphon-mixture-aware mixup (GMAM), a data augmentation technique that interpolates in a semantically valid space guided by the estimated graphons, instead of assuming a single graphon per class. 2) For GCL, it enables model-adaptive and principled augmentations. Additionally, by introducing a new model-aware objective, our proposed approach (termed MGCL) improves negative sampling by restricting negatives to graphs from other models. We establish a key theoretical guarantee: a novel, tighter bound showing that graphs sampled from graphons with small cut distance will have similar motif densities with high probability. Extensive experiments on benchmark datasets demonstrate strong empirical performance. In unsupervised learning, MGCL achieves state-of-the-art results, obtaining the top average rank across eight datasets. In supervised learning, GMAM consistently outperforms existing strategies, achieving new state-of-the-art accuracy in 6 out of 7 datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes