LG AIMay 8

GAD in the Wild: Benchmarking Graph Anomaly Detection under Realistic Deployment Challenges

Jingjing Zhou, Shiyu Huang, Qing Qing, Zuquan Yuan, Huafei Huang, Ziqi Xu, Mingliang Hou, Xikun Zhang, Renqiang Luo, Ivan Lee

arXiv:2605.0713387.1Has Code

AI Analysis

For practitioners deploying GAD in real-world applications (e.g., fraud detection), this benchmark highlights critical gaps between academic evaluation and production robustness, showing that strong lab performance does not guarantee real-world effectiveness.

This paper introduces a multi-dimensional benchmark for Graph Anomaly Detection (GAD) that evaluates models under realistic challenges: million-scale graphs, extreme anomaly scarcity (e.g., 0.1% anomaly ratio), and missing node attributes. Evaluation of nine GAD models reveals that most GNN-based methods fail to scale, detection performance drops sharply (often zero recall) under low anomaly ratios, and reconstruction-based models are sensitive to attribute imputation.

Graph Anomaly Detection (GAD) is a critical task in graph machine learning with vital applications in financial fraud detection and social platform governance. However, existing GAD benchmarks are often restricted to small-scale, curated graphs with relatively balanced anomaly ratios, leaving a substantial gap between academic evaluation and real-world deployment. To bridge this gap, we present a multi-dimensional benchmark that systematically evaluates GAD models under three deployment-relevant challenges: million-scale graphs, extreme anomaly scarcity, and missing node attributes. We derive a family of controlled benchmark variants from five diverse graphs, including two native industrial-scale datasets with over 3.7 million nodes. Our extensive evaluation of nine representative GAD models reveals three major limitations: (1) most GNN-based methods fail to scale to million-node graphs due to prohibitive memory requirements; (2) detection performance drops sharply under realistic anomaly ratios (e.g., 0.1\%), often resulting in zero recall; and (3) reconstruction-based models are highly sensitive to attribute imputation strategies. Our findings suggest that strong performance in laboratory settings does not guarantee robustness in production environments. We release this benchmark and empirical evaluation as a diagnostic testbed to promote the development of robust and scalable GAD systems for large-scale, imperfect graphs encountered in practice. Code is available at https://anonymous.4open.science/r/Benchmark_GAD-E7A3.

View on arXiv PDF

Similar