LG MLDec 23, 2020

On Using Classification Datasets to Evaluate Graph-Level Outlier Detection: Peculiar Observations and New Insights

arXiv:2012.12931v317.685 citationsHas Code

Originality Incremental advance

AI Analysis

This study identifies a critical evaluation flaw for researchers and practitioners working on graph-level outlier detection, potentially leading to misinterpretation of model performance.

This paper investigates the common practice of repurposing classification datasets for evaluating Graph-Level Outlier Detection (GLOD) models. It reveals that ROC-AUC performance significantly changes, even flipping from high to very low, depending on which class is down-sampled, with the ROC-AUCs of the two variants approximately summing to 1.

It is common practice of the outlier mining community to repurpose classification datasets toward evaluating various detection models. To that end, often a binary classification dataset is used, where samples from one of the classes is designated as the inlier samples, and the other class is substantially down-sampled to create the ground-truth outlier samples. Graph-level outlier detection (GLOD) is rarely studied but has many potentially influential real-world applications. In this study, we identify an intriguing issue with repurposing graph classification datasets for GLOD. We find that ROC-AUC performance of the models changes significantly (flips from high to very low, even worse than random) depending on which class is down-sampled. Interestingly, ROC-AUCs on these two variants approximately sum to 1 and their performance gap is amplified with increasing propagations for a certain family of propagation based outlier detection models. We carefully study the graph embedding space produced by propagation based models and find two driving factors: (1) disparity between within-class densities which is amplified by propagation, and (2)overlapping support (mixing of embeddings) across classes. We also study other graph embedding methods and downstream outlier detectors, and find that the intriguing performance flip issue still widely exists but which version of the downsample achieves higher performance may vary. Thoughtful analysis over comprehensive results further deeper our understanding of the established issue.

View on arXiv PDF Code

Similar