LGApr 5, 2023

A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data

arXiv:2304.02189v19 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of interactive exploration of big data for regulatory agencies, policy makers, and concerned citizens, but it is incremental as it combines existing techniques like k-means and subset scanning.

The authors tackled the challenge of exploring large datasets by developing a system that uses an iterative k-means searchlight and subset scanning to identify outliers, applying it to open health data from New York State to uncover anomalies like cost overruns at specific hospitals and increases in diagnoses such as suicides.

The interactive exploration of large and evolving datasets is challenging as relationships between underlying variables may not be fully understood. There may be hidden trends and patterns in the data that are worthy of further exploration and analysis. We present a system that methodically explores multiple combinations of variables using a searchlight technique and identifies outliers. An iterative k-means clustering algorithm is applied to features derived through a split-apply-combine paradigm used in the database literature. Outliers are identified as singleton or small clusters. This algorithm is swept across the dataset in a searchlight manner. The dimensions that contain outliers are combined in pairs with other dimensions using a susbset scan technique to gain further insight into the outliers. We illustrate this system by anaylzing open health care data released by New York State. We apply our iterative k-means searchlight followed by subset scanning. Several anomalous trends in the data are identified, including cost overruns at specific hospitals, and increases in diagnoses such as suicides. These constitute novel findings in the literature, and are of potential use to regulatory agencies, policy makers and concerned citizens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes