Unsupervised Domain Shift Detection with Interpretable Subspace Attribution
For practitioners in domains like medical imaging or bioinformatics, this provides a practical tool to uncover hidden cohort biases in unlabeled datasets before modeling, though the method is incremental as it combines existing density estimation and subspace attribution techniques.
The paper presents a method for detecting and interpreting domain shifts in high-dimensional data by identifying localized density anomalies and attributing them to specific feature subspaces. It validates the approach on synthetic benchmarks and demonstrates its utility in detecting device-induced shifts in ECG recordings, enabling the extraction of unbiased subsets for downstream modeling.
We developed a tool for detecting domain shifts, namely subtle differences in the probability distributions of datasets. We identify these shifts using an algorithm designed to detect localised density anomalies in high-dimensional feature spaces. If an anomaly is present, we then identify the feature subspace in which the anomaly is most pronounced. This allows us to trace the domain shift to a small set of features, making the shift interpretable. Moreover, we provide a protocol for compensating domain shifts by extracting, from two unlabelled datasets, subsets of samples with no detectable residual distributional difference. We validate the framework on controlled 20-dimensional benchmarks with known ground truth, recovering both broad and localized shifts together with their supporting feature subspaces. We then apply it to healthy electrocardiogram (ECG) recordings represented by 782 features. In age- and sex-matched cohort comparisons differing in measurement-device composition, the method detects device-induced shifts, extracts representative subsets enriched in the imbalanced device components, and identifies ECG features associated with the acquisition contrast. These results suggest that density-shift detection and subspace attribution provide a practical framework for uncovering hidden cohort biases before downstream modelling.