Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving
This addresses the need for reliable safety monitors in complex vision tasks like autonomous driving, though it is incremental as it builds on existing VFMs and density techniques.
The paper tackles the problem of detecting out-of-distribution (OOD) inputs in autonomous driving by proposing a framework that uses Vision Foundation Models (VFMs) with density modeling, achieving improved OOD classification performance compared to state-of-the-art methods.
Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised and model-agnostic method that unifies detection of semantic and covariate shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine Vision Foundation Models (VFMs) as feature extractors with density modeling techniques. Through a comprehensive benchmark of 4 VFMs with different backbone architectures and 5 density-modeling techniques against established baselines, we provide the first systematic evaluation of OOD classification capabilities of VFMs across diverse conditions. A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs. Additionally, we show that our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance. Overall, VFMs, when coupled with robust density modeling techniques, are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks