CVMar 24, 2025

Towards Training-free Anomaly Detection with Vision and Language Foundation Models

Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang

arXiv:2503.18325v123 citationsh-index: 5Has CodeCVPR

Originality Incremental advance

AI Analysis

This addresses anomaly detection for applications like industrial inspection by providing a training-free, unified framework that handles compositional anomalies, though it is incremental in leveraging existing foundation models.

The paper tackles the problem of detecting both logical and structural anomalies in images without training, using vision and language foundation models, and achieves state-of-the-art results compared to supervised approaches.

Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.

View on arXiv PDF Code

Similar