SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection
This addresses the need for scalable and accurate anomaly detection in industrial settings, representing an incremental improvement over existing vision-language models.
The paper tackled the problem of visual anomaly detection in industrial manufacturing by introducing a novel window self-attention mechanism based on CLIP, achieving superior performance by leading in 18 out of 20 metrics on five benchmark datasets.
Visual anomaly detection is essential in industrial manufacturing, yet traditional methods often rely heavily on extensive normal datasets and task-specific models, limiting their scalability. Recent advancements in large-scale vision-language models have significantly enhanced zero- and few-shot anomaly detection. However, these approaches may not fully leverage hierarchical features, potentially overlooking nuanced details crucial for accurate detection. To address this, we introduce a novel window self-attention mechanism based on the CLIP model, augmented with learnable prompts to process multi-level features within a Soldier-Officer Window Self-Attention (SOWA) framework. Our method has been rigorously evaluated on five benchmark datasets, achieving superior performance by leading in 18 out of 20 metrics, setting a new standard against existing state-of-the-art techniques.