ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection
This work improves anomaly detection for industrial and medical applications by enhancing VLMs for dense perception tasks, though it appears incremental as it builds on existing adaptation methods.
The paper tackled the problem of Zero-Shot Anomaly Detection (ZSAD) by addressing the adaptation gap in pre-trained Vision-Language Models (VLMs) through an Architectural Co-Design framework, resulting in superior accuracy and robustness on diverse industrial and medical benchmarks.
Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.