CVNov 5, 2023

GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

arXiv:2311.02612v238 citationsh-index: 34Has Code
Originality Synthesis-oriented
AI Analysis

This work provides a baseline for applying VQA-oriented large multimodal models to zero-shot anomaly detection, which is incremental as it adapts an existing model to a new task without major methodological innovation.

The paper explores using GPT-4V with a Visual Question Answering (VQA) approach for zero-shot anomaly detection, achieving image-level AU-ROCs of 77.1/88.0 and pixel-level AU-ROCs of 68.0/76.6 on MVTec AD and VisA datasets, but it falls short of state-of-the-art methods like WinCLIP and CLIP-AD.

Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: \textbf{\textit{1)}} Granular Region Division, \textbf{\textit{2)}} Prompt Designing, \textbf{\textit{3)}} Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, \eg, WinCLIP and CLIP-AD, and further researches are needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at \url{https://github.com/zhangzjn/GPT-4V-AD}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes