Zewen Li

CV
h-index16
8papers
4,278citations
Novelty29%
AI Score43

8 Papers

CVOct 16, 2025Code
IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection

Zewen Li, Zitong Yu, Qilang Ye et al.

The robust causal capability of Multimodal Large Language Models (MLLMs) hold the potential of detecting defective objects in Industrial Anomaly Detection (IAD). However, most traditional IAD methods lack the ability to provide multi-turn human-machine dialogues and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pre-trained models have not fully stimulated the ability of large models in anomaly detection tasks. In this paper, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ Abnormal Prompt Generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pre-trained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose Text-Guided Enhancer, wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a Multi-Mask Fusion module to incorporate mask as expert knowledge, which enhances the LLM's perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at \href{https://github.com/LiZeWen1225/IAD-GPT}{https://github.com/LiZeWen1225/IAD-GPT}.

CVApr 24
Text-Guided Multimodal Unified Industrial Anomaly Detection

Zewen Li, Shuo Ye, Zitong Yu et al.

Industrial anomaly detection based on RGB-3D multimodal data has emerged as a mainstream paradigm for intelligent quality inspection. However, existing unsupervised methods suffer from two critical limitations: ambiguous cross-modal alignment caused by the lack of high-level semantic guidance and insufficient geometric modeling for RGB-to-3D feature mapping. To address these issues, we propose a unified multimodal industrial anomaly detection framework guided by text semantics. The framework consists of two core modules: a Geometry-Aware Cross-Modal Mapper to preserve geometric structure during modality conversion, and an Object-Conditioned Textual Feature Adaptor to align multimodal features with semantic priors. Furthermore, we establish a unified learning paradigm for multimodal industrial anomaly detection, which breaks the one-model-one-class constraint and enables accurate anomaly detection across diverse classes using a single model. Extensive experiments on the MVTec 3D-AD and Eyecandies datasets demonstrate that our method achieves state-of-the-art performance in classification and localization under unsupervised settings.

CVJul 28, 2021
VirtualConductor: Music-driven Conducting Video Generation System

Delong Chen, Fan Liu, Zewen Li et al.

In this demo, we present VirtualConductor, a system that can generate conducting video from any given music and a single user's image. First, a large-scale conductor motion dataset is collected and constructed. Then, we propose Audio Motion Correspondence Network (AMCNet) and adversarial-perceptual learning to learn the cross-modal relationship and generate diverse, plausible, music-synchronized motion. Finally, we combine 3D animation rendering and a pose transfer model to synthesize conducting video from a single given user's image. Therefore, any user can become a virtual conductor through the system.

LGJul 20, 2021
Significant Wave Height Prediction based on Wavelet Graph Neural Network

Delong Chen, Fan Liu, Zheqi Zhang et al.

Computational intelligence-based ocean characteristics forecasting applications, such as Significant Wave Height (SWH) prediction, are crucial for avoiding social and economic loss in coastal cities. Compared to the traditional empirical-based or numerical-based forecasting models, "soft computing" approaches, including machine learning and deep learning models, have shown numerous success in recent years. In this paper, we focus on enabling the deep learning model to learn both short-term and long-term spatial-temporal dependencies for SWH prediction. A Wavelet Graph Neural Network (WGNN) approach is proposed to integrate the advantages of wavelet transform and graph neural network. Several parallel graph neural networks are separately trained on wavelet decomposed data, and the reconstruction of each model's prediction forms the final SWH prediction. Experimental results show that the proposed WGNN approach outperforms other models, including the numerical models, the machine learning models, and several deep learning models.

LGFeb 1, 2021
NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning

Rongjun Qin, Songyi Gao, Xingyuan Zhang et al.

Offline reinforcement learning (RL) aims at learning a good policy from a batch of collected data, without extra interactions with the environment during training. However, current offline RL benchmarks commonly have a large reality gap, because they involve large datasets collected by highly exploratory policies, and the trained policy is directly evaluated in the environment. In real-world situations, running a highly exploratory policy is prohibited to ensure system safety, the data is commonly very limited, and a trained policy should be well validated before deployment. In this paper, we present a near real-world offline RL benchmark, named NeoRL, which contains datasets from various domains with controlled sizes, and extra test datasets for policy validation. We evaluate existing offline RL algorithms on NeoRL and argue that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward. The empirical results demonstrate that the tested offline RL algorithms become less competitive to the deterministic policy on many datasets, and the offline policy evaluation hardly helps. The NeoRL suit can be found at http://polixir.ai/research/neorl. We hope this work will shed some light on future research and draw more attention when deploying RL in real-world systems.

IVJun 9, 2020
A Review of Automated Diagnosis of COVID-19 Based on Scanning Images

Delong Chen, Shunhui Ji, Fan Liu et al.

The pandemic of COVID-19 has caused millions of infections, which has led to a great loss all over the world, socially and economically. Due to the false-negative rate and the time-consuming of the conventional Reverse Transcription Polymerase Chain Reaction (RT-PCR) tests, diagnosing based on X-ray images and Computed Tomography (CT) images has been widely adopted. Therefore, researchers of the computer vision area have developed many automatic diagnosing models based on machine learning or deep learning to assist the radiologists and improve the diagnosing accuracy. In this paper, we present a review of these recently emerging automatic diagnosing models. 70 models proposed from February 14, 2020, to July 21, 2020, are involved. We analyzed the models from the perspective of preprocessing, feature extraction, classification, and evaluation. Based on the limitation of existing models, we pointed out that domain adaption in transfer learning and interpretability promotion would be the possible future directions.

CVJun 9, 2020
Deep Learning Based Single Sample Per Person Face Recognition: A Survey

Fan Liu, Delong Chen, Fei Wang et al.

Face recognition has long been an active research area in the field of artificial intelligence, particularly since the rise of deep learning in recent years. In some practical situations, each identity has only a single sample available for training. Face recognition under this situation is referred to as single sample face recognition and poses significant challenges to the effective training of deep models. Therefore, in recent years, researchers have attempted to unleash more potential of deep learning and improve the model recognition performance in the single sample situation. While several comprehensive surveys have been conducted on traditional single sample face recognition approaches, emerging deep learning based methods are rarely involved in these reviews. Accordingly, we focus on the deep learning-based methods in this paper, classifying them into virtual sample methods and generic learning methods. In the former category, virtual images or virtual features are generated to benefit the training of the deep model. In the latter one, additional multi-sample generic sets are used. There are three types of generic learning methods: combining traditional methods and deep features, improving the loss function, and improving network structure, all of which are covered in our analysis. Moreover, we review face datasets that have been commonly used for evaluating single sample face recognition models and go on to compare the results of different types of models. Additionally, we discuss problems with existing single sample face recognition methods, including identity information preservation in virtual sample methods, domain adaption in generic learning methods. Furthermore, we regard developing unsupervised methods is a promising future direction, and point out that the semantic gap as an important issue that needs to be further considered.

CVApr 1, 2020
A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects

Zewen Li, Wenjie Yang, Shouheng Peng et al.

Convolutional Neural Network (CNN) is one of the most significant networks in the deep learning field. Since CNN made impressive achievements in many areas, including but not limited to computer vision and natural language processing, it attracted much attention both of industry and academia in the past few years. The existing reviews mainly focus on the applications of CNN in different scenarios without considering CNN from a general perspective, and some novel ideas proposed recently are not covered. In this review, we aim to provide novel ideas and prospects in this fast-growing field as much as possible. Besides, not only two-dimensional convolution but also one-dimensional and multi-dimensional ones are involved. First, this review starts with a brief introduction to the history of CNN. Second, we provide an overview of CNN. Third, classic and advanced CNN models are introduced, especially those key points making them reach state-of-the-art results. Fourth, through experimental analysis, we draw some conclusions and provide several rules of thumb for function selection. Fifth, the applications of one-dimensional, two-dimensional, and multi-dimensional convolution are covered. Finally, some open issues and promising directions for CNN are discussed to serve as guidelines for future work.