CVDec 30, 2025

Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

arXiv:2512.24160v21 citations
Originality Incremental advance
AI Analysis

This work addresses the need for scalable and data-efficient quality inspection in manufacturing, though it is incremental as it builds on existing multimodal and foundation model approaches.

The paper tackles the problem of industrial defect understanding by introducing IMDD-1M, a large-scale multimodal dataset with 1,000,000 image-text pairs, and trains a diffusion-based vision-language foundation model that achieves comparable performance to expert models using less than 5% of task-specific data.

We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence. Additional details and resources can be found in this URL: https://ninaneon.github.io/projectpage/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes