CVCLCRSep 4, 2025

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

arXiv:2509.04403v12 citationsh-index: 9EMNLP
Originality Incremental advance
AI Analysis

This addresses the need for better dataset construction methods to cover complex safety challenges in multimodal large language models, though it appears incremental as it builds on existing dataset construction paradigms.

The paper tackles the problem of constructing datasets for real-world multimodal safety scenarios by introducing an image-oriented self-adaptive method that generates 35k image-text pairs with guidance responses, and it demonstrates the approach's scalability and effectiveness through experiments.

Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges. However, current dataset construction methods, which are risk-oriented, fail to cover the growing complexity of real-world multimodal safety scenarios (RMS). And due to the lack of a unified evaluation metric, their overall effectiveness remains unproven. This paper introduces a novel image-oriented self-adaptive dataset construction method for RMS, which starts with images and end constructing paired text and guidance responses. Using the image-oriented method, we automatically generate an RMS dataset comprising 35k image-text pairs with guidance responses. Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety datasets.Extensive experiments on various tasks demonstrate the effectiveness of the proposed image-oriented pipeline. The results confirm the scalability and effectiveness of the image-oriented approach, offering a new perspective for the construction of real-world multimodal safety datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes