The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning
This addresses safety concerns for users of MLLMs in multi-image tasks, highlighting a critical risk as capabilities advance, though it is incremental in focusing on a specific safety aspect.
The paper tackles the problem of safety risks in Multimodal Large Language Models (MLLMs) as they gain stronger multi-image reasoning capabilities, revealing through a new benchmark that more advanced models can be more vulnerable, with 19 models evaluated showing unsafe responses and superficial safe ones.
As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints. Our code and data are available at https://github.com/thu-coai/MIR-SafetyBench.