Is GPT-4o mini Blinded by its Own Safety Filters? Exposing the Multimodal-to-Unimodal Bottleneck in Hate Speech Detection
This exposes a fundamental tension between capability and safety in state-of-the-art LMMs, highlighting the need for more integrated, context-aware alignment strategies for AI deployment.
The paper systematically analyzed OpenAI's GPT-4o mini on multimodal hate speech detection, identifying a 'Unimodal Bottleneck' where safety filters preempt multimodal reasoning, with 144 content policy refusals triggered equally by visual (50%) and textual (50%) content.
As Large Multimodal Models (LMMs) become integral to daily digital life, understanding their safety architectures is a critical problem for AI Alignment. This paper presents a systematic analysis of OpenAI's GPT-4o mini, a globally deployed model, on the difficult task of multimodal hate speech detection. Using the Hateful Memes Challenge dataset, we conduct a multi-phase investigation on 500 samples to probe the model's reasoning and failure modes. Our central finding is the experimental identification of a "Unimodal Bottleneck," an architectural flaw where the model's advanced multimodal reasoning is systematically preempted by context-blind safety filters. A quantitative validation of 144 content policy refusals reveals that these overrides are triggered in equal measure by unimodal visual 50% and textual 50% content. We further demonstrate that this safety system is brittle, blocking not only high-risk imagery but also benign, common meme formats, leading to predictable false positives. These findings expose a fundamental tension between capability and safety in state-of-the-art LMMs, highlighting the need for more integrated, context-aware alignment strategies to ensure AI systems can be deployed both safely and effectively.