CLAIDec 17, 2025

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

arXiv:2512.15052v31 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses safety risks in multimodal generation for users and developers, offering an interpretable, low-cost solution, though it is incremental as it builds on existing detoxification methods.

The paper tackles the problem of toxic, biased, and NSFW signals in multimodal large language models (MLLMs) by proposing SGM, a neuron-level intervention that selectively recalibrates toxic neurons, reducing harmful rates from 48.2% to 2.5% while preserving model performance.

Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes