Do VLMs Have a Moral Backbone? A Study on the Fragile Morality of Vision-Language Models

arXiv:2601.17082v1h-index: 10
Originality Highly original
AI Analysis

This addresses the problem of ensuring reliable ethical behavior in VLMs for safe deployment, highlighting a critical vulnerability beyond mere alignment.

This study investigated the stability of moral judgments in Vision-Language Models (VLMs) under multimodal perturbations, finding that their moral stances are highly fragile and frequently flip under simple manipulations, with lightweight interventions partially restoring stability.

Despite substantial efforts toward improving the moral alignment of Vision-Language Models (VLMs), it remains unclear whether their ethical judgments are stable in realistic settings. This work studies moral robustness in VLMs, defined as the ability to preserve moral judgments under textual and visual perturbations that do not alter the underlying moral context. We systematically probe VLMs with a diverse set of model-agnostic multimodal perturbations and find that their moral stances are highly fragile, frequently flipping under simple manipulations. Our analysis reveals systematic vulnerabilities across perturbation types, moral domains, and model scales, including a sycophancy trade-off where stronger instruction-following models are more susceptible to persuasion. We further show that lightweight inference-time interventions can partially restore moral stability. These results demonstrate that moral alignment alone is insufficient and that moral robustness is a necessary criterion for the responsible deployment of VLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes