CYAIDec 18, 2025

The Violation State: Safety State Persistence in a Multimodal Language Model Interface

arXiv:2601.06049v1
Originality Synthesis-oriented
AI Analysis

This incremental finding highlights a potential reliability issue in multimodal AI systems, affecting user experience and safety design for developers and users.

The study identified a reproducible safety-state persistence effect in ChatGPT's multimodal interface, where an initial refusal to remove a watermark from a copyrighted image led to a 96.67% refusal rate for subsequent unrelated image-generation requests, while text-only requests remained unaffected.

Multimodal AI systems integrate text generation, image generation, and other capabilities within a single conversational interface. These systems employ safety mechanisms to prevent disallowed actions, including the removal of watermarks from copyrighted images. While single-turn refusals are expected, the interaction between safety filters and conversation-level state is not well understood. This study documents a reproducible behavioral effect in the ChatGPT (GPT-5.1) web interface. Manual execution was chosen to capture the exact user-facing safety behavior of the production system, rather than isolated API components. When a conversation begins with an uploaded copyrighted image and a request to remove a watermark, which the model correctly refuses, subsequent prompts to generate unrelated, benign images are refused for the remainder of the session. Importantly, text-only requests (e.g., generating a Python function) continue to succeed. Across 40 manually run sessions (30 contaminated and 10 controls), contaminated threads showed 116/120 image-generation refusals (96.67%), while control threads showed 0/40 refusals (Fisher's exact p < 0.0001). All sessions used an identical fixed prompt order, ensuring sequence uniformity across conditions. We describe this as safety-state persistence: a form of conversational over-generalization in which a copyright refusal influences subsequent, unrelated image-generation behavior. We present these findings as behavioral observations, not architectural claims. We discuss possible explanations, methodological limitations (single model, single interface), and implications for multimodal reliability, user experience, and the design of session-level safety systems. These results motivate further examination of session-level safety interactions in multimodal AI systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes