LGSep 15, 2025

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

arXiv:2509.12521v12 citationsh-index: 10Has CodeEMNLP
Originality Highly original
AI Analysis

This addresses a safety problem for users of MLLMs by revealing a new, subtle attack vector that could lead to biased outputs without overt harm, representing a novel security concern rather than an incremental improvement.

The paper tackles the safety risk of Multimodal Large Language Models (MLLMs) by showing that their output preferences can be manipulated using optimized images, resulting in contextually relevant but biased responses that are hard to detect, with experimental results demonstrating the effectiveness of this approach.

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes