CVNov 14, 2025

Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End?

arXiv:2511.11216v13.6h-index: 1

Originality Incremental advance

AI Analysis

This work addresses a previously underexplored issue of positional bias in multimodal representation models, which is incremental as it extends known bias research from text generation to multimodal contexts.

The study investigated positional bias in multimodal embedding models for image-text retrieval, finding that such bias is prevalent but differs between modalities: text encoders favor the beginning of inputs, while image encoders favor both the beginning and end.

Positional bias - where models overemphasize certain positions regardless of content - has been shown to negatively impact model performance across various tasks. While recent research has extensively examined positional bias in text generation models, its presence and effects in representation models remain underexplored. Even less is known about such biases in multimodal models. In this work, we investigate positional bias in multimodal representation models, specifically in the context of image-text retrieval. We begin by distinguishing between context importance and positional bias, and then assess the presence and extent of positional bias across different models and datasets. Our experiments demonstrate that positional bias is prevalent in multimodal models, but manifests differently across modalities: text encoders tend to exhibit bias toward the beginning of the input, whereas image encoders show bias at both the beginning and end. Furthermore, we find that this bias arises from, or is amplified by, a combination of factors, including the positional encoding scheme, training loss, context importance, and the nature of using image-text pairs in multimodal training.

View on arXiv PDF

Similar