The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

Akshay Paruchuri, Ishan Chatterjee, Henry Fuchs, Ehsan Adeli, Piotr Didyk

Stanford

arXiv:2604.1436372.7h-index: 7

Predicted impact top 87% in CL · last 90 daysOriginality Incremental advance

AI Analysis

Identifies and corrects a structural imbalance in multimodal models, offering a diagnostic and inference-time fix for practitioners.

Multimodal language models underperform on visual tasks due to language representations overshadowing vision. The authors propose centroid replacement to probe this imbalance and use text centroid contrastive decoding to recover up to +16.9% accuracy on individual tasks.

Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

View on arXiv PDF

Similar