Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
This paper identifies a fundamental information-theoretic bottleneck in multimodal LLMs for researchers and developers, explaining why fine-grained modality-specific information is lost despite being encoded.
Multimodal LLMs struggle to process fine-grained details like speaker voice or object texture, not due to encoding failures, but because their text-trained decoders cannot utilize this information. The authors formalize this as a mismatched decoder problem, demonstrating that the decoder's scoring rule, rather than the encoder, is the bottleneck, and that specific training objectives can improve accessibility of desired attributes.
Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.