CL LG COMP-PHMay 21, 2025

A quantitative analysis of semantic information in deep representations of text and images

Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio

arXiv:2505.17101v35 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work provides a method for analyzing semantic alignment in AI models, which is incremental but useful for researchers in multimodal AI and representation learning.

The authors tackled the problem of quantifying semantic information in deep representations across text and images, finding that larger language models extract more general information and that semantic layers in both text and vision models encode cross-domain relationships with significant asymmetries.

Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

View on arXiv PDF

Similar