LG AI CL MLJun 6, 2024

What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

Liyi Zhang, Michael Y. Li, R. Thomas McCoy, Theodore R. Sumers, Jian-Qiao Zhu, Thomas L. Griffiths

arXiv:2406.03707v210.49 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses a foundational question in machine learning about embedding representation, providing theoretical insights and empirical validation for researchers in NLP and AI, though it is incremental in building on existing understanding of autoregressive models.

The paper tackles the problem of determining what embeddings in autoregressive language models should represent by connecting the prediction objective to constructing predictive sufficient statistics, identifying three optimal settings: sufficient statistics for i.i.d. data, posterior over latent states, and posterior over discrete hypotheses. Empirical probing shows transformers encode these latent generating distributions effectively, with strong performance in out-of-distribution cases and without token memorization.

Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings.

View on arXiv PDF

Similar