LGAICLMLJun 6, 2024

What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

arXiv:2406.03707v29 citations
Originality Incremental advance
AI Analysis

This work addresses a foundational question in machine learning about embedding representation, providing theoretical insights and empirical validation for researchers in NLP and AI, though it is incremental in building on existing understanding of autoregressive models.

The paper tackles the problem of determining what embeddings in autoregressive language models should represent by connecting the prediction objective to constructing predictive sufficient statistics, identifying three optimal settings: sufficient statistics for i.i.d. data, posterior over latent states, and posterior over discrete hypotheses. Empirical probing shows transformers encode these latent generating distributions effectively, with strong performance in out-of-distribution cases and without token memorization.

Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes