LGAICLITFeb 1, 2024

The Information of Large Language Model Geometry

arXiv:2402.03471v16 citationsh-index: 17
AI Analysis

This provides theoretical insights into LLM scaling laws and token information distribution, which is incremental for understanding model behavior.

This paper investigates the information encoded in large language model embeddings, discovering a power law relationship between representation entropy and model size, and finds that information is distributed across tokens rather than concentrated in specific meaningful tokens.

This paper investigates the information encoded in the embeddings of large language models (LLMs). We conduct simulations to analyze the representation entropy and discover a power law relationship with model sizes. Building upon this observation, we propose a theory based on (conditional) entropy to elucidate the scaling law phenomenon. Furthermore, we delve into the auto-regressive structure of LLMs and examine the relationship between the last token and previous context tokens using information theory and regression techniques. Specifically, we establish a theoretical connection between the information gain of new tokens and ridge regression. Additionally, we explore the effectiveness of Lasso regression in selecting meaningful tokens, which sometimes outperforms the closely related attention weights. Finally, we conduct controlled experiments, and find that information is distributed across tokens, rather than being concentrated in specific "meaningful" tokens alone.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes