CLLGMar 4, 2025

Measuring Intrinsic Dimension of Token Embeddings

arXiv:2503.02142v15 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This work provides a quantitative method for evaluating embedding redundancy in language models, which is incremental but useful for optimizing techniques like LoRA.

The study measured the Intrinsic Dimension (ID) of token embeddings in language models to assess redundancy, finding that embedding spaces often lie on lower-dimensional manifolds, with redundancy increasing with model scale and a rapid ID drop early in training.

In this study, we measure the Intrinsic Dimension (ID) of token embedding to estimate the intrinsic dimensions of the manifolds spanned by the representations, so as to evaluate their redundancy quantitatively compared to their extrinsic dimensionality. In detail, (1) we estimate the ID of token embeddings in small-scale language models and also modern large language models, finding that the embedding spaces often reside on lower-dimensional manifolds compared to their extrinsic dimensionality; (2) we measure the ID across various model sizes and observe an increase in redundancy rates as the model scale grows; (3) we measure the dynamics of IDs during the training process, and find a rapid ID drop in the early stages of training. Moreover, (4) when LoRA is applied to the embedding layers, we observe a sudden drop in perplexity around the estimated IDs, suggesting that the ID can serve as a useful guideline for LoRA application.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes