LGCTMGMay 20, 2024

Directed Metric Structures arising in Large Language Models

arXiv:2405.12264v12 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work provides a foundational mathematical framework for analyzing LLMs, which could impact theoretical understanding and applications in natural language processing, though it appears incremental in its mathematical formalization.

The paper tackles the problem of understanding the mathematical structure of conditional probability distributions in Large Language Models (LLMs) by reformulating them as metric structures using -log probabilities. It constructs a metric polyhedron and an isometric embedding, proves duality theorems, and shows compatibility with text extensions, deriving approximations for text vectors.

Large Language Models are transformer neural networks which are trained to produce a probability distribution on the possible next words to given texts in a corpus, in such a way that the most likely word predicted is the actual word in the training text. In this paper we find what is the mathematical structure defined by such conditional probability distributions of text extensions. Changing the view point from probabilities to -log probabilities we observe that the subtext order is completely encoded in a metric structure defined on the space of texts $\mathcal{L}$, by -log probabilities. We then construct a metric polyhedron $P(\mathcal{L})$ and an isometric embedding (called Yoneda embedding) of $\mathcal{L}$ into $P(\mathcal{L})$ such that texts map to generators of certain special extremal rays. We explain that $P(\mathcal{L})$ is a $(\min,+)$ (tropical) linear span of these extremal ray generators. The generators also satisfy a system of $(\min+)$ linear equations. We then show that $P(\mathcal{L})$ is compatible with adding more text and from this we derive an approximation of a text vector as a Boltzmann weighted linear combination of the vectors for words in that text. We then prove a duality theorem showing that texts extensions and text restrictions give isometric polyhedra (even though they look a priory very different). Moreover we prove that $P(\mathcal{L})$ is the lattice closure of (a version of) the so called, Isbell completion of $\mathcal{L}$ which turns out to be the $(\max,+)$ span of the text extremal ray generators. All constructions have interpretations in category theory but we don't use category theory explicitly. The categorical interpretations are briefly explained in an appendix. In the final appendix we describe how the syntax to semantics problem could fit in a general well known mathematical duality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes