John Terilla

CL
h-index17
8papers
214citations
Novelty49%
AI Score44

8 Papers

99.4LOJun 2
A calculus of types in Isbell nuclei

Juan Luis Gastaldi, Samantha Jarvis, Thomas Seiller et al.

We identify two constructions from different mathematical traditions. In linear logic and realisability, logical types are generated rather than fixed in advance: one begins with a universe of realisers equipped with execution, uses orthogonality to test their interactions, and takes types to be the biorthogonally closed subsets. In enriched Isbell duality, a quantitative relation induces an adjunction whose fixed points form a category, its nucleus. These constructions proceed by different means; we show that, in the present setting, they produce the same objects. The shared datum is minimal: an associative product, called execution, and a real-valued measurement, with no compatibility assumed between them. The failure of the measurement to be additive is at once the relation defining orthogonality and the quantitative relation whose Isbell nucleus we form, and the types cut out by orthogonality are exactly the fixed points of the associated adjunction. The identification pays off in both directions. The most natural product of types fails to be associative; repairing this failure forces a different notion of type, sensitive to both sides of a composite, on which the induced product is associative and, when execution has units, carries two residuals. What emerges is a noncommutative Lambek calculus, derived directly from execution and orthogonality rather than imposed. In the reverse direction, each such type, read on the categorical side, generates a quantitative relation of its own, and with it a derived adjunction and a further generation of types; these derived types are again types of the original situation, computed by the residuals of the Lambek calculus. We also prove a coherence theorem for the threefold arrangements of this construction and, in the finite-dimensional case, give explicit formulas for the product.

CLJul 16, 2024
The Foundations of Tokenization: Statistical and Computational Concerns

Juan Luis Gastaldi, John Terilla, Luca Malagutti et al.

Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model performance but is also the source of many undesirable behaviors, such as spurious ambiguity or inconsistency. Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood. In particular, the impact of tokenization on language model estimation has been investigated primarily through empirical means. The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models. Based on the category of stochastic maps, this framework enables us to establish general conditions for a principled use of tokenizers and, most importantly, the necessary and sufficient conditions for a tokenizer model to preserve the consistency of statistical estimators. In addition, we discuss statistical and computational concerns crucial for designing and implementing tokenizer models, such as inconsistency, ambiguity, finiteness, and sequentiality. The framework and results advanced in this paper contribute to building robust theoretical foundations for representations in neural language modeling that can inform future theoretical and empirical research.

CLDec 4, 2024
From Language Models over Tokens to Language Models over Characters

Tim Vieira, Ben LeBrun, Mario Giulianelli et al.

Modern language models are internally -- and mathematically -- distributions over $\it{token}$ strings rather than $\it{character}$ strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that -- even with a small computation budget -- our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved.

CTJun 15, 2021
An enriched category theory of language: from syntax to semantics

Tai-Danae Bradley, John Terilla, Yiannis Vlassopoulos

State of the art language models return a natural language text continuation from any piece of input text. This ability to generate coherent text extensions implies significant sophistication, including a knowledge of grammar and semantics. In this paper, we propose a mathematical framework for passing from probability distributions on extensions of given texts, such as the ones learned by today's large language models, to an enriched category containing semantic information. Roughly speaking, we model probability distributions on texts as a category enriched over the unit interval. Objects of this category are expressions in language, and hom objects are conditional probabilities that one expression is an extension of another. This category is syntactical -- it describes what goes with what. Then, via the Yoneda embedding, we pass to the enriched category of unit interval-valued copresheaves on this syntactical category. This category of enriched copresheaves is semantic -- it is where we find meaning, logical operations such as entailment, and the building blocks for more elaborate semantic concepts.

LGMar 2, 2020
Tensor Networks for Probabilistic Sequence Modeling

Jacob Miller, Guillaume Rabusseau, John Terilla

Tensor networks are a powerful modeling framework developed for computational many-body physics, which have only recently been applied within machine learning. In this work we utilize a uniform matrix product state (u-MPS) model for probabilistic modeling of sequence data. We first show that u-MPS enable sequence-level parallelism, with length-n sequences able to be evaluated in depth O(log n). We then introduce a novel generative algorithm giving trained u-MPS the ability to efficiently sample from a wide variety of conditional distributions, each one defined by a regular expression. Special cases of this algorithm correspond to autoregressive and fill-in-the-blank sampling, but more complex regular expressions permit the generation of richly structured data in a manner that has no direct analogue in neural generative models. Experiments on sequence modeling with synthetic and real text data show u-MPS outperforming a variety of baselines and effectively generalizing their predictions in the presence of limited data.

QUANT-PHOct 16, 2019
Modeling Sequences with Quantum States: A Look Under the Hood

Tai-Danae Bradley, E. Miles Stoudenmire, John Terilla

Classical probability distributions on sets of sequences can be modeled using quantum states. Here, we do so with a quantum state that is pure and entangled. Because it is entangled, the reduced densities that describe subsystems also carry information about the complementary subsystem. This is in contrast to the classical marginal distributions on a subsystem in which information about the complementary system has been integrated out and lost. A training algorithm based on the density matrix renormalization group (DMRG) procedure uses the extra information contained in the reduced densities and organizes it into a tensor network model. An understanding of the extra information contained in the reduced densities allow us to examine the mechanics of this DMRG algorithm and study the generalization error of the resulting model. As an illustration, we work with the even-parity dataset and produce an estimate for the generalization error as a function of the fraction of the dataset used in training.

QUANT-PHFeb 19, 2019
Probabilistic Modeling with Matrix Product States

James Stokes, John Terilla

Inspired by the possibility that generative models based on quantum circuits can provide a useful inductive bias for sequence modeling tasks, we propose an efficient training algorithm for a subset of classically simulable quantum circuit models. The gradient-free algorithm, presented as a sequence of exactly solvable effective models, is a modification of the density matrix renormalization group procedure adapted for learning a probability distribution. The conclusion that circuit-based models offer a useful inductive bias for classical datasets is supported by experimental results on the parity learning problem.

CLNov 4, 2017
Language as a matrix product state

Vasily Pestun, John Terilla, Yiannis Vlassopoulos

We propose a statistical model for natural language that begins by considering language as a monoid, then representing it in complex matrices with a compatible translation invariant probability measure. We interpret the probability measure as arising via the Born rule from a translation invariant matrix product state.