Juan Luis Gastaldi

CL
h-index17
6papers
548citations
Novelty57%
AI Score53

6 Papers

99.4LOJun 2
A calculus of types in Isbell nuclei

Juan Luis Gastaldi, Samantha Jarvis, Thomas Seiller et al.

We identify two constructions from different mathematical traditions. In linear logic and realisability, logical types are generated rather than fixed in advance: one begins with a universe of realisers equipped with execution, uses orthogonality to test their interactions, and takes types to be the biorthogonally closed subsets. In enriched Isbell duality, a quantitative relation induces an adjunction whose fixed points form a category, its nucleus. These constructions proceed by different means; we show that, in the present setting, they produce the same objects. The shared datum is minimal: an associative product, called execution, and a real-valued measurement, with no compatibility assumed between them. The failure of the measurement to be additive is at once the relation defining orthogonality and the quantitative relation whose Isbell nucleus we form, and the types cut out by orthogonality are exactly the fixed points of the associated adjunction. The identification pays off in both directions. The most natural product of types fails to be associative; repairing this failure forces a different notion of type, sensitive to both sides of a composite, on which the induced product is associative and, when execution has units, carries two residuals. What emerges is a noncommutative Lambek calculus, derived directly from execution and orthogonality rather than imposed. In the reverse direction, each such type, read on the categorical side, generates a quantitative relation of its own, and with it a derived adjunction and a further generation of types; these derived types are again types of the original situation, computed by the residuals of the Lambek calculus. We also prove a coherence theorem for the threefold arrangements of this construction and, in the finite-dimensional case, give explicit formulas for the product.

CLJun 29, 2023
Tokenization and the Noiseless Channel

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi et al. · eth-zurich

Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the Rényi entropy with $α= 2.5$ has a very strong correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for compressed length.

CLJun 29, 2023
A Formal Perspective on Byte-Pair Encoding

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi et al. · eth-zurich

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{σ(\boldsymbolμ^\star)}(1-e^{-{σ(\boldsymbolμ^\star)}})$-approximation of an optimal merge sequence, where ${σ(\boldsymbolμ^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbolμ^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

CLJul 16, 2024
The Foundations of Tokenization: Statistical and Computational Concerns

Juan Luis Gastaldi, John Terilla, Luca Malagutti et al.

Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model performance but is also the source of many undesirable behaviors, such as spurious ambiguity or inconsistency. Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood. In particular, the impact of tokenization on language model estimation has been investigated primarily through empirical means. The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models. Based on the category of stochastic maps, this framework enables us to establish general conditions for a principled use of tokenizers and, most importantly, the necessary and sufficient conditions for a tokenizer model to preserve the consistency of statistical estimators. In addition, we discuss statistical and computational concerns crucial for designing and implementing tokenizer models, such as inconsistency, ambiguity, finiteness, and sequentiality. The framework and results advanced in this paper contribute to building robust theoretical foundations for representations in neural language modeling that can inform future theoretical and empirical research.

CLDec 4, 2024
From Language Models over Tokens to Language Models over Characters

Tim Vieira, Ben LeBrun, Mario Giulianelli et al.

Modern language models are internally -- and mathematically -- distributions over $\it{token}$ strings rather than $\it{character}$ strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that -- even with a small computation budget -- our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved.

CLJun 9, 2025
Language Models over Canonical Byte-Pair Encodings

Tim Vieira, Tianyu Liu, Clemente Pasti et al.

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of $\it{noncanonical}$ token encodings of each character string -- these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.