DSCLFLDec 19, 2024

Tokenisation is NP-Complete

arXiv:2412.15210v18 citationsh-index: 1
Originality Incremental advance
AI Analysis

This is a foundational theoretical result for machine learning and NLP, showing that optimal tokenisation is inherently difficult, which impacts algorithm design and efficiency in text processing.

The authors proved that two variants of tokenisation—direct vocabulary selection and bottom-up merge operations—are NP-complete problems, meaning they are computationally intractable to solve optimally in polynomial time.

In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes