DS CL FLDec 19, 2024

Tokenisation is NP-Complete

Philip Whittington, Gregor Bachmann, Tiago Pimentel

arXiv:2412.15210v17.38 citationsh-index: 1

Originality Incremental advance

AI Analysis

This is a foundational theoretical result for machine learning and NLP, showing that optimal tokenisation is inherently difficult, which impacts algorithm design and efficiency in text processing.

The authors proved that two variants of tokenisation—direct vocabulary selection and bottom-up merge operations—are NP-complete problems, meaning they are computationally intractable to solve optimally in polynomial time.

In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).

View on arXiv PDF

Similar