Tokenisation is NP-Complete
This is a foundational theoretical result for machine learning and NLP, showing that optimal tokenisation is inherently difficult, which impacts algorithm design and efficiency in text processing.
The authors proved that two variants of tokenisation—direct vocabulary selection and bottom-up merge operations—are NP-complete problems, meaning they are computationally intractable to solve optimally in polynomial time.
In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).