Modular Linear Tokenization (MLT)
This method addresses the need for efficient and scalable categorical encoding in machine learning, particularly for applications with millions of identifiers, though it appears incremental as it builds on existing encoding techniques.
The paper tackles the problem of encoding high-cardinality categorical identifiers by introducing Modular Linear Tokenization (MLT), a reversible and deterministic technique that achieves comparable predictive performance to supervised embeddings on the MovieLens 20M dataset while using significantly fewer parameters and lower training cost.
This paper introduces Modular Linear Tokenization (MLT), a reversible and deterministic technique for encoding high-cardinality categorical identifiers into compact numerical vectors. Unlike traditional hashing or one-hot encodings, MLT preserves bijective mappings by leveraging modular arithmetic over finite fields and invertible linear transformations. The method offers explicit control of dimensionality and computational scalability while maintaining full reversibility, even for millions of identifiers. Experimental results on the MovieLens 20M dataset show that MLT achieves comparable predictive performance to supervised embeddings while requiring significantly fewer parameters and lower training cost. An open-source implementation of MLT is available on PyPI (https://pypi.org/project/light-mlt/) and GitHub (https://github.com/tcharliesschmitz/light-mlt).