CLAug 2, 2022

Lost in Space Marking

arXiv:2208.01561v12 citationsh-index: 17
Originality Synthesis-oriented
AI Analysis

This work addresses a specific, incremental decision in tokenizer design for NLP practitioners, offering guidance based on data type.

The study investigated whether marking word-initial or word-final tokens in subword tokenizers leads to better performance, finding that Unigram LM tokenizers on pre-tokenized English text benefit from initial marking, while those on raw text perform better with final marking, with results generalizing across domains.

We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes