Lost in Space Marking
This work addresses a specific, incremental decision in tokenizer design for NLP practitioners, offering guidance based on data type.
The study investigated whether marking word-initial or word-final tokens in subword tokenizers leads to better performance, finding that Unigram LM tokenizers on pre-tokenized English text benefit from initial marking, while those on raw text perform better with final marking, with results generalizing across domains.
We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.