CLAug 2, 2022

Lost in Space Marking

arXiv:2208.01561v11.12 citationsh-index: 17

Originality Synthesis-oriented

AI Analysis

This work addresses a specific, incremental decision in tokenizer design for NLP practitioners, offering guidance based on data type.

The study investigated whether marking word-initial or word-final tokens in subword tokenizers leads to better performance, finding that Unigram LM tokenizers on pre-tokenized English text benefit from initial marking, while those on raw text perform better with final marking, with results generalizing across domains.

We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.

View on arXiv PDF

Similar