Analyzing Cognitive Plausibility of Subword Tokenization
This work addresses the lack of cognitive plausibility evaluations in tokenization for NLP researchers, though it is incremental as it builds on existing tokenization algorithms.
The paper tackled the problem of evaluating subword tokenization by introducing a new paradigm based on cognitive plausibility, analyzing correlation with human performance on lexical decision tasks across languages, and found that the UnigramLM algorithm yields less cognitively plausible behavior and worse coverage of derivational morphemes.
Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.