CLAIOct 19, 2024

Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization

arXiv:2410.17094v1
Originality Synthesis-oriented
AI Analysis

This work addresses tokenization challenges for natural language processing researchers, but it is incremental as it builds on existing methods without introducing major innovations.

The paper explored using morphological segmentation methods, specifically Morfessor and a transformer-based seq2seq model, as part of subword tokenizers, finding they could be as effective as commonly used tokenizers. It also investigated tokenizer vocabulary influence, showing that a balanced token frequency distribution, achieved by keeping frequent words as unique tokens, tends to improve language model performance.

This papers presents the submission of team Ryu to the canceled SIGMORPHON 2024 shared task on subword tokenization. My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers. I adopt two approaches: the statistical segmentation method Morfessor and a transformer based sequence-to-sequence (seq2seq) segmentation model in tokenizers. The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers. Additionally, I investigate how a tokenizer's vocabulary influences the performance of language models. A tokenizer with a balanced token frequency distribution tends to work better. A balanced token vocabulary can be achieved by keeping frequent words as unique tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes