CLApr 24, 2023

Semantic Tokenizer for Enhanced Natural Language Processing

Sandeep Mehta, Darpan Shah, Ravindra Kulkarni, Cornelia Caragea

arXiv:2304.12404v11.35 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses vocabulary limitations in NLP for researchers and practitioners, though it appears incremental as it builds on existing tokenizer frameworks.

The authors tackled the problem of NLP vocabulary construction by developing a semantic tokenizer that uses stemming to enhance subword formation, which more than doubles the number of wordforms represented and improves model convergence and embedding quality. Experimental results show top performance on two GLUE tasks using BERT-base, outperforming models over 50 times larger.

Traditionally, NLP performance improvement has been focused on improving models and increasing the number of model parameters. NLP vocabulary construction has remained focused on maximizing the number of words represented through subword regularization. We present a novel tokenizer that uses semantics to drive vocabulary construction. The tokenizer includes a trainer that uses stemming to enhance subword formation. Further optimizations and adaptations are implemented to minimize the number of words that cannot be encoded. The encoder is updated to integrate with the trainer. The tokenizer is implemented as a drop-in replacement for the SentencePiece tokenizer. The new tokenizer more than doubles the number of wordforms represented in the vocabulary. The enhanced vocabulary significantly improves NLP model convergence, and improves quality of word and sentence embeddings. Our experimental results show top performance on two Glue tasks using BERT-base, improving on models more than 50X in size.

View on arXiv PDF

Similar