CLSep 9, 2022

MaxMatch-Dropout: Subword Regularization for WordPiece

arXiv:2209.04126v131.0585 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This work addresses tokenization robustness for NLP practitioners using models like BERT, but it is incremental as it builds on existing subword regularization techniques.

The paper tackles the problem of improving tokenization for pretrained language models by introducing MaxMatch-Dropout, a subword regularization method for WordPiece that randomly drops words during maximum matching tokenization, and it shows performance improvements in text classification and machine translation tasks compared to other methods.

We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization. The proposed method, MaxMatch-Dropout, randomly drops words in a search using the maximum matching algorithm. It realizes finetuning with subword regularization for popular pretrained language models such as BERT-base. The experimental results demonstrate that MaxMatch-Dropout improves the performance of text classification and machine translation tasks as well as other subword regularization methods. Moreover, we provide a comparative analysis of subword regularization methods: subword regularization with SentencePiece (Unigram), BPE-Dropout, and MaxMatch-Dropout.

View on arXiv PDF Code

Similar