CLSep 9, 2022

MaxMatch-Dropout: Subword Regularization for WordPiece

arXiv:2209.04126v1585 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses tokenization robustness for NLP practitioners using models like BERT, but it is incremental as it builds on existing subword regularization techniques.

The paper tackles the problem of improving tokenization for pretrained language models by introducing MaxMatch-Dropout, a subword regularization method for WordPiece that randomly drops words during maximum matching tokenization, and it shows performance improvements in text classification and machine translation tasks compared to other methods.

We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization. The proposed method, MaxMatch-Dropout, randomly drops words in a search using the maximum matching algorithm. It realizes finetuning with subword regularization for popular pretrained language models such as BERT-base. The experimental results demonstrate that MaxMatch-Dropout improves the performance of text classification and machine translation tasks as well as other subword regularization methods. Moreover, we provide a comparative analysis of subword regularization methods: subword regularization with SentencePiece (Unigram), BPE-Dropout, and MaxMatch-Dropout.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes