CLMar 12, 2022

MarkBERT: Marking Word Boundaries Improves Chinese BERT

arXiv:2203.06378v216 citationsh-index: 66Has Code
AI Analysis

This addresses a bottleneck in Chinese NLP by enhancing word handling for tasks such as text classification and named entity recognition, though it is incremental as it builds on existing BERT architectures.

The paper tackles the problem of Chinese BERT models struggling with out-of-vocabulary words by introducing MarkBERT, which uses boundary markers between words while keeping a character-based vocabulary, resulting in improved performance on downstream tasks like language understanding and sequence labeling.

We present a Chinese BERT model dubbed MarkBERT that uses word information in this work. Existing word-based BERT models regard words as basic units, however, due to the vocabulary limit of BERT, they only cover high-frequency words and fall back to character level when encountering out-of-vocabulary (OOV) words. Different from existing works, MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words. Such design enables the model to handle any words in the same way, no matter they are OOV words or not. Besides, our model has two additional benefits: first, it is convenient to add word-level learning objectives over markers, which is complementary to traditional character and sentence-level pretraining tasks; second, it can easily incorporate richer semantics such as POS tags of words by replacing generic markers with POS tag-specific markers. With the simple markers insertion, MarkBERT can improve the performances of various downstream tasks including language understanding and sequence labeling. \footnote{All the codes and models will be made publicly available at \url{https://github.com/daiyongya/markbert}}

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes