CLJun 1, 2021

Sub-Character Tokenization for Chinese Pretrained Language Models

arXiv:2106.00400v3224 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses tokenization inefficiencies and homophone issues in Chinese NLP, offering incremental improvements over existing methods.

The paper tackles the problem of tokenization for Chinese pretrained language models by proposing sub-character tokenization, which encodes characters based on glyph or pronunciation to utilize sub-character linguistic information, resulting in shorter sequences for improved computational efficiency and robustness to homophone typos while maintaining competitive performance on downstream tasks.

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes