CLAIAug 2, 2021

LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization

arXiv:2108.00801v2711 citations
AI Analysis

This addresses the problem of learning precise meanings of words and phrases in language models for NLU tasks, representing an incremental improvement over existing methods.

The paper tackles the limitation of single-grained tokenization in pre-trained language models by proposing LICHEE, a method that incorporates multi-grained information, resulting in comprehensive improvements on NLU tasks in Chinese and English with little extra inference cost and achieving state-of-the-art performance on the CLUE benchmark.

Language model pre-training based on large corpora has achieved tremendous success in terms of constructing enriched contextual representations and has led to significant performance gains on a diverse range of Natural Language Understanding (NLU) tasks. Despite the success, most current pre-trained language models, such as BERT, are trained based on single-grained tokenization, usually with fine-grained characters or sub-words, making it hard for them to learn the precise meaning of coarse-grained words and phrases. In this paper, we propose a simple yet effective pre-training method named LICHEE to efficiently incorporate multi-grained information of input text. Our method can be applied to various pre-trained language models and improve their representation capability. Extensive experiments conducted on CLUE and SuperGLUE demonstrate that our method achieves comprehensive improvements on a wide variety of NLU tasks in both Chinese and English with little extra inference cost incurred, and that our best ensemble model achieves the state-of-the-art performance on CLUE benchmark competition.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes