CLApr 8, 2024

Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training

Longhui Zhang, Dingkun Long, Meishan Zhang, Yanzhao Zhang, Pengjun Xie, Min Zhang

arXiv:2404.05560v123.781 citationsh-index: 14Has CodeLREC

Originality Incremental advance

AI Analysis

This work addresses the need for better boundary-aware models in Chinese NLP, offering incremental improvements over existing methods.

The paper tackled the problem of Chinese sequence labeling by enhancing a pre-trained language model with supervised boundary information, resulting in improved performance on sequence labeling and broader natural language understanding tasks, and introduced a new metric for evaluating boundary awareness without fine-tuning.

Chinese sequence labeling tasks are heavily reliant on accurate word boundary demarcation. Although current pre-trained language models (PLMs) have achieved substantial gains on these tasks, they rarely explicitly incorporate boundary information into the modeling process. An exception to this is BABERT, which incorporates unsupervised statistical boundary information into Chinese BERT's pre-training objectives. Building upon this approach, we input supervised high-quality boundary information to enhance BABERT's learning, developing a semi-supervised boundary-aware PLM. To assess PLMs' ability to encode boundaries, we introduce a novel ``Boundary Information Metric'' that is both simple and effective. This metric allows comparison of different PLMs without task-specific fine-tuning. Experimental results on Chinese sequence labeling datasets demonstrate that the improved BABERT variant outperforms the vanilla version, not only on these tasks but also more broadly across a range of Chinese natural language understanding tasks. Additionally, our proposed metric offers a convenient and accurate means of evaluating PLMs' boundary awareness.

View on arXiv PDF Code

Similar