Unified Multi-Criteria Chinese Word Segmentation with BERT
This work addresses the challenge of segmenting Chinese text under multiple criteria, which is incremental as it builds on existing unified frameworks and BERT models.
The paper tackled the problem of multi-criteria Chinese word segmentation by proposing a unified BERT-based model enhanced with bigram features and an auxiliary classification task, achieving new state-of-the-art results on eight diverse datasets.
Multi-Criteria Chinese Word Segmentation (MCCWS) aims at finding word boundaries in a Chinese sentence composed of continuous characters while multiple segmentation criteria exist. The unified framework has been widely used in MCCWS and shows its effectiveness. Besides, the pre-trained BERT language model has been also introduced into the MCCWS task in a multi-task learning framework. In this paper, we combine the superiority of the unified framework and pretrained language model, and propose a unified MCCWS model based on BERT. Moreover, we augment the unified BERT-based MCCWS model with the bigram features and an auxiliary criterion classification task. Experiments on eight datasets with diverse criteria demonstrate that our methods could achieve new state-of-the-art results for MCCWS.