CLAIJun 28, 2019

A Concise Model for Multi-Criteria Chinese Word Segmentation with Transformer Encoder

arXiv:1906.12035v21000 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses segmentation variability in Chinese NLP, offering a concise solution for researchers and practitioners, though it is incremental as it builds on existing Transformer methods.

The paper tackles multi-criteria Chinese word segmentation by proposing a unified Transformer encoder model that uses criterion-tokens to handle multiple segmentation criteria, outperforming baseline and other multi-criteria models on eight datasets.

Multi-criteria Chinese word segmentation (MCCWS) aims to exploit the relations among the multiple heterogeneous segmentation criteria and further improve the performance of each single criterion. Previous work usually regards MCCWS as different tasks, which are learned together under the multi-task learning framework. In this paper, we propose a concise but effective unified model for MCCWS, which is fully-shared for all the criteria. By leveraging the powerful ability of the Transformer encoder, the proposed unified model can segment Chinese text according to a unique criterion-token indicating the output criterion. Besides, the proposed unified model can segment both simplified and traditional Chinese and has an excellent transfer capability. Experiments on eight datasets with different criteria show that our model outperforms our single-criterion baseline model and other multi-criteria models. Source codes of this paper are available on Github https://github.com/acphile/MCCWS.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes