CLSep 18, 2020

fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP

Zhichao Geng, Hang Yan, Xipeng Qiu, Xuanjing Huang

arXiv:2009.08633v227.6712 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This provides a practical, user-friendly toolkit for Chinese NLP tasks, though it is incremental as it builds on existing BERT-based methods.

The authors tackled the problem of Chinese natural language processing by developing fastHan, a multi-task toolkit for word segmentation, POS tagging, named entity recognition, and dependency parsing, achieving state-of-the-art performance in CWS and POS and near-SOTA in dependency parsing and NER.

We present fastHan, an open-source toolkit for four basic tasks in Chinese natural language processing: Chinese word segmentation (CWS), Part-of-Speech (POS) tagging, named entity recognition (NER), and dependency parsing. The backbone of fastHan is a multi-task model based on a pruned BERT, which uses the first 8 layers in BERT. We also provide a 4-layer base model compressed from the 8-layer model. The joint-model is trained and evaluated on 13 corpora of four tasks, yielding near state-of-the-art (SOTA) performance in dependency parsing and NER, achieving SOTA performance in CWS and POS. Besides, fastHan's transferability is also strong, performing much better than popular segmentation tools on a non-training corpus. To better meet the need of practical application, we allow users to use their own labeled data to further fine-tune fastHan. In addition to its small size and excellent performance, fastHan is user-friendly. Implemented as a python package, fastHan isolates users from the internal technical details and is convenient to use. The project is released on Github.

View on arXiv PDF Code

Similar