AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding and Generation
This addresses the challenge of processing ancient Chinese for cultural and linguistic applications, but it is incremental as it adapts an existing method (BERT) to a new domain.
The authors tackled the problem of natural language processing for ancient Chinese, which lacks large-scale parallel data, by releasing AnchiBERT, a pre-trained model based on BERT trained on monolingual ancient Chinese corpora; it outperformed BERT and non-pretrained models, achieving state-of-the-art results on tasks like poem classification, translation, and generation.
Ancient Chinese is the essence of Chinese culture. There are several natural language processing tasks of ancient Chinese domain, such as ancient-modern Chinese translation, poem generation, and couplet generation. Previous studies usually use the supervised models which deeply rely on parallel data. However, it is difficult to obtain large-scale parallel data of ancient Chinese. In order to make full use of the more easily available monolingual ancient Chinese corpora, we release AnchiBERT, a pre-trained language model based on the architecture of BERT, which is trained on large-scale ancient Chinese corpora. We evaluate AnchiBERT on both language understanding and generation tasks, including poem classification, ancient-modern Chinese translation, poem generation, and couplet generation. The experimental results show that AnchiBERT outperforms BERT as well as the non-pretrained models and achieves state-of-the-art results in all cases.