CLOct 12, 2023

The Uncertainty-based Retrieval Framework for Ancient Chinese CWS and POS

arXiv:2310.08496v136.2584 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of text mining for ancient Chinese, which is less studied compared to modern Chinese, by improving accuracy for classical literature comprehension.

The authors tackled the problem of word segmentation and part-of-speech tagging in ancient Chinese texts by proposing a framework that captures wordhood semantics and re-predicts uncertain samples using external knowledge, resulting in performance that outperforms pre-trained BERT with CRF and existing tools like Jiayan.

Automatic analysis for modern Chinese has greatly improved the accuracy of text mining in related fields, but the study of ancient Chinese is still relatively rare. Ancient text division and lexical annotation are important parts of classical literature comprehension, and previous studies have tried to construct auxiliary dictionary and other fused knowledge to improve the performance. In this paper, we propose a framework for ancient Chinese Word Segmentation and Part-of-Speech Tagging that makes a twofold effort: on the one hand, we try to capture the wordhood semantics; on the other hand, we re-predict the uncertain samples of baseline model by introducing external knowledge. The performance of our architecture outperforms pre-trained BERT with CRF and existing tools such as Jiayan.

View on arXiv PDF Code

Similar