CL AINov 18, 2024

CNMBERT: A Model for Converting Hanyu Pinyin Abbreviations to Chinese Characters

arXiv:2411.11770v41 citationsIJCNN

Originality Incremental advance

AI Analysis

This addresses a specific issue in Chinese Spelling Correction for applications like named entity recognition and sentiment analysis, representing an incremental improvement.

The paper tackles the problem of converting Hanyu Pinyin abbreviations to Chinese characters, a challenging task due to limited information, and proposes CNMBERT, which achieves a 61.53% MRR score and 51.86% accuracy on a test dataset of 10,373 samples.

The task of converting Hanyu Pinyin abbreviations to Chinese characters is a significant branch within the domain of Chinese Spelling Correction (CSC). It plays an important role in many downstream applications such as named entity recognition and sentiment analysis. This task typically involves text-length alignment and seems easy to solve; however, due to the limited information content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we treat this as a fill-mask task and propose CNMBERT, which stands for zh-CN Pinyin Multi-mask BERT Model, as a solution to this issue. By introducing a multi-mask strategy and Mixture of Experts (MoE) layers, CNMBERT outperforms fine-tuned GPT models and ChatGPT-4o with a 61.53% MRR score and 51.86% accuracy on a 10,373-sample test dataset.

View on arXiv PDF

Similar