CLNov 15, 2022

Error-Robust Retrieval for Chinese Spelling Check

arXiv:2211.07843v282 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the challenge of insufficient annotated data and underutilization of datasets in CSC, which is important for applications in Chinese text processing, though it appears incremental as it builds on existing models.

The paper tackles the problem of Chinese Spelling Check (CSC) by introducing a plug-and-play retrieval method (RERIC) that uses multimodal representations and n-gram reranking, achieving substantial improvements on SIGHAN benchmarks.

Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts, which has a wide range of applications. However, it is confronted with the challenges of insufficient annotated data and the issue that previous methods may actually not fully leverage the existing datasets. In this paper, we introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check (RERIC), which can be directly applied to existing CSC models. The datastore for retrieval is built completely based on the training data, with elaborate designs according to the characteristics of CSC. Specifically, we employ multimodal representations that fuse phonetic, morphologic, and contextual information in the calculation of query and key during retrieval to enhance robustness against potential errors. Furthermore, in order to better judge the retrieved candidates, the n-gram surrounding the token to be checked is regarded as the value and utilized for specific reranking. The experiment results on the SIGHAN benchmarks demonstrate that our proposed method achieves substantial improvements over existing work.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes