CLFeb 21, 2025

A Training-free LLM-based Approach to General Chinese Character Error Correction

Houquan Zhou, Bo Zhang, Zhenghua Li, Ming Yan, Min Zhang

arXiv:2502.15266v25 citationsh-index: 8ACL

Originality Incremental advance

AI Analysis

This addresses a practical limitation in Chinese text correction for users by including more error types, though it is incremental as it builds on existing methods.

The paper tackles the problem of Chinese spelling correction by extending it to include missing and redundant character errors, introducing the General Chinese Character Error Correction (C2EC) task, and achieves results where a 14B-parameter LLM performs comparably to models nearly 50 times larger without fine-tuning.

Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.

View on arXiv PDF

Similar