CL AINov 14, 2023

Eval-GCSC: A New Metric for Evaluating ChatGPT's Performance in Chinese Spelling Correction

Kunting Li, Yong Hu, Shaolei Wang, Hanhan Ma, Liang He, Fandong Meng, Jie Zhou

arXiv:2311.08219v10.51 citationsh-index: 40Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the evaluation gap for generative models in Chinese Spelling Correction, which is incremental as it improves assessment rather than the correction method itself.

The paper tackles the problem that ChatGPT's performance in Chinese Spelling Correction is underestimated by traditional metrics due to strict constraints, and proposes a new metric, Eval-GCSC, which aligns closely with human evaluations and shows ChatGPT's performance is comparable to traditional models.

ChatGPT has demonstrated impressive performance in various downstream tasks. However, in the Chinese Spelling Correction (CSC) task, we observe a discrepancy: while ChatGPT performs well under human evaluation, it scores poorly according to traditional metrics. We believe this inconsistency arises because the traditional metrics are not well-suited for evaluating generative models. Their overly strict length and phonics constraints may lead to underestimating ChatGPT's correction capabilities. To better evaluate generative models in the CSC task, this paper proposes a new evaluation metric: Eval-GCSC. By incorporating word-level and semantic similarity judgments, it relaxes the stringent length and phonics constraints. Experimental results show that Eval-GCSC closely aligns with human evaluations. Under this metric, ChatGPT's performance is comparable to traditional token-level classification models (TCM), demonstrating its potential as a CSC tool. The source code and scripts can be accessed at https://github.com/ktlKTL/Eval-GCSC.

View on arXiv PDF Code

Similar