CLAIJun 5, 2024

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

arXiv:2406.03202v21 citations
Originality Incremental advance
AI Analysis

This work addresses data scarcity in GEC for NLP researchers, though it is incremental as it builds on existing LLM-based generation methods.

The paper tackles the problem of generating synthetic data for grammatical error correction (GEC) by proposing an automated framework and a new dataset called ChatLang-8, which includes 1 million pairs with human-like errors and improves model performance compared to existing datasets.

We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named ChatLang-8, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT's data generation capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes