CLAIJun 26, 2023

Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

arXiv:2306.14377v1h-index: 13
Originality Incremental advance
AI Analysis

This work addresses a critical gap in data-centric AI by highlighting potential pitfalls in applying established data quality techniques to synthetic data, which is important for researchers and practitioners relying on synthetic datasets for model development.

The study investigated whether data quality control methods, which improve models trained on real-world data, have the same effect when applied to models trained solely on synthetic data for grammatical error correction. The results showed that these methods had a negative impact on models trained with synthetic data, contrasting with the positive effects seen in real-world data scenarios.

Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, data-centric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data? To address this question, we conducted comparative analyses between models trained on synthetic and real-world data based on grammatical error correction (GEC) task. Our experimental results reveal that the data quality control method has a positive impact on models trained with real-world data, as previously reported in existing studies, while a negative impact is observed in models trained solely on synthetic data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes