CLAISDASJul 25, 2025

Data Augmentation for Spoken Grammatical Error Correction

arXiv:2507.19374v11 citationsh-index: 10Slate
Originality Incremental advance
AI Analysis

This work addresses a data scarcity problem for researchers and practitioners in speech processing and language learning, but it is incremental as it builds on existing GEC methods by extending them to the spoken domain.

The paper tackles the lack of high-quality annotated datasets for Spoken Grammatical Error Correction (SGEC) by proposing an automated method to generate audio-text pairs with grammatical errors and disfluencies, and it evaluates this augmented dataset on the S&I Corpus, showing improvements in both written and spoken GEC tasks.

While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S\&I Corpus, the first publicly available speech dataset with grammar error annotations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes