Improving Grammatical Error Correction with Machine Translation Pairs
This work addresses the data scarcity issue in grammatical error correction for ESL learners, offering an incremental improvement over existing methods.
The paper tackles the problem of limited training data for grammatical error correction by proposing a novel data synthesis method using machine translation pairs of different qualities to generate diverse error-corrected sentence pairs, resulting in improved performance that can be combined with other synthetic data sources for further gains.
We propose a novel data synthesis method to generate diverse error-corrected sentence pairs for improving grammatical error correction, which is based on a pair of machine translation models of different qualities (i.e., poor and good). The poor translation model resembles the ESL (English as a second language) learner and tends to generate translations of low quality in terms of fluency and grammatical correctness, while the good translation model generally generates fluent and grammatically correct translations. We build the poor and good translation model with phrase-based statistical machine translation model with decreased language model weight and neural machine translation model respectively. By taking the pair of their translations of the same sentences in a bridge language as error-corrected sentence pairs, we can construct unlimited pseudo parallel data. Our approach is capable of generating diverse fluency-improving patterns without being limited by the pre-defined rule set and the seed error-corrected data. Experimental results demonstrate the effectiveness of our approach and show that it can be combined with other synthetic data sources to yield further improvements.