CLNov 1, 2021

A New Tool for Efficiently Generating Quality Estimation Datasets

Sugyeong Eo, Chanjun Park, Jaehyung Seo, Hyeonseok Moon, Heuiseok Lim

arXiv:2111.00767v10.51 citations

Originality Incremental advance

AI Analysis

This provides an inexpensive method for the machine translation community to develop QE datasets, though it appears incremental as it automates an existing data-centric approach.

The paper tackles the high cost of building quality estimation datasets by proposing a fully automatic tool that generates pseudo-QE datasets from monolingual or parallel corpora, enhancing QE performance through data augmentation and enabling use across multiple language pairs.

Building of data for quality estimation (QE) training is expensive and requires significant human labor. In this study, we focus on a data-centric approach while performing QE, and subsequently propose a fully automatic pseudo-QE dataset generation tool that generates QE datasets by receiving only monolingual or parallel corpus as the input. Consequently, the QE performance is enhanced either by data augmentation or by encouraging multiple language pairs to exploit the applicability of QE. Further, we intend to publicly release this user friendly QE dataset generation tool as we believe this tool provides a new, inexpensive method to the community for developing QE datasets.

View on arXiv PDF

Similar