CLAug 28, 2022

CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

arXiv:2208.13170v10.31 citationsh-index: 9

Originality Synthesis-oriented

AI Analysis

This provides a valuable resource for machine translation researchers and practitioners working with Japanese-French language pairs, though it is incremental as it builds on existing data.

The authors tackled the lack of freely available Japanese-French parallel data by creating CJaFr-v3, a filtered corpus with 15M aligned segments compiled from existing resources, and demonstrated its usefulness by training standard MT systems.

We present a free Japanese-French parallel corpus. It includes 15M aligned segments and is obtained by compiling and filtering several existing resources. In this paper, we describe the existing resources, their quantity and quality, the filtering we applied to improve the quality of the corpus, and the content of the ready-to-use corpus. We also evaluate the usefulness of this corpus and the quality of our filtering by training and evaluating some standard MT systems with it.

View on arXiv PDF

Similar