CLDec 11, 2020

Document-aligned Japanese-English Conversation Parallel Corpus

Matīss Rikters, Ryokan Ri, Tong Li, Toshiaki Nakazawa

arXiv:2012.06143v131.0995 citationsHas Code

Originality Incremental advance

AI Analysis

This work provides a new dataset and evaluation method for researchers working on document-level machine translation, particularly for Japanese-English, addressing the lack of resources in this area.

The authors created a document-aligned Japanese-English conversation corpus to address the scarcity of document-level (DL) machine translation (MT) data. They also developed an evaluation set with annotated phenomena where sentence-level MT fails, demonstrating that using context from their corpus leads to improvements in MT models.

Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing. As for the second issue, we manually identify the main areas where SL MT fails to produce adequate translations in lack of context. We then create an evaluation set where these phenomena are annotated to alleviate automatic evaluation of DL systems. We train MT models using our corpus to demonstrate how using context leads to improvements.

View on arXiv PDF Code

Similar