CLDec 11, 2020

Document-aligned Japanese-English Conversation Parallel Corpus

arXiv:2012.06143v1995 citations
AI Analysis

This work provides a new dataset and evaluation method for researchers working on document-level machine translation, particularly for Japanese-English, addressing the lack of resources in this area.

The authors created a document-aligned Japanese-English conversation corpus to address the scarcity of document-level (DL) machine translation (MT) data. They also developed an evaluation set with annotated phenomena where sentence-level MT fails, demonstrating that using context from their corpus leads to improvements in MT models.

Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing. As for the second issue, we manually identify the main areas where SL MT fails to produce adequate translations in lack of context. We then create an evaluation set where these phenomena are annotated to alleviate automatic evaluation of DL systems. We train MT models using our corpus to demonstrate how using context leads to improvements.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes