Exploiting Sentence Order in Document Alignment
This work addresses document alignment for machine translation, particularly for web-scraped data, with incremental improvements over existing methods.
The paper tackles document alignment by incorporating sentence order information, achieving a 61% relative error reduction on the WMT16 task and improving machine translation performance on Sinhala-English data from ParaCrawl.
We present a simple document alignment method that incorporates sentence order information in both candidate generation and candidate re-scoring. Our method results in 61% relative reduction in error compared to the best previously published result on the WMT16 document alignment shared task. Our method improves downstream MT performance on web-scraped Sinhala--English documents from ParaCrawl, outperforming the document alignment method used in the most recent ParaCrawl release. It also outperforms a comparable corpora method which uses the same multilingual embeddings, demonstrating that exploiting sentence order is beneficial even if the end goal is sentence-level bitext.