Bilingual Text Extraction as Reading Comprehension
This work addresses the challenge of improving bilingual text extraction for machine translation and NLP applications, but it is incremental as it builds on existing span prediction and pre-trained models.
The paper tackles the problem of extracting bilingual texts from noisy parallel corpora by framing it as a token-level span prediction task, similar to SQuAD-style Reading Comprehension, and finds that using QANet with integer linear programming achieves significantly better accuracy than baseline methods, particularly for distant language pairs like En-Ja.
In this paper, we propose a method to extract bilingual texts automatically from noisy parallel corpora by framing the problem as a token-level span prediction, such as SQuAD-style Reading Comprehension. To extract a span of the target document that is a translation of a given source sentence (span), we use either QANet or multilingual BERT. QANet can be trained for a specific parallel corpus from scratch, while multilingual BERT can utilize pre-trained multilingual representations. For the span prediction method using QANet, we introduce a total optimization method using integer linear programming to achieve consistency in the predicted parallel spans. We conduct a parallel sentence extraction experiment using simulated noisy parallel corpora with two language pairs (En-Fr and En-Ja) and find that the proposed method using QANet achieves significantly better accuracy than a baseline method using two bi-directional RNN encoders, particularly for distant language pairs (En-Ja). We also conduct a sentence alignment experiment using En-Ja newspaper articles and find that the proposed method using multilingual BERT achieves significantly better accuracy than a baseline method using a bilingual dictionary and dynamic programming.