CL IRJun 4, 2021

Cross-language Sentence Selection via Data Augmentation and Rationale Training

Yanda Chen, Chris Kedzie, Suraj Nair, Petra Galuščáková, Rui Zhang, Douglas W. Oard, Kathleen McKeown

arXiv:2106.02293v131.5714 citations

Originality Incremental advance

AI Analysis

It addresses the problem of cross-language retrieval for low-resource languages, which is incremental as it builds on existing embedding and alignment methods.

The paper tackles cross-language sentence selection in low-resource settings by using data augmentation and rationale training, achieving performance comparable to or better than state-of-the-art systems across three language pairs.

This paper proposes an approach to cross-language sentence selection in a low-resource setting. It uses data augmentation and negative sampling techniques on noisy parallel sentence data to directly learn a cross-lingual embedding-based query relevance model. Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data. Moreover, when a rationale training secondary objective is applied to encourage the model to match word alignment hints from a phrase-based statistical machine translation model, consistent improvements are seen across three language pairs (English-Somali, English-Swahili and English-Tagalog) over a variety of state-of-the-art baselines.

View on arXiv PDF

Similar