CLOct 7, 2020

Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank

arXiv:2010.03662v131.2996 citationsh-index: 31Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of expensive annotation for cross-lingual NLP tasks, though it is incremental as it builds on existing multilingual models and methods.

The paper tackles the problem of detecting fine-grained semantic differences between sentences in different languages without supervision, by introducing a training strategy for multilingual BERT models using learning to rank on synthetic divergent examples, and it shows improved accuracy over a strong baseline on a new English-French dataset.

Detecting fine-grained differences in content conveyed in different languages matters for cross-lingual NLP and multilingual corpora analysis, but it is a challenging machine learning problem since annotation is expensive and hard to scale. This work improves the prediction and annotation of fine-grained semantic divergences. We introduce a training strategy for multilingual BERT models by learning to rank synthetic divergent examples of varying granularity. We evaluate our models on the Rationalized English-French Semantic Divergences, a new dataset released with this work, consisting of English-French sentence-pairs annotated with semantic divergence classes and token-level rationales. Learning to rank helps detect fine-grained sentence-level divergences more accurately than a strong sentence-level similarity model, while token-level predictions have the potential of further distinguishing between coarse and fine-grained divergences.

View on arXiv PDF Code

Similar