CLAIMay 16, 2023

Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

arXiv:2305.09148v1227 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving cross-lingual sentence embeddings for multilingual NLP applications, representing an incremental advance by combining existing alignment methods with a novel token-level task.

The paper tackles the problem of cross-lingual sentence embedding by proposing a dual-alignment pre-training framework that incorporates both sentence-level and token-level alignment, introducing a representation translation learning task to embed translation information into token representations, and demonstrates significant improvements on three cross-lingual benchmarks.

Recent studies have shown that dual encoder models trained with the sentence-level translation ranking task are effective methods for cross-lingual sentence embedding. However, our research indicates that token-level alignment is also crucial in multilingual scenarios, which has not been fully explored previously. Based on our findings, we propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding that incorporates both sentence-level and token-level alignment. To achieve this, we introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart. This reconstruction objective encourages the model to embed translation information into the token representation. Compared to other token-level alignment methods such as translation language modeling, RTL is more suitable for dual encoder architectures and is computationally efficient. Extensive experiments on three sentence-level cross-lingual benchmarks demonstrate that our approach can significantly improve sentence embedding. Our code is available at https://github.com/ChillingDream/DAP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes